Since its initial release, the Kafka group membership protocol has offered Connect, Streams and Consumer applications an ingenious and robust way to balance resources among distributed processes. The process of rebalancing, as it’s widely known, allows Kafka APIs to define an embedded protocol for load balancing within the group membership protocol itself. Until now, rebalancing has been working under the simple assumption that every time a new group generation is created, the members join after first releasing all of their resources, getting a whole new load assignment by the time the new group is formed. This allows Kafka APIs to provide task fault-tolerance and elasticity on top of the group membership protocol. However, due to its side-effects on multi-tenancy and scalability this simple approach in rebalancing, also known as stop-the-world effect, is limiting larger scale deployments. Because of stop-the-world, application tasks get interrupted only for most of them to receive the same resources after rebalancing. In this technical deep dive, I’ll discuss the proposition of Incremental Cooperative Rebalancing as a way to alleviate stop-the-world and optimize rebalancing in Kafka APIs.
* The internals of Incremental Cooperative Rebalancing
* Uses cases that benefit from Incremental Cooperative Rebalancing
* Implementation in Kafka Connect
* Performance results in Kafka Connect clusters