LINE is a messaging service with 160+ million active users. Last year I talked about how we operate our Kafka cluster that receives more than 160 billion messages daily, dealing with performance problems to meet our tight requirement. Since last year we have deployed three more new clusters each for different purposes, such as one in different datacenter, one for security sensitive usages and so on, still keeping the fundamental concept: one cluster for everyone to use. While letting many projects using few multi-tenancy clusters greatly saves our operational cost and enables us to concentrate our engineering resources for maximizing their reliability, hosting multiple topics of different kinds of workload led us through a lot of challenges, too.
In this talk I will introduce how we operate Kafka clusters shared among different services, solving troubles we met to maximize its reliability. Especially, one of the most critical issues we’ve solved—delayed consumer Fetch request causing a broker’s network threads to be blocked—should be very interesting because it could have worse overall performance of brokers in a very common situation, and we have managed to solve it leveraging advanced technique such as dynamic tracing and tricky patch to control in-kernel behavior from Java code.