Apache Kafka is now nearly ubiquitous in modern data pipelines and use cases. While the Kafka development model is elegantly simple, operating Kafka clusters in production environments is a challenge. It’s hard to troubleshoot misbehaving Kafka clusters, especially when there are potentially hundreds or thousands of topics, producers and consumers and billions of messages.
The root cause of why real-time applications is lag may be due to an application problem – like poor data partitioning or load imbalance – or due to a Kafka problem – like resource exhaustion or suboptimal configuration. Therefore getting the best performance, predictability, and reliability for Kafka-based applications can be difficult. In the end, the operation of your Kafka powered analytics pipelines could themselves benefit from machine learning (ML).
In this presentation, we will explain how recent advances in machine learning and AI form the basis of a methodology for applying statistical learning to the rich monitoring data that is available from Kafka. This monitoring data includes metrics from Kafka brokers, producers, consumers, and infrastructure, as well as logs from various components of the Kafka ecosystem. We will also discuss how to use ML to identify root causes for a number of Kafka-based application bottlenecks, slowdowns, and failures.