Over the years at Yelp, we have relied on Kafka to build many complex applications and stream processing data-pipelines that solve a multitude of use cases, including powering our product experimentation workflow, search indexing, asynchronous task processing and more. Today, Kafka is at the core of our infrastructure. These applications use different versions of Kafka clients and different programming languages.To fulfill the requirements of these diverse use cases, we run several specialized Kafka clusters for high-availability, consistency, exactly-once and infinite retention. We endeavor to keep our clusters up-to-date with newer Kafka versions that bring with them several critical bug fixes and exciting features like dynamic broker configuration, exactly-once semantics, kafka offset management and improved tooling. Our journey with Kafka started with version 0.8.2.0. Upgrading Kafka while ensuring client compatibility, zero-downtime, negligible performance degradation across our ever-growing multi-regional cluster deployment exposed us to a plethora of unique challenges. This session will focus on the challenges we encountered and how we evolved our infrastructure tooling and upgrade strategy to overcome them. I will be talking about: — How we rolled out new features such as kafka offset storage, message timestamp, reassignment auto-throttling, etc. — Core technical issues discovered during upgrades such as failure of log cleaners due to large offsets while upgrading. — The in-house test-suite that we built in order to: validate new kafka versions against our existing tooling and client-libraries, exercise the upgrade and rollback process and benchmark performance. — The automation we built for safe and fast rolling upgrades and broker configuration deployment.