At ZipRecruiter, we connect employers with jobseekers. To do that, we record and process millions of events every day using Apache Kafka. Kafka availability is critical for our business. If Kafka is down—we are losing money.
In this talk, we aim to explain the path we’ve gone through after suffering from a production incident that occurred during a rolling upgrade. This incident led us to perform chaos engineering to get our team more acquainted with Kafka’s internals and how to deal with incidents. We will share some best practices of chaos engineering Kafka and learn how to overcome failures when they appear. We will also share some of our strategies on how to perform a cluster update—rolling upgrade vs. a blue green vs. active-passive approach, including pros and cons for each method.