While Apache Kafka is designed to be fault-tolerant, there will be times when your Kafka environment just isn’t working as expected. Whether it’s a newly configured application not processing messages, or an outage in a high-load, mission-critical production environment, it’s crucial to get up and running as quickly and safely as possible. IBM has hosted production Kafka environments for several years and has in-depth knowledge of how to diagnose and resolve problems rapidly and accurately to ensure minimal impact to end users. This session will discuss our experiences of how to most effectively collect and understand Kafka diagnostics. We’ll talk through using these diagnostics to work out what’s gone wrong, and how to recover from a system outage. Using this new-found knowledge, you will be equipped to handle any problem your cluster throws at you.