Is your Kafka cluster keeping you up all night? URP driving you crazy? Do you want all the benefits of running Kafka as a shared data hub, but the thought of supporting that many clients on the same infrastructure has you spooked? We felt the same way, but it turned out Kafka can indeed be run at scale in a multi-tenant environment with the right mix of hardware, configuration, and filesystem sorcery. As we all know, a well rested administrator is a happy (less cranky) administrator.
At Pandora we run a large shared Kafka cluster supporting thousands of clients and trillions of messages per day. Further, access to our cluster is intentionally permissive, lowering the bar to entry for users in order to promote innovation and speed to production. This approach has pros, but also, predictably, many cons. As the number of clients grew, we noticed periods of performance degradation at peak volume. This symptoms of this included thrashing ISR expands/shrinks, decreased throughput, and increased response times. We were initially baffled by this behavior and affectionately named it “The Thing.” This talk is the story of how we tracked this problem to its root by diving deep into both Kafka and ZFS, ultimately alleviating the issue with ZFS configuration, broker configuration, and end-user education (read: client scolding). This allowed us to support our current load – with room for growth – without the need for additional hardware.