Walmart.com generates millions of events per second. At WalmartLabs, I’m working in a team called the Customer Backbone (CBB), where we wanted to upgrade to a platform capable of processing this event volume in real-time and store the state/knowledge of possibly all the Walmart Customers generated by the processing. Kafka streams’ event-driven architecture seemed like the only obvious choice. However, there are a few challenges w.r.t. Walmart’s scale: – the clusters need to be large and the problems thereof. – infinite retention of changelog topics, wasting valuable disk. – slow stand-by task recovery in case of a node failure (changelog topics have GBs of data) – no repartitioning in Kafka Streams.
As part of the event-driven development and addressing the challenges above, I’m going to talk about some bold new ideas we developed as features/patches to Kafka Streams to deal with the scale required at Walmart.
– Cold Bootstrap: Where in case of a Kafka Streams node failure, how instead of recovering from the change-log topic, we bootstrap the standby from active’s RocksDB using JSch and zero event loss by careful offset management.
– Dynamic Repartitioning: We added support for repartitioning in Kafka Streams where state is distributed among the new partitions. We can now elastically scale to any number of partitions and any number of nodes.
– Cloud/Rack/AZ aware task assignment: No active and standby tasks of the same partition are assigned to the same rack.
– Decreased Partition Assignment Size: With large clusters like ours (>400 nodes and 3 stream threads per node), the size of Partition Assignment of the KS cluster being few 100MBs, it takes a lot of time to settle a rebalance.
Key Takeaways: ‘
– Basic understanding of Kafka Streams.
– Productionizing Kafka Streams at scale.
– Using Kafka Streams as Distributed NoSQL DB.