Kafka Summit Logo
Organized by
Register

Kafka Summit NYC 2019

April 2, 2019 | New York City

Register

Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using Kafka

Session Level: Intermediate

In this session, we will discuss Live Aggregators (LA), Mist’s highly reliable and massively scalable in-house real time aggregation system that relies on Kafka for ensuring fault tolerance and scalability. LA consumes billions of messages a day from Kafka with a memory footprint of over 750 GB and aggregates over 100 million timeseries. Since it runs entirely on top of AWS spot instances, it is designed to be highly reliable. LA can recover from hours long complete EC2 outages using its checkpointing mechanism that depends on Kafka. This recovery mechanism recovers the checkpoint and replays messages from Kafka where it left off, ensuring no data loss. The characteristic that sets LA apart is its ability to autoscale by intelligently learning about resource usage and allocating resources accordingly. LA emits custom metrics that track resource usage for different components, i.e., Kafka consumer, shared memory manager and aggregator, to achieve server utilization of over 70%. We do multi-level aggregations in LA to intelligently solve load imbalance issues amongst different partitions for a Kafka topic. We’d demonstrate multi-level aggregation using an example in which we aggregate indoor location data coming from different organizations both spatially and temporally. We’d explain how changing partitioning key, along with writing intermediate data back to Kafka in a new topic for the next level aggregators helps Mist scale our solution. LA runs on top of 400+ cores, comprised of 10+ different Amazon EC2 spot instance types/sizes. We track the CPU usage for reading each Kafka stream on all the different instance types/sizes. We have several months of such data from our production Mesos cluster, which we are incorporating into LA’s scheduler to improve our server utilization and avoid CPU hot spots from developing on our cluster. Detailed Blog:https://www.mist.com/live-aggregators-highly-reliable-massively-scalable-real-time-aggregation-system/


We use cookies to understand how you use our site and to improve your experience. Click here to learn more or change your cookie settings. By continuing to browse, you agree to our use of cookies.