Logging is a critical piece of Airbnb infrastructure. Massive amount of logging events is emitted to Kafka and ingested continuously to the data warehouse for offline analytics. Logging data drives critical business and engineering functions and decision-making such as building search ranking models, experimentation (A/B testing), and identifying fraud and risks. At Airbnb logging events are published to Kafka from services and clients. The logging events are then ingested from Kafka to the data warehouse in near real-time using Airstream (a product built on top of Spark streaming). With rapid increase of the logging traffic, we have seen some serious engineering challenges in the scalability and reliability of the streaming ingestion infrastructure. For example, we cannot simply increase the parallelism of Spark reading from Kafka or throw more resources at it when event rate spikes. Another serious problem is large skew in event size – some events are much bigger than others. The out-of-box Kafka reader in Spark treats all topics equally so running time of tasks in a stage could vary a lot due to event size, which results in huge computing inefficiency and lag in a streaming job.
In this talk, I will start with an overview of the logging architecture at Airbnb. I will then dive into the emerged scalability and reliability issues and share how we scaled the infrastructure to handle 10X growth of logging traffic.
The key takeaways include
1) an architecture of near real-time and large-scale logging event ingestion from Kafka using Spark streaming;
2) limitation of Kafka consumer in Spark and a more scalable solution;
3) lessons learned from productionizing a critical real-time data infrastructure.