What are the most important considerations for shipping billions of daily events to analysis? In this session, I’ll share the journey we’ve made building a reliable, near real-time data pipeline. I’ll discuss and compare several data loading techniques and hopefully assist you on making better choices with your next pipeline. MyHeritage collects billions of events every day, including request logs from web servers and backend services, events describing user activities across different platforms, and change-data-capture logs recording every change made in its databases. Delivering these events to analytics is a complex task, requiring a robust and scalable data pipeline. We have decided to ship our events to Apache Kafka, and load them for analysis in Google BigQuery.
In this talk, I’m going to share some of the lessons we learned and best practices we adopted, while describing the following loading techniques:
- Batch loading to Google Cloud Storage and using a load job to deliver data to BigQuery
- Streaming data via BigQuery API, along with Kafka Streams as the streaming framework
- Streaming data to BigQuery with Kafka Connect
- Streaming data with Apache Beam along with its cloud Dataflow runner
Along with presenting our journey, I’ll discuss some important concepts of data loading:
- Batch vs. streaming load
- Processing time partitioning vs. event time partitioning
- Considerations for running your pipeline on premise vs. in the cloud
Hopefully this case study can assist others with building better data pipeline.