Kafka Summit Logo
Organized by
Sold Out

Kafka Summit London 2018

April 23-24, 2018 | London

Sold Out

From Kafka to BigQuery – a Guide for Delivering Billions of Daily Events

Session Level:
Video & Slides

What are the most important considerations for shipping billions of daily events to analysis? In this session, I’ll share the journey we’ve made building a reliable, near real-time data pipeline. I’ll discuss and compare several data loading techniques and hopefully assist you on making better choices with your next pipeline. MyHeritage collects billions of events every day, including request logs from web servers and backend services, events describing user activities across different platforms, and change-data-capture logs recording every change made in its databases. Delivering these events to analytics is a complex task, requiring a robust and scalable data pipeline. We have decided to ship our events to Apache Kafka, and load them for analysis in Google BigQuery.

In this talk, I’m going to share some of the lessons we learned and best practices we adopted, while describing the following loading techniques: 

  • Batch loading to Google Cloud Storage and using a load job to deliver data to BigQuery
  • Streaming data via BigQuery API, along with Kafka Streams as the streaming framework
  • Streaming data to BigQuery with Kafka Connect
  • Streaming data with Apache Beam along with its cloud Dataflow runner

Along with presenting our journey, I’ll discuss some important concepts of data loading:

  • Batch vs. streaming load
  • Processing time partitioning vs. event time partitioning
  • Considerations for running your pipeline on premise vs. in the cloud

Hopefully this case study can assist others with building better data pipeline.

We use cookies to understand how you use our site and to improve your experience. Click here to learn more or change your cookie settings. By continuing to browse, you agree to our use of cookies.