With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Pinterest has a hundred billion pins stored in MySQL at the scale of a 100TB and most of this data is needed for building data-driven products for machine learning and data analytics.
This talk discusses how Pinterest designed and built a continuous database (DB) ingestion system for moving MySQL data into near-real-time computation pipelines with only 15 minutes of latency to support our dynamic personalized recommendations and search indices. Pinterest helps people discover and do things that they love. We have billions of core objects (pins/boards/users) stored in MySQL at the scale of 100TB. All this data needs to be ingested onto S3/Hadoop for machine learning and data analytics. As Pinterest is moving towards real-time computation, we are facing a stringent service-level agreement requirement such as making the MySQL data available on S3/Hadoop within 15 minutes, and serving the DB data incrementally in stream processing. We designed WaterMill: a continuous DB ingestion system to listen for MySQL binlog changes, publish the MySQL changelogs as an Apache Kafka® change stream and ingest and compact the stream into Parquet columnar tables in S3/Hadoop within 15 minutes.
We would like to share how we solved the problem of:
- Scalable data partitioning, efficient compaction algorithm
- Stories on schema migration, rewind and recovery
- PII (personally identifiable information) processing
- Columnar storage for efficient incremental query
- How the DB change stream powers other use cases such as cache invalidation in multi-datacenter
- How we deal with the issue of S3 eventual consistency and rate limiting; related technologies: Apache Kafka, stream processing, MySQL binlog processing, Amazon S3, Hadoop and Parquet columnar storage