CERN, the European Laboratory for Particle Physics, hosts the biggest particle accelerators in the world and manages, among other resources, large data centres and a distributed computing infrastructure to pursue its scientific goals. Since two years, Apache Kafka is a core part of the new monitoring pipeline which collects, transports and processes several terabytes of metrics and logs per day from more than 30k hosts of the CERN data centres and world-wide grid distributed services. Not only Kafka proved to be the solid backbone of the ingestion infrastructure, it also enabled low-latency on-the-fly access to all monitoring data, opening new possibilities for data extraction, transformation and correlation.
Apache Flume is currently used as collector agent. A processing infrastructure is provided to users for the streaming analytics of monitoring data, based on Mesos/Marathon and Docker for job orchestration and deployment, with users developing the processing logic on the preferred framework (e.g. mainly Apache Spark, but with Kafka Streams/KSQL being an option too). The results of the analysis, as well as the raw data, are stored in HDFS as long term archive and on Elasticsearch and InfluxDB as backends for the visualisation layer, based on Grafana. This talk discusses the monitoring architecture, the challenges encountered in operating and scaling Kafka to handle billions of events per day and presents how users benefit from Kafka as central data hub for stream processing and analysis of monitoring data.