Databricks is ingesting over 10 trillion data samples daily, a scale that pushed its traditional monitoring infrastructure to its limits. To maintain reliability and efficiency across its global operations on AWS, Azure, and GCP, the company undertook a significant rearchitecture.
The core of this effort involved customizing open-source monitoring solutions, particularly the CNCF Thanos project, which was forked into a new system codenamed Pantheon. This initiative supports over 5 billion active time series in real-time and has drastically reduced monitoring infrastructure downtime by approximately 5x, while saving millions in annual cloud costs.
Pantheon: A Scaled Timeseries Database
Traditional timeseries databases (TSDBs) became a bottleneck for Databricks, struggling with the near-daily scaling demands driven by exponential growth. Pantheon, a customized Thanos implementation, now operates at a massive scale, with over 160 instances across three cloud providers.