Databricks is ingesting over 10 trillion data samples daily, a scale that pushed its traditional monitoring infrastructure to its limits. To maintain reliability and efficiency across its global operations on AWS, Azure, and GCP, the company undertook a significant rearchitecture.
The core of this effort involved customizing open-source monitoring solutions, particularly the CNCF Thanos project, which was forked into a new system codenamed Pantheon. This initiative supports over 5 billion active time series in real-time and has drastically reduced monitoring infrastructure downtime by approximately 5x, while saving millions in annual cloud costs.