Databricks Crushes Monitoring Scale

Databricks is ingesting over 10 trillion data samples daily, a scale that pushed its traditional monitoring infrastructure to its limits. To maintain reliability and efficiency across its global operations on AWS, Azure, and GCP, the company undertook a significant rearchitecture.

The core of this effort involved customizing open-source monitoring solutions, particularly the CNCF Thanos project, which was forked into a new system codenamed Pantheon. This initiative supports over 5 billion active time series in real-time and has drastically reduced monitoring infrastructure downtime by approximately 5x, while saving millions in annual cloud costs.

Pantheon: A Scaled Timeseries Database

Traditional timeseries databases (TSDBs) became a bottleneck for Databricks, struggling with the near-daily scaling demands driven by exponential growth. Pantheon, a customized Thanos implementation, now operates at a massive scale, with over 160 instances across three cloud providers.

Its tiered storage architecture, keeping recent data in-memory and older data on object storage, decouples compute from storage. This allows for efficient scaling without historical data rebalancing.

Optimizations include distinct memory-retention policies for long-lived versus ephemeral workloads and a design using isolated Kubernetes StatefulSets for improved operational resilience and data isolation.

The platform also incorporates a purpose-built control plane. This automates releases, scaling, and self-healing, handling dozens of failure events weekly with minimal human intervention.

Taming High Cardinality with Aggregation and Hydra

The rapid growth of serverless and AI workloads at Databricks has led to a surge in high-cardinality metrics—metrics with numerous unique label combinations. This growth strains TSDBs and drives up costs.

To combat this, Databricks implemented metric aggregation. This strategy drops expensive labels from serverless systems during ingestion, providing aggregated fleetwide views while reducing the cardinality problem.

For advanced troubleshooting of these high-cardinality metrics, a novel Lakehouse-based platform called Hydra was developed. This system offers rich debugging capabilities at scale and boasts storage costs 50 times lower than previous solutions.

This approach to Databricks monitoring infrastructure highlights a broader trend of adapting observability tools for the demands of modern, hyperscale cloud platforms.

The insights shared in the Databricks engineering blog underscore the challenges and innovative solutions required to manage complex, rapidly evolving systems.

This innovation is reminiscent of how other companies are leveraging cloud-native architectures for efficiency, such as nOps rebuilding its cloud savings platform on Databricks.

Databricks Crushes Monitoring Scale

Pantheon: A Scaled Timeseries Database

Related startups

Taming High Cardinality with Aggregation and Hydra

AI Daily Digest