Databricks engineers are pushing the boundaries of infrastructure reliability and efficiency, with key contributions highlighted at SRECon 2026. The company is tackling complex multi-cloud challenges, including advanced Kubernetes load balancing solutions. This work aims to enhance the performance and availability of its Databricks Platform across AWS, Azure, and GCP.
Running thousands of microservices at scale exposes limitations in Kubernetes' native load balancing. The default kube-proxy and ClusterIP model operates at Layer 4, distributing connections rather than granular requests. This proves problematic for services like gRPC that utilize long-lived HTTP/2 connections, leading to uneven resource utilization and performance degradation.
Intelligent Kubernetes Load Balancing
Databricks has engineered a custom solution to overcome these limitations. Their approach moves beyond the default Layer 4 distribution, addressing traffic skew and tail latency spikes. The team will share architectural details and trade-offs considered, including why a full service mesh like Istio was not adopted.