The open-source project llm-d, designed to orchestrate and scale distributed inference across accelerator infrastructure, is entering the Cloud Native Computing Foundation (CNCF) Sandbox. This marks an important step toward making production inference a standard, cloud-native capability.
Last May, CoreWeave joined Red Hat, IBM, Google, and NVIDIA as a founding contributor to llm-d, believing production inference needed to be built in the open. As llm-d moves into the CNCF Sandbox process, it signifies a broader industry shift. More pioneers and established enterprises are treating production inference with the rigor, openness, and interoperability that modern AI workloads demand, recognizing that distributed inference is now foundational cloud-native infrastructure requiring a collaborative, multi-vendor approach.
Inference Becomes the Backbone of Rapidly Scaling AI
Inference at scale presents unique challenges distinct from traditional cloud workloads. It is stateful, hardware-sensitive, and requires cost efficiency for viability. The rise of AI agents is transforming inference from a simple serving layer into a real-time, always-on production necessity.
Agent-driven workflows across customer support, software development, and internal operations depend on fast, reliable inference. This surge in demand strains infrastructure, necessitating alignment of cost and performance with application value. Enterprises also require flexibility to deploy models across public cloud, private data centers, and edge environments without vendor lock-in.
llm-d aims to address these challenges by intelligently managing inference workload placement and scaling, while offering deployment flexibility across diverse environments. Its move into the CNCF Sandbox provides a neutral ground for broader ecosystem adoption, extension, and contribution.
AI Serving Needs Its Own Orchestration Layer
While Kubernetes transformed software deployment, its standard orchestration capabilities are insufficient for the demands of large-scale inference workloads. The cost and behavior of LLM inference requests vary significantly based on factors like prompt length and model phase (prefill vs. decode).
Standard service routing overlooks these dynamics, leading to inefficient placement and unpredictable latency. llm-d introduces a purpose-built orchestration layer between serving frameworks and inference engines. This layer brings intelligence to workload routing, placement, and scaling. It integrates with Kubernetes-native components like KServe, Gateway API, and Prometheus, transforming complex distributed inference into a manageable, observable cloud-native workload.
The project's strength lies in its broad coalition. From its inception, llm-d has united contributors, industry leaders, and academic supporters around the principle that production inference advances fastest through open, interoperable infrastructure. CoreWeave's contribution is informed by operating production inference under real-world demand, advocating for teams to gain control, performance tuning, and operational ownership without sacrificing architectural continuity or cost visibility.
The CNCF offers a natural home for this effort, providing transparent governance and a framework for growing open-source communities. This move creates a broader path to share contributions and collaborate with organizations tackling similar production challenges.
Some layers of AI infrastructure are too important to remain fragmented or vendor-defined. Inference is one of them. The llm-d community has an opportunity to make production inference more accessible, portable, and efficient across the full range of AI infrastructure environments. CoreWeave is proud to contribute alongside a growing community of supporters.
