Cloudflare is making a significant push into the large language model (LLM) arena with its Workers AI platform, announcing support for frontier open-source models. The company is kicking things off with Moonshot AI's Kimi K2.5, a move designed to equip developers with powerful tools for building sophisticated AI agents.
This expansion positions Cloudflare as a more comprehensive environment for AI development, moving beyond just execution primitives like Durable Objects and Workflows. The integration of Kimi K2.5, boasting a substantial 256k context window and multi-turn tool calling capabilities, directly addresses the need for capable models to power agentic tasks.
Kimi K2.5: A Cost-Effective Powerhouse
Cloudflare has been internally testing Kimi K2.5, integrating it into development tools and automated code review processes, including its public code review agent, Bonk. The model has reportedly demonstrated strong performance and cost efficiency, proving to be a viable alternative to larger proprietary models.
The company highlighted a specific use case where an agent performing security reviews processed over 7 billion tokens daily. By switching to Kimi K2.5 on Workers AI, Cloudflare claims to have cut costs by 77%, projecting annual savings of $2.4 million for that single workload compared to using a mid-tier proprietary model.
This cost-efficiency is crucial as the demand for AI inference, particularly for personal and coding agents, skyrockets. Cloudflare aims to facilitate the enterprise shift towards open-source models that offer comparable reasoning capabilities without the premium price tag of proprietary solutions.
Optimizing the Inference Stack
While Workers AI has served models for two years, the focus was previously on smaller architectures. The introduction of models like Kimi K2.5 necessitated upgrades to Cloudflare's inference stack. The company has developed custom kernels for Kimi K2.5, leveraging its Infire inference engine to optimize performance and GPU utilization.
Cloudflare emphasizes that developers using Workers AI bypass the complexities of self-hosting and optimizing large open-source models, which typically requires expertise in machine learning engineering and DevOps. The platform handles these intricate optimizations, offering a simplified API-driven approach.
Platform Enhancements for Agentic Workloads
Alongside the Kimi K2.5 integration, Cloudflare is rolling out platform improvements to enhance agent development. Prefix caching is now surfaced as a usage metric with discounted pricing for cached tokens, aiming to reduce latency and computational cost during multi-turn conversations.
A new x-session-affinity header is introduced to improve cache hit rates by routing requests to the same model instance, further boosting performance and lowering inference costs. This header is automatically supported by tools like OpenCode and the Agents SDK starter.
The redesigned asynchronous API offers a more robust solution for non-real-time agentic tasks. This revamped system uses a pull-based approach, allowing requests to be processed as capacity becomes available, mitigating Out of Capacity errors and ensuring durable execution for workloads like code scanning or research agents.
Cloudflare is making Kimi K2.5 available on Workers AI starting today, with developer documentation, pricing, and integration details accessible. The Agents SDK starter now defaults to Kimi K2.5, and the model can be accessed via Opencode and the company's playground.
