Cloudflare Bets Big on Open-Source LLMs

Cloudflare is making a significant push into the large language model (LLM) arena with its Workers AI platform, announcing support for frontier open-source models. The company is kicking things off with Moonshot AI's Kimi K2.5, a move designed to equip developers with powerful tools for building sophisticated AI agents.

This expansion positions Cloudflare as a more comprehensive environment for AI development, moving beyond just execution primitives like Durable Objects and Workflows. The integration of Kimi K2.5, boasting a substantial 256k context window and multi-turn tool calling capabilities, directly addresses the need for capable models to power agentic tasks.

Related startups

Kimi K2.5: A Cost-Effective Powerhouse

Cloudflare has been internally testing Kimi K2.5, integrating it into development tools and automated code review processes, including its public code review agent, Bonk. The model has reportedly demonstrated strong performance and cost efficiency, proving to be a viable alternative to larger proprietary models.

The company highlighted a specific use case where an agent performing security reviews processed over 7 billion tokens daily. By switching to Kimi K2.5 on Workers AI, Cloudflare claims to have cut costs by 77%, projecting annual savings of $2.4 million for that single workload compared to using a mid-tier proprietary model.

This cost-efficiency is crucial as the demand for AI inference, particularly for personal and coding agents, skyrockets. Cloudflare aims to facilitate the enterprise shift towards open-source models that offer comparable reasoning capabilities without the premium price tag of proprietary solutions.

Optimizing the Inference Stack

While Workers AI has served models for two years, the focus was previously on smaller architectures. The introduction of models like Kimi K2.5 necessitated upgrades to Cloudflare's inference stack. The company has developed custom kernels for Kimi K2.5, leveraging its Infire inference engine to optimize performance and GPU utilization.

Cloudflare emphasizes that developers using Workers AI bypass the complexities of self-hosting and optimizing large open-source models, which typically requires expertise in machine learning engineering and DevOps. The platform handles these intricate optimizations, offering a simplified API-driven approach.

Platform Enhancements for Agentic Workloads

Alongside the Kimi K2.5 integration, Cloudflare is rolling out platform improvements to enhance agent development. Prefix caching is now surfaced as a usage metric with discounted pricing for cached tokens, aiming to reduce latency and computational cost during multi-turn conversations.

A new x-session-affinity header is introduced to improve cache hit rates by routing requests to the same model instance, further boosting performance and lowering inference costs. This header is automatically supported by tools like OpenCode and the Agents SDK starter.

The redesigned asynchronous API offers a more robust solution for non-real-time agentic tasks. This revamped system uses a pull-based approach, allowing requests to be processed as capacity becomes available, mitigating Out of Capacity errors and ensuring durable execution for workloads like code scanning or research agents.

Cloudflare is making Kimi K2.5 available on Workers AI starting today, with developer documentation, pricing, and integration details accessible. The Agents SDK starter now defaults to Kimi K2.5, and the model can be accessed via Opencode and the company's playground.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

Cloudflare Bets Big on Open-Source LLMs

Related startups

Kimi K2.5: A Cost-Effective Powerhouse

Optimizing the Inference Stack

Platform Enhancements for Agentic Workloads

AI Daily Digest