Cloudflare's AI Code Review Overhaul

Code reviews are essential for catching bugs and sharing knowledge, but they often become a significant bottleneck for engineering teams. Cloudflare experienced this firsthand, with merge requests waiting hours for initial feedback. To tackle this, they explored AI code review solutions.

Initial attempts with existing AI tools showed promise but lacked the customization needed for an organization the size of Cloudflare. A more direct approach, feeding raw diffs into large language models with basic prompts, resulted in a flood of vague, often inaccurate suggestions. This led Cloudflare to develop a CI-native orchestration system around OpenCode, an open-source coding agent.

Orchestrating AI Code Review at Scale

Cloudflare's current system deploys a coordinated group of specialized AI agents for each merge request. Instead of a single, monolithic model, up to seven distinct reviewers focus on areas like security, performance, code quality, documentation, release management, and compliance with their internal Engineering Codex. A central coordinator agent manages these specialists, deduplicating findings, assessing severity, and consolidating feedback into a single, structured comment.

This system has been rigorously tested across tens of thousands of merge requests, effectively approving clean code, flagging genuine bugs with high accuracy, and blocking merges for critical issues or security vulnerabilities. This initiative is part of Cloudflare's broader strategy for improving engineering resiliency, known as Code Orange: Fail Small.

Modular Architecture for Flexibility

Building internal tooling that spans thousands of repositories requires extreme flexibility. Cloudflare opted for a composable plugin architecture to avoid hardcoding dependencies on specific version control systems or AI providers. This design ensures the system can adapt to future changes, such as supporting new VCS platforms or integrating different AI models.

Each plugin adheres to a defined `ReviewPlugin` interface with distinct lifecycle phases: bootstrap, configure, and postConfigure. This modularity isolates concerns; for example, the GitLab plugin doesn't interact with Cloudflare AI Gateway configurations, and vice versa. This approach ensures that VCS-specific coupling is contained within a single configuration file.

A typical internal review utilizes a roster of plugins covering:

GitLab VCS integration and comment posting
Cloudflare AI Gateway configuration and model tiering
Internal compliance checks against engineering RFCs
Distributed tracing and observability
Verification of `AGENTS.md` documentation
Dynamic per-reviewer model overrides
Review tracking and telemetry

Leveraging OpenCode Under the Hood

Cloudflare chose OpenCode for its extensive internal use, open-source nature allowing for upstream contributions, and a robust SDK ideal for plugin development. Crucially, OpenCode's server-first architecture enables programmatic session creation and SDK-driven prompting, avoiding limitations of CLI interfaces.

The orchestration operates in two layers. The Coordinator Process, spawned as a child process, handles the main review logic. To circumvent command-line argument limits for large merge requests, prompts are passed via standard input. Output is streamed as JSON Lines (JSONL) for efficient, line-by-line parsing, crucial for long-running processes that might exit unexpectedly.

Within the OpenCode process, a Review Plugin utilizes the `spawn_reviewers` tool. When the coordinator decides a review is needed, this tool launches individual sub-reviewer sessions via OpenCode's SDK. Each sub-reviewer operates in its own session with a specialized agent prompt, free to use tools like `grep` or search the codebase to return findings in a structured XML format.

Specialized Agents for Focused Review

The system eschews a single, all-powerful model for a distributed approach using domain-specific agents. Each agent receives a tightly scoped prompt detailing what to flag and, importantly, what to ignore. For instance, the security reviewer is instructed to focus solely on exploitable or concretely dangerous issues, ignoring theoretical risks or suggestions for unchanged code.

This precise instruction, particularly defining what not to flag, is key to effective prompt engineering. It prevents the overwhelming, speculative warnings that developers often learn to disregard. Each reviewer outputs findings classified by severity: critical, warning, or suggestion, ensuring structured data drives downstream actions.

Tiered Model Usage

By segmenting review tasks, Cloudflare can assign AI models based on job complexity, optimizing cost and performance. Top-tier models like Claude Opus and GPT-5.4 are reserved for the Review Coordinator, which requires the highest reasoning capabilities to synthesize findings from multiple agents.

Standard-tier models such as Claude Sonnet and GPT-5.3 Codex handle intensive sub-reviewer tasks like code quality and security checks. Lightweight, text-heavy tasks, such as documentation review, are delegated to models like Kimi K2.5. All model assignments can be dynamically overridden via a Cloudflare Worker.

Prompt injection is mitigated by assembling prompts from agent-specific files and shared rules, and by sanitizing user-controlled input to prevent breakout attacks. The system also saves tokens by passing paths to per-file patch files rather than embedding full diffs in prompts, with sub-reviewers accessing only relevant patches.

To combat the perception of hung jobs during complex AI processing, a simple heartbeat log provides real-time status updates. This has significantly reduced premature job cancellations.

The approach taken by Cloudflare engineering demonstrates a sophisticated method for integrating AI into critical development workflows, enhancing efficiency and reliability. This is an example of the advancements discussed in relation to Cloudflare engineering and the future of software development.