Exa.ai, the AI-powered search engine, is tackling a critical gap in evaluating search capabilities for coding agents. Today, the company is open-sourcing 'WebCode,' a suite of benchmarks designed to rigorously assess how well web search functions serve AI developers. This move comes as Exa observes a significant surge in code search queries over the past year, particularly a sharp increase at the end of 2025. Precision in these results is paramount, as coding agents rely on retrieved context for multi-step reasoning, where stale or noisy data can derail complex processes. Exa has developed a dedicated pipeline to ensure fresh, clean results, prioritizing substantive content over full-page scraping.
The need for robust evaluation methods is underscored by the shortcomings of current public benchmarks. As detailed on exa.ai, these existing tools often fall prey to data contamination and saturation. Models may learn to answer benchmark questions through memorization rather than genuine reasoning, a problem highlighted by OpenAI's recent deprecation of its own SWE-bench Verified benchmark. This issue is particularly relevant for search, where agents must navigate new, niche, or rapidly updating information, such as changelogs and SDK documentation.
Evaluating Content Quality
WebCode's evaluation framework is split into two primary components: content quality and retrieval quality. Content quality focuses on how faithfully a search provider extracts relevant information from a given URL. Exa's approach involves creating a 'golden reference' by rendering web pages in a cloud browser, capturing screenshots, and using a multimodal model to generate markdown that mirrors the rendered output.
This method ensures that evaluations are based on what a human user would actually see, accounting for JavaScript execution and dynamic rendering. The scoring process uses both LLM-judged metrics for semantic dimensions like completeness and accuracy, and deterministic NLP metrics for precise measurements like signal and ROUGE-L.
In tests on a dataset of 250 URLs, Exa demonstrated strong performance across various metrics, outperforming competitors like Parallel and Claude in areas such as completeness and signal. For example, Exa's extraction length was closer to the golden reference compared to other providers, indicating less extraneous 'chrome' or navigational content.
In-Document Search and Groundedness
Beyond full-page extraction, WebCode also evaluates 'highlights'—the ability to pinpoint the most relevant section within a document for a specific query. This is crucial for token-efficient code search and serves as a foundational element for techniques like Retrieval-Augmented Generation (RAG).
A key innovation is reframing RAG evaluation as a discriminative task ('does this context contain the answer?') rather than a generative one ('generate the answer from this context'). This separation helps isolate the performance of the retrieval component from the generative capabilities of the synthesis model.
Exa's results show that while 'correctness' scores (which rely on the synthesis LLM) cluster tightly, 'groundedness' scores (which measure if the retrieved context actually supports the answer) exhibit much higher variance. This indicates groundedness is a more effective metric for differentiating the actual capabilities of search providers.
Retrieval Quality Across the Web
The evaluation extends to Retrieval-Augmented Generation (RAG) for code search, moving from a single document to the entire web. Exa generates question-answer pairs from lengthy documentation, ensuring queries are difficult for models to answer from memory alone. These queries are then used to test providers on their ability to retrieve relevant URLs that contain the correct answer, measuring both groundedness and citation precision.
For end-to-end coding tasks, Exa developed a novel benchmark mirroring real-world autonomous coding workflows. This benchmark involves sandboxed environments with bash tool calls and unit tests, explicitly evaluating a coding agent's ability to leverage web search. Unlike existing benchmarks that focus primarily on reasoning or tool use, Exa's evaluation directly measures the impact of search quality.
The benchmark construction involves careful library selection, knowledge checks against frontier models, and rigorous task generation and verification within a sandbox. Exa's findings indicate that native search integration significantly improves an agent's pass rate on these coding tasks compared to a baseline without search or with a separate search sub-agent.
WebCode addresses a critical need for specialized evaluations in the realm of coding agents. By focusing on both content and retrieval quality, Exa aims to drive improvements across the industry. The company is actively hiring to build out its search engine for leading coding agent companies.
