"Yesterday's data can't answer today's questions," a stark truth underscored at the AI Engineer World's Fair, highlights a critical challenge in developing reliable AI search systems. Julia Neagu, CEO and co-founder of Quotient AI, and Maitar Asher, Head of Engineering at Tavily, presented their collaborative framework for evaluating augmented AI systems, emphasizing a shift from static benchmarks to dynamic, real-time assessment.
At the AI Engineer World's Fair in San Francisco, Neagu and Asher detailed how Quotient AI and Tavily are tackling the complexities of AI search. Traditional monitoring, built for predictable software, falters when confronted with AI agents that operate in constantly evolving web environments, make real-time decisions, and handle arbitrary user queries. These dynamic systems present multiple, interconnected failure modes, from hallucinations to retrieval errors, rendering conventional evaluation metrics insufficient.
The core problem, as articulated by Neagu, is: "How do you build robust web-based AI search- and retrieval-augmented agents if: (1) the web is constantly changing AND (2) users make arbitrary queries?" Tavily, as an infrastructure layer powering millions of real-time AI search requests, directly experiences this volatility. Their solution involves building dynamic evaluation datasets that align with real-world events, offer broad coverage across diverse domains, and ensure continuous relevance by regular refreshing.
Quotient AI and Tavily have open-sourced a dynamic eval dataset generator, an agent that automatically creates evidence-based Q&A pairs by generating broad web search queries and aggregating grounding documents from multiple real-time AI search providers. This approach maximizes coverage and minimizes bias, moving beyond the limitations of static datasets like SimpleQA and HotPotQA, which, while useful for factual accuracy and multi-hop reasoning, cannot keep pace with the web's fluidity.
Beyond mere accuracy, the framework advocates for a holistic evaluation. "We argue that it's important to measure accuracy, but you should not stop there," stated Deanna Emery, Founding AI Researcher at Quotient AI, elaborating on the need for unsupervised evaluation methods. These methods eliminate reliance on labeled data, enabling scalable and unbiased assessments of critical aspects like answer completeness, document relevance, and hallucination rates. Their findings demonstrate that dynamic benchmarks reveal significantly different performance rankings compared to static ones, exposing flaws traditional evaluations miss.
The true power emerges when these metrics are combined. "Metrics work better together," Emery highlighted. By integrating answer completeness, document relevance, and hallucination metrics, the system can pinpoint specific failure modes and suggest actionable solutions. For example, a low document relevance coupled with hallucinations might indicate a need to retrieve more comprehensive documents, while high relevance but still present hallucinations could point to issues with model utilization or paraphrasing. This multi-dimensional analysis provides a clear roadmap for continuous improvement.
Ultimately, this innovative approach aims to build the foundation for AI systems that don't just retrieve information but continuously improve their performance over time. This enables AI that adapts seamlessly to changing contexts, augmented systems capable of debugging themselves mid-loop, and fosters continuous optimization in production environments.

