Ankur Goyal, CEO of Braintrust, recently presented at the AI Engineer World's Fair in San Francisco, introducing Loop, a new evaluation assistant designed to fundamentally transform the manual, often cumbersome process of AI model development. Goyal’s address highlighted the rapid, yet surprisingly labor-intensive, growth in AI evaluation practices and detailed how Loop addresses these critical bottlenecks for developers building advanced AI products.
The current landscape of AI evaluation, or "evals," sees organizations logging a staggering number of experiments. Goyal noted, "On average, organizations log 12.8 experiments per day with Braintrust. Some of our customers run more than 3,000 evals a day." This volume underscores the intense iteration required in AI development, yet the process remains stubbornly human-centric. Engineers are spending significant time—"more than two hours in the product every day"—manually sifting through dashboards, attempting to discern actionable insights from raw evaluation data.
This manual burden is precisely what Braintrust aims to alleviate with Loop. Goyal articulated the core problem: "Evals are such a manual process. To date, every time you run an eval, the best thing you can do is look at a dashboard... and you walk away and think, okay, what changes can I make to my code or to my prompts so that this eval does better." Loop is positioned as an intelligent agent built directly into Braintrust, leveraging the power of advanced large language models to automate these optimization tasks.
Loop serves as an evaluation assistant, automating prompt optimization and generating dataset rows directly in the playground. It analyzes the current context—scorers, datasets, prompts, and evaluations—and surfaces tailored changes. This directly reduces the manual debugging effort and facilitates the rapid implementation of evaluation best practices, driving faster and more accurate results.
The breakthrough in the efficacy of such automation, according to Goyal, is recent. He specifically cited the performance of cutting-edge models: "We think that Claude 4 in particular was a real breakthrough moment, and it performs almost six times better than the previous leading model before it." This significant leap in model capability enables Loop to perform its analytical and optimization functions with unprecedented effectiveness, suggesting a pivotal shift in how AI development teams can operate.
By providing a side-by-side view of suggested edits to prompts and data, Loop ensures transparency and maintains developer control, a crucial aspect of responsible AI development. This iterative, AI-assisted approach promises to unlock a new level of efficiency, allowing AI engineers to focus on higher-level problem-solving rather than repetitive manual analysis. The integration of frontier models directly into the evaluation pipeline marks a significant step towards a more automated, and ultimately more productive, future for AI product development.

