Evals Reimagined: Braintrust's Engineering Approach to AI Development

Ankur Goyal, CEO of Braintrust, recently addressed attendees at the AI Engineer World's Fair, outlining five hard-earned lessons for developing successful AI applications. His insights emphasized the indispensable role of robust evaluation systems, moving beyond superficial metrics to genuinely engineered approaches. The core message was clear: building impactful AI demands a sophisticated engineering mindset, particularly in how we assess and refine model performance.

Effective evaluations are not incidental; they are deliberately constructed to reflect real-world performance. Goyal noted, "The most important property of a good dataset is that you can reconcile it with reality." This means moving past purely synthetic data to continuously incorporate genuine user feedback, transforming complaints into actionable evaluation metrics. He stressed that evaluations should be proactive, used "to play offense" by identifying new use cases and predicting performance, rather than merely for regression testing. A mature evaluation system, for instance, should enable a product team to roll out an update incorporating a new model within 24 hours.

The era of simple prompt engineering is waning, replaced by "context engineering." This involves optimizing the entire informational context provided to a large language model (LLM), including meticulously defined tools and their outputs. Braintrust's analysis reveals that a vast majority of the tokens in a typical prompt are not from the system prompt itself, but from tool definitions and, predominantly, tool responses—a significant 67.6%. This demands precision in how tools are structured and how their outputs are presented to the model, as even subtle changes, like shifting from JSON to YAML, can dramatically impact LLM comprehension and performance.

Agility is paramount in the rapidly evolving AI landscape, where a new model can fundamentally alter product viability. Goyal highlighted how a feature that previously yielded only 10% performance with GPT 4o became 58% viable with Claude 4 Sonnet. Such dramatic shifts underscore the need for model-agnostic systems, allowing developers to quickly integrate and test new models without extensive code changes. This proactive approach ensures organizations are prepared to capitalize on sudden leaps in model capabilities.

Braintrust's new "Loop" feature directly addresses this need for holistic optimization, empowering developers to optimize the entire evaluation system, not just isolated prompts. The Loop feature allows users to auto-optimize prompts, datasets, and scorers directly within the platform. This comprehensive approach yields dramatically better results, as demonstrated by a benchmark showing an improvement from 8.9% (prompt only) to 39.14% when the dataset, prompt, and scorers are optimized together. This enables rapid, intentional iteration, ensuring AI applications are continuously aligned with evolving model capabilities and user needs.

Evals Reimagined: Braintrust's Engineering Approach to AI Development

AI Daily Digest

Evals Reimagined: Braintrust's Engineering Approach to AI Development

AI Daily Digest