Artificial Intelligence

Preferred on Google

AI Evals: Broken But Essential, Use Them Anyway

Ara Khan and Cline argue that AI evaluations, though flawed, are crucial. They outline common pitfalls and a process for iterative improvement, emphasizing honesty and nuanced assessment.

Jun 6 at 5:02 PM9 min read

Speaker presenting "Evals Are Broken, Use Them Anyway" slide to an audience. — Ara Khan presenting "Evals Are Broken, Use Them Anyway" to an audience.· AI Engineer

Visual TL;DR. AI Evals Are Flawed leads to Two Camps Misunderstand. AI Evals Are Flawed despite Evals Are Essential. Evals Are Essential requires Three-Stage Approach. Three-Stage Approach informed by Heuristics for Interpreting. Three-Stage Approach involves Get Score, Then Improve. Get Score, Then Improve enables Honest, Nuanced Assessment. Get Score, Then Improve clarifies What You're Testing.

AI Evals Are Flawed: current evaluation methods for AI models are imperfect
Two Camps Misunderstand: over-reliance on quantitative benchmarks vs. ignoring metrics
Evals Are Essential: indispensable for building, interpreting, and improving AI agents
Three-Stage Approach: a structured process for iterative improvement of AI models
Heuristics for Interpreting: guidelines for understanding the nuances of evaluation results
Get Score, Then Improve: iterative process of scoring and refining AI agent performance
Honest, Nuanced Assessment: emphasizing transparency and detailed understanding of AI capabilities
What You're Testing: clarifying the actual capabilities and limitations being measured

Visual TL;DRQuickExplainDeeper

In the rapidly evolving world of AI, the effectiveness of evaluations is a critical, yet often contentious, topic. Ara Khan and Cline, in their presentation titled 'Evals Are Broken, Use Them Anyway,' tackle this head-on, arguing that despite their inherent flaws, evaluations are indispensable for driving progress in AI development.

AI Evals: Broken But Essential, Use Them Anyway - AI Engineer — AI Evals: Broken But Essential, Use Them Anyway — from AI Engineer

Khan and Cline's core thesis is that while current evaluation methods for AI models are imperfect, they are still essential for building, interpreting, and ultimately improving AI agents. They highlight a common sentiment that many people are 'wrong about evals,' suggesting a need for a more nuanced understanding and application of these metrics.

The Two Camps of Flawed Evals

The presentation identifies two primary groups that misunderstand or misapply AI evaluations:

The 'Objective Metrics' Camp: This group places an over-reliance on quantitative benchmarks, believing that high scores on public metrics automatically translate to real-world utility. They showcase examples of models that excel on benchmarks but fail in practical applications, often due to 'benchmark overfitting.'
The 'Taste is King' Camp: Conversely, this group dismisses objective metrics entirely, prioritizing subjective 'taste' or qualitative assessments. While subjective feedback is valuable, relying on it exclusively can lead to inconsistent and biased evaluations.

Khan and Cline argue that the truth lies in the middle. Evals are neither the be-all and end-all, nor are they completely useless. There are indeed right and wrong ways to use them.

A Three-Stage Approach to Using Evals

To navigate the complexities of AI evaluations, the speakers propose a three-stage framework for developers:

Level 1: Leverage Evals from Outside Sources. This involves utilizing existing benchmarks and evaluation datasets created by others.
Level 2: Use Evals to Improve Your Own Agents. This stage focuses on applying evaluation results to iteratively refine the performance of your AI models based on specific use cases.
Level 3: Build Your Own Evals for Specific Use Cases. This advanced level involves creating custom evaluation frameworks tailored to the unique requirements of your AI application.

The presentation emphasizes that while Level 1 and Level 2 are crucial starting points, the ultimate goal is to develop bespoke evaluations that accurately reflect the desired performance in real-world scenarios.

Heuristics for Interpreting Evals

To help developers interpret evaluation results more effectively, Khan and Cline offer several heuristics:

Heuristic 1: Don't Believe Model Lab Evals Blindly; Treat Them as Approximations. The scores and results reported by model labs are often optimized for specific conditions and may not reflect real-world performance. It's important to treat them as estimates rather than absolute truths.
Heuristic 2: Stay Current, But Don't Be the Earliest Adopter. The AI field moves incredibly fast, with new models and benchmarks emerging constantly. While it's important to stay informed, rushing to adopt the absolute latest might mean using evaluations that are not yet thoroughly validated or are susceptible to 'overfitting.' It’s often beneficial to wait for more robust and tested evaluations.
Heuristic 3: Look for Very Precise and New Evals. As benchmarks and evaluation methodologies evolve, newer and more precise evaluations are developed. These are more likely to capture the nuances of AI performance and provide more reliable insights than older, more standardized tests.

The Process: Get a Score, Then Improve

The core process for improving AI models through evaluation is presented as a cyclical approach:

Run the Eval: Obtain an original score using an existing benchmark.
Evaluate All Failures: Analyze the failures to identify common themes or patterns. For instance, does a particular tool blow up on massive files? Does inference consistently time out? Does the agent get distracted and go off track?
Bucket Your Failures: Group similar failures together to understand the root causes.
Make One Tiny Change: Implement a small, targeted improvement based on the failure analysis.
Re-run the Whole Eval: Assess the impact of the change by re-running the evaluation and repeating the process.

What You're Actually Testing

The presentation clarifies that when conducting evaluations, you are essentially testing three intertwined components:

The Model Itself: The inherent capabilities and limitations of the AI model.
The Harness: The scaffolding or framework used to run the evaluations, which can significantly influence the results.
The Problem: Whether the problem being addressed is well-defined and "sane." If you're optimizing for a flawed or misaligned problem, even perfect scores will be misleading.

All three elements need to be in alignment, and honesty with oneself about these components is crucial for meaningful evaluation and improvement.

Conclusion

Khan and Cline conclude by reiterating the importance of a pragmatic approach to AI evaluations. They advise finding relevant benchmarks, building honest evaluations, and critically assessing the results. Benchmarks alone are insufficient; they must be complemented by a qualitative 'vibe check' and a commitment to continuous, iterative improvement. By understanding the limitations and applying the right heuristics, developers can harness the power of evaluations to build better, more reliable AI systems.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Ara Khan #Cline #AI Research #Machine Learning #LLMs #AI Agents #benchmarks #evaluations #AI Engineering