The very nature of AI, particularly large language models (LLMs), introduces a profound challenge to traditional software development: how do you reliably measure goodness? This fundamental question underpinned the workshop led by David Karam, Product Director at Pi Labs and former Google Search veteran, at the recent AI Engineer World's Fair in San Francisco.
A common refrain from attendees echoed the sentiment that "evals can be labor-intensive," making them "painful to get started with." Unlike deterministic software, LLMs are stochastic; their outputs can vary widely, making objective performance measurement difficult. One attendee highlighted the struggle to "define the metrics" for text generation, where the "correct answer" can manifest in myriad ways. This inherent ambiguity means that "each case is so unique that you just can't reuse some previous works." Consequently, the time spent on evaluation in AI development has skyrocketed, with one participant estimating it consumes "80% of feature development time," a stark increase from typical software development.
Karam and his co-founder, drawing from their decade at Google building Search's core AI, posit that evaluating LLMs isn't merely about testing, but about a holistic approach to "quality engineering." They argue that finding errors and measuring quality are paramount, necessitating a shift from "vibe testing" to a more structured, iterative process: "measure, learn, build iteratively." The key, they contend, lies in dissecting a seemingly subjective "goodness" into a "tree of signals." This architectural approach breaks down complex, abstract quality notions into granular, objective components.
By decomposing a subjective metric like "goodness" into a hierarchy of objective signals—such as content score, title score, or even specific elements like "action item clarity" or "due date inclusion" within a meeting summary—developers gain precise insights into *why* an output is good or bad. This granularity is crucial for debugging and for enabling sophisticated optimization techniques like Reinforcement Learning, which rely on nuanced feedback. The objective signals, being "more objective," exhibit "low variance," meaning they consistently provide the same score for the same input, which is vital for the stability and convergence of optimization algorithms. This allows for a more reliable feedback loop, ensuring that engineering efforts directly translate into tangible improvements in model performance and user satisfaction.
The proliferation of LLMs means that "everyone has to do evals," even if "best practices don't exist."
While automated evaluation using LLM-as-a-judge models offers scalability, it often suffers from clunkiness and inaccuracy. Pi Labs advocates for a scoring system where individual, objective signals are combined and then rigorously calibrated against "ground truth" data, typically derived from human preferences or user behavior (like clicks or thumbs-up data). This calibration ensures the scoring system truly reflects desired product behavior, making improvements in development directly impactful for users. Their product aims to simplify this process, providing tools for designing custom metrics, identifying optimal signal types, and integrating scoring models into both online and offline workflows. The goal is not merely to provide a shortcut, but to empower AI engineers with the methodologies and technology to rigorously measure and improve their applications.

