"Please no more evals." This blunt plea, voiced by Ben Hylak, CTO of Raindrop, encapsulated a core sentiment at the AI Engineer World's Fair in San Francisco. He, alongside Sid Bendre, co-founder of Oleve, articulated a critical truth: while AI demos dazzle, many AI products simply don't work as intended in the wild. Their joint presentation dissected the often-unspoken reality of AI product development, pivoting from theoretical evaluations to practical, iterative strategies for building robust applications.
Ben Hylak, CTO of Raindrop, a company providing "Sentry for AI products" that finds and fixes issues, and Sid Bendre, co-founder of Oleve, known for scaling viral consumer AI products, shared their insights at the AI Engineer World's Fair in San Francisco. Their discussion centered on the challenges of transitioning from AI proofs-of-concept to functional, scalable products, emphasizing the critical role of continuous iteration and real-world data.
The current AI landscape, while exciting, is fraught with unpredictable behavior. Even industry leaders like OpenAI are not immune to shipping "not so great products." Hylak cited instances where OpenAI's Codex produced "equally dumb" tests and Grok hallucinated bizarre claims about "white genocide" when asked about enterprise software. These examples underscore a fundamental truth: "More capable = more undefined behavior." This inherent non-determinism means that simply increasing model intelligence doesn't guarantee a flawless user experience; rather, it often introduces new, unexpected edge cases.
The speakers contended that traditional reliance on "evals" for gauging product quality is often misleading. "They tell you how good your product is. They're not," Hylak asserted, aligning with Goodhart's Law, which states that when a measure becomes a target, it ceases to be a good measure. OpenAI itself publicly acknowledges this limitation: "Our evals won't catch everything: We can't predict every issue...for more subtle or emerging issues, like changes in tone or style, real-world use helps us spot problems and understand what matters most to users." The real value, they argued, lies not in predefined evaluations but in continuous "signals."
Signals are defined as "ground-truthy indicators of your app's performance." These encompass both explicit user feedback, such as thumbs up/down, copying content, or sharing, and implicit cues. Implicit signals involve detecting patterns of behavior rather than subjective judgments, identifying issues like user frustration, task failures, or instances of AI "laziness." By observing these patterns across both explicit and implicit signals, teams can pinpoint and address critical issues that traditional evaluations often miss.
Oleve, a lean four-person team, exemplifies the power of this iterative, signal-driven approach, scaling to $6 million in annual recurring revenue and generating half a billion social media views. Sid Bendre highlighted that AI is inherently "chaotic and non-deterministic." To navigate this, Oleve developed "Trellis," a framework designed to "guide the chaos, don't eliminate it." Trellis involves discretizing the infinite output space into manageable "buckets," prioritizing workflows by their impact on business KPIs (a formula incorporating volume, negative sentiment, estimated achievable delta, and strategic relevance), and recursively refining these workflows. This structured approach ensures that AI magic is "engineered, repeatable, testable, and attributable, but not accidental."
The path to building successful AI products is not about achieving perfect, static models, but about embracing continuous iteration driven by real-world performance signals. Success in AI demands a dynamic feedback loop, leveraging both explicit user actions and subtle behavioral cues to refine and improve, rather than relying on metrics that merely confirm what is already known.

