The widespread adoption of voice AI agents, despite their immense promise, hinges on a critical factor: trust. Brooke Hopkins, CEO of Coval, illuminated this challenge and presented a compelling solution during her talk at the AI Engineer World’s Fair. Drawing parallels from her foundational work in autonomous vehicle evaluation at Waymo, Hopkins posited that the reliability infrastructure developed for self-driving cars holds the key to unlocking scalable, trustworthy conversational AI.
Hopkins highlighted a paradoxical perception of voice agents: "We are overestimating them AND we're underestimating them." Enterprises often overestimate immediate capabilities, attempting to automate entire workflows at once, leading to what she terms "PoC Hell" – a perpetual state of proof-of-concept without full production deployment. This stems from a false dichotomy in current deployment approaches: either conservative, deterministic, but ultimately expensive IVR trees, or autonomous, flexible, but inherently unpredictable AI.
The path to resolving this, Hopkins argued, lies in emulating the rigorous evaluation methodologies honed in the autonomous vehicle (AV) industry. Waymo's success, she noted, is not merely due to technological prowess but its "magical" reliability born from extensive testing. A pivotal insight from self-driving was the shift from manual, brittle, scenario-specific evaluations to large-scale simulation. "I think large scale simulation has been the huge unlock for self-driving and robotics," Hopkins stated. This allows for probabilistic evaluations, assessing the frequency of certain events across countless simulations, rather than a binary pass/fail for a single instance.
Applying these learnings to voice AI means embracing responsive environments, durable tests, and comprehensive coverage. Conversational agents, like self-driving cars, operate in dynamic, real-world interactions where each turn creates a new state. This necessitates testing systems capable of responding to myriad variations, moving beyond rigid input/output checks to probabilistic evaluations of overall agent performance. The non-determinism of large language models (LLMs) can actually be an asset here, enabling broader scenario generation.
A truly scalable evaluation strategy for voice AI prioritizes leveraging automated evaluations for "speed and scale," reserving manual human oversight for critical calibration. The goal is not to automate *all* evaluations, but to build constant evaluation loops that mirror AV development: running small feature evals, larger regression sets, and integrating pre- and post-submit CI/CD. This continuous feedback mechanism, including live monitoring and detection, is crucial. Furthermore, the key to building trustworthy evaluations lies in an iterative process: "The key to good evals in conversational AI is to iterate and calibrate based on human feedback. Look at the data!" This involves defining specific metrics, leveraging LLMs as judges, and constantly refining automated assessments against human judgment to achieve the desired level of reliability for different product functionalities. Voice AI is poised to become the next major platform, and robust evaluation is the non-negotiable foundation for its widespread, trusted adoption.

