Large Language Models, "by nature, LLMs can be very unreliable," as Ido Pesok, an engineer and researcher at Vercel, plainly stated. This inherent unpredictability poses a significant challenge for developers aiming to build robust AI applications. Pesok, speaking at the AI Engineer World's Fair in San Francisco, illuminated this critical issue and presented Vercel v0's approach to evaluation, emphasizing the need for application-layer "evals" rather than solely relying on model-level benchmarks.
Pesok illustrated the problem with a relatable anecdote about a simple "Fruit Letter Counter" app. Despite initial success in testing, a user's slightly varied query exposed a glaring failure. This highlights a crucial truth: "No one is going to use something that doesn't work. It's literally unusable." While AI models might perform admirably in controlled demonstrations, their real-world deployment often uncovers unexpected "hallucinations" that render them ineffective for end-users. The challenge, therefore, lies in making AI software truly reliable.
Vercel v0, a full-stack AI coding platform that recently crossed 100 million messages sent and launched GitHub Sync, tackles this by focusing on evals at the application layer. This means evaluating how the AI performs within the context of actual user interactions and specific use cases, moving beyond abstract model performance metrics. A core insight for this process is the need to "understand your 'court'." Using a basketball analogy, Pesok explained that the "court" represents the domain of user queries your application is designed to handle. Just as a basketball player needs to practice across the entire court, developers must ensure their evaluations cover the full spectrum of relevant user inputs, avoiding "out-of-bounds" scenarios that don't reflect real usage.
Collecting comprehensive data is paramount. Pesok stressed that "there is no shortcut here. You really have to do the work and understand what your court looks like." This involves gathering diverse user feedback, from simple thumbs up/down ratings to analyzing random samples from application logs, and even monitoring community forums and social media for user-reported issues. This qualitative and quantitative data forms the foundation for effective evals.
Structuring these evaluations involves a key principle: separating constants from variables. Static user queries and their expected outputs reside in the "data" portion of the eval. The "task" section, however, is where developers introduce variables like different system prompts or retrieval-augmented generation (RAG) techniques, allowing them to experiment and iterate without constantly re-annotating data. This modularity ensures that changes to the underlying AI logic can be quickly tested against a consistent set of real-world scenarios.
Scoring these evals should lean towards deterministic pass/fail outcomes to simplify debugging and communication across teams. While some domains might necessitate human review for nuanced outputs, the goal is to codify what constitutes a "failed" interaction. For instance, a simple string match for expected answers can be facilitated by instructing the AI to output its final answer within specific tags, making automated scoring more straightforward.
The ultimate goal is continuous, systematic improvement. Integrating evals into the continuous integration/continuous deployment (CI/CD) pipeline allows for automated testing of every code change. Tools like Braintrust can generate reports showing improvements and regressions, providing clear visual feedback on how changes impact performance across the "court." This empirical approach ensures that "improvement without measurement is limited and imprecise." By systematically evaluating AI applications against real-world user data, teams can significantly increase reliability and quality, leading to higher conversion and retention rates, and ultimately, less time spent on reactive support.

