The frontier of artificial intelligence demands evaluation metrics that transcend academic benchmarks, a critical pivot highlighted by Tejal Patwardhan and Henry Scott-Green of OpenAI. Their presentation at OpenAI DevDay [2025] unveiled a comprehensive framework for measuring real-world AI progress, emphasizing the necessity of robust evaluation for both cutting-edge model development and practical application building. This shift marks a recognition that traditional tests, while foundational, no longer adequately capture the nuanced capabilities required for AI to perform effectively in economically valuable tasks.

Tejal Patwardhan, a researcher on OpenAI's Reinforcement Learning team leading frontier evals, initiated the discussion by underscoring the intrinsic value of evaluation. For OpenAI, where "training runs are very expensive in terms of compute and researcher time," evaluations provide the essential signal to "measure progress" and "steer model training towards good outcomes." Historically, academic benchmarks like the SAT or the AIME high school math competition served to push reasoning capabilities. However, these quickly reached a ceiling; as Tejal noted, "these evals can only measure so much." Models could achieve near-perfect scores on such tests yet remain incapable of performing real-world work. This stark disconnect necessitated a new approach, moving beyond theoretical aptitude to practical utility.

OpenAI's answer to this challenge is GDPval, a novel evaluation designed to measure model performance on economically viable, real-world tasks. The name itself reflects its core purpose, focusing on tasks relevant to Gross Domestic Product (GDP). This framework represents a fundamental re-calibration of how AI is assessed, moving from the simulated environment of academic testing to the complex, multimodal demands of professional work.

To construct GDPval, OpenAI collaborated with experts averaging 14 years of experience across diverse fields. The tasks included in GDPval are notably long-horizon, often requiring days or even weeks to complete, and are inherently multimodal, integrating computer usage, image analysis, and various tools. Examples shared included a real estate agent designing a sales brochure, a manufacturing engineer creating a 3D model of a cable reel stand, and a film editor crafting a high-energy intro reel with video and audio. These tasks were meticulously selected from the top nine sectors contributing to US GDP, and within each sector, the top five knowledge work occupations were identified based on their contribution to overall wages, as per federal and labor statistics. This rigorous process yielded over a thousand real-world tasks, ensuring a broad and relevant measure of AI's practical capabilities.

The evaluation methodology for GDPval employs pairwise expert grading. Human experts, blinded to the source, compare a model's output against a human expert's deliverable for a given task, selecting the preferred outcome. This process generates an overall "win rate," reflecting the percentage of times the model's output is preferred or deemed equally good. Tejal presented compelling data showing that while GPT-4o scored less than 20% win rate against human professionals, subsequent models like GPT-5 high are nearing a 40% win rate. This trajectory suggests that AI models are rapidly approaching parity with human experts on these complex, economically valuable tasks.

Transitioning from frontier research to production applications, Henry Scott-Green, OpenAI’s Evals Product Lead, highlighted why evaluations are equally crucial for builders. He emphasized that "teams that invest in evals, they build consistently better products." Building high-performing AI applications, especially agents, remains a significant challenge. Large Language Models are often non-deterministic, new models are constantly emerging, and "with chat, the edge cases are pretty much endless." When building multi-agent systems, error rates compound with every call. Furthermore, user expectations are "through the roof," with no tolerance for failure in regulated domains like finance, legal, and healthcare.

OpenAI's Evals product aims to address these challenges by providing tools that make defining and running evaluations "much easier and increasingly automated." Henry demonstrated several key features: a visual eval builder for creating datasets and prompts, trace grading to analyze agent workflows and pinpoint failures at scale, and automated prompt optimization that rewrites prompts based on evaluation results to accelerate iteration time. The platform also supports third-party models via OpenRouter and allows users to bring their own keys, fostering an ecosystem-agnostic approach. Critically, the product is enterprise-ready, offering zero data retention and enterprise key management.

The overarching lesson for founders, VCs, and AI professionals is clear: rigorous evaluations are not merely an academic exercise but an essential pillar for developing reliable, high-performing AI products. By providing integrated, automated, and easy-to-use tools, OpenAI is empowering developers to move beyond ad-hoc testing and confidently build the next generation of AI applications. The ability to systematically measure real-world performance from the outset of development is paramount, enabling faster iteration and ensuring that AI capabilities translate into tangible economic value.

Redefining AI Evaluation: OpenAI's Shift to Real-World Performance Metrics

Related startups

AI Daily Digest