The frontier of artificial intelligence demands evaluation metrics that transcend academic benchmarks, a critical pivot highlighted by Tejal Patwardhan and Henry Scott-Green of OpenAI. Their presentation at OpenAI DevDay [2025] unveiled a comprehensive framework for measuring real-world AI progress, emphasizing the necessity of robust evaluation for both cutting-edge model development and practical application building. This shift marks a recognition that traditional tests, while foundational, no longer adequately capture the nuanced capabilities required for AI to perform effectively in economically valuable tasks.
Tejal Patwardhan, a researcher on OpenAI's Reinforcement Learning team leading frontier evals, initiated the discussion by underscoring the intrinsic value of evaluation. For OpenAI, where "training runs are very expensive in terms of compute and researcher time," evaluations provide the essential signal to "measure progress" and "steer model training towards good outcomes." Historically, academic benchmarks like the SAT or the AIME high school math competition served to push reasoning capabilities. However, these quickly reached a ceiling; as Tejal noted, "these evals can only measure so much." Models could achieve near-perfect scores on such tests yet remain incapable of performing real-world work. This stark disconnect necessitated a new approach, moving beyond theoretical aptitude to practical utility.
