Benchmarks Fail Modern AI, Says OpenAI Scientist

OpenAI's Noam Brown discusses why traditional benchmarks fail modern AI, emphasizing the need for new evaluation methods that account for computational budgets and model capabilities.

3 min read
Noam Brown, Research Scientist at OpenAI, speaking at a podcast recording.
Noam Brown, Research Scientist at OpenAI, speaking during a podcast interview.· NoPriors

Noam Brown, a research scientist at OpenAI, articulated a critical concern for the AI industry in a recent discussion: traditional benchmarks are failing to adequately assess the capabilities of modern artificial intelligence models. The rapid advancement of AI, particularly in areas like reasoning and multimodal understanding, has outpaced the methods used to evaluate these sophisticated systems.

Benchmarks Fail Modern AI, Says OpenAI Scientist - NoPriors
Benchmarks Fail Modern AI, Says OpenAI Scientist — from NoPriors

The Shortcomings of Traditional Benchmarks

Brown explained that many existing benchmarks are not designed to capture the full spectrum of what today's AI models can achieve. These benchmarks often rely on static datasets and predefined metrics that do not reflect the dynamic and context-dependent nature of advanced AI applications. The core issue, as highlighted by Brown, is that the true capability of a model is often a function of the resources, such as computational budget or time, allocated to its testing.

Related startups

Budget and Time Constraints in AI Evaluation

A key point raised was the direct correlation between the resources invested and the observed performance. Brown illustrated this by suggesting that if a model like GPT-3 were given a substantial budget of $10 million for testing, its performance on benchmarks would likely be significantly higher than if it were evaluated with a minimal budget of $10. This implies that benchmark results can be misleading, as they may reflect the testing budget rather than the inherent capabilities of the model.

Furthermore, the concept of computational time is crucial. If a model is allowed to run for an extended period, its performance can improve. However, traditional benchmarks often fail to account for this, leading to an incomplete picture of the model's potential. Brown emphasized that the effectiveness of AI evaluation should consider the time and resources required for a model to reach its optimal performance.

The Need for New Evaluation Frameworks

Brown argued that the current evaluation policies and benchmarks are not sufficiently robust for modern AI. He pointed out that the rapid pace of AI development means that models are constantly evolving, often surpassing the limitations of existing evaluation metrics. This necessitates a shift towards more dynamic and comprehensive evaluation frameworks that can accurately gauge AI capabilities across a wide range of tasks and conditions.

He stressed the importance of asking the right questions when evaluating models, such as: "What's the capability of the model?" and "At what budget should you evaluate these models?" The answer to these questions, he suggested, is not straightforward and requires a deeper understanding of how models perform under various constraints.

The Role of Self-Improvement and Competition

The discussion also touched upon the concept of self-improvement in AI and the competitive landscape. Brown noted that models are becoming increasingly adept at learning and adapting, which in turn drives the need for more sophisticated evaluation methods. The drive for better performance also fuels competition, pushing researchers to find novel ways to benchmark and compare AI systems.

In essence, Brown's insights highlight a critical challenge facing the AI community: the need to move beyond traditional, static benchmarks to develop more nuanced and realistic evaluation methods that can keep pace with the rapid progress of AI technology. This includes considering factors like computational resources, time, and the potential for self-improvement when assessing model capabilities.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.