Greg Kamradt, President of the ARC Prize Foundation, argues, "If you measure your AI in terms of a narrow domain, maybe a verticalized benchmark, well, then you're going to be making progress in that vertical domain". He spoke at OpenAI DevDay about the need for interactive evaluations to accurately measure AI agent performance across diverse tasks. Static benchmarks, he contends, fall short in capturing the complexities of real-world scenarios.
Kamradt spoke with attendees at OpenAI DevDay about measuring frontier AI using interactive evaluations. He discussed how static benchmarks, while useful, are insufficient for evaluating AI agents' ability to explore, plan, and reliably execute across diverse, long-horizon tasks.
One of the primary insights Kamradt shared is the importance of defining intelligence before attempting to measure it. Referencing François Chollet's 2019 paper, he defined intelligence as "skill-acquisition efficiency," or, put simply, "What is your ability to learn new things?". This definition shifts the focus from narrow, task-specific performance to a model's capacity for generalization.
There is no doubt that AI has made incredible progress recently, but the question that I'm asking myself is not is AI making progress, it's what is AI making progress towards, Kamradt asked.
This focus on generalization necessitates a shift towards interactive benchmarks. Static benchmarks, where AI is presented with a fixed dataset and asked to perform a specific task, often lead to overfitting and a lack of true understanding. Interactive environments, on the other hand, force AI agents to adapt to novel situations, make decisions based on incomplete information, and learn from their mistakes. This approach more closely mirrors the challenges of the real world and provides a more accurate assessment of an AI's true intelligence.
Kamradt highlighted the ARC Prize Foundation's work in creating interactive benchmarks. One example is ARC-AGI-3, a series of 150 novel video game environments designed to test AI agents' ability to generalize. "We are a nonprofit with the North Star to act as a North Star towards open progress of AGI," Kamradt said, emphasizing the organization's commitment to advancing the field of general artificial intelligence.
To illustrate the difference between human and AI performance in interactive environments, Kamradt showcased a demo of GPT-5 playing a custom-built game. He emphasized that while humans could quickly grasp the game's mechanics and devise a winning strategy, the AI struggled to make meaningful progress.
Related Reading
- Redefining AI Evaluation: OpenAI's Shift to Real-World Performance Metrics
- DMN: The Blueprint for Reliable AI Decision Agents
- AI's March: Solved Limits, Enduring Challenges, and Human Purpose
The key takeaway is that intelligence is inherently interactive. It's not enough for AI to excel at static tasks; it must also be able to learn, adapt, and solve problems in dynamic, unpredictable environments. "You need your specific benchmark to measure generalization and target it," noted Kamradt.
This approach also allows for a more nuanced understanding of AI performance. Instead of simply measuring accuracy, interactive benchmarks allow for the assessment of "action efficiency" – how efficiently an AI agent can convert information from the environment into the value that it's looking for. This metric provides a more complete picture of an AI's overall intelligence and its potential for real-world applications.

