Jeremy Berman, a research scientist at Reflection AI and the recent winner of the ARC-AGI v2 public leaderboard, articulated a profound shift in his approach to artificial intelligence during a recent interview. He spoke with the interviewer about the critical need for AI systems to synthesize new knowledge, rather than merely compressing existing data, and how his winning method for the ARC-AGI v2 challenge, achieving a 29.4% top score, pivoted from Python code generation to evolving natural language descriptions. This marked a significant departure from his previous ARC-AGI v1 success, highlighting a core insight: natural language offers a more expressive and adaptable programming paradigm for achieving genuine reasoning.

Berman's journey into AI research, relatively recent at eight months, was catalyzed by Jeff Hawkins's "A Thousand Brains" and the compelling nature of the ARC-AGI challenge, which he views as an elegant way to expose current language models' limitations. He describes this challenge as akin to an IQ test for machines, requiring abstract pattern recognition to transform input grids into output grids based on underlying rules. The stark performance gap between humans (75% accuracy on ARC-AGI v1) and even advanced LLMs like GPT-4 (5%) underscored the problem. Berman's initial breakthrough in ARC-AGI v1, achieving 53.6% accuracy, involved an "Evolutionary Test-time Compute" approach, where an LLM generated multiple Python functions, which were then iteratively refined based on their performance against example grids. This method leveraged Python's deterministic nature for verifiable correctness.

However, the transition to ARC-AGI v2 revealed a new challenge: compositional problems requiring multiple rules. Berman noted that his Python-based solution for v1 did not perform well on these more complex tasks. This led to his second core insight: current language models, even with reinforcement learning, still struggle with true reasoning. They operate like "stochastic parrots," as the interviewer quoted from Berman's earlier paper, regurgitating correct-sounding words based on training data rather than deep, abstract understanding. The key, Berman argues, is not merely to teach an AI a skill, but to teach it the "meta-skill" of creating skills—true reasoning.

Berman's latest architecture for ARC-AGI v2, which secured him the top spot, addresses this by employing natural language descriptions of algorithms instead of Python code. This fundamental shift acknowledges natural language as a significantly more expressive programming medium. While his previous approach involved external revision loops for Python code, the new "thinking models" (like Grok-4) now integrate a deep revision process internally. This means the model essentially performs many of the iterative refinements itself, allowing Berman's system to prioritize "breadth" in its initial exploration of solutions. He observed that for ARC-AGI v2, being broad in exploration was more crucial than deep iteration, a surprising finding that contrasts with his earlier work.

The third core insight centers on the future of AI: overcoming catastrophic forgetting and achieving true general reasoning. Berman believes that current models, while powerful, are fundamentally limited by their inability to synthesize new knowledge without forgetting old. He envisions a future where models possess "in-built revision loops" and are trained to explore solution spaces in a deep, adaptive way, ultimately leading to general intelligence. This echoes the neuro-symbolic debate, with Berman firmly believing that neural networks, given sufficient compute and the right algorithmic structures, can eventually emulate biological neural networks and achieve true general reasoning. He is confident that the computational and architectural hurdles, including the "catastrophic forgetting" problem in fine-tuning, are surmountable within the next decade, leading to AI systems that are "as smart in every way as a human brain." This audacious vision underscores the ongoing quest for artificial general intelligence, pushing beyond current paradigms to unlock unprecedented scientific progress and human flourishing.

Jeremy Berman’s Evolutionary Leap: Natural Language for ARC-AGI-2

Related startups

AI Daily Digest