Jeremy Berman, a research scientist at Reflection AI and the recent winner of the ARC-AGI v2 public leaderboard, articulated a profound shift in his approach to artificial intelligence during a recent interview. He spoke with the interviewer about the critical need for AI systems to synthesize new knowledge, rather than merely compressing existing data, and how his winning method for the ARC-AGI v2 challenge, achieving a 29.4% top score, pivoted from Python code generation to evolving natural language descriptions. This marked a significant departure from his previous ARC-AGI v1 success, highlighting a core insight: natural language offers a more expressive and adaptable programming paradigm for achieving genuine reasoning.
Berman's journey into AI research, relatively recent at eight months, was catalyzed by Jeff Hawkins's "A Thousand Brains" and the compelling nature of the ARC-AGI challenge, which he views as an elegant way to expose current language models' limitations. He describes this challenge as akin to an IQ test for machines, requiring abstract pattern recognition to transform input grids into output grids based on underlying rules. The stark performance gap between humans (75% accuracy on ARC-AGI v1) and even advanced LLMs like GPT-4 (5%) underscored the problem. Berman's initial breakthrough in ARC-AGI v1, achieving 53.6% accuracy, involved an "Evolutionary Test-time Compute" approach, where an LLM generated multiple Python functions, which were then iteratively refined based on their performance against example grids. This method leveraged Python's deterministic nature for verifiable correctness.
