The hallmark of general intelligence—discovering causal regularities and applying them—faces a significant evaluation hurdle. Bridging the complexity gap between scientific discovery and real-world engineering has proven exceptionally difficult for current AI systems.
The SciCrafter Benchmark: Operationalizing Discovery-to-Application
To address this, researchers introduced SciCrafter, a novel Minecraft-based benchmark. This platform operationalizes the discovery-to-application loop through parameterized redstone circuit tasks. Agents are challenged to ignite lamps in specific patterns, with scaling parameters intentionally increasing complexity and knowledge requirements. This design forces genuine discovery, moving beyond memorized solutions. The SciCrafter benchmark aims to push AI capabilities beyond current limitations.
Frontier Models Hit a Plateau, Revealing New Bottlenecks
Evaluation of leading models, including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5, under a general-purpose code agent scaffold revealed a stark plateau. All models achieved approximately 26% success. Decomposing the loop into four capacities—knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application—and employing targeted interventions, the analysis pinpointed the primary issues. While general knowledge application remains a significant gap, frontier models are increasingly bottlenecked by knowledge gap identification. This indicates a crucial shift: the challenge is moving from AI's ability to solve problems correctly to its ability to formulate the correct problems.