The pursuit of artificial general intelligence has historically been fractured between proponents of symbolic, rules-based systems and those advocating for generalized neural architectures. Yi Tay, Research Scientist at Google DeepMind and leader of the Reasoning and AGI team in Singapore, recently detailed the organization’s decisive pivot away from the former—a shift that underpinned the landmark IMO Gold achievement and redefined the company’s approach to scaling intelligence. This move represents a high-stakes, non-consensus bet on end-to-end large language models (LLMs) that prioritize self-correction and experiential learning over brittle, specialized systems.

Tay spoke on the Latent Space podcast about the eighteen months leading up to the release of Gemini Deep Think and the IMO Gold accomplishment, a period characterized by rapid organizational and philosophical consolidation. Google’s unification of Brain and DeepMind positioned the merged entity to execute a singular vision: scaling reasoning capabilities through reinforcement learning (RL). The critical decision was abandoning efforts like AlphaProof, a specialized symbolic system previously used for mathematical theorem proving, in favor of training a single, massive foundation model. This was driven by a fundamental question regarding the ultimate scalability of narrow AI: "If one model can't do it, can we get to AGI?" The answer, implicitly, was no. The future demanded a unified, versatile architecture capable of generalization.

This philosophical shift manifested technically in the widespread adoption of on-policy reinforcement learning, a concept Tay believes is crucial for unlocking true intelligence. He drew a sharp distinction between off-policy learning (imitation) and on-policy learning (self-correction). Off-policy methods, such as supervised fine-tuning (SFT), involve training a model to mimic successful trajectories provided by a larger model or human experts. Conversely, on-policy RL mandates that the model generate its own outputs, receive environmental feedback, and then train on its own experience. "Humans learn by making mistakes, not by copying," Tay observed, analogizing the process to how a child learns through iterative trial and error rather than purely through imitation. This approach is inherently more generalized, allowing the model to explore novel solution spaces beyond the confines of the initial training data.

The pursuit of robust reasoning is inextricably linked to this self-correction mechanism. Techniques like Chain of Thought (CoT) prompting and self-consistency—where a model generates multiple solution paths and aggregates the best outcome—are critical scaffolding for these nascent reasoning skills. These methods allow the model to surface latent reasoning capabilities embedded deep within its massive pre-training corpus. Tay noted that this self-consistency approach is highly powerful, relying on the model to judge the validity of its own intermediate steps and outputs, rather than relying on external human annotations or specialized proof verifiers.

The success of these reasoning efforts is already crossing critical commercial thresholds. Tay highlighted the profound impact on AI coding, noting a recent emergent capability that moved AI assistance from merely useful to indispensable. He recounted an experience where he submits jobs, receives a bug, pastes the error into Gemini, and relaunches the job without reading the fix provided by the model. "The model is better than me at this," he stated, illustrating that for certain classes of coding and debugging tasks, the model’s reliability and speed have surpassed the average human engineer. This is a powerful signal of true productivity gain, driven by reasoning, not just retrieval.

However, the path to AGI remains constrained by fundamental challenges. Tay pointed to the stark efficiency gap: humans learn from eight orders of magnitude less data than current LLMs. This leads to open questions regarding the necessity of architectural shifts, new learning algorithms beyond backpropagation, or novel world model paradigms. He briefly touched on the Pokémon benchmark (specifically Crystal version) as a challenge for long-horizon planning and knowledge application, noting that models still struggle to consistently apply knowledge they look up in a complex visual game state—a problem distinct from the mathematical reasoning needed for IMO Gold.

The compounding advantage held by frontier labs is also accelerating. Tay suggested that the gap between heavily resourced closed labs and the open-source community is widening because fundamental ideas—like transformers, pre-training regimes, and RL techniques—compound over time. Each new trick plays well with everything built before it, creating a powerful flywheel effect. This deep infrastructure advantage is also being leveraged in applied domains like search and recommendation systems (RecSys). Tay discussed Generative Retrieval (DSI), which reimagines search as predicting document identifiers with semantic tokens, a technology already deployed at scale within YouTube and Spotify. He noted that RecSys environments present unique challenges where "modeling dynamics are strange, like gravity is different—you hit the shuttlecock and hear glass shatter, cause and effect are too far apart," requiring deep causal reasoning rather than simple pattern matching.

The growth of reasoning and RL capabilities remains the North Star for Google DeepMind. This focus is clearly reflected in the mandate for the new Singapore team, which centers on driving progress in these core areas. The goal is to cultivate a small, talent-dense team capable of tackling the next generation of reasoning challenges that pave the way toward generalized intelligence.

DeepMind’s Pivot to On-Policy AGI and the Non-Consensus Bet on End-to-End Reasoning

Related startups

AI Daily Digest