The limitations of Large Language Models in achieving true Artificial General Intelligence (AGI) hinges on their inability to generate genuinely novel scientific discoveries, according to Columbia CS Professor Vishal Misra. “Any LLM that was trained on pre-1915 physics would never have come up with the theory of relativity. Einstein had to sort of reject the Newtonian physics and come up with this space time continuum. He completely rewrote the rules."
Misra spoke with Martin Casado and Erik Torenberg at a16z about the inherent constraints of LLMs, the reasons behind the effectiveness of chain-of-thought reasoning, and a vision for what genuine AGI would entail.
Misra argues that LLMs, at their core, are sophisticated pattern-matching machines that excel at predicting the next token in a sequence based on vast amounts of training data. This process, while impressive, fundamentally differs from human reasoning, which involves creativity, intuition, and the ability to challenge existing paradigms. "AGI will be when we are able to create new science, new results, new math. When an AGI comes up with a theory of relativity, it has to go beyond what it has been trained on to come up with new paradigms, new science. That's my definition of AGI."
A key insight from the discussion is that LLMs operate within a "Bayesian manifold," essentially navigating a pre-defined space of possibilities shaped by their training data. While they can refine their predictions and improve their accuracy within this space, they lack the capacity to transcend it and create entirely new concepts or theories.
Chain-of-thought reasoning, a technique that involves prompting LLMs to break down complex problems into a series of smaller, more manageable steps, works because it reduces the entropy, or uncertainty, at each stage of the process. "Chain-of-thought reasoning and entropy reduction" are inherently linked. By guiding the LLM through a structured sequence of steps, the prompt engineer effectively narrows down the range of possible outputs, leading to more coherent and accurate results. However, this technique does not enable the LLM to generate truly novel insights.
The conversation also touched upon the critical distinction between modeling and prompt engineering. While prompt engineering can be a powerful tool for eliciting desired responses from LLMs, it does not fundamentally alter the underlying model or its capabilities.
Related Reading
- Tiny AI Model Outperforms Giants Redefining Scaling Laws
- Redefining AI Evaluation: OpenAI's Shift to Real-World Performance Metrics
- OpenAI's gpt-oss: Open Models for Custom AI Solutions
Misra contends that LLMs are fundamentally incapable of recursive self-improvement, a key characteristic often associated with AGI. Because they are trained on existing data, they cannot generate knowledge that lies outside the scope of that data. True AGI, on the other hand, would possess the ability to challenge existing assumptions, formulate new hypotheses, and conduct experiments to validate them, leading to a continuous cycle of self-discovery and improvement.
The ultimate test for AGI, according to Misra, would be its ability to make novel scientific discoveries that go beyond what it has been trained on. "Einstein had to sort of reject the Newtonian physics and come up with this space time continuum. He completely rewrote the rules."

