In an era where large language models dominate headlines, a quietly audacious bet on "world models" is positioning itself as the next frontier in artificial intelligence. This vision, championed by Pim de Witte, CEO of General Intuition (GI), and backed by Khosla Ventures' largest seed investment since OpenAI, posits that spatial-temporal foundation models, trained on a unique trove of human gameplay data, will redefine how AI interacts with the physical and simulated worlds.
Pim de Witte recently spoke with Swyx, Editor of Latent Space, at General Intuition's offices, delving into the foundational technology, the strategic advantage of their data, and the expansive future applications of world models. The discussion illuminated why this approach, distinct yet complementary to LLMs, represents a profound shift in AI's capabilities, moving beyond mere content generation to active, intuitive understanding.
At its core, General Intuition is building agents that learn to perceive and act within environments, mimicking human intuition. Unlike traditional video models that merely predict the next likely frame, world models are tasked with a far more complex challenge. "What world models do is they actually have to understand the full range of possibilities and outcomes... and based on the action that you take, generate the next state," de Witte explained. This action-conditioned generation is crucial, enabling AI to not just observe but also to interact and anticipate consequences within dynamic environments.
The bedrock of GI's innovation is the colossal dataset amassed from Medal, de Witte's previous venture. Medal, a game clipping platform with 12 million users, has accumulated an astounding "3.8 billion clips of the best moments and actions in games, resulting in one of the most unique and diverse datasets of peak human behavior." This treasure trove of "episodic memory for simulation" provides an unparalleled resource for training AI. Crucially, this data is privacy-preserving, mapping actions to visual inputs and game outcomes without revealing personal user data, a prescient design choice that became a goldmine for world model development.
GI's agents are purely vision-based, operating on a "frames in, actions out" paradigm. De Witte demonstrated models that, without any reinforcement learning (RL) or fine-tuning, and seeing "no game state," predict actions from raw pixels. These agents exhibit "incredibly human-like" behaviors, even making the same mistakes or performing actions like gamers checking a scoreboard, a testament to the fidelity of their imitation learning.
A significant insight from de Witte is the distinction between world models and simple video generation. World models demand an understanding of actions, memory, and partial observability—elements like smoke, occlusion, and camera shake. This enables them to navigate, hide, and peek corners with human-like spatial reasoning, capabilities essential for real-world application. The ability to distill these giant policies into tiny, real-time models further enhances their practical utility.
The ambition extends far beyond gaming. General Intuition is demonstrating transferability from arcade-style games to more realistic games, and crucially, to real-world video. This means their models can process and predict actions within any internet video, laying the groundwork for applications in robotics. De Witte envisions spatial-temporal foundation models powering the majority of "atoms-to-atoms interactions" in both simulation and the physical world by 2030, suggesting a future where intelligent agents seamlessly navigate and manipulate our environments. These models are not seen as rivals to large language models but as complementary forces, with LLMs handling symbolic reasoning and world models excelling at embodied, spatial intelligence.
De Witte’s personal journey, from running RuneScape private servers and reverse engineering to leading a frontier AI lab, underscores the unconventional path to innovation. His self-taught approach to deep learning fundamentals, actively seeking out and mastering core concepts, reflects the dedication required to tackle such complex problems. The decision to remain independent, turning down a reported $500 million offer from OpenAI, was driven by the recognition of their unique data moat and the belief in their ability to lead this research independently, a conviction validated by Khosla Ventures’ substantial investment.

