The future of artificial intelligence isn't just about language; it's about understanding and interacting with the physical world. This was the central theme when Pim de Witte, CEO of General Intuition, spoke with Swyx, the editor of Latent Space, in a recent interview. Their discussion delved into Khosla Ventures' significant investment in General Intuition, highlighting the startup's innovative approach to AI development through "world models" trained on vast datasets of human gameplay.
De Witte explained that while traditional video models predict the next likely sequence or entertaining frame, "what world models do is they actually have to understand the full range of possibilities and outcomes from the current state, and based on the action that you take, generates the next state." This distinction is crucial. It moves beyond mere prediction to an active, interactive understanding of cause and effect within a simulated environment. This represents a far more complex problem than traditional video models, requiring an AI to grasp underlying physics and spatial-temporal reasoning.
A core insight from the interview is General Intuition's unique dataset: 3.8 billion action-labeled highlight clips from gaming platform Medal. This trove of data, initially gathered with privacy-first action labels, proved to be a goldmine for training world models. Unlike raw video, these clips contain explicit human actions and their immediate consequences, providing a rich, "episodic memory for simulation." This allows agents to learn not just what happens, but why it happens in response to specific inputs, laying the groundwork for true understanding rather than superficial imitation.
The transferability of these world models is another key insight. De Witte outlined a clear path: from arcade-style games, where the physics are simpler, to more realistic games, and eventually to real-world video and robotics. The "frames in, actions out" recipe remains consistent, allowing the learned spatial-temporal reasoning to generalize across increasingly complex environments. This modular approach is designed to replace brittle, hand-coded behavior trees in existing systems, offering a more robust and adaptable solution for controlling agents and robots.
Related Reading
- AI's Ground Truth: Beyond Models to Infrastructure and Application
- Agent Memory: The New Frontier of AI Reliability
- Building Resilient AI Agents Through Abstraction
De Witte also emphasized that world models and large language models (LLMs) are complementary, not rivals. He envisions LLMs acting as orchestrators, handling high-level planning and communication, while world models provide the granular, real-time perception-action loop necessary for physical interaction. Text, in this context, becomes a form of compression, allowing for efficient communication of complex goals to the world model. This synergy promises a powerful combination for future AI systems, where symbolic reasoning meets embodied intelligence.
General Intuition's ambitious 2030 vision involves spatial-temporal foundation models powering 80% of future atoms-to-atoms interactions in both simulation and the real world. This bold prediction underscores the transformative potential de Witte sees in their approach. By focusing on models that deeply understand and interact with their environment, General Intuition aims to move beyond simple imitation learning to genuine reinforcement learning, driven by the ability to predict and react to negative events. Khosla's substantial seed investment signals strong belief in this foundational shift in AI development.

