"To me, AGI will not be complete without spatial intelligence. And I want to solve that problem." Dr. Fei-Fei Li, often hailed as the godmother of AI, articulated this audacious vision during a fireside chat at AI Startup School in San Francisco on June 16, 2025. Joined by Diana Hu, General Partner at Y Combinator, Li delved into her foundational work with ImageNet and its pivotal role in igniting the deep learning revolution, before charting the course for AI's demanding future.
Li recounted the early days of computer vision, a time when data was scarce and algorithms faltered. Her unwavering belief in data-driven methods, even when neural networks were out of favor, led to the creation of ImageNet in 2009. This massive, labeled dataset for visual recognition fundamentally shifted the paradigm, providing the backbone for modern computer vision. The breakthrough moment arrived in 2012, when AlexNet, leveraging ImageNet and powerful GPUs, dramatically outperformed previous benchmarks. "It was an old algorithm," Li noted, referring to convolutional neural networks, "but it was the first time that two GPUs were put together... for the computing of deep learning." This convergence of data, compute, and algorithms validated Li's vision, proving that quantity and quality of data were indeed crucial.
The success of ImageNet paved the way for machines to not just recognize objects, but to understand and describe entire scenes, and eventually, to generate images from text. This rapid progression, from object recognition to image captioning and generative models, fulfilled what Li once considered a "lifelong dream." However, ever the visionary, Li sees beyond the current language-centric AI hype, asserting that true Artificial General Intelligence (AGI) necessitates a mastery of the physical world.
Spatial intelligence, the core focus of her new company, World Labs, presents a challenge arguably more formidable than language. Li highlighted the fundamental differences: language is inherently one-dimensional and purely generative, a human construct without a direct physical counterpart. In contrast, the real world is three-dimensional, governed by complex physics, and requires continuous interaction and understanding, not just generation. "We don't have this spatial data on the internet," she explained, emphasizing the scarcity compared to textual data. This combinatorial explosion of possibilities in 3D space, coupled with the ill-posed problem of inferring 3D from 2D projections, makes spatial intelligence a significantly harder frontier.
Li's career is defined by tackling such "hard, bordering delusional" problems. She stressed the importance of intellectual fearlessness and the courage to pursue challenges that might seem insurmountable. For aspiring founders and AI professionals, her message is clear: whether in academia or industry, embrace curiosity and focus on building solutions for problems that truly excite you, regardless of their perceived difficulty. The progress in AI, from ImageNet to generative models, underscores the power of collective effort and open innovation.

