"Models are what they eat." This concise truth encapsulates the profound shift in artificial intelligence highlighted by Ari Morcos, CEO and co-founder of Datology, on the Latent Space podcast. He argues that the prevailing focus on intricate model architectures and brute-force compute scaling has overlooked the most impactful lever for AI progress: sophisticated data curation.

Morcos, interviewed by Alessio Fanelli and Swyx, shared his personal journey from a neuroscience background, deeply immersed in understanding neural dynamics and inductive biases, to a stark realization he dubs "the bitter lesson." After years spent on papers attempting to understand why certain model representations were desirable, he found a consistent, confronting insight emerging around 2020: "all that really matters is the data."

This epiphany revealed that as data scales, the carefully engineered inductive biases in model architectures become less critical, even "mildly harmful" past a threshold of roughly one million data points. The traditional computer science approach, which treated datasets as a given to be optimized against, was fundamentally flawed. The era of self-supervised learning, enabling a million-fold increase in data quantity from ImageNet to trillions of tokens, ushered in a new regime where models are consistently "underfitting" the available data.

In this data-abundant landscape, data quality, not just quantity or architectural novelty, is the primary determinant of model performance. Automated data curation, involving filtering, rebalancing, sequencing, and strategic synthetic data generation, is no longer a "plumbing" task but a frontier research problem. Humans are simply unequipped to discern the nuanced information gain of individual data points within massive datasets.

Datology’s mission is to automate this complex curation process, making state-of-the-art data accessible and enabling the training of models that are simultaneously faster, better, and smaller. Their work has demonstrated remarkable efficiency gains, achieving baseline performance 12x faster than previous methods by judiciously curating data to maintain high marginal information gain. This approach allows for bending the "naive scaling laws" that predict diminishing returns from simply increasing data volume or compute.

The implications are significant for founders, VCs, and AI professionals. Instead of pouring resources into increasingly marginal gains from architectural tweaks or larger compute clusters, the true competitive advantage now lies in optimizing the "food" models consume. This shift towards data efficiency promises not only superior model performance but also substantial cost reductions, making advanced AI development more accessible and sustainable.

The Bitter Lesson: Why Data Curation is AI's Underinvested Frontier

Related startups

AI Daily Digest