Inception Labs, led by CEO Stefano Ermon, has unveiled Mercury, a groundbreaking suite of large language models leveraging diffusion architecture to achieve unprecedented speed and efficiency. Their initial offering, Mercury Coder, has demonstrated a remarkable ability to outperform existing speed-optimized models by up to 10x while maintaining comparable quality, signaling a potential paradigm shift in the competitive LLM landscape.
Ermon, an Associate Professor at Stanford University, recently discussed these advancements with Alessio Fanelli, Partner and CTO at Decibel, and Swyx, Founder of Smol AI, on the Latent Space Podcast. The conversation illuminated Inception Labs' journey, rooted in generative model research since 2014, and their strategic pivot towards diffusion-based language models.
The genesis of diffusion models for text and code began in 2019, evolving from successes in image generation. While initial attempts to adapt these models to discrete data proved challenging, Ermon and his team made a crucial breakthrough. "We were able to show that discrete diffusion models were competitive on language generation with auto-regressive models up to the GPT-2 kind of scale," Ermon noted, highlighting the early promise. This fundamental research laid the groundwork for Inception Labs, founded last year, to scale these capabilities commercially.
The core innovation lies in how diffusion LLMs generate content. Unlike traditional auto-regressive models that produce text token by token in a sequential, left-to-right manner, diffusion models employ an iterative refinement process. "Diffusion models, on the other hand, they work by generating objects kind of like in a coarse-to-fine way. You start with a rough guess of what the answer should be, and then you refine it... by modifying essentially multiple tokens in parallel." This parallel processing capability is key to their superior speed. Mercury Coder Mini and Mercury Coder Small, for instance, boast impressive throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs.
This architectural difference also confers unique advantages beyond raw speed. Diffusion models are inherently non-causal, meaning they can consider the entire context of a sequence (both left and right) during generation, which is beneficial for tasks like code in-filling. Furthermore, their inference efficiency is a significant differentiator. "For the same throughput, we can get better latency, or for the same latency, we can get better throughput," Ermon explained, directly addressing the critical cost and performance metrics for production AI. This efficiency makes Mercury particularly well-suited for latency-sensitive applications such as voice agents, integrated development environments (IDEs), and real-time coding assistants. While Inception Labs offers its models via API and has no immediate plans for open-sourcing, their focus remains on pushing the boundaries of what's possible with efficient, high-quality language generation.

