NVIDIA has once again swept the MLPerf Training v5.1 benchmarks, demonstrating unparalleled performance across critical AI workloads. The company's Blackwell Ultra architecture made a formidable debut, setting new records for large language model (LLM) training. This comprehensive victory underscores NVIDIA's continued leadership in scaling intelligence through hardware and software breakthroughs.
The GB300 NVL72 rack-scale system, powered by Blackwell Ultra, delivered significant generational leaps. It achieved over 4x faster Llama 3.1 405B pretraining and nearly 5x faster Llama 2 70B LoRA fine-tuning compared to the prior-generation Hopper architecture, using the same number of GPUs. These gains stem from Blackwell Ultra's architectural enhancements, including new Tensor Cores offering 15 petaflops of NVFP4 AI compute and 279GB of HBM3e memory. The NVIDIA Quantum-X800 InfiniBand platform also debuted, doubling scale-out networking bandwidth.
A pivotal innovation behind these results is the adoption of NVFP4 precision for calculations, a first in MLPerf Training history. Performing computations with fewer bits dramatically increases compute performance, though it demands meticulous design to maintain accuracy. NVIDIA's teams innovated across the entire stack to leverage FP4 precision for LLM training effectively. Blackwell Ultra boosts FP4 calculation rates to 3x that of FP8, enabling substantially greater AI compute. According to the announcement, NVIDIA is the only platform to submit MLPerf Training results using FP4 precision while meeting strict accuracy requirements.
Blackwell's Scaling Prowess
NVIDIA established a new Llama 3.1 405B time-to-train record of just 10 minutes, utilizing over 5,000 Blackwell GPUs. This achievement was 2.7x faster than the previous Blackwell-based record, driven by efficient scaling and the performance boost from NVFP4 precision. To illustrate per-GPU performance, a submission with 2,560 Blackwell GPUs completed training in 18.79 minutes, a 45% improvement over a similar GPU count in the last round.
The latest MLPerf round also introduced new benchmarks, where NVIDIA continued its record-setting trend. The Llama 3.1 8B model, a modern compact LLM replacing BERT-large, saw NVIDIA set a 5.2-minute training record with 512 Blackwell Ultra GPUs. Similarly, for FLUX.1, a new image generation model replacing Stable Diffusion v2, NVIDIA was the sole platform to submit results, achieving a 12.5-minute training time with 1,152 Blackwell GPUs.
NVIDIA's consistent one-year innovation cycle is rapidly accelerating AI performance across all stages, from pretraining to inference. This relentless pace, combined with the maturity of its CUDA software stack and a broad partner ecosystem, is not merely about winning benchmarks. It signifies a fundamental shift in what's possible for AI development, paving the way for more sophisticated models and faster AI adoption across industries.



