Preferred on Google

Nvidia's Ziv Ilan on Faster Diffusion Models

Nvidia's Ziv Ilan explains how to reduce diffusion model latency using quantization, caching, and distillation, plus the new FastGen library.

Jun 16 at 2:23 PM9 min read

Ziv Ilan presenting on diffusion models at AI Engineer Europe — Ziv Ilan of Nvidia discusses optimizing diffusion models.· AI Engineer

Ziv Ilan, an AI Labs researcher at Nvidia, presented a talk titled "You Might Not Need 50 Diffusion Steps" at AI Engineer Europe. The presentation focused on optimizing diffusion models to reduce their computational demands and improve inference speed, making them more practical for real-time applications.

Nvidia's Ziv Ilan on Faster Diffusion Models - AI Engineer — Nvidia's Ziv Ilan on Faster Diffusion Models — from AI Engineer

Visual TL;DR. Diffusion Models requires High Step Count. High Step Count addressed by Nvidia's Ziv Ilan. Nvidia's Ziv Ilan presents Optimization Strategies. Optimization Strategies includes Quantization. Optimization Strategies includes Caching. Optimization Strategies includes Distillation. Quantization leads to Reduced Latency. Caching leads to Reduced Latency. Distillation leads to Reduced Latency. Optimization Strategies enabled by FastGen Library. FastGen Library enables Reduced Latency.

Related startups

Diffusion Models: iteratively denoising random noise to generate images or videos
High Step Count: typically 20-50 steps, leading to high computational demands
Nvidia's Ziv Ilan: researcher presenting optimization strategies for diffusion models
Optimization Strategies: quantization, caching, and distillation techniques
Quantization: making each diffusion step computationally cheaper
Caching: skipping redundant computations across diffusion steps
Distillation: compressing multiple steps into fewer, 1-8 steps
Reduced Latency: faster inference for practical real-time applications
FastGen Library: new library enabling faster diffusion model inference

Visual TL;DRQuickExplainDeeper

Understanding Diffusion Models

Ilan began by explaining the fundamental concept of diffusion models. These models generate images or videos by iteratively denoising random noise. Each step involves a neural network predicting and removing noise, refining the output. The quality of the generated content is a result of these refinement passes, typically ranging from 20 to 50 steps. He highlighted that models like FLUX.2, FLUX.2.3, and Wan 2.7 are currently leading the charge in this domain, powering applications from text-to-image generation to scientific modeling.

The Problem with 50 Steps

The primary challenge Ilan addressed is the high number of diffusion steps required by these models. He identified three key barriers blocking diffusion models from reaching their full potential:

The Latency Wall: Generating an image or video with 50 steps can take 30-60 seconds. For real-time applications, which require less than a second per output, this is a significant bottleneck. This limits their use in live broadcasting, interactive apps, and other time-sensitive scenarios.
Use Case Enablement: Many potential applications, such as real-time image editing, live video stylization, and interactive world models, are currently unviable at 50 diffusion steps. Reducing these steps is essential for unlocking these use cases.
The Mature AI Ecosystem: Compared to other AI models like Large Language Models (LLMs), which have achieved near-instantaneous inference through optimization techniques, diffusion models lag behind. LLMs have achieved one-forward-pass-per-token efficiency, while diffusion models still require extensive iterative refinement. The goal is to bridge this gap by bringing diffusion inference to a similar maturity level as LLMs.

Closing the Gap: Optimization Strategies

To address these challenges, Ilan discussed three key optimization strategies: quantization, caching, and distillation.

Quantization: Make Each Step Cheaper

Quantization involves reducing the precision of the model's parameters, leading to faster computations, lower memory usage, and often preserving quality. Ilan explained two main approaches:

Post-training quantization (PTQ): This method quantizes a pre-trained model without further training. It involves using calibration data to determine the quantization parameters and then running inference.
Quantization-aware training (QAT): This approach introduces quantization during the training process itself. The model simulates the quantization process during training, which helps to reduce the accuracy drop that can occur with quantization.

Nvidia has demonstrated that by applying quantization, such as FP16 or FP8 precision, on models like FLUX.2, the quality is preserved at every precision level, leading to significant speedups.

Caching: Skip Redundant Computation

Caching involves reusing previously computed results when the input has not changed significantly. Ilan highlighted "TeaCache" (CVPR 2025) as an example of intra-request caching. This technique monitors timestep embedding changes between consecutive steps. When the change is small, the model can reuse the cached transformer output, skipping computation. This method has shown results like a 2x speedup by skipping 16 of 50 steps with less than 0.07% quality loss. It's also integrated into TensorRT-LLM, ComfyUI, and LLM-Omni.

Distillation: Compress Steps to 1-8

Distillation involves training a smaller, "student" model to mimic the behavior of a larger, "teacher" model. The idea is that a 50-step teacher model can be approximated by a student model that takes only a few steps (1-8) to achieve similar results. There are two main families of methods for distillation:

Trajectory-based (path regression): The student model learns to approximate the teacher's denoising trajectory. This is best for preserving fine-grained details.
Distribution-based (output matching): The student model learns to match the teacher's output distribution directly. This is best for one-step generation and creative applications.

The trade-off for distillation is that it requires significant post-training time (hours to days), but the rewards are substantial, potentially achieving a 12-25x speedup by reducing 50 steps to just 2-4 steps.

Nvidia FastGen: An Open-Source Solution

To facilitate these optimizations, Nvidia has released FastGen, a unified library for all these methods. FastGen is designed to be network-agnostic and supports various models and tasks. It offers scalable FSOP2 for models up to 14B+ parameters. The library has demonstrated impressive results, including 25x speedups on image/video generation and 23x speedups on CondDiff atmospheric downscaling. Notably, a 14B model was distilled in 16 hours on a 64 H100s. The library also boasts faster-than-real-time performance, with examples like generating 10-second 720p video in 6 seconds using DGX Station.

Ilian concluded by encouraging the audience to try these techniques themselves, highlighting the availability of FastGen and other related tools on GitHub, along with blog posts and documentation for further exploration.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Ziv Ilan #Nvidia #AI Research #Diffusion Models #Generative AI #Model Optimization #Quantization #Caching #Distillation #FastGen

AI Daily Digest

Get the most important AI news daily.

+40k readers