AI agents boost GPU kernels 38%

AI agents autonomously optimize GPU kernels, achieving 38% speedup and demonstrating potential for complex software development.

4 min read
Abstract visualization of AI agents optimizing complex code on a GPU architecture.
A multi-agent AI system autonomously optimized NVIDIA GPU kernels, achieving significant performance gains.· Cursor Blog

A new multi-agent AI system has autonomously optimized 235 CUDA kernels for NVIDIA's Blackwell GPUs, delivering an average speedup of 38% in just three weeks. This significant performance boost, detailed by Cursor and NVIDIA researchers, showcases the power of AI in accelerating critical components of AI model training and inference.

These CUDA kernels are the bedrock of AI computations on NVIDIA hardware. Faster kernels translate directly to better GPU efficiency, reduced power consumption, lower latency, and ultimately, decreased costs. This efficiency allows for the deployment of larger, more capable AI models to a wider user base.

The multi-agent system tackled complex kernel optimization problems, achieving levels of performance improvement that typically demand months or even years of specialized human engineering. The system's ability to address a long tail of previously intractable kernel issues highlights its potential for future software development.

AI tackles kernel optimization

Kernel optimization presents a unique challenge for AI systems. Unlike tasks with a single known solution, optimizing kernels involves navigating a vast solution space with measurable objectives. Traditional methods often break down complex operations into smaller, manageable parts, which can leave performance gains on the table due to a lack of holistic optimization.

This experiment aimed to see if a multi-agent system could overcome these limitations, exploring a broader solution space to generate faster kernels. The system's capability to autonomously optimize GPU kernel performance is a significant step forward.

Related startups

SOL-ExecBench fuels the AI

NVIDIA provided the SOL-ExecBench platform, which generated 235 real-world optimization problems from over 124 open-source models. These problems mimicked constraints found in actual AI training and inference workloads across various architectures, including LLMs, diffusion models, and more.

The platform also served as the benchmarking environment, ensuring that AI-generated solutions adhered to hardware limits and did not employ deceptive tactics like caching. This rigorous testing methodology validated the system's performance on 27 NVIDIA Blackwell 200 GPUs.

Autonomous optimization in action

The multi-agent system operated autonomously for three weeks, coordinating a planner agent with worker agents to distribute and rebalance tasks based on performance metrics. The entire coordination protocol was defined within a single markdown file, specifying rules and testing procedures.

Crucially, the system learned to interact with the benchmarking pipeline independently, creating a continuous loop of testing, debugging, and optimization without human intervention.

To test its adaptability, the system was tasked with writing solutions in both CUDA C with inline PTX (low-level hardware access) and a CuTe DSL (high-level abstractions), demonstrating its ability to reason across different levels of abstraction and learn novel APIs.

Quantifiable performance gains

The results show the multi-agent system outperforming baseline PyTorch implementations in 63% of the problems, achieving a geometric mean speedup of 1.38x (38%). Notably, 19% of the optimizations resulted in speedups exceeding 2x.

The system also achieved high "Speed-of-Light" (SOL) scores, indicating solutions close to theoretical hardware limits. A median SOL score of 0.56 suggests substantial room for further improvement.

Diverse strategies for diverse problems

The multi-agent system demonstrated adaptability by employing distinct optimization strategies for different kernel types.

BF16 Grouped Query Attention with Paged Prefill

For Grouped Query Attention, a key operation in LLM inference, the system used CUDA C++ to optimize memory loading, math operations, and scheduling. It achieved a 84% geomean speedup over a human-optimized baseline, reaching an SOL score of 0.9722.

NVFP4 MoE Linear with Gating

In optimizing Mixture-of-Experts (MoE) models with NVFP4 quantization, the AI identified quantization as the bottleneck. It fused scaling and rounding operations and employed pre-computed threshold buckets for efficient FP32 to FP4 conversion, yielding a 39% geomean speedup.

BF16 Matrix Multiplication

Matrix multiplication, a notoriously complex task, saw the AI generate a specialized CUDA C++ kernel. This kernel approached an 86% performance of NVIDIA's cuBLAS library by leveraging Blackwell-specific instructions and optimizing memory access. This advance suggests AI could soon surpass domain experts in highly specialized areas, akin to how AI outpaces GPU experts.

While the median SOL score of 0.56 indicates potential for further gains, the experiment validates the efficacy of multi-agent systems for complex, open-ended software development tasks.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.