AI Agents Supercharge GPU Kernel Development

LinkedIn is leveraging AI agents to automate complex GPU kernel engineering for its Liger Kernel project, accelerating AI model performance.

4 min read
Diagram showing agentic workflows for Liger Kernel engineering pipeline.
Agentic workflows automate tasks in Liger Kernel engineering.· LinkedIn Engineering

As large language models balloon in size and complexity, the efficiency of GPU kernels has become a critical bottleneck. While custom kernels can bridge performance gaps left by standard libraries like PyTorch, their creation demands scarce, specialized expertise. LinkedIn's open-source Liger Kernel project aims to democratize these optimizations.

Liger Kernel delivers substantial gains, boasting a 20% throughput improvement and 60% memory reduction across nearly 40 model architectures. It integrates seamlessly with popular tools like HuggingFace Transformers and works with Flash Attention, PyTorch FSDP, and DeepSpeed. The project has seen strong adoption, with over 7 million downloads and contributions from 100+ companies.

However, maintaining such an extensive project presents its own hurdles. Developing new kernels, optimizing existing ones, and integrating support for new models each require significant expert time—a pace that struggles to keep up with rapid model innovation.

To address this, LinkedIn is deploying AI agents to automate the heavy lifting of kernel engineering. This initiative, detailed in a recent post, applies the philosophy of "AI helping build better AI" to GPU kernel development.

Agentic Workflows for Kernel Engineering

The development of Liger Kernel follows well-defined patterns: analysis, implementation, testing, and benchmarking. These repeatable steps are ideal for agentic automation, but the complexity of arbitrary shapes, multiple precision modes, and diverse model architectures necessitates sophisticated workflows.

Related startups

LinkedIn developed agentic workflows that encode Liger-specific domain knowledge into repeatable, agent-driven processes. Packaged as reusable agent skills, these workflows automate complex, multi-step engineering tasks through a three-stage pipeline with human review checkpoints.

  • Understand: The agent analyzes input (code, URLs, descriptions), reasons about the problem, and generates a structured profile for human review.
  • Act: Using the confirmed profile and existing code as reference, the agent generates or modifies files according to project conventions.
  • Verify: The agent runs correctness checks and benchmarks, blocking progress on hard failures and flagging soft failures for review.

These agentic workflows, which are themselves examples of agentic workflows GPU kernel engineering, have already shipped real contributions, including new kernels, model integrations, and performance optimizations.

Automating Kernel Creation: liger-kernel-dev

The liger-kernel-dev agent converts PyTorch operations into optimized Triton kernels. It classifies operations into complexity tiers (element-wise, reduction, fused/complex) to guide downstream decisions like tiling strategy and memory management. Using existing Liger kernels as references ensures generated code follows proven patterns.

A real-world example involved the ReLU Squared activation function. The agent classified it as Tier 1, generated all necessary files, and validated them. The resulting kernel achieved significant speedups: 1.9x for the forward pass and 3.2x for the backward pass, with a 37.5% memory reduction compared to PyTorch. This task, which would typically take days of expert effort, required only human review before merging.

Adding Model Support: liger-autopatch

The liger-autopatch agent streamlines the addition of Liger optimization support for new HuggingFace Transformers models. Model integration is complex due to subtle architectural differences in normalization, casting, activation functions, and more, where any error can lead to silent numerical divergence.

This agent resolves 12 architectural decisions by analyzing model source code, capturing them in a structured profile for human review. It then generates or modifies up to 13 files, including convergence tests across multiple configurations.

The agent successfully added support for the Nemotron and Mistral models, both requiring only human review before merging with no manual code changes. Validation checks passed on H100 hardware.

Optimizing Existing Kernels: liger-kernel-perf

The liger-kernel-perf agent focuses on accelerating already functional kernels. This requires expertise in GPU profiling and hardware-specific bottlenecks.

It employs an autonomous optimization loop. The agent profiles kernels, detects GPU architecture, and classifies bottlenecks. It then generates versioned optimization variants, starting with parameter tuning and progressing to techniques like register pressure reduction or memory coalescing. Learning accumulates across iterations, with guardrails preventing regressions.

For the fused_add_rms_norm backward kernel, the agent diagnosed and addressed severe underutilization on an H100 GPU. It applied four targeted optimizations, resulting in a 3.35x backward speedup for a hidden dimension of 16384 and a 59% full-pass speedup with no memory impact. This showcases how AI can tackle complex performance tuning, improving upon standard PyTorch GPU kernel performance.

Internal Integration with torch.compile

Beyond open-source contributions, LinkedIn integrates agent-generated or optimized kernels directly into its training infrastructure using a custom compiler-based selection library. This library extends torch.compile to automatically identify operations for fusion, select the best kernel from a registry, and replace operations via custom graph passes.

A notable internal result includes a batched partitioned mean pooling kernel for a recommendation model, which reduced encoder step time by 10x (400ms to 40ms) and average training step time by 3x (1.12s to 0.39s), saving 64.7% of GPU hours.

These agentic workflows demonstrate a powerful paradigm shift in AI development, where AI not only models data but actively participates in building and optimizing the underlying infrastructure.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.