LinkedIn's Generative Recommender Speed-Up

LinkedIn engineers drastically improved Generative Recommender training efficiency, cutting GPU hours by up to 65% through system-level optimizations.

May 28 at 9:04 PM7 min read

Abstract visualization of data flow and computational nodes representing AI training. — System optimizations significantly enhanced Generative Recommender training efficiency at LinkedIn.· LinkedIn Engineering

Visual TL;DR. Generative Recommender (GR) leads to Scaling Hurdles. Scaling Hurdles addressed by System Optimizations. System Optimizations includes Data Pipeline Overhaul. System Optimizations includes Compute Enhancements. System Optimizations includes Training Lifecycle. System Optimizations results in GPU Hours Cut. GPU Hours Cut enables Increased Session Time.

Generative Recommender (GR): new model for richer user behavior understanding
Scaling Hurdles: advanced models present significant engineering challenges
System Optimizations: drastically improved training efficiency
Data Pipeline Overhaul: key part of efficiency improvements
Compute Enhancements: further boosted system performance
Training Lifecycle: improvements made to the entire process
GPU Hours Cut: reduced by up to 65%
Increased Session Time: tangible benefit of 2.10% increase

Visual TL;DRQuickExplainDeeper

LinkedIn is pushing the boundaries of recommendation systems, moving beyond traditional models to embrace generative sequential architectures. This shift, exemplified by their Generative Recommender (GR), promises more nuanced understanding of user behavior over time. However, scaling these advanced models presents significant engineering hurdles.

The move to GR, which models user activity as token sequences, offers richer long-context personalization than older Deep Learning Recommendation Models (DLRM). This upgrade was crucial as user interactions on the platform became more dynamic and sequence-driven. In LinkedIn Engineering's own production deployments, the GR system demonstrated tangible benefits, including a 2.10% increase in session time spent.

Traditional DLRMs focus on per-user activity, while GRs leverage a user's entire history as ordered token streams. This means GRs utilize a broader time window (360 days versus 90) and employ transformer-based architectures, leading to larger model sizes and more complex data handling.

Engineering Hurdles at Scale

Training these sophisticated GR models at LinkedIn's scale introduced unique challenges. Variable-length sequences and large embedding tables strained memory, while data ingestion faced I/O bottlenecks. Skewed sequence lengths led to compute waste, and the need for custom attention masks complicated efficient kernel implementations.

Furthermore, GRs required frequent retraining on the latest user data, a process made more complex by the shift to listwise data. This contrasted with the simpler incremental updates of older models.

System Optimizations Drive Efficiency

To tackle these issues, LinkedIn engineers implemented a suite of system-level optimizations. The primary goal was to improve Generative Recommender training efficiency without sacrificing model quality. Total GPU hours served as the key metric for success.

Data Pipeline Overhaul

Significant I/O bottlenecks were traced to the native data loader and row-level transformations. A custom C++ fused loader was developed to consolidate padding, truncating, packing, and batching into a single PyTorch operation. This reduced training step time by approximately 50%.

Compute and Kernel Enhancements

Inefficiencies in attention kernels, particularly with dynamic sequence lengths and custom masks, were addressed by adopting FlashAttention-3 and FlexAttention. These advanced kernels minimize memory traffic and handle variable lengths more effectively. An in-house compiler backend automatically selects the optimal kernel for the runtime environment.

This switch resulted in up to a 25% training speed increase for specific GR models. Metrics calculation, which previously incurred a 15% step overhead, was optimized using a fused custom CUDA kernel. This reduced end-to-end update time from milliseconds to microseconds, contributing to a 22% GPU hour saving.

Training Lifecycle Improvements

Optimizer performance was boosted by enabling the fused flag in Adam, consolidating CUDA kernel launches and fusing the GradScaler. This cut optimizer time by about 50%, yielding a 15% GPU hour saving for Feed GR training.

Fused embedding table lookups combined multiple small lookups into a single kernel, improving cache locality and reducing memory traffic. This yielded a 10% training time improvement for Ads GR.

Evaluation, traditionally interleaved with training, was parallelized. By saving all checkpoints and evaluating them post-training, LinkedIn achieved a 16% reduction in GPU hours for Feed GR training.

The handling of variable sequence lengths was fundamentally improved through packed sequences, significantly reducing padding ratios and associated compute/memory waste. This led to over 30% GPU hour reduction and 40% GPU memory reduction for Feed GR training. Dynamic batching also offered a >50% GPU time reduction by grouping similar sequence lengths before padding.

Collectively, these system optimizations reduced end-to-end GPU hours by up to 65% in internal production workloads, demonstrating a powerful approach to scaling advanced recommendation systems.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.

#Generative AI #Machine Learning #Recommendation Systems #LinkedIn #Infrastructure #Deep Learning #Data Science #AI Engineering