Together AI Slashes RL Training Time

Together AI's new distribution-aware speculative decoding slashes RL training time by up to 50%, tackling a major bottleneck in LLM post-training.

2 min read
Abstract visualization of neural network connections and data flow, representing AI model training.
Together AI's new technique speeds up LLM training processes.· Together AI

Large language model training is getting a speed boost. Together AI has unveiled distribution-aware speculative decoding (DAS), a new framework designed to drastically cut down the time spent on Reinforcement Learning (RL) post-training.

RL fine-tuning has become critical for enhancing LLM reasoning, but the rollout phase—where models generate responses for training—presents a significant bottleneck. This process can consume up to 70% of total training time, primarily due to the long-tail nature of response generation, where a few slow generations can delay the entire batch and leave expensive GPUs idle.

Related startups

Tackling the Rollout Bottleneck

DAS directly addresses this by optimizing the rollout process. It achieves up to a 50% speedup in RL rollouts, a crucial improvement for large-scale AI training.

The framework cleverly exploits two key properties of RL rollouts: the reuse of prompts across training epochs and the long-tail distribution of generation times. Unlike standard inference, RL training revisits the same prompts repeatedly, providing a rich history that DAS can leverage.

DAS employs an adaptive suffix tree drafter that learns from recent rollouts, allowing it to stay synchronized with the evolving model weights without requiring constant retraining. This training-free approach continuously adapts to the changing policy.

Intelligent Scheduling for Efficiency

Complementing the drafter is a length-aware scheduling strategy. This system balances the workload across GPUs, preventing long generations from monopolizing resources. It also dynamically allocates speculation budgets within GPUs, giving longer requests more attention early on to avoid costly late-stage computations.

Experimental results on math reasoning and code generation tasks demonstrate that DAS achieves its speedup without any degradation in reward quality, preserving the training signal entirely.

This technique offers a compelling solution for cutting compute costs in RL post-training. As models grow larger and tasks more complex, the rollout bottleneck will only intensify, making DAS a valuable tool for practitioners seeking efficiency without sacrificing performance.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.