Together AI Supercharges LLM Inference

Together AI unveils ATLAS, accelerating LLM inference up to 4x with adaptive speculative decoding, tackling the growing cost challenge for AI-native companies.

2 min read
Abstract visualization of neural network connections and data flow representing AI inference.
Together AI's advancements in LLM inference aim to make AI models run faster and more efficiently.· Together AI

The AI industry’s focus is rapidly shifting from model training to efficient, scalable inference. For AI-native companies, inference costs represent the vast majority of lifetime expenses, shaping unit economics and product viability.

This is where Together AI is doubling down. The company is leveraging foundational research to accelerate LLM inference, announcing its ATLAS system, which uses runtime-learning accelerators to deliver up to 4x faster LLM inference. This adaptive speculative decoding approach learns from live traffic, outperforming static methods.

The Inference Imperative

Jensen Huang, NVIDIA CEO, highlighted that users pay for work, not just information, underscoring the shift towards agentic systems that demand reliable, low-latency inference. This makes inference a complex optimization challenge, balancing latency, throughput, model evolution, and concurrency.

Together AI’s approach tackles these challenges through a compounded stack of research, systems engineering, and hardware expertise.

Research That Ships

The company's research arm has yielded significant advancements, including FlashAttention-4, which offers up to 1.3x faster NVIDIA Blackwell inference performance compared to cuDNN. These innovations are integrated into their production systems swiftly.

Adaptive speculative decoding, exemplified by ATLAS and Aurora, is key. While standard speculative decoding offers 1.5-3x speedups, Together AI's systems learn and adapt in real-time, crucial for unpredictable production workloads.

Full-Stack Optimization

Running on the latest NVIDIA hardware, like the GB200 NVL72, requires custom parallelism strategies and advanced quantization techniques. Together AI builds these full-stack solutions, enabling rapid deployment and handling strict latency Service Level Agreements (SLAs).

Intelligent scheduling and dynamic batching are also critical. Together AI's inference engine optimizes request routing and batching in real-time to maximize GPU utilization without compromising user experience.

Economics of Efficiency

Inference costs have plummeted, but total spend is rising as AI adoption expands. Together AI focuses on optimizing the entire hardware and software stack to improve profitability for customers.

This efficiency translates directly to business growth, enabling more customers and previously unviable use cases.

Together AI positions itself as the AI Native Cloud, offering a comprehensive platform for serverless and dedicated inference, accelerated compute, and model shaping.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.