Together AI Supercharges LLM Inference

The AI industry’s focus is rapidly shifting from model training to efficient, scalable inference. For AI-native companies, inference costs represent the vast majority of lifetime expenses, shaping unit economics and product viability.

This is where Together AI is doubling down. The company is leveraging foundational research to accelerate LLM inference, announcing its ATLAS system, which uses runtime-learning accelerators to deliver up to 4x faster LLM inference. This adaptive speculative decoding approach learns from live traffic, outperforming static methods.

The Inference Imperative

Jensen Huang, NVIDIA CEO, highlighted that users pay for work, not just information, underscoring the shift towards agentic systems that demand reliable, low-latency inference. This makes inference a complex optimization challenge, balancing latency, throughput, model evolution, and concurrency.

Together AI’s approach tackles these challenges through a compounded stack of research, systems engineering, and hardware expertise.

Research That Ships

The company's research arm has yielded significant advancements, including FlashAttention-4, which offers up to 1.3x faster NVIDIA Blackwell inference performance compared to cuDNN. These innovations are integrated into their production systems swiftly.

Adaptive speculative decoding, exemplified by ATLAS and Aurora, is key. While standard speculative decoding offers 1.5-3x speedups, Together AI's systems learn and adapt in real-time, crucial for unpredictable production workloads.

Full-Stack Optimization

Running on the latest NVIDIA hardware, like the GB200 NVL72, requires custom parallelism strategies and advanced quantization techniques. Together AI builds these full-stack solutions, enabling rapid deployment and handling strict latency Service Level Agreements (SLAs).

Intelligent scheduling and dynamic batching are also critical. Together AI's inference engine optimizes request routing and batching in real-time to maximize GPU utilization without compromising user experience.

Economics of Efficiency

Inference costs have plummeted, but total spend is rising as AI adoption expands. Together AI focuses on optimizing the entire hardware and software stack to improve profitability for customers.

This efficiency translates directly to business growth, enabling more customers and previously unviable use cases.

Together AI positions itself as the AI Native Cloud, offering a comprehensive platform for serverless and dedicated inference, accelerated compute, and model shaping.

Together AI Supercharges LLM Inference

The Inference Imperative

Research That Ships

Full-Stack Optimization

Economics of Efficiency

AI Daily Digest