Running large language models efficiently in production demands constant optimization, but traditional methods often fall short. Speculative decoding, a technique designed to speed up inference, frequently underperforms due to stale draft models that can't keep pace with live traffic shifts. Together AI aims to solve this with its new open-source framework, Aurora.
Aurora is built on a reinforcement learning (RL) foundation, enabling it to learn directly from live inference traces and continuously update its draft models without interrupting service. This creates a self-improving flywheel for LLM inference optimization.
