Traditional inference benchmarks often miss the mark for production AI. Together AI has released a new benchmark designed to stress-test large language models (LLMs) under the demanding conditions of coding agent workloads. This approach prioritizes performance not just at peak, but under sustained, high-traffic scenarios.
The core of the challenge lies in simulating how dozens or hundreds of concurrent requests interact. These requests compete for critical resources like KV cache, memory bandwidth, and GPU cycles. What matters most is how every user experiences performance degradation as the system reaches its limits.
The Coding Agent Workload
Coding agent requests are characterized by large input contexts, often tens of thousands of tokens, representing files, conversation history, and retrieved code snippets. While output lengths are typically bounded, the sheer volume of concurrent requests creates significant pressure.
