Current LLM serving systems optimize for high throughput via paged or radix KV caches, effectively managing only a single fragment of execution state. This approach falls short for the demanding low-latency, small-batch, on-device physical-AI serving scenarios where interactive agents, speech systems, and robotic policies require frequent branching, resetting, and re-entry under strict responsiveness budgets. The arXiv preprint introduces a novel solution targeting this opposite regime.
Related startups
Execution-State Capsules: A Granular Checkpoint for Dynamic AI
The core innovation lies in execution-state capsules, a graph-bound checkpoint and restore mechanism designed to capture the complete restorable state at committed boundaries. Unlike previous methods focused on token-level KV cache fragments, this approach treats the entire execution context as a unified, restorable unit. FlashRT, the runtime implementing this, operates as a white-box kernel runtime. Its NVIDIA CUDA backend executes captured graph plans over static buffers, eliminating indirection and enabling efficient state management. The live state is a closed set of named buffers, allowing a capsule to snapshot, restore, fork, or roll back the entire execution boundary, encompassing KV cache, recurrent state, convolution state, MTP state, and metadata. This fundamentally shifts reuse from token-addressed fragments to these comprehensive graph-bound execution-state boundaries.
Sub-Millisecond Restore Drives Latency-First Serving
The performance implications of execution-state capsules are significant. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. Crucially, ablation studies show that recurrent state is load-bearing, highlighting the necessity of managing more than just the KV cache. GPU-resident snapshot and restore operations are achieved in sub-millisecond times. This translates to dramatic improvements in Time-To-First-Token (TTFT) speedup over cold prefill, growing from 3.9x at 2k tokens to an impressive 27x at 16k tokens. These benefits are consistent across different hardware platforms, including Jetson AGX Thor and DGX Spark, demonstrating broad applicability for latency-first AI serving. The researchers emphasize that execution-state capsules are not a replacement for high-throughput KV-cache systems but rather define a complementary serving point optimized for explicit execution-state reuse.