FlashRT: Execution State for Latency-First AI

Current LLM serving systems optimize for high throughput via paged or radix KV caches, effectively managing only a single fragment of execution state. This approach falls short for the demanding low-latency, small-batch, on-device physical-AI serving scenarios where interactive agents, speech systems, and robotic policies require frequent branching, resetting, and re-entry under strict responsiveness budgets. The arXiv preprint introduces a novel solution targeting this opposite regime.

Visual TL;DR. Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore. Sub-Millisecond Restore drives Latency-First AI Serving. Execution-State Capsules is a Granular Checkpoint.

Related startups

Current LLM Serving: optimizes for high throughput via paged/radix KV caches
On-Device AI Needs: low-latency, small-batch, frequent branching, resetting, re-entry
Execution-State Capsules: graph-bound checkpoint capturing complete restorable state
FlashRT Runtime: white-box kernel runtime with NVIDIA CUDA backend
Sub-Millisecond Restore: enables rapid state restoration for dynamic AI
Latency-First AI Serving: significant TTFT speedups for critical applications
Granular Checkpoint: treats entire execution context as unified, restorable unit

Visual TL;DRQuickExplainDeeper

Execution-State Capsules: A Granular Checkpoint for Dynamic AI

The core innovation lies in execution-state capsules, a graph-bound checkpoint and restore mechanism designed to capture the complete restorable state at committed boundaries. Unlike previous methods focused on token-level KV cache fragments, this approach treats the entire execution context as a unified, restorable unit. FlashRT, the runtime implementing this, operates as a white-box kernel runtime. Its NVIDIA CUDA backend executes captured graph plans over static buffers, eliminating indirection and enabling efficient state management. The live state is a closed set of named buffers, allowing a capsule to snapshot, restore, fork, or roll back the entire execution boundary, encompassing KV cache, recurrent state, convolution state, MTP state, and metadata. This fundamentally shifts reuse from token-addressed fragments to these comprehensive graph-bound execution-state boundaries.

Sub-Millisecond Restore Drives Latency-First Serving

The performance implications of execution-state capsules are significant. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. Crucially, ablation studies show that recurrent state is load-bearing, highlighting the necessity of managing more than just the KV cache. GPU-resident snapshot and restore operations are achieved in sub-millisecond times. This translates to dramatic improvements in Time-To-First-Token (TTFT) speedup over cold prefill, growing from 3.9x at 2k tokens to an impressive 27x at 16k tokens. These benefits are consistent across different hardware platforms, including Jetson AGX Thor and DGX Spark, demonstrating broad applicability for latency-first AI serving. The researchers emphasize that execution-state capsules are not a replacement for high-throughput KV-cache systems but rather define a complementary serving point optimized for explicit execution-state reuse.

FlashRT: Execution State for Latency-First AI

Related startups

Execution-State Capsules: A Granular Checkpoint for Dynamic AI

Sub-Millisecond Restore Drives Latency-First Serving

AI Daily Digest