FlashRT: Execution State for Latency-First AI

FlashRT revolutionizes on-device AI serving with execution-state capsules, enabling sub-millisecond state restoration and significant TTFT speedups for latency-critical applications.

6 min read
Diagram illustrating execution-state capsules in FlashRT
Conceptual overview of execution-state capsules for dynamic AI serving.

Current LLM serving systems optimize for high throughput via paged or radix KV caches, effectively managing only a single fragment of execution state. This approach falls short for the demanding low-latency, small-batch, on-device physical-AI serving scenarios where interactive agents, speech systems, and robotic policies require frequent branching, resetting, and re-entry under strict responsiveness budgets. The arXiv preprint introduces a novel solution targeting this opposite regime.

Visual TL;DR. Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore. Sub-Millisecond Restore drives Latency-First AI Serving. Execution-State Capsules is a Granular Checkpoint.

Related startups

  1. Current LLM Serving: optimizes for high throughput via paged/radix KV caches
  2. On-Device AI Needs: low-latency, small-batch, frequent branching, resetting, re-entry
  3. Execution-State Capsules: graph-bound checkpoint capturing complete restorable state
  4. FlashRT Runtime: white-box kernel runtime with NVIDIA CUDA backend
  5. Sub-Millisecond Restore: enables rapid state restoration for dynamic AI
  6. Latency-First AI Serving: significant TTFT speedups for critical applications
  7. Granular Checkpoint: treats entire execution context as unified, restorable unit
Visual TL;DR
Visual TL;DR, startuphub.ai Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore mismatch requires implemented by enables Current LLM Serving On-Device AI Needs Execution-State Capsules FlashRT Runtime Sub-Millisecond Restore From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore mismatch requires implemented by enables Current LLMServing On-Device AINeeds Execution-StateCapsules FlashRT Runtime Sub-MillisecondRestore From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore mismatch requires implemented by enables Current LLM Serving optimizes for high throughput viapaged/radix KV caches On-Device AI Needs low-latency, small-batch, frequentbranching, resetting, re-entry Execution-State Capsules graph-bound checkpoint capturing completerestorable state FlashRT Runtime white-box kernel runtime with NVIDIA CUDAbackend Sub-Millisecond Restore enables rapid state restoration fordynamic AI From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore mismatch requires implemented by enables Current LLMServing optimizes for highthroughput viapaged/radix KV… On-Device AINeeds low-latency,small-batch,frequent branching,… Execution-StateCapsules graph-boundcheckpointcapturing complete… FlashRT Runtime white-box kernelruntime with NVIDIACUDA backend Sub-MillisecondRestore enables rapid staterestoration fordynamic AI From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore. Sub-Millisecond Restore drives Latency-First AI Serving. Execution-State Capsules is a Granular Checkpoint mismatch requires implemented by enables drives is a Current LLM Serving optimizes for high throughput viapaged/radix KV caches On-Device AI Needs low-latency, small-batch, frequentbranching, resetting, re-entry Execution-State Capsules graph-bound checkpoint capturing completerestorable state FlashRT Runtime white-box kernel runtime with NVIDIA CUDAbackend Sub-Millisecond Restore enables rapid state restoration fordynamic AI Latency-First AI Serving significant TTFT speedups for criticalapplications Granular Checkpoint treats entire execution context asunified, restorable unit From startuphub.ai · The publishers behind this format
Visual TL;DR, startuphub.ai Current LLM Serving mismatch On-Device AI Needs. On-Device AI Needs requires Execution-State Capsules. Execution-State Capsules implemented by FlashRT Runtime. FlashRT Runtime enables Sub-Millisecond Restore. Sub-Millisecond Restore drives Latency-First AI Serving. Execution-State Capsules is a Granular Checkpoint mismatch requires implemented by enables drives is a Current LLMServing optimizes for highthroughput viapaged/radix KV… On-Device AINeeds low-latency,small-batch,frequent branching,… Execution-StateCapsules graph-boundcheckpointcapturing complete… FlashRT Runtime white-box kernelruntime with NVIDIACUDA backend Sub-MillisecondRestore enables rapid staterestoration fordynamic AI Latency-First AIServing significant TTFTspeedups forcritical… GranularCheckpoint treats entireexecution contextas unified,… From startuphub.ai · The publishers behind this format

Execution-State Capsules: A Granular Checkpoint for Dynamic AI

The core innovation lies in execution-state capsules, a graph-bound checkpoint and restore mechanism designed to capture the complete restorable state at committed boundaries. Unlike previous methods focused on token-level KV cache fragments, this approach treats the entire execution context as a unified, restorable unit. FlashRT, the runtime implementing this, operates as a white-box kernel runtime. Its NVIDIA CUDA backend executes captured graph plans over static buffers, eliminating indirection and enabling efficient state management. The live state is a closed set of named buffers, allowing a capsule to snapshot, restore, fork, or roll back the entire execution boundary, encompassing KV cache, recurrent state, convolution state, MTP state, and metadata. This fundamentally shifts reuse from token-addressed fragments to these comprehensive graph-bound execution-state boundaries.

Sub-Millisecond Restore Drives Latency-First Serving

The performance implications of execution-state capsules are significant. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. Crucially, ablation studies show that recurrent state is load-bearing, highlighting the necessity of managing more than just the KV cache. GPU-resident snapshot and restore operations are achieved in sub-millisecond times. This translates to dramatic improvements in Time-To-First-Token (TTFT) speedup over cold prefill, growing from 3.9x at 2k tokens to an impressive 27x at 16k tokens. These benefits are consistent across different hardware platforms, including Jetson AGX Thor and DGX Spark, demonstrating broad applicability for latency-first AI serving. The researchers emphasize that execution-state capsules are not a replacement for high-throughput KV-cache systems but rather define a complementary serving point optimized for explicit execution-state reuse.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.