KV-Fold: Unlocking Transformer Long Context

KV-Fold enables training-free, stable long-context inference up to 128K tokens with 100% retrieval accuracy, overcoming prior limitations.

5 min read
Diagram illustrating the KV-Fold process of accumulating key-value pairs across sequence chunks.
Conceptual overview of the KV-Fold long context inference protocol.

The quest for extended context windows in large language models has historically demanded significant computational resources or architectural modifications. A recent arXiv preprint introduces KV-Fold, a novel, training-free approach to long-context inference that ingeniously repurposes the key-value (KV) cache.

Visual TL;DR. Long context challenge leads to Introducing KV-Fold. Introducing KV-Fold leads to Functional folding. Functional folding leads to Stable recurrence. Stable recurrence leads to Robustness observed. Stable recurrence leads to High-fidelity retrieval. High-fidelity retrieval leads to 128K token context.

Related startups

  1. Long context challenge: training demands high resources or architectural changes for extended context
  2. Introducing KV-Fold: novel training-free approach repurposing the key-value cache
  3. Functional folding: KV cache treated as accumulator in left fold across sequence chunks
  4. Stable recurrence: chunk-to-chunk updates mirror foldl, per-step drift quickly saturates
  5. Robustness observed: stable across numerical precision, chunk sizes, and model families
  6. High-fidelity retrieval: enables long-range retrieval with 100% accuracy
  7. 128K token context: achieves stable inference up to 128,000 tokens
Visual TL;DR
Visual TL;DR — startuphub.ai Long context challenge leads to Introducing KV-Fold. Stable recurrence leads to High-fidelity retrieval. High-fidelity retrieval leads to 128K token context Long context challenge Introducing KV-Fold Stable recurrence High-fidelity retrieval 128K token context From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Long context challenge leads to Introducing KV-Fold. Stable recurrence leads to High-fidelity retrieval. High-fidelity retrieval leads to 128K token context Long contextchallenge IntroducingKV-Fold Stable recurrence High-fidelityretrieval 128K tokencontext From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Long context challenge leads to Introducing KV-Fold. Stable recurrence leads to High-fidelity retrieval. High-fidelity retrieval leads to 128K token context Long context challenge training demands high resources orarchitectural changes for extended context Introducing KV-Fold novel training-free approach repurposingthe key-value cache Stable recurrence chunk-to-chunk updates mirror foldl,per-step drift quickly saturates High-fidelity retrieval enables long-range retrieval with 100%accuracy 128K token context achieves stable inference up to 128,000tokens From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Long context challenge leads to Introducing KV-Fold. Stable recurrence leads to High-fidelity retrieval. High-fidelity retrieval leads to 128K token context Long contextchallenge training demandshigh resources orarchitectural… IntroducingKV-Fold novel training-freeapproachrepurposing the… Stable recurrence chunk-to-chunkupdates mirrorfoldl, per-step… High-fidelityretrieval enables long-rangeretrieval with 100%accuracy 128K tokencontext achieves stableinference up to128,000 tokens From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Long context challenge leads to Introducing KV-Fold. Introducing KV-Fold leads to Functional folding. Functional folding leads to Stable recurrence. Stable recurrence leads to Robustness observed. Stable recurrence leads to High-fidelity retrieval. High-fidelity retrieval leads to 128K token context Long context challenge training demands high resources orarchitectural changes for extended context Introducing KV-Fold novel training-free approach repurposingthe key-value cache Functional folding KV cache treated as accumulator in leftfold across sequence chunks Stable recurrence chunk-to-chunk updates mirror foldl,per-step drift quickly saturates Robustness observed stable across numerical precision, chunksizes, and model families High-fidelity retrieval enables long-range retrieval with 100%accuracy 128K token context achieves stable inference up to 128,000tokens From startuphub.ai · The publishers behind this format
Visual TL;DR — startuphub.ai Long context challenge leads to Introducing KV-Fold. Introducing KV-Fold leads to Functional folding. Functional folding leads to Stable recurrence. Stable recurrence leads to Robustness observed. Stable recurrence leads to High-fidelity retrieval. High-fidelity retrieval leads to 128K token context Long contextchallenge training demandshigh resources orarchitectural… IntroducingKV-Fold novel training-freeapproachrepurposing the… Functionalfolding KV cache treated asaccumulator in leftfold across… Stable recurrence chunk-to-chunkupdates mirrorfoldl, per-step… Robustnessobserved stable acrossnumericalprecision, chunk… High-fidelityretrieval enables long-rangeretrieval with 100%accuracy 128K tokencontext achieves stableinference up to128,000 tokens From startuphub.ai · The publishers behind this format

Stable Recurrence via Functional Folding

KV-Fold treats the KV cache as an accumulator in a left fold operation across sequence chunks. At each step, the model processes a new chunk, conditioned on the accumulated cache from previous chunks. This cache is then appended with the newly generated keys and values, and the enlarged cache is passed forward. This simple, one-step update mirrors the behavior of `foldl` in functional programming, establishing a stable, chunk-to-chunk recurrence. The researchers observed that per-step drift quickly saturates and remains stable, demonstrating robustness across numerical precision, chunk sizes, and model families.

Unlocking High-Fidelity Long-Range Retrieval

The practical impact of KV-Fold is demonstrated through its exceptional performance on challenging long-context tasks. On a needle-in-a-haystack benchmark, the method achieved 100% exact-match retrieval across numerous trials, spanning contexts from 16K to 128K tokens. Crucially, this was accomplished on a Llama-3.1-8B model while staying within the memory constraints of a single 40GB GPU, and supporting chain depths up to 511. This stands in contrast to streaming methods that often sacrifice fidelity for memory boundedness. KV-Fold provides a path to maintaining long-range retrieval capabilities through a series of tractable forward passes, making KV-Fold long context inference a compelling development.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.