The quest for models that can process and retain information from ever-growing sequences has hit a fundamental theoretical wall. A recent arXiv preprint, authored by Yan Zhou, formalizes a critical dilemma inherent in the design of long sequence models: the impossibility of achieving efficiency, compactness, and extensive recall simultaneously.
The Unifying Theory of Sequence Processing
Zhou introduces the 'Online Sequence Processor' abstraction, a framework that elegantly unifies diverse architectures including Transformers, state space models, and linear recurrent networks. This abstraction serves as the bedrock for proving a fundamental long sequence models trade-off. The core insight is that any model striving for both per-step computation independent of sequence length (Efficiency) and state size independent of sequence length (Compactness) is inherently limited in its ability to recall historical facts proportional to sequence length (Recall). This limitation is quantified: such models can recall at most O(poly(d)/log V) key-value pairs, where 'd' is model dimension and 'V' is vocabulary size. This finding, derived using the Data Processing Inequality and Fano's Inequality, has profound implications for the scalability and capability of current and future models, as detailed in the research paper.