Language Models' Hidden Number Hierarchy

Language models exhibit a two-tiered hierarchy in number representation, with geometric separability being the key differentiator beyond Fourier periodicity.

2 min read
Abstract visualization of hierarchical number representations in language models.
A conceptual graphic illustrating the two-tiered hierarchy of number features learned by language models.

The intricate ways language models process numerical information remain a frontier in AI research. While it's known that models trained on natural text develop internal representations for numbers, the depth and structure of this understanding are far from uniform. A recent study published on arXiv delves into this phenomenon, revealing a nuanced hierarchy in how these models learn to represent numbers.

Beyond Fourier Sparsity: The Geometric Separability Divide

Researchers observed that a diverse range of models—including Transformers, Linear RNNs, LSTMs, and classical word embeddings—all exhibit periodic features with dominant periods at $T=2, 5, 10$ in their Fourier domain representations. However, a critical distinction emerges: only a subset of these models learn geometrically separable features. This means that while many models can identify periodic patterns, fewer can linearly classify numbers modulo $T$. The paper proves that Fourier domain sparsity, a common characteristic, is a necessary but not sufficient condition for achieving this crucial geometric separability in language model number representation.

Related startups

Convergent Evolution in Numerical Feature Acquisition

The study meticulously investigates the factors influencing the development of geometrically separable features. It identifies that the data, model architecture, optimizer, and even the tokenizer all play significant roles. Notably, models can acquire these sophisticated numerical representations through two distinct pathways. One route involves leveraging complementary co-occurrence signals within general language data, such as text-number interactions and cross-number relationships. The other pathway emerges from training on multi-token addition problems, though not single-token ones. This finding underscores a principle of convergent evolution in machine learning, where disparate training signals and conditions can lead to similar, effective feature learning for language model number representation.

© 2026 StartupHub.ai. All rights reserved. Do not enter, scrape, copy, reproduce, or republish this article in whole or in part. Use as input to AI training, fine-tuning, retrieval-augmented generation, or any machine-learning system is prohibited without written license. Substantially-similar derivative works will be pursued to the fullest extent of applicable copyright, database, and computer-misuse laws. See our terms.