Language Models' Hidden Number Hierarchy

The intricate ways language models process numerical information remain a frontier in AI research. While it's known that models trained on natural text develop internal representations for numbers, the depth and structure of this understanding are far from uniform. A recent study published on arXiv delves into this phenomenon, revealing a nuanced hierarchy in how these models learn to represent numbers.

Beyond Fourier Sparsity: The Geometric Separability Divide

Researchers observed that a diverse range of models—including Transformers, Linear RNNs, LSTMs, and classical word embeddings—all exhibit periodic features with dominant periods at $T=2, 5, 10$ in their Fourier domain representations. However, a critical distinction emerges: only a subset of these models learn geometrically separable features. This means that while many models can identify periodic patterns, fewer can linearly classify numbers modulo $T$. The paper proves that Fourier domain sparsity, a common characteristic, is a necessary but not sufficient condition for achieving this crucial geometric separability in language model number representation.

Convergent Evolution in Numerical Feature Acquisition

The study meticulously investigates the factors influencing the development of geometrically separable features. It identifies that the data, model architecture, optimizer, and even the tokenizer all play significant roles. Notably, models can acquire these sophisticated numerical representations through two distinct pathways. One route involves leveraging complementary co-occurrence signals within general language data, such as text-number interactions and cross-number relationships. The other pathway emerges from training on multi-token addition problems, though not single-token ones. This finding underscores a principle of convergent evolution in machine learning, where disparate training signals and conditions can lead to similar, effective feature learning for language model number representation.

Language Models' Hidden Number Hierarchy

Beyond Fourier Sparsity: The Geometric Separability Divide

Related startups

Convergent Evolution in Numerical Feature Acquisition

AI Daily Digest