Two persistent phenomena in Transformer language models – massive activations and attention sinks – have long been observed to co-occur, often involving the same tokens. While prior work noted this correlation, the underlying functional roles and causal relationship remained elusive. This new research, published on arXiv, systematically dissects these behaviors, revealing them to be largely an artifact of modern Transformer design rather than semantic necessities.
Massive Activations as Implicit Parameters
Massive activations, characterized by a few tokens exhibiting extreme outliers in specific channels, operate globally within the model. The study demonstrates that these activations induce near-constant hidden representations that persist across layers. This persistence effectively imbues them with the function of implicit parameters, influencing the model's behavior in a pervasive manner without explicit training objectives for these specific activations.
Attention Sinks: Local Modulation of Dependencies
In contrast, attention sinks operate locally. They disproportionately attract attention mass, regardless of semantic relevance, and modulate attention outputs across heads. This local effect biases individual heads towards short-range dependencies. The research indicates that while related to massive activations, attention sinks serve a distinct role in shaping the attention mechanism's focus.
The Pre-Norm Configuration as the Enabler
The key architectural choice enabling the co-occurrence of massive activations and attention sinks is identified as the pre-norm configuration. Through systematic experiments, the researchers show that ablating this configuration leads to a decoupling of the two phenomena. This finding is a significant step forward in Transformer architecture research, suggesting that modifications to normalization strategies could offer a pathway to disentangling these effects and potentially improving model efficiency and interpretability.