Two persistent phenomena in Transformer language models – massive activations and attention sinks – have long been observed to co-occur, often involving the same tokens. While prior work noted this correlation, the underlying functional roles and causal relationship remained elusive. This new research, published on arXiv, systematically dissects these behaviors, revealing them to be largely an artifact of modern Transformer design rather than semantic necessities.
Massive Activations as Implicit Parameters
Massive activations, characterized by a few tokens exhibiting extreme outliers in specific channels, operate globally within the model. The study demonstrates that these activations induce near-constant hidden representations that persist across layers. This persistence effectively imbues them with the function of implicit parameters, influencing the model's behavior in a pervasive manner without explicit training objectives for these specific activations.