We observe an empirical phenomenon in Large Language Models (LLMs)—very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in self-attention output. Last, we also study massive activations in Vision Transformers.
Massive activations are widely existent in LLMs, across various model sizes and families.
We conduct intervention analysis on massive activations. We find that their values remain as largely constant across input sequences. However, they play a crucial role in the internal computation of LLMs.
Intriguingly, massive activations are closely connected to the self-attention mechanism. Their emergence within LLMs immediately leads to the concentration of attention probabilities to their sequence dimensions.
Delving into the computation within self-attention, we show that LLMs use massive activations to implement an implicit form of biases in the attention output.
Furthermore, we find that explicit attention biases eliminate the need of LLMs to learn massive activations during pretraining.
Massive activations exist in many Vision Transformers (ViTs), e.g., CLIP ViT-L and DINOv2 ViT-L. They are located in fixed feature dimensions and random patch tokens.
For more results and analysis on massive activations in LLMs and ViTs, please take a look at our full paper.
@article{sun2024massive,
title = {Massive Activations in Large Language Models},
author = {Sun, Mingjie and Chen, Xinlei and Kolter, J. Zico and Liu, Zhuang},
year = {2024},
journal = {arXiv preprint arXiv:2402.17762}
}
We thank Sachin Goyal, Jeremy Cohen, Timothée Darcet, Koustuv Sinha and Mike Rabbat for valuable discussions. Mingjie Sun was supported by funding from the Bosch Center for Artificial Intelligence.
Website adapted from the following template.