Massive Activations in Large Language Models

Mingjie Sun1       Xinlei Chen 2       J. Zico Kolter1,3       Zhuang Liu2
1Carnegie Mellon University       2Meta AI Research       3Bosch Center for AI

Teaser

We discover and study massive activations in Large Language Models. These activations are extremely rare, yet have significantly larger magnitudes than other activations. For LLaMA2-7B, they appear in two fixed feature dimensions (1415, 2533), and two types of tokens: the starting token, and the first period "." or newline token "\n".




Abstract

We observe an empirical phenomenon in Large Language Models (LLMs)—very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in self-attention output. Last, we also study massive activations in Vision Transformers.


Results

Massive activations are widely existent in LLMs, across various model sizes and families.

Teaser


We conduct intervention analysis on massive activations. We find that their values remain as largely constant across input sequences. However, they play a crucial role in the internal computation of LLMs.




Intriguingly, massive activations are closely connected to the self-attention mechanism. Their emergence within LLMs immediately leads to the concentration of attention probabilities to their sequence dimensions.




Delving into the computation within self-attention, we show that LLMs use massive activations to implement an implicit form of biases in the attention output.




Furthermore, we find that explicit attention biases eliminate the need of LLMs to learn massive activations during pretraining.



Vision Transformers

Massive activations exist in many Vision Transformers (ViTs), e.g., CLIP ViT-L and DINOv2 ViT-L. They are located in fixed feature dimensions and random patch tokens.





For more results and analysis on massive activations in LLMs and ViTs, please take a look at our full paper.

BibTeX

@article{sun2024massive,
        title    = {Massive Activations in Large Language Models}, 
        author   = {Sun, Mingjie and Chen, Xinlei and Kolter, J. Zico and Liu, Zhuang},
        year     = {2024},
        journal  = {arXiv preprint arXiv:2402.17762}
}

Acknowledgments

We thank Sachin Goyal, Jeremy Cohen, Timothée Darcet, Koustuv Sinha and Mike Rabbat for valuable discussions. Mingjie Sun was supported by funding from the Bosch Center for Artificial Intelligence.

Website adapted from the following template.