On Vanishing Variance in Transformer Length Generalization
Abstract
It is a widely known issue that Transformers, when trained on shorter sequences, fail to generalize robustly to longer ones at test time. This raises the question of whether Transformer models are real reasoning engines, despite their impressive abilities in mathematical problem solving and code synthesis. In this paper, we offer a vanishing variance perspective on this issue. To the best of our knowledge, we are the first to demonstrate that even for today’s frontier models, a longer sequence length results in a decrease in variance in the output of the multi-head attention modules. On the retrieval and dictionary lookup tasks, our experiments show that applying layer normalization after the attention outputs leads to significantly better length generalization. Our analyses attribute this improvement to a reduction—though not a complete elimination—of the distribution shift caused by vanishing variance. Project page: ruiningli.com/vanishing-variance.

1 Background: Vanishing Variance
It is no exaggeration to say that Transformers (Vaswani et al., 2017) is the most important architecture in modern deep learning. It is widely adopted in almost every domain, ranging from natural language (Vaswani et al., 2017; Devlin et al., 2019; Brown et al., 2020) and vision (Dosovitskiy et al., 2021; Peebles & Xie, 2023) to audio (Radford et al., 2023) and protein design (Jumper et al., 2021). Despite its successes, recent studies (Press et al., 2022; Zhou et al., 2023; 2024; Kazemnejad et al., 2024; Veličković et al., 2024) in large language models (LLMs) have shown that transformer-based models often struggle with length generalization, an ability that requires the model to generalize to longer sequences than seen during training. Several prior works have proposed to either refine position encodings (Ruoss et al., 2023; Zhou et al., 2024; Kazemnejad et al., 2024) or adapt the softmax function (Press et al., 2022; Veličković et al., 2024) to improve length generalization. However, these methods are ad-hoc and lack interpretability, making it more of an art than a science to understand when and why they work.
In this paper, we study the distribution shift that occurs in the intermediate outputs when an attention module trained on shorter sequences is subsequently exposed to longer ones in a zero-shot manner. We hope that our findings will encourage future research on network architectures that are provably robust (e.g., invariant) to varying sequence lengths.
Background and notations.
At the core of Transformers is the attention mechanism (Vaswani et al., 2017). The attention first projects the input sequence , where is the sequence length and each item has features, into keys and values using learnable weight matrices . Similarly, the query sequence is projected into queries using . The attention then computes and projects it using another weight matrix to yield the final result . In this paper, we use the term “attention weights” to refer to the softmax score, i.e., , “attention outputs” the intermediate , and the final result.
Our main observation is the vanishing variance problem: as the sequence length increases, the variance of attention outputs (computed over multiple input sequences of length ) decreases. We formalize this as Proposition 1.
Proposition 1 (The vanishing variance problem).
Consider a trained attention module with weights . Let denote an input sequence of length . If (1) , a distribution over a finite vocabulary, and (2) , then for a fixed query and a fixed feature ,
where and are intermediate results in .
Informally, for a fixed component in the attention outputs, its variance over input sequences of length , where each sequence consists of independently and identically distributed (i.i.d.) tokens, vanishes as .
Proof.
Please refer to Appendix A. ∎
Note that assumptions of Proposition 1 are violated in practice. In particular, the independence assumption does not hold in LLMs because of (1) the introduction of positional encoding, and more significantly (2) the nature of language, where preceding words provide important context for those that follow. In addition, is not strictly enforced. Nevertheless, we find that even for today’s frontier LLMs, the decay in attention output variance, as established in Proposition 1, remains pronounced. In Fig. 1, we plot the standard deviation of a fixed component of the attention outputs from the first layer of Llama-3.2-1B (AI@Meta, 2024) as a function of input sequence length . is computed over length- sequences sampled randomly with strategies: ( Random Tokens w/o P.E.) We sample single tokens i.i.d uniformly at random from the tokenizer’s vocabulary, and remove the positional encoding for inference; ( Random Tokens w/ P.E.) We still sample single tokens i.i.d uniformly at random, but keep the positional encoding at inference time; ( Sentences w/ P.E.) We sample consecutive sentences from a long paragraph111Obtained from https://github.com/dscape/spell/blob/master/test/resources/big.txt, and truncate the token sequences to length —such sequences lie within the LLM’s training distribution. As can be seen in the log-log plot, for Random Tokens w/o P.E., where the independence assumption does hold, scales with input sequence length roughly as . For Random Tokens w/ P.E. and Sentences w/ P.E., where such assumption is no longer valid, the downward trend is still obvious.
2 Layer Normalization for Length Generalization
As variance vanishes with longer sequence lengths in attention outputs, we are intrigued to investigate the causes of performance degradation observed in LLMs. To this end, we perform a toy study on the statistical behavior of attention output values.
For simplicity, we consider a one-layer Transformer with single-head attention, omitting residual connections and normalization, following Veličković et al. (2024). We adopt this architecture as our Baseline. The model receives a single query token and an input sequence of varying length to perform simple algorithmic tasks. To eliminate confounds from positional encodings, we focus on order-invariant tasks, where the output depends only on the multiset (not the order) of input tokens, including retrieval and dictionary lookup. Our goal is to evaluate models trained on shorter sequences using longer (i.e., out-of-distribution in length) sequences to study length generalization. More details of the model architecture and synthetic data generation are provided in Appendix B.
In Fig. 2, we visualize the distribution of individual components in attention outputs across multiple input sequences of lengths , and , obtained with a model checkpoint trained on sequences of up to length . As can be seen in the top row, testing on out-of-distribution sequence lengths leads to vanishing variance, causing a distribution shift where each individual component of becomes more concentrated around its mean.

While this distribution shift of individual features is expected according to Proposition 1, we are more interested in the distribution shift of the entire feature vector in , as the whole vector is subsequently input to an MLP to predict the final result. As input sequence length increases, each feature is less likely to have extreme values (as its distribution is more centered). Consequently, the global feature variance, defined as where is the global mean, also decreases. We illustrate this observation in Fig. 3 (right), where the global variance decays as increases. In Fig. 3 (left), we show that in addition to the global variance, the global mean also exhibits drift. Such a distribution shift in attention outputs (and thus MLP inputs) hinders generalization, since the MLP is only trained on features with larger global variance and a different global mean (Zhou et al., 2022).

To mitigate this distribution shift, we explore applying layer normalization (Ba et al., 2016) immediately after the attention outputs, i.e., , where is the batched attention outputs, , , and are learnable scale and shift parameters. While variance decay in individual features is inevitable (bottom row of Fig. 2), standardization and learnable scale and shift parameters help stabilize the feature distribution. This adjustment preserves the global mean and variance more effectively as sequence length increases (Fig. 3). This enhances length generalization, as discussed next.
3 Experiments
We consider two tasks: retrieval and dictionary lookup. The former has been considered by Veličković et al. (2024). The latter closely resembles the core function of the attention mechanism (i.e., to retrieve the most relevant information based on the similarity between queries and keys). As detailed in Appendix B, the order of input tokens does not affect the target output in either task. By deliberately selecting such tasks, we isolate and examine the length generalization capabilities of the attention mechanism itself, independent of any effects introduced by positional encodings (Zhou et al., 2024). We generate synthetic data (of input sequence length up to ) to train the models, and evaluate them on sequences of length up to .
3.1 Results and Analysis
The results presented in Table 1 and Table 2 indicate that applying layer normalization to attention outputs leads to consistently better accuracy on out-of-distribution sequence lengths, with statistical significance confirmed by a paired -test over training runs from different random seeds.
Test-time adaptation and fine-tuning are common techniques for improving length generalization in transformers (Anil et al., 2022; Veličković et al., 2024). To show that the benefits of layer normalization are orthogonal to these techniques, we implement the adaptive temperature method from Veličković et al. (2024) in both architectures, with and without layer normalization. Combined with test-time adaptation, layer normalization still yields a significant improvement. In Fig. 4, we demonstrate that layer normalization also mitigates dispersion (Veličković et al., 2024).

Model | |||||||||||
w.o. test-time adaptation | |||||||||||
Baseline | |||||||||||
Baseline (+ LN) | |||||||||||
-value | |||||||||||
w. test-time adaptation | |||||||||||
Adaptive | |||||||||||
Adaptive (+ LN) | |||||||||||
-value |
Model | |||||||||||
w.o. test-time adaptation | |||||||||||
Baseline | |||||||||||
Baseline (+ LN) | |||||||||||
-value | |||||||||||
w. test-time adaptation | |||||||||||
Adaptive | |||||||||||
Adaptive (+ LN) | |||||||||||
-value |
Does layer normalization alleviate distribution shift?
Layer normalization does alleviate—but not eliminate—distribution shift. With layer normalization, the global mean and global variance remain more stable on out-of-distribution sequence lengths (Fig. 3). However, the variance of fixed components in attention outputs still decays, regardless of layer normalization (Fig. 2).
3.2 Ablations
In addition to layer normalization, we explore an alternative normalization strategy in which we standardize (i.e., std. in Table 3) the attention outputs across the features without the learnable scale and shift parameters present in LN, i.e., where and are computed in the same manner as in layer normalization.
As shown in Table 3, where the relative accuracy gain over Baseline on the retrieval task is reported, standardization improves length generalization, even though it strictly constrains model capacity. This underscores the importance (and potential benefits) of addressing the observed distribution shift. LN outperforms standardization, as confirmed by the paired -test. Similar ablation results on the dictionary lookup task can be found in Section B.3.
Model | |||||||||||
(+ std.) | |||||||||||
(+ LN) | |||||||||||
-value |
4 Related Work
Positional encoding for length generalization.
Many works have attributed the inability of Transformers to extrapolate to longer sequences to positional encoding. Several alternatives to the sinusoidal positional encoding originally introduced by Vaswani et al. (2017) have been proposed to enhance the performance of Transformer-based models in natural language processing (NLP) tasks, including relative positional encoding (Shaw et al., 2018; Dai et al., 2019), rotary positional encoding (Su et al., 2024), no positional encoding (Haviv et al., 2022) and randomized positional encoding (Ruoss et al., 2023). Authors have examined the impact of different variants of positional encoding on length generalization (Chi et al., 2022; Ruoss et al., 2023; Kazemnejad et al., 2024; Li et al., 2024; Peng et al., 2024; Zhou et al., 2024). Unlike prior work that explores positional encoding for length generalization, we focus on algorithmic tasks that are order-invariant. We present a vanishing variance perspective on length generalization which is orthogonal to the extrapolability of positional encoding.
Alternatives to softmax attention.
The output has been utilized to interpret the inner workings of Transformers (Xu et al., 2015; Choi et al., 2016; Martins & Astudillo, 2016). More recently, Veličković et al. (2024) demonstrated that the attention weights output by will disperse as sequence length increases, attributing this phenomenon to the Transformer’s limited capability in length generalization. In this paper, we show that this dispersion leads to the vanishing variance problem in the intermediate attention outputs. While many variants of softmax attention have been introduced (Correia et al., 2019; Press et al., 2022; Tan et al., 2024; Ye et al., 2024), they are motivated mostly by interpretability, rather than the distribution of attention outputs for length generalization. To the best of our knowledge, none of the existing works have fundamentally eliminated the vanishing variance problem we presented in this paper. We hope our study can motivate designs of network architectures that are provably invariant to sequence length variations.
5 Conclusion
In this paper, we have introduced the vanishing variance problem and provided both theoretical analysis and empirical evidence demonstrating its role in inducing distribution shift in attention outputs. This shift hinders the ability of Transformers to generalize effectively to out-of-distribution sequence lengths. We demonstrated that mitigating this distribution shift through techniques like layer normalization and standardization—despite potential trade-offs in model expressiveness—significantly improves length generalization in attention models.
Future work.
We conduct our experiments using a single-layer, single-head attention architecture for simplicity, while real-world models typically use multi-layer, multi-head attention. Our conclusions may not fully generalize to these more complex architectures. Future work may validate the normalization strategies on larger benchmarks like CLRS (Veličković et al., 2022) and real-world LLMs. Moreover, layer normalization only partially mitigates distribution shift presented in this paper, and is already widely adopted in Transformers (though not immediately after attention outputs). Future work may design architectures that are provably invariant to sequence length variations.
References
- AI@Meta (2024) AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
- Anil et al. (2022) Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
- Chi et al. (2022) Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, and Alexander Rudnicky. Kerple: Kernelized relative positional embedding for length extrapolation. NeurIPS, 2022.
- Choi et al. (2016) Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. NeurIPS, 2016.
- Correia et al. (2019) Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015, 2019.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL, 2019.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Haviv et al. (2022) Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. In EMNLP, 2022.
- Jumper et al. (2021) John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A Kohl, Andy Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman, Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W. Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis. Highly accurate protein structure prediction with alphafold. Nature, 2021.
- Kazemnejad et al. (2024) Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. NeurIPS, 2024.
- Li et al. (2024) Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. Functional interpolation for relative positions improves long context transformers. In ICLR, 2024.
- Martins & Astudillo (2016) Andre Martins and Ramon Astudillo. From softmax to sparsemax: A sparse model of attention and multi-label classification. In ICML, 2016.
- Peebles & Xie (2023) William Peebles and Saining Xie. Scalable diffusion models with transformers. In ICCV, 2023.
- Peng et al. (2024) Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In ICLR, 2024.
- Press et al. (2022) Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
- Radford et al. (2023) Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In ICML, 2023.
- Ruoss et al. (2023) Anian Ruoss, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Róbert Csordás, Mehdi Bennani, Shane Legg, and Joel Veness. Randomized positional encodings boost length generalization of transformers. In ACL, 2023.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
- Tan et al. (2024) Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, and Rameswar Panda. Stick-breaking attention. arXiv preprint arXiv:2410.17980, 2024.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
- Veličković et al. (2022) Petar Veličković, Adrià Puigdomènech Badia, David Budden, Razvan Pascanu, Andrea Banino, Misha Dashevskiy, Raia Hadsell, and Charles Blundell. The clrs algorithmic reasoning benchmark. In ICML, 2022.
- Veličković et al. (2024) Petar Veličković, Christos Perivolaropoulos, Federico Barbero, and Razvan Pascanu. softmax is not enough (for sharp out-of-distribution). arXiv preprint arXiv:2410.01104, 2024.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.
- Ye et al. (2024) Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer. arXiv preprint arXiv:2410.05258, 2024.
- Zhou et al. (2023) Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
- Zhou et al. (2022) Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. Domain generalization: A survey. PAMI, 2022.
- Zhou et al. (2024) Yongchao Zhou, Uri Alon, Xinyun Chen, Xuezhi Wang, Rishabh Agarwal, and Denny Zhou. Transformers can achieve length generalization but not robustly. arXiv preprint arXiv:2402.09371, 2024.
Appendix A Proof of Proposition 1
Proposition 1 (The vanishing variance problem).
Consider a trained attention module with weights . Let denote an input sequence of length . If (1) , a distribution over a finite vocabulary, and (2) , then for a fixed query and a fixed feature ,
where and are intermediate results in .
Informally, for a fixed component in the attention outputs, its variance over input sequences of length , where each sequence consists of independently and identically distributed (i.i.d.) tokens, vanishes as .
Proof.
Let . Firstly, we argue that and are independent for . Let be the projection onto the -th coordinate, namely for . Observe that and . By assumption (1), and are independent for . Since , are measurable functions of independent random variables, they are independent for . By assumption, and are identically distributed for every , so and are identically distributed. Thus, depends only on , not on . Set . is finite since the vocabulary is finite and thus compact and bounded.
Let denote the attention weights as the query sequence consists of only a single item. Let denote the -th element of . We have
In the derivation above, we used the fact that and that for , and are independent. We also used the assumption that for every , . Since the tokens come from a finite dictionary, and since and are continuous functions on compact domain (dictionary is finite), the logits are bounded, because they are continuous image of a compact set and every compact set on the real line is closed and bounded. By Lemma 2.1 of Veličković et al. (2024), there exist a constant and , such that for every , . Then for every ,
Let . There exists such that for every , . Then for every ,
∎
Model | |||||||||||
Baseline | |||||||||||
Baseline (+ LN) | |||||||||||
-value |
Model | |||||||||||
Baseline | |||||||||||
Baseline (+ LN) | |||||||||||
-value |
Model | |||||||||||
(+ std.) | |||||||||||
(+ LN) | |||||||||||
-value |
Appendix B More Experimental Details
B.1 Implementation Details
retrieval.
lookup.
The network architecture is the same as the retrieval task. We generate data for training and evaluation in the following way: for each item of the length- sequence, we sample a value class i.i.d at random; each item also has a key class . The key classes of all items in the sequence are sampled without replacement. In our experiments, and .
The features of each item is defined as , i.e., the concatenation of the embeddings of the key class and the value class. The embedding vectors for each (key and value) class are optimized jointly with the attention network.
The query sequence in our case is guaranteed to be of length . We sample a key class present in the input sequence and use its embedding vector as the query.
For this task, we found that the optimization usually converges within gradient steps. We train the attention network, together with the embedding vectors, in PyTorch for steps with the same hyper-parameter setup as the task.
B.2 Results When Training on More Diverse Sequence Lengths
To validate the utility of normalization when the length gap between the training sequences and the test sequences is smaller, we follow the same experimental setup as in Section 3, but sample sequences of up to items during training. We found it beneficial to gradually increase the length of the sequences sampled throughout training, as is commonly done during pre-training of frontier LLMs (AI@Meta, 2024). The results are reported in Table 4 and Table 5. With layer normalization, the accuracies on out-of-distribution sequence lengths are significantly higher than without on both tasks, demonstrating the importance of normalization for length generalization over various training settings.
B.3 Ablations on the Dictionary Lookup Task
Ablation results on the dictionary lookup task are shown in Table 6, which are consistent with the results on the retrieval task presented in Section 3.2. However, on this task, the performance of standardization and layer normalization is more similar, as indicated by the larger -values, suggesting weaker statistical evidence for a significant difference.