Dispersion loss counteracts embedding condensation and improves generalization in small language models

*Equal contribution
ICML 2026

One-liner summary

What makes LLMs better than small LMs? Data? Parameters? Geometry might play a role!

What is embedding condensation?

Every Transformer layer of a language model represents each input token as a vector in a high-dimensional embedding space. We notice that as those vectors progress through Transformer layers, they often behave as if they were confined to a narrow cone: they point to increasingly similar directions as measured by pairwise cosine similarity. We call this geometric phenomenon embedding condensation. This phenomenon is:

Feature 1

More severe in smaller models than in larger counterparts (Figure 2).

Feature 2

Reproducible under confounder-controlled settings (Figure 3).

Feature 3

Emerging at model initialization and gets alleviated by pre-training (Figure 4).

Feature 4

Not resolved by knowledge distillation from a larger model (Figure 5).

A 5-minute intro to this paper

This paper presents an observation-driven improvement on language model training.

We observe a geometric phenomenon which we term embedding condensation, where token embeddings collapse into a narrow cone-like subspace in smaller language models. We then design a training objective called dispersion loss to counteract the effect.

Figure 1. Illustration of the embedding condensation phenomenon. In pre-trained language models, embeddings of all tokens from the same input sequence condense into a narrow cone after being processed by many Transformer layers. This phenomenon is substantially more pronounced in smaller models than in larger models within the same family, which motivates our hypothesis in Section 3.3.

Feature 1: Larger model, less condensation.
Within the same model family, smaller models exhibit more severe embedding condensation, with token embeddings collapsing toward near-parallel directions, while larger models resist this collapse.

Figure 2. Qualitative and quantitative observations of the embedding condensation phenomenon. a. The cosine similarity heatmaps demonstrate that smaller models (e.g., GPT2, Qwen3-0.6B) are susceptible to condensation, since token cosine similarities become increasingly positive as the embeddings proceed to deeper layers. In contrast, larger models (e.g., GPT2-xl, Qwen3-32B) are more resistant to embedding condensation. b. Quantifications using Spearman correlation and Kendall’s Tau demonstrate a consistent trend of “larger model, less condensation” across multiple families of language models. Additional results can be found in Figure S1.

This effect is also quite robust to the choice of input datasets.

Figure S2. The embedding condensation effect is consistent regardless of the input text dataset. Results are shown for four datasets, namely (a) wikitext, (b) pubmed_qa, (c) imdb, and (d) squad.

Feature 2: Reproducible when controlling for confounders.
To isolate the effect of model size from other confounding factors, we conduct a controlled experiment where we pre-train GPT2-like models, varying only the MLP dimension while keeping all other components fixed, including the number of layers, embedding dimension, dataset, and training settings. The same phenomenon is observed.

Figure 3. In a highly controlled experiment, we reproduced the observation of “larger model, less condensation”. We pre-trained four GPT2-like models of varying sizes that differ only in MLP dimension, while keeping all other factors fixed, including the number of layers, embedding dimension, dataset, and training configuration. The resulting models exhibit consistent trends in embedding condensation, shown qualitatively (panel a) and quantitatively (panel b). Horizontal dashed lines are added to panel a for easier visual comparison.

Feature 3: Condensation occurs early on.
The embedding condensation phenomenon emerges at model initialization and is gradually mitigated, not exacerbated, by pre-training.

Figure 4. Embedding condensation is observed immediately after model initialization. We analyze checkpoints of Olmo-3-1025-7B spanning initialization, intermediate pre-training stages, and the final base model. Each checkpoint is annotated by its training stage and the number of training tokens.

Feature 4: Distillation is not a solution.
Knowledge distillation from a larger model does not transfer the desired resistance to embedding condensation.

Figure 5. Knowledge distillation is not a remedy to embedding condensation, shown qualitatively (panel a) and quantitatively (panel b).

Dispersion loss
Embedding condensation reduces the expressivity of Transformers by collapsing token embedding vectors into narrow cones, under-utilizing the representation space. We hypothesize that by dispersing embeddings during training, smaller models can achieve representational qualities more similar to larger models, thus narrowing the performance gap without increasing the number of parameters.

Figure 6. Illustration of how dispersion loss and its alternative formulations promote embedding dispersion. a. Dispersion loss enforces uniform angular dispersion by spreading out all pairs along the unit hypersphere. b. Decorrelation loss encourages different feature dimensions to remain uncorrelated. c.2-repel loss increases pairwise Euclidean distance, while the norm regularization prevents unbounded expansion. d. Orthogonalization loss spreads out vectors forming acute angles while leaving obtuse ones unchanged.

Our dispersion loss is inspired by the "Diffuse and Disperse" paper with practical modifications.

Table 1. Our dispersion loss and its alternative formulations. Main implementation differences from Diffuse and Disperse are highlighted in teal and magenta. Including or excluding diagonal terms yields identical gradients and is therefore cosmetic. For dispersion loss and ℓ2-repel, we adopt the log-sum-exp trick for numerical stability, which differs from log(mean(exp(·))) only by an additive constant. For ℓ2-repel, we include a norm regularization term to prevent unbounded expansion of embeddings. For Orthogonalization, the distance margin is fixed to 12 since we use angular distance, where 12 corresponds to orthogonality and thus serves as the ideal margin.

Dispersion loss counteracts the embedding condensation effect during mid-training and pre-training. A qualitative result is shown below, while more quantitative results can be found in the paper.

Figure 7. Dispersion loss counteracts the embedding condensation phenomenon. a. Starting from condensed embeddings (gray dashed box), mid-training with the default loss has a limited impact (green box). b. In contrast, mid-training with our dispersion loss as a regularizer substantially mitigates embedding condensation (blue box).

Conclusion
Larger language models are better than smaller language models, but might not merely because they have more parameters. It can be partially attributed to how they organize the information in the latent representations. We hope to see future efforts along this interesting direction.

Disclaimers

If you are thinking about reproducing this work or borrowing pieces of it, here are my two cents.

  • Embedding condensation. This part has behaved well in our hands. It is consistently observed in many model families and input datasets and under controlled settings. The trends might be stronger or weaker depending on the model family, but nothing we put in the key observations required cherry-picked runs. We cannot guarantee that all model families have this phenomenon, but you can surely try it on your favorite ones.
  • Dispersion loss. Treat this part as more exploratory. The gains are modest: the improvement is subtle enough that it is hard to separate from noise without formal statistical tests (and the ones we did are very elementary as we are not good statisticians). A friend with more experience in math reasoning has commented after the acceptance of the paper that our mid-training recipe is not very standard, as the common practice is to strengthen domain-specific capabilities rather than continued training on wikitext. Our pre-training experiments are also thin because large runs are expensive. If you are interested in this method, I would recommend trying it at small scale with your team’s normal playbook before routing a big training budget through it.

Future directions

I personally highlight a few directions that seem potentially meaningful.

  • Better regularizers. The dispersion loss is a very simple and straightforward solution, likely with benefits and drawbacks. A more carefully designed method to counteract embedding condensation might be more helpful.
  • Beyond pre-training. Track how embedding condensation evolves in later stages of training, such as supervised fine-tuning (SFT) and reinforcement learning (RL). It remains unclear whether condensation re-emerges, stabilizes, or interacts differently with alignment objectives.
  • Mechanism and causality. Pin down the root causes of embedding condensation and establish stronger causal links between condensation and downstream behavior such as generalization.
  • Better architectures. Design model famlies or modules that inherently resist condensation, complementing or replacing purely loss-based regularization.
  • Better initialization. Develop initialization schemes that start models in a less condensed regime, potentially reducing the burden on the training objective to counteract the geometric collapse.

Citation

@inproceedings{liu2026dispersion,
  title={Dispersion loss counteracts embedding condensation and improves generalization in small language models},
  author={Liu, Chen and Sun, Xingzhi and Xiao, Xi and Van Tassel, Alexandre and Xu, Ke and Reimann, Kristof and Liao, Danqi and Gerstein, Mark and Wang, Tianyang and Wang, Xiao and Krishnaswamy, Smita},
  booktitle={International conference on machine learning},
  year={2026},
  organization={PMLR}
}

Acknowledgements

  1. This work was initially motivated by the paper “A mathematical perspective on Transformers”. We started this project in early Apr 2025 after watching a talk on that paper. We were intrigued by a theoretical result in that paper stating that if we stack Transformer layers infinitely, all the token embeddings will cluster to the same point, and we were curious whether that behavior can be observed empirically. That led to the key observations of embedding condensation in our paper.
  2. The design of the dispersion loss was largely inspired by Runqian and Kaiming’s paper “Diffuse and Disperse: Image Generation with Representation Regularization”. Their paper came out right after when we completed the initial observations of embedding condensation and was thinking about ways to mitigate the phenomenon. When I read that paper, I immedidately felt it was highly relevant and could be used as a reasonable solution.