Study shows tokens in deep encoder-only transformers concentrate quickly in a low‑temperature, large‑token limit
This paper analyzes how tokens move inside deep encoder-only transformers during inference, and it shows that under certain limits the token distribution quickly concentrates onto a simple transformed version of the initial distribution. The authors work in a mean‑field setting, where many tokens behave like interacting particles, and in the low‑temperature regime (that is, when the “temperature” parameter β^{-1} is very small or β is large). In plain terms, they find that attention makes tokens cluster in a predictable way when there are many tokens and the randomness from the temperature is weak.
Concretely, the researchers describe token evolution by a mean‑field continuity equation. This is a deterministic equation that approximates the evolution of the token distribution when the number of tokens is large. The limiting clustered distribution is the push‑forward of the initial token distribution under a projection map. That projection map is determined by the transformer’s key, query, and value matrices — the learned linear parts of the self‑attention mechanism.
The main quantitative statement bounds how far the actual token distribution is from the limiting one using the Wasserstein distance. The Wasserstein distance is a way to measure how different two probability distributions are. The bound they prove scales like sqrt( log(β+1) / β ) * exp(C t) + exp(−c t), where β is the inverse temperature, t is inference time, and C,c are positive constants. Roughly speaking, as β grows large (the low‑temperature limit) and for times up to order log β, the first term becomes small and the distribution concentrates near the identified limit. The second term shows a decay from the zero‑temperature dynamics and gives a time scale for this metastability.
To obtain these results the authors borrow ideas from the analysis of interacting particle systems. They derive Lyapunov‑type estimates for the zero‑temperature equation to control stability and identify the long‑time limit of that equation. Then they use a stability estimate in Wasserstein space plus a quantitative Laplace principle (a tool to approximate the dominant contribution in integrals) to couple the low‑temperature dynamics to the zero‑temperature behavior. These steps give a quantitative link between the particle picture and the mean‑field limit.