Statistical MLEnglishPublished

How generative models move from memorising examples to producing similar new images — and why 'convergence' can miss the main features

May 24, 2026arXiv: 2605.21402v1

Researchers analysed when and how simple generative models stop memorising their training examples and start producing similar outputs when trained independently. Using an exact calculation for linear models, they show memorisation dominates when the dataset is very small, while a form of “convergence” between independently trained models appears once the number of examples grows in proportion to the data dimension. Crucially, that convergence mostly reflects learning the bulk of the data’s variability, and does not necessarily mean the models have recovered the main underlying factors that structure the data.

To make the problem tractable, the authors study the simplest useful setting: zero‑mean Gaussian data whose covariance is a low‑rank perturbation of the identity. In plain terms, the data look mostly like noise but have one or a few dominant directions (latent factors) that carry real signal. They train a Gaussian generative model that uses the square root of the empirical covariance to make new samples. The key control is the sample complexity γ = n/d, the number of training examples n divided by the input dimension d. The analysis is asymptotic, meaning it becomes exact in the limit of large dimension.

The paper defines three measures to separate behaviors. Memorisation (m) is how closely a generated sample matches an actual training example. Convergence (q) is how similar two samples are when two models are trained on different datasets but started from the same random seed. Latent recovery (Q) measures whether the dominant latent direction in the true data is learned. They find that m falls quickly as n grows (regularisation further reduces memorisation). Convergence q stays near zero for tiny datasets but rises continuously to one when n is on the order of d. The authors even give an exact formula for q in that regime that depends on the limiting spectrum of the empirical covariance (the Marchenko–Pastur law) and on the regularisation strength.