MathematicsEnglishPublished

As the number of experts grows, a mixture-of-experts model settles into a predictable continuum — even when each expert is a quantum circuit

May 5, 2026arXiv: 2501.14660

This paper studies what happens when you train a mixture-of-experts model that averages many identical small models, and then let the number of experts grow without bound. The authors prove that, under gradient-flow training, the collection of expert parameters behaves like a smooth probability distribution. In other words, the many random experts become well described by a single deterministic object that evolves in time according to a nonlinear continuity equation.

Technically, the work treats each expert as a particle with parameters θ and defines the mixture model as the average of N identical expert functions. The parameters are initialized independently and uniformly on a d-dimensional torus and then updated by continuous-time gradient flow on the usual squared loss. The main theorem shows a precise form of “propagation of chaos”: the empirical measure of the N trained parameters is close, in Wasserstein distance, to a probability measure μt that solves a specific nonlinear continuity equation. The drift appearing in that equation depends on the training data, the model function of a single expert, and the current average behavior of the ensemble.

At a high level the argument is a mean-field limit. When N is large one can treat the discrete collection of experts as a cloud of interacting particles. As N→∞ the interactions average out and the cloud is described by a single density that evolves by a partial differential equation. The authors quantify this closeness with an explicit convergence rate that depends only on N and on the dimension of the parameter space. They measure distances between probability laws with the Wasserstein metric, which captures how much mass must be moved to turn one distribution into another.

The paper also applies these results when each expert is a parametric quantum circuit. In that case the mixture is a classical averaging of quantum experts and may be hard to simulate classically if many experts are themselves hard to simulate. The mean-field description gives a way to study the collective training dynamics of such hybrid quantum–classical models. The authors note that they work in a regime where the model outputs are uniformly bounded; this avoids the “lazy training” behavior studied in some earlier quantum-neural-network limits and is intended to permit representation learning.