Polynomial speedup for diffusion-model sampling using a multilevel Euler–Maruyama trick
Researchers propose a simple change to how diffusion models draw samples that can cut the computing needed to generate images. The new Multilevel Euler–Maruyama (ML-EM) method mixes many cheap, small neural networks with a few expensive, large ones during the simulation of a stochastic process. Under a reasonable technical assumption about how hard the denoising function is to learn, this mixture gives a provable polynomial speedup. In experiments on 64×64 CelebA images the authors report up to fourfold faster sampling, and they measure a hardness parameter γ≈2.5 that fits their theory.
What the team did. Diffusion-based generators work by running a stochastic differential equation (SDE) or a related ordinary differential equation (ODE) from noise to an image. The expensive part is evaluating a learned “drift” or denoiser, usually coded as a large UNet (a common convolutional neural network). The authors adapt an idea from Multilevel Monte Carlo: keep a ladder of denoisers f1,…,fk of increasing size and accuracy, then randomly use small denoisers most of the time and the large ones only rarely. The overall update combines these levels so that the expected step matches an update with the best denoiser but at much lower average cost.
How it works at a high level. The method relies on two facts the paper assumes and tests: (1) smaller networks are much cheaper to run, and (2) error improves predictably as network size grows (a “scaling law”). ML-EM forms a telescoping sum of differences between successive levels and samples which differences to include with chosen probabilities. This keeps the bias low while cutting the number of expensive network calls. In practice the authors either fix sampling probabilities or learn coefficients; they also use a short search (best of 15 trials) to pick a good random sampling pattern because the sampling can produce variable outcomes.