Why a simple genetic algorithm can behave like gradient descent in very large models
Researchers show that a basic form of genetic algorithm can act like a noisy version of gradient descent when mutations are small. They study the elitist (1+M) genetic algorithm: start from one parent, make M Gaussian-perturbed offspring, and keep the best. In the small-mutation limit the algorithm’s average motion follows the loss gradient direction without ever calculating derivatives. The difference from exact gradient descent is an anisotropic Gaussian noise that slows progress in directions that are not aligned with the gradient.
To reach this view the authors expand the algorithm’s master equation for small mutation steps. That expansion gives a Fokker–Planck equation and an equivalent Langevin (stochastic differential) equation. The effective drift term points along the normalized gradient of the loss (this is called “clipped” gradient descent). The random part of the dynamics is Gaussian white noise with different sizes along and across the gradient. In plain terms: random mutation plus selection tends to move downhill, but with noise that blurs motion sideways.
A key consequence is that the speed penalty from this noise does not necessarily grow with the total number of parameters. Instead it is controlled by the effective rank of the Hessian of the loss. The Hessian is the matrix of second derivatives that describes local curvature of the loss. If the Hessian’s spectrum falls off as a power law λk ∝ k^{-a}, the paper shows the effective rank grows like N^{1−a} for a<1, like log N for a=1, and does not grow with N for a>1. Many neural-network losses have a in the range 0.8–1.3, so the slowdown of the genetic algorithm can be much less severe than the simple 1/N scaling often assumed.
The authors also give a few other concrete results. Using more offspring M>1 speeds up progress per generation by about √(ln M) compared with hill climbing (M=1), but the cost per loss evaluation scales like √(ln M)/M, so in serial computation plain hill climbing can be more efficient. Averaging many independent runs makes the mean loss follow the noiseless clipped gradient-descent law. The small-mutation approximation breaks down when the gradient becomes comparable to the mutation scale times the largest Hessian eigenvalue, so the theory fails near critical points of the loss.