Random reshuffling (a common way to run SGD) is provably better than standard SGD for smooth convex problems
Many machine learning methods use Stochastic Gradient Descent (SGD) to minimize a sum of many small functions, one per data point. In practice people rarely sample with replacement. Instead they shuffle the data each pass (an “epoch”) and then process it in order. This common recipe is called Random Reshuffling (RR). The authors prove that RR is not just a useful trick. For smooth convex problems, RR provably outperforms standard SGD under any reasonable learning rate and after any finite number of epochs.
What the researchers did. They analyzed Shuffling SGD with independent random permutations each epoch (Random Reshuffling). They work under the standard smooth convex assumptions: each component function is convex and has a Lipschitz-continuous gradient (a technical smoothness condition). They give a new convergence theorem (their Theorem 1) that holds when the stepsize (the learning rate) is at most about 1/(16 times the largest smoothness constant). This upper bound on the stepsize is much less restrictive than older theory, which required a stepsize proportional to 1/n where n is the number of data points.
How the new bound works, in plain terms. The bound splits the error into two parts. One part is an “optimization” term that shrinks quickly with the total amount of work and is divided by n, the number of data points. The other part measures the effect of noise in gradients at the true solution (they call its size σ*2). The noise term in their bound is expressed so that it is never worse than the corresponding term for standard SGD when the stepsize is in the allowed range. In short, for any reasonable constant stepsize RR converges in expectation and its bound is at least as good as SGD’s and strictly better in many regimes.
Why this matters. Before this work, theory explained RR’s advantage only when the learning rate was tiny (about 1/n times a smoothness constant). That left a big gap with practice, where people use much larger constant learning rates and see strong benefits from reshuffling. This paper closes that gap for smooth convex problems. It also sharpens the best possible error rates when the stepsize is tuned, and it shows that RR reduces to the faster 1/(nK) dependence on the number of epochs K in important cases, including the zero-noise case where all component functions share the same minimizer.