New algorithm VISOR improves finite-sample performance in noisy convex optimization
This paper studies a common problem: finding the minimum of a smooth, strongly convex loss when we only see noisy samples. The authors show that two standard approaches — sample average approximation (SAA), which minimizes the average observed loss, and averaged stochastic approximation (SA), which averages stochastic gradient iterates — can behave poorly when the number of samples is realistic rather than huge. They propose a new variance-reduction method called VISOR that can get much better accuracy with the same number of samples. An accelerated version of VISOR is shown to be instance-optimal up to logarithmic factors while also having optimal oracle complexity (a measure of how many gradient queries it needs).
The paper focuses on a setting where the stochastic oracle that provides data can introduce both additive and multiplicative noise. Classical asymptotic results say that SAA and the averaged stochastic-gradient estimator become optimal as the sample size n goes to infinity. But the authors point out that these asymptotic guarantees can be misleading in practice. They give a concrete two-dimensional quadratic example where the averaged stochastic-gradient estimator only looks asymptotically normal after roughly one million samples. In other words, finite-sample errors can be much larger than the asymptotic theory predicts.
To make these observations precise, the authors prove finite-sample, information-theoretic lower bounds. These are local minimax limits that describe how hard a specific problem instance can be, not just the worst case over many problems. A central quantity in these bounds is the trace of a matrix Λ that combines the curvature of the loss at its minimum and the covariance of the sample gradients. The bounds show there is a problem-dependent sample-size threshold: below that threshold no algorithm can reach the small errors predicted by asymptotic theory, and above it the expected error has a lower bound proportional to trace(Λ). For a family of quadratic problems in the paper, this threshold grows like the square of a condition parameter, so some problems need many samples before asymptotic behavior kicks in.