Proximal‑gradient style sampler for distributions that mix smooth and non‑smooth parts
This paper proposes a practical way to draw samples from probability densities that look like exp(−f(x)−g(x)). Here f is a smooth function that we can differentiate, and g is a convex function that may be non‑smooth (for example, an indicator for a constraint set or an ℓ1 penalty). The method borrows the idea of the proximal gradient method from optimization and turns it into a sampling algorithm that only needs gradient access to f and a special sampling subroutine for g called a restricted Gaussian oracle (RGO). The authors show this procedure can reach high accuracy while keeping the dependence on the problem dimension almost as good as the best known methods for the purely smooth case.
At a high level the algorithm works by lifting the target to a joint distribution on two variables (x,y) whose y‑marginal equals a Gaussian perturbation of x. The sampler alternates (Gibbs sampling) between drawing y from a Gaussian centered at the current x and updating x conditional on y. The conditional update for x would normally require sampling from a density proportional to exp(−f(x)−g(x)−(1/2h)||x−y||^2). Instead of sampling that density directly, the authors linearize the smooth part f around y and use the RGO for g to draw a proposal from exp(−⟨∇f(y),x−y⟩−g(x)−(1/2h)||x−y||^2). They then apply an independent Metropolis–Hastings correction step to remove the bias introduced by linearization.
The main theoretical claim (Theorem 3.1) is that when f+g is strongly convex and f is smooth, the sampler achieves ε error in total variation distance in about Õ(κ √d log^4(1/ε)) iterations. Here κ is the condition number β/α, with β the smoothness constant of f and α the strong convexity of f+g, and d is the dimension. The Õ notation hides logarithmic factors. This √d dependence on dimension matches recent state‑of‑the‑art results for the non‑composite case g=0, while avoiding any smoothness assumption on g. The proof uses a choice of step size h on the order of 1/(β√d) to control a Rényi divergence between the proposal and the target.