Nesterov's Smoothing¶
We introduced the proximal algorithm to solve problems of the form:
When \(f\) is smooth, the accelerated proximal gradient descent has a convergence rate of \(O(1/t^2)\). What if the objective is not smooth? For example, the square-root Lasso:
The \(\ell_2\)-norm \(\|x\|_2\) is not differentiable at \(0\). Nesterov's smoothing idea involves approximating the non-smooth objective function with a smooth one and minimizing the smooth approximation using gradient descent.
Definition: A convex function \(f\) is \((\alpha, \beta)\)-smoothable if for any \(\mu > 0\), there exists a convex approximation \(f_{\mu}\) such that:
- \(f_{\mu}(x) \leq f(x) \leq f_{\mu}(x) + \beta\mu\), for all \(x\).
- \(f_{\mu}\) is \(\frac{\alpha}{\mu}\)-smooth.
Example: \(\ell_1\)-norm¶
Approximate the absolute value \(f(z) = |z|\) using the Huber loss:
The Huber loss is \(\frac{1}{\mu}\)-smooth, making \(|z|\) \((1, \frac{1}{2})\)-smoothable. For the \(\ell_1\)-norm \(f(z) = \|z\|_1\), we approximate it by \(\sum_{i=1}^d h_{\mu}(z_i)\), making it \((1, \frac{d}{2})\)-smoothable.
Example: \(\ell_2\)-norm¶
Approximate the \(\ell_2\)-norm \(f(x) = \|x\|_2\) by:
This makes \(\|x\|_2\) \((1,1)\)-smoothable, with dimension-free smoothing parameters.
The following theorem shows the convergence rate using Nesterov's smoothing idea:
Theorem: Given \(F(x) = f(x) + h(x)\), where \(f\) is \((\alpha, \beta)\)-smoothable and \(h\) is convex, let \(f_\mu\) be the \(\frac{1}{\mu}\)-smooth approximation to \(f\). Applying accelerated proximal gradient descent to \(F_\mu(x) = f_{\mu}(x) + h(x)\) with \(\mu = \epsilon/(2\beta)\) achieves \(\epsilon\)-accuracy if \(t \gtrsim \frac{\sqrt{\alpha\beta}}{\epsilon}\).
For the square-root Lasso, applying accelerated proximal gradient descent to:
achieves \(\epsilon\)-accuracy within \(O(1/\epsilon)\) steps, assuming the maximum singular value of the design matrix \(\mathbb{X}\) is bounded.