Duality and ADMM¶
Composite Objective Function¶
We have introduced proximal gradient descent to solve optimization problems of the form \(\min_x f(x) + g(x)\), where \(f\) is smooth but \(g\) is not differentiable. A key step in proximal gradient descent is solving the proximal operator:
For example, when \(g(x) = \lambda \|x\|_1\), the proximal operator is soft-thresholding. For the fused Lasso problem:
where \(D\) is some matrix representing the differential map, applying proximal gradient descent requires solving:
Unlike the \(\ell_1\)-norm, this problem lacks a closed-form solution, making proximal gradient descent inefficient for fused Lasso.
We discuss solving composite objective functions. Generally, we are interested in the following composite optimization problem:
for some convex \(f\) and \(g\). The fused Lasso can be reduced to this form by letting \(f(\beta) = \|Y - X\beta\|_2^2\), \(g(y) = \lambda \|y\|_1\), so:
Another example is basis pursuit:
Duality¶
To solve the composite optimization problem, we start by reviewing duality in optimization. We define the Lagrange multiplier function of the problem as:
This allows us to convert the primal problem to the dual problem:
Duality is crucial in optimization. The following theorem implies we can solve the primal problem via the dual problem.
Theorem (Strong Duality): Given convex functions \(f,g\), the primal problem has the solution:
Consider the dual problem \(\lambda^* = \arg\max_{\lambda} h(\lambda)\), where \(h(\lambda) = \min_{x,y}L(x,y, \lambda)\). We can solve \((x^*, y^*)\) via:
As \(L(x,y, \lambda^*)\) is decomposable, we further have:
The strong duality property is useful for converting a linear constraint problem to an unconstrained dual problem. The following example shows how to find the dual problem of Lasso.
Example: Duality of Lasso¶
As Lasso is an unconstrained problem, we first convert it to the standard form:
The Lagrange function is:
Notice that \(L(\beta, z, \mu)\) is decomposable:
The second problem is quadratic, and we have:
Applying the first-order optimality condition to the first problem, we get:
As \(\|g\|_\infty \le 1\), the equality \(0 = -X^\top \mu + \lambda g\) has a finite solution only if \(\lambda \geq \|X^\top \mu\|_\infty\). Moreover, if \(\lambda \geq \|X^\top \mu\|_\infty\), multiplying \(\beta^*\) on both sides of \(0 = -X^\top \mu + \lambda g\) gives:
Therefore, we have:
Combining the results, the dual problem becomes:
Therefore, the dual problem of Lasso is:
This is equivalent to:
By the duality, the dual variable \(\mu^* = z^* = Y - X\beta^*\) is the residual. This gives a new geometric insight into Lasso, illustrating that Lasso is a projection, similar to OLS.
Alternating Direction Method of Multipliers (ADMM)¶
Consider an equivalent form of the problem by adding a quadratic term:
We introduce the augmented Lagrangian:
To solve the dual problem \(\max_\lambda \min_{x,y} L_{\rho} (x,y,\lambda)\), we propose updating the primal and dual variables alternately:
Primal step:
Dual step:
We add a proximal term in the dual step because the augmented Lagrangian \(L_{\rho}(x_{t+1}, y_{t+1}, \lambda)\) is linear with respect to \(\lambda\), making \(\max_{\lambda} L_{\rho}(x_{t+1}, y_{t+1}, \lambda) = \infty\). The proximal term prevents \(\lambda_{t+1}\) from deviating too much from \(\lambda_t\).
Plugging the augmented Lagrangian into the primal and dual steps, we have the following algorithm.
Alternating Direction Method of Multipliers (ADMM)
If \(f\) and \(g\) are closed convex functions, the ADMM algorithm converges.
ADMM not first-order method
ADMM is not a first-order method. The sub-problem in the primal step use the functions \(f\) and \(g\) instead of just their gradients like the proximal gradient descent. So ADMM may converge faster than first-order methods in terms of the number of iterations, but the computational cost per iteration might be higher.
Example: Fused Lasso¶
Applying ADMM to the fused Lasso:
The updating rule is:
This yields the algorithm:
When \(D = I\), the algorithm reduces to ADMM for Lasso. Unlike FISTA, ADMM involves matrix inversion, which may be time-consuming when \(d\) is large.
Here is the implementation of ADMM for fused Lasso in PyTorch. Notice we can implement the tips in Linear Algebra to pre-compute the Cholesky decomposition of the matrix \((X^\top X + \rho D^\top D)\) to make it more efficient.
import torch
# Soft-thresholding function
def soft_thresholding(y, threshold):
return torch.sign(y) * torch.clamp(torch.abs(y) - threshold, min=0)
# Define problem parameters
torch.manual_seed(42)
n, d = 10, 5 # Number of samples (n) and features (d)
X = torch.randn(n, d) # Design matrix
Y = torch.randn(n) # Response vector
# Define difference matrix D (for fused lasso)
D = torch.eye(d) - torch.eye(d, k=1) # First-order difference matrix
# Initialize variables
beta = torch.zeros(d) # Coefficients
z = torch.zeros(d - 1) # Auxiliary variable
lam = torch.zeros(d - 1) # Lagrange multiplier
# ADMM parameters
lambda_ = 0.1 # Regularization parameter
rho = 1.0 # ADMM penalty parameter
num_iters = 100 # Number of iterations
# Compute Cholesky decomposition of (X^T X + rho D^T D)
XtX = X.T @ X
DtD = D.T @ D
A = XtX + rho * DtD # Regularized system matrix
# Cholesky decomposition (more stable than direct inversion)
L = torch.linalg.cholesky(A) # Compute Cholesky factor L
for t in range(num_iters):
# Solve for beta using Cholesky decomposition
b_rhs = X.T @ Y + rho * D.T @ z - D.T @ lam # Right-hand side
y = torch.linalg.solve_triangular(L, b_rhs, upper=False) # Forward substitution
beta = torch.linalg.solve_triangular(L.T, y, upper=True) # Backward substitution
# Update z using soft-thresholding
z = soft_thresholding(D @ beta + lam / rho, lambda_ / rho)
# Update lambda (dual variable)
lam += rho * (D @ beta - z)
When \(d\) is large, we can further improve the efficiency by using the Woodbury matrix identity to solve the linear system $$ (X^TX + ρDTD) = (ρDTD) - (ρDTD)X^T(I_n + X(ρDTD)XT)X(ρDTD) $$ The computation will be much more efficient, especially when for the Lasso case \(D = I\).
Example: Graphical Lasso¶
Given i.i.d. samples \(X_1, \ldots, X_n \sim N(0, \Sigma)\), we estimate the precision matrix \(\Theta = \Sigma^{-1}\) via graphical Lasso:
Applying ADMM, we get:
where the Frobenius norm \(\|A\|_{\rm F}^2 = \sum_{jk}A_{jk}^2\). This simplifies to:
where for the spectral decomposition \(X = UDU^\top\), \(\mathcal{F}_{\rho}(X) = U{\rm diag}\{\lambda_i + \sqrt{\lambda_i + 4/\rho}\}U^\top\).
Example: Consensus Optimization¶
We design a divide-and-conquer ADMM for Lasso for massive data when \(n\) is ultra-large. The Lasso:
can be formulated as the general block separable form:
Applying ADMM, we get:
ADMM for Consensus Optimization
In the first step, we update each \(x_i^{t+1}\) in parallel, then gather all local iterates, and finally broadcast the updated dual variables back to each core.