Skip to content

Duality and ADMM

Composite Objective Function

We have introduced proximal gradient descent to solve optimization problems of the form minxf(x)+g(x), where f is smooth but g is not differentiable. A key step in proximal gradient descent is solving the proximal operator:

proxg(x)=argminzRd{12zx22+g(z)}

For example, when g(x)=λx1, the proximal operator is soft-thresholding. For the fused Lasso problem:

minβYXβ22+λDβ1

where D is some matrix representing the differential map, applying proximal gradient descent requires solving:

argminzRd{12zx22+λDz1}

Unlike the 1-norm, this problem lacks a closed-form solution, making proximal gradient descent inefficient for fused Lasso.

We discuss solving composite objective functions. Generally, we are interested in the following composite optimization problem:

minx,yf(x)+g(y), s.t. Ax+By=c

for some convex f and g. The fused Lasso can be reduced to this form by letting f(β)=YXβ22, g(y)=λy1, so:

minβYXβ22+λDβ1minβ,yf(β)+g(y), s.t. Dβy=0

Another example is basis pursuit:

minββ1 s.t. XβY=0

Duality

To solve the composite optimization problem, we start by reviewing duality in optimization. We define the Lagrange multiplier function of the problem as:

L(x,y,λ)=f(x)+g(y)+λ(Ax+Byc)

This allows us to convert the primal problem to the dual problem:

maxλh(λ), where h(λ)=minx,yL(x,y,λ)

Duality is crucial in optimization. The following theorem implies we can solve the primal problem via the dual problem.

Theorem (Strong Duality): Given convex functions f,g, the primal problem has the solution:

(x,y)=argminx,yf(x)+g(y), s.t. Ax+By=c

Consider the dual problem λ=argmaxλh(λ), where h(λ)=minx,yL(x,y,λ). We can solve (x,y) via:

(x,y)=argminx,yL(x,y,λ)

As L(x,y,λ) is decomposable, we further have:

x=argminxf(x)+λAx and y=argminyg(y)+λBy

The strong duality property is useful for converting a linear constraint problem to an unconstrained dual problem. The following example shows how to find the dual problem of Lasso.

Example: Duality of Lasso

As Lasso is an unconstrained problem, we first convert it to the standard form:

minβ12YXβ22+λβ1minβ,z12z22+λβ1, s.t. YXβ=z

The Lagrange function is:

L(β,z,μ)=12z22+λβ1+μ(YXβz)

Notice that L(β,z,μ) is decomposable:

minβ,zL(β,z,μ)=minβ{λβ1μXβ}+minz{12z22μz}+μY

The second problem is quadratic, and we have:

z=argminz{12z22μz}zμ=0

Applying the first-order optimality condition to the first problem, we get:

β=argminβ{λβ1μXβ}0=Xμ+λg, for some gβ1

As g1, the equality 0=Xμ+λg has a finite solution only if λXμ. Moreover, if λXμ, multiplying β on both sides of 0=Xμ+λg gives:

0=βXμ+λβ1

Therefore, we have:

minβ{λβ1μXβ}={0,if λXμ;,if λ<Xμ.

Combining the results, the dual problem becomes:

maxλminβ,zL(β,z,μ)=maxλ{12μ22+μY} s.t. λXμ

Therefore, the dual problem of Lasso is:

maxμL(β,z,μ)=maxμ12μ22+μY s.t. Xμλ

This is equivalent to:

**Duality of Lasso**:minμYμ22, s.t. Xμλ

Duality of Lasso Duality of Lasso

By the duality, the dual variable μ=z=YXβ is the residual. This gives a new geometric insight into Lasso, illustrating that Lasso is a projection, similar to OLS.

Alternating Direction Method of Multipliers (ADMM)

Consider an equivalent form of the problem by adding a quadratic term:

{minx,yf(x)+g(y) s.t. Ax+By=c,}{minx,yf(x)+g(y)+ρ2Ax+Byc22 s.t. Ax+By=c.}

We introduce the augmented Lagrangian:

Lρ(x,y,λ)=f(x)+g(y)+λ(Ax+Byc)+ρ2Ax+Byc22

To solve the dual problem maxλminx,yLρ(x,y,λ), we propose updating the primal and dual variables alternately:

Primal step:

{xt+1=argminxLρ(x,yt,λt);yt+1=argminyLρ(xt+1,y,λt);

Dual step:

λt+1=argmaxλLρ(xt+1,yt+1,λ)12ρλλt22

We add a proximal term in the dual step because the augmented Lagrangian Lρ(xt+1,yt+1,λ) is linear with respect to λ, making maxλLρ(xt+1,yt+1,λ)=. The proximal term prevents λt+1 from deviating too much from λt.

Plugging the augmented Lagrangian into the primal and dual steps, we have the following algorithm.

Alternating Direction Method of Multipliers (ADMM)

xt+1=argminxf(x)+ρ2Ax+Bytc+λt/ρ22yt+1=argminyg(x)+ρ2Axt+1+Byc+λt/ρ22λt+1=λt+ρ(Axt+1+Byt+1c)

If f and g are closed convex functions, the ADMM algorithm converges.

ADMM not first-order method

ADMM is not a first-order method. The sub-problem in the primal step use the functions f and g instead of just their gradients like the proximal gradient descent. So ADMM may converge faster than first-order methods in terms of the number of iterations, but the computational cost per iteration might be higher.

Example: Fused Lasso

Applying ADMM to the fused Lasso:

minβ12YXβ22+λDβ1minβ,z12YXβ22+λz1 s.t. Dβz=0

The updating rule is:

βt+1=argminβ12YXβ22+ρ2Dβzt+λt/ρ22zt+1=argminzλz1+ρ2Dβt+1z+λt/ρ22λt+1=λt+ρ(Dβt+1zt+1)

This yields the algorithm:

βt+1=(XX+ρDD)1(XY+ρDztDλt)zt+1=SoftThreshold(Dβt+1+λt/ρ,λ/ρ)λt+1=λt+ρ(Dβt+1zt+1)

When D=I, the algorithm reduces to ADMM for Lasso. Unlike FISTA, ADMM involves matrix inversion, which may be time-consuming when d is large.

Here is the implementation of ADMM for fused Lasso in PyTorch. Notice we can implement the tips in Linear Algebra to pre-compute the Cholesky decomposition of the matrix (XX+ρDD) to make it more efficient.

import torch

# Soft-thresholding function
def soft_thresholding(y, threshold):
    return torch.sign(y) * torch.clamp(torch.abs(y) - threshold, min=0)

# Define problem parameters
torch.manual_seed(42)
n, d = 10, 5  # Number of samples (n) and features (d)
X = torch.randn(n, d)  # Design matrix
Y = torch.randn(n)  # Response vector

# Define difference matrix D (for fused lasso)
D = torch.eye(d) - torch.eye(d, k=1)  # First-order difference matrix

# Initialize variables
beta = torch.zeros(d)  # Coefficients
z = torch.zeros(d - 1)  # Auxiliary variable
lam = torch.zeros(d - 1)  # Lagrange multiplier

# ADMM parameters
lambda_ = 0.1  # Regularization parameter
rho = 1.0  # ADMM penalty parameter
num_iters = 100  # Number of iterations

# Compute Cholesky decomposition of (X^T X + rho D^T D)
XtX = X.T @ X
DtD = D.T @ D
A = XtX + rho * DtD  # Regularized system matrix

# Cholesky decomposition (more stable than direct inversion)
L = torch.linalg.cholesky(A)  # Compute Cholesky factor L

for t in range(num_iters):
    # Solve for beta using Cholesky decomposition
    b_rhs = X.T @ Y + rho * D.T @ z - D.T @ lam  # Right-hand side
    y = torch.linalg.solve_triangular(L, b_rhs, upper=False)  # Forward substitution
    beta = torch.linalg.solve_triangular(L.T, y, upper=True)  # Backward substitution

    # Update z using soft-thresholding
    z = soft_thresholding(D @ beta + lam / rho, lambda_ / rho)

    # Update lambda (dual variable)
    lam += rho * (D @ beta - z)

When d is large, we can further improve the efficiency by using the Woodbury matrix identity to solve the linear system $$ (X^TX + ρDTD) = (ρDTD) - (ρDTD)X^T(I_n + X(ρDTD)XT)X(ρDTD) $$ The computation will be much more efficient, especially when for the Lasso case D=I.

Example: Graphical Lasso

Given i.i.d. samples X1,,XnN(0,Σ), we estimate the precision matrix Θ=Σ1 via graphical Lasso:

minΘlogdetΘ+tr(ΘΣ^)+λΘ1,1minΘ,ΨlogdetΘ+tr(ΘΣ^)+λΨ1,1 s.t. Θ=Ψ

Applying ADMM, we get:

Θt+1=argminΘlogdetΘ+tr(ΘΣ^)+ρ2ΘΨt+Λt/ρF2Ψt+1=argminΨλΨ1,1+ρ2Θt+1Ψ+Λt/ρF2Λt+1=Λt+ρ(Θt+1Ψt+1)

where the Frobenius norm AF2=jkAjk2. This simplifies to:

Θt+1=Fρ(ΨtΛt/ρΣ^/ρ)Ψt+1=SoftThreshold(Θt+1+Λt/ρ,λ/ρ)Λt+1=Λt+ρ(Θt+1Ψt+1)

where for the spectral decomposition X=UDU, Fρ(X)=Udiag{λi+λi+4/ρ}U.

Example: Consensus Optimization

We design a divide-and-conquer ADMM for Lasso for massive data when n is ultra-large. The Lasso:

minβi=1n(YiXiβ)2+λβ1

can be formulated as the general block separable form:

minxi=1nfi(x)minxi,zi=1nfi(x) s.t. xi=z for all i=1,,n

Applying ADMM, we get:

ADMM for Consensus Optimization

Divide: xit+1=argminxifi(xi)+ρ2xizt+λit/ρ22,i=1,,nGather: zt+1=1ni=1n(xit+1+λit/ρ)Broadcast: λit+1=λit+ρ(xit+1zt+1),i=1,,n

In the first step, we update each xit+1 in parallel, then gather all local iterates, and finally broadcast the updated dual variables back to each core.