Mirror Descent¶
Bregman Divergence¶
In the previous lecture, we introduced the proximal perspective of gradient descent. To minimize
This is composed of the first-order Taylor expansion and a proximal term. For constrained optimization
If there are no constraints, i.e.,
Example: Quadratic Optimization¶
Consider the quadratic optimization problem:
where
Using the
In figure above, the trajectory of gradient descent is zigzag. This zigzag pattern occurs because the
In figure above, the descent direction directly points to the minimizer
Bregman Divergence¶
The previous example shows the need for a better distance fitting the problem's geometry. This motivates the definition of Bregman divergence.
Definition (Bregman Divergence): Let
By convexity,
The table below shows some common Bregman divergences.
Function Name | |||
---|---|---|---|
Squared norm | |||
Shannon entropy | |||
Bit entropy | |||
Burg entropy | |||
Hellinger | |||
Exponential | |||
Inverse |
Mirror Descent¶
Replacing the
This leads to the mirror descent algorithm:
Mirror Descent Algorithm:
This example shows the importance of considering Bregman divergence due to the objective function's geometry. However, in most cases, we need a proper Bregman divergence due to the constraint's geometry. The following example illustrates choosing the right Bregman divergence under a specific constraint.
Example: Probability Simplex¶
In many statistical problems, we estimate the probability mass function of a discrete distribution. The parameters are constrained to the probability simplex:
A widely used Bregman divergence for the probability simplex uses
The corresponding Bregman divergence becomes the Kullback–Leibler (KL) divergence:
This is also known as
Therefore, for the constrained optimization problem
the mirror descent algorithm with the KL-divergence is:
And it has a closed-form solution:
Mirror Descent for Probability Simplex
The algorithm can be implemented as:
def f(x): # define the objective function
...
x = torch.ones(dim, requires_grad=True) / dim # Initialize x in the probability simplex
# Mirror Descent Parameters
lr = 0.1 # Learning rate (η_t)
num_iters = 100 # Number of iterations
for t in range(num_iters):
# Compute function value at x_t
loss = f(x)
# Compute gradient using autograd
loss.backward()
with torch.no_grad():
# Compute softmax mirror descent update
x_new = torch.softmax((1 / lr) * x.grad, dim=0)
# Update variables
x.copy_(x_new) # In-place update to maintain tracking
# Zero out gradients for the next iteration
x.grad.zero_()
Summary for Constrained Optimization¶
Now consider the constrained optimization problem
- Frank-Wolfe Algorithm: Essentially mirror descent with
. It involves solving a linear programming sub-problem. - Projected Gradient Descent: Uses the
-norm as the proximal term, involving a quadratic programming sub-problem. - Mirror Descent: Using KL-divergence
, the sub-problem has a closed-form solution.