Stochastic Gradient Descent¶
Stochastic Gradient Descent (SGD) is a popular optimization algorithm in machine learning and deep learning.
The SGD problem is to solve:
where \(n\) is the sample size and it is typically too large to compute the full gradient \(\nabla F(x)\) in one go.
Recalling the idea of randomized linear algebra, we can replace the full gradient with its unbiased estimator: \(\nabla f_{\xi}(x)\) where \(\xi\) is uniformly sampled from \(\{1, 2, \cdots, n\}\).
This motivates the basic SGD algorithm:
where in the following of the note, \(g_t\) is used to denote any unbiased estimator of \(\nabla F(x_t)\), i.e.,
Mini-Batch Gradient Descent¶
Mini-batch gradient descent is a variant of SGD that uses a small batch of data points to compute the gradient, balancing the efficiency of SGD with the stability of full-batch gradient descent.
Mini-Batch Gradient Descent Algorithm
For a mini-batch \(\mathcal{B}_t \subset \{1, 2, \cdots, n\}\), update the parameters as follows:
Mini-batch gradient descent reduces the variance of parameter updates, leading to more stable convergence.
PyTorch Optimizer Pipeline¶
Mini-batch gradient descent can be implemented in PyTorch by the DataLoader
class. The DataLoader
is a PyTorch class that provides a way to load data from a dataset into a batch. Then the stochastic gradient descent can be implemented by the torch.optim.SGD
function.
PyTorch Dataset and DataLoader¶
In order to use the DataLoader
class, you first need to create a dataset class that is a subclass of torch.utils.data.Dataset
. For this purpose, you need to implement the following methods in your dataset class:
__len__
: Return the length of the dataset.__getitem__
: Return the item at the given index.
Here is an example to generate a dataset from linear model \(y = 3x + 2 + \epsilon\) with Gaussian noise \(\epsilon \sim \mathcal{N}(0, 2)\).
import torch
from torch.utils.data import Dataset
class LinearModelDataset(Dataset):
def __init__(self, num_samples=100):
self.x = torch.randn(num_samples, 1)
self.y = 3 * self.x + 2 + torch.randn(self.x.size()) * 2
def __len__(self): # Return the length of the dataset
return len(self.x)
def __getitem__(self, idx): # Return the item at the given index
# Add any preprocessing here if needed
return self.x[idx], self.y[idx]
Then you can use torch.utils.data.DataLoader
to load the dataset for mini-batch gradient descent by setting the batch_size
and shuffle
parameters.
from torch.utils.data import DataLoader
# Define the dataset
dataset = LinearModelDataset(128)
# Define the DataLoader
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
PyTorch Model Class¶
Then you can define the model class as a subclass of torch.nn.Module
. You need to implement the following methods in your model class:
__init__
: Define and initialize the parameters of the model. In this method, you need to addsuper().__init__()
to call the constructor of the parent class.forward(self, x)
: Define the output of the model with respect to the inputx
.
# Define the model class. Here we use a simple linear regression model as an example.
class LinearRegressionModel(nn.Module):
def __init__(self, input_dim, output_dim):
"""
This is used to define and initialize the parameters of the model.
All parameters in model should be nn.Parameter.
Here we define a linear regression model with weight and bias.
We will discuss other types of model class in the next chapter of deep learning.
"""
super().__init__() # Call the constructor of the parent class
# Initialize weight and bias as learnable parameters
self.weight = nn.Parameter(torch.randn(output_dim, input_dim))
self.bias = nn.Parameter(torch.randn(output_dim))
def forward(self, x):
"""
This is used to define the mathematics of the model.
"""
return x @ self.weight.T + self.bias
PyTorch Optimizer¶
PyTorch provides a lot of optimizers in torch.optim
. For the mini-batch gradient descent, you can use torch.optim.SGD
with the following parameters:
params
: Pass all the trainable parameters from your model to the optimizer.lr
: Set the learning rate (step size).
You can then set the optimizer to the model by the following code:
model = LinearRegressionModel(1, 1) # Define the 1d linear regression model
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
Here model.parameters()
will return all the trainable parameters from the model.
Now we are ready to assemble all the components to implement the mini-batch gradient descent:
# Define the loss function, e.g., MSELoss
loss_fn = nn.MSELoss()
# Define the training loop
for epoch in range(num_epochs):
for batch in dataloader:
# Forward pass
outputs = model(batch) # Compute the output of the model
loss = loss_fn(outputs, batch) # Compute the loss
# Backward pass
optimizer.zero_grad() # Zero the gradients of the model parameters
loss.backward() # Compute the gradient of the loss with respect to the model parameters
optimizer.step() # Update the model parameters: x_{t+1} = x_t - lr * g_t
The core code above are the three steps of the mini-batch gradient descent:
optimizer.zero_grad()
: Zero the gradients of the model parameters.loss.backward()
: Compute the gradient of the loss with respect to the model parameters.optimizer.step()
: Update the model parameters
Most of the time, you can just simply use these three lines for implementing any optimizer from torch.optim
.
We can write our own mini-batch gradient descent optimizer by the following code:
lr = 0.01 # Set the learning rate
# 1. Manually zero the gradients
for param in model.parameters():
if param.grad is not None:
param.grad.zero_()
# 2. Compute gradients
loss.backward()
# 3. Update parameters using explicit SGD rule
with torch.no_grad(): # Prevent tracking history for parameter updates
for param in model.parameters():
if param.grad is not None:
# The core SGD update: θ = θ - lr * gt
param.data = param.data - lr * param.grad
Adaptive Gradient Descent¶
Adaptive gradient descent algorithms adjust the learning rate for each parameter individually, allowing for more efficient optimization. If the partial derivative \(\partial f(x) / \partial x_j\) is smaller, the \(j\)-th entry \(x_{t,j}\) is less likely to be updated and we tend to take a larger step. On the other hand, if the partial derivative \(\partial f(x) / \partial x_j\) is larger, the \(j\)-th entry \(x_{t,j}\) was updated more significantly and we tend to take a smaller step. The idea of adaptive gradient descent is to measure the standard deviation of the partial derivatives and adjust the learning rate accordingly.
Adagrad¶
Adagrad adapts the learning rate based on the standard deviation of the historical gradient information. Let \(x_{t,j}\) be the \(j\)-th component of \(x_t\), then Adagrad updates each entry of \(x_t\) as follows:
for all \(j=1,2,\cdots,d\), where \(G_{t,j}\) is the sum of the squares of past gradients, and \(\epsilon\) is a small constant to prevent division by zero. For the notation simplicity, we will write the above Adagrad update as:
where \(g_{t}^2\) is the entry-wise square of \(g_t\).
Adagrad can be implemented by the following code:
RMSprop¶
RMSprop modifies Adagrad by introducing a decay factor to control the accumulation of past gradients:
where \(\gamma\) is the decay rate.
RMSprop can be implemented by the following code:
Adam¶
Adam (Adaptive Moment Estimation) combines the ideas of momentum and RMSprop, maintaining an exponentially decaying average of past gradients and squared gradients.
Adam Algorithm
Adam is widely used due to its robustness and efficiency in training deep neural networks. It can be implemented by the following code: