Exploring The Mathematics Of Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a powerful optimization algorithm used in Machine Learning to find the solution of optimization objectives. It is ubiquitous in neural networks and large-scale machine learning, and is a key component in many Deep Learning models. This post will provide an overview of the mathematics behind SGD, and how it works in practice.

At its core, the concept of SGD revolves around an optimization objective of the form:

$\overset{y}{^} = min_{θ} f (θ)$

Where $θ$ is a set of model parameters (or a vector of weights and biases) and $f (θ)$ is the objective (or cost) function. The goal of SGD is to find the vector $θ$ that minimizes the objective.

At each iteration $t$ , SGD updates the vector $θ$ using:

$θ_{t + 1} = θ_{t} + η_{t} (y_{t} - θ_{t})$

where $η_{t}$ is the learning rate and $y_{t}$ is the gradient of the objective at iteration $t$ . The gradient can be computed using the chain rule:

$y_{t} = \frac{\partial f}{\partial θ _{t}}$

Using this equation, SGD can be used to find the optimal solution to the objective. In practice, it is often preferable to use a mini-batch of data points (i.e., a subset of the dataset) at each iteration in order to reduce the computational cost. Additionally, SGD can be adapted to use various forms of regularization, such as weight decay, which can help alleviate the problem of overfitting.

To illustrate the use of SGD, let's look at a simple linear regression model. The objective function can be written as:

$f (θ) = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - θ^{T} x_{i})^{2}$

Where $y_{i}$ is the target value and $x_{i}$ is the input vector for data point $i$ . The gradient can be computed using the chain rule:

$\frac{\partial f}{\partial θ} = - 2 \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - θ^{T} x_{i}) x_{i}$

We can then use SGD to update the parameters of the linear regression model using the gradient at each step:

# n_samples is the number of data points
# x_train is the input data points
# y_train is the target values
# theta is the parameter vector   
# learning_rate is the learning rate

for t in range(n_samples):
    # compute the gradient at iteration t
    gradient = -2/n_samples * np.dot(x_train[t], (y_train[t] - np.dot(x_train[t], theta)))
    # update the parameter vector
    theta = theta + learning_rate * gradient

SGD is a powerful optimization algorithm that can be used to solve a variety of optimization objectives. It is relatively fast, scalable, and can easily incorporate more complex loss functions and regularization techniques. But as with any algorithm, it is important to ensure that the learning rate is not too large, as this can lead to numerical problems or suboptimal solutions.