Machine Learning Optimization - Gradient Descent | Alex Salce

Table of Contents
0: Introduction
2: Stochastic Gradient Descent
3: Variance Reduction Techniques
4: SARAH and ADAM Algorithms

Crash Course: Gradient Descent

This section is some entry-level background to machine learning concepts, so please skip if you already have a grasp.

The goal in machine learning optimization is to train model parameters for a machine learning model we choose to use, which will ultimately depend on the application. We will use the simple but very commonly used linear model as an example.

Example: Least Squares Regression

We are interested in training a model using $n$ data samples with $p$ features and known outputs for each data sample. The model we will use as an example is a simple linear system.

\begin{aligned} A\boldsymbol{\theta}=\boldsymbol{b} \end{aligned}

Matrix $A \in \mathbb{R}^{n \times p}$ represents our data, and $\boldsymbol{b} \in \mathbb{R}^{n \times 1}$ is the response vector with entries $b_i$ corresponding to the known output of each row of data $a_i \in A$ , $i \in 1,...,n$ . Our model uses a vector of parameters $\theta_j \in \boldsymbol{\theta}$ , $j \in 1,...,p$ , where $\boldsymbol{\theta} \in \mathbb{R}^{p \times 1}$ . This is a linear model, where the product $A\boldsymbol{\theta}$ should ideally produce exactly $b$ .

In a typical application, however, this system is overdetermined ( $n>p$ , i.e. more data than there are features that describe the data; column rank $(A)=p$ ), and is inconsistent ( $A\boldsymbol{\theta}=\boldsymbol{b}$ has no solution).

Since we cannot find an exact solution to the model, we want to optimize each of our parameters $\theta_j$ such that the product $A\boldsymbol{\theta}$ produces the best approximate solution to the system. In other words, we want to train the model to be able to predict the known outputs as well as it can by adjusting each of the $\theta_j \in \boldsymbol{\theta}$ .

To do so, we aim to minimize a quantity we call the "loss" of the model; a measure of error between predictions $A\boldsymbol{\theta}$ and the actual label values $b$ as a function of the however the $\theta_j \in \boldsymbol{\theta}$ are selected (or in our case, how we optimize them).

"Loss" will be quantified as a function of the $\theta_j \in \boldsymbol{\theta}$ expressed as $f(\boldsymbol{\theta})$ , also sometimes called the "objective function". So, the "loss" will depend on the parameters $\boldsymbol{\theta}$ that we choose.

A common loss function $f$ that we see in machine learning is a simple least-squares regression fit, where $f(\boldsymbol{\theta})$ is the following.

\begin{aligned} f(\boldsymbol{\theta}) = ||A\boldsymbol{\theta}-\boldsymbol{b}||^{2} \end{aligned}

Here, $A\boldsymbol{\theta}$ generates a vector of predictions, and the loss function $f$ calculates the squared Euclidian distance between the prediction vector $A\boldsymbol{\theta}$ and the true labels vector $\boldsymbol{b}$ .

Our optimized model with our trained parameters satisfies the following.

\begin{aligned} \boldsymbol{\theta}^{*}\in\underset{\boldsymbol{\theta}}{\text{min}}||A\boldsymbol{\theta}-\boldsymbol{b}||^{2} \end{aligned}

Gradient Descent in Machine Learning

The gradient descent algorithm is ideal for optimization of this kind of problem. It is an iterative algorithm that is designed to update parameters by changing each parameter by some amount, in such a way that the "loss" decreases.

We must note that the least squares objective function $f$ is convex, for which follows that $f$ must have a global minimum (it must reach a "smallest" value). Not all problems in machine learning utilize convex loss functions, but many common applications do in practice.

NOTE

This post will assume, unless otherwise stated, that the objective function $f$ is convex (i.e. we will be covering convex optimization).

Common examples of convex objective functions include logistic regression, data fitting, SVMs, Lasso and Group Lasso. Convexity is detailed more formally in the appendix.

In basic terms, the gradient descent algorithm uses the gradient of the loss function $f(\boldsymbol{\theta})$ (denoted $\nabla f(\boldsymbol{\theta})$ ), or the vector of partial derivatives for each vector component $\frac{df_i}{d\theta}$ evaluated at the parameter values $\theta_j$ , to determine the "steepest" direction that the loss is increasing. It's worth taking a look in the appendix at the details of the gradient operation if you are not familiar.

The negative gradient will indicate the steepest direction that our loss function $f(\boldsymbol{\theta})$ is decreasing, and if we move the values of the $\theta_j$ s in that direction, we will decrease the loss $f(\boldsymbol{\theta})$ . Ultimately, we intend to optimize $f(\boldsymbol{\theta})$ by minimizing it, so we hope to adjust the $\theta_j$ s until our loss is minimized ("minimized" is usually defined by criteria we select).

So, if we start with some initial parameters $\boldsymbol{\theta_0}$ , we can subtract $\nabla f(\boldsymbol{\theta_0})$ to "step" the parameter tuning in the direction that will minimize the loss function. However, we need to control the size of the steps that we take, so we include a parameter $\alpha$ that we can tune on our own (called a "hyperparameter"), to scale $\nabla f$ and control the size of the step we take in that direction. The parameter $\alpha$ is commonly referred to as the "learning rate".

Since we likely will not arrive at the solution after one step, we can use the resulting weights $\boldsymbol{\theta_1}$ and evaluate $\nabla f(\boldsymbol{\theta_1})$ to continue and take another step. We continue onward by recursively plugging parameter values back into the algorithm and in general after $t$ steps, we have the gradient descent algorithm.

\begin{aligned} \theta_{t+1}=\theta_{t}-\alpha \nabla f(\theta_{t}) \end{aligned}

The algorithm iterates over discrete timesteps $t$ , stepping toward the minimum of $f(\theta_t)$ proportional to the learning rate $\alpha$ .

Now, picking up from the previous section, we will use a simple least squares loss regression of some data we generate as an example. We will generate 100 random points $x_i\in X$ , $i \in 1,...,100$ where $X \sim \text{Unif}(0,2)$ , and assign labels to these points where $y_i = x_i + 4 + z_i$ where $z_i \in Z \sim N(0,1)$ is a standard gaussian noise term to make our $y_i$ values noisy. We only want to fit a first-order polynomial line to the data in this example, so the only two "features" that we will use for our parameters to fit are slope and intercept, $\boldsymbol\theta = \left[\begin{array}{c} \theta_{slope}\\ \theta_{intercept} \end{array}\right]$ .

We then initialize a slope and intercept for $\boldsymbol{\theta_0} = \left[\begin{array}{c} \theta_{slope,0}\\ \theta_{intercept,0} \end{array}\right]$ (in this example just $\left[\begin{array}{c} 0\\ 0 \end{array}\right]$ ), select a learning rate $\alpha=0.1$ (we will discuss how to more carefully select this hyperparameter later), and perform the gradient descent algorithm for $t=100$ steps (iterations). Below we animate the process, where the top plot is the line being fitted, the middle plot is the value of our loss gradually minimizing, and the last plot is the values for the slope and intercept.

GD Least Squares fit

Since we know how we generated the data, we know that the true optimal is $\boldsymbol\theta = \left[\begin{array}{c} 2\\ 4 \end{array}\right]$ , which we can see visually in the first plot.

There are a few observations we can note here.

The parameters are converging to the optimal solution, taking large steps early on but slowing down as it approaches the optimum.
Loss also falls steeply intially, but after around 10 steps it levels off around 1 (the variance of the noise of the data itself).
100 steps is not enough to get all the way to true optimal solution.

In short, we are seeing the parameters converge toward the optimal solution as the algorithm iterates, but it doesn't get us quite to the exact solution that is underlying the noise. When training a machine learning model, this isn't necessarily a bad thing. In fact, models that are not "overfit" to training data, i.e. fit "too closely" to the training data, often generalize better to make predictions for unseen data than models that match too closely with the training data. We may only care that the model weights we use are within some guaranteed tolerance of the true solution (an example is included in the third plot), which we typically express in convergence rates in terms of a tolerance $\epsilon$ . We will see $\epsilon$ accuracy convergence properties, or convergence properties expressed in terms of total iterations $T$ for the algorithms we cover in this post.

Limitations of Regular Gradient Descent

In a theoretical sense, regular gradient descent is robust, guaranteed to converge for convex objective functions, and will guarantee we are within a tolerance of the true solution (or we will know how close we are after some number of iterations). However, in practice it has limitations.

Regular gradient descent works well for a very simple example like this (small sample size, only two features), but in practical machine learning applications with large datasets, it has glaring computational deficiencies.

We turn our attention to the gradient calculation itself, and consider a gradient calculation at timestep $t$ , $\nabla f(\theta_t)$ for $\theta_t \in \mathbb{R}^{p \times 1}$ for a dataset with $p$ features and $n$ samples. At each timestep, we are computing the full gradient. It is helpful to write out what the gradient calculation really looks like for our least square loss example.

\begin{aligned} \nabla f (\boldsymbol{\theta_t}) = \frac{1}{n}\sum_{i=1}^{n}{\nabla f_i(\boldsymbol{\theta_t})} \end{aligned}

\begin{aligned} \nabla f_i (\boldsymbol{\theta_t}) = \frac{d}{d\theta} (||a_i\boldsymbol{\theta_t}-b_i||^{2}) \end{aligned}

See appendix for more details on full and stochastic gradient calculation.

In terms of complexity, each step has $p \cdot n$ calculations per step. It can be shown that since $f$ is convex and Lipschitz continuous, for $\alpha$ selected carefully using what is called the Lipschitz constant $L$ (you can learn more about Lipschitz continuity here or in the appendix) , that after $T$ iterations the following will hold.

\begin{aligned} f(\theta_T)-f(\theta^*) \leq \frac{L}{2T} ||\theta_0 - \theta^*|| \text{ for } T \geq 1 \end{aligned}

Where $\theta^*$ is the true optimal solution ( $\boldsymbol\theta^* = \left[\begin{array}{c} 2\\ 4 \end{array}\right]$ from our earlier example), and $\theta_0$ is our initialization ( $\boldsymbol{\theta_0} = \left[\begin{array}{c} \theta_{slope,0}\\ \theta_{intercept,0} \end{array}\right]$ $=$ $\left[\begin{array}{c} 0\\ 0 \end{array}\right]$ also from our earlier example).

We say that this converges at a rate $\mathcal{O}(\frac{1}{T})$ . In other words, we can choose the number of gradient descent iterations $T$ and know for certain upper bound of the loss for the current parameters as compared to the optimal parameters is proportional to $\frac{1}{T}$ ( more on "big O notation" here ).

This means that we can guarantee we are within some tolerance of the true optimal solution after $T$ iterations. But, if we are calculating $\nabla f (\boldsymbol{\theta_t})$ at each of those $T$ iterations, and our dataset is large, it can become computationally prohibitive to get a solution within a desirable tolerance.

In machine learning optimization, we are always seeking faster ways to train our models. When approaching an optimization problem we might ask ourselves questions like,

How can optimize my parameters (train the model) in the fewest steps?
How can I improve the computational cost of my steps?
When can we terminate the algorithm, and what are my criteria for stopping?

These are the kinds of questions that might motivate a search for more efficient algorithms. A common approach to address such motives is to use approximations, for which stochasticity has some desirable theoretical properties to help us.

Table of Contents

0: Introduction

2: Stochastic Gradient Descent

3: Variance Reduction Techniques

4: SARAH and ADAM Algorithms

Crash Course: Gradient Descent

Example: Least Squares Regression

Gradient Descent in Machine Learning

Limitations of Regular Gradient Descent

Next: Stochastic Gradient Descent