Published on

Machine Learning Optimization - Gradient Descent

Table of Contents

0: Introduction

2: Stochastic Gradient Descent

3: Variance Reduction Techniques

4: SARAH and ADAM Algorithms

Crash Course: Gradient Descent

This section is some entry-level background to machine learning concepts, so please skip if you already have a grasp.

The goal in machine learning optimization is to train model parameters for a machine learning model we choose to use, which will ultimately depend on the application. We will use the simple but very commonly used linear model as an example.

Example: Least Squares Regression

We are interested in training a model using nn data samples with pp features and known outputs for each data sample. The model we will use as an example is a simple linear system.

Aθ=b\begin{aligned} A\boldsymbol{\theta}=\boldsymbol{b} \end{aligned}

Matrix ARn×pA \in \mathbb{R}^{n \times p} represents our data, and bRn×1\boldsymbol{b} \in \mathbb{R}^{n \times 1} is the response vector with entries bib_i corresponding to the known output of each row of data aiAa_i \in A, i1,...,ni \in 1,...,n. Our model uses a vector of parameters θjθ\theta_j \in \boldsymbol{\theta}, j1,...,pj \in 1,...,p, where θRp×1\boldsymbol{\theta} \in \mathbb{R}^{p \times 1}. This is a linear model, where the product AθA\boldsymbol{\theta} should ideally produce exactly bb.

In a typical application, however, this system is overdetermined (n>pn>p, i.e. more data than there are features that describe the data; column rank(A)=p(A)=p), and is inconsistent (Aθ=bA\boldsymbol{\theta}=\boldsymbol{b} has no solution).

Since we cannot find an exact solution to the model, we want to optimize each of our parameters θj\theta_j such that the product AθA\boldsymbol{\theta} produces the best approximate solution to the system. In other words, we want to train the model to be able to predict the known outputs as well as it can by adjusting each of the θjθ\theta_j \in \boldsymbol{\theta}.

To do so, we aim to minimize a quantity we call the "loss" of the model; a measure of error between predictions AθA\boldsymbol{\theta} and the actual label values bb as a function of the however the θjθ\theta_j \in \boldsymbol{\theta} are selected (or in our case, how we optimize them).

"Loss" will be quantified as a function of the θjθ\theta_j \in \boldsymbol{\theta} expressed as f(θ)f(\boldsymbol{\theta}), also sometimes called the "objective function". So, the "loss" will depend on the parameters θ\boldsymbol{\theta} that we choose.

A common loss function ff that we see in machine learning is a simple least-squares regression fit, where f(θ)f(\boldsymbol{\theta}) is the following.

f(θ)=Aθb2\begin{aligned} f(\boldsymbol{\theta}) = ||A\boldsymbol{\theta}-\boldsymbol{b}||^{2} \end{aligned}

Here, AθA\boldsymbol{\theta} generates a vector of predictions, and the loss function ff calculates the squared Euclidian distance between the prediction vector AθA\boldsymbol{\theta} and the true labels vector b\boldsymbol{b}.

Our optimized model with our trained parameters satisfies the following.

θminθAθb2\begin{aligned} \boldsymbol{\theta}^{*}\in\underset{\boldsymbol{\theta}}{\text{min}}||A\boldsymbol{\theta}-\boldsymbol{b}||^{2} \end{aligned}

Gradient Descent in Machine Learning

The gradient descent algorithm is ideal for optimization of this kind of problem. It is an iterative algorithm that is designed to update parameters by changing each parameter by some amount, in such a way that the "loss" decreases.

We must note that the least squares objective function ff is convex, for which follows that ff must have a global minimum (it must reach a "smallest" value). Not all problems in machine learning utilize convex loss functions, but many common applications do in practice.

NOTE

This post will assume, unless otherwise stated, that the objective function ff is convex (i.e. we will be covering convex optimization).

Common examples of convex objective functions include logistic regression, data fitting, SVMs, Lasso and Group Lasso. Convexity is detailed more formally in the appendix.

In basic terms, the gradient descent algorithm uses the gradient of the loss function f(θ)f(\boldsymbol{\theta}) (denoted f(θ)\nabla f(\boldsymbol{\theta})), or the vector of partial derivatives for each vector component dfidθ\frac{df_i}{d\theta} evaluated at the parameter values θj\theta_j, to determine the "steepest" direction that the loss is increasing. It's worth taking a look in the appendix at the details of the gradient operation if you are not familiar.

The negative gradient will indicate the steepest direction that our loss function f(θ)f(\boldsymbol{\theta}) is decreasing, and if we move the values of the θj\theta_js in that direction, we will decrease the loss f(θ)f(\boldsymbol{\theta}). Ultimately, we intend to optimize f(θ)f(\boldsymbol{\theta}) by minimizing it, so we hope to adjust the θj\theta_js until our loss is minimized ("minimized" is usually defined by criteria we select).

So, if we start with some initial parameters θ0\boldsymbol{\theta_0}, we can subtract f(θ0)\nabla f(\boldsymbol{\theta_0}) to "step" the parameter tuning in the direction that will minimize the loss function. However, we need to control the size of the steps that we take, so we include a parameter α\alpha that we can tune on our own (called a "hyperparameter"), to scale f\nabla f and control the size of the step we take in that direction. The parameter α\alpha is commonly referred to as the "learning rate".

Since we likely will not arrive at the solution after one step, we can use the resulting weights θ1\boldsymbol{\theta_1} and evaluate f(θ1)\nabla f(\boldsymbol{\theta_1}) to continue and take another step. We continue onward by recursively plugging parameter values back into the algorithm and in general after tt steps, we have the gradient descent algorithm.

θt+1=θtαf(θt)\begin{aligned} \theta_{t+1}=\theta_{t}-\alpha \nabla f(\theta_{t}) \end{aligned}

The algorithm iterates over discrete timesteps tt, stepping toward the minimum of f(θt)f(\theta_t) proportional to the learning rate α\alpha.

Now, picking up from the previous section, we will use a simple least squares loss regression of some data we generate as an example. We will generate 100 random points xiXx_i\in X, i1,...,100i \in 1,...,100 where XUnif(0,2)X \sim \text{Unif}(0,2), and assign labels to these points where yi=xi+4+ziy_i = x_i + 4 + z_i where ziZN(0,1)z_i \in Z \sim N(0,1) is a standard gaussian noise term to make our yiy_i values noisy. We only want to fit a first-order polynomial line to the data in this example, so the only two "features" that we will use for our parameters to fit are slope and intercept, θ=[θslopeθintercept]\boldsymbol\theta = \left[\begin{array}{c} \theta_{slope}\\ \theta_{intercept} \end{array}\right].

We then initialize a slope and intercept for θ0=[θslope,0θintercept,0]\boldsymbol{\theta_0} = \left[\begin{array}{c} \theta_{slope,0}\\ \theta_{intercept,0} \end{array}\right] (in this example just [00]\left[\begin{array}{c} 0\\ 0 \end{array}\right]), select a learning rate α=0.1\alpha=0.1 (we will discuss how to more carefully select this hyperparameter later), and perform the gradient descent algorithm for t=100t=100 steps (iterations). Below we animate the process, where the top plot is the line being fitted, the middle plot is the value of our loss gradually minimizing, and the last plot is the values for the slope and intercept.

GD Least Squares fit

Since we know how we generated the data, we know that the true optimal is θ=[24]\boldsymbol\theta = \left[\begin{array}{c} 2\\ 4 \end{array}\right], which we can see visually in the first plot.

There are a few observations we can note here.

  • The parameters are converging to the optimal solution, taking large steps early on but slowing down as it approaches the optimum.
  • Loss also falls steeply intially, but after around 10 steps it levels off around 1 (the variance of the noise of the data itself).
  • 100 steps is not enough to get all the way to true optimal solution.

In short, we are seeing the parameters converge toward the optimal solution as the algorithm iterates, but it doesn't get us quite to the exact solution that is underlying the noise. When training a machine learning model, this isn't necessarily a bad thing. In fact, models that are not "overfit" to training data, i.e. fit "too closely" to the training data, often generalize better to make predictions for unseen data than models that match too closely with the training data. We may only care that the model weights we use are within some guaranteed tolerance of the true solution (an example is included in the third plot), which we typically express in convergence rates in terms of a tolerance ϵ\epsilon. We will see ϵ\epsilon accuracy convergence properties, or convergence properties expressed in terms of total iterations TT for the algorithms we cover in this post.

Limitations of Regular Gradient Descent

In a theoretical sense, regular gradient descent is robust, guaranteed to converge for convex objective functions, and will guarantee we are within a tolerance of the true solution (or we will know how close we are after some number of iterations). However, in practice it has limitations.

Regular gradient descent works well for a very simple example like this (small sample size, only two features), but in practical machine learning applications with large datasets, it has glaring computational deficiencies.

We turn our attention to the gradient calculation itself, and consider a gradient calculation at timestep tt, f(θt)\nabla f(\theta_t) for θtRp×1\theta_t \in \mathbb{R}^{p \times 1} for a dataset with pp features and nn samples. At each timestep, we are computing the full gradient. It is helpful to write out what the gradient calculation really looks like for our least square loss example.

f(θt)=1ni=1nfi(θt)\begin{aligned} \nabla f (\boldsymbol{\theta_t}) = \frac{1}{n}\sum_{i=1}^{n}{\nabla f_i(\boldsymbol{\theta_t})} \end{aligned} fi(θt)=ddθ(aiθtbi2)\begin{aligned} \nabla f_i (\boldsymbol{\theta_t}) = \frac{d}{d\theta} (||a_i\boldsymbol{\theta_t}-b_i||^{2}) \end{aligned}

See appendix for more details on full and stochastic gradient calculation.

In terms of complexity, each step has pnp \cdot n calculations per step. It can be shown that since ff is convex and Lipschitz continuous, for α\alpha selected carefully using what is called the Lipschitz constant LL (you can learn more about Lipschitz continuity here or in the appendix) , that after TT iterations the following will hold.

f(θT)f(θ)L2Tθ0θ for T1\begin{aligned} f(\theta_T)-f(\theta^*) \leq \frac{L}{2T} ||\theta_0 - \theta^*|| \text{ for } T \geq 1 \end{aligned}

Where θ\theta^* is the true optimal solution (θ=[24]\boldsymbol\theta^* = \left[\begin{array}{c} 2\\ 4 \end{array}\right] from our earlier example), and θ0\theta_0 is our initialization (θ0=[θslope,0θintercept,0]\boldsymbol{\theta_0} = \left[\begin{array}{c} \theta_{slope,0}\\ \theta_{intercept,0} \end{array}\right] == [00]\left[\begin{array}{c} 0\\ 0 \end{array}\right] also from our earlier example).

We say that this converges at a rate O(1T)\mathcal{O}(\frac{1}{T}). In other words, we can choose the number of gradient descent iterations TT and know for certain upper bound of the loss for the current parameters as compared to the optimal parameters is proportional to 1T\frac{1}{T} ( more on "big O notation" here ).

This means that we can guarantee we are within some tolerance of the true optimal solution after TT iterations. But, if we are calculating f(θt)\nabla f (\boldsymbol{\theta_t}) at each of those TT iterations, and our dataset is large, it can become computationally prohibitive to get a solution within a desirable tolerance.

In machine learning optimization, we are always seeking faster ways to train our models. When approaching an optimization problem we might ask ourselves questions like,

  • How can optimize my parameters (train the model) in the fewest steps?
  • How can I improve the computational cost of my steps?
  • When can we terminate the algorithm, and what are my criteria for stopping?

These are the kinds of questions that might motivate a search for more efficient algorithms. A common approach to address such motives is to use approximations, for which stochasticity has some desirable theoretical properties to help us.

Next: Stochastic Gradient Descent