- Published on
Machine Learning Optimization - Stochastic Gradient Descent
Table of Contents
0: Introduction
1: Crash Course: Gradient Descent
3: Variance Reduction Techniques
4: SARAH and ADAM Algorithms
Previous: ML Optimization Splash Page
Stochastic Gradient Descent
Stochastic methods are common in many areas of computational mathematics and are desirable for improving computational performance by leveraging random operations. A stochastic approach to gradient descent is a natural extension; we want to speed up an inherently slow but robust method for training our machine learning models.
The core concept of stochastic gradient descent is straightforward -- rather than performing a full gradient calculation at each timestep of gradient descent, we can sample individual gradients 's (where is selected uniformly at random from ) in lieu of computing the full gradient . We use a randomly sampled at each step as a general approximation for the full gradient in the descent algorithm to step toward the solution at significantly reduced computational cost, with the concession that the step is an approximation.
See appendix for more details on stochastic gradient calculation.
So, we would like to show that this randomly sampled gradient is a valid estimator of the true gradient.
SGD Algorithm
For stochastic descent, we want to consider the following problem, minimizing the average of functions, where is our objective function, a function of our model parameters .
Taking the gradient of this objective function gives
Stochastic gradient descent (SGD) is defined as the following.
where is an index chosen uniformly at random for an iteration .
NOTE
SGD cannot converge with a fixed step size; instead it must use a diminishing step size that satisfies and . Hence learning rate is denoted with .
We want to determine whether is a valid estimator for the true gradient , so we are interested in its expectation. Since is selected uniformly at random, the following expectation holds.
So, an individual gradient is an unbiased estimator for the full gradient, a fundamental principle to stochastic gradient descent.
This presents a great computational advantage -- since we are selecting an individual gradient at random , for each iteration of SGD, it doesn't matter what the sample size of our data is. The sample size of the data could be very large and an SGD computation will "cost" the same per iteration.


On the left, we see the SGD optimizer algorithm stepping the parameters toward the solution in some obviously approximate ways. On the right, we also impose other algorithms that we will revisit later that use variance reduction technniques for comparison.
Since we are now taking steps in our descent algorithm using approximations of the gradient, we are going to be taking steps only approximately in the direction of the optimum.
SGD Convergence and Variance
The inexactness of the steps that we have now introduced to gradient descent is variance. Variance in our algorithm's steps can in fact slow the algorithm down in its own right (we are only approximately stepping in the right direction after each step), so we must take additional considerations to how we approach the development of SGD algorithms. First, we must be able to quantify the variance.
An analysis of convergence of SGD reveals how we quantify the variance in the algorithm's steps. The SGD algorithm can be rewritten as the following.
We assume is convex and that is Lipschitz continuous. First, we define . quantifies the error between the full gradient and a stochastic gradient at iteration .
Next, we take the conditional expectiation of given parameters ,
We establish is conditionally unbiased as an assumption. Now we take the conditional variance and use the conditional unbiasedness assumption for the following.
We can now use the Cauchy-Schwarz Inequality to show
So our conditional variance must be bounded by some constant variance .
Next, we want to observe a sequence of generated by SGD after iterations, . Let and .
Under these assumptions, we show the following convergence for SGD.
If we select , we achieve a convergence rate of , which is an improvement over regular gradient descent. Note here, decays as we step, so the algorithm "slows down" as we proceed. The variance in the steps requires that our learning rate decays, which is generally undesirable.
We are able to quantify a bound for the variance of SGD, and we can now start to explore what different approaches we may take to reduce it.