Machine Learning Optimization - Stochastic Descent, Variance Reduction Techniques, and the Bias-Variance Tradeoff
Introduction
The intent of this post is to explore some examples of variance reduction techniques in gradient descent optimization methods for machine learning, with some focus on their underlying statistical concepts. We will use the ADAM optimizer, a popular algorithm in neural network optimization and deep learning, and the less prolific SARAH optimizer (a variant of SVRG) as examples of algorithms in practice, with additional background on the methodologies from which these stochastic optimizers are derived including SGD, SAG, SAGA, and SVRG. In addition to exploring the techniques themselves, we will lend some focused attention to some bias-variance tradeoffs unique to variance reduction techniques, which are unrelated to the typical overfitting/overparameterization bias-variance tradeoff traditionally examined in machine learning. There are many resources available online for all of these topics, and I will provide some links to my favorites, but my aim is to cover the basic underlying concepts needed to build some more common and contemporary algorithms.
This post will cover variance reduction techniques with some technical statistical details, and assumes the reader has a basic understanding of machine learning principles and gradient descent (here's a 3Blue1Brown video if you need a refresher, but I also dedicated a section of this post to my own refresher). But, my intent is to make these topics accessible for an audience looking for some enhanced understanding of techniques in machine learning optimization. I will include links and additional details wherever I feel it's warranted, but I am open to any feedback you may have.
Information for this post is largely drawn from my final project for an optimization course I took in the Spring of 2024, for which the paper and presentation are avialable below.
Please refer to these documents for additional formal details, the post may reference portions of them for clarity.
Background
Machine Learning Optimization and Stochastic Optimizers
Optimization as a field has become critical to recent progress in machine learning as the field expands its applications into ever greater swathes of data. We are using machine learning to approach increasingly large datasets, and to do so effectively we need model optimization techniques that are both fast and guarantee convergence. Stochastic optimizers are a class of algorithms that have become prevalent in recent years due to their theoretical convergence properties with the desirable efficiency of stochastic processes. Being stochastic methods, they require special techniques to address and control variance in their computations to accomplish any advantageous theoretical convergence over other common methods.
When optimizing machine learning model parameters, first-order gradient descent algorithms utilizing stochastic gradients (individual gradient selected uniformly at random), or minibatch gradients (subsets of a full gradient whose elements selected uniformly at random), have become core staples in marchine learning frameworks due to their advantageous computational efficiency for applications to datasets of both large sample size and dimensionality. But, the randomness (variance) in the steps that is introduced to these gradient descent methods with stochasticity tends to slow down or inhibit convergence of the algorithms, so it is in our interest to implement techniques to curb this variance.
This is a visual example of the topics and algorithms we intend to cover. The graph depicts model parameters for a two parameter model (training to correlated noise) as they iteratively update toward the true parameters using different variants of stochastic gradient descent (SGD, yellow line) that each employ different techniques to reduce variance in the parameter updates. Variance, as we will discuss in this post, is the quantity describing the magnitude of the randomness in the algorithms' steps to the true solution. SGD clearly has the highest variance, and the other variance reduction techniques have obviously smaller variance in their steps.
This example is using some contrived data, but the settings for each algorithm are approximately analagous (we'll see what "approximately" means as the algorithms are different in methodology), and they are running for the same number of iterations. We can see that the variants of SGD can progress closer to the solution, each with qualitatively smaller variance than SGD, which encapsulates our general motivation for utilizing variance reduction techniques in stochastic gradient descent style algorithms. If we reduce variance in the steps, we can be more efficient about converging to the true solution. We will dive into some of these techniques and their underlying theory to better understand how we can balance variance reduction with convergence properties, algorithm computational efficiency, and unintended bias that can be a byproduct of these methodologies.
Crash Course: Gradient Descent
This section is some entry-level background to machine learning concepts, so please skip if you already have a grasp.
The goal in machine learning optimization is to train model parameters for a machine learning model we choose to use, which will ultimately depend on the application. We will use the simple but very commonly used linear model as an example.
Example: Least Squares Regression
We are interested in training a model using data samples with features and known outputs for each data sample. The model we will use as an example is a simple linear system.
Matrix represents our data, and is the response vector with entries corresponding to the known output of each row of data , . Our model uses a vector of parameters , , where . This is a linear model, where the product should ideally produce exactly .
In a typical application, however, this system is overdetermined (, i.e. more data than there are features that describe the data; column rank), and is inconsistent ( has no solution).
Since we cannot find an exact solution to the model, we want to optimize each of our parameters such that the product produces the best approximate solution to the system. In other words, we want to train the model to be able to predict the known outputs as well as it can by adjusting each of the .
To do so, we aim to minimize a quantity we call the "loss" of the model; a measure of error between predictions and the actual label values as a function of the however the are selected (or in our case, how we optimize them).
"Loss" will be quantified as a function of the expressed as , also sometimes called the "objective function". So, the "loss" will depend on the parameters that we choose.
A common loss function that we see in machine learning is a simple least-squares regression fit, where is the following.
Here, generates a vector of predictions, and the loss function calculates the squared Euclidian distance between the prediction vector and the true labels vector .
Our optimized model with our trained parameters satisfies the following.
Gradient Descent in Machine Learning
The gradient descent algorithm is ideal for optimization of this kind of problem. It is an iterative algorithm that is designed to update parameters by changing each parameter by some amount, in such a way that the "loss" decreases.
We must note that the least squares objective function is convex, for which follows that must have a global minimum (it must reach a "smallest" value). Not all problems in machine learning utilize convex loss functions, but many common applications do in practice.
NOTE
This post will assume, unless otherwise stated, that the objective function is convex (i.e. we will be covering convex optimization).
Common examples of convex objective functions include logistic regression, data fitting, SVMs, Lasso and Group Lasso. Convexity is detailed more formally in the appendix.
In basic terms, the gradient descent algorithm uses the gradient of the loss function (denoted ), or the vector of partial derivatives for each vector component evaluated at the parameter values , to determine the "steepest" direction that the loss is increasing. It's worth taking a look in the appendix at the details of the gradient operation if you are not familiar.
The negative gradient will indicate the steepest direction that our loss function is decreasing, and if we move the values of the s in that direction, we will decrease the loss . Ultimately, we intend to optimize by minimizing it, so we hope to adjust the s until our loss is minimized ("minimized" is usually defined by criteria we select).
So, if we start with some initial parameters , we can subtract to "step" the parameter tuning in the direction that will minimize the loss function. However, we need to control the size of the steps that we take, so we include a parameter that we can tune on our own (called a "hyperparameter"), to scale and control the size of the step we take in that direction. The parameter is commonly referred to as the "learning rate".
Since we likely will not arrive at the solution after one step, we can use the resulting weights and evaluate to continue and take another step. We continue onward by recursively plugging parameter values back into the algorithm and in general after steps, we have the gradient descent algorithm.
The algorithm iterates over discrete timesteps , stepping toward the minimum of proportional to the learning rate .
Now, picking up from the previous section, we will use a simple least squares loss regression of some data we generate as an example. We will generate 100 random points , where , and assign labels to these points where where is a standard gaussian noise term to make our values noisy. We only want to fit a first-order polynomial line to the data in this example, so the only two "features" that we will use for our parameters to fit are slope and intercept, .
We then initialize a slope and intercept for (in this example just ), select a learning rate (we will discuss how to more carefully select this hyperparameter later), and perform the gradient descent algorithm for steps (iterations). Below we animate the process, where the top plot is the line being fitted, the middle plot is the value of our loss gradually minimizing, and the last plot is the values for the slope and intercept.
Since we know how we generated the data, we know that the true optimal is , which we can see visually in the first plot.
There are a few observations we can note here.
- The parameters are converging to the optimal solution, taking large steps early on but slowing down as it approaches the optimum.
- Loss also falls steeply intially, but after around 10 steps it levels off around 1 (the variance of the noise of the data itself).
- 100 steps is not enough to get all the way to true optimal solution.
In short, we are seeing the parameters converge toward the optimal solution as the algorithm iterates, but it doesn't get us quite to the exact solution that is underlying the noise. When training a machine learning model, this isn't necessarily a bad thing. In fact, models that are not "overfit" to training data, i.e. fit "too closely" to the training data, often generalize better to make predictions for unseen data than models that match too closely with the training data. We may only care that the model weights we use are within some guaranteed tolerance of the true solution (an example is included in the third plot), which we typically express in convergence rates in terms of a tolerance . We will see accuracy convergence properties, or convergence properties expressed in terms of total iterations for the algorithms we cover in this post.
Limitations of Regular Gradient Descent
In a theoretical sense, regular gradient descent is robust, guaranteed to converge for convex objective functions, and will guarantee we are within a tolerance of the true solution (or we will know how close we are after some number of iterations). However, in practice it has limitations.
Regular gradient descent works well for a very simple example like this (small sample size, only two features), but in practical machine learning applications with large datasets, it has glaring computational deficiencies.
We turn our attention to the gradient calculation itself, and consider a gradient calculation at timestep , for for a dataset with features and samples. At each timestep, we are computing the full gradient. It is helpful to write out what the gradient calculation really looks like for our least square loss example.
See appendix for more details on full and stochastic gradient calculation.
In terms of complexity, each step has calculations per step. It can be shown that since is convex and Lipschitz continuous, for selected carefully using what is called the Lipschitz constant (you can learn more about Lipschitz continuity here or in the appendix) , that after iterations the following will hold.
Where is the true optimal solution ( from our earlier example), and is our initialization ( also from our earlier example).
We say that this converges at a rate . In other words, we can choose the number of gradient descent iterations and know for certain upper bound of the loss for the current parameters as compared to the optimal parameters is proportional to ( more on "big O notation" here ).
This means that we can guarantee we are within some tolerance of the true optimal solution after iterations. But, if we are calculating at each of those iterations, and our dataset is large, it can become computationally prohibitive to get a solution within a desirable tolerance.
In machine learning optimization, we are always seeking faster ways to train our models. When approaching an optimization problem we might ask ourselves questions like,
- How can optimize my parameters (train the model) in the fewest steps?
- How can I improve the computational cost of my steps?
- When can we terminate the algorithm, and what are my criteria for stopping?
These are the kinds of questions that might motivate a search for more efficient algorithms. A common approach to address such motives is to use approximations, for which stochasticity has some desirable theoretical properties to help us.
Stochastic Gradient Descent
Stochastic methods are common in many areas of computational mathematics and are desirable for improving computational performance by leveraging random operations. A stochastic approach to gradient descent is a natural extension; we want to speed up an inherently slow but robust method for training our machine learning models.
The core concept of stochastic gradient descent is straightforward -- rather than performing a full gradient calculation at each timestep of gradient descent, we can sample individual gradients 's (where is selected uniformly at random from ) in lieu of computing the full gradient . We use a randomly sampled at each step as a general approximation for the full gradient in the descent algorithm to step toward the solution at significantly reduced computational cost, with the concession that the step is an approximation.
See appendix for more details on stochastic gradient calculation.
So, we would like to show that this randomly sampled gradient is a valid estimator of the true gradient.
SGD Algorithm
For stochastic descent, we want to consider the following problem, minimizing the average of functions, where is our objective function, a function of our model parameters .
Taking the gradient of this objective function gives
Stochastic gradient descent (SGD) is defined as the following.
where is an index chosen uniformly at random for an iteration .
NOTE
SGD cannot converge with a fixed step size; instead it must use a diminishing step size that satisfies and . Hence learning rate is denoted with .
We want to determine whether is a valid estimator for the true gradient , so we are interested in its expectation. Since is selected uniformly at random, the following expectation holds.
So, an individual gradient is an unbiased estimator for the full gradient, a fundamental principle to stochastic gradient descent.
This presents a great computational advantage -- since we are selecting an individual gradient at random , for each iteration of SGD, it doesn't matter what the sample size of our data is. The sample size of the data could be very large and an SGD computation will "cost" the same per iteration.
On the left, we see the SGD optimizer algorithm stepping the parameters toward the solution in some obviously approximate ways. On the right, we also impose other algorithms that we will revisit later that use variance reduction technniques for comparison.
Since we are now taking steps in our descent algorithm using approximations of the gradient, we are going to be taking steps only approximately in the direction of the optimum.
SGD Convergence and Variance
The inexactness of the steps that we have now introduced to gradient descent is variance. Variance in our algorithm's steps can in fact slow the algorithm down in its own right (we are only approximately stepping in the right direction after each step), so we must take additional considerations to how we approach the development of SGD algorithms. First, we must be able to quantify the variance.
An analysis of convergence of SGD reveals how we quantify the variance in the algorithm's steps. The SGD algorithm can be rewritten as the following.
We assume is convex and that is Lipschitz continuous. First, we define . quantifies the error between the full gradient and a stochastic gradient at iteration .
Next, we take the conditional expectiation of given parameters ,
We establish is conditionally unbiased as an assumption. Now we take the conditional variance and use the conditional unbiasedness assumption for the following.
We can now use the Cauchy-Schwarz Inequality to show
So our conditional variance must be bounded by some constant variance .
Next, we want to observe a sequence of generated by SGD after iterations, . Let and .
Under these assumptions, we show the following convergence for SGD.
If we select , we achieve a convergence rate of , which is an improvement over regular gradient descent. Note here, decays as we step, so the algorithm "slows down" as we proceed. The variance in the steps requires that our learning rate decays, which is generally undesirable.
We are able to quantify a bound for the variance of SGD, and we can now start to explore what different approaches we may take to reduce it.
Variance Reduction Techniques
The Bias-Variance Tradeoff
You may already be familiar with a bias-variance tradeoff scenario in machine learning -- our model is not producing desirable training error, so we choose to increase the complexity to better fit the training data. However, in doing so we may find that we "overfit" the model to the training data and create a prediction model that is biased to the details and minutae of the training data. When we apply the model to test data that it has not previously seen, it performs more poorly than our simpler model.
This is commonly referred to as "overfitting" and is an example of a biased model. As we increase the complexity of our model, we are trading the magnitude of prediction error to the test data (variance) for bias to the training data itself.
The bias-variance tradeoff we see in variance reduction techniques for SGD algorithms is fundamentally different than our familiar example.
Model complexity bias-variance tradeoffs can affect the error in predictions by the model itself -- but in the context of stochastic optimizaiton, bias and variance can impact error in parameter updates.
The issue with this variance in general ends up being the rate at which we converge, or whether we converge at all. Variance that overwhelms the steps, it can adversely impact the rate at which we converge to an acceptable solution, and if our algorithm takes too long to execute, we have defeated the purpose of using stochastic methods. In all, our goal becomes finding a balance of the computational advantages of stochastic processes while keeping variance (and as we will see, bias) under control in order to reap the advantages of faster algorithms.
NOTE
Image from* Hastie et al. (2004) p.38, a general illustration of the bias-variance tradeoff
MSE, Variance and Bias
Before we get into some techniques that can address variance in stochastic descent, we need to observe one of the core principles of statistical anlysis.
Part of our due diligence in statistics is to analyze the quality of an esimator. To quantify the accuracy of an estimator , we measure the error of the estimator compared to the true value of the parameter it is estimating by using Mean Squared Error (MSE).
MSE is a quantity defined as the expected value of the square difference between the parameter estimate and the true parameter value. Skipping some steps in between, it can be decomposed into both a variance component and a bias component (a link to a helpful video about MSE).
In statistics, we often measure the quality of an estimator by whether it is both unbiased and consistent. We won't dive into consistency of estimators further here (see appendix for some detail), but we do want to take note that in general, unbiased estimators are preferred estimators to those that are biased in some way. In terms of MSE, if our estimator is unbiased, then its MSE is equal to its variance.
In the previous section some examples of an unbiased estimator.
For example, we assume our conditional estimator of the approximation error for a gradient, is unbiased, i.e. , where . So, the MSE for the estimator is only its variance.
We showed , so the MSE of , the estimate of stochastic gradient estimation error for a given , is the following.
which is bounded by a constant variance term.
It is important to be cognizant of MSE as the measure for error of an estimator rather than just the variance. We don't just care about variance -- bias is a nontrivial and undesirable quantity if we are not careful.
As we explore variance reduction techniques, we will see that if we attempt to curb error due to variance but do not reduce overall MSE of the estimator in doing so, that MSE information is only "transferred" to the bias component. Compounding effects of bias in recursive descent algorithms can become especially problematic, so as we continue, we will give special attention to ensuring we also take it into account.
Minibatch SGD
Since variance in the algorithm's steps is a consequence of using estimates of the full gradient, it is pragmatic to utilize improved approximations. One natural statistical approach to improving an approximation is simple, we should use more data. We can do so by taking random batch samples of the full data -- minibatches.
Minibatches are an important variance-reduction technique that are commonly seen in the class of gradient descent algorithms. Rather than sampling random individual gradients from the full gradient at each step, we can improve our gradient approximation by sampling a random subset where and modify the SGD algorithm to average the "minibatches".
See appendix for example calculation details
We first take the conditional expectation modified from SGD. Note, as with SGD the gradient estimate is unbiased .
NOTE
The above calculation uses the identity to derive inequality.*
We can reduce the variance bound by a factor of batch size be . Since the estimator is unbiased, we do not need to be concerned with mere displacement of MSE a bias term. Intuitively we can make sense of this since we are using more information for each gradient calculation, and thus we get a better approximation of the gradient at each step . Note here also that this result does not require a diminishing learning rate, so it can be applied to gradient descent algorithms with constant or adaptive learning rates.
The tradeoff here, though, is a more expensive gradient computation by a factor of at each timestep. No free lunch! A convergence analysis will show that minibatch SGD converges at a rate , which is the same as SGD but at greater expense computationally due to the batch size. It's largely considered a compromise between SGD and regular gradient descent.
The ADAM optimizer, which we will see in greater detail later, utilizes batch gradients. We are fitting to the same correlated noise data from above, which has samples and two features and . The above graphs demonstrate the effects of variance reduction for increasing batch sizes. The first plot uses a batch size of (which is essentially regular gradient descent with ADAM's boost), the second , and the last . Clearly, we get closest to our optimal solution using bigger batches to reduce variance in the steps, but what this does not capture is additional computational cost for each step. Ultimately, we tune parameters like batch size to find a balance between performance and results.
Stochastic Average Gradient (SAG) and SAGA
SAG and SAGA utilize important variance reduction techniques and are often treated as benchmarks for comparison with stochastic and batch gradient descent algorithms. Minibatches serve well in reducing variance, however as we increase the batch size, we are also increasing the per-iteration computational cost; as approaches , we end up back where we started with regular gradient descent. Our goal is to achieve a convergence rate of with a cost of one gradient per iteration, motivating SAG.
Stochastic Average Gradient (SAG)
The framework for SAG uses gradient descent in the following form, taking the average gradient at each iteration.
where .
For SAG, we will update only one gradient at step , for uniformly random chosen , and keep all other at their previous value.
To accomplish this, the procedure maintains a table containing , . This table does require additional storage computationally proportional to the size of the data.
We first initialize and set and for all .
At each step we select uniformly at random and update our individual gradient, keeping all other values of the same.
We then perform the per iteration update.
We now have a new estimator for the gradient which we will call , which can be written as the following.
If we set and , noting that , can be rewritten as the following.
As with our previous analysis of estimators, we take the expectation of ,
Our expectation for our estimator is nonzero, so our gradient estimator is biased!
Omitting the details, SAG can be shown to converge under our usual convexity and Lipschitz continuity for the gradient per the following. For and and the intialization,
we achieve the following convergence.
Comparing directly to SGD:
- This is the desired (more accurately ), which is overall an improvement as compared to SGD
- Converges with constant learning rate
- SAG does not require the assumption of conditional boundedness of the variance ()
- BUT this algorithm is biased
- It's a drawback computationally to need to store a gradient table
SAGA
The SAGA algorithm is only a slightly modified version of SAG, with the only difference being a modification of the gradient estimate itself. We will still maintain our table of gradients, use the same initialization procedure, and pick/replace a random gradient in the same fashion.
Instead, however, of the SAG gradient estimate , we use the following gradient estimate for SAGA.
So, our per iteration updates are of the following form.
SAGA can be shown to have the same convergence , but can use a slightly better learning rate . It also does not require the bound on conditional variance. The most stark difference with SAGA vs. SAG is that SAGA is unbiased. It still has the less than desirable need to store a gradient table.
Using the same definition for random varables and as before, we have the following.
So, is an unbiased estimator for the gradient and it follows that the bias term for this method.
SAG & SAGA Variance comparison
Recall the gradient estimates for each.
SAG
SAGA
For and where and are correlated, we have the general form.
Where for SAGA and for SAG.
Taking the variance, we have the following.
So, we can compare the methods.
SAGA's modified algorithm makes the iterations unbiased, however by doing so we increase variance in the steps. The SAG/SAGA has tradeoffs with bias and variance, but ultimately both converge at the same rate. In terms of MSE, this means we may not have an obvious advantage for either method other than the faster learning rate of SAGA.
Other Common Variance Reduction Techniques
We have covered some basics of stochastic gradient descent algortihms, as well as some basic approaches to variance reduction, but I also wanted to include some supplemental material for further reading on the basics of the topics. If you are interested in learning more about some common techniques, I recommend this paper on variance reduction techniques in machine learning (Gower et al., 2020).
SARAH
SARAH, or StochAstic Recursive grAdient algoritHm, is a stochastic gradient descent method that utilizes recursive gradient information to reduce variance in the steps. The algorithm is a modification of the method SVRG (Stochastic Variance Reduced Gradient), which we will briefly cover, and is widely used as one of the benchmark algorithms for comparison to new gradient descent style methods. As with other methods we have seen until now, some of its core assumptions for convergence require a convex objective function (or strongly-convex) with Lipschitz continuous gradient. SARAH is a good example of an algorithm that is actually biased, but can be an improvement over SVRG if conditions are considered carefully.
Stochastic Variance Reduced Gradient (SVRG)
To understand the SARAH algorithm, we need to first conceptually understand SVRG. First, we present the SVRG algorithm.
The algorithm consists of an outer loop and an inner loop, both with iterations specified as hyperparameters. We select the number of epochs (each "epoch" is one pass over all samples in the dataset) for the outer loop, and the number of inner loop iterations .
The outer loop performs one full gradient computation using the initialized parameter , which begins with the initialized parameter that is used to make one full gradient computation (hence outer loop "epochs"). is then used to initialize the inner loop updates , and fed into the loop with .
The inner loop then begins, using a stochastic gradient approach in combination with an iteratively updated gradient estimate , utilizing the outer loop full gradient , the stochastic gradient computed from the most recent parameter update , and the corresponding full-gradient sample . The gradient estimate is then used for a normal gradient descent style parameter update.
All parameter updates occur within the inner loop, and the resulting parameters from the inner loop get fed back into the outer loop as the new parameter initialization.
SVRG Gradient Estimator
We want to analyze our gradient estimate, so we take its expectation conditioned on knowledge of .
So, the SVRG gradient estimator for the inner loop updates is unbiased.
SVRG Variance Reduction
The variance reduction for SVRG is accomplished with the inner loops by the convergence of both and to the same optimal parameter . Omitting the detailed proof (Johnson & Zhang, 2013) (link), the following holds for convex (and strongly convex) objective functions with Lipschitz continuous gradients.
So if , then
This means that the variance of the gradient estimates approach zero as the parameter updates converge to the optimum , hence the inner loops serve as a variance reduction technique, but in order for overall variance in the steps to decrease we must update (so that ), meaning that we must execute additional full gradient computations (epochs). We will revisit this point in the analysis of the variance reduction of SARAH.
SVRG Convergence
For convex and Lipschitz continuous gradient objective functions , SVRG can be shown to converge at a rate of , which is an improvement over SGD's . It also has better theoretical convergence, , than SAG and SAGA, () (note convergence here is in terms of specified tolerance rather than number of iterations ).
For a -strongly convex objective function, SVRG converges to the following.
Assume is sufficiently large such that the following holds.
then,
The complexity is , which is comparable to SAG/SAGA. This also gives us criteria for which to select inner loop size and learning rate .
Unlike SGD, we do not need a decaying learning rate for SVRG to achieve the comparable convergence rate. The decaying learning rate required for SGD is a consequence of the variance in the steps (even though variance can be unbounded), which is a good example of how variance can impact convergence rate; SVRG explicitly reduces the variance in the steps of SGD.
The SARAH Algorithm
Methodology in this section is drawn from SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient (Nguyen et al., 2017) (link)
(Nguyen, Liu, Scheinberg, and Takáč, 2017)
SARAH is very closely related to SVRG. It utilizes the same outer/inner loop apprach that functionally serve the same purpose; the outer loop is used to compute a full gradient (a "snapshot" of the full gradient), and the inner loop performs SGD style steps with a variance reduction design. The key difference between SVRG and SARAH is in the gradient estimate update step in the inner loop.
SVRG's gradient estimate is designed to use the outer loop "snapshot" to generate an unbiased gradient estimate at each step, while simultaneously reducing the variance that is a consequence of SGD. The "snapshot" is what stablizes the algorithm in terms of bias (the gradient estimate is unbiased) and allows the variance in the stochastic steps to be reduced using the parameters from which the outer loop was initialized.
SARAH's gradient estimate, however, leans more heavily into the variance-reduction aspect of the algorithm by utilizing recursive gradient information to reduce variance, rather than solely using the gradient estimate from the outer loops. This improves the stochastic variance reduction power of the algorithm, but at the expense of introducing bias.
SARAH Gradient Estimator
The gradient estimate (also referred to as the "SARAH update") utilizes a similar framework to SVRG. Recall the SVRG gradient estimate.
The SARAH estimator modifies the gradient estimate by replacing the gradient esimate terms from the outer loop initialization with a recursive gradient update.
We note that there is a shift in index as we now need to use both the most recent random gradient as well as the random gradient from the previous step. This also means that we must perform one parameter update in the outer loop before we enter the inner loop.
Each gradient estimate uses every previous gradient estimate from the inner loop for each inner loop update, hence a "recursive" update framework as opposed to each update using the outer loop initialization only.
Let's take the conditional expectation of the gradient estimate, given all previous iterates of . We will call this , the -algebra generated by the inital and all subsequent randomly chosen indices for the inner loop . Note that .
The steps are omitted, but the result is notable as it is not an unbiased estimator for the gradient. However, the total expectation holds.
So, an exact full pass over the data for an inner loop is unbiased.
SARAH Variance Reduction
Recall for SVRG that in order for overall variance in the steps to reduce, we must execute additional outer loops; in order for to hold, we must update via an outer loop computation.
In the analysis of SVRG we showed that stochastic gradients are unbiased estimators for the full gradient, and that . Since the inner loop terms only utilize stochastic gradients for the gradient estimate, the inner loops will reduce variance without requiring an outer loop as there are no terms in the estimator dependent upon the outer loop update.
SARAH's variance bound in the -strongly convex objective function case can be shown to be less than the variance bound of the same objective function with SVRG.
The variance reduction of the inner loops is at the expense of bias; SARAH reduces variance in the inner loops but now has bias that is itself stochastic. This bias can be somewhat "corrected" for, however, by executing an outer loop to reinitialize the parameters by pointing them in the correct direction. We will examine some intelligent ways to execute outer loops when necessary using SARAH+.
SARAH Convergence
For convex and Lipschitz continuous gradient objective functions , SARAH can be shown to converge at a rate of . Better conditioned s that are -strongly convex converge at a rate of .
Comparing these rates directly to SVRG,
Convexity | SVRG | SARAH |
---|---|---|
Convex | ||
Strong Convexity |
Similar to SVRG, we must select our inner loop size and learning rate such that the following holds.
Recall from above that the variance bound for SARAH in the strongly convex case is smaller than that of SVRG. So, while their convergence complexity is the same for strong convexity, SARAH can use a larger learning rate than SVRG. SVRG must use a learning rate of whereas SARAH can use .
We revisit a side by side comparison of SARAH and SVRG, both run with the same number of inner and outer loops with the same hyperparameter settings. The first image is zoomed in to highlight how each algorithm handles variance reduction. SARAH has very little variance, but requires outer loop executions to correct for its inner loop bias (outer loops happen at the "kinks"). SVRG starts moving toward the solution sooner, but has higher variance in its steps until we execute more outer loops, and it smooths as more outer loops are executed (our gradient estimate becomes better). In the same number of iterations, however, they are fairly comparable. This is only a very simple example to illustrate their general behaviors, but reinforces that they are very similar algorithms in nature and performance. So, there are some advantages to SARAH over SVRG, but not necessarily overwhelmingly so.
SARAH+ Algorithm
Nguyen et al. (2017) (link) present this variant of the SARAH algorithm that provides an automatic and adaptive inner loop size. Numerical experiments from the paper indicate a selection of shows benefits in rate for their numerical experiments, but does not elaborate on any theory.
This modification introduces an automated methodology for breaking the inner loop such that inner loop size can be adaptive. It also serves as automation to correct for bias in the inner loops with the execution of a full outer loop gradient.
Applications of SARAH
In practice, SARAH can be a direct substitute for SVRG with only some updated parameter tuning considerations, so it can be utilized for a very large variety of applications as a SVRG replacement. However, it does not seem to be used as widely in practice as SVRG. While speculation as to why is not worthwhile, it is noteworthy that SARAH's variance reduction technique utilizes a biased methodology, where SVRG is unbiased. SVRG is a widely utilized benchmark algorithm, and many machine learning packages have SVRG built in; the same cannot be said for SARAH, and while it does have theoretical advantages, it does not seem to have penetrated the infrastructure of optimization algorithms in practice. Still, there is a fair amount of research available for futher modifications of SARAH.
ADAM
Methodology in this section is drawn from Adam: A Method for Stochastic Optimization (Kingma & Ba, 2017) (link).
ADAM is a widely popular optimizer in machine learning. It is a staple for deep learning, neural networks, natural language processing, and generative adversarial networks applications, to name a few. ADAM has some unique strengths against other optimizers; it is fast, requires little hyperparameter tuning, performs well with sparse gradients, and can be applied to stochastic objective functions. It is also an adaptive algorithm that uses adaptive step sizes according to data it has seen to speed up convergence.
The ADAM optimizer is a stochastic/batch gradient descent variant that utilizes two techniques that leverage previous gradient information (conceptually adjacent to recursion we saw in SARAH), Momentum and RMSProp (Root Mean Squared Propogation). Momentum acts to dampen oscillations by using previous gradient moving averages to decrease the variance in steps. RMSProp adaptively scales the learning rate based upon an exponential weighted average of the magnitude of recent gradients, so it can "speed up" and "slow down" such that it does not diverge from the local data. Additionally, ADAM can utilize batch gradients rather than only individual stochastic gradients, which can act as variance reduction by increasing sample size.
The Algorithm
NOTE
is equivalent to element-wise multiplication of gradients
Variance Reduction
Momentum
Polyak's momentum is a technique that uses both the current gradient as well as the history of gradients in previous steps to accelerate convergence. The momentum term is incorporated in general with gradient descent style algorithm and uses a hyperparameter to scale its weight.
If we expand this term,
If these recent gradients are similar, the algorithm gets an additional "boost", or "momentum". and so on can be expanded similarly, and we get higher order terms as the algorithm iterates. This favors the most recent gradients since and higher order terms quickly dampen out previous iterates.
We see the "boost" in the animation for the ADAM algorithm compared to other stochastic methods, moving more rapidly toward the solution.
In ADAM, the momentum term is written as follows.
Momentum helps to accelerate convergence, particularly over the contours of objective functions, while also acting as variance reduction in the steps. The technical details of variance reduction in stochastic methods are a little more involved and are covered in (Arnold et al., 2019) (link). For and in-depth and beautifully interactive blog about momentum, I also suggest this post.
Minibatches
Earlier in this post we covered the variance reduction properties of using stochastic minibatches for variance reduction in the steps. ADAM utilizes minibatches, and the batch size is a tuneable hyperparameter. This gives us the ability to use better gradient approximations for our steps and balance convergence with computation time.
RMSProp
Root-mean-squared propograaion (RMSProp) is a technique that uses the moving average of the magnitude of recent squared gradients to normalize the gradient estimate in the descent algorithm. Its implementation in descent algorithms makes the learning rate adaptive to the data, rather than treating it as a constant tuneable hyperparameter. In essence, the learning rate will speed up or slow down depending on recent gradients.
The RMSProp update is similar to momentum, using the squared gradient term.
And the update rule in a gradient descent algorithm is written as follows.
The term only serves to stablize the algorithm as becomes small (the steps will become large if the denominator is small), preventing overshooting close to the optimum.
Bias and Bias Correction
The bias-correction steps in ADAM are enhancements to the momentum and RMSProp algorithms (which are themselves biased) and can be attributed to its advantageous performance with sparse data. ADAM is biased toward its initialization, which in general should be chosen to be the vector.
We will follow the derivation of the bias terms as in Kingma & Ba (2017)
Our momentum term , or the first moment estimator, is meant to estimate the gradient , so we want to find the expectation of the first moment estimate .
Let's first consider running ADAM for timesteps, and let be the gradients at each timestep. We assume the gradients are distributed as . At timestep , the first moment estimate term is the following.
We initialize the first moment weights setting and expand this term by substituting with each previous timestep, we get
Now, we can then take the expectation of .
We can rewrite by adding and subtracting , giving , giving the following.
Recent gradients will be close to , and gradients further in the past will be quickly dampened by as it should be selected to be small, so the term can be treated as small. So, our expectation can be written as the following.
Expanding the sum and cancellations of term results in the following.
The calculation steps for the second moment estimates follow identical logic, and we get the following expectation for .
Clearly, is a biased estimator for , and is biased for .
We can approximately correct for bias in each term by assuming initialization of moments to be zero, and by dividing the respective estimates by the and term for the first and second moment estimates respectively, giving approximately and .
ADAM Convergence
ADAM requires only a basic convexity assumption of the objective function, and a similar but more relaxed assumption of Lipschitz continuity called smoothness (Wang et al., 2022) (link). Foregoing the details, ADAM converges at a rate of (the detailed proof for ADAM's convergence can be found in original paper).
Recall that this is the same theoretical convergence rate as SGD, however it is able to used a fixed learning rate rather than a diminishing step size. While its rate is less desirable than other algorithms we have seen, it is still a widely desirable algorithm for it's performance in practice.
Applications of ADAM
ADAM is a very widely used optimizer, particularly for neural network, deep learning, and natural language processing applications amongst others. One of its advantages is how well it handles sparse data, which are commonly characteristic of the data for these types of tasks. While it is not the fastest method in terms of convergence, the methodologies it utilizes to balance stochasic variance while also addressing inherent estimator bias, making this method a trustworthy workhorse.
There is existing infrastructure for ADAM among some common machine learning tools like Keras, scikit-learn, PyTorch, and TensorFlow, to name some prolific packages in industry and research applications, so it has been embraced by the community as an effective optimizer. Overall it is a robust optimizer with reliable performance for convex (as well as nonconvex, He et al. (2023)) objectives.
Example Numerical Experiments
NOTE
This section was lifted directly from my coursework. Note this section uses for the model parameter vector instead of the we have been using previously (different choices may be seen in practice).
The purpose of the numerical experiment will be to evaluate general performance of ADAM and SARAH on real data and to illustrate a bias-variance tradeoff between SVRG and SARAH. We use the "Wine Quality" dataset from the UC Irvine Machine Learning Repository (Cortez et al., 2009). The data are measurements of wine characteristics (features) used to predict wine quality scores. There are instances and features. The experiements will use a simple least squares linear regression objective, assuming a system with data and response is of the form .
The least squares objective was chosen because it is strongly convex. was computed separately using scipy.optimize.lsq_linear
with tolerance set to .
See Figures 1 and 2 for settings and computation time. The two examples chosen illustrate some high level takeaways. In terms of per-iteration computational performance, both ADAM and SARAH are virtually identical to SVRG, though ADAM can become slightly slower for increased batch sizes. SARAH and SVRG are limited in flexibility for learning rate with this data since is large and learning rate is . Both are descending, but are moving extremely slowly toward , and the larger learning rate for SARAH hardly presents an advantage. We can see benefit of the adaptive learning rate of ADAM for both Exp A and B; it is moving toward it much faster and it continues to descend as iterations increase ( vs. epochs).
Comparing SARAH and SVRG, we run the inner loop for half an epoch in Exp. A and for 30 epochs in Exp. B. The parameter choice in Exp. B adheres with SARAH's variance bound requirement . Note that for this data, the iterations required for an accurate convergence with SARAH, for a reasonable choice of , was prohibitive. In both Exp. A and B, is clearly smoother for SARAH than SVRG, however we observe that the number of effective passes over the data in the inner loop can matter. In both the and plots, SARAH will stray away from the optimal solution due to bias after around 4 epochs. It is ultimately corrected by the second outer loop step, but we can see how the biased steps of the inner loop can cause SARAH drift, whereas SVRG continues to descend. As is the reality of statistics, we are trading variance for bias.
Supplemental Resources
- Gradient descent, how neural networks learn | Chapter 2, Deep learning
- 3Blue1Brown video covering the basics of gradient descent in deep learning out of his Deep Learning series. Of course, I strongly recommend any of his excellent videos.
- Stochastic Gradient Descent, Clearly Explained!!!
- A fun StatQuest video about stochastic gradient descent.
- Adam Optimizer Explained in Detail | Deep Learning - Coding Lane
- A good simplified breakdown of ADAM optimizaiton
- Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006
- A really nice (deep dive) post about Momentum, a stochastic variance reduction technique
- Lipschitz Functions: Intro and Simple Explanation for Usefulness in Machine Learning
- Learn about Lipschitz continuous functions and their uses in machine learning
- L20.4 On the Mean Squared Error of an Estimator
- MIT Open Courseware video about MSE
- Notes on "Big O" notation
- "Big O" is standard notation to describe convergence for machine learning algorithms
Bibliography
Reproducibility
All code for diagrams and accompanying paper is available here.
Appendix
Some supplemental information for the topics we have covered.
Convexity
Convex Function
A function is convex if it satisfies the following.
- is convex if and only if for all , ,
- is strictly convex if and only if for all , ,
- is -strongly convex if and only if there exists a such that for all , ,
- If the objective function is convex, then any local minimum is a global minimum.
- If the objective function is strictly convex, then the minimum is unique.
- If the objective function is strongly convex, then the minimum is unique.
Return to
Lipschitz Continuity
For the following convex optimization problem,
- has a Lipschitz-continuous gradient if there exists such that
- If is twice differentiable such that for all and some , then has a Lipschitz continuous gradient with constant .
Return to
Consistency
Informally, a consistent estimator just means that if we use more data to estimate a parameter, it is guaranteed to converge to the true value of the parameter.
Gradient Calculation
For simplicity, we are using the least squares loss objective function .
Recall that our data has features and total samples, so our matrix has dimensions . Our labels (the "response" vector) have a corresponding sample for each data sample, so its dimensions are . Lastly, our parameters vector has a parameter for each of the features, so its dimensions are .
Full gradient calculation
Return to
Stochastic gradient calculation
Select a random index from .
Compute the stochastic gradient (approximation of full gradient).
Return to
Batch gradient calculation
We first select a subset of indices where . In this example we select , , .
We compute the batch gradient.
Which results in the average of the selected gradients.
Return to