Machine Learning Optimization - Stochastic Descent, Variance Reduction Techniques, and the Bias-Variance Tradeoff | Alex Salce

Introduction

The intent of this post is to explore some examples of variance reduction techniques in gradient descent optimization methods for machine learning, with some focus on their underlying statistical concepts. We will use the ADAM optimizer, a popular algorithm in neural network optimization and deep learning, and the less prolific SARAH optimizer (a variant of SVRG) as examples of algorithms in practice, with additional background on the methodologies from which these stochastic optimizers are derived including SGD, SAG, SAGA, and SVRG. In addition to exploring the techniques themselves, we will lend some focused attention to some bias-variance tradeoffs unique to variance reduction techniques, which are unrelated to the typical overfitting/overparameterization bias-variance tradeoff traditionally examined in machine learning. There are many resources available online for all of these topics, and I will provide some links to my favorites, but my aim is to cover the basic underlying concepts needed to build some more common and contemporary algorithms.

This post will cover variance reduction techniques with some technical statistical details, and assumes the reader has a basic understanding of machine learning principles and gradient descent (here's a 3Blue1Brown video if you need a refresher, but I also dedicated a section of this post to my own refresher). But, my intent is to make these topics accessible for an audience looking for some enhanced understanding of techniques in machine learning optimization. I will include links and additional details wherever I feel it's warranted, but I am open to any feedback you may have.

Information for this post is largely drawn from my final project for an optimization course I took in the Spring of 2024, for which the paper and presentation are avialable below.

Please refer to these documents for additional formal details, the post may reference portions of them for clarity.

Background

Machine Learning Optimization and Stochastic Optimizers

Optimization as a field has become critical to recent progress in machine learning as the field expands its applications into ever greater swathes of data. We are using machine learning to approach increasingly large datasets, and to do so effectively we need model optimization techniques that are both fast and guarantee convergence. Stochastic optimizers are a class of algorithms that have become prevalent in recent years due to their theoretical convergence properties with the desirable efficiency of stochastic processes. Being stochastic methods, they require special techniques to address and control variance in their computations to accomplish any advantageous theoretical convergence over other common methods.

When optimizing machine learning model parameters, first-order gradient descent algorithms utilizing stochastic gradients (individual gradient selected uniformly at random), or minibatch gradients (subsets of a full gradient whose elements selected uniformly at random), have become core staples in marchine learning frameworks due to their advantageous computational efficiency for applications to datasets of both large sample size and dimensionality. But, the randomness (variance) in the steps that is introduced to these gradient descent methods with stochasticity tends to slow down or inhibit convergence of the algorithms, so it is in our interest to implement techniques to curb this variance.

stochasics

This is a visual example of the topics and algorithms we intend to cover. The graph depicts model parameters for a two parameter model (training to correlated noise) as they iteratively update toward the true parameters using different variants of stochastic gradient descent (SGD, yellow line) that each employ different techniques to reduce variance in the parameter updates. Variance, as we will discuss in this post, is the quantity describing the magnitude of the randomness in the algorithms' steps to the true solution. SGD clearly has the highest variance, and the other variance reduction techniques have obviously smaller variance in their steps.

This example is using some contrived data, but the settings for each algorithm are approximately analagous (we'll see what "approximately" means as the algorithms are different in methodology), and they are running for the same number of iterations. We can see that the variants of SGD can progress closer to the solution, each with qualitatively smaller variance than SGD, which encapsulates our general motivation for utilizing variance reduction techniques in stochastic gradient descent style algorithms. If we reduce variance in the steps, we can be more efficient about converging to the true solution. We will dive into some of these techniques and their underlying theory to better understand how we can balance variance reduction with convergence properties, algorithm computational efficiency, and unintended bias that can be a byproduct of these methodologies.

NOTE

Updated January 2025, this post has been split into some smaller posts, use the Table of Contents links on each page to navigate.

Table of Contents
1: Crash Course: Gradient Descent
2: Stochastic Gradient Descent
3: Variance Reduction Techniques
4: SARAH and ADAM Algorithms

Supplemental Resources

Gradient descent, how neural networks learn | Chapter 2, Deep learning
- 3Blue1Brown video covering the basics of gradient descent in deep learning out of his Deep Learning series. Of course, I strongly recommend any of his excellent videos.
Stochastic Gradient Descent, Clearly Explained!!!
- A fun StatQuest video about stochastic gradient descent.
Adam Optimizer Explained in Detail | Deep Learning - Coding Lane
- A good simplified breakdown of ADAM optimizaiton
Goh, "Why Momentum Really Works", Distill, 2017. http://doi.org/10.23915/distill.00006
- A really nice (deep dive) post about Momentum, a stochastic variance reduction technique
Lipschitz Functions: Intro and Simple Explanation for Usefulness in Machine Learning
- Learn about Lipschitz continuous functions and their uses in machine learning
L20.4 On the Mean Squared Error of an Estimator
- MIT Open Courseware video about MSE
Notes on "Big O" notation
- "Big O" is standard notation to describe convergence for machine learning algorithms

Bibliography

Arnold, S. M. R., Manzagol, P.-A., Babanezhad, R., Mitliagkas, I., & Roux, N. L. (2019). Reducing the variance in online optimization by transporting past gradients. https://arxiv.org/abs/1906.03532

Beznosikov, A., & Takáč, M. (2023). Random-reshuffled SARAH does not need full gradient computations. Optimization Letters, 18(3), 727–749. https://doi.org/10.1007/s11590-023-02081-x

Bock, S., Goppold, J., & Weiß, M. (2018, April). An improvement of the convergence proof of the ADAM-Optimizer.

Chauhan, V. K., Sharma, A., & Dahiya, K. (2019). SAAGs: Biased stochastic variance reduction methods for large-scale learning. Applied Intelligence, 49(9), 3331–3361. https://doi.org/10.1007/s10489-019-01450-3

Chen, J., Zhang, R., & Liu, Y. (2023). An Adam-enhanced Particle Swarm Optimizer for Latent Factor Analysis.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Wine Quality. UCI Machine Learning Repository. https://doi.org/https://doi.org/10.24432/C56S3T

Défossez, A., Bottou, L., Bach, F., & Usunier, N. (2022). A Simple Convergence Proof of Adam and Adagrad.

Gower, R. M., Schmidt, M., Bach, F., & Richtarik, P. (2020). Variance-Reduced Methods for Machine Learning. https://arxiv.org/abs/2010.00892

Gürbüzbalaban, M., Ozdaglar, A., & Parrilo, P. A. (2019). Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1–2), 49–84. https://doi.org/10.1007/s10107-019-01440-w

Hastie, T., Tibshirani, R., Friedman, J., & Franklin, J. (2004). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Math. Intell., 27, 38. https://doi.org/10.1007/BF02985802

He, M., Liang, Y., Liu, J., & Xu, D. (2023). Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters and Non-ergodic Case.

Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, 315–323.

Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization.

Liu, M., Yao, D., Liu, Z., Guo, J., & Chen, J. (2023). An Improved Adam Optimization Algorithm Combining Adaptive Coefficients and Composite Gradients Based on Randomized Block Coordinate Descent. Computational Intelligence and Neuroscience, 2023, 1–14. https://doi.org/10.1155/2023/4765891

Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017). SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient. https://arxiv.org/abs/1703.00102

Reddi, S. J., Kale, S., & Kumar, S. (2019). On the Convergence of Adam and Beyond.

Ruder, S. (2017). An overview of gradient descent optimization algorithms.

TensorFlow Keras Optimizers Method tf.keras.optimizers. (n.d.). https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam

Wang, B., Zhang, Y., Zhang, H., Meng, Q., Ma, Z.-M., Liu, T.-Y., & Chen, W. (2022). Provable Adaptivity in Adam.

Yang, Z. (2024). SARAH-M: A fast stochastic recursive gradient descent algorithm via momentum. Expert Systems with Applications, 238, 122295. https://doi.org/https://doi.org/10.1016/j.eswa.2023.122295

Reproducibility

All code for diagrams and accompanying paper is available here.

Appendix

Some supplemental information for the topics we have covered.

Convexity

Convex Function

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if it satisfies the following.

\begin{aligned} f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y) \text{, } \forall \alpha \in (0,1),x,y \in \mathbb{R}^n \end{aligned}

$f$ is convex if and only if for all $x$ , $y$ ,

\begin{aligned} f(y)\ge f(x)+\nabla f(x){}^{T}(y-x) \end{aligned}

$f$ is strictly convex if and only if for all $x$ , $y$ ,

\begin{aligned} f(y) > f(x)+\nabla f(x){}^{T}(y-x) \end{aligned}

$f$ is $\mu$ -strongly convex if and only if there exists a $\mu>0$ such that for all $x$ , $y$ ,

\begin{aligned} f(y) \geq f(x)+\nabla f(x){}^{T}(y-x) + \frac{\mu}{2} ||x-y||^2 \end{aligned}

If the objective function $f$ is convex, then any local minimum is a global minimum.
If the objective function $f$ is strictly convex, then the minimum is unique.
If the objective function $f$ is strongly convex, then the minimum is unique.

Return to

Lipschitz Continuity

For the following convex optimization problem,

\begin{aligned} \min_x f(x) \end{aligned}

$f$ has a Lipschitz-continuous gradient if there exists $L>0$ such that

\begin{aligned} || \nabla f(x) - \nabla f(y) || \leq L ||x-y|| \text{, } \forall x,y \in \mathbb{R} \end{aligned}

If $f$ is twice differentiable such that $|| \nabla^2 f(x) || \leq L$ for all $x$ and some $L>0$ , then $f$ has a Lipschitz continuous gradient with constant $L$ .

Return to

Limitations of Regular Gradient Descent

Consistency

Informally, a consistent estimator just means that if we use more data to estimate a parameter, it is guaranteed to converge to the true value of the parameter.

Gradient Calculation

For simplicity, we are using the least squares loss objective function $f$ .

\begin{aligned} f(\boldsymbol{\theta}) = ||A\boldsymbol{\theta}-\boldsymbol{b}||^{2} \end{aligned}

Recall that our data has $p$ features and $n$ total samples, so our matrix $A$ has dimensions $n \times p$ . Our labels $b$ (the "response" vector) have a corresponding sample for each data sample, so its dimensions are $n \times 1$ . Lastly, our parameters vector has a parameter for each of the $p$ features, so its dimensions are $p \times 1$ .

Full gradient calculation

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix}b_{1}\\ b_{2}\\ ...\\ b_{n} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{full}f\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{full}f\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{full}f\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Return to

Limitations of Regular Gradient Descent

Stochastic gradient calculation

Select a random index $i$ from $\{ 1,...n \}$ .

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ ...\\ \color{green}\theta_{i}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{1}\\ ...\\ \color{green}b_{i}\\ ...\\ b_{n} \end{bmatrix}\right) \end{aligned}

Compute the stochastic gradient (approximation of full gradient).

\begin{aligned} \nabla f_{i}\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{i} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Return to

Batch gradient calculation

We first select a subset of indices $I_k \subset \{ 1,...,n \}$ where $|I_k|=b < < n$ . In this example we select $i$ , $j$ , $k$ .

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ ...\\ \color{green}\theta_{i}\\ ...\\ \color{green}\theta_{j}\\ ...\\ \color{green}\theta_{k}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{1}\\ ...\\ \color{green}b_{i}\\ ...\\ \color{green}b_{j}\\ ...\\ \color{green}b_{k}\\ ...\\ b_{n} \end{bmatrix}\right) \end{aligned}

We compute the batch gradient.

\begin{aligned} \nabla f_{batch}\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\ a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\ a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\ a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\ a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{i} \\ b_{j} \\ b_{k} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Which results in the average of the selected gradients.

Return to

Minibatch SGD