Published on

Machine Learning Optimization - SARAH and ADAM Algorithms

Table of Contents

0: Introduction

1: Crash Course: Gradient Descesnt

2: Stochastic Gradient Descent

3: Variance Reduction Techniques

SARAH

SARAH, or StochAstic Recursive grAdient algoritHm, is a stochastic gradient descent method that utilizes recursive gradient information to reduce variance in the steps. The algorithm is a modification of the method SVRG (Stochastic Variance Reduced Gradient), which we will briefly cover, and is widely used as one of the benchmark algorithms for comparison to new gradient descent style methods. As with other methods we have seen until now, some of its core assumptions for convergence require a convex objective function (or strongly-convex) with Lipschitz continuous gradient. SARAH is a good example of an algorithm that is actually biased, but can be an improvement over SVRG if conditions are considered carefully.

Stochastic Variance Reduced Gradient (SVRG)

To understand the SARAH algorithm, we need to first conceptually understand SVRG. First, we present the SVRG algorithm.

SVRG algorithm

The algorithm consists of an outer loop and an inner loop, both with iterations specified as hyperparameters. We select the number of epochs TT (each "epoch" is one pass over all samples in the dataset) for the outer loop, and the number of inner loop iterations mm.

The outer loop performs one full gradient computation using the initialized parameter θ~\tilde{\boldsymbol{\theta}}, which begins with the initialized parameter θ~0\tilde{\boldsymbol{\theta}}_0 that is used to make one full gradient computation g~\tilde{g} (hence outer loop "epochs"). θ~\tilde{\boldsymbol{\theta}} is then used to initialize the inner loop updates θ0\boldsymbol{\theta}_0, and fed into the loop with g~\tilde{g}.

The inner loop then begins, using a stochastic gradient approach in combination with an iteratively updated gradient estimate vt1=fit(θt1)fit(θ~)+g~v_{t-1} = \nabla f_{i_{t}} (\boldsymbol{\theta}_{t-1}) - \nabla f_{i_{t}} (\tilde{\boldsymbol{\theta}}) + \tilde{g}, utilizing the outer loop full gradient g~\tilde{g}, the stochastic gradient computed from the most recent parameter update fit(θt1)\nabla f_{i_{t}} (\boldsymbol{\theta}_{t-1}), and the corresponding full-gradient sample fit(θ~)\nabla f_{i_{t}} (\tilde{\boldsymbol{\theta}}). The gradient estimate is then used for a normal gradient descent style parameter update.

All parameter updates occur within the inner loop, and the resulting parameters from the inner loop get fed back into the outer loop as the new parameter initialization.

SVRG Gradient Estimator

We want to analyze our gradient estimate, so we take its expectation conditioned on knowledge of θt1\boldsymbol{\theta}_{t-1}.

E[vt1θt1]=E[fit(θt1)fit(θ~)+g~]=E[fit(θt1)]E[fit(θ~)]+g~\begin{aligned} \mathbb{E} [v_{t-1}|\boldsymbol{\theta}_{t-1}] = \mathbb{E}[ \nabla f_{i_{t}} (\boldsymbol{\theta}_{t-1}) - \nabla f_{i_{t}} (\tilde{\boldsymbol{\theta}}) + \tilde{g} ] = \mathbb{E}[ \nabla f_{i_{t}} (\boldsymbol{\theta}_{t-1}) ] - \mathbb{E}[ \nabla f_{i_{t}} (\tilde{\boldsymbol{\theta}}) ] + \tilde{g} \end{aligned} =1ni=1nfit(θt1)f(θt1)1ni=1nfit(θ~)g~+g~=f(θt1)\begin{aligned} = \underset{\nabla f (\boldsymbol{\theta}_{t-1})}{\underbrace{\frac{1}{n} \sum_{i=1}^{n}{\nabla f_{i_{t}}(\boldsymbol{\theta}_{t-1})}}} - \underset{\tilde{g}}{\underbrace{\frac{1}{n} \sum_{i=1}^{n}{\nabla f_{i_{t}} (\tilde{\boldsymbol{\theta}})}}} + \tilde{g} = \nabla f (\boldsymbol{\theta}_{t-1}) \end{aligned}

So, the SVRG gradient estimator for the inner loop updates is unbiased.

SVRG Variance Reduction

The variance reduction for SVRG is accomplished with the inner loops by the convergence of both θt\boldsymbol{\theta}_{t} and θ~\tilde{\boldsymbol{\theta}} to the same optimal parameter θ{\theta}^{*}. Omitting the detailed proof (Johnson & Zhang, 2013) (link), the following holds for convex (and strongly convex) objective functions with Lipschitz continuous gradients.

g~=1ni=1nfi(θ~)θ~θ1ni=1nfi(θ)0\begin{aligned} \tilde{g} = \frac{1}{n} \sum_{i=1}^{n}{\nabla f_{i} (\tilde{\boldsymbol{\theta}})} \underset{\tilde{\theta}\rightarrow\theta^{*}}{\rightarrow} \frac{1}{n} \sum_{i=1}^{n}{\nabla f_{i} (\boldsymbol{\theta}^*)} \rightarrow 0 \end{aligned}

So if fi(θ~)fi(θ)\nabla f_{i} (\tilde{\boldsymbol{\theta}}) \rightarrow \nabla f_{i} (\boldsymbol{\theta}^*), then

vt1=fit(θt1)fit(θ~)+g~θ~θθt1θfit(θ)fit(θ)0\begin{aligned} v_{t-1} = \nabla f_{i_{t}} (\boldsymbol{\theta}_{t-1}) - \nabla f_{i_{t}} (\tilde{\boldsymbol{\theta}}) + \tilde{g} \underset{\tilde{\theta}\rightarrow\theta^{*}\text{, } \theta_{t-1}\rightarrow\theta^{*}}{\rightarrow} \nabla f_{i_{t}} (\boldsymbol{\theta}^{*}) - \nabla f_{i_{t}} (\boldsymbol{\theta}^{*}) \rightarrow 0 \end{aligned}

This means that the variance of the gradient estimates approach zero as the parameter updates converge to the optimum θ\boldsymbol{\theta}^{*}, hence the inner loops serve as a variance reduction technique, but in order for overall variance in the steps to decrease we must update θ~\tilde{\boldsymbol{\theta}} (so that fi(θ~)fi(θ)\nabla f_{i} (\tilde{\boldsymbol{\theta}}) \rightarrow \nabla f_{i} (\boldsymbol{\theta}^*)), meaning that we must execute additional full gradient computations (epochs). We will revisit this point in the analysis of the variance reduction of SARAH.

SVRG Convergence

For convex and Lipschitz continuous gradient objective functions fif_i, SVRG can be shown to converge at a rate of O(1T)\mathcal{O} (\frac{1}{T}), which is an improvement over SGD's O(1T)\mathcal{O} (\frac{1}{\sqrt{T}}). It also has better theoretical convergence, O(nlog1ϵ+Lϵ)\mathcal{O} (n\log\frac{1}{\epsilon} + \frac{L}{\epsilon}), than SAG and SAGA, (O(n+Lϵ)\mathcal{O} (\frac{n+L}{\epsilon})) (note convergence here is in terms of specified tolerance ϵ\epsilon rather than number of iterations TT).

For a μ\mu-strongly convex objective function, SVRG converges to the following.

Assume mm is sufficiently large such that the following holds.

βm=1μα(12Lα)m+2Lα12Lα<1\begin{aligned} \beta_m = \frac{1}{\mu \alpha (1-2L \alpha)m} + \frac{2L \alpha}{1-2L \alpha} < 1 \end{aligned}

then,

E[f(θ~s)]f(θ)βs(f(θ~s)f(θ))\begin{aligned} \mathbb{E} [f(\tilde{\boldsymbol{\theta}}_s)] - f(\boldsymbol{\theta}^*) \leq \beta^s (f(\tilde{\boldsymbol{\theta}}_s) - f(\boldsymbol{\theta}^*)) \end{aligned}

The complexity is O((n+Lμ)log(1ϵ))\mathcal{O}((n+\frac{L}{\mu}) \log(\frac{1}{\epsilon})), which is comparable to SAG/SAGA. This also gives us criteria for which to select inner loop size mm and learning rate α\alpha.

Unlike SGD, we do not need a decaying learning rate for SVRG to achieve the comparable convergence rate. The decaying learning rate required for SGD is a consequence of the variance in the steps (even though variance can be unbounded), which is a good example of how variance can impact convergence rate; SVRG explicitly reduces the variance in the steps of SGD.

The SARAH Algorithm

Methodology in this section is drawn from SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient (Nguyen et al., 2017) (link)

(Nguyen, Liu, Scheinberg, and Takáč, 2017)

SARAH algorithm

SARAH is very closely related to SVRG. It utilizes the same outer/inner loop apprach that functionally serve the same purpose; the outer loop is used to compute a full gradient (a "snapshot" of the full gradient), and the inner loop performs SGD style steps with a variance reduction design. The key difference between SVRG and SARAH is in the gradient estimate update step in the inner loop.

SVRG's gradient estimate is designed to use the outer loop "snapshot" to generate an unbiased gradient estimate at each step, while simultaneously reducing the variance that is a consequence of SGD. The "snapshot" is what stablizes the algorithm in terms of bias (the gradient estimate is unbiased) and allows the variance in the stochastic steps to be reduced using the parameters from which the outer loop was initialized.

SARAH's gradient estimate, however, leans more heavily into the variance-reduction aspect of the algorithm by utilizing recursive gradient information to reduce variance, rather than solely using the gradient estimate from the outer loops. This improves the stochastic variance reduction power of the algorithm, but at the expense of introducing bias.

SARAH Gradient Estimator

The gradient estimate (also referred to as the "SARAH update") utilizes a similar framework to SVRG. Recall the SVRG gradient estimate.

The SARAH estimator modifies the gradient estimate by replacing the gradient esimate terms from the outer loop initialization with a recursive gradient update.

vt1=fit(θt1)fit(θ~)+g~outer loop terms\begin{aligned} v_{t-1} = \nabla f_{i_{t}} (\boldsymbol{\theta}_{t-1}) - \underset{\text{outer loop terms}}{\underbrace{ \nabla f_{i_{t}} (\tilde{\boldsymbol{\theta}}) + \tilde{g}}} \end{aligned} vt=fit(θt)fit(θt1)+vt1\begin{aligned} v_{t}=\nabla f_{i_{t}}(\theta_{t})-\nabla f_{i_{t}}(\theta_{t-1})+v_{t-1} \end{aligned}

We note that there is a shift in index as we now need to use both the most recent random gradient as well as the random gradient from the previous step. This also means that we must perform one parameter update in the outer loop before we enter the inner loop.

Each gradient estimate uses every previous gradient estimate from the inner loop for each inner loop update, hence a "recursive" update framework as opposed to each update using the outer loop initialization only.

Let's take the conditional expectation of the gradient estimate, given all previous iterates of θt\boldsymbol{\theta}_{t}. We will call this Ft\mathcal{F}_t, the σ\sigma-algebra generated by the inital θ0\boldsymbol{\theta}_{0} and all subsequent randomly chosen indices for the inner loop Ft=σ(θ0,i1,...,it1)\mathcal{F}_t = \sigma(\boldsymbol{\theta}_{0}, i_1,...,i_{t-1}). Note that E[vtFt]=Eit[vt]\mathbb{E}[v_{t}|\mathcal{F}_t] = \mathbb{E}_{i_t}[v_{t}].

Eit[vt]=f(θt)f(θt1)+vt1f(θt)\begin{aligned} \mathbb{E}_{i_t}[v_{t}] = \nabla f(\theta_{t}) - \nabla f(\theta_{t-1}) + v_{t-1} \neq \nabla f(\theta_{t}) \end{aligned}

The steps are omitted, but the result is notable as it is not an unbiased estimator for the gradient. However, the total expectation holds.

E[vt]=f(θt)\begin{aligned} \mathbb{E}[v_t] = \nabla f(\theta_{t}) \end{aligned}

So, an exact full pass over the data for an inner loop is unbiased.

SARAH Variance Reduction

Recall for SVRG that in order for overall variance in the steps to reduce, we must execute additional outer loops; in order for fi(θ~)fi(θ)\nabla f_{i} (\tilde{\boldsymbol{\theta}}) \rightarrow \nabla f_{i} (\boldsymbol{\theta}^*) to hold, we must update θ~\tilde{\boldsymbol{\theta}} via an outer loop computation.

In the analysis of SVRG we showed that stochastic gradients are unbiased estimators for the full gradient, and that fit(θ)fit(θ)0\nabla f_{i_{t}} (\boldsymbol{\theta}^{*}) - \nabla f_{i_{t}} (\boldsymbol{\theta}^{*}) \rightarrow 0. Since the inner loop terms only utilize stochastic gradients for the gradient estimate, the inner loops will reduce variance without requiring an outer loop as there are no terms in the estimator dependent upon the outer loop update.

SARAH's variance bound in the μ\mu-strongly convex objective function case can be shown to be less than the variance bound of the same objective function with SVRG.

σm=def1μη(m+1)+ηL2ηL<1μη(12Lη)m+112ηL1=βm\begin{aligned} \sigma_{m}\stackrel{def}{=}\frac{1}{\mu\eta(m+1)}+\frac{\eta L}{2-\eta L} < \frac{1}{\mu\eta(1-2L\eta)m}+\frac{1}{\frac{1}{2\eta L}-1}=\beta_m \end{aligned}

The variance reduction of the inner loops is at the expense of bias; SARAH reduces variance in the inner loops but now has bias that is itself stochastic. This bias can be somewhat "corrected" for, however, by executing an outer loop to reinitialize the parameters by pointing them in the correct direction. We will examine some intelligent ways to execute outer loops when necessary using SARAH+.

SARAH Convergence

For convex and Lipschitz continuous gradient objective functions fif_i, SARAH can be shown to converge at a rate of O((n+1ϵ)log(1ϵ))\mathcal{O}((n+\frac{1}{\epsilon}) \log(\frac{1}{\epsilon})). Better conditioned fif_is that are μ\mu-strongly convex converge at a rate of O((n+Lμ)log(1ϵ))\mathcal{O}((n+\frac{L}{\mu}) \log(\frac{1}{\epsilon})).

Comparing these rates directly to SVRG,

ConvexitySVRGSARAH
ConvexO(nlog1ϵ+Lϵ)\mathcal{O} (n\log\frac{1}{\epsilon} + \frac{L}{\epsilon})O((n+1ϵ)log(1ϵ))\mathcal{O}((n+\frac{1}{\epsilon}) \log(\frac{1}{\epsilon}))
Strong ConvexityO((n+Lμ)log(1ϵ))\mathcal{O}((n+\frac{L}{\mu}) \log(\frac{1}{\epsilon}))O((n+Lμ)log(1ϵ))\mathcal{O}((n+\frac{L}{\mu}) \log(\frac{1}{\epsilon}))

Similar to SVRG, we must select our inner loop size mm and learning rate such that the following holds.

σm=def1μη(m+1)+ηL2ηL<1\begin{aligned} \sigma_{m}\stackrel{def}{=}\frac{1}{\mu\eta(m+1)}+\frac{\eta L}{2-\eta L} < 1 \end{aligned}

Recall from above that the variance bound for SARAH in the strongly convex case is smaller than that of SVRG. So, while their convergence complexity is the same for strong convexity, SARAH can use a larger learning rate than SVRG. SVRG must use a learning rate of η<14L\eta < \frac{1}{4L} whereas SARAH can use η<1L\eta < \frac{1}{L}.

SARAH and SVRG zoomed
SARAH and SVRG run same inner and outer
loops

We revisit a side by side comparison of SARAH and SVRG, both run with the same number of inner and outer loops with the same hyperparameter settings. The first image is zoomed in to highlight how each algorithm handles variance reduction. SARAH has very little variance, but requires outer loop executions to correct for its inner loop bias (outer loops happen at the "kinks"). SVRG starts moving toward the solution sooner, but has higher variance in its steps until we execute more outer loops, and it smooths as more outer loops are executed (our gradient estimate becomes better). In the same number of iterations, however, they are fairly comparable. This is only a very simple example to illustrate their general behaviors, but reinforces that they are very similar algorithms in nature and performance. So, there are some advantages to SARAH over SVRG, but not necessarily overwhelmingly so.

SARAH+ Algorithm

SARAH+ algorithm

Nguyen et al. (2017) (link) present this variant of the SARAH algorithm that provides an automatic and adaptive inner loop size. Numerical experiments from the paper indicate a selection of γ=1/8\gamma=1/8 shows benefits in rate for their numerical experiments, but does not elaborate on any theory.

This modification introduces an automated methodology for breaking the inner loop such that inner loop size mm can be adaptive. It also serves as automation to correct for bias in the inner loops with the execution of a full outer loop gradient.

Applications of SARAH

In practice, SARAH can be a direct substitute for SVRG with only some updated parameter tuning considerations, so it can be utilized for a very large variety of applications as a SVRG replacement. However, it does not seem to be used as widely in practice as SVRG. While speculation as to why is not worthwhile, it is noteworthy that SARAH's variance reduction technique utilizes a biased methodology, where SVRG is unbiased. SVRG is a widely utilized benchmark algorithm, and many machine learning packages have SVRG built in; the same cannot be said for SARAH, and while it does have theoretical advantages, it does not seem to have penetrated the infrastructure of optimization algorithms in practice. Still, there is a fair amount of research available for futher modifications of SARAH.

ADAM

Methodology in this section is drawn from Adam: A Method for Stochastic Optimization (Kingma & Ba, 2017) (link).

ADAM is a widely popular optimizer in machine learning. It is a staple for deep learning, neural networks, natural language processing, and generative adversarial networks applications, to name a few. ADAM has some unique strengths against other optimizers; it is fast, requires little hyperparameter tuning, performs well with sparse gradients, and can be applied to stochastic objective functions. It is also an adaptive algorithm that uses adaptive step sizes according to data it has seen to speed up convergence.

The ADAM optimizer is a stochastic/batch gradient descent variant that utilizes two techniques that leverage previous gradient information (conceptually adjacent to recursion we saw in SARAH), Momentum and RMSProp (Root Mean Squared Propogation). Momentum acts to dampen oscillations by using previous gradient moving averages to decrease the variance in steps. RMSProp adaptively scales the learning rate based upon an exponential weighted average of the magnitude of recent gradients, so it can "speed up" and "slow down" such that it does not diverge from the local data. Additionally, ADAM can utilize batch gradients rather than only individual stochastic gradients, which can act as variance reduction by increasing sample size.

The Algorithm

ADAM algorithm

NOTE

gt2g_t^2 is equivalent to element-wise multiplication of gradients g1g2g_1 \odot g_2

Variance Reduction

Momentum

Polyak's momentum is a technique that uses both the current gradient as well as the history of gradients in previous steps to accelerate convergence. The momentum term is incorporated in general with gradient descent style algorithm and uses a hyperparameter β[0,1)\beta \in [0,1) to scale its weight.

θk+1=θkαf(θk)+β(θkθk1)\begin{aligned} \theta_{k+1} = \theta_{k} - \alpha \nabla f(\theta_k) + \beta(\theta_{k} - \theta_{k-1}) \end{aligned}

If we expand this term,

θk+1=θkαf(θk)+β(θk1αf(θk1)+β(θk1θk2)θk1)\begin{aligned} \theta_{k+1} = \theta_{k} - \alpha \nabla f(\theta_k) + \beta( \theta_{k-1} - \alpha \nabla f(\theta_{k-1}) + \beta(\theta_{k-1} - \theta_{k-2}) - \theta_{k-1}) \end{aligned} =θkαf(θk)βαf(θk1)+β2(θk1θk2)higher order β small\begin{aligned} = \theta_{k} - \alpha \nabla f(\theta_k) - \beta \alpha \nabla f(\theta_{k-1}) + \underset{\text{higher order } \beta \text{ small}}{\underbrace{\beta^2(\theta_{k-1} - \theta_{k-2})}} \end{aligned}

If these recent gradients are similar, the algorithm gets an additional "boost", or "momentum". θk1\theta_{k-1} and so on can be expanded similarly, and we get higher order β\beta terms as the algorithm iterates. This favors the most recent gradients since β[0,1)\beta \in [0,1) and higher order terms quickly dampen out previous iterates.

Momentum
Compare ADAM momentum

We see the "boost" in the animation for the ADAM algorithm compared to other stochastic methods, moving more rapidly toward the solution.

In ADAM, the momentum term is written as follows.

mt=β1mt1+(1β1)gt\begin{aligned} m_{t} = \beta_{1} \cdot m_{t-1} + (1 - \beta_{1}) \cdot g_{t} \end{aligned}

Momentum helps to accelerate convergence, particularly over the contours of objective functions, while also acting as variance reduction in the steps. The technical details of variance reduction in stochastic methods are a little more involved and are covered in (Arnold et al., 2019) (link). For and in-depth and beautifully interactive blog about momentum, I also suggest this post.

Minibatches

Earlier in this post we covered the variance reduction properties of using stochastic minibatches for variance reduction in the steps. ADAM utilizes minibatches, and the batch size is a tuneable hyperparameter. This gives us the ability to use better gradient approximations for our steps and balance convergence with computation time.

RMSProp

Root-mean-squared propograaion (RMSProp) is a technique that uses the moving average of the magnitude of recent squared gradients to normalize the gradient estimate in the descent algorithm. Its implementation in descent algorithms makes the learning rate adaptive to the data, rather than treating it as a constant tuneable hyperparameter. In essence, the learning rate will speed up or slow down depending on recent gradients.

The RMSProp update is similar to momentum, using the squared gradient term.

vt=β2vt1+(1β2)gt2\begin{aligned} v_{t} = \beta_{2} \cdot v_{t-1} + (1 - \beta_{2}) \cdot g_{t}^2 \end{aligned}

And the update rule in a gradient descent algorithm is written as follows.

θt=θt1αf(θk)vt1+ϵ\begin{aligned} \theta_{t} = \theta_{t-1} - \alpha \frac{\nabla f(\theta_k)}{\sqrt{v_{t-1}}+\epsilon} \end{aligned}

The ϵ\epsilon term only serves to stablize the algorithm as vtv_t becomes small (the steps will become large if the denominator is small), preventing overshooting close to the optimum.

Bias and Bias Correction

The bias-correction steps in ADAM are enhancements to the momentum and RMSProp algorithms (which are themselves biased) and can be attributed to its advantageous performance with sparse data. ADAM is biased toward its initialization, which in general should be chosen to be the 0\textbf{0} vector.

We will follow the derivation of the bias terms as in Kingma & Ba (2017)

Our momentum term mtm_t, or the first moment estimator, is meant to estimate the gradient gtg_t, so we want to find the expectation of the first moment estimate E[mt]E[m_t].

Let's first consider running ADAM for TT timesteps, and let g1,...,gTg_1,...,g_T be the gradients at each timestep. We assume the gradients are distributed as gtp(gt)g_t \sim p(g_t). At timestep tt, the first moment estimate term is the following.

mt=β1mt1+(1β1)gt\begin{aligned} m_t = \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t \end{aligned}

We initialize the first moment weights setting m0=0m_{0}=\textbf{0} and expand this term by substituting mt1m_{t-1} with each previous timestep, we get

mt=(1β1)i=1tβ1tigi\begin{aligned} m_t = (1-\beta_1) \sum_{i=1}^{t}{\beta_1^{t-i} \cdot g_i } \end{aligned}

Now, we can then take the expectation of mtm_t.

E[mt]=E[(1β1)i=1tβ1tigi]\begin{aligned} \mathbb{E}[m_{t}] = \mathbb{E} \left[ (1-\beta_1) \sum_{i=1}^{t}{\beta_1^{t-i} \cdot g_i} \right] \end{aligned}

We can rewrite i=1tβ1tigi\sum_{i=1}^{t}{\beta_1^{t-i} \cdot g_i} by adding and subtracting gtg_t, giving i=1tβ1ti(gi+gtgt)\sum_{i=1}^{t}{\beta_1^{t-i} \cdot (g_i + g_t - g_t)}, giving the following.

i=1tβ1tigt+i=1tβ1ti(gigt)ζ\begin{aligned} \sum_{i=1}^{t}{\beta_1^{t-i} \cdot g_t} + \underset{\zeta}{\underbrace{\sum_{i=1}^{t}{\beta_{1}^{t-i}\cdot(g_{i}-g_{t})}}} \end{aligned}

Recent gradients will be close to gtg_t, and gradients further in the past will be quickly dampened by β1ti\beta_1^{t-i} as it should be selected to be small, so the ζ\zeta term can be treated as small. So, our expectation can be written as the following.

E[mt]=E[(1β1)i=1tβ1tigt+ζ]\begin{aligned} \mathbb{E}[m_{t}] = \mathbb{E} \left[ (1-\beta_1) \sum_{i=1}^{t}{\beta_1^{t-i} \cdot g_t} + \zeta \right] \end{aligned} =E[gt](1β1)i=1tβ1ti+ζ\begin{aligned} = \mathbb{E}[g_t] \cdot (1-\beta_1) \sum_{i=1}^{t}{\beta_1^{t-i} } + \zeta \end{aligned}

Expanding the sum and cancellations of β1\beta_1 term results in the following.

E[mt]=E[gt](1β1t)+ζ\begin{aligned} \mathbb{E}[m_{t}] = \mathbb{E}[g_t] \cdot (1-\beta_1^t) + \zeta \end{aligned}

The calculation steps for the second moment estimates follow identical logic, and we get the following expectation for vtv_{t}.

E[vt]=E[gt2](1β2t)+ζ\begin{aligned} \mathbb{E}[v_{t}]=\mathbb{E}[g_{t}^{2}]\cdot(1-\beta_{2}^{t})+\zeta \end{aligned}

Clearly, mtm_t is a biased estimator for gtg_t, and vtv_t is biased for gt2g_t^2.

We can approximately correct for bias in each term by assuming initialization of moments to be zero, and by dividing the respective estimates by the (1β1t)(1-\beta_{1}^t) and (1β2t)(1-\beta_{2}^t) term for the first and second moment estimates respectively, giving approximately E[m^t]=E[gt]\mathbb{E}[\hat{m}_{t}]=\mathbb{E}[g_{t}] and E[v^t]=E[gt2]\mathbb{E}[\hat{v}_{t}]=\mathbb{E}[g_{t}^{2}].

ADAM Convergence

ADAM requires only a basic convexity assumption of the objective function, and a similar but more relaxed assumption of Lipschitz continuity called (L0,L1)(L_0,L_1) smoothness (Wang et al., 2022) (link). Foregoing the details, ADAM converges at a rate of O(1T)\mathcal{O}(\frac{1}{\sqrt{T}}) (the detailed proof for ADAM's convergence can be found in original paper).

Recall that this is the same theoretical convergence rate as SGD, however it is able to used a fixed learning rate rather than a diminishing step size. While its rate is less desirable than other algorithms we have seen, it is still a widely desirable algorithm for it's performance in practice.

Applications of ADAM

ADAM is a very widely used optimizer, particularly for neural network, deep learning, and natural language processing applications amongst others. One of its advantages is how well it handles sparse data, which are commonly characteristic of the data for these types of tasks. While it is not the fastest method in terms of convergence, the methodologies it utilizes to balance stochasic variance while also addressing inherent estimator bias, making this method a trustworthy workhorse.

There is existing infrastructure for ADAM among some common machine learning tools like Keras, scikit-learn, PyTorch, and TensorFlow, to name some prolific packages in industry and research applications, so it has been embraced by the community as an effective optimizer. Overall it is a robust optimizer with reliable performance for convex (as well as nonconvex, He et al. (2023)) objectives.

Example Numerical Experiments

NOTE

This section was lifted directly from my coursework. Note this section uses xx for the model parameter vector instead of the θ\theta we have been using previously (different choices may be seen in practice).

The purpose of the numerical experiment will be to evaluate general performance of ADAM and SARAH on real data and to illustrate a bias-variance tradeoff between SVRG and SARAH. We use the "Wine Quality" dataset from the UC Irvine Machine Learning Repository (Cortez et al., 2009). The data are measurements of wine characteristics (features) used to predict wine quality scores. There are n=6497n=6497 instances and m=11m=11 features. The experiements will use a simple least squares linear regression objective, assuming a system with data AA and response bb is of the form Ax=bAx=b.

minxAxb2\begin{aligned} \underset{x}{\text{min}}||Ax-b||^{2} \end{aligned}

The least squares objective was chosen because it is strongly convex. xx^* was computed separately using scipy.optimize.lsq_linear with tolerance set to 1e101e-10.

Numerical Experiment Figure 1
Numerical Experiment Figure 2

See Figures 1 and 2 for settings and computation time. The two examples chosen illustrate some high level takeaways. In terms of per-iteration computational performance, both ADAM and SARAH are virtually identical to SVRG, though ADAM can become slightly slower for increased batch sizes. SARAH and SVRG are limited in flexibility for learning rate with this data since LL is large and learning rate is O(1L)\mathcal{O}(\frac{1}{L}). Both are descending, but are moving extremely slowly toward xx^*, and the larger learning rate for SARAH hardly presents an advantage. We can see benefit of the adaptive learning rate of ADAM for both Exp A and B; it is moving toward it much faster and it continues to descend as iterations increase (xtx||x_t-x^*|| vs. epochs).

Comparing SARAH and SVRG, we run the inner loop for half an epoch in Exp. A and for 30 epochs in Exp. B. The parameter choice in Exp. B adheres with SARAH's variance bound requirement σm=def1μη(m+1)+ηL2ηL<1\sigma_{m}\stackrel{def}{=}\frac{1}{\mu\eta(m+1)}+\frac{\eta L}{2-\eta L} < 1. Note that for this data, the iterations required for an ϵ\epsilon accurate convergence with SARAH, for a reasonable choice of ϵ\epsilon, was prohibitive. In both Exp. A and B, xtxt1||x_t - x_{t-1}|| is clearly smoother for SARAH than SVRG, however we observe that the number of effective passes over the data in the inner loop can matter. In both the Axtb||Ax_t-b|| and xtx||x_t-x^*|| plots, SARAH will stray away from the optimal solution due to bias after around 4 epochs. It is ultimately corrected by the second outer loop step, but we can see how the biased steps of the inner loop can cause SARAH drift, whereas SVRG continues to descend. As is the reality of statistics, we are trading variance for bias.

Arnold, S. M. R., Manzagol, P.-A., Babanezhad, R., Mitliagkas, I., & Roux, N. L. (2019). Reducing the variance in online optimization by transporting past gradients. https://arxiv.org/abs/1906.03532
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Wine Quality. UCI Machine Learning Repository. https://doi.org/https://doi.org/10.24432/C56S3T
He, M., Liang, Y., Liu, J., & Xu, D. (2023). Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters and Non-ergodic Case.
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, 315–323.
Kingma, D. P., & Ba, J. (2017). Adam: A Method for Stochastic Optimization.
Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017). SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient. https://arxiv.org/abs/1703.00102
Wang, B., Zhang, Y., Zhang, H., Meng, Q., Ma, Z.-M., Liu, T.-Y., & Chen, W. (2022). Provable Adaptivity in Adam.