### Author: Lili Jiang

Step-by-step illustration of gradient descent algorithm.

# Momentum

The gradient descent with momentum algorithm (or Momentum for short) borrows the idea from physics. Imagine rolling down a ball inside of a frictionless bowl. Instead of stopping at the bottom, the momentum it has accumulated pushes it forward, and the ball keeps rolling back and forth.

We can apply the concept of momentum to our vanilla gradient descent algorithm. In each step, in addition to the regular gradient, it also adds on the movement from the previous step. Mathematically, it is commonly expressed as:

delta = – learning_rate * gradient + previous_delta * decay_rate (eq. 1)

theta += delta (eq. 2)

I found it more intuitive if I massage this equation a little and keep track of the (decayed) cumulative sum of gradient instead. This will also make things easier when we introduce the Adam algorithm later.

delta = -learning_rate * sum_of_gradient (eq. 4)

theta += delta (eq. 5)

(What I did was factoring out -learning_rate. To see the mathematical equivalence, you can substitute delta with -learning_rate * sum_of_gradient in eq. 1 to get eq. 3.)

Step-by-step illustration of momentum descent. Watch live animation in the app. For the rest of this post, I sloppily use gradient x and gradient y in the visualization; in reality, because it’s gradient *descent*, it’s actually the negative of the gradient.

Let’s consider two extreme cases to understand this decay rate parameter better. If the decay rate is 0, then it is exactly the same as (vanilla) gradient descent. If the decay rate is 1 (and provided that the learning rate is reasonably small), then it rocks back and forth endlessly like the frictionless bowl analogy we mentioned in the beginning; you do not want that. Typically the decay rate is chosen around 0.8–0.9 — it’s like a surface with a little bit of friction so it eventually slows down and stops.

This property allows AdaGrad (and other similar gradient-squared-based methods like RMSProp and Adam) to escape a saddle point much better. AdaGrad will take a straight path, whereas gradient descent (or relatedly, Momentum) takes the approach of “let me slide down the steep slope first and maybe worry about the slower direction later”. Sometimes, vanilla gradient descent might just stop at the saddle point where gradients in both directions are 0 and be perfectly content there.

# RMSProp

The problem of AdaGrad, however, is that it is incredibly slow. This is because the sum of gradient squared only grows and never shrinks. RMSProp (for Root Mean Square Propagation) fixes this issue by adding a decay factor.
theta += delta

More precisely, the sum of gradient squared is actually the decayed sum of gradient squared. The decay rate is saying only recent gradient² matters, and the ones from long ago are basically forgotten. As a side note, the term “decay rate” is a bit of a misnomer. Unlike the decay rate we saw in momentum, in addition to decaying, the decay rate here also has a scaling effect: it scales down the whole term by a factor of (1 – decay_rate). In other words, if the decay_rate is set at 0.99, in addition to decaying, the sum of gradient squared will be sqrt(1 – 0.99) = 0.1 that of AdaGrad, and thus the step is on the order of 10x larger for the same learning rate.

To see the effect of the decaying, in this head-to-head comparison, AdaGrad white) keeps up with RMSProp (green) initially, as expected with the tuned learning rate and decay rate. But the sums of gradient squared for AdaGrad accumulate so fast that they soon become humongous (demonstrated by the sizes of the squares in the animation). They take a heavy toll and eventually AdaGrad practically stops moving. RMSProp, on the other hand, has kept the squares under a manageable size the whole time, thanks to the decay rate. This makes RMSProp faster than AdaGrad.

Last but not least, Adam (short for Adaptive Moment Estimation) takes the best of both worlds of Momentum and RMSProp. Adam empirically works well, and thus in recent years, it is commonly the go-to choice of deep learning problems.

Let’s take a look at how it works:

theta += delta

Beta1 is the decay rate for the first moment, sum of gradient (aka momentum), commonly set at 0.9. Beta 2 is the decay rate for the second moment, sum of gradient squared, and it is commonly set at 0.999.

Adam gets the speed from momentum and the ability to adapt gradients in different directions from RMSProp. The combination of the two makes it powerful.

# Closing Words

Now that we have discussed all the methods, let’s watch a few races of all the descent methods we talked about so far! (There is some inevitable cherry-picking of parameters. The best way to get a taste is to play around yourself.)

In this terrain, there are two little hills blocking the way to the global minimum. Adam is the only one able to find its way to the global minimum. Whichever way the parameters are tuned, from this starting position at least, none of the other methods can get there. This means neither momentum nor adaptive gradient alone can do the trick. It’s really the combination of the two: first, momentum takes Adam beyond the local minimum where all the other balls stop; then, the adjustment from the sum of gradient squared pulls it sideway, because it is the direction less explored, leading to its final victory.

Here is another race. In this terrain, there is a flat region (plateau) surrounding the global minimum. With some parameter tuning, Momentum and Adam (thanks to its momentum component) can make it to the center, while the other methods can’t.
In summary, gradient descent is a class of algorithms that aims to find the minimum point on a function by following the gradient. Vanilla gradient descent just follows the gradient (scaled by learning rate). Two common tools to improve gradient descent are the sum of gradient (first moment) and the sum of the gradient squared (second momentum). The Momentum method uses the first moment with a decay rate to gain speed. AdaGrad uses the second moment with no decay to deal with sparse features. RMSProp uses the second moment by with a decay rate to speed up from AdaGrad. Adam uses both first and second moments, and is generally the best choice. There are a few other variations of gradient descent algorithms, such as Nesterov accelerated gradient, AdaDelta, etc., that are not covered in this post.
Lastly I shall leave you with this momentum descent with no decay. Its path makes up a fun pattern. I see no practical use (yet) but present it here just for the funsies. [Edit: I take my word back about no practical use. Read more about this curve at https://en.wikipedia.org/wiki/Lissajous_curve.]