github twitter instagram linkedin email rss
Visualising Stochastic Optimisers
Aug 9, 2017
2 minutes read

This post is inspired by Alec Radford’s visualisation of stochastic optimisers. This visualization is used almost everywhere in deep learning classes, including CS213n.

It is indeed a really good visualisation. However, one thing just doesn’t look right. I think the author mistakenly typed GD (gradient descent) as SGD.

I reproduced Alec’s animation, fixed the typo, and added Adam optimiser to see its behaviour. I added Adam since it is very popular and expected to work very well combining the benefits of both AdaGrad (works well with sparse gradients) and RMSProp (works well with noisy gradients). You can read Adam on this paper.

The purpose of this post is to give intuitions of optimisers’ behaviour. We should be aware that the visualisation is not intended to compare the optimisers. It will be unfair anyway since it will depend heavily on the learning rates chosen.

Visualising the optimisers

Long valley

Animation below portrays their behaviour in a saddle point. Notice that SGD is stuck and has a very hard time breaking the symmetry, while Nesterov and Momentum exhibit oscillations for a moment. On the other hand, optimisers with scaling based on gradient information such as Adam is able to break the symmetry very quickly and descent to the saddle direction.

saddle

Beale’s function

Velocity based optimisers exhibit overshoot and bounce around because of the large initial gradient. Again, optimisers with scaling based on gradients/step sizes find the minimum easily.

beale


Back to posts


comments powered by Disqus