It is indeed a really good visualisation. However, one thing just doesn’t look right. I think the author mistakenly typed GD (gradient descent) as SGD.
I reproduced Alec’s animation, fixed the typo, and added Adam optimiser to see its behaviour. I added Adam since it is very popular and expected to work very well combining the benefits of both AdaGrad (works well with sparse gradients) and RMSProp (works well with noisy gradients). You can read Adam on this paper.
The purpose of this post is to give intuitions of optimisers’ behaviour. We should be aware that the visualisation is not intended to compare the optimisers. It will be unfair anyway since it will depend heavily on the learning rates chosen.
Visualising the optimisers
Animation below portrays their behaviour in a saddle point. Notice that SGD is stuck and has a very hard time breaking the symmetry, while Nesterov and Momentum exhibit oscillations for a moment. On the other hand, optimisers with scaling based on gradient information such as Adam is able to break the symmetry very quickly and descent to the saddle direction.
Velocity based optimisers exhibit overshoot and bounce around because of the large initial gradient. Again, optimisers with scaling based on gradients/step sizes find the minimum easily.