The last years in deep learning research have given rise to a plethora of different optimization algorithms, such as SGD, AdaGrad, Adam, LARS, LAMB, etc. which all claim to have their special peculiarities and advantages. In general, all algorithms modify two major things: The (implicit) learning rate schedule, and a correction to the gradient direction. This paper introduces grafting, which allows to transfer the induced learning rate schedule of one optimizer to another one. In that, the paper shows that much of the benefits of adaptive methods (e.g. Adam) are actually due to this schedule, and not necessarily to the gradient direction correction. Grafting allows for more fundamental research into differences and commonalities between optimizers, and a derived version of it makes it possible to computes static learning rate corrections for SGD, which potentially allows for large savings of GPU memory.
0:00 - Rant about Reviewer #2
6:25 - Intro & Overview
12:25 - Adaptive Optimization Methods
20:15 - Grafting Algorithm
26:45 - Experimental Results
31:35 - Static Transfer of Learning Rate Ratios
35:25 - Conclusion & Discussion
Paper (OpenReview): https://openreview.net/forum?id=FpKgG…
Old Paper (Arxiv): [2002.11803] Disentangling Adaptive Gradient Methods from Learning Rates
Our Discord: https://discord.gg/4H8xxDF
In the empirical science of training large neural networks, the learning rate schedule is a notoriously challenging-to-tune hyperparameter, which can depend on all other properties (architecture, optimizer, batch size, dataset, regularization, …) of the problem. In this work, we probe the entanglements between the optimizer and the learning rate schedule. We propose the technique of optimizer grafting, which allows for the transfer of the overall implicit step size schedule from a tuned optimizer to a new optimizer, preserving empirical performance. This provides a robust plug-and-play baseline for optimizer comparisons, leading to reductions to the computational cost of optimizer hyperparameter search. Using grafting, we discover a non-adaptive learning rate correction to SGD which allows it to train a BERT model to state-of-the-art performance. Besides providing a resource-saving tool for practitioners, the invariances discovered via grafting shed light on the successes and failure modes of optimizers in deep learning.
Authors: Anonymous (Under Review)