Batch Normalization is a core component of modern deep learning. It enables training at higher batch sizes, prevents mean shift, provides implicit regularization, and allows networks to reach higher performance than without. However, BatchNorm also has disadvantages, such as its dependence on batch size and its computational overhead, especially in distributed settings. Normalizer-Free Networks, developed at Google DeepMind, are a class of CNNs that achieve state-of-the-art classification accuracy on ImageNet without batch normalization. This is achieved by using adaptive gradient clipping (AGC), combined with a number of improvements in general network architecture. The resulting networks train faster, are more accurate, and provide better transfer learning performance. Code is provided in Jax.
0:00 - Intro & Overview
2:40 - What’s the problem with BatchNorm?
11:00 - Paper contribution Overview
13:30 - Beneficial properties of BatchNorm
15:30 - Previous work: NF-ResNets
18:15 - Adaptive Gradient Clipping
21:40 - AGC and large batch size
23:30 - AGC induces implicit dependence between training samples
28:30 - Are BatchNorm’s problems solved?
30:00 - Network architecture improvements
31:10 - Comparison to EfficientNet
33:00 - Conclusion & Comments