Self-Supervised Learning is the final frontier in Representation Learning: Getting useful features without any labels. Facebook AI’s new system, DINO, combines advances in Self-Supervised Learning for Computer Vision with the new Vision Transformer (ViT) architecture and achieves impressive results without any labels. Attention maps can be directly interpreted as segmentation maps, and the obtained representations can be used for image retrieval and zero-shot k-nearest neighbor classifiers (KNNs).
0:00 - Intro & Overview
6:20 - Vision Transformers
9:20 - Self-Supervised Learning for Images
13:30 - Self-Distillation
15:20 - Building the teacher from the student by moving average
16:45 - DINO Pseudocode
23:10 - Why Cross-Entropy Loss?
28:20 - Experimental Results
33:40 - My Hypothesis why this works
38:45 - Conclusion & Comments