Paper Explained - DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Full Video Analysis)

Self-Supervised Learning is the final frontier in Representation Learning: Getting useful features without any labels. Facebook AI’s new system, DINO, combines advances in Self-Supervised Learning for Computer Vision with the new Vision Transformer (ViT) architecture and achieves impressive results without any labels. Attention maps can be directly interpreted as segmentation maps, and the obtained representations can be used for image retrieval and zero-shot k-nearest neighbor classifiers (KNNs).

0:00​ - Intro & Overview
6:20​ - Vision Transformers
9:20​ - Self-Supervised Learning for Images
13:30​ - Self-Distillation
15:20​ - Building the teacher from the student by moving average
16:45​ - DINO Pseudocode
23:10​ - Why Cross-Entropy Loss?
28:20​ - Experimental Results
33:40​ - My Hypothesis why this works
38:45​ - Conclusion & Comments

Blog: Facebook…​
Code: GitHub - facebookresearch/dino: PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO