Large-scale pre-training and subsequent fine-tuning is a common recipe for success with transformer models in machine learning. However, most such transfer learning is done when a model is pre-trained on the same or a very similar modality to the final task to be solved. This paper demonstrates that transformers can be fine-tuned to completely different modalities, such as from language to vision. Moreover, they demonstrate that this can be done by freezing all attention layers, tuning less than .1% of all parameters. The paper further claims that language modeling is a superior pre-training task for such cross-domain transfer. The paper goes through various ablation studies to make its point.
0:00 - Intro & Overview
2:00 - Frozen Pretrained Transformers
4:50 - Evaluated Tasks
10:05 - The Importance of Training LayerNorm
17:10 - Modality Transfer
25:10 - Network Architecture Ablation
26:10 - Evaluation of the Attention Mask
27:20 - Are FPTs Overfitting or Underfitting?
28:20 - Model Size Ablation
28:50 - Is Initialization All You Need?
31:40 - Full Model Training Overfits
32:15 - Again the Importance of Training LayerNorm
33:10 - Conclusions & Comments