Paper Explained - Pretrained Transformers as Universal Computation Engines (Full Video Analysis)

Large-scale pre-training and subsequent fine-tuning is a common recipe for success with transformer models in machine learning. However, most such transfer learning is done when a model is pre-trained on the same or a very similar modality to the final task to be solved. This paper demonstrates that transformers can be fine-tuned to completely different modalities, such as from language to vision. Moreover, they demonstrate that this can be done by freezing all attention layers, tuning less than .1% of all parameters. The paper further claims that language modeling is a superior pre-training task for such cross-domain transfer. The paper goes through various ablation studies to make its point.

OUTLINE:
0:00​ - Intro & Overview
2:00​ - Frozen Pretrained Transformers
4:50​ - Evaluated Tasks
10:05​ - The Importance of Training LayerNorm
17:10​ - Modality Transfer
25:10​ - Network Architecture Ablation
26:10​ - Evaluation of the Attention Mask
27:20​ - Are FPTs Overfitting or Underfitting?
28:20​ - Model Size Ablation
28:50​ - Is Initialization All You Need?
31:40​ - Full Model Training Overfits
32:15​ - Again the Importance of Training LayerNorm
33:10​ - Conclusions & Comments

Paper: https://arxiv.org/abs/2103.05247​
Code: https://github.com/kzl/universal-comp

1 Like