Paper Explained - TransGAN: Two Transformers Can Make One Strong GAN (Full Video Analysis)

Generative Adversarial Networks (GANs) hold the state-of-the-art when it comes to image generation. However, while the rest of computer vision is slowly taken over by transformers or other attention-based architectures, all working GANs to date contain some form of convolutional layers. This paper changes that and builds TransGAN, the first GAN where both the generator and the discriminator are transformers. The discriminator is taken over from ViT (an image is worth 16x16 words), and the generator uses pixelshuffle to successfully up-sample the generated resolution. Three tricks make training work: Data augmentations using DiffAug, an auxiliary superresolution task, and a localized initialization of self-attention. Their largest model reaches competitive performance with the best convolutional GANs on CIFAR10, STL-10, and CelebA.

0:00 - Introduction & Overview
3:05 - Discriminator Architecture
5:25 - Generator Architecture
11:20 - Upsampling with PixelShuffle
15:05 - Architecture Recap
16:00 - Vanilla TransGAN Results
16:40 - Trick 1: Data Augmentation with DiffAugment
19:10 - Trick 2: Super-Resolution Co-Training
22:20 - Trick 3: Locality-Aware Initialization for Self-Attention
27:30 - Scaling Up & Experimental Results
28:45 - Recap & Conclusion

Paper: [2102.07074] TransGAN: Two Transformers Can Make One Strong GAN
Code: GitHub - VITA-Group/TransGAN: [Preprint] "TransGAN: Two Transformers Can Make One Strong GAN", Yifan Jiang, Shiyu Chang, Zhangyang Wang