After dominating Natural Language Processing, Transformers have taken over Computer Vision recently with the advent of Vision Transformers. However, the attention mechanism’s quadratic complexity in the number of tokens means that Transformers do not scale well to high-resolution images. XCiT is a new Transformer architecture, containing XCA, a transposed version of attention, reducing the complexity from quadratic to linear, and at least on image data, it appears to perform on par with other models. What does this mean for the field? Is this even a transformer? What really matters in deep learning?
0:00 - Intro & Overview
3:45 - Self-Attention vs Cross-Covariance Attention (XCA)
19:55 - Cross-Covariance Image Transformer (XCiT) Architecture
26:00 - Theoretical & Engineering considerations
30:40 - Experimental Results
33:20 - Comments & Conclusion