Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix.
0:00 - Intro & Overview
0:45 - Giving up on Attention
5:00 - FNet Architecture
9:00 - Going deeper into the Fourier Transform
11:20 - The Importance of Mixing
22:20 - Experimental Results
33:00 - Conclusions & Comments