*Another great episode by @ykilcher*

## Full Title: Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

Deep Neural Networks are often said to discover useful representations of the data. However, this paper challenges this prevailing view and suggest that rather than representing the data, deep neural networks store superpositions of the training data in their weights and act as kernel machines at inference time. This is a theoretical paper with a main theorem and an understandable proof and the result leads to many interesting implications for the field.

OUTLINE:

0:00 - Intro & Outline

4:50 - What is a Kernel Machine?

10:25 - Kernel Machines vs Gradient Descent

12:40 - Tangent Kernels

22:45 - Path Kernels

25:00 - Main Theorem

28:50 - Proof of the Main Theorem

39:10 - Implications & My Comments

Paper: [2012.00152] Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

ERRATA: I simplify a bit too much when I pit kernel methods against gradient descent. Of course, you can even learn kernel machines using GD, they’re not mutually exclusive. And it’s also not true that you “don’t need a model” in kernel machines, as it usually still contains learned parameters.