Transformers on images - Facebook's DeiT

Any opinions about using data transformers on images ? Is it really worth switching over CNNs ?

1 Like

I believe in smth in between. E.g., SuperGlue is transformer)

2 Likes

I agree, to me Transformers is the name of an especĂ­fic architecture that combines low features to solve a more high level task.

Another question could be whether the Transformer architecture will become to be the new “vgg” for high level tasks. One can extract low level features with a CNN (vgg, resnet, etc) and use a Transformer to reason about a specific high level task.

As you pointed Superglue can be seen as transformer (without the name branding :wink: but there are other examples that I believe that can fit in this same definition, for example Flownet which combines features to estimate an optical flow.

@Tomasz any intuition about this ?

1 Like

For me ConvNets are like a little army of SIFT-descriptors, building up representations which have a growing receptive field as you proceed through more and more layers. But attention mechanisms let far away features talk to each other, and this is necessary for high-level vision tasks.

In my team’s SuperGlue paper, we showed that attention mechanisms can be used to improve the quality of feature matching.

I don’t think Transformers and SuperGlue-like networks will replace CNNs. It is interesting to see how far a transformer-only approach can go, but the combination of CNN+Transformer works extremely well today. I would still like to see something like our SuperGlue paper, but using the attention mechanism to learn the underlying local features–basically using attention to self-supervise the CNN part and having something like SuperPoint++ emerge from the process.

3 Likes

I truly like Vladen K (Intel) talk last CVPR’20. He posed the question “Could we have reached this point in time (AI & Computer Vision interest :rocket:) withoit Conv-Nets?”

He mentioned an interesting hypothesis regarding open vs closed feedback-loop regarding attention vs plain convolution. BTW, there has been progress to close the loop with conv, such as dynamic convolution.

I see machine learning (at least) as a triad of data, model & training, compute. The computational edge could be decisive. Transformer is highly amenable for our current :racehorse:. Does it apply for the :racehorse: of our kids, Mobile & Edge? or a hybrid computational model, off-loading different parts of the arch?