Any opinions about using data transformers on images ? Is it really worth switching over CNNs ?
I believe in smth in between. E.g., SuperGlue is transformer)
I agree, to me Transformers is the name of an específic architecture that combines low features to solve a more high level task.
Another question could be whether the Transformer architecture will become to be the new “vgg” for high level tasks. One can extract low level features with a CNN (vgg, resnet, etc) and use a Transformer to reason about a specific high level task.
As you pointed Superglue can be seen as transformer (without the name branding but there are other examples that I believe that can fit in this same definition, for example Flownet which combines features to estimate an optical flow.
@Tomasz any intuition about this ?
For me ConvNets are like a little army of SIFT-descriptors, building up representations which have a growing receptive field as you proceed through more and more layers. But attention mechanisms let far away features talk to each other, and this is necessary for high-level vision tasks.
In my team’s SuperGlue paper, we showed that attention mechanisms can be used to improve the quality of feature matching.
I don’t think Transformers and SuperGlue-like networks will replace CNNs. It is interesting to see how far a transformer-only approach can go, but the combination of CNN+Transformer works extremely well today. I would still like to see something like our SuperGlue paper, but using the attention mechanism to learn the underlying local features–basically using attention to self-supervise the CNN part and having something like SuperPoint++ emerge from the process.
I truly like Vladen K (Intel) talk last CVPR’20. He posed the question “Could we have reached this point in time (AI & Computer Vision interest ) withoit Conv-Nets?”
He mentioned an interesting hypothesis regarding open vs closed feedback-loop regarding attention vs plain convolution. BTW, there has been progress to close the loop with conv, such as dynamic convolution.
I see machine learning (at least) as a triad of data, model & training, compute. The computational edge could be decisive. Transformer is highly amenable for our current . Does it apply for the of our kids, Mobile & Edge? or a hybrid computational model, off-loading different parts of the arch?