DeiT: Training data-efficient image transformers & distillation through attention

[Summary]

[Strengths]

Interestingly, with our distillation, image transformers learn more from a convnet than from a another transformer with comparable performance.

We have observed that using a convnet teacher gives better performance than using a transformer.

Furthermore, when DeiT benefits from the distillation from a relatively weaker RegNetY to produce DeiT⚗, it outperforms EfficientNet.

[Weaknesses]

image-center

[Interesting things]

The transformer block described above is invariant to the order of the patch embeddings, and thus does not consider their relative position.

Touvron et al. [47] show that it is desirable to use a lower training resolution and fine-tune the network at the larger resolution.

Hard-label distillation

image-center

[17] Priya Goyal, Piotr Doll´ar, Ross B. Girshick, Pieter Noordhuis, LukaszWesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

[Citation]

Paper link: https://arxiv.org/abs/2012.12877?fbclid=IwAR3YDfOm_795KDp2VmGHfFz-ZLH0cLhOatEq-r2HeN0t6CQGXz9VDdglSaE Original blog: https://ai.facebook.com/blog/data-efficient-image-transformers-a-promising-new-technique-for-image-classification/