A New Technique To Train Computer Vision Models: Facebook DeiT
Deep Fake. Self-driving car. Colorized photos. Movement tracking. How could all these be possible? Last month, Facebook AI announced a new technique called Data-efficient image Transformers to train computer vision models. Shortly called DeiT.
Facebook said it achieved 84.2 top-1 accuracy on the ImageNet benchmark without any external training data. The result is competitive with cutting-edge CNNs.
Wait a second.
What is CNN? What is DeiT? Let’s find out.
From the early stages of artificial intelligence development, scientists tried to create computers that could see the world like humans. Computer vision is the result of this effort. In early development of computer vision, everything had to be defined by the developers. The problem was that not all visual systems could be implemented by computer programming. So, they decided to use an alternative.
Another approach was to use machine learning. Unlike previous methods, machine learning algorithms were able to learn themselves through cases. However, early machine learning algorithms still required engineer’s work to characterize the image.
In 2012, One of the most important developments came into light. The AlexNet.
In fact, AlexNet’s original paper was “ImageNet Classification Using Deep Convolutional Neural Networks.” AlexNet is named after him since the first author of this paper is Alex Khrizevsky.
This Alexnet use something called a convolutional neural network structure. Called CNN. CNN is an end-to-end AI model that develops its own and unique features. CNN learns simple patterns in low layer steps and goes up to an upper layer by combining patterns to a more complex and abstract image. A well-learned CNN automatically recognizes simple edges and corners in a hierarchical manner to complex features such as faces, chairs, cars, and dogs.
In fact, CNN had already been announced in the 1990s. However, there were many cases where learning did not work properly. A few years later, several technical solutions began to appear one by one, like the activation function called Relu. Relu makes CNN’s performance fast.
Today, thanks to large computational clusters, hardware, and vast amounts of data, CNNs have been able to develop many useful applications for image classification and object recognition.
However, CNN suffers from a change in viewpoint due to rotation or scaling. One way to solve it is to learn AI with 4D or 6D maps and then detect objects.
“But that method has astronomical costs,” said Hinton.
Last month, Facebook AI announced a new technique called Data-efficient image Transformers to train computer vision models. DeiT requires far fewer data and computing resources to produce a high-performance image classification model. In training a DeiT model with just a single 8-GPU server over three days, FaceBook AI achieved 84.2 top-1 accuracy on the ImageNet benchmark without any external training data. The result is competitive with cutting-edge CNNs, which have been the principal approach for image-classification till now.
It’s expected to extend Transformers to use new cases and make this work more accessible to researchers and engineers who have a shortage of large-scale systems to train massive AI models.
Image classification is easy for humans but hard for machines. It is difficult for convolution-free Transformers like DeiT as these systems don’t have many statistical priors about the images. Thus, they typically have to “see” a lot of example images to learn to classify different objects. DeiT, however, can be trained easily with approximately 1.2 million images, rather than hundreds of millions of images.
So, How could it work?
The first key ingredient of DeiT is its training strategy. Initially, researchers used data augmentation, optimization, and regularization to simulate training on a much larger data set, as done in CNN.
Further, they modified the Transformer architecture to allow native distillation. Distillation is the process by which one neural network learns from the output of another network. They used CNN as a teacher model for Transformer.
However using distillation may hamper the performance of neural networks. So, the student model learns from two different sources that may be diverging: from a labeled data set and the teacher.
To alleviate this, a distillation token is introduced: a learned vector that flows through the network along with the transformed image data and cues the model for its distillation output, which can differ from distillation token’s class output. This improved distillation method is specific to Transformers.
That’s the way how DeiT works.
Facebook said DeiT is an important step forward to advance computer vision.
And they said this work will also help democratize AI research. DeiT shows that it is possible for developers with limited access to data and computing resources to train or use these new models. We hope that it will help foster advances by a larger community of researchers.
You can see more details on AI Network Youtube Channel. Please don’t forget to subscribe it!