Concept

Vision transformer

A Vision Transformer (ViT) is a transformer that is targeted at vision processing tasks such as . Transformers found their initial applications in natural language processing (NLP) tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet, DenseNet, and Inception. Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer. As in the case of BERT, a fundamental role in classification tasks is played by the class token. A special token that is used as the only input of the final MLP Head as it has been influenced by all the others. The architecture for image classification is the most common and uses only the Transformer Encoder in order to transform the various input tokens. However, there are also other applications in which the decoder part of the traditional Transformer Architecture is also used. Transformers initially introduced in 2017 in the well-known paper "Attention is All You Need". have spread widely in the field of Natural Language Processing soon becoming one of the most widely used and promising neural network architectures in the field. In 2020 Vision Transformers were then adapted for tasks in Computer Vision with the paper "An image is worth 16x16 words".

Source officielle

https://en.wikipedia.org/wiki/Vision_transformer

À propos de ce résultat

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Vision transformer

Graph Chatbot

Chattez avec Graph Search