Machine learning-based attention is a mechanism mimicking cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. It can do it either in parallel (such as in transformers) or sequentially (such as recursive neural networks). "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards. Multiple attention heads are used in transformer-based large language models.
Predecessors of the mechanism were used in recursive neural networks which, however, calculated "soft" weights sequentially and, at each step, considered the current word and other words within the context window. They were known as multiplicative modules, sigma pi units, and hyper-networks. They have been used in LSTMs, and multi-sensory data processing (sound, images, video, and text) in perceivers, fast weight controllers's memory, reasoning tasks in differentiable neural computers, and neural Turing machines
Correlating the different parts within a sentence or a picture can help capture its structure and meaning. In the sentence "see that girl run" the attention weights, originating from the word "that", are being calculated by the Q and K sub-networks of a single "attention head" in the illustration below. As a result the most soft weight (or attention) is given to the word "girl".
The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.
The structure of the input data is captured in the Qw and Kw weights, and the Vw weights express that structure in terms of more meaningful features for the task being trained for.
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.
vignette|Schéma représentant l'architecture générale d'un transformeur. Un transformeur (ou modèle auto-attentif) est un modèle d'apprentissage profond introduit en 2017, utilisé principalement dans le domaine du traitement automatique des langues (TAL). Dès 2020, les transformeurs commencent aussi à trouver une application en matière de vision par ordinateur par la création des vision transformers (ViT).
L'apprentissage profond ou apprentissage en profondeur (en anglais : deep learning, deep structured learning, hierarchical learning) est un sous-domaine de l’intelligence artificielle qui utilise des réseaux neuronaux pour résoudre des tâches complexes grâce à des architectures articulées de différentes transformations non linéaires. Ces techniques ont permis des progrès importants et rapides dans les domaines de l'analyse du signal sonore ou visuel et notamment de la reconnaissance faciale, de la reconnaissance vocale, de la vision par ordinateur, du traitement automatisé du langage.
Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory".
Real-world engineering applications must cope with a large dataset of dynamic variables, which cannot be well approximated by classical or deterministic models. This course gives an overview of method
The Deep Learning for NLP course provides an overview of neural network based methods applied to text. The focus is on models particularly suited to the properties of human language, such as categori
This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and imple
Explore l'histoire, les modèles, la formation, la convergence et les limites des réseaux neuronaux, y compris l'algorithme de rétropropagation et l'approximation universelle.
Explore l'évolution des mécanismes d'attention vers les transformateurs dans les NLP modernes, en soulignant l'importance de l'auto-attention et de l'attention croisée.
Fournit un aperçu des Transformateurs, de l'auto-attention, de l'attention multi-têtes, et du décodeur et de l'encodeur Transformateur.
In this PhD manuscript, we explore optimisation phenomena which occur in complex neural networks through the lens of 2-layer diagonal linear networks. This rudimentary architecture, which consists of a two layer feedforward linear network with a diagonal ...
In the past few years, Machine Learning (ML) techniques have ushered in a paradigm shift, allowing the harnessing of ever more abundant sources of data to automate complex tasks. The technical workhorse behind these important breakthroughs arguably lies in ...
EPFL2024
, ,
In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size γ and momentum parameter β that allows u ...