Concept

Attention (machine learning)

Summary
Machine learning-based attention is a mechanism mimicking cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. It can do it either in parallel (such as in transformers) or sequentially (such as recursive neural networks). "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards. Multiple attention heads are used in transformer-based large language models. Predecessors of the mechanism were used in recursive neural networks which, however, calculated "soft" weights sequentially and, at each step, considered the current word and other words within the context window. They were known as multiplicative modules, sigma pi units, and hyper-networks. They have been used in LSTMs, and multi-sensory data processing (sound, images, video, and text) in perceivers, fast weight controllers's memory, reasoning tasks in differentiable neural computers, and neural Turing machines Correlating the different parts within a sentence or a picture can help capture its structure and meaning. In the sentence "see that girl run" the attention weights, originating from the word "that", are being calculated by the Q and K sub-networks of a single "attention head" in the illustration below. As a result the most soft weight (or attention) is given to the word "girl". The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words. The structure of the input data is captured in the Qw and Kw weights, and the Vw weights express that structure in terms of more meaningful features for the task being trained for.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.