Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom. The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. The standard (unit) softmax function where is defined by the formula In words, it applies the standard exponential function to each element of the input vector and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of is approximately , which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8). In general, instead of e a different base b > 0 can be used. If 0 < b < 1, smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values.
Michaël Unser, Alexis Marie Frederic Goujon
Nicolas Henri Bernard Flammarion, Hristo Georgiev Papazov, Scott William Pesme