Softmax function

The softmax function, also known as softargmax or normalized exponential function, converts a vector of K real numbers into a probability distribution of K possible outcomes. It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes, based on Luce's choice axiom. The softmax function takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval , and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities. The standard (unit) softmax function where is defined by the formula In words, it applies the standard exponential function to each element of the input vector and normalizes these values by dividing by the sum of all these exponentials. The normalization ensures that the sum of the components of the output vector is 1. The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector. For example, the standard softmax of is approximately , which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8). In general, instead of e a different base b > 0 can be used. If 0 < b < 1, smaller input components will result in larger output probabilities, and decreasing the value of b will create probability distributions that are more concentrated around the positions of the smallest input values.

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Nicolas Henri Bernard Flammarion, Hristo Georgiev Papazov, Scott William Pesme

In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size

\gamma

and momentum parameter

\beta

that allows u ...

2024

Graph Chatbot

Chat with Graph Search

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

On the number of regions of piecewise linear neural networks

Deep Learning Theory Through the Lens of Diagonal Linear Networks

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

On the number of regions of piecewise linear neural networks

Deep Learning Theory Through the Lens of Diagonal Linear Networks