Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Two distinct limits for deep learning have been derived as the network width h -> infinity, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Theta (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as alpha h(-1/2) at initialization. By varying alpha and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an alpha* that scales as 1/root h. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations delta F induced on the learned function by initial conditions decay as delta F similar to 1/root h, leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t(1) similar to root h alpha, such that for t < t(1) the dynamics is linear. At t similar to t(1), the output has grown by a magnitude root h and the changes of the tangent kernel parallel to Delta Theta parallel to become significant. Ultimately, it follows parallel to Delta Theta parallel to similar to(root h alpha)(-a) for ReLU and Softplus activation functions, with a < 2 and a -> 2 as depth grows. We provide scaling arguments supporting these findings.
Volkan Cevher, Grigorios Chrysos, Fanghui Liu
The capabilities of deep learning systems have advanced much faster than our ability to understand them. Whilst the gains from deep neural networks (DNNs) are significant, they are accompanied by a growing risk and gravity of a bad outcome. This is tr ...