Are you an EPFL student looking for a semester project?
Work with us on data science and visualisation projects, and deploy your project as an app on top of Graph Search.
Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsitybased arguments would suggest that the generalization error increases as N grows past a certain threshold N*. Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations parallel to f(N) - < f(N)>parallel to similar to N-1/4 of the neural net output function f(N) around its expectation < f(N)>. These affect the generalization error epsilon(f(N)) for classification: under natural assumptions, it decays to a plateau value epsilon(f(infinity)) in a power-law fashion similar to N-1/2. This description breaks down at a so-called jamming transition N = N*. At this threshold, we argue that parallel to f(N)parallel to diverges. This result leads to a plausible explanation for the cusp in test error known to occur at N*. Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond N*, and averaging their outputs.
Jean-Paul Richard Kneib, Emma Elizabeth Tolley, Tianyue Chen, Michele Bianco
The capabilities of deep learning systems have advanced much faster than our ability to understand them. Whilst the gains from deep neural networks (DNNs) are significant, they are accompanied by a growing risk and gravity of a bad outcome. This is tr ...