**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Person# Stefano Spigler

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related units

Loading

Courses taught by this person

Loading

Related research domains

Loading

Related publications

Loading

People doing similar research

Loading

Courses taught by this person

No results

Related publications (6)

Loading

Loading

Loading

Related units (1)

Related research domains (10)

General Dynamics Corporation (GD) is an American publicly traded aerospace and defense corporation headquartered in Reston, Virginia. As of 2020, it was the fifth-largest defense contractor in the wor

A neural network can refer to a neural circuit of biological neurons (sometimes also called a biological neural network), a network of artificial neurons or nodes in the case of an artificial neur

In physics, physical chemistry and engineering, fluid dynamics is a subdiscipline of fluid mechanics that describes the flow of fluids—liquids and gases. It has several subdisciplines, including aer

People doing similar research (122)

Mario Geiger, Stefano Spigler, Matthieu Wyart

Two distinct limits for deep learning have been derived as the network width h -> infinity, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Theta (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as alpha h(-1/2) at initialization. By varying alpha and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an alpha* that scales as 1/root h. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations delta F induced on the learned function by initial conditions decay as delta F similar to 1/root h, leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t(1) similar to root h alpha, such that for t < t(1) the dynamics is linear. At t similar to t(1), the output has grown by a magnitude root h and the changes of the tangent kernel parallel to Delta Theta parallel to become significant. Ultimately, it follows parallel to Delta Theta parallel to similar to(root h alpha)(-a) for ReLU and Softplus activation functions, with a < 2 and a -> 2 as depth grows. We provide scaling arguments supporting these findings.

, ,

In this paper we first recall the recent result that in deep networks a phase transition, analogous to the jamming transition of granular media, delimits the over- and under-parametrized regimes where fitting can or cannot be achieved. The analysis leading to this result support that for proper initialization and architectures, in the whole over-parametrized regime poor minima of the loss are not encountered during training, because the number of constraints that hinders the dynamics is insufficient to allow for the emergence of stable minima. Next, we study systematically how this transition affects generalization properties of the network (i.e. its predictive power). As we increase the number of parameters of a given model, starting from an under-parametrized network, we observe for gradient descent that the generalization error displays three phases: (i) initial decay, (ii) increase until the transition point?where it displays a cusp?and (iii) slow decay toward an asymptote as the network width diverges. However if early stopping is used, the cusp signaling the jamming transition disappears. Thereby we identify the region where the classical phenomenon of over-fitting takes place as the vicinity of the jamming transition, and the region where the model keeps improving with increasing the number of parameters, thus organizing previous empirical observations made in modern neural networks.

Franck Raymond Gabriel, Mario Geiger, Clément Hongler, Levent Dogus Sagun, Stefano Spigler, Matthieu Wyart

Supervised deep learning involves the training of neural networks with a large number N of parameters. For large enough N, in the so-called over-parametrized regime, one can essentially fit the training data points. Sparsitybased arguments would suggest that the generalization error increases as N grows past a certain threshold N*. Instead, empirical studies have shown that in the over-parametrized regime, generalization error keeps decreasing with N. We resolve this paradox through a new framework. We rely on the so-called Neural Tangent Kernel, which connects large neural nets to kernel methods, to show that the initialization causes finite-size random fluctuations parallel to f(N) - < f(N)>parallel to similar to N-1/4 of the neural net output function f(N) around its expectation < f(N)>. These affect the generalization error epsilon(f(N)) for classification: under natural assumptions, it decays to a plateau value epsilon(f(infinity)) in a power-law fashion similar to N-1/2. This description breaks down at a so-called jamming transition N = N*. At this threshold, we argue that parallel to f(N)parallel to diverges. This result leads to a plausible explanation for the cusp in test error known to occur at N*. Our results are confirmed by extensive empirical observations on the MNIST and CIFAR image datasets. Our analysis finally suggests that, given a computational envelope, the smallest generalization error is obtained using several networks of intermediate sizes, just beyond N*, and averaging their outputs.