**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Person# Berfin Simsek

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related units

Loading

Courses taught by this person

Loading

Related research domains

Loading

Related publications

Loading

People doing similar research

Loading

Courses taught by this person

No results

Related publications (2)

People doing similar research

Related research domains

No results

No results

Loading

Loading

Related units (5)

Deep learning has achieved remarkable success in various challenging tasks such as generating images from natural language or engaging in lengthy conversations with humans.The success in practice stems from the ability to successfully train massive neural networks on massive datasets.This thesis studies the theoretical foundations of the simplest architecture,that is, a deep (feedforward) neural network, with a particular emphasis on the role of width.We first focus on a simple model of finite-width neural networks to study generalization,a central inquiry in machine learning and statistical learning theory.Our study delves into the expected generalization error of a Gaussian random features model in terms of the number of features, number of data points, the kernel that it approximates, and the input distribution.Our formulas closely match numerical experiments.Next, we explore another simplification of finite-width neural networks to study their training dynamics.We assume a linear activation function, resulting in a linear predictor.However, the training dynamics remain non-trivial.In particular, the loss function is non-convex: the orthogonal symmetrygives rise to manifolds of saddle points at various loss levels.Nevertheless, these saddle points exhibit a unique arrangement, wherein the escape direction of a saddle channels the trajectory towards a subsequent saddle.By gluing the local trajectories between saddles, we describe a so-called saddle-to-saddle dynamics that provably kicks in for very small initializations.To study finite-width neural networks without devising a simple model, we shift our focus to the structure of network parameterization and permutation symmetry among hidden neurons.We identify a neuron-splitting technique that maps a critical point of a network to a manifold of symmetry-induced critical points of a wider network.By considering all possible neuron partitions and their permutations, we establish the precise scaling law for the number of critical manifolds.The scaling laws behave as $e^{c (\alpha)} m^m$ for large $m$ where $m$ is the width of the wider network and $\alpha$ is shrinkage factor, i.e. is the ratio between the number of distinct neurons to $m$.Notably, the maximum of $c(\alpha)$ is attained at $\alpha^* = \frac{1}{2 \log(2)}$, hence it is the shrinkage factor inducing the most numerous symmetry-induced critical manifolds.We then give an application of this scaling law for overparameterized networks.The key question is: can we give a rule of thumb for how much overparameterization is neededto ensure reliable convergence to a zero-loss solution?Our approach is based on studying the geometry and topology of the zero-loss solutions in overparameterized neural networks.We prove that \textit{all} zero-loss solution manifolds are identical up to neuron splitting, zero neuron addition, and permutation for input distributions with full support.Additionally, we give the scaling law of the zero-loss manifolds.The ratio between the two scaling laws yields a measure of the landscape complexity which decays with overparameterization.We observe that the complexity decreases rapidly until reaching an overparameterization factor of approximately $2\log(2)$, beyond which the complexity becomes smaller than one.Overall, we find it recommendable to use at least a factor of $2$ to $4$ of overparameterization to ensure reliable convergence to a zero-loss solution.

Johanni Michael Brea, Wulfram Gerstner, Clément Hongler, Berfin Simsek, Francesco Spadaro

We study how permutation symmetries in overparameterized multi-layer neural networks generate `symmetry-induced' critical points. Assuming a network with $L$ layers of minimal widths $r_1^*, \ldots, r_{L-1}^*$ reaches a zero-loss minimum at $r_1^*! \cdots r_{L-1}^*!$ isolated points that are permutations of one another, we show that adding one extra neuron to each layer is sufficient to connect all these previously discrete minima into a single manifold. For a two-layer overparameterized network of width $r^*+ h =: m$ we explicitly describe the manifold of global minima: it consists of $T(r^*, m)$ affine subspaces of dimension at least $h$ that are connected to one another. For a network of width $m$, we identify the number $G(r,m)$ of affine subspaces containing only symmetry-induced critical points that are related to the critical points of a smaller network of width r<r^*. Via a combinatorial analysis, we derive closed-form formulas for $T$ and $G$ and show that the number of symmetry-induced critical subspaces dominates the number of affine subspaces forming the global minima manifold in the mildly overparameterized regime (small $h$) and vice versa in the vastly overparameterized regime ($h \gg r^*$). Our results provide new insights into the minimization of the non-convex loss function of overparameterized neural networks.

2021