**Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?**

Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur GraphSearch.

Publication# What can linearized neural networks actually say about generalization?

Résumé

For certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization, but for the networks used in practice, the empirical NTK only provides a rough first-order approximation. Still, a growing body of work keeps leveraging this approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. In our work, we provide strong empirical evidence to determine the practical validity of such approximation by conducting a systematic comparison of the behavior of different neural networks and their linear approximations on different tasks. We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, even when they achieve very different performances. However, in contrast to what was previously reported, we discover that neural networks do not always perform better than their kernel approximations, and reveal that the performance gap heavily depends on architecture, dataset size and training task. We discover that networks overfit to these tasks mostly due to the evolution of their kernel during training, thus, revealing a new type of implicit bias.

Official source

Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.

Concepts associés

Chargement

Publications associées

Chargement

Concepts associés (15)

Neural network

A neural network can refer to a neural circuit of biological neurons (sometimes also called a biological neural network), a network of artificial neurons or nodes in the case of an artificial neur

Algorithme

thumb|Algorithme de découpe d'un polygone quelconque en triangles (triangulation).
Un algorithme est une suite finie et non ambiguë d'instructions et d’opérations permettant de résoudre une classe de

Réseau de neurones artificiels

Un réseau de neurones artificiels, ou réseau neuronal artificiel, est un système dont la conception est à l'origine schématiquement inspirée du fonctionnement des neurones biologique

Publications associées (74)

Chargement

Chargement

Chargement

Deep neural networks have been empirically successful in a variety of tasks, however their theoretical understanding is still poor. In particular, modern deep neural networks have many more parameters than training data. Thus, in principle they should overfit the training samples and exhibit poor generalization to the complete data distribution. Counter intuitively however, they manage to achieve both high training accuracy and high testing accuracy. One can prove generalization using a validation set, however this can be difficult when training samples are limited and at the same time we do not obtain any information about why deep neural networks generalize well. Another approach is to estimate the complexity of the deep neural network. The hypothesis is that if a network with high training accuracy has high complexity it will have memorized the data, while if it has low complexity it will have learned generalizable patterns. In the first part of this thesis we explore Spectral Complexity, a measure of complexity that depends on combinations of norms of the weight matrices of the deep neural network. For a dataset that is difficult to classify, with no underlying model and/or no recurring pattern, for example one where the labels have been chosen randomly, spectral complexity has a large value, reflecting that the network needs to memorize the labels, and will not generalize well. Putting back the real labels, the spectral complexity becomes lower reflecting that some structure is present and the network has learned patterns that might generalize to unseen data. Spectral complexity results in vacuous estimates of the generalization error (the difference between the training and testing accuracy), and we show that it can lead to counterintuitive results when comparing the generalization error of different architectures. In the second part of the thesis we explore non-vacuous estimates of the generalization error. In Chapter 2 we analyze the case of PAC-Bayes where a posterior distribution over the weights of a deep neural network is learned using stochastic variational inference, and the generalization error is the KL divergence between this posterior and a prior distribution. We find that a common approximation where the posterior is constrained to be Gaussian with diagonal covariance, known as the mean-field approximation, limits significantly any gains in bound tightness. We find that, if we choose the prior mean to be the random network initialization, the generalization error estimate tightens significantly. In Chapter 3 we explore an existing approach to learning the prior mean, in PAC-Bayes, from the training set. Specifically, we explore differential privacy, which ensures that the training samples contribute only a limited amount of information to the prior, making it distribution and not training set dependent. In this way the prior should generalize well to unseen data (as it hasn't memorized individual samples) and at the same time any posterior distribution that is close to it in terms of the KL divergence will also exhibit good generalization.

A big challenge in algorithmic composition is to devise a model that is both easily trainable and able to reproduce the long-range temporal dependencies typical of music. Here we investigate how artificial neural networks can be trained on a large corpus of melodies and turned into automated mu- sic composers able to generate new melodies coherent with the style they have been trained on. We employ gated-recurrent unit (GRU) networks that have been shown to be particularly efficient in learning complex sequential activations with arbitrary long time lags. Our model processes rhythm and melody in parallel while modeling the relation between these two properties. Using such an approach, we were able to generate interesting complete melodies or suggest possible continuations of a melody fragment that is coherent with the characteristics of the fragment itself.

2016In this thesis, we propose new algorithms to solve inverse problems in the context of biomedical images. Due to ill-posedness, solving these problems require some prior knowledge of the statistics of the underlying images. The traditional algorithms, in the field, assume prior knowledge related to smoothness or sparsity of these images. Recently, they have been outperformed by the second generation algorithms which harness the power of neural networks to learn required statistics from training data. Even more recently, last generation deep-learning-based methods have emerged which require neither training nor training data. This thesis devises algorithms which progress through these generations. It extends these generations to novel formulations and applications while bringing more robustness. In parallel, it also progresses in terms of complexity, from proposing algorithms for problems with 1D data and an exact known forward model to the ones with 4D data and an unknown parametric forward model. We introduce five main contributions. The last three of them propose deep-learning-based latest-generation algorithms that require no prior training. 1) We develop algorithms to solve the continuous-domain formulation of inverse problems with both classical Tikhonov and total-variation regularizations. We formalize the problems, characterize the solution set, and devise numerical approaches to find the solutions. 2) We propose an algorithm that improves upon end-to-end neural-network-based second generation algorithms. In our method, a neural network is first trained as a projector on a training set, and is then plugged in as a projector inside the projected gradient descent (PGD). Since the problem is nonconvex, we relax the PGD to ensure convergence to a local minimum under some constraints. This method outperforms all the previous generation algorithms for Computed Tomography (CT). 3) We develop a novel time-dependent deep-image-prior algorithm for modalities that involve a temporal sequence of images. We parameterize them as the output of an untrained neural network fed with a sequence of latent variables. To impose temporal directionality, the latent variables are assumed to lie on a 1D manifold. The network is then tuned to minimize the data fidelity. We obtain state-of-the-art results in dynamic magnetic resonance imaging (MRI) and even recover intra-frame images. 4) We propose a novel reconstruction paradigm for cryo-electron-microscopy (CryoEM) called CryoGAN. Motivated by generative adversarial networks (GANs), we reconstruct a biomolecule's 3D structure such that its CryoEM measurements resemble the acquired data in a distributional sense. The algorithm is pose-or-likelihood-estimation-free, needs no ab initio, and is proven to have a theoretical guarantee of recovery of the true structure. 5) We extend CryoGAN to reconstruct continuously varying conformations of a structure from heterogeneous data. We parameterize the conformations as the output of a neural network fed with latent variables on a low-dimensional manifold. The method is shown to recover continuous protein conformations and their energy landscape.