**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# Erdos Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs

Abstract

Combinatorial optimization (CO) problems are notoriously challenging for neural networks, especially in the absence of labeled instances. This work proposes an unsupervised learning framework for CO problems on graphs that can provide integral solutions of certified quality. Inspired by Erdos' probabilistic method, we use a neural network to parametrize a probability distribution over sets. Crucially, we show that when the network is optimized w.r.t. a suitably chosen loss, the learned distribution contains, with controlled probability, a low-cost integral solution that obeys the constraints of the combinatorial problem. The probabilistic proof of existence is then derandomized to decode the desired solutions. We demonstrate the efficacy of this approach to obtain valid solutions to the maximum clique problem and to perform local graph clustering. Our method achieves competitive results on both real datasets and synthetic hard instances.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related concepts (18)

Neural network

A neural network can refer to a neural circuit of biological neurons (sometimes also called a biological neural network), a network of artificial neurons or nodes in the case of an artificial neur

Probability

Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number

Probability distribution

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mat

Related publications (83)

Loading

Loading

Loading

Deep neural networks have been empirically successful in a variety of tasks, however their theoretical understanding is still poor. In particular, modern deep neural networks have many more parameters than training data. Thus, in principle they should overfit the training samples and exhibit poor generalization to the complete data distribution. Counter intuitively however, they manage to achieve both high training accuracy and high testing accuracy. One can prove generalization using a validation set, however this can be difficult when training samples are limited and at the same time we do not obtain any information about why deep neural networks generalize well. Another approach is to estimate the complexity of the deep neural network. The hypothesis is that if a network with high training accuracy has high complexity it will have memorized the data, while if it has low complexity it will have learned generalizable patterns. In the first part of this thesis we explore Spectral Complexity, a measure of complexity that depends on combinations of norms of the weight matrices of the deep neural network. For a dataset that is difficult to classify, with no underlying model and/or no recurring pattern, for example one where the labels have been chosen randomly, spectral complexity has a large value, reflecting that the network needs to memorize the labels, and will not generalize well. Putting back the real labels, the spectral complexity becomes lower reflecting that some structure is present and the network has learned patterns that might generalize to unseen data. Spectral complexity results in vacuous estimates of the generalization error (the difference between the training and testing accuracy), and we show that it can lead to counterintuitive results when comparing the generalization error of different architectures. In the second part of the thesis we explore non-vacuous estimates of the generalization error. In Chapter 2 we analyze the case of PAC-Bayes where a posterior distribution over the weights of a deep neural network is learned using stochastic variational inference, and the generalization error is the KL divergence between this posterior and a prior distribution. We find that a common approximation where the posterior is constrained to be Gaussian with diagonal covariance, known as the mean-field approximation, limits significantly any gains in bound tightness. We find that, if we choose the prior mean to be the random network initialization, the generalization error estimate tightens significantly. In Chapter 3 we explore an existing approach to learning the prior mean, in PAC-Bayes, from the training set. Specifically, we explore differential privacy, which ensures that the training samples contribute only a limited amount of information to the prior, making it distribution and not training set dependent. In this way the prior should generalize well to unseen data (as it hasn't memorized individual samples) and at the same time any posterior distribution that is close to it in terms of the KL divergence will also exhibit good generalization.

Humans and some other animals are able to perform tasks that require coordination of movements across multiple temporal scales, ranging from hundreds of milliseconds to several seconds. The fast timescale at which neurons naturally operate, on the order of tens of milliseconds, is well-suited to support motor control of rapid movements. In contrast, to coordinate movements on the order of seconds, a neural network should produce reliable dynamics on a similarly âslowâ timescale. Neurons and synapses exhibit biophysical mechanisms whose timescales range from tens of milliseconds to hours, which suggests a possible role of these mechanisms in producing slow reliable dynamics. However, how such mechanisms influence network dynamics is not yet understood. An alternative approach to achieve slow dynamics in a neural network consists in modifying its connectivity structure. Still, the limitations of this approach and in particular to what degree the weights require fine-tuning, remain unclear. Understanding how both the single neuron mechanisms and the connectivity structure might influence the network dynamics
to produce slow timescales is the main goal of this thesis.
We first consider the possibility of obtaining slow dynamics in binary networks by tuning their connectivity. It is known that binary networks can produce sequential dynamics. However, if the sequences consist of random patterns, the typical length of the longest sequence that can be produced grows linearly with the number of units. Here, we show that we can overcome this limitation by carefully designing the sequence structure. More precisely, we obtain a constructive proof that allows to obtain sequences whose length scales exponentially with the number of units. To achieve this however, one needs to exponentially fine-tune the connectivity matrix.
Next, we focus on the interaction between single neuron mechanisms and recurrent dynamics. Particular attention is dedicated to adaptation, which is known to have a broad range of timescales and is therefore particularly interesting for the subject of this thesis. We study the dynamics of a random network with adaptation using mean-field techniques, and we show that the network can enter a state of resonant chaos. Interestingly, the resonance frequency of this state is independent of the connectivity strength and depends only on the properties of the single neuron model. The approach used to study networks with adaptation can also be applied when considering linear rate units with an arbitrary number of auxiliary variables. Based on a qualitative analysis of the mean-field theory for a random network whose neurons are described by a D -dimensional rate model, we conclude that the statistics of the chaotic dynamics are strongly influenced by the single neuron model under investigation.
Using a reservoir computing approach, we show preliminary evidence that slow adaptation can be beneficial when performing tasks that require slow timescales. The positive impact of adaptation on the network performance is particularly strong in the presence of noise. Finally, we propose a network architecture in which the slowing-down effect due to adaptation is combined with a hierarchical structure, with the purpose of efficiently generate sequences that require multiple, hierarchically organized timescales.

With ever greater computational resources and more accessible software, deep neural networks have become ubiquitous across industry and academia.
Their remarkable ability to generalize to new samples defies the conventional view, which holds that complex, over-parameterized networks would be prone to overfitting.
This apparent discrepancy is exacerbated by our inability to inspect and interpret the high-dimensional, non-linear, latent representations they learn, which has led many to refer to neural networks as

`black-boxes''. The Law of Parsimony states that `

simpler solutions are more likely to be correct than complex ones''. Since they perform quite well in practice, a natural question to ask, then, is in what way are neural networks simple?
We propose that compression is the answer. Since good generalization requires invariance to irrelevant variations in the input, it is necessary for a network to discard this irrelevant information. As a result, semantically similar samples are mapped to similar representations in neural network deep feature space, where they form simple, low-dimensional structures.
Conversely, a network that overfits relies on memorizing individual samples. Such a network cannot discard information as easily.
In this thesis we characterize the difference between such networks using the non-negative rank of activation matrices. Relying on the non-negativity of rectified-linear units, the non-negative rank is the smallest number that admits an exact non-negative matrix factorization.
We derive an upper bound on the amount of memorization in terms of the non-negative rank, and show it is a natural complexity measure for rectified-linear units.
With a focus on deep convolutional neural networks trained to perform object recognition, we show that the two non-negative factors derived from deep network layers decompose the information held therein in an interpretable way. The first of these factors provides heatmaps which highlight similarly encoded regions within an input image or image set. We find that these networks learn to detect semantic parts and form a hierarchy, such that parts are further broken down into sub-parts.
We quantitatively evaluate the semantic quality of these heatmaps by using them to perform semantic co-segmentation and co-localization. In spite of the convolutional network we use being trained solely with image-level labels, we achieve results comparable or better than domain-specific state-of-the-art methods for these tasks.
The second non-negative factor provides a bag-of-concepts representation for an image or image set. We use this representation to derive global image descriptors for images in a large collection. With these descriptors in hand, we perform two variations content-based image retrieval, i.e. reverse image search. Using information from one of the non-negative matrix factors we obtain descriptors which are suitable for finding semantically related images, i.e., belonging to the same semantic category as the query image. Combining information from both non-negative factors, however, yields descriptors that are suitable for finding other images of the specific instance depicted in the query image, where we again achieve state-of-the-art performance.