Êtes-vous un étudiant de l'EPFL à la recherche d'un projet de semestre?
Travaillez avec nous sur des projets en science des données et en visualisation, et déployez votre projet sous forme d'application sur Graph Search.
This thesis is devoted to information-theoretic aspects of community detection. The importance of community detection is due to the massive amount of scientific data today that describes relationships between items from a network, e.g., a social network. Items from this network can be inherently partitioned into a known number of communities, but the partition can only be inferred from the data. To estimate the underlying partition, data scientists can apply any type of advanced statistical techniques; but the data could be very noisy, or the number of data is inadequate. A fundamental question here is about the possibility of weak recovery: does the data contain a sufficient amount of information that enables us to produce a non-trivial estimate of the partition?
For the purpose of mathematical analysis, the above problem can be formulated as Bayesian inference on generative models. These models, including the stochastic block model (SBM) and censored block model (CBM), consider a random graph generated based on a hidden partition that divides the nodes in the graph into labelled groups. In the SBM, nodes are connected with a probability depending on the labels of the endpoints. Whereas, in the CBM, hidden variables are measured through a noisy channel, and the measurement outcomes form a weighted graph. In both models, inference is the task of recovering the hidden partition from the observed graph. The criteria for weak recovery can be studied via an information-theoretic quantity called mutual information. Once the asymptotic mutual information is computed, phase transitions for the weak recovery can be located.
This thesis pertains to rigorous derivations of single-letter variational expressions for the asymptotic mutual information for models in community detection. These variational expressions, known as the replica predictions, come from heuristic methods of statistical physics. We present our development of new rigorous methods for confirming the replica predictions. These methods are based on extending the recently introduced adaptive interpolation method.
We prove the replica prediction for the SBM in the dense-graph regime with two groups of asymmetric size. The existing proofs in the literature are indirect, as they involve mapping the model to an external problem whose mutual information is determined by a combination of methods. Here, on the contrary, we provide a self-contained and direct proof.
Next, we extend this method to sparse models. Before this thesis, adaptive interpolation was known for providing a conceptually simple proof for replica predictions for dense graphs. Whereas, for a sparse graph, the replica prediction involves a more complicated variational expression, and rigorous confirmations are often lacking or obtained by rather complicated methods. Therefore, we focus on a simple version of CBM on sparse graphs, where hidden variables are measured through a binary erasure channel, for which we fully prove the replica prediction by the adaptive interpolation.
The key for extending the adaptive interpolation to a broader class of sparse models is a concentration result for the so-called "multi-overlaps". This concentration forms the basis of the replica "symmetric" prediction. We prove this concentration result for a related sparse model in the context of physics. This provides inspiration for further development of the adaptive interpolation.