**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Publication# A Functional Perspective on Information Measures

Abstract

Since the birth of Information Theory, researchers have defined and exploited various information measures, as well as endowed them with operational meanings. Some were born as a "solution to a problem", like Shannon's Entropy and Mutual Information. Others were the fruit of generalisation and the mathematical genius of bright minds like Rényi, Csizsár and Sibson. These powerful objects allow us to manipulate probabilities intuitively and seem always to be somehow connected to concrete settings in communication, coding or estimation theory. A common theme is: take a problem in one of these areas, try to control (upper or lower-bound) the expected value of some function of interest (often, probabilities of error) and, with enough work, an information measure appears as a fundamental limit of the problem. The most striking example of this is in Shannon's seminal paper in 1948: his purpose was to characterise the smallest possible expected length of a uniquely decodable encoding that compresses the realisations of a random variable. As he brilliantly proved, the smallest expected length one can hope for is the Entropy of the random variable. In establishing this connection, another quantity needed to be implicitly controlled: the Kraft's sum of the code. Seemingly unrelated before, these three objects joined forces in harmony to provide a beautiful and fundamental result. But why are they related? The answer seems to be: duality. Duality is an abstract notion commonly used in linear algebra and functional analysis. It has been expanded and generalised over the years. Several incarnations have been discovered throughout mathematics. One particular instance of this involves vector spaces: given two vector spaces and a "duality pairing" one can jump from one space to the other (its dual) through Legendre-Fenchel-like transforms. In the most common settings in Information Theory, the two spaces and the pairing are, respectively: 1) the space of (probability)measures defined on X; 2) the space of bounded functions defined on X; 3) the Lebesgue integral of the function (the expected value of the function if the measure is a probability measure). Once these are set, Legendre-Fenchel-like transforms allow us to connect a) a functional acting on the space described in 1), b) a functional acting on the space described in 2) and the anchor point is c) the (expected) value described in 3).These three pieces (a), b) and c)) represent the actors of many of the results provided in Information Theory. Once they are found, one usually bounds the functional described in b) and obtains a bound connecting the expected value and the functional of measures (e.g., an information measure). Going back to Shannon's result, fixed a random variable (and thus, a probability measure) and selected the function to be the length of a code: the functional a) is the Shannon Entropy of the source; the functional b) is the Kraft sum of the code; the pairing c) is the expected length of the code. We explore this connection and this pattern throughout the thesis. We will see how it can be found in notable results like Coding Theorems for one-to-one codes, Campbell's Coding Theorem, Arikan's Guessing Theorem, Fano-like and Transportation-Cost Inequalities and so on. Moreover, unearthing the pattern allows us to generalise it to other information measures and apply the technique in a variety of fields, including Learning Theory, Estimation Theory and Hypothesis Testing.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related concepts

Loading

Related publications

Loading

Related concepts (39)

Related publications (24)

Information theory

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in

Mutual information

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies th

Random variable

A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. The term 'random va

Loading

Loading

Loading

We are living in the era of "Big Data", an era characterized by a voluminous amount of available data. Such amount is mainly due to the continuing advances in the computational capabilities for capturing, storing, transmitting and processing data. However, it is not always the volume of data that matters, but rather the "relevant" information that resides in it.
Exactly 70 years ago, Claude Shannon, the father of information theory, was able to quantify the amount of information in a communication scenario based on a probabilistic model of the data. It turns out that Shannon's theory can be adapted to various probability-based information processing fields, ranging from coding theory to machine learning. The computation of some information theoretic quantities, such as the mutual information, can help in setting fundamental limits and devising more efficient algorithms for many inference problems.
This thesis deals with two different, yet intimately related, inference problems in the fields of coding theory and machine learning. We use Bayesian probabilistic formulations for both problems, and we analyse them in the asymptotic high-dimensional regime. The goal of our analysis is to assess the algorithmic performance on the first hand and to predict the Bayes-optimal performance on the second hand, using an information theoretic approach. To this end, we employ powerful analytical tools from statistical physics.
The first problem is a recent forward-error-correction code called sparse superposition code. We consider the extension of such code to a large class of noisy channels by exploiting the similarity with the compressed sensing paradigm. Moreover, we show the amenability of sparse superposition codes to perform
joint distribution matching and channel coding.
In the second problem, we study symmetric rank-one matrix factorization, a prominent model in machine learning and statistics with many applications ranging from community detection to sparse principal component analysis. We provide an explicit expression for the normalized mutual information and the minimum mean-square error of this model in the asymptotic limit. This allows us to prove the optimality of a certain iterative algorithm on a large set of parameters.
A common feature of the two problems stems from the fact that both of them are represented on dense graphical models. Hence, similar message-passing algorithms and analysis tools can be adopted. Furthermore, spatial coupling, a new technique introduced in the context of low-density parity-check (LDPC) codes, can be applied to both problems. Spatial coupling is used in this thesis as a "construction technique" to boost the algorithmic performance and as a "proof technique" to compute some information theoretic quantities.
Moreover, both of our problems retain close connections with spin glass models studied in statistical mechanics of disordered systems. This allows us to use sophisticated techniques developed in statistical physics. In this thesis, we use the potential function predicted by the replica method in order to prove the threshold saturation phenomenon associated with spatially coupled models. Moreover, one of the main contributions of this thesis is proving that the predictions given by the "heuristic" replica method are exact. Hence, our results could be of great interest for the statistical physics community as well, as they help to set a rigorous mathematical foundation of the replica predictions.

Claude Elwood Shannon in 1948, then of the Bell Labs, published one of the ground breaking papers in the history of engineering [1]. This paper (”A Mathematical Theory of Communication”, Bell System Tech. Journal, Vol. 27, July and October 1948, pp. 379 - 423 and pp. 623 - 656) laid the groundwork of an entirely new scientific discipline, information Theory, that enabled engineers for the first time to deal quantitatively with the elusive concept of information”. In his celebrated work, Shannon cleanly laid the foundation for transmission and storage of information. Using a probabilistic model, his theory helped to get further insight into the achievable limits of information transfer over perturbed medium called channel. Indeed the very same concept is used to predict the limits on data compression and achievable transmission rate on a probabilistic channel.These underlying concepts can be thought of as inequalities involving measures of probability distributions. Shannon defined several such basic measures in his original work. The field of Information Theory grew with researchers finding more results and insights into the fundamental problem of transmission of and storage using probabilistic models. By nature of the subject itself, the results obtained are usually inequalities involving basic Shannon’s measures such as entropies. Some of them are elementary, some rather complicated expressions. In order to prove further theorems as well it required to check whether certain expressions are true in an Information Theoretic sense. This motivated researchers to seek a formal method to check all possible inequalities. Raymond Yeung [2] in 1998 came out with a remarkable framework, which could verify many of the inequalities in this field. His framework thus enabled to verify all inequalities, derived from the basic Shannon measure properties. A central notion of Information Theory is entropy, which Shannon defines as measure of information itself. Given a set of jointly distributed random variables X1, X2, . . . , Xn, we can consider entropies of all random variables H(Xi), entropies of all pairs H (Xi , Xj ), etc. (2n − 1 entropy values for all nonempty subsets of {X1 , X2 , ..., Xn }). For every n-tuple of random variables we get a point in R2n−1, representing entropies of the given distribution. Following [2] we call a point in R2n−1 constructible if it represents entropy values of some collection of n random variables. The set of all constructible points is denoted b y Γ ∗n It is hard to characterize Γ∗n for an arbitrary n (for n ≥ 3, it is not even closed [?]). A more feasible (but also highly non- trivial) problem is to describe the closure Γ ̄∗n of the set Γ∗n. The set Γ ̄∗n n is a convex cone [?], and to characterize it we should describe the class of all linear inequalities of the form λ1H(X1) + . . . + λnH(Xn) + λ1,2H(X1X2) + . . . + λ1,2,3H(X1, X2, X3) + . . . + λ1,2,3,...,nH(X1, X2, X3, . . . , Xn) which are true for any random variables X1, X2, . . . , Xn (λi are real coefficients). Information inequalities are widely used for proving converse coding theorems in Information Theory. Recently interesting applications of information inequalities beyond Information Theory were found [10],[12],[14]. So investigation of the class of all valid information inequalities is an interesting problem. We refer the reader to [15] for a comprehensive treatment of the subject. Yeung’s framework thus helped to verify all the Shannon type inequalities. Yeung and Yan have also developed a software, to computationally verify such inequalities. Since the software is rather outdated, we have made an attempt to make a more efficient and user friendly implementation of the software, hinging from the original work of Yeung. The software, which we call information inequality solver (iis) is freely available for download from EPFL website. The new software suit has the added advantage that it is freed of dependencies on any licensed products such as Matlab (or toolboxes).

2007This thesis is devoted to information-theoretic aspects of community detection. The importance of community detection is due to the massive amount of scientific data today that describes relationships between items from a network, e.g., a social network. Items from this network can be inherently partitioned into a known number of communities, but the partition can only be inferred from the data.
To estimate the underlying partition, data scientists can apply any type of advanced statistical techniques; but the data could be very noisy, or the number of data is inadequate. A fundamental question here is about the possibility of weak recovery: does the data contain a sufficient amount of information that enables us to produce a non-trivial estimate of the partition?
For the purpose of mathematical analysis, the above problem can be formulated as Bayesian inference on generative models. These models, including the stochastic block model (SBM) and censored block model (CBM), consider a random graph generated based on a hidden partition that divides the nodes in the graph into labelled groups. In the SBM, nodes are connected with a probability depending on the labels of the endpoints. Whereas, in the CBM, hidden variables are measured through a noisy channel, and the measurement outcomes form a weighted graph. In both models, inference is the task of recovering the hidden partition from the observed graph. The criteria for weak recovery can be studied via an information-theoretic quantity called mutual information. Once the asymptotic mutual information is computed, phase transitions for the weak recovery can be located.
This thesis pertains to rigorous derivations of single-letter variational expressions for the asymptotic mutual information for models in community detection. These variational expressions, known as the replica predictions, come from heuristic methods of statistical physics. We present our development of new rigorous methods for confirming the replica predictions. These methods are based on extending the recently introduced adaptive interpolation method.
We prove the replica prediction for the SBM in the dense-graph regime with two groups of asymmetric size. The existing proofs in the literature are indirect, as they involve mapping the model to an external problem whose mutual information is determined by a combination of methods. Here, on the contrary, we provide a self-contained and direct proof.
Next, we extend this method to sparse models. Before this thesis, adaptive interpolation was known for providing a conceptually simple proof for replica predictions for dense graphs. Whereas, for a sparse graph, the replica prediction involves a more complicated variational expression, and rigorous confirmations are often lacking or obtained by rather complicated methods. Therefore, we focus on a simple version of CBM on sparse graphs, where hidden variables are measured through a binary erasure channel, for which we fully prove the replica prediction by the adaptive interpolation.
The key for extending the adaptive interpolation to a broader class of sparse models is a concentration result for the so-called "multi-overlaps". This concentration forms the basis of the replica "symmetric" prediction. We prove this concentration result for a related sparse model in the context of physics. This provides inspiration for further development of the adaptive interpolation.