**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Mutual information

Summary

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" (in units such as shannons (bits), nats or hartleys) obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.
Not limited to real-valued random variables and linear dependence like the correlation coefficient, MI is more general and determines how different the joint distribution of the pair (X,Y) is from the product of the marginal distributions of X and Y. MI is the expected value of the pointwise mutual information (PMI).
The quantity was defined and analyzed by Claude Shannon in his land

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications

Loading

Related people

Loading

Related units

Loading

Related concepts

Loading

Related courses

Loading

Related lectures

Loading

Related people (22)

Related units (21)

Related concepts (37)

Information theory is the mathematical study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in

In information theory, the entropy of a random variable is the average level of "information", "surprise", or "uncertainty" inherent to the variable's possible outcomes. Given a discrete random vari

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in

Related courses (27)

COM-406: Foundations of Data Science

We discuss a set of topics that are important for the understanding of modern data science but that are typically not taught in an introductory ML course. In particular we discuss fundamental ideas and techniques that come from probability, information theory as well as signal processing.

BIO-369: Randomness and information in biological data

Biology is becoming more and more a data science, as illustrated by the explosion of available genome sequences. This course aims to show how we can make sense of such data and harness it in order to understand biological processes in a quantitative way.

MGT-302: Data driven business analytics

This course focuses on on methods and algorithms needed to apply machine learning with an emphasis on applications in business analytics.

Related lectures (67)

Related publications (100)

Loading

Loading

Loading

This thesis focuses on two kinds of statistical inference problems in signal processing and data science. The first problem is the estimation of a structured informative tensor from the observation of a noisy tensor in which it is buried. The structure comes from the possibility to decompose the informative tensor as the sum of a small number of rank-one tensors (small compared to its size). Such structure has applications in data science where data, organized into arrays, can often be explained by the interaction between a few features characteristic of the problem. The second problem is the estimation of a signal input to a feedforward neural network whose output is observed. It is relevant for many applications (phase retrieval, quantized signals) where the relation between the measurements and the quantities of interest is not linear.
We look at these two statistical models in different high-dimensional limits corresponding to situations where the amount of observations and size of the signal become infinitely large. These asymptotic regimes are motivated by the ever-increasing computational power and storage capacity that make possible the processing and handling of large data sets. We take an information-theoretic approach in order to establish fundamental statistical limits of estimation in high-dimensional regimes. In particular, the main contributions of this thesis are the proofs of exact formulas for the asymptotic normalized mutual information associated with these inference problems. These are low-dimensional variational formulas that can nonetheless capture the behavior of a large system where each component interacts with all the others. Owing to the relationship between mutual information and Bayesian inference, we use the solutions to these variational problems to rigorously predict the asymptotic minimum mean-square error (MMSE), the error achieved by the (Bayes) optimal estimator. We can thus compare algorithmic performances to the statistical limit given by the MMSE.
Variational formulas for the mutual information are referred to as replica symmetric (RS) ansätze due to the predictions of the heuristic replica method from statistical physics. In the past decade proofs of the validity of these predictions started to emerge.
The general strategy is to show that the RS ansatz is both an upper and lower bound on the asymptotic normalized mutual information. The present work leverages on the adaptive interpolation method that proposes a unified way to prove the two bounds. We extend the adaptive interpolation to situations where the order parameter of the problem is not a scalar but a matrix, and to high-dimensional regimes that differ from the one for which the RS formula is usually conjectured. Our proofs also demonstrate the modularity of the method. Indeed, using statistical models previously studied in the literature as building blocks of more complex ones (e.g., estimated signal with a model of structure), we derive RS formulas for the normalized mutual information associated with estimation problems that are relevant to modern applications.

Jean François Emmanuel Barbier, Clément Dominique Luneau, Nicolas Macris

We consider a statistical model for finite-rank symmetric tensor factorization and prove a single-letter variational expression for its mutual information when the tensor is of even order. The proof uses the adaptive interpolation method, for which rank-one matrix factorization is one of the first problems to which it was successfully applied. We show how to extend the adaptive interpolation to finite-rank symmetric tensors of even order, which requires new ideas with respect to the proof for the rank-one case. We also underline where the proof falls short when dealing with odd-order tensors.

Since the birth of Information Theory, researchers have defined and exploited various information measures, as well as endowed them with operational meanings. Some were born as a "solution to a problem", like Shannon's Entropy and Mutual Information. Others were the fruit of generalisation and the mathematical genius of bright minds like Rényi, Csizsár and Sibson. These powerful objects allow us to manipulate probabilities intuitively and seem always to be somehow connected to concrete settings in communication, coding or estimation theory. A common theme is: take a problem in one of these areas, try to control (upper or lower-bound) the expected value of some function of interest (often, probabilities of error) and, with enough work, an information measure appears as a fundamental limit of the problem. The most striking example of this is in Shannon's seminal paper in 1948: his purpose was to characterise the smallest possible expected length of a uniquely decodable encoding that compresses the realisations of a random variable. As he brilliantly proved, the smallest expected length one can hope for is the Entropy of the random variable. In establishing this connection, another quantity needed to be implicitly controlled: the Kraft's sum of the code. Seemingly unrelated before, these three objects joined forces in harmony to provide a beautiful and fundamental result. But why are they related? The answer seems to be: duality. Duality is an abstract notion commonly used in linear algebra and functional analysis. It has been expanded and generalised over the years. Several incarnations have been discovered throughout mathematics. One particular instance of this involves vector spaces: given two vector spaces and a "duality pairing" one can jump from one space to the other (its dual) through Legendre-Fenchel-like transforms. In the most common settings in Information Theory, the two spaces and the pairing are, respectively: 1) the space of (probability)measures defined on X; 2) the space of bounded functions defined on X; 3) the Lebesgue integral of the function (the expected value of the function if the measure is a probability measure). Once these are set, Legendre-Fenchel-like transforms allow us to connect a) a functional acting on the space described in 1), b) a functional acting on the space described in 2) and the anchor point is c) the (expected) value described in 3).These three pieces (a), b) and c)) represent the actors of many of the results provided in Information Theory. Once they are found, one usually bounds the functional described in b) and obtains a bound connecting the expected value and the functional of measures (e.g., an information measure). Going back to Shannon's result, fixed a random variable (and thus, a probability measure) and selected the function to be the length of a code: the functional a) is the Shannon Entropy of the source; the functional b) is the Kraft sum of the code; the pairing c) is the expected length of the code. We explore this connection and this pattern throughout the thesis. We will see how it can be found in notable results like Coding Theorems for one-to-one codes, Campbell's Coding Theorem, Arikan's Guessing Theorem, Fano-like and Transportation-Cost Inequalities and so on. Moreover, unearthing the pattern allows us to generalise it to other information measures and apply the technique in a variety of fields, including Learning Theory, Estimation Theory and Hypothesis Testing.