**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Pointwise mutual information

Summary

In statistics, probability theory and information theory, pointwise mutual information (PMI), or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were independent.
PMI (especially in its positive pointwise mutual information variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in [a] corpus than we would have a priori expected them to appear by chance."
The concept was introduced in 1961 by Robert Fano under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables: The mutual information (MI) of two discrete random variables refers to the average PMI of all possible events.
The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:
(with the latter two expressions being equal to the first by Bayes' theorem). The mutual information (MI) of the random variables X and Y is the expected value of the PMI (over all possible outcomes).
The measure is symmetric (). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is non-negative. PMI maximizes when X and Y are perfectly associated (i.e. or ), yielding the following bounds:
Finally, will increase if is fixed but decreases.
Here is an example to illustrate:
Using this table we can marginalize to get the following additional table for the individual distributions:
With this example, we can compute four values for . Using base-2 logarithms:
(For reference, the mutual information would then be 0.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related people (9)

Related units (5)

Related publications (9)

We discuss some properties of generative models for word embeddings. Namely, (Arora et al., 2016) proposed a latent discourse model implying the concentration of the partition function of the word vectors. This concentration phenomenon led to an asymptotic linear relation between the pointwise mutual information (PMI) of pairs of words and the scalar product of their vectors. Here, we first revisit this concentration phenomenon and prove it under slightly weaker assumptions, for a set of random vectors symmetrically distributed around the origin. Second, we empirically evaluate the relation between PMI and scalar products of word vectors satisfying the concentration property. Our empirical results indicate that, in practice, this relation does not hold with arbitrarily small error. This observation is further supported by two theoretical results: (i) the error cannot be exactly zero because the corresponding shifted PMI matrix cannot be positive semidefinite; (ii) under mild assumptions, there exist pairs of words for which the error cannot be close to zero. We deduce that either natural language does not follow the assumptions of the considered generative model, or the current word vector generation methods do not allow the construction of the hypothesized word embeddings.

This thesis focuses on two kinds of statistical inference problems in signal processing and data science. The first problem is the estimation of a structured informative tensor from the observation of a noisy tensor in which it is buried. The structure comes from the possibility to decompose the informative tensor as the sum of a small number of rank-one tensors (small compared to its size). Such structure has applications in data science where data, organized into arrays, can often be explained by the interaction between a few features characteristic of the problem. The second problem is the estimation of a signal input to a feedforward neural network whose output is observed. It is relevant for many applications (phase retrieval, quantized signals) where the relation between the measurements and the quantities of interest is not linear.
We look at these two statistical models in different high-dimensional limits corresponding to situations where the amount of observations and size of the signal become infinitely large. These asymptotic regimes are motivated by the ever-increasing computational power and storage capacity that make possible the processing and handling of large data sets. We take an information-theoretic approach in order to establish fundamental statistical limits of estimation in high-dimensional regimes. In particular, the main contributions of this thesis are the proofs of exact formulas for the asymptotic normalized mutual information associated with these inference problems. These are low-dimensional variational formulas that can nonetheless capture the behavior of a large system where each component interacts with all the others. Owing to the relationship between mutual information and Bayesian inference, we use the solutions to these variational problems to rigorously predict the asymptotic minimum mean-square error (MMSE), the error achieved by the (Bayes) optimal estimator. We can thus compare algorithmic performances to the statistical limit given by the MMSE.
Variational formulas for the mutual information are referred to as replica symmetric (RS) ansätze due to the predictions of the heuristic replica method from statistical physics. In the past decade proofs of the validity of these predictions started to emerge.
The general strategy is to show that the RS ansatz is both an upper and lower bound on the asymptotic normalized mutual information. The present work leverages on the adaptive interpolation method that proposes a unified way to prove the two bounds. We extend the adaptive interpolation to situations where the order parameter of the problem is not a scalar but a matrix, and to high-dimensional regimes that differ from the one for which the RS formula is usually conjectured. Our proofs also demonstrate the modularity of the method. Indeed, using statistical models previously studied in the literature as building blocks of more complex ones (e.g., estimated signal with a model of structure), we derive RS formulas for the normalized mutual information associated with estimation problems that are relevant to modern applications.

, ,

In this work, the probability of an event under some joint distribution is bounded by measuring it with the product of the marginals instead (which is typically easier to analyze) together with a measure of the dependence between the two random variables. These results find applications in adaptive data analysis, where multiple dependencies are introduced and in learning theory, where they can be employed to bound the generalization error of a learning algorithm. Bounds are given in terms of Sibson’s Mutual Information, α -Divergences, Hellinger Divergences, and f -Divergences. A case of particular interest is the Maximal Leakage (or Sibson’s Mutual Information of order infinity), since this measure is robust to post-processing and composes adaptively. The corresponding bound can be seen as a generalization of classical bounds, such as Hoeffding’s and McDiarmid’s inequalities, to the case of dependent random variables.

2021