Concept

Information bottleneck method

Summary
The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning". Applications include distributional clustering and dimension reduction, and more recently it has been suggested as a theoretical foundation for deep learning. It generalized the classical notion of minimal sufficient statistics from parametric statistics to arbitrary distributions, not necessarily of exponential form. It does so by relaxing the sufficiency condition to capture some fraction of the mutual information with the relevant variable Y. The information bottleneck can also be viewed as a rate distortion problem, with a distortion function that measures how well Y is predicted from a compressed representation T compared to its direct prediction from X. This interpretation provides a general iterative algorithm for solving the information bottleneck trade-off and calculating the information curve from the distribution p(X,Y). Let the compressed representation be given by random variable . The algorithm minimizes the following functional with respect to conditional distribution : where and are the mutual information of and , and of and , respectively, and is a Lagrange multiplier. It has been mathematically proven that controlling information bottleneck is one way to control generalization error in deep learning. Namely, the generalization error is proven to scale as where is the number of training samples, is the input to a deep neural network, and is the output of a hidden layer. This generalization bound scale with the degree of information bottleneck, unlike the other generalization bounds that scale with the number of parameters, VC dimension, Rademacher complexity, stability or robustness.
About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.