In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution .
The cross-entropy of the distribution relative to a distribution over a given set is defined as follows:
where is the expected value operator with respect to the distribution .
The definition may be formulated using the Kullback–Leibler divergence , divergence of from (also known as the relative entropy of with respect to ).
where is the entropy of .
For discrete probability distributions and with the same support this means
The situation for continuous distributions is analogous. We have to assume that and are absolutely continuous with respect to some reference measure (usually is a Lebesgue measure on a Borel σ-algebra). Let and be probability density functions of and with respect to . Then
and therefore
NB: The notation is also used for a different concept, the joint entropy of and .
In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value out of a set of possibilities can be seen as representing an implicit probability distribution over , where is the length of the code for in bits. Therefore, cross-entropy can be interpreted as the expected message-length per datum when a wrong distribution is assumed while the data actually follows a distribution . That is why the expectation is taken over the true probability distribution and not . Indeed the expected message-length under the true distribution is
There are many situations where cross-entropy needs to be measured but the distribution of is unknown. An example is language modeling, where a model is created based on a training set , and then its cross-entropy is measured on a test set to assess how accurate the model is in predicting the test data.
Cette page est générée automatiquement et peut contenir des informations qui ne sont pas correctes, complètes, à jour ou pertinentes par rapport à votre recherche. Il en va de même pour toutes les autres pages de ce site. Veillez à vérifier les informations auprès des sources officielles de l'EPFL.
We discuss a set of topics that are important for the understanding of modern data science but that are typically not taught in an introductory ML course. In particular we discuss fundamental ideas an
Ce cours présente la thermodynamique en tant que théorie permettant une description d'un grand nombre de phénomènes importants en physique, chimie et ingéniere, et d'effets de transport. Une introduc
Biology is becoming more and more a data science, as illustrated by the explosion of available genome sequences. This course aims to show how we can make sense of such data and harness it in order to
The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm.
vignette|Entropie conjointe. En théorie de l'information, l'entropie conjointe est une mesure d'entropie utilisée en théorie de l'information, qui mesure la quantité d'information contenue dans un système de deux variables aléatoires (ou plus de deux). Comme les autres entropies, l'entropie conjointe est mesurée en bits ou en nats, selon la base du logarithme utilisée. Si chaque paire d'états possibles des variables aléatoires ont une probabilité alors l'entropie conjointe de et est définie par : où est la fonction logarithme en base 2.
Le principe d'entropie maximale consiste, lorsqu'on veut représenter une connaissance imparfaite d'un phénomène par une loi de probabilité, à : identifier les contraintes auxquelles cette distribution doit répondre (moyenne, etc) ; choisir de toutes les distributions répondant à ces contraintes celle ayant la plus grande entropie au sens de Shannon. De toutes ces distributions, c'est en effet celle d'entropie maximale qui contient le moins d'information, et elle est donc pour cette raison la moins arbitraire de toutes celles que l'on pourrait utiliser.
Introduit les bases de la science des données, couvrant les arbres de décision, les progrès de l'apprentissage automatique et l'apprentissage par renforcement profond.
This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In ...
This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In ...
This dataset contains a collection of ultrafast ultrasound acquisitions from nine volunteers and the CIRS 054G phantom. For a comprehensive understanding of the dataset, please refer to the paper: Viñals, R.; Thiran, J.-P. A KL Divergence-Based Loss for In ...