In information theory, the cross-entropy between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution . The cross-entropy of the distribution relative to a distribution over a given set is defined as follows: where is the expected value operator with respect to the distribution . The definition may be formulated using the Kullback–Leibler divergence , divergence of from (also known as the relative entropy of with respect to ). where is the entropy of . For discrete probability distributions and with the same support this means The situation for continuous distributions is analogous. We have to assume that and are absolutely continuous with respect to some reference measure (usually is a Lebesgue measure on a Borel σ-algebra). Let and be probability density functions of and with respect to . Then and therefore NB: The notation is also used for a different concept, the joint entropy of and . In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value out of a set of possibilities can be seen as representing an implicit probability distribution over , where is the length of the code for in bits. Therefore, cross-entropy can be interpreted as the expected message-length per datum when a wrong distribution is assumed while the data actually follows a distribution . That is why the expectation is taken over the true probability distribution and not . Indeed the expected message-length under the true distribution is There are many situations where cross-entropy needs to be measured but the distribution of is unknown. An example is language modeling, where a model is created based on a training set , and then its cross-entropy is measured on a test set to assess how accurate the model is in predicting the test data.

About this result
This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.
Related courses (32)
COM-406: Foundations of Data Science
We discuss a set of topics that are important for the understanding of modern data science but that are typically not taught in an introductory ML course. In particular we discuss fundamental ideas an
PHYS-105: Advanced physics II (thermodynamics)
Ce cours présente la thermodynamique en tant que théorie permettant une description d'un grand nombre de phénomènes importants en physique, chimie et ingéniere, et d'effets de transport. Une introduc
BIO-369: Randomness and information in biological data
Biology is becoming more and more a data science, as illustrated by the explosion of available genome sequences. This course aims to show how we can make sense of such data and harness it in order to
Show more
Related lectures (242)
Quantum Information
Explores the CHSH operator, self-testing, eigenstates, and quantifying randomness in quantum systems.
Introduction to Data Science
Introduces the basics of data science, covering decision trees, machine learning advancements, and deep reinforcement learning.
Information Measures: Entropy and Information Theory
Explains how entropy measures uncertainty in a system based on possible outcomes.
Show more
Related publications (133)
Related concepts (13)
Quantities of information
The mathematical theory of information is based on probability theory and statistics, and measures information with several quantities of information. The choice of logarithmic base in the following formulae determines the unit of information entropy that is used. The most common unit of information is the bit, or more correctly the shannon, based on the binary logarithm.
Joint entropy
In information theory, joint entropy is a measure of the uncertainty associated with a set of variables. The joint Shannon entropy (in bits) of two discrete random variables and with images and is defined as where and are particular values of and , respectively, is the joint probability of these values occurring together, and is defined to be 0 if . For more than two random variables this expands to where are particular values of , respectively, is the probability of these values occurring together, and is defined to be 0 if .
Principle of maximum entropy
The principle of maximum entropy states that the probability distribution which best represents the current state of knowledge about a system is the one with largest entropy, in the context of precisely stated prior data (such as a proposition that expresses testable information). Another way of stating this: Take precisely stated prior data or testable information about a probability distribution function. Consider the set of all trial probability distributions that would encode the prior data.
Show more

Graph Chatbot

Chat with Graph Search

Ask any question about EPFL courses, lectures, exercises, research, news, etc. or try the example questions below.

DISCLAIMER: The Graph Chatbot is not programmed to provide explicit or categorical answers to your questions. Rather, it transforms your questions into API requests that are distributed across the various IT services officially administered by EPFL. Its purpose is solely to collect and recommend relevant references to content that you can explore to help you answer your questions.