**Are you an EPFL student looking for a semester project?**

Work with us on data science and visualisation projects, and deploy your project as an app on top of GraphSearch.

Concept# Multinomial distribution

Summary

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.
When k is 2 and n is 1, the multinomial distribution is the Bernoulli distribution. When k is 2 and n is bigger than 1, it is the binomial distribution. When k is bigger than 2 and n is 1, it is the categorical distribution. The term "multinoulli" is sometimes used for the categorical distribution to emphasize this four-way relationship (so n determines the suffix, and k the prefix).
The Bernoulli distribution models the outcome of a single Bernoulli trial. In other words, it models whether flipping a (possibly biased) coin one time will result in either a success (obtaining a head) or failure (obtaining a tail). The binomial distribution generalizes this to the number of heads from performing n independent flips (Bernoulli trials) of the same coin. The multinomial distribution models the outcome of n experiments, where the outcome of each trial has a categorical distribution, such as rolling a k-sided die n times.
Let k be a fixed finite number. Mathematically, we have k possible mutually exclusive outcomes, with corresponding probabilities p1, ..., pk, and n independent trials. Since the k outcomes are mutually exclusive and one must occur we have pi ≥ 0 for i = 1, ..., k and . Then if the random variables Xi indicate the number of times outcome number i is observed over the n trials, the vector X = (X1, ..., Xk) follows a multinomial distribution with parameters n and p, where p = (p1, ..., pk). While the trials are independent, their outcomes Xi are dependent because they must be summed to n.

Official source

This page is automatically generated and may contain information that is not correct, complete, up-to-date, or relevant to your search query. The same applies to every other page on this website. Please make sure to verify the information with EPFL's official sources.

Related publications (2)

Related courses (10)

MATH-408: Regression methods

General graduate course on regression methods

DH-406: Machine learning for DH

This course aims to introduce the basic principles of machine learning in the context of the digital humanities. We will cover both supervised and unsupervised learning techniques, and study and imple

CS-101: Advanced information, computation, communication I

Discrete mathematics is a discipline with applications to almost all areas of study. It provides a set of indispensable tools to computer science in particular. This course reviews (familiar) topics a

Related concepts (24)

Conjugate prior

In Bayesian probability theory, if the posterior distribution is in the same probability distribution family as the prior probability distribution , the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function . A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise, numerical integration may be necessary. Further, conjugate priors may give intuition by more transparently showing how a likelihood function updates a prior distribution.

Multinomial distribution

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

Poisson distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. It is named after French mathematician Siméon Denis Poisson ('pwɑːsɒn; pwasɔ̃). The Poisson distribution can also be used for the number of events in other specified interval types such as distance, area, or volume.

Related lectures (88)

Types of Variables and Multinomial DistributionMATH-233: Probability and statistics

Introduces types of variables, multinomial distribution, data characteristics, shapes of densities, correlation, and data visualization methods.

Fundamental Limits of Gradient-Based LearningMATH-232: Probability and statistics

Delves into the fundamental limits of gradient-based learning on neural networks, covering topics such as binomial theorem, exponential series, and moment-generating functions.

Data Analysis: Correlation MeasuresMATH-233: Probability and statistics

Covers the basics of data analysis, focusing on statistical concepts and correlation measures.

Consider a two-class classification problem where the number of features is much larger than the sample size. The features are masked by Gaussian noise with mean zero and covariance matrix Sigma, where the precision matrix Omega = Sigma(-1) is unknown but is presumably sparse. The useful features, also unknown, are sparse and each contributes weakly (i.e., rare and weak) to the classification decision. By obtaining a reasonably good estimate of Omega, we formulate the setting as a linear regression model. We propose a two-stage classification method where we first select features by the method of Innovated Thresholding (IT), and then use the retained features and Fisher's LDA for classification. In this approach, a crucial problem is how to set the threshold of IT. We approach this problem by adapting the recent innovation of Higher Criticism Thresholding (HCT). We find that when useful features are rare and weak, the limiting behavior of HCT is essentially just as good as the limiting behavior of ideal threshold, the threshold one would choose if the underlying distribution of the signals is known (if only). Somewhat surprisingly, when Omega is sufficiently sparse, its off-diagonal coordinates usually do not have a major influence over the classification decision. Compared to recent work in the case where Omega is the identity matrix [Proc. Natl. Acad. Sci. USA 105 (2008) 14790-14795; Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 367 (2009) 4449-4470], the current setting is much more general, which needs a new approach and much more sophisticated analysis. One key component of the analysis is the intimate relationship between HCT and Fisher's separation. Another key component is the tight large-deviation bounds for empirical processes for data with unconventional correlation structures, where graph theory on vertex coloring plays an important role.

Olfactometer experiments are used to determine the effect of odours on the behaviour of organisms such as insects or nematodes, and typically result in data comprising many groups of small counts, overdispersed relative to the multinomial distribution. Overdispersion reflects a lack of independence or heterogeneity among individuals and can lead to statistics having larger variances than expected and possible losses of efficiency. In this thesis, some distributions which consist of generalisations of the multinomial distribution have been developed. These models are based on non-homogeneous Markov chain theory, take the overdispersion into account, and potentially provide a physical interpretation of the overdispersion seen in olfactometer data. Some inference aspects are considered, including comparison of the asymptotic relative efficiencies of three different sampling schemes. The fact that the empirical distributions well approximate the corresponding asymptotic distributions is checked. Observable differences in parameter estimates between data generated under different hypotheses are also studied. Finally, different models intended to shed light on various aspects of the data and/or the experiment procedure, are applied to three real olfactometer datasets.