Zipf's law (zɪf, ts͡ɪpf) is an empirical law that often holds, approximately, when a list of measured values is sorted in decreasing order. It states that the value of the nth entry is inversely proportional to n.
The best known instance of Zipf's law applies to the frequency table of words in a text or corpus of natural language: Namely, it is usually found that the most common word occurs approximately twice as often as the next common one, three times as often as the third most common, and so on. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). It is often used in the following form, called Zipf-Mandelbrot law:where are fitted parameters, with , and .
This "law" is named after the American linguist George Kingsley Zipf, and is still an important concept in quantitative linguistics. It has been found to apply to many other types of data studied in the physical and social sciences.
In mathematical statistics, the concept has been formalized as the Zipfian distribution: a family of related discrete probability distributions whose rank-frequency distribution is an inverse power law relation. They are related to Benford's law and the Pareto distribution.
Some sets of time-dependent empirical data deviate somewhat from Zipf's law. Such empirical distributions are said to be quasi-Zipfian.
In 1913, the German physicist Felix Auerbach observed an inverse proportionality between the population sizes of cities, and their ranks when sorted by decreasing order of that variable.
Zipf's law has been discovered before Zipf, by the French stenographer Jean-Baptiste Estoup' Gammes Stenographiques (4th ed) in 1916, with G. Dewey in 1923, and with E. Condon in 1928.