Relationship to thermodynamic entropy The inspiration for adopting the word
entropy in information theory came from the close resemblance between Shannon's formula and very similar known formulae from
statistical mechanics. In
statistical thermodynamics the most general formula for the thermodynamic
entropy of a
thermodynamic system is the
Gibbs entropy S = - k_\text{B} \sum_i p_i \ln p_i \,, where is the
Boltzmann constant, and is the probability of a
microstate. The
Gibbs entropy was defined by
J. Willard Gibbs in 1878 after earlier work by
Ludwig Boltzmann (1872). The Gibbs entropy translates over almost unchanged into the world of
quantum physics to give the
von Neumann entropy introduced by
John von Neumann in 1927: S = - k_\text{B} \,{\rm Tr}(\rho \ln \rho) \,, where ρ is the
density matrix of the quantum mechanical system and Tr is the
trace. At an everyday practical level, the links between information entropy and thermodynamic entropy are not evident. Physicists and chemists are apt to be more interested in
changes in entropy as a system spontaneously evolves away from its initial conditions, in accordance with the
second law of thermodynamics, rather than an unchanging probability distribution. As the minuteness of the Boltzmann constant indicates, the changes in for even tiny amounts of substances in chemical and physical processes represent amounts of entropy that are extremely large compared to anything in
data compression or
signal processing. In classical thermodynamics, entropy is defined in terms of macroscopic measurements and makes no reference to any probability distribution, which is central to the definition of information entropy. The connection between thermodynamics and what is now known as information theory was first made by Boltzmann and expressed by his
equation: S=k_\text{B} \ln W, where S is the thermodynamic entropy of a particular macrostate (defined by thermodynamic parameters such as temperature, volume, energy, etc.), is the number of microstates (various combinations of particles in various energy states) that can yield the given macrostate, and is the Boltzmann constant. It is assumed that each microstate is equally likely, so that the probability of a given microstate is . When these probabilities are substituted into the above expression for the Gibbs entropy (or equivalently times the Shannon entropy), Boltzmann's equation results. In information theoretic terms, the information entropy of a system is the amount of "missing" information needed to determine a microstate, given the macrostate. In the view of
Jaynes (1957), thermodynamic entropy, as explained by
statistical mechanics, should be seen as an
application of Shannon's information theory: the thermodynamic entropy is interpreted as being proportional to the amount of further Shannon information needed to define the detailed microscopic state of the system, that remains uncommunicated by a description solely in terms of the macroscopic variables of classical thermodynamics, with the constant of proportionality being just the Boltzmann constant. Adding heat to a system increases its thermodynamic entropy because it increases the number of possible microscopic states of the system that are consistent with the measurable values of its macroscopic variables, making any complete state description longer. (See article:
maximum entropy thermodynamics).
Maxwell's demon can (hypothetically) reduce the thermodynamic entropy of a system by using information about the states of individual molecules; but, as
Landauer (from 1961) and co-workers have shown, to function the demon himself must increase thermodynamic entropy in the process, by at least the amount of Shannon information he proposes to first acquire and store; and so the total thermodynamic entropy does not decrease (which resolves the paradox).
Landauer's principle imposes a lower bound on the amount of heat a computer must generate to process a given amount of information, though modern computers are far less efficient.
Data compression Shannon's definition of entropy, when applied to an information source, can determine the minimum channel capacity required to reliably transmit the source as encoded binary digits. Shannon's entropy measures the information contained in a message as opposed to the portion of the message that is determined (or predictable). Examples of the latter include redundancy in language structure or statistical properties relating to the occurrence frequencies of letter or word pairs, triplets etc. The minimum channel capacity can be realized in theory by using the
typical set or in practice using
Huffman,
Lempel–Ziv or
arithmetic coding. (See also
Kolmogorov complexity.) In practice, compression algorithms deliberately include some judicious redundancy in the form of
checksums to protect against errors. The
entropy rate of a data source is the average number of bits per symbol needed to encode it. Shannon's experiments with human predictors show an information rate between 0.6 and 1.3 bits per character in English; the
PPM compression algorithm can achieve a compression ratio of 1.5 bits per character in English text. If a
compression scheme is lossless – one in which you can always recover the entire original message by decompression – then a compressed message has the same quantity of information as the original but is communicated in fewer characters. It has more information (higher entropy) per character. A compressed message has less
redundancy.
Shannon's source coding theorem states a lossless compression scheme cannot compress messages, on average, to have
more than one bit of information per bit of message, but that any value
less than one bit of information per bit of message can be attained by employing a suitable coding scheme. The entropy of a message per bit multiplied by the length of that message is a measure of how much total information the message contains. Shannon's theorem also implies that no lossless compression scheme can shorten
all messages. If some messages come out shorter, at least one must come out longer due to the
pigeonhole principle. In practical use, this is generally not a problem, because one is usually only interested in compressing certain types of messages, such as a document in English, as opposed to gibberish text, or digital photographs rather than noise, and it is unimportant if a compression algorithm makes some unlikely or uninteresting sequences larger. A 2011 study in
Science estimates the world's technological capacity to store and communicate optimally compressed information normalized on the most effective compression algorithms available in the year 2007, therefore estimating the entropy of the technologically available sources. The authors estimate humankind technological capacity to store information (fully entropically compressed) in 1986 and again in 2007. They break the information into three categories—to store information on a medium, to receive information through one-way
broadcast networks, or to exchange information through two-way
telecommunications networks. A diversity index is a quantitative statistical measure of how many different types exist in a dataset, such as species in a community, accounting for ecological
richness,
evenness, and
dominance. Specifically, Shannon entropy is the logarithm of , the
true diversity index with parameter equal to 1. The Shannon index is related to the proportional abundances of types.
Entropy of a sequence There are a number of entropy-related concepts that mathematically quantify information content of a sequence or message: • the
self-information of an individual message or symbol taken from a given probability distribution (message or sequence seen as an individual event), • the
joint entropy of the symbols forming the message or sequence (seen as a set of events), • the
entropy rate of a
stochastic process (message or sequence is seen as a succession of events). (The "rate of self-information" can also be defined for a particular sequence of messages or symbols generated by a given stochastic process: this will always be equal to the entropy rate in the case of a
stationary process.) Other
quantities of information are also used to compare or relate different sources of information. It is important not to confuse the above concepts. Often it is only clear from context which one is meant. For example, when someone says that the "entropy" of the English language is about 1 bit per character, they are actually modeling the English language as a stochastic process and talking about its entropy
rate. Shannon himself used the term in this way. If very large blocks are used, the estimate of per-character entropy rate may become artificially low because the probability distribution of the sequence is not known exactly; it is only an estimate. If one considers the text of every book ever published as a sequence, with each symbol being the text of a complete book, and if there are published books, and each book is only published once, the estimate of the probability of each book is , and the entropy (in bits) is . As a practical code, this corresponds to assigning each book a
unique identifier and using it in place of the text of the book whenever one wants to refer to the book. This is enormously useful for talking about books, but it is not so useful for characterizing the information content of an individual book, or of language in general: it is not possible to reconstruct the book from its identifier without knowing the probability distribution, that is, the complete text of all the books. The key idea is that the complexity of the probabilistic model must be considered.
Kolmogorov complexity is a theoretical generalization of this idea that allows the consideration of the information content of a sequence independent of any particular probability model; it considers the shortest
program for a
universal computer that outputs the sequence. A code that achieves the entropy rate of a sequence for a given model, plus the codebook (i.e. the probabilistic model), is one such program, but it may not be the shortest. The Fibonacci sequence is 1, 1, 2, 3, 5, 8, 13, .... treating the sequence as a message and each number as a symbol, there are almost as many symbols as there are characters in the message, giving an entropy of approximately . The first 128 symbols of the Fibonacci sequence has an entropy of approximately 7 bits/symbol, but the sequence can be expressed using a formula [ for , , ] and this formula has a much lower entropy and applies to any length of the Fibonacci sequence.
Limitations of entropy in cryptography In
cryptanalysis, entropy is often roughly used as a measure of the unpredictability of a cryptographic key, though its real
uncertainty is unmeasurable. For example, a 128-bit key that is uniformly and randomly generated has 128 bits of entropy. It also takes (on average) 2^{127} guesses to break by brute force. Entropy fails to capture the number of guesses required if the possible keys are not chosen uniformly. Instead, a measure called
guesswork can be used to measure the effort required for a brute force attack. Other problems may arise from non-uniform distributions used in cryptography. For example, a 1,000,000-digit binary
one-time pad using exclusive or. If the pad has 1,000,000 bits of entropy, it is perfect. If the pad has 999,999 bits of entropy, evenly distributed (each individual bit of the pad having 0.999999 bits of entropy) it may provide good security. But if the pad has 999,999 bits of entropy, where the first bit is fixed and the remaining 999,999 bits are perfectly random, the first bit of the ciphertext will not be encrypted at all.
Data as a Markov process A common way to define entropy for text is based on the
Markov model of text. For an order-0 source (each character is selected independent of the last characters), the binary entropy is: \Eta(\mathcal{S}) = - \sum_i p_i \log p_i , where is the probability of . For a first-order
Markov source (one in which the probability of selecting a character is dependent only on the immediately preceding character), the
entropy rate is: \Eta(\mathcal{S}) = - \sum_i p_i \sum_j \ p_i (j) \log p_i (j) , where is a
state (certain preceding characters) and p_i(j) is the probability of given as the previous character. For a second order Markov source, the entropy rate is \Eta(\mathcal{S}) = -\sum_i p_i \sum_j p_i(j) \sum_k p_{i,j}(k)\ \log p_{i,j}(k) . ==Efficiency (normalized entropy)==