Entropy (information theory)

The core idea of information theory is that the "informational value" of a communicated message depends on the degree to which the content of the message is surprising. If a highly likely event occurs, the message carries very little information. On the other hand, if a highly unlikely event occurs, the message is much more informative. For instance, the knowledge that some particular number will not be the winning number of a lottery provides very little information, because any particular chosen number will almost certainly not win. However, knowledge that a particular number will win a lottery has high informational value because it communicates the occurrence of a very low probability event. The information content, also called the surprisal or self-information, of an event E is a function that increases as the probability p(E) of an event decreases. When p(E) is close to 1, the surprisal of the event is low, but if p(E) is close to 0, the surprisal of the event is high. This relationship is described by the function \log\left(\frac{1}{p(E)}\right) , where \log is the logarithm, which gives 0 surprise when the probability of the event is 1. In fact, is the only function that satisfies a specific set of conditions defined in section ''''. Hence, we can define the information, or surprisal, of an event E by I(E) = \log\left(\frac{1}{p(E)}\right) , or equivalently, I(E) = -\log(p(E)) . Entropy measures the expected (i.e., average) amount of information conveyed by identifying the outcome of a random trial. This implies that rolling a die has higher entropy than tossing a coin because each outcome of a single die roll has smaller probability (p=1/6) than each outcome of a coin toss (p=1/2). Consider a coin with probability of landing on heads and probability of landing on tails. The maximum surprise is when , for which one outcome is not expected over the other. In this case a coin flip has an entropy of one bit (similarly, one trit with equiprobable values contains \log_2 3 (about 1.58496) bits of information because it can have one of three values). The minimum surprise is when (impossibility) or (certainty) and the entropy is zero bits. When the entropy is zero, there is no uncertainty at all – no freedom of choice – no information. Other values of p give entropies between zero and one bits. Example Information theory is useful to calculate the smallest amount of information required to convey a message, as in data compression. For example, consider the transmission of sequences comprising the 4 characters 'A', 'B', 'C', and 'D' over a binary channel. If all 4 letters are equally likely (25%), one cannot do better than using two bits to encode each letter. 'A' might code as '00', 'B' as '01', 'C' as '10', and 'D' as '11'. However, if the probabilities of each letter are unequal, say 'A' occurs with 70% probability, 'B' with 26%, and 'C' and 'D' with 2% each, one could assign variable length codes. In this case, 'A' would be coded as '0', 'B' as '10', 'C' as '110', and 'D' as '111'. With this representation, 70% of the time only one bit needs to be sent, 26% of the time two bits, and only 4% of the time 3 bits. On average, fewer than 2 bits are required since the entropy is lower (owing to the high prevalence of 'A' followed by 'B' – together 96% of characters). The calculation of the sum of probability-weighted log probabilities measures and captures this effect. English text, treated as a string of characters, has fairly low entropy; i.e. it is fairly predictable. We can be fairly certain that, for example, 'e' will be far more common than 'z', that the combination 'qu' will be much more common than any other combination with a 'q' in it, and that the combination 'th' will be more common than 'z', 'q', or 'qu'. After the first few letters one can often guess the rest of the word. English text has between 0.6 and 1.3 bits of entropy per character of the message. ==Definition==