Fair coin toss Consider the
Bernoulli trial of
tossing a fair coin X. The
probabilities of the
events of the coin landing as heads \text{H} and tails \text{T} (see
fair coin and
obverse and reverse) are
one half each, p_X{(\text{H})} = p_X{(\text{T})} = \tfrac{1}{2} = 0.5. Upon
measuring the variable as heads, the associated information gain is \operatorname{I}_X(\text{H}) = -\log_2 {p_X{(\text{H})}} = -\log_2\!{\tfrac{1}{2}} = 1,so the information gain of a fair coin landing as heads is 1
shannon. Likewise, the information gain of measuring tails T is\operatorname{I}_X(T) = -\log_2 {p_X{(\text{T})}} = -\log_2 {\tfrac{1}{2}} = 1 \text{ Sh}.
Fair die roll Suppose we have a
fair six-sided die. The value of a die roll is a
discrete uniform random variable X \sim \mathrm{DU}[1, 6] with
probability mass function p_X(k) = \begin{cases} \frac{1}{6}, & k \in \{1, 2, 3, 4, 5, 6\} \\ 0, & \text{otherwise} \end{cases}The probability of rolling a 4 is p_X(4) = \frac{1}{6}, as for any other valid roll. The information content of rolling a 4 is thus\operatorname{I}_{X}(4) = -\log_2{p_X{(4)}} = -\log_2{\tfrac{1}{6}} \approx 2.585\; \text{Sh}of information.
Two independent, identically distributed dice Suppose we have two
independent, identically distributed random variables X,\, Y \sim \mathrm{DU}[1, 6] each corresponding to an
independent fair 6-sided dice roll. The
joint distribution of X and Y is \begin{align} p_{X, Y}\!\left(x, y\right) & {} = \Pr(X = x,\, Y = y) = p_X\!(x)\,p_Y\!(y) \\ & {} = \begin{cases} \displaystyle{1 \over 36}, \ &x, y \in [1, 6] \cap \mathbb{N} \\ 0 & \text{otherwise.} \end{cases} \end{align} The information content of the
random variate (X, Y) = (2,\, 4) is \begin{align} \operatorname{I}_{X, Y}{(2, 4)} &= -\log_2\!{\left[p_{X,Y}{(2, 4)}\right]} = \log_2\!{36} = 2 \log_2\!{6} \\ & \approx 5.169925 \text{ Sh}, \end{align} and can also be calculated by
additivity of events \begin{align} \operatorname{I}_{X, Y}{(2, 4)} &= -\log_2\!{\left[p_{X,Y}{(2, 4)}\right]} = -\log_2\!{\left[p_X(2)\right]} -\log_2\!{\left[p_Y(4)\right]} \\ & = 2\log_2\!{6} \\ & \approx 5.169925 \text{ Sh}. \end{align}
Information from frequency of rolls If we receive information about the value of the dice
without knowledge of which die had which value, we can formalize the approach with so-called counting variables C_k := \delta_k(X) + \delta_k(Y) = \begin{cases} 0, & \neg\, (X = k \vee Y = k) \\ 1, & \quad X = k\, \veebar \, Y = k \\ 2, & \quad X = k\, \wedge \, Y = k \end{cases} for k \in \{1, 2, 3, 4, 5, 6\}, then \sum_{k=1}^{6}{C_k} = 2 and the counts have the
multinomial distribution \begin{align} f(c_1,\ldots,c_6) & {} = \Pr(C_1 = c_1 \text{ and } \dots \text{ and } C_6 = c_6) \\ & {} = \begin{cases} { \displaystyle {1\over{18}}{1 \over c_1!\cdots c_k!}}, \ & \text{when } \sum_{i=1}^6 c_i=2 \\ 0 & \text{otherwise,} \end{cases} \\ & {} = \begin{cases} {1 \over 18}, \ & \text{when 2 } c_k \text{ are } 1 \\ {1 \over 36}, \ & \text{when exactly one } c_k = 2 \\ 0, \ & \text{otherwise.} \end{cases} \end{align} To verify this, the 6 outcomes (X, Y) \in \left\{(k, k)\right\}_{k = 1}^{6} = \left\{ (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6) \right\} correspond to the event C_k = 2 and a
total probability of . These are the only events that are faithfully preserved with identity of which dice rolled which outcome because the outcomes are the same. Without knowledge to distinguish the dice rolling the other numbers, the other \binom{6}{2} = 15
combinations correspond to one die rolling one number and the other die rolling a different number, each having probability . Indeed, 6 \cdot \tfrac{1}{36} + 15 \cdot \tfrac{1}{18} = 1, as required. Unsurprisingly, the information content of learning that both dice were rolled as the same particular number is more than the information content of learning that one die was one number and the other was a different number. Take for examples the events A_k = \{(X, Y) = (k, k)\} and B_{j, k} = \{c_j = 1\} \cap \{c_k = 1\} for j \ne k, 1 \leq j, k \leq 6. For example, A_2 = \{X = 2 \text{ and } Y = 2\} and B_{3, 4} = \{(3, 4), (4, 3)\}. The information contents are \operatorname{I}(A_2) = -\log_2\!{\tfrac{1}{36}} = 5.169925 \text{ Sh} \operatorname{I}\left(B_{3, 4}\right) = - \log_2 \! \tfrac{1}{18} = 4.169925 \text{ Sh} Let \text{Same} = \bigcup_{i = 1}^{6}{A_i} be the event that both dice rolled the same value and \text{Diff} = \overline{\text{Same}} be the event that the dice differed. Then \Pr(\text{Same}) = \tfrac{1}{6} and \Pr(\text{Diff}) = \tfrac{5}{6}. The information contents of the events are \operatorname{I}(\text{Same}) = -\log_2\!{\tfrac{1}{6}} = 2.5849625 \text{ Sh} \operatorname{I}(\text{Diff}) = -\log_2\!{\tfrac{5}{6}} = 0.2630344 \text{ Sh}.
Information from sum of dice The probability mass or density function (collectively
probability measure) of the
sum of two independent random variables is the convolution of each probability measure. In the case of independent fair 6-sided dice rolls, the random variable Z = X + Y has probability mass function p_Z(z) = p_X(x) * p_Y(y) = {6 - |z - 7| \over 36} , where * represents the
discrete convolution. The
outcome Z = 5 has probability p_Z(5) = \frac{4}{36} = {1 \over 9} . Therefore, the information asserted is \operatorname{I}_Z(5) = -\log_2{\tfrac{1}{9}} = \log_2{9} \approx 3.169925 \text{ Sh}.
General discrete uniform distribution Generalizing the example above, consider a general
discrete uniform random variable (DURV) X \sim \mathrm{DU}[a,b]; \quad a, b \in \mathbb{Z}, \ b \ge a. For convenience, define N := b - a + 1. The
probability mass function is p_X(k) = \begin{cases} \frac{1}{N}, & k \in [a, b] \cap \mathbb{Z} \\ 0, & \text{otherwise}. \end{cases}In general, the values of the DURV need not be
integers, or for the purposes of information theory even uniformly spaced; they need only be
equiprobable. The information gain of any observation X = k is\operatorname{I}_X(k) = -\log_2{\frac{1}{N}} = \log_2{N} \text{ Sh}.
Special case: constant random variable If b = a above, X
degenerates to a
constant random variable with probability distribution deterministically given by X = b and probability measure the
Dirac measure p_X(k) = \delta_{b}(k). The only value X can take is
deterministically b, so the information content of any measurement of X is\operatorname{I}_X(b) = - \log_2{1} = 0.In general, there is no information gained from measuring a known value.
Categorical distribution Generalizing all of the above cases, consider a
categorical discrete random variable with
support \mathcal{S} = \bigl\{s_i\bigr\}_{i=1}^{N} and
probability mass function given by p_X(k) = \begin{cases} p_i, & k = s_i \in \mathcal{S} \\ 0, & \text{otherwise} . \end{cases} For the purposes of information theory, the values s \in \mathcal{S} do not have to be
numbers; they can be any
mutually exclusive events on a
measure space of
finite measure that has been
normalized to a
probability measure p.
Without loss of generality, we can assume the
categorical distribution is supported on the set [N] = \left\{1, 2, \dots, N \right\}; the
mathematical structure is
isomorphic in terms of
probability theory and therefore
information theory as well. The information of the outcome X = x is given \operatorname{I}_X(x) = -\log_2{p_X(x)}. From these examples, it is possible to calculate the information of any set of
independent DRVs with known
distributions by
additivity. ==Derivation==