The
Shannon index has been a popular diversity index in the ecological literature, where it is also known as '''Shannon's diversity index
, Shannon–
Wiener index
, and (erroneously) Shannon–
Weaver index'
. The measure was originally proposed by Claude Shannon in 1948 to quantify the entropy (hence Shannon entropy'', related to
Shannon information content) in strings of text. The idea is that the more letters there are, and the closer their proportional abundances in the string of interest, the more difficult it is to correctly predict which letter will be the next one in the string. The Shannon entropy quantifies the uncertainty (entropy or degree of surprise) associated with this prediction. It is most often calculated as follows: H' = -\sum_{i=1}^R p_i \ln(p_i) where is the proportion of characters belonging to the th type of letter in the string of interest. In ecology, is often the proportion of individuals belonging to the th species in the dataset of interest. Then the Shannon entropy quantifies the uncertainty in predicting the species identity of an individual that is taken at random from the dataset. Although the equation is here written with natural
logarithms, the base of the logarithm used when calculating the Shannon entropy can be chosen freely. Shannon himself discussed logarithm bases 2, 10 and , and these have since become the most popular bases in applications that use the Shannon entropy. Each log base corresponds to a different measurement unit, which has been called binary digits (bits), decimal digits (decits), and natural digits (nats) for the bases 2, 10 and , respectively. Comparing Shannon entropy values that were originally calculated with different log bases requires converting them to the same log base: change from the base to base is obtained with multiplication by . The Shannon index () is related to the
weighted geometric mean of the proportional abundances of the types. Specifically, it equals the logarithm of true diversity as calculated with : H' = -\sum_{i=1}^R p_i \ln(p_i) = -\sum_{i=1}^R \ln\left(p_i^{p_i}\right) This can also be written \begin{align} H' &= -\left[\ln\left(p_1^{p_1}\right) +\ln\left(p_2^{p_2}\right) +\ln\left(p_3^{p_3}\right) + \cdots + \ln\left(p_R^{p_R}\right)\right] \\[1ex] &= -\ln\left(p_1^{p_1}p_2^{p_2}p_3^{p_3} \cdots p_R^{p_R}\right) = \ln \left ( {1 \over p_1^{p_1}p_2^{p_2}p_3^{p_3} \cdots p_R^{p_R}} \right ) \\ &= \ln \left ( {1 \over {\prod_{i=1}^R p_i^{p_i}}} \right ) \end{align} Since the sum of the values equals 1 by definition, the
denominator equals the weighted geometric mean of the values, with the values themselves being used as the weights (exponents in the equation). The term within the parentheses hence equals true diversity , and equals . When all types in the dataset of interest are equally common, all values equal , and the Shannon index hence takes the value . The more unequal the abundances of the types, the larger the weighted geometric mean of the values, and the smaller the corresponding Shannon entropy. If practically all abundance is concentrated to one type, and the other types are very rare (even if there are many of them), Shannon entropy approaches zero. When there is only one type in the dataset, Shannon entropy exactly equals zero (there is no uncertainty in predicting the type of the next randomly chosen entity). In machine learning the Shannon index is also called as
Information gain.
Rényi entropy The
Rényi entropy is a generalization of the Shannon entropy to other values of than 1. It can be expressed: {}^qH = \frac{1}{1-q} \; \ln\left ( \sum_{i=1}^R p_i^q \right ) which equals {}^qH = \ln\left ( {1 \over \sqrt[q-1]{{\sum_{i=1}^R p_i p_i^{q-1}}}} \right ) = \ln({}^q\!D) This means that taking the logarithm of true diversity based on any value of gives the Rényi entropy corresponding to the same value of . ==Simpson index==