Several variations on mutual information have been proposed to suit various needs. Among these are normalized variants and generalizations to more than two variables.
Metric Many applications require a
metric, that is, a distance measure between pairs of points. The quantity :\begin{align} d(X,Y) &= \Eta(X,Y) - \operatorname{I}(X;Y) \\ &= \Eta(X) + \Eta(Y) - 2\operatorname{I}(X;Y) \\ &= \Eta(X\mid Y) + \Eta(Y\mid X) \\ &= 2\Eta(X,Y) - \Eta(X) - \Eta(Y) \end{align} satisfies the properties of a metric (
triangle inequality,
non-negativity,
indiscernability and symmetry), where equality X=Y is understood to mean that X can be completely determined from Y. This distance metric is also known as the
variation of information. If X, Y are discrete random variables then all the entropy terms are non-negative, so 0 \le d(X,Y) \le \Eta(X,Y) and one can define a normalized distance :D(X,Y) = \frac{d(X, Y)}{\Eta(X, Y)} \le 1. Plugging in the definitions shows that :D(X,Y) = 1 - \frac{\operatorname{I}(X; Y)}{\Eta(X, Y)}. This is known as the Rajski Distance. In a set-theoretic interpretation of information (see the figure for
Conditional entropy), this is effectively the
Jaccard distance between X and Y. Finally, :D^\prime(X, Y) = 1 - \frac{\operatorname{I}(X; Y)}{\max\left\{\Eta(X), \Eta(Y)\right\}} is also a metric.
Conditional mutual information Sometimes it is useful to express the mutual information of two random variables conditioned on a third. :\operatorname{I}(X;Y|Z) = \mathbb{E}_Z [D_{\mathrm{KL}}( P_{(X,Y)|Z} \| P_{X|Z} \otimes P_{Y|Z} )] For jointly
discrete random variables this takes the form : \operatorname{I}(X;Y|Z) = \sum_{z\in \mathcal{Z}} \sum_{y\in \mathcal{Y}} \sum_{x\in \mathcal{X}} {p_Z(z)\, p_{X,Y|Z}(x,y|z) \log\left[\frac{p_{X,Y|Z}(x,y|z)}{p_{X|Z}\,(x|z)p_{Y|Z}(y|z)}\right]}, which can be simplified as : \operatorname{I}(X;Y|Z) = \sum_{z\in \mathcal{Z}} \sum_{y\in \mathcal{Y}} \sum_{x\in \mathcal{X}} p_{X,Y,Z}(x,y,z) \log \frac{p_{X,Y,Z}(x,y,z)p_{Z}(z)}{p_{X,Z}(x,z)p_{Y,Z}(y,z)}. For jointly
continuous random variables this takes the form : \operatorname{I}(X;Y|Z) = \int_{\mathcal{Z}} \int_{\mathcal{Y}} \int_{\mathcal{X}} {p_Z(z)\, p_{X,Y|Z}(x,y|z) \log\left[\frac{p_{X,Y|Z}(x,y|z)}{p_{X|Z}\,(x|z)p_{Y|Z}(y|z)}\right]} dx dy dz, which can be simplified as : \operatorname{I}(X;Y|Z) = \int_{\mathcal{Z}} \int_{\mathcal{Y}} \int_{\mathcal{X}} p_{X,Y,Z}(x,y,z) \log \frac{p_{X,Y,Z}(x,y,z)p_{Z}(z)}{p_{X,Z}(x,z)p_{Y,Z}(y,z)} dx dy dz. Conditioning on a third random variable may either increase or decrease the mutual information, but it is always true that :\operatorname{I}(X;Y|Z) \ge 0 for discrete, jointly distributed random variables X,Y,Z. This result has been used as a basic building block for proving other
inequalities in information theory.
Interaction information Several generalizations of mutual information to more than two random variables have been proposed, such as
total correlation (or multi-information) and
dual total correlation. The expression and study of multivariate higher-degree mutual information was achieved in two seemingly independent works: McGill (1954) who called these functions "
interaction information", and Hu Kuo Ting (1962). Interaction information is defined for one variable as follows: :\operatorname{I}(X_1) = \Eta(X_1) and for n > 1, : \operatorname{I}(X_1;\,...\,;X_n) = \operatorname{I}(X_1;\,...\,;X_{n-1}) - \operatorname{I}(X_1;\,...\,;X_{n-1}\mid X_n), where (as above) we define :\operatorname{I}(X_1;\,...\,;X_{n-1}|X_n) = \mathbb{E}_{X_n} \bigl[\operatorname{I}(X_1;\,...\,;X_{n-1})|X_n\bigr]. Some authors reverse the order of the terms on the right-hand side of the preceding equation, which changes the sign when the number of random variables is odd. (And in this case, the single-variable expression becomes the negative of the entropy.) The interaction information can be positive, negative, or zero.). In this sense, the I(X_1; \ldots; X_k) = 0 can be used as a refined statistical independence criterion.
Applications For 3 variables, Brenner et al. applied multivariate mutual information to
neural coding and called its negativity "synergy" and Watkinson et al. applied it to genetic expression. For arbitrary k variables, Tapia et al. applied multivariate mutual information to
gene expression. Mutual information is also used in the area of signal processing as a
measure of similarity between two signals. For example, FMI metric is an image fusion performance measure that makes use of mutual information in order to measure the amount of information that the fused image contains about the source images. The
Matlab code for this metric can be found at. A python package for computing all multivariate mutual informations,
conditional mutual information, joint entropies, total correlations, information distance in a dataset of n variables is available.
Directed information Directed information, \operatorname{I}\left(X^n \to Y^n\right), measures the amount of information that flows from the process X^n to Y^n, where X^n denotes the vector X_1, X_2, ..., X_n and Y^n denotes Y_1, Y_2, ..., Y_n. The term
directed information was coined by
James Massey and is defined as : \operatorname{I}\left(X^n \to Y^n\right) = \sum_{i=1}^n \operatorname{I}\left(X^i; Y_i\mid Y^{i-1}\right) . Note that if n=1, the directed information becomes the mutual information. Directed information has many applications in problems where
causality plays an important role, such as
capacity of channel with feedback.
Normalized variants Normalized variants of the mutual information are provided by the
coefficients of constraint,
uncertainty coefficient or proficiency: : C_{XY} = \frac{\operatorname{I}(X;Y)}{\Eta(Y)} ~~~~\mbox{and}~~~~ C_{YX} = \frac{\operatorname{I}(X;Y)}{\Eta(X)}. The two coefficients have a value ranging in [0, 1], but are not necessarily equal. This measure is not symmetric. If one desires a symmetric measure, one may consider the following
redundancy measure: :R = \frac{\operatorname{I}(X;Y)}{\Eta(X) + \Eta(Y)} which attains a minimum of zero when the variables are independent and a maximum value of :R_\max = \frac{\min\left\{\Eta(X), \Eta(Y)\right\}}{\Eta(X) + \Eta(Y)} when one variable becomes completely redundant with the knowledge of the other. See also
Redundancy (information theory). Another symmetrical measure is the
symmetric uncertainty , given by :U(X, Y) = 2R = 2\frac{\operatorname{I}(X;Y)}{\Eta(X) + \Eta(Y)} which represents the
harmonic mean of the two uncertainty coefficients C_{XY}, C_{YX}. : IQR(X, Y) = \operatorname{E}[\operatorname{I}(X;Y)] = \frac{\operatorname{I}(X;Y)}{\Eta(X, Y)} = \frac{\sum_{x \in X} \sum_{y \in Y} p(x, y) \log {p(x)p(y)}}{\sum_{x \in X} \sum_{y \in Y} p(x, y) \log {p(x, y)}} - 1 There exists a normalization which derives from first thinking of mutual information as an analogue to
covariance (thus
Shannon entropy is analogous to
variance). Then the normalized mutual information is calculated akin to the
Pearson correlation coefficient, : \frac{\operatorname{I}(X;Y)}{\sqrt{\Eta(X)\Eta(Y)}}\; . A naive normalization may lead to biased interpretation and introduce spurious dependences.
Weighted variants In the traditional formulation of the mutual information, : \operatorname{I}(X;Y) = \sum_{y \in Y} \sum_{x \in X} p(x, y) \log \frac{p(x, y)}{p(x)\,p(y)}, each
event or
object specified by (x, y) is weighted by the corresponding probability p(x, y). This assumes that all objects or events are equivalent
apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more
significant than others, or that certain patterns of association are more semantically important than others. For example, the deterministic mapping \{(1,1),(2,2),(3,3)\} may be viewed as stronger than the deterministic mapping \{(1,3),(2,1),(3,2)\}, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (, , ), and is therefore not sensitive at all to the
form of the relational mapping between the associated variables. If it is desired that the former relation—showing agreement on all variable values—be judged stronger than the later relation, then it is possible to use the following
weighted mutual information . : \operatorname{I}(X;Y) = \sum_{y \in Y} \sum_{x \in X} w(x,y) p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)}, which places a weight w(x,y) on the probability of each variable value co-occurrence, p(x,y). This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant
holistic or
Prägnanz factors. In the above example, using larger relative weights for w(1,1), w(2,2), and w(3,3) would have the effect of assessing greater
informativeness for the relation \{(1,1),(2,2),(3,3)\} than for the relation \{(1,3),(2,1),(3,2)\}, which may be desirable in some cases of pattern recognition, and the like. This weighted mutual information is a form of weighted KL-Divergence, which is known to take negative values for some inputs, and there are examples where the weighted mutual information also takes negative values.
Adjusted mutual information A probability distribution can be viewed as a
partition of a set. One may then ask: if a set were partitioned randomly, what would the distribution of probabilities be? What would the expectation value of the mutual information be? The
adjusted mutual information or AMI subtracts the expectation value of the MI, so that the AMI is zero when two different distributions are random, and one when two distributions are identical. The AMI is defined in analogy to the
adjusted Rand index of two different partitions of a set.
Absolute mutual information Using the ideas of
Kolmogorov complexity, one can consider the mutual information of two sequences independent of any probability distribution: : \operatorname{I}_K(X;Y) = K(X) - K(X\mid Y). To establish that this quantity is symmetric up to a logarithmic factor (\operatorname{I}_K(X;Y) \approx \operatorname{I}_K(Y;X)) one requires the
chain rule for Kolmogorov complexity . Approximations of this quantity via
compression can be used to define a
distance measure to perform a
hierarchical clustering of sequences without having any
domain knowledge of the sequences .
Linear correlation Unlike correlation coefficients, such as the
product moment correlation coefficient, mutual information contains information about all dependence—linear and nonlinear—and not just linear dependence as the correlation coefficient measures. However, in the narrow case that the joint distribution for X and Y is a
bivariate normal distribution (implying in particular that both marginal distributions are normally distributed), there is an exact relationship between \operatorname{I} and the correlation coefficient \rho . :\operatorname{I} = -\frac{1}{2} \log\left(1 - \rho^2\right) The equation above can be derived as follows for a bivariate Gaussian: :\begin{align} \begin{pmatrix} X_1 \\ X_2 \end{pmatrix} &\sim \mathcal{N} \left( \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \Sigma \right),\qquad \Sigma = \begin{pmatrix} \sigma^2_1 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma^2_2 \end{pmatrix} \\ \Eta(X_i) &= \frac{1}{2}\log\left(2\pi e \sigma_i^2\right) = \frac{1}{2} + \frac{1}{2}\log(2\pi) + \log\left(\sigma_i\right), \quad i\in\{1, 2\} \\ \Eta(X_1, X_2) &= \frac{1}{2}\log\left[(2\pi e)^2|\Sigma|\right] = 1 + \log(2\pi) + \log\left(\sigma_1 \sigma_2\right) + \frac{1}{2}\log\left(1 - \rho^2\right) \\ \end{align} Therefore, : \operatorname{I}\left(X_1; X_2\right) = \Eta\left(X_1\right) + \Eta\left(X_2\right) - \Eta\left(X_1, X_2\right) = -\frac{1}{2}\log\left(1 - \rho^2\right)
For discrete data When X and Y are limited to be in a discrete number of states, observation data is summarized in a
contingency table, with row variable X (or i) and column variable Y (or j). Mutual information is one of the measures of
association or
correlation between the row and column variables. Other measures of association include
Pearson's chi-squared test statistics,
G-test statistics, etc. In fact, with the same log base, mutual information will be equal to the
G-test log-likelihood statistic divided by 2N, where N is the sample size. == Applications ==