Chow and Liu show how to select second-order terms for the product approximation so that, among all such second-order approximations (first-order dependency trees), the constructed approximation P^{\prime} has the minimum
Kullback–Leibler divergence to the actual distribution P, and is thus the
closest approximation in the classical
information-theoretic sense. The Kullback–Leibler divergence between a second-order product approximation and the actual distribution is shown to be : D(P\parallel P^{\prime })=-\sum I(X_{i};X_{j(i)})+\sum H(X_{i})-H(X_{1},X_{2},\ldots ,X_{n}) where I(X_{i};X_{j(i)}) is the
mutual information between variable X_{i} and its parent X_{j(i)} and H(X_{1},X_{2},\ldots ,X_{n}) is the
joint entropy of variable set \{X_{1},X_{2},\ldots ,X_{n}\}. Since the terms \sum H(X_{i}) and H(X_{1},X_{2},\ldots ,X_{n}) are independent of the dependency ordering in the tree, only the sum of the pairwise
mutual informations, \sum I(X_{i};X_{j(i)}), determines the quality of the approximation. Thus, if every branch (edge) on the tree is given a weight corresponding to the mutual information between the variables at its vertices, then the tree which provides the optimal second-order approximation to the target distribution is just the
maximum-weight tree. The equation above also highlights the role of the dependencies in the approximation: When no dependencies exist, and the first term in the equation is absent, we have only an approximation based on first-order marginals, and the distance between the approximation and the true distribution is due to the redundancies that are not accounted for when the variables are treated as independent. As we specify second-order dependencies, we begin to capture some of that structure and reduce the distance between the two distributions. Chow and Liu provide a simple algorithm for constructing the optimal tree; at each stage of the procedure the algorithm simply adds the maximum
mutual information pair to the tree. See the original paper, , for full details. A more efficient tree construction algorithm for the common case of sparse data was outlined in . Chow and Wagner proved in a later paper that the learning of the Chow–Liu tree is consistent given samples (or observations) drawn i.i.d. from a tree-structured distribution. In other words, the probability of learning an incorrect tree decays to zero as the number of samples tends to infinity. The main idea in the proof is the continuity of the mutual information in the pairwise
marginal distribution. More recently, the exponential
rate of convergence of the error probability was provided. ==Variations on Chow–Liu trees==