Mixture model

General mixture model A typical finite-dimensional mixture model is a hierarchical model consisting of the following components: • N random variables that are observed, each distributed according to a mixture of K components, with the components belonging to the same parametric family of distributions (e.g., all normal, all Zipfian, etc.) but with different parameters. However, it is also possible to have a finite mixture model where each component belongs to a different parametric family of distributions, for example, a mixture of a multivariate normal distribution and a generalized hyperbolic distribution. • N random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution • A set of K mixture weights, which are probabilities that sum to 1. • A set of K parameters, each specifying the parameter of the corresponding mixture component. In many cases, each "parameter" is actually a set of parameters. For example, if the mixture components are Gaussian distributions, there will be a mean and variance for each component. If the mixture components are categorical distributions (e.g., when each observation is a token from a finite alphabet of size V), there will be a vector of V probabilities summing to 1. In addition, in a Bayesian setting, the mixture weights and parameters will themselves be random variables, and prior distributions will be placed over the variables. In such a case, the weights are typically viewed as a K-dimensional random vector drawn from a Dirichlet distribution (the conjugate prior of the categorical distribution), and the parameters will be distributed according to their respective conjugate priors. Mathematically, a basic parametric mixture model can be described as follows: : \begin{array}{lcl} K &=& \text{number of mixture components} \\ N &=& \text{number of observations} \\ \theta_{i=1 \dots K} &=& \text{parameter of distribution of observation associated with component } i \\ \phi_{i=1 \dots K} &=& \text{mixture weight, i.e., prior probability of a particular component } i \\ \boldsymbol\phi &=& K\text{-dimensional vector composed of all the individual } \phi_{1 \dots K} \text{; must sum to 1} \\ z_{i=1 \dots N} &=& \text{component of observation } i \\ x_{i=1 \dots N} &=& \text{observation } i \\ F(x|\theta) &=& \text{probability distribution of an observation, parametrized on } \theta \\ z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N}|z_{i=1 \dots N} &\sim& F(\theta_{z_i}) \end{array} In a Bayesian setting, all parameters are associated with random variables, as follows: : \begin{array}{lcl} K,N &=& \text{as above} \\ \theta_{i=1 \dots K}, \phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\ z_{i=1 \dots N}, x_{i=1 \dots N}, F(x|\theta) &=& \text{as above} \\ \alpha &=& \text{shared hyperparameter for component parameters} \\ \beta &=& \text{shared hyperparameter for mixture weights} \\ H(\theta|\alpha) &=& \text{prior probability distribution of component parameters, parametrized on } \alpha \\ \theta_{i=1 \dots K} &\sim& H(\theta|\alpha) \\ \boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\ z_{i=1 \dots N}|\boldsymbol\phi &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N}|z_{i=1 \dots N},\theta_{i=1 \dots K} &\sim& F(\theta_{z_i}) \end{array} This characterization uses F and H to describe arbitrary distributions over observations and parameters, respectively. Typically H will be the conjugate prior of F. The two most common choices of F are Gaussian aka "normal" (for real-valued observations) and categorical (for discrete observations). Other common possibilities for the distribution of the mixture components are: • Binomial distribution, for the number of "positive occurrences" (e.g., successes, yes votes, etc.) given a fixed number of total occurrences • Multinomial distribution, similar to the binomial distribution, but for counts of multi-way occurrences (e.g., yes/no/maybe in a survey) • Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs • Poisson distribution, for the number of occurrences of an event in a given period of time, for an event that is characterized by a fixed rate of occurrence • Exponential distribution, for the time before the next event occurs, for an event that is characterized by a fixed rate of occurrence • Log-normal distribution, for positive real numbers that are assumed to grow exponentially, such as incomes or prices • Multivariate normal distribution (aka multivariate Gaussian distribution), for vectors of correlated outcomes that are individually Gaussian-distributed • Multivariate Student's t-distribution, for vectors of heavy-tailed correlated outcomes • A vector of Bernoulli-distributed values, corresponding, e.g., to a black-and-white image, with each value representing a pixel; see the handwriting-recognition example below Specific examples Gaussian mixture model . Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K. A typical non-Bayesian Gaussian mixture model looks like this: : \begin{array}{lcl} K,N &=& \text{as above} \\ \phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\ z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\ \theta_{i=1 \dots K} &=& \{ \mu_{i=1 \dots K}, \sigma^2_{i=1 \dots K} \} \\ \mu_{i=1 \dots K} &=& \text{mean of component } i \\ \sigma^2_{i=1 \dots K} &=& \text{variance of component } i \\ z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N} &\sim& \mathcal{N}(\mu_{z_i}, \sigma^2_{z_i}) \end{array} . Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K. A Bayesian version of a Gaussian mixture model is as follows: : \begin{array}{lcl} K,N &=& \text{as above} \\ \phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\ z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\ \theta_{i=1 \dots K} &=& \{ \mu_{i=1 \dots K}, \sigma^2_{i=1 \dots K} \} \\ \mu_{i=1 \dots K} &=& \text{mean of component } i \\ \sigma^2_{i=1 \dots K} &=& \text{variance of component } i \\ \mu_0, \lambda, \nu, \sigma_0^2 &=& \text{shared hyperparameters} \\ \mu_{i=1 \dots K} &\sim& \mathcal{N}(\mu_0, \lambda\sigma_i^2) \\ \sigma_{i=1 \dots K}^2 &\sim& \operatorname{Inverse-Gamma}(\nu, \sigma_0^2) \\ \boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\ z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N} &\sim& \mathcal{N}(\mu_{z_i}, \sigma^2_{z_i}) \end{array} . The histograms of the clusters are shown in different colours. During the parameter estimation process, new clusters are created and grow on the data. The legend shows the cluster colours and the number of datapoints assigned to each cluster. Multivariate Gaussian mixture model A Bayesian Gaussian mixture model is commonly extended to fit a vector of unknown parameters (denoted in bold), or multivariate normal distributions. In a multivariate distribution (i.e. one modelling a vector \boldsymbol{x} with N random variables) one may model a vector of parameters (such as several observations of a signal or patches within an image) using a Gaussian mixture model prior distribution on the vector of estimates given by p(\boldsymbol{\theta}) = \sum_{i=1}^K \phi_i \mathcal{N}(\boldsymbol{\mu}_i,\boldsymbol{\Sigma}_i) where the ith vector component is characterized by normal distributions with weights \phi_i, means \boldsymbol{\mu}_i and covariance matrices \boldsymbol{\Sigma}_i. To incorporate this prior into a Bayesian estimation, the prior is multiplied with the known distribution p(\boldsymbol{x | \theta}) of the data \boldsymbol{x} conditioned on the parameters \boldsymbol{\theta} to be estimated. With this formulation, the posterior distribution p(\boldsymbol{\theta | x}) is also a Gaussian mixture model of the form p(\boldsymbol{\theta | x}) = \sum_{i=1}^K \tilde{\phi}_i \mathcal{N}(\boldsymbol{\tilde{\mu}_i}, \boldsymbol{\tilde{\Sigma}}_i) with new parameters \tilde{\phi}_i, \boldsymbol{\tilde{\mu}}_i and \boldsymbol{\tilde{\Sigma}}_i that are updated using the EM algorithm. Although EM-based parameter updates are well-established, providing the initial estimates for these parameters is currently an area of active research. Note that this formulation yields a closed-form solution to the complete posterior distribution. Estimations of the random variable \boldsymbol{\theta} may be obtained via one of several estimators, such as the mean or maximum of the posterior distribution. Such distributions are useful for assuming patch-wise shapes of images and clusters, for example. In the case of image representation, each Gaussian may be tilted, expanded, and warped according to the covariance matrices \boldsymbol{\Sigma}_i. One Gaussian distribution of the set is fit to each patch (usually of size 8×8 pixels) in the image. Notably, any distribution of points around a cluster (see k-means) may be accurately given enough Gaussian components, but scarcely over K=20 components are needed to accurately model a given image distribution or cluster of data. Categorical mixture model . Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K; likewise for [V]. A typical non-Bayesian mixture model with categorical observations looks like this: • K,N: as above • \phi_{i=1 \dots K}, \boldsymbol\phi: as above • z_{i=1 \dots N}, x_{i=1 \dots N}: as above • V: dimension of categorical observations, e.g., size of word vocabulary • \theta_{i=1 \dots K, j=1 \dots V}: probability for component i of observing item j • \boldsymbol\theta_{i=1 \dots K}: vector of dimension V, composed of \theta_{i,1 \dots V}; must sum to 1 The random variables: : \begin{array}{lcl} z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i}) \end{array} \begin{array}{lcl} K,N &=& \text{as above} \\ \phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\ z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\ V &=& \text{dimension of categorical observations, e.g., size of word vocabulary} \\ \theta_{i=1 \dots K, j=1 \dots V} &=& \text{probability for component } i \text{ of observing the } j\text{th item} \\ \boldsymbol\theta_{i=1 \dots K} &=& V\text{-dimensional vector, composed of }\theta_{i,1 \dots V} \text{; must sum to 1} \\ z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i}) \end{array} --> . Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K; likewise for [V]. A typical Bayesian mixture model with categorical observations looks like this: • K,N: as above • \phi_{i=1 \dots K}, \boldsymbol\phi: as above • z_{i=1 \dots N}, x_{i=1 \dots N}: as above • V: dimension of categorical observations, e.g., size of word vocabulary • \theta_{i=1 \dots K, j=1 \dots V}: probability for component i of observing item j • \boldsymbol\theta_{i=1 \dots K}: vector of dimension V, composed of \theta_{i,1 \dots V}; must sum to 1 • \alpha: shared concentration hyperparameter of \boldsymbol\theta for each component • \beta: concentration hyperparameter of \boldsymbol\phi The random variables: : \begin{array}{lcl} \boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\ \boldsymbol\theta_{i=1 \dots K} &\sim& \text{Symmetric-Dirichlet}_V(\alpha) \\ z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i}) \end{array} 1,...,K, φ as above • zi=1...N, xi=1...N = as above • V = dimension of categorical observations, e.g., size of word vocabulary --> \begin{array}{lcl} K,N &=& \mbox{as above} \\ \phi_{i=1 \dots K}, \boldsymbol\phi &=& \text{as above} \\ z_{i=1 \dots N}, x_{i=1 \dots N} &=& \text{as above} \\ V &=& \text{dimension of categorical observations, e.g., size of word vocabulary} \\ \theta_{i=1 \dots K, j=1 \dots V} &=& \text{probability for component } i \text{ of observing the } j\text{th item} \\ \boldsymbol\theta_{i=1 \dots K} &=& V\text{-dimensional vector, composed of }\theta_{i,1 \dots V} \text{; must sum to 1} \\ \alpha &=& \text{shared concentration hyperparameter of } \boldsymbol\theta \text{ for each component} \\ \beta &=& \text{concentration hyperparameter of } \boldsymbol\phi \\ \boldsymbol\phi &\sim& \operatorname{Symmetric-Dirichlet}_K(\beta) \\ \boldsymbol\theta_{i=1 \dots K} &\sim& \text{Symmetric-Dirichlet}_V(\alpha) \\ z_{i=1 \dots N} &\sim& \operatorname{Categorical}(\boldsymbol\phi) \\ x_{i=1 \dots N} &\sim& \text{Categorical}(\boldsymbol\theta_{z_i}) \end{array} --> ==Examples==