The symbols Given the
statistical model which generates a set \mathbf{X} of observed data, a set of unobserved latent data or
missing values \mathbf{Z}, and a vector of unknown parameters \boldsymbol\theta, along with a
likelihood function L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}\mid\boldsymbol\theta), the
maximum likelihood estimate (MLE) of the unknown parameters is determined by maximizing the
marginal likelihood of the observed data \begin{align} L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}\mid\boldsymbol\theta) &= \int p(\mathbf{X},\mathbf{Z} \mid \boldsymbol\theta) \, d\mathbf{Z} \\ &= \int p(\mathbf{X} \mid \mathbf{Z}, \boldsymbol\theta) p(\mathbf{Z} \mid \boldsymbol\theta) \, d\mathbf{Z} \end{align} However, this quantity is often intractable since \mathbf{Z} is unobserved and the distribution of \mathbf{Z} is unknown before attaining \boldsymbol\theta.
The EM algorithm The EM algorithm seeks to find the maximum likelihood estimate of the marginal likelihood by iteratively applying these two steps: {{block indent | em = 1.5 | text =
Expectation step (E step): Define Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) as the
expected value of the log
likelihood function of , with respect to the current
conditional distribution of \mathbf{Z} given \mathbf{X} and the current estimates of the parameters : Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z} \sim p(\cdot | \mathbf{X},\boldsymbol\theta^{(t)})}\left[ \log p (\mathbf{X},\mathbf{Z} | \boldsymbol\theta) \right] := \int \log p (\mathbf X, \mathbf Z | \boldsymbol \theta) \, p (\mathbf Z | \mathbf X , \boldsymbol \theta^{(t)}) \, d \mathbf Z \,
Maximization step (M step): Find the parameters that maximize this quantity: \boldsymbol\theta^{(t+1)} = \mathop{\arg\max}_{\boldsymbol\theta} Q(\boldsymbol\theta\mid\boldsymbol\theta^{(t)}) \, }} More succinctly, we can write it as one equation:\boldsymbol\theta^{(t+1)} = \mathop{\arg\max}_{\boldsymbol\theta} \operatorname{E}_{\mathbf{Z} \sim p(\cdot | \mathbf{X},\boldsymbol\theta^{(t)})} \left[ \log p (\mathbf{X},\mathbf{Z} | \boldsymbol\theta) \right] \,
Interpretation of the variables The typical models to which EM is applied use \mathbf{Z} as a latent variable indicating membership in one of a set of groups: • The observed data points \mathbf{X} may be
discrete (taking values in a finite or countably infinite set) or
continuous (taking values in an uncountably infinite set). Associated with each data point may be a vector of observations. • The
missing values (aka
latent variables) \mathbf{Z} are
discrete, drawn from a fixed number of values, and with one latent variable per observed unit. • The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and those associated with a specific value of a latent variable (i.e., associated with all data points whose corresponding latent variable has that value). However, it is possible to apply EM to other sorts of models. The motivation is as follows. If the value of the parameters \boldsymbol\theta is known, usually the value of the latent variables \mathbf{Z} can be found by maximizing the log-likelihood over all possible values of \mathbf{Z}, either simply by iterating over \mathbf{Z} or through an algorithm such as the
Viterbi algorithm for
hidden Markov models. Conversely, if we know the value of the latent variables \mathbf{Z}, we can find an estimate of the parameters \boldsymbol\theta fairly easily, typically by simply grouping the observed data points according to the value of the associated latent variable and averaging the values, or some function of the values, of the points in each group. This suggests an iterative algorithm, in the case where both \boldsymbol\theta and \mathbf{Z} are unknown: • First, initialize the parameters \boldsymbol\theta to some random values. • Compute the probability of each possible value of , given . • Then, use the just-computed values of \mathbf{Z} to compute a better estimate for the parameters . • Iterate steps 2 and 3 until convergence. The algorithm as just described monotonically approaches a local minimum of the cost function. == Properties ==