Introduction There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article on
logistic regression presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other
statistical classification techniques, is to construct a
linear predictor function that constructs a score from a set of weights that are
linearly combined with the explanatory variables (features) of a given observation using a
dot product: :\operatorname{score}(\mathbf{X}_i,k) = \boldsymbol\beta_k \cdot \mathbf{X}_i, where
Xi is the vector of explanatory variables describing observation
i,
βk is a vector of weights (or
regression coefficients) corresponding to outcome
k, and score(
Xi,
k) is the score associated with assigning observation
i to category
k. In
discrete choice theory, where observations represent people and outcomes represent choices, the score is considered the
utility associated with person
i choosing outcome
k. The predicted outcome is the one with the highest score. The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (the
perceptron algorithm,
support vector machines,
linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating the
probability of observation
i choosing outcome
k given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a large
predictive model that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.95 = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.85 = 33% accuracy. This issue is known as
error propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue.
Setup The basic setup is the same as in
logistic regression, the only difference being that the
dependent variables are
categorical rather than
binary, i.e. there are
K possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult the
logistic regression article.
Data points Specifically, it is assumed that we have a series of
N observed data points. Each data point
i (ranging from 1 to
N) consists of a set of
M explanatory variables
x1,
i ...
xM,i (also known as
independent variables, predictor variables, features, etc.), and an associated
categorical outcome
Yi (also known as
dependent variable, response variable), which can take on one of
K possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to
K. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of
N "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so that the outcome of a new "experiment" can be correctly predicted for a new data point for which the explanatory variables, but not the outcome, are available. In the process, the model attempts to explain the relative effect of differing explanatory variables on the outcome. Some examples: • The observed outcomes are different variants of a disease such as
hepatitis (possibly including "no disease" and/or other related diseases) in a set of patients, and the explanatory variables might be characteristics of the patients thought to be pertinent (sex, race, age,
blood pressure, outcomes of various liver-function tests, etc.). The goal is then to predict which disease is causing the observed liver-related symptoms in a new patient. • The observed outcomes are the party chosen by a set of people in an election, and the explanatory variables are the demographic characteristics of each person (e.g. sex, race, age, income, etc.). The goal is then to predict the likely vote of a new voter with given characteristics.
Linear predictor As in other forms of linear regression, multinomial logistic regression uses a
linear predictor function f(k,i) to predict the probability that observation
i has outcome
k, of the following form: :f(k,i) = \beta_{0,k} + \beta_{1,k} x_{1,i} + \beta_{2,k} x_{2,i} + \cdots + \beta_{M,k} x_{M,i}, where \beta_{m,k} is a
regression coefficient associated with the
mth explanatory variable and the
kth outcome. As explained in the
logistic regression article, the regression coefficients and explanatory variables are normally grouped into vectors of size
M + 1, so that the predictor function can be written more compactly: :f(k,i) = \boldsymbol\beta_k \cdot \mathbf{x}_i, where \boldsymbol\beta_k is the set of regression coefficients associated with outcome
k, and \mathbf{x}_i (a row vector) is the set of explanatory variables associated with observation
i, prepended by a 1 in entry 0.
As a set of independent binary regressions To arrive at the multinomial logit model, one can imagine, for
K possible outcomes, running
K independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other
K − 1 outcomes are separately regressed against the pivot outcome. If outcome
K (the last outcome) is chosen as the pivot, the
K − 1 regression equations are: : \ln \frac{\Pr(Y_i=k)}{\Pr(Y_i=K)} \,=\, \boldsymbol\beta_k \cdot \mathbf{X}_i, \;\;\;\;\;\;1\leq k . This formulation is also known as the
Additive Log Ratio transform commonly used in compositional data analysis. In other applications it’s referred to as “relative risk”. If we exponentiate both sides and solve for the probabilities, we get: : \Pr(Y_i=k) \,=\, {\Pr(Y_i=K)}\;e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;\;\;1\leq k Using the fact that all
K of the probabilities must sum to one, we find: : \begin{align} \Pr(Y_i=K) = {} & 1- \sum_{j=1}^{K-1} \Pr (Y_i = j) \\ = {} & 1 - \sum_{j=1}^{K-1}{\Pr(Y_i=K)}\;e^{\boldsymbol\beta_j \cdot \mathbf{X}_i} \;\;\Rightarrow\;\; \Pr(Y_i=K) \\ = {} & \frac{1}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}. \end{align} We can use this to find the other probabilities: : \Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;\;\;1\leq k . The fact that we run multiple regressions reveals why the model relies on the assumption of
independence of irrelevant alternatives described above.
Estimating the coefficients The unknown parameters in each vector
βk are typically jointly estimated by
maximum a posteriori (MAP) estimation, which is an extension of
maximum likelihood using
regularization of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean
Gaussian prior distribution on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such as
generalized iterative scaling,
iteratively reweighted least squares (IRLS), by means of
gradient-based optimization algorithms such as
L-BFGS,
As a log-linear model The formulation of binary logistic regression as a
log-linear model can be directly extended to multi-way regression. That is, we model the
logarithm of the probability of seeing a given output using the linear predictor as well as an additional
normalization factor, the logarithm of the
partition function: : \ln \Pr(Y_i=k) = \boldsymbol\beta_k \cdot \mathbf{X}_i - \ln Z, \;\;\;\;\;\;1\leq k \le K. As in the binary case, we need an extra term - \ln Z to ensure that the whole set of probabilities forms a
probability distribution, i.e. so that they all sum to one: :\sum_{k=1}^{K} \Pr(Y_i=k) = 1 The reason why we need to add a term to ensure normalization, rather than multiply as is usual, is because we have taken the logarithm of the probabilities. Exponentiating both sides turns the additive term into a multiplicative factor, so that the probability is just the
Gibbs measure: : \Pr(Y_i=k) = \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}, \;\;\;\;\;\;1\leq k \le K. The quantity
Z is called the
partition function for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1: : 1 = \sum_{k=1}^{K} \Pr(Y_i=k) \;=\; \sum_{k=1}^{K} \frac{1}{Z} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} \;=\; \frac{1}{Z} \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}. Therefore :Z = \sum_{k=1}^{K} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}. Note that this factor is "constant" in the sense that it is not a function of
Yi, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients
βk, which we will need to determine through some sort of
optimization procedure. The resulting equations for the probabilities are : \Pr(Y_i=k) = \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}}, \;\;\;\;\;\;1\leq k \le K. The following function: :\operatorname{softmax}(k,s_1,\ldots,s_K) = \frac{e^{s_k}}{\sum_{j=1}^K e^{s_j}} is referred to as the
softmax function. The reason is that the effect of exponentiating the values s_1,\ldots,s_K is to exaggerate the differences between them. As a result, \operatorname{softmax}(k,s_1,\ldots,s_K) will return a value close to 0 whenever s_k is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a
weighted average that behaves as a
smooth function (which can be conveniently
differentiated, etc.) and which approximates the
indicator function :f(k) = \begin{cases} 1 & \textrm{if } \; k = \operatorname{\arg\max}_j s_j, \\ 0 & \textrm{otherwise}. \end{cases} Thus, we can write the probability equations as :\Pr(Y_i=k) = \operatorname{softmax}(k, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \ldots, \boldsymbol\beta_K \cdot \mathbf{X}_i) The softmax function thus serves as the equivalent of the
logistic function in binary logistic regression. Note that not all of the \boldsymbol{\beta}_k vectors of coefficients are uniquely
identifiable. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only K-1 separately specifiable probabilities, and hence K-1 separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical: : \begin{align} \frac{e^{(\boldsymbol\beta_k + \mathbf{C}) \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{(\boldsymbol\beta_j + \mathbf{C}) \cdot \mathbf{X}_i}} &= \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i} e^{\mathbf{C}\cdot \mathbf{X}_i}} \\ &= \frac{e^{\mathbf{C}\cdot \mathbf{X}_i} e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{e^{\mathbf{C} \cdot \mathbf{X}_i} \sum_{j=1}^{K} e^{\boldsymbol\beta_j \cdot \mathbf{X}_i}} \\ &= \frac{e^{\boldsymbol\beta_k \cdot \mathbf{X}_i}}{\sum_{j=1}^{K} e^{\boldsymbol\beta_j\cdot \mathbf{X}_i}} \end{align} As a result, it is conventional to set \mathbf{C}= -\boldsymbol\beta_K (or alternatively, one of the other coefficient vectors). Essentially, we set the constant so that one of the vectors becomes \boldsymbol 0, and all of the other vectors get transformed into the difference between those vectors and the vector we chose. This is equivalent to "pivoting" around one of the
K choices, and examining how much better or worse all of the other
K − 1 choices are, relative to the choice we are pivoting around. Mathematically, we transform the coefficients as follows: : \begin{align} \boldsymbol\beta'_k &= \boldsymbol\beta_k - \boldsymbol\beta_K, \;\;\;\;1\leq k This leads to the following equations: : \Pr(Y_i=k) = \frac{e^{\boldsymbol\beta'_k \cdot \mathbf{X}_i}}{1 + \sum_{j=1}^{K-1} e^{\boldsymbol\beta'_j \cdot \mathbf{X}_i}}, \;\;\;\;\;\;1\leq k \le K Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms of
K − 1 independent two-way regressions.
As a latent-variable model It is also possible to formulate multinomial logistic regression as a latent variable model, following the
two-way latent variable model described for binary logistic regression. This formulation is common in the theory of
discrete choice models, and makes it easier to compare multinomial logistic regression to the related
multinomial probit model, as well as to extend it to more complex models. Imagine that, for each data point
i and possible outcome
k = 1,2,...,
K, there is a continuous
latent variable ''Y'
i,k'*'' (i.e. an unobserved
random variable) that is distributed as follows: : Y_{i,k}^{\ast} = \boldsymbol\beta_k \cdot \mathbf{X}_i + \varepsilon_k \;\;\;\;,\;\;k \le K where \varepsilon_k \sim \operatorname{EV}_1(0,1), i.e. a standard type-1
extreme value distribution. This latent variable can be thought of as the
utility associated with data point
i choosing outcome
k, where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable Y_i is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcome
k is chosen
if and only if the associated utility (the value of Y_{i,k}^{\ast}) is greater than the utilities of all the other choices, i.e. if the utility associated with outcome
k is the maximum of all the utilities. Since the latent variables are
continuous, the probability of two having exactly the same value is 0, so we ignore the scenario. That is: : \begin{align} \Pr(Y_i = 1) &= \Pr(Y_{i,1}^{\ast} > Y_{i,2}^{\ast} \text{ and } Y_{i,1}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,1}^{\ast} > Y_{i,K}^{\ast}) \\ \Pr(Y_i = 2) &= \Pr(Y_{i,2}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,2}^{\ast} > Y_{i,3}^{\ast}\text{ and } \cdots \text{ and } Y_{i,2}^{\ast} > Y_{i,K}^{\ast}) \\ & \,\,\,\vdots \\ \Pr(Y_i = K) &= \Pr(Y_{i,K}^{\ast} > Y_{i,1}^{\ast} \text{ and } Y_{i,K}^{\ast} > Y_{i,2}^{\ast}\text{ and } \cdots \text{ and } Y_{i,K}^{\ast} > Y_{i,K-1}^{\ast}) \\ \end{align} Or equivalently: : \Pr(Y_i = k) \;=\; \Pr(\max(Y_{i,1}^{\ast},Y_{i,2}^{\ast},\ldots,Y_{i,K}^{\ast})=Y_{i,k}^{\ast}) \;\;\;\;,\;\;k \le K Let's look more closely at the first equation, which we can write as follows: : \begin{align} \Pr(Y_i = 1) &= \Pr(Y_{i,1}^{\ast} > Y_{i,k}^{\ast}\ \forall\ k=2,\ldots,K) \\ &= \Pr(Y_{i,1}^{\ast} - Y_{i,k}^{\ast} > 0\ \forall\ k=2,\ldots,K) \\ &= \Pr(\boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 - (\boldsymbol\beta_k \cdot \mathbf{X}_i + \varepsilon_k) > 0\ \forall\ k=2,\ldots,K) \\ &= \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_k) \cdot \mathbf{X}_i > \varepsilon_k - \varepsilon_1\ \forall\ k=2,\ldots,K) \end{align} There are a few things to realize here: • In general, if X \sim \operatorname{EV}_1(a,b) and Y \sim \operatorname{EV}_1(a,b) then X - Y \sim \operatorname{Logistic}(0,b). That is, the difference of two
independent identically distributed extreme-value-distributed variables follows the
logistic distribution, where the first parameter is unimportant. This is understandable since the first parameter is a
location parameter, i.e. it shifts the mean by a fixed amount, and if two values are both shifted by the same amount, their difference remains the same. This means that all of the relational statements underlying the probability of a given choice involve the logistic distribution, which makes the initial choice of the extreme-value distribution, which seemed rather arbitrary, somewhat more understandable. • The second parameter in an extreme-value or logistic distribution is a
scale parameter, such that if X \sim \operatorname{Logistic}(0,1) then bX \sim \operatorname{Logistic}(0,b). This means that the effect of using an error variable with an arbitrary scale parameter in place of scale 1 can be compensated simply by multiplying all regression vectors by the same scale. Together with the previous point, this shows that the use of a standard extreme-value distribution (location 0, scale 1) for the error variables entails no loss of generality over using an arbitrary extreme-value distribution. In fact, the model is
nonidentifiable (no single set of optimal coefficients) if the more general distribution is used. • Because only differences of vectors of regression coefficients are used, adding an arbitrary constant to all coefficient vectors has no effect on the model. This means that, just as in the log-linear model, only
K − 1 of the coefficient vectors are identifiable, and the last one can be set to an arbitrary value (e.g. 0). Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particular
order statistic (the first, i.e. maximum) of a set of values. However, it can be shown that the resulting expressions are the same as in above formulations, i.e. the two are equivalent. ==Estimation of intercept==