In linear regression, the model specification is that the dependent variable, y_i is a
linear combination of the
parameters (but need not be linear in the
independent variables). For example, in
simple linear regression for modeling n data points there is one independent variable: x_i , and two parameters, \beta_0 and \beta_1: :straight line: y_i=\beta_0 +\beta_1 x_i +\varepsilon_i,\quad i=1,\dots,n.\! In multiple linear regression, there are several independent variables or functions of independent variables. Adding a term in x_i^2 to the preceding regression gives: :parabola: y_i=\beta_0 +\beta_1 x_i +\beta_2 x_i^2+\varepsilon_i,\ i=1,\dots,n.\! This is still linear regression; although the expression on the right hand side is quadratic in the independent variable x_i, it is linear in the parameters \beta_0, \beta_1 and \beta_2. In both cases, \varepsilon_i is an error term and the subscript i indexes a particular observation. Returning our attention to the straight line case: Given a random sample from the population, we estimate the population parameters and obtain the sample linear regression model: : \widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_1 x_i. The
residual, e_i = y_i - \widehat{y}_i , is the difference between the value of the dependent variable predicted by the model, \widehat{y}_i, and the true value of the dependent variable, y_i. One method of estimation is
ordinary least squares. This method obtains parameter estimates that minimize the sum of squared
residuals,
SSR: :SSR=\sum_{i=1}^n e_i^2 Minimization of this function results in a set of
normal equations, a set of simultaneous linear equations in the parameters, which are solved to yield the parameter estimators, \widehat{\beta}_0, \widehat{\beta}_1. In the case of simple regression, the formulas for the least squares estimates are :\widehat{\beta}_1=\frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2} :\widehat{\beta}_0=\bar{y}-\widehat{\beta}_1\bar{x} where \bar{x} is the
mean (average) of the x values and \bar{y} is the mean of the y values. Under the assumption that the population error term has a constant variance, the estimate of that variance is given by: : \hat{\sigma}^2_\varepsilon = \frac{SSR}{n-2} This is called the
mean square error (MSE) of the regression. The denominator is the sample size reduced by the number of model parameters estimated from the same data, (n-p) for p
regressors or (n-p-1) if an intercept is used. In this case, p=1 so the denominator is n-2. The
standard errors of the parameter estimates are given by :\hat\sigma_{\beta_1}=\hat\sigma_{\varepsilon} \sqrt{\frac{1}{\sum(x_i-\bar x)^2}} :\hat\sigma_{\beta_0}=\hat\sigma_\varepsilon \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum(x_i-\bar x)^2}}=\hat\sigma_{\beta_1} \sqrt{\frac{\sum x_i^2}{n}}. Under the further assumption that the population error term is normally distributed, the researcher can use these estimated standard errors to create
confidence intervals and conduct
hypothesis tests about the
population parameters.
General linear model In the more general multiple regression model, there are p independent variables: : y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \varepsilon_i, \, where x_{ij} is the i-th observation on the j-th independent variable. If the first independent variable takes the value 1 for all i, x_{i1} = 1, then \beta_1 is called the
regression intercept. The least squares parameter estimates are obtained from p normal equations. The residual can be written as :\varepsilon_i=y_i - \hat\beta_1 x_{i1} - \cdots - \hat\beta_p x_{ip}. The
normal equations are :\sum_{i=1}^n \sum_{k=1}^p x_{ij}x_{ik}\hat \beta_k=\sum_{i=1}^n x_{ij}y_i,\ j=1,\dots,p.\, In matrix notation, the normal equations are written as :\mathbf{(X^\top X )\hat{\boldsymbol{\beta}}= {}X^\top Y},\, where the ij element of \mathbf X is x_{ij}, the i element of the column vector Y is y_i, and the j element of \hat \boldsymbol \beta is \hat \beta_j. Thus \mathbf X is n \times p, Y is n \times 1, and \hat \boldsymbol \beta is p \times 1. The solution is :\mathbf{\hat{\boldsymbol{\beta}}= (X^\top X )^{-1}X^\top Y}.\,
Diagnostics Once a regression model has been constructed, it may be important to confirm the
goodness of fit of the model and the
statistical significance of the estimated parameters. Commonly used checks of goodness of fit include the
R-squared, analyses of the pattern of
residuals and hypothesis testing. Statistical significance can be checked by an
F-test of the overall fit, followed by
t-tests of individual parameters. Interpretations of these diagnostic tests rest heavily on the model's assumptions. Although examination of the residuals can be used to invalidate a model, the results of a
t-test or
F-test are sometimes more difficult to interpret if the model's assumptions are violated. For example, if the error term does not have a normal distribution, in small samples the estimated parameters will not follow normal distributions and complicate inference. With relatively large samples, however, a
central limit theorem can be invoked such that hypothesis testing may proceed using asymptotic approximations.
Limited dependent variables Limited dependent variables, which are response variables that are
categorical or constrained to fall only in a certain range, often arise in
econometrics. The response variable may be non-continuous ("limited" to lie on some subset of the real line). For binary (zero or one) variables, if analysis proceeds with least-squares linear regression, the model is called the
linear probability model. Nonlinear models for binary dependent variables include the
probit and
logit model. The
multivariate probit model is a standard method of estimating a joint relationship between several binary dependent variables and some independent variables. For
categorical variables with more than two values there is the
multinomial logit. For
ordinal variables with more than two values, there are the
ordered logit and
ordered probit models.
Censored regression models may be used when the dependent variable is only sometimes observed, and
Heckman correction type models may be used when the sample is not randomly selected from the population of interest. An alternative to such procedures is linear regression based on
polychoric correlation (or polyserial correlations) between the categorical variables. Such procedures differ in the assumptions made about the distribution of the variables in the population. If the variable is positive with low values and represents the repetition of the occurrence of an event, then count models like the
Poisson regression or the
negative binomial model may be used. ==Nonlinear regression==