This section introduces MARS using a few examples. We start with a set of data: a matrix of input variables
x, and a vector of the observed responses
y, with a response for each row in
x. For example, the data could be: Here there is only one
independent variable, so the
x matrix is just a single column. Given these measurements, we would like to build a model which predicts the expected
y for a given
x. A
linear model for the above data is \widehat{y} = -37 + 5.1 x The hat on the \widehat{y} indicates that \widehat{y} is estimated from the data. The figure on the right shows a plot of this function: a line giving the predicted \widehat{y} versus
x, with the original values of
y shown as red dots. The data at the extremes of
x indicates that the relationship between
y and
x may be non-linear (look at the red dots relative to the regression line at low and high values of
x). We thus turn to MARS to automatically build a model taking into account non-linearities. MARS software constructs a model from the given
x and
y as follows \begin{align} \widehat{y} = &\ 25 \\ & {} + 6.1 \max(0, x - 13) \\ & {} - 3.1 \max(0, 13 - x) \end{align} The figure on the right shows a plot of this function: the predicted \widehat{y} versus
x, with the original values of
y once again shown as red dots. The predicted response is now a better fit to the original
y values. MARS has automatically produced a kink in the predicted
y to take into account non-linearity. The kink is produced by
hinge functions. The hinge functions are the expressions starting with \max (where \max(a,b) is a if a > b, else b). Hinge functions are described in more detail below. In this simple example, we can easily see from the plot that
y has a non-linear relationship with
x (and might perhaps guess that y varies with the square of
x). However, in general there will be multiple
independent variables, and the relationship between
y and these variables will be unclear and not easily visible by plotting. We can use MARS to discover that non-linear relationship. An example MARS expression with multiple variables is \begin{align} \mathrm{ozone} = &\ 5.2 \\ & {} + 0.93 \max(0, \mathrm{temp} - 58) \\ & {} - 0.64 \max(0, \mathrm{temp} - 68) \\ & {} - 0.046 \max(0, 234 - \mathrm{ibt}) \\ & {} - 0.016 \max(0, \mathrm{wind} - 7) \max(0, 200 - \mathrm{vis}) \end{align} This expression models air pollution (the ozone level) as a function of the temperature and a few other variables. Note that the last term in the formula (on the last line) incorporates an interaction between \mathrm{wind} and \mathrm{vis}. The figure on the right plots the predicted \mathrm{ozone} as \mathrm{wind} and \mathrm{vis} vary, with the other variables fixed at their median values. The figure shows that wind does not affect the ozone level unless visibility is low. We see that MARS can build quite flexible regression surfaces by combining hinge functions. To obtain the above expression, the MARS model building procedure automatically selects which variables to use (some variables are important, others not), the positions of the kinks in the hinge functions, and how the hinge functions are combined. == The MARS model ==