Geometric interpretation Lasso can set coefficients to zero, while the superficially similar ridge regression cannot. This is due to the difference in the shape of their constraint boundaries. Both lasso and ridge regression can be interpreted as minimizing the same objective function \min_{ \beta_0, \beta } \left\{ \frac{1}{N} \left\| y - \beta_0 - X \beta \right\|_2^2 \right\} but with respect to different constraints: \| \beta \|_1 \leq t for lasso and \| \beta \|_2^2 \leq t for ridge. The figure shows that the constraint region defined by the \ell^1 norm is a square rotated so that its corners lie on the axes (in general a
cross-polytope), while the region defined by the \ell^2 norm is a circle (in general an
n-sphere), which is
rotationally invariant and, therefore, has no corners. As seen in the figure, a convex object that lies tangent to the boundary, such as the line shown, is likely to encounter a corner (or a higher-dimensional equivalent) of a hypercube, for which some components of \beta are identically zero, while in the case of an
n-sphere, the points on the boundary for which some of the components of \beta are zero are not distinguished from the others and the convex object is no more likely to contact a point at which some components of \beta are zero than one for which none of them are.
Making λ easier to interpret with an accuracy-simplicity tradeoff The lasso can be rescaled so that it becomes easy to anticipate and influence the degree of shrinkage associated with a given value of \lambda . It is assumed that X is standardized with z-scores and that y is centered (zero mean). Let \beta_0 represent the hypothesized regression coefficients and let b_\text{OLS} refer to the data-optimized ordinary least squares solutions. We can then define the
Lagrangian as a tradeoff between the in-sample accuracy of the data-optimized solutions and the simplicity of sticking to the hypothesized values. This results in \min_{ \beta \in \mathbb{R}^p } \left\{ \frac{(y-X\beta)'(y-X\beta)}{(y-X\beta_0)'(y-X\beta_0)} + 2\lambda \sum_{i=1}^p \frac{q_{i}} \right\} where q_i is specified below and the "prime" symbol stands for transpose. The first fraction represents relative accuracy, the second fraction relative simplicity, and \lambda balances between the two. Given a single regressor, relative simplicity can be defined by specifying q_i as |b_\text{OLS}-\beta_{0}|, which is the maximum amount of deviation from \beta_0 when \lambda=0 . Assuming that \beta_{0}=0, the solution path can be defined in terms of
R^2: b_{\ell_1} = \begin{cases} (1-\lambda/R^{2})b_\text{OLS} & \mbox{if } \lambda \leq R^{2}, \\ 0 & \mbox{if } \lambda>R^{2}. \end{cases} If \lambda=0, the ordinary least squares solution (OLS) is used. The hypothesized value of \beta_0=0 is selected if \lambda is bigger than R^2. Furthermore, if R^2=1, then \lambda represents the proportional influence of \beta_0=0. In other words, \lambda\times100\% measures in percentage terms the minimal amount of influence of the hypothesized value relative to the data-optimized OLS solution. If an \ell_2-norm is used to penalize deviations from zero given a single regressor, the solution path is given by b_{\ell_2} = \left(1+\frac{\lambda}{R^{2}(1-\lambda)}\right)^{-1} b_\text{OLS}. Like b_{\ell_1}, b_{\ell_2} moves in the direction of the point (\lambda = R^2, b=0) when \lambda is close to zero; but unlike b_{\ell_1}, the influence of R^2 diminishes in b_{\ell_2} if \lambda increases (see figure). Given multiple regressors, the moment that a parameter is activated (i.e. allowed to deviate from \beta_0) is also determined by a regressor's contribution to R^2 accuracy. First, R^2=1-\frac{(y-Xb)'(y-Xb)}{(y-X\beta_0)'(y-X\beta_0)}. An R^2 of 75% means that in-sample accuracy improves by 75% if the unrestricted OLS solutions are used instead of the hypothesized \beta_0 values. The individual contribution of deviating from each hypothesis can be computed with the p x p matrix R^{\otimes}=(X'\tilde y_0)(X'\tilde y_0)' (X'X)^{-1}(\tilde y_0'\tilde y_0)^{-1}, where \tilde y_0=y-X\beta_0. If b=b_\text{OLS} when R^2 is computed, then the diagonal elements of R^{\otimes} sum to R^2. The diagonal R^{\otimes} values may be smaller than 0 or, less often, larger than 1. If regressors are uncorrelated, then the i^{th} diagonal element of R^{\otimes} simply corresponds to the r^2 value between x_i and y. A rescaled version of the adaptive lasso of can be obtained by setting q_{\mbox{adaptive lasso},i}=|b_{\text{OLS},i}-\beta_{0,i}|. If regressors are uncorrelated, the moment that the i^{th} parameter is activated is given by the i^{th} diagonal element of R^{\otimes}. Assuming for convenience that \beta_0 is a vector of zeros, b_{i} = \begin{cases} (1-\lambda/R_{ii}^{\otimes})b_{\text{OLS},i} & \text{if } \lambda \leq R_{ii}^{\otimes}, \\ 0 & \text{if } \lambda > R_{ii}^{\otimes}. \end{cases} That is, if regressors are uncorrelated, \lambda again specifies the minimal influence of \beta_0. Even when regressors are correlated, the first time that a regression parameter is activated occurs when \lambda is equal to the highest diagonal element of R^{\otimes}. These results can be compared to a rescaled version of the lasso by defining q_{\mbox{lasso},i}=\frac{1}{p} \sum_{l} |b_{\text{OLS},l}-\beta_{0,l}|, which is the average absolute deviation of b_\text{OLS} from \beta_0. Assuming that regressors are uncorrelated, then the moment of activation of the i^{th} regressor is given by \tilde \lambda_{\text{lasso},i} = \frac{1}{p}\sqrt{R^{\otimes}_i} \sum_{l=1}^p\sqrt{R^{\otimes}_{l}}. For p=1, the moment of activation is again given by \tilde \lambda_{\text{lasso},i}=R^2. If \beta_0 is a vector of zeros and a subset of p_B relevant parameters are equally responsible for a perfect fit of R^2=1, then this subset is activated at a \lambda value of \frac{1}{p}. The moment of activation of a relevant regressor then equals \frac{1}{p}\frac{1}{\sqrt{p_B}}p_B\frac{1}{\sqrt{p_B}}=\frac{1}{p}. In other words, the inclusion of irrelevant regressors delays the moment that relevant regressors are activated by this rescaled lasso. The adaptive lasso and the lasso are special cases of a '1ASTc' estimator. The latter only groups parameters together if the absolute correlation among regressors is larger than a user-specified value. The Laplace distribution is sharply
peaked at zero (its first derivative is discontinuous at zero) and it concentrates its probability mass closer to zero than does the
normal distribution. This provides an alternative explanation of why lasso tends to set some coefficients to zero, while ridge regression does not.
Convex relaxation interpretation Lasso can also be viewed as a convex relaxation of the best subset selection regression problem, which is to find the subset of \leq k covariates that results in the smallest value of the objective function for some fixed k \leq n , where n is the total number of covariates. The " \ell^0 norm", \| \cdot \|_0 , (the number of nonzero entries of a vector), is the limiting case of "\ell^p norms", of the form \textstyle \| x \|_p = \left( \sum_{i=1}^n | x_j |^p \right)^{1/p} (where the quotation marks signify that these are not really norms for p since \| \cdot \|_p is not convex for p , so the triangle inequality does not hold). Therefore, since p = 1 is the smallest value for which the " \ell^p norm" is convex (and therefore actually a norm), lasso is, in some sense, the best convex approximation to the best subset selection problem, since the region defined by \| x \|_1 \leq t is the
convex hull of the region defined by \| x \|_p \leq t for p . == Generalizations ==