The Lasso and it's Utility

The latest update from the second week of my research into Valid post Selection Inference, with a particular focus on the statistical tool of the "Lasso".
The Lasso and it's Utility
Like

Continuing on from the start of my research last week, the intrigue of Valid Post Selection Inference comes in how we can make conclusions and manipulate data "post selection" (i.e after we have chosen the relevant variables to examine). However, a critical part of this process, and one which is pivotal to understand, is how we actually select these variables from the potentially millions of pieces of data which we have been given. 

The usual way to proceed with making statistical statements, is to consider the Least Squares Estimate. This is a very useful tool and essentially allows us to create an estimate for our given model where the discrepancies between different pieces of data have been minimised. The Least Squares Estimate is a useful tool as it has low bias (So the estimate will never be skewed to favour one direction over another), however the variance of this estimate is often very high, and hence the estimates we get could be very spread out and hence not very accurate. Another downside to this estimate, is that if we start with millions of pieces of data, we will have the same number (so millions) of estimators. This is not very intuitive, as it is not practical to try and examine such a large number of estimators by eye. Hence it is useful to consider a smaller subset of estimators, as this is more intuitive. The only problem however, is that we need this subset to be representative fully of the millions of pieces of data we have originally been given. To this end, we need a method which will reduce the number of estimates we make, whilst also maintaining all of the key information about our model. There are two methods available which I have chosen to focus on - Ridge regression, and the lasso

The differences between the two methods are best shown by considering the equations which calculate the given predictions which we wish to consider. The first thing we do in both methods, is consider the linear regression model given by the equations below (Hastie, Tibshirani, Friedman 2008):

where here, Y is the response variable vector (the 'result' of the experiment, for example the price of a stock or consequence of a gene mutation), X is the matrix of covariates (the data from the experiment), βi (i = 1,...,p) are the coefficients of the problem which we wish to estimate, and ε is the error term, which usually has a normal distribution. Within this model, the only unknown quantity is the vector of coefficients β, and this is what we wish to estimate. 

First we look at the equation for ridge regression, and in this case, the estimates for β are the solutions to the following equation (Hastie, Tibshirani, Friedman 2008):

The key part of this equation above, is that it includes an l2-penalty of the form:

This term is important because it is differentiable, and this means that we can differentiate the results we have and solve the ridge equation above to find an explicit solution (Hastie, Tibshirani, Friedman 2008):

which we can use to estimate β. This is useful, as this means we have an explicit formula which we can use to carry out ridge regression. 

The lasso method is very similar, but there is one key difference which is best represented when looking at the lasso equation below (Hastie, Tibshirani, Friedman 2008):

Here, we see in the last term that instead of having an l2-penalty like in ridge regression, the lasso has an l1-penalty of the form:

This l1-penalty makes a big difference, as this is not differentiable, and hence there is no nice closed form expression for the result as in ridge regression, and thus the only way to compute the lasso estimates is via a computer. The lasso is thus a much more difficult way to estimate β. The explicit way to calculate the lasso is given by the Least Angel Regression (LAR) Algorithm (Tibshirani 1996). Thus algorithm is very complex, and thus requires the statistical programme R to  run. Another key part to solving the lasso, is choosing the tuning parameter λ.This is important, as setting the tuning parameter too large will cause all variables to be set to zero (so we have nothing to analyse), whilst setting the tuning parameter too small will choose no variables to be set to zero, and hence we have not shrunk the number of variables as desired.  Thankfully, a computational way of picking the optimal parameter value is given in Negahban et al. (2012), and involves using 10 fold cross validation to calculate this parameter. The actual value of the optimal parameter is given by:

Obviously, it is clear to see from the above that the lasso is difficult to calculate. However, the lasso is a more useful construct. An interesting way to see this is by considering the diagram below (Hastie, Tibshirani, Friedman 2008):

Here, the image on the left represents the lasso, and the image on the right represents ridge regression. The solution to the respective equations are the point where the red lines of β intersect the boundary of the blue shape which represents the full set of lasso or ridge possibilities respectively. The key point to take away here, is that in ridge regression, both parameters will be estimated, whilst in the lasso picture, the lines intersect when β1 is equal to zero.  Hence we see that the lasso will set some coefficients to zero, and this allows us to remove these variables from our original model, as they clearly were not as important as the other variables in affecting the data. This is an incredibly useful tool, as now we have a method with which we can shrink the number of variables, hence solving our problem of intuitiveness from before. Furthermore, by shrinking the number of variables we are using, we sacrifice the bias which we had using the usual least squares estimate, but in return we lose a lot of variance meaning our results will be much more accurate. These two properties make the lasso an extremely useful tool with incredibly desirable properties, and hence is the method which I will implement going forward in order to shrink data to make relevant inferences and mathematical claims as my research progresses. 

References:

Hastie, Tibshirani, Friedman 2008: The elements of Statistical Learning, Data Mining, Inference and Prediction Second edition, Springer, Stanford, California

Tibshirani 1996: Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58:267-288

Neghaban et al. 2012: “A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers.” Statistical Science, vol. 27, no. 4, 2012, pp. 538–57. JSTOR, http://www.jstor.org/stable/41714783. Accessed 6 Jul. 2022

Please sign in

If you are a registered user on Laidlaw Scholars Network, please sign in