Continuing on from the start of my research last week, the intrigue of Valid Post Selection Inference comes in how we can make conclusions and manipulate data "post selection" (i.e after we have chosen the relevant variables to examine). However, a critical part of this process, and one which is pivotal to understand, is how we actually select these variables from the potentially millions of pieces of data which we have been given.
The usual way to proceed with making statistical statements, is to consider the Least Squares Estimate. This is a very useful tool and essentially allows us to create an estimate for our given model where the discrepancies between different pieces of data have been minimised. The Least Squares Estimate is a useful tool as it has low bias (So the estimate will never be skewed to favour one direction over another), however the variance of this estimate is often very high, and hence the estimates we get could be very spread out and hence not very accurate. Another downside to this estimate, is that if we start with millions of pieces of data, we will have the same number (so millions) of estimators. This is not very intuitive, as it is not practical to try and examine such a large number of estimators by eye. Hence it is useful to consider a smaller subset of estimators, as this is more intuitive. The only problem however, is that we need this subset to be representative fully of the millions of pieces of data we have originally been given. To this end, we need a method which will reduce the number of estimates we make, whilst also maintaining all of the key information about our model. There are two methods available which I have chosen to focus on - Ridge regression, and the lasso.
The differences between the two methods are best shown by considering the equations which calculate the given predictions which we wish to consider. The first thing we do in both methods, is consider the linear regression model given by the equations below (Hastie, Tibshirani, Friedman 2008):
where here, Y is the response variable vector (the 'result' of the experiment, for example the price of a stock or consequence of a gene mutation), X is the matrix of covariates (the data from the experiment), βi (i = 1,...,p) are the coefficients of the problem which we wish to estimate, and ε is the error term, which usually has a normal distribution. Within this model, the only unknown quantity is the vector of coefficients β, and this is what we wish to estimate.
First we look at the equation for ridge regression, and in this case, the estimates for β are the solutions to the following equation (Hastie, Tibshirani, Friedman 2008):
The key part of this equation above, is that it includes an l2-penalty of the form:
This term is important because it is differentiable, and this means that we can differentiate the results we have and solve the ridge equation above to find an explicit solution (Hastie, Tibshirani, Friedman 2008):
which we can use to estimate β. This is useful, as this means we have an explicit formula which we can use to carry out ridge regression.
The lasso method is very similar, but there is one key difference which is best represented when looking at the lasso equation below (Hastie, Tibshirani, Friedman 2008):
Here, we see in the last term that instead of having an l2-penalty like in ridge regression, the lasso has an l1-penalty of the form:
Obviously, it is clear to see from the above that the lasso is difficult to calculate. However, the lasso is a more useful construct. An interesting way to see this is by considering the diagram below (Hastie, Tibshirani, Friedman 2008):
References:
Hastie, Tibshirani, Friedman 2008: The elements of Statistical Learning, Data Mining, Inference and Prediction Second edition, Springer, Stanford, California
Tibshirani 1996: Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58:267-288