STEM

An Investigation Into Valid Post-Selection Inference

The proposal of the research which I aim to undertake as part of my summer research project in 2022.

Daniel Harvey Liddell Apr 11, 2022

Statistics as a subject relies upon using the fundamental axioms of Mathematics in order to correctly estimate, analyse and create inferences about data.

In common statistical practice it is often important to choose a model with which to analyse data after the data has first been seen. For low dimensional problems (problems with a low number of variables and data, for example, a problem with two variables to estimate) this problem is well understood, and small changes and results can be easily seen, interpreted and analysed using this method to make valid conclusions about the investigation. However, when we attempt to scale such a problem such that it is of higher dimension (has a large number - thousands or millions - of parameters to estimate and analyse) this general method does not work. As the data has so many possible data values and parameters to monitor and analyse, it is near impossible to spot inconsistencies and interesting changes in the data. For example, when analysing the genetics of 100 hospital patients, each patient will have thousands of gene codes, and finding specific genes which are mutated or harmful is very difficult without relying on luck. As a result, I aim to research and develop a statistical method to analyse such problems via a new technique called ‘Valid Post-selection Inference’¹. The way in which I aim to do this, is using a method being developed by John Wiley and Sons², whereby we shrink the data into smaller chunks which we are able analyse more easily using valid statistical theories. This process is called ‘shrinkage’ and as of yet very little is known about the process. The general idea however, is that any data which we predict to be unimportant are shrunk so that we are left only with the data relevant to our investigation. The current method for conducting this shrinkage, uses a branch of statistics known as “Linear Regression’³. This framework involves the reduction of the complex, multivariate problem into a model which is linear. I will complement this with a technique called “the lasso”⁴, which is another technique which has been currently developed, with less successful results so far.

My main aim with this project, is to use the ideas of ‘shrinkage’ and ‘linear regression’ in order to develop a new method for solving these high-dimensional problems. I aim to further the research being completed by other teams across the world in order to find a solution which is both mathematically correct, and applicable to real-world situations. Once this method has been developed, I believe that it will be important to develop common statistical constructs for these problems. These constructs include confidence intervals⁵ and standard deviations (which are general statistical values which help to describe data and test hypotheses). I propose to complete this research in the form of a mathematical report or paper. This way, my research is able to be cross-referenced and checked by experts in the field across the world, thus allowing me to tweak and make improvements to my methodologies and propositions as I complete my research. The overall upshot of this project is that it allows for the large swathes of data which currently exist in the world to be analysed correctly. Currently, such data (as held by healthcare providers, banks, pharmaceutical companies and utilities providers) is being incorrectly analysed and interpreted using the common statistical practices outlined above. It has been proven that these methods do not work for such large quantities of data, and hence these corporations are making decisions which are misinformed based on the results that they have. This is a large problem, as these decisions could have long-lasting and potentially harmful affects on both individuals and the world as a whole. In order to address this problem, I believe that this project will not only provide a quick solution for specific case studies, datasets and examples; but will also provide a basis from which other mathematicians, researchers and individuals from other disciplines can develop their own thought and theories too.

References:

[1]. "Valid post-selection inference" - Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang and Linda Zhao.

[2]. "Post selection shrinkage estimation for high-dimensional data analysis" - Xiaoli Gao, S.E. Ahmed and Yang Feng.

[3]. "Post-selection estimation and testing following aggregate association tests" - Ruth Heller, Amit Meir and Nilanjan Chaterjee.

[4]. "Exact post-selection inference with application to the lasso" - Jason D Lee, Dennis L. Sun, Yuekai Sun and Jonathan E. Taylor.

[5]. "Valid confidence intervals for post-model-selection predictors" - Francois Bachoc, Hannes Leeb and Benedikt M. Potscher.