Approved Research

Application of a scalable perturbation algorithm for variable selection in UK Biobank data

Columbia University, New York

Lay summary

We aim to (1) identify genetic loci that are associated to quantitative health-related measurements (e.g. BMI, cardiovascular traits, dementia score) (2) identify the relationship between physical activity and health-related outcomes (e.g. type 2 diabetes, cancer) using Biobank data and our proposed algorithm for large datasets. Large datasets such as UK Biobank dataset is ideal for analysis these associations; however, conventional statistical methods lack the scalability to handle processing of such large sample size because the full sample may well exceed the physical memory of an ordinary computer. We propose a novel algorithm based on subsampling to generate a robust and computationally efficient estimator for variable selection and statistical inference in the analysis of big data. The new method will improve our ability to process the full cohort of data with limited computer sources, and the analyses can provide more precise result and better understanding of the association between genes and health status, human activities and health status as well as their underlying biological process. Our project will last for about 3 years.