Approved Research

Model-X statistical methods and uncertainty-aware machine learning algorithms for powerful and reliable analyses of GWAS data

University of Southern California

Lay summary

The goal of this project is to develop new data analysis tools that can help scientists discover how human genes influence genetic diseases or other measurable personal traits of medical relevance. In particular, we aim to to improve the flexibility of existing analysis tools by accounting for the possibility that genes may have different biological effects in different people. As we try to find out if and how a particular gene affects a trait within a certain group of people, it is important to limit the number of incorrect discoveries. This is because incorrect discoveries would be confusing and misleading for scientists. Unfortunately, it is quite likely to make incorrect discoveries when working with genetic data, and this project will need to develop sophisticated methods of statistical analysis to make sure most of the reported findings are scientifically valid.

A second goal of this project is to develop new data analysis tools that can help physicians predict which people are at risk of developing a certain disease based on information about their gene. In particular, this project will improve the ability of the existing tools for genetic risk prediction to account for uncertainty. Accounting for uncertainty is of vital importance in genetic risk prediction because many diseases are only partly influenced by our genes, not pre-determined by them. Therefore, any genetic-based risk predictions should carefully take into account the fact that the disease or trait of interest may also be affected by many other random variables, such as environmental conditions and lifestyle factors, that are not necessarily measured.

The UK Biobank resource contains one of the largest sets of genetic data, which provides a perfect opportunity to test and apply the analysis tools developed by this project. The expected duration of this project is 3 years. The data analysis tools developed by this project may lead to novel scientific discoveries in genetics, helping scientists develop novel risk assessment protocols, personalized therapies, and targeted drugs.