Approved Research

Novel Statistical Methods for Disease Prediction, Causal Inference and Discovery in Genetic and Genomic Studies

University of Toronto

Lay summary

Aims and Scientific Rationale. The overall aim of our study is to utilize the UK Biobank data resource and develop powerful methods that can identify genetic variants and other (biological, lifestyle, environmental and mental) risk factors for a number of complex diseases including asthma, COVID-19, cancer, dementia, depression, diabetes, heart attack, high blood pressure, mental disorders, stroke, and Parkinson's disease.

Genetic studies of complex diseases such as COVID-19 and cancer have direct impact on public health, because these studies advance our understanding of the effect of genetic variation and its interaction with environmental factors. The success of these genetic studies, however, depends on the availability of good statistical methods that can reliably extract meaningful and actionable knowledge from data.

Leveraging the strength of our statistical expertise, we propose to develop powerful analytical methods to advance different genetic studies, ranging from discovery and treatment to prevention. And our research mission directly aligns with one of the UK Biobank missions: development of reliable assessment of different causes of disease.

The proposed research covers two main themes: 1) Improve methods available to scientists to study the involvement of genetic factors in complex diseases, by developing more powerful methods that include previously over-looked genetic variants, and 2) Develop statistical methods to identify risk factors for diseases such as COVID-19, and further understand how risk factors affect various diseases. The result is likely to help doctors to develop effective individualized treatment rule according to patient characteristics.

Project Duration. The initial duration of the project is 3 years, with possible extensions depending on the progress of the research projects. We will follow the UK Biobank data usage agreement, report relevant publications and findings, and we will return to UK Biobank all the required items.

Public Health Impact. The proposed research improves public health by providing more powerful analytical toolsets to researchers who would need them to study complex and heritable traits. These studies include disease prediction, treatment and prevention. We will also publish our findings in scientific journals, release open-source software packages, and present the work in international conferences, universities and research hospitals to broadly disseminate and translate the knowledge gained through the use of UK Biobank data resource.

Scope extension:

We aim to develop new statistical methodologies for genetic studies of complex diseases to fill some of the analytical gaps existing in the literature. Research questions and aims include i) use polygenic risk score (PRS) to identify risk factors, and determine the effects of these risk factors on the disease diagnosis and/or disease severity and/or death rate of asthma, COVID-19, cancer, dementia, depression, diabetes, heart attack, high blood pressure, mental disorders, stroke, and Parkinson's disease; (ii) develop effective individualized treatment recommendation rules for the diseases listed in (i); iii) further improve existing PRS methodologies by considering (indirect) interaction effects, rare-variants and the X-chromosomal variants; iv) study genetic correlation via Copula methods; v) investigate the behaviors of p-values under both the null and alternatives to enhance multiple hypothesis testing methodologies; vi) identify genetic confounders among millions of genetic covariates to determine the effects of risk factors for the diseases listed in (i); vii) develop methods for robust PRS tools in Mendelian randomization studies; viii) use PRS for genetic imaging mediation analysis with a goal of early diagnosis and enhanced prediction for Parkinson's disease, Alzheimer's disease and dementia.

We also aim to use the UKB data to study synthetic surrogate analysis for genome-wide association studies of a partially observed phenotype. Within large population biobanks such as UKB, certain target phenotypes are incompletely measured or partially missing, for example due to the requirement for specialized imaging modalities. Machine learning models may allow for accurate prediction of the missing target phenotype on the basis of observed phenotypes. However, simply imputing missing values of the target phenotype can invalidate subsequent inference. In this project, we propose a suite of methods for jointly analyzing the partially missing target phenotype and its predicted value within a bivariate outcome framework, the predicted phenotype serving as a synthetic surrogate for the potentially unobserved target phenotype. We will develop computationally efficient estimation and inference procedures to perform genome-wide association studies of partially missing phenotypes from the UK Biobank.