Some of most promising avenues for translating genetic epidemiological findings into new medical care approaches and therapies include genetic risk prediction and genetic instrument-based causal inference (Mendelian randomization; MR). However, several challenges have been recognized in large biobank-based data analysis. We will develop new statistical and computational methods to address:
(i) Compromised genetic risk score performance when the target population differs from the discovery population, which occurs when populations of distinct ancestries are involved or when the development of the score is affected by mismatches in linkage disequilibrium (LD) between the LD reference panel and the discovery population.
(ii) Lack of condition-specific genetic risk scores, as existing scores have been developed based on genome-wide association studies in the general population, largely ignoring specificity with respect to age, sex, and various pre-existing medical conditions.
(iii) Missing data in genetic risk scores when applied to the target population, resulting from discrepancies between genotyping arrays, imputation panels, or sequencing technologies.
(iv) Lack of statistical methods for de-biasing MR, where biases can arise from non-random selection, assortative mating, indirect genetic effects, artificial stratification, etc.
(v) Over-simplified characterization of complex biological mechanisms in MR, which likely involve a multi-layered interplay between various traits and require advanced machine learning methods with increased flexibility for better profiling.
We will use the UK Biobank resources for method development and application, with the goals to (1) generate more generalizable and accurate genetic risk scores for improving existing clinical risk factor-based predictors, and (2) better identify novel biomarkers and potential drug targets while illustrating the underlying biological mechanisms through genetics-guided causal inference.