Rationale: Predicated on the common disease-common variant model, the goal of genome-wide association studies (GWAS) is to associate common genetic variants with the risk of complex disease. Characterizing the role of GWAS variants in disease etiology is complicated by their individual small effect sizes and by the fact that the majority of single nucleotide polymorphisms (SNPs) associated with complex disease are located in non-coding regions. While these non-coding variants often play a role in regulating other loci, traditional statistical approaches require large sample sizes to identify interactions between loci, also known as non-additive genetic effects. Non-additive genetic effects are estimated to account for 11-36% of the heritability in human traits, but GWAS and polygenic risk scores (PRS) only account for additive genetic effects in which each risk allele adds linearly to the risk for a disease. Even for traits characterized by pathogenic variants of large effect size such as risk of recurrence in cancer, the predictive power of PRS is limited by their inability to take into account interactions between loci. Finally, environmental factors also play an important role in individual disease risk. Thus new approaches are needed to quantify the impact of genetic variants on human disease risk.
Objectives: With the advent of foundation models, it may be possible to detect long-range interactions between loci and decipher the role of non-additive genetic effects in human disease. Our primary research objective is to develop a foundation model that is trained on genomic sequence data for a large population in order to learn representations of human genetic variation.
Research Questions: We want to establish whether unsupervised pretraining on a large population genomics dataset will enable better performance on downstream tasks such as calculating PRS, identifying endotypes for complex diseases, and predicting drug response biomarkers.