Approved Research

Machine learning-based phenotyping and methods development for the identification, characterization, and validation of disease susceptibility loci from high-throughput sequencing and microarray data.

Vanderbilt University Medical Center

Lay summary

The goal of this project is to study the impact of genetic variation on disease risk. Genomic segments shared due to relatedness represent an untapped resource for disease gene mapping and identifying people likely to carry rare mutations. We have developed some of the most popular and powerful tools for accurate relatedness detection. We propose to build on our tools and others' to create large-scale shared segment data repositories from electronic health record-linked DNA databanks and identify genes that impact disease risk, phenome-wide. In addition, many phenotypes are poorly captured in electronic health record and survey-based data. Therefore, we propose to apply innovative machine learning approaches that will allow us to better characterize disease risk. Specifically, we propose to use UKBioBank data to 1) train and model co-occurring health information via machine learning; 2) validate known genetic risk factors and identify new ones; 3) improve characterization causal genetic effects 4) use methods that leverage relatedness and genetic sharing patterns to find new genes that underlie human health 5) consider jointly the genetic risk factors from across the genome in disease development and progression; and 6) develop and refine new methods and software to identify and characterize disease susceptibility loci. Each aim represents an innovative approach to data utilization in large EHR-linked DNA databanks, and the creation of resources that will fuel future research. Collectively, our aims map a path towards efficient and affordable novel disease-gene discovery using innovative approaches to existing data analysis.