Machine learning prediction of common disease risks and disease network construction using multi-source data
Principal Investigator:
Dr Lu Zhang
Approved Research ID:
60434
Approval date:
May 4th 2020
Lay summary
Quantifying an individual's risk for common diseases is an important goal of precision health. Genetic variant is a significant contributor for disease prediagnosis and prevention, but they were not attracted enough attention previously. Polygenic risk score (PRS), aggregating candidate disease alleles, has recently emerged as a standard approach to identify high risk individuals, but it has two major limitations. PRS is a linear model and does not account for nonlinear genetic interactions. Moreover, PRS does not model lifestyle and environmental (L&E) factors and patient medical history, lab test and clinical images which often play significant roles in disease risk prediction. Unfortunately, the target diseases or populations may not have sufficient sample size for model training. These pose a substantial challenge to identify high-risk incident cases from general population by integrating multi-source data. In this project, we plan to design a computational framework for common disease prediction and disease network construction by integrating multi-source data. To integrate genetic and L&E factors, we design a deep learning model to capture the nonlinear relationship between predictors and between predictors and disease outcome. Dimensionality reduction methods will be used to represent a huge number of genetic variants in low dimension rather than pre-select them by hard thresholds. We further extend multi-dimensional Hawkes model to integrate disease risk scores, genetic association-based disease similarities and the medical history, lab test and clinical images to predict disease high risk individuals, their diagnostic time and disease similarity network. In order to resolve the poor predictive power due to the insufficient sample size of target populations or diseases, we design a transfer learning method to fine-tune the weights of deep learning model trained by source population/diseases with large enough sample size. This project could significantly improve the power of identifying disease high risk individuals from general population ahead of the clinical diagnosis. Our model model could provide the disease prediagnosis in a certain time period and the doctor can provide professional advice in lifestyle or pre-treatment in the early disease stages. This project is planned for 3 years.