Development of complex disease risk prediction algorithm on obesity, cancers, alopecia areata, and sleep pattern-related genetic factors using microarray genome data

Last updated:: 2 July 2025

ID:: 63372
Start date:: 8 February 2021
Project status:: Current
Principal investigator:: Dr Youngah Shin
Lead institution:: ichrogene, Inc., Korea (South)

– Study aims
Our primary study aim is to develop complex disease risk prediction methods using computer algorithms that studies and improve from big data and yields prediction results (machine learning approaches, ML). To this end, the prediction methods will be developed based on our DTC DNA data (>5,000, multi-ethnic), then the methods will be tested with the two large-scale biobank genome data (Korea Biobank Array (>65,000, East Asian) and UK Biobank array (European))

– Scientific rationale
Utilizing machine learning (ML) and polygenic risk scoring (PRS, the estimated effect of many genetic variants on an individual’s phenotype) is the primary approach to develop complex disease risk prediction methods that can estimate and predicts individuals’ risk to diseases.
The approaches have been used to achieve reasonable successes in predicting risk of T2D and cardiovascular diseases. The predictive power of both methods can be improved consistently with increases in the size of the training datasets.
For the proposed study, we will develop ML-based disease risk prediction methods on obesity, cancers, Alopecia areata, and sleep pattern-related genetic factors. The model will be tested in the large size training datasets (UKB, Korea Biobank array). Furthermore, the performance of the developed model will be compared with that of the PRS-based prediction model.

– Project duration: 2021-2023 (36 months)
– Public health impact
ML models coupled with large new population datasets with high-quality phenotype contribute to classifying individual disease risks with high precision. Machine learning methods can provide cost-effective and proactive healthcare services.