Skip to navigation Skip to main content Skip to footer

Approved research

Development and evaluation of machine learning approaches for genetic-based disease prediction in the UK Biobank.

Principal Investigator: Dr James Cook
Approved Research ID: 45340
Approval date: March 12th 2019

Lay summary

The field of genetic epidemiology focuses on the discovery of genetic risk factors contributing to health and disease in human populations. Genetic association studies aim to discover DNA variants associated with complex binary (such as case-control status in disease) and continuous (such as height) traits, in order to better understand the causal mechanisms of these conditions and enable the development of new treatments and preventative interventions. One of the early aims of genetic association studies was to use associated variants to enable personalised medicine on the basis of an individual's genotypes. However, complex traits are typically affected by many underlying genetic variants, all contributing a small effect to the trait, which limits the utility of individual variants in risk prediction. Work has been undertaken to combine effects at multiple genetic variants to produce a single predictive measure of genetic risk (known as a 'genetic risk score'). However, these methods have limitations, such as only being able to incorporate common genetic variants and using crude methodology for choosing variants to include in the model. This project will investigate the use of machine learning methods in building risk prediction models using genetic and non-genetic data. These methods are not limited in the same way as traditional risk scores, and have additional advantages such as the ability to learn from experience and incorporate non-genetic information in the model. I will begin by building simple neural networks including a small number of genetic variants, and over the duration of the project, will extend these networks up to include large numbers of variants and relevant non-genetic information, such as clinical information and biomarker data. As the project progresses we will also apply the neural network approach to sample groups from different ethnicities within UK Biobank, and investigate different methods to incorporate rare genetic variants in the network. This project is expected to last for three years. We expect that the predictive models produced during this project will have greater predictive accuracy than current models, and may be of greater clinical utility when it comes to predicting an individual's risk of developing complex diseases. The main public health impact of this research will therefore be in producing predictive models which can be used clinically to enable personalised medicine on a genetic basis.