Using Machine Learning to uncover disease patterns in large-scale cohort data

Last updated:: 2 July 2025

ID:: 206855
Start date:: 4 February 2025
Project status:: Current
Principal investigator:: Professor Gerton Lunter
Lead institution:: University Medical Center Groningen, Netherlands

The UK Biobank contains a wealth of information about the health and lifestyle of half a million people, as well as their genetic information. These data have been instrumental in identifying genetic markers that influence a person’s susceptibility to a disease, and this information has been helpful in developing novel drugs for those diseases.

However, most studies focus on a single disease (such as Type-2 Diabetes), or a characteristic (such as blood pressure). However, the underlying health problem often manifests itself in multiple ways – a person who suffers from Type-2 Diabetes might not have had a doctor make a diagnosis, but may experience changes in their weight for instance.

Using modern machine-learning methods, the UK Biobank provides an opportunity to learn such patterns directly from the data. This will help to better describe the health of participants, in a way that is more refined than a simple yes/no disease diagnosis. This in turn will help to find more genetic markers that either predispose to or protect against health problems, which eventually will lead to a better understanding of disease, and the development of more effective drugs.

In this project we extend an earlier successful effort to identify such patterns. In the previous project we used an existing machine-learning method that could deal with binary (yes/no) variables only, which nevertheless allowed us to identify meaningful patterns and novel genetic markers for disease. In the current project we will extend the method to also include questionnaire data, as well as physical measurements such as blood pressure. Since many variables in UK Biobank are of these two types, we expect that this new method will enable us to identify even more patterns of disease, lifestyle and physical measurements, which will allow us to identify additional genetic markers for disease.