Validation of AI-based phenotypes and supervised predictive analyses in major chronic conditions and multi-morbidities

Last updated:: 2 July 2025

ID:: 79216
Start date:: 21 June 2022
Project status:: Current
Principal investigator:: Dr Laura Pasea
Lead institution:: University College London, Great Britain

Since the COVID-19 pandemic, the volume of health data accessible to the research community has increased. The UK government has provided anonymised national data for research in a secure Trusted Research Environment (TRE). While these datasets provide an unparalleled research opportunity, there appear to be several challenges in their usage. During a patient visit, the healthcare provider selects the disease or treatment from a list of codes. One complicating issue is the lack of a common standard to define the codes across different healthcare systems. When data from various systems are combined, finding all patients with a particular condition becomes challenging. One approach is to use readily defined rules to specify diseases and medications called phenotypes. However, there is no consensus among different phenotype definitions in the research community.
We aim to study the most common chronic diseases in the UK (e.g. heart conditions, kidney disease, diabetes, lung disease) and the long COVID syndrome in the context mentioned above. Our method is based on machine learning and statistical analysis. We aim to validate our studies conducted in TREs and Clinical Practice Research Datalink (CPRD) data in UK Biobank, which contains complete test results, questionnaire data, primary care, hospital episode data, and, most importantly, genomic data. We will use UK Biobank data to:
1. Assess the associations between test results,
existing chronic diseases, prescribed medicines, and potential risk factors in people with common chronic diseases.
2. Validate phenotypes discovered using machine learning in UK Biobank.
3. To find any similarity between phenotypes and the patterns in gene data. This would be an invaluable input for precision medicine research.
4. Analyse how particular chronic diseases, treatment plans, or quality of life progress over time.
We aim to use the whole cohort of UK Biobank for 36 months. Our analysis programs will be provided to the UK Biobank as open-source codes. Additionally, all validated phenotypes, generated data, and publications will be shared openly with UK Biobank.
The most important contributions of this study would be a) Improving the quality of phenotypes as validated in UK Biobank.
b) A better understanding of the most common chronic diseases in the UK.
c) Providing input for public health policies and clinical guidelines regarding the phenotyping of chronic diseases and long COVID.
d) Providing higher-quality evidence for precision medicine validated across TRE and UK Biobank.