Last updated:
Author(s):
Ana Torralbo, Jonathan M. Davitte, Damien C. Croteau-Chonka, Cai Ytsma, Chris Tomlinson, Natalie K. Fitzpatrick, Sheng-Chia Chung, Ghazaleh Fatemifar, Adrian S. Cortes, Tom G. Richardson, Matthew Barclay, Julia Carrasco-Zanini, Chris Finan, Harry Hemingway, Aroon D. Hingorani, Valerie Kuan, Claudia Langenberg, Georgios Lyratzopoulos, R. Thomas Lumbers, Maik Pietzner, Anoop D. Shah, Johan H. Thygesen, Natalie Zelenka, John C. Whittaker, Margaret G. Ehm, Spiros Denaxas
Publish date:
9 July 2025
Journal:
Scientific Reports
PubMed ID:
40634319

Abstract

Accurate and reproducible phenotyping is essential for large-scale biomedical research. However, developing robust phenotype definitions in biobanks is challenging due to diverse data sources and varying medical ontologies. As a result, the current phenotyping landscape is fragmented. We developed a computational framework to harmonize electronic health record (EHR) data, participant questionnaires, and clinical registry information, defining 313 disease phenotypes among 502,356 UK Biobank (UKB) participants. Our method integrated four medical ontologies (Read v2, CTV3, ICD-10, OPCS-4) across seven data sources, including primary care, hospital admissions, cancer and death registries, and self-reported data on diseases, procedures, and medication. Phenotypes underwent multi-layered validation, assessing data source concordance, age-sex incidence and prevalence patterns, external comparison to a representative UK EHR dataset, modifiable risk factor associations, and genetic correlations with external genome-wide association studies (GWAS). Results indicated consistent disease distributions by age and sex, high correlation with non-selected general population data prevalence estimates, confirmed risk factor associations, and significant genetic correlations with external GWAS for nine of ten evaluated diseases. Our approach establishes comprehensive disease validation profiles, improving phenotype generalizability despite inherent UKB demographic biases. The modular, reproducible framework can be extended to additional diseases and populations, supporting federated analyses across diverse biobanks, and facilitating research in underrepresented populations.

Related projects

Overall success rates for bringing novel medicines to patients are low. Reasons for failure in drug discovery and clinical development are many and complex, including…

Institution:
GlaxoSmithKline, USA, United States of America

Our understadning of human disease and the different factors which influence our health changes all the time through but the manner in which we define…

Institution:
University College London, Great Britain

All projects