Disease areas:
  • cancer and other tissue growths
  • nutrition and metabolism
Last updated:
Author(s):
Ziqi Yang, Ziyang Song, Shadi Zabad, Marc-André Legault, Yue Li
Publish date:
7 January 2026
Journal:
Briefings in Bioinformatics
PubMed ID:
41627341

Abstract

Phenome-wide association studies rely on disease definitions derived from diagnostic codes, often failing to leverage the full richness of electronic health records (EHR). We present MixEHR-SAGE, a PheCode-guided multi-modal topic model that integrates diagnoses, procedures, and medications to enhance phenotyping from large-scale EHRs. By combining expert-informed priors with probabilistic inference, MixEHR-SAGE identifies over 1000 interpretable phenotype topics from UK Biobank data. Applied to 350 000 individuals with high-quality genetic data, MixEHR-SAGE-derived risk scores accurately predict incident type 2 diabetes (T2D) and leukemia diagnoses. Subsequent genome-wide association studies using these continuous risk scores uncovered novel disease-associated loci, including PPP1R15A for T2D and JMJD6/SRSF2 for leukemia, that were missed by traditional binary case definitions. These results highlight the potential of probabilistic phenotyping from multi-modal EHRs to improve genetic discovery. The MixEHR-SAGE software is publicly available at: https://github.com/li-lab-mcgill/MixEHR-SAGE.

Related projects

The goal of this project is to provide a mechanistic modelling of brain structure and function. Our overarching aim is to further our understanding of…

Institution:
McGill University, Canada

All projects