Clinically relevant stratification through deep learning analysis of registry data and extensive genetic data, and identification of functionally relevant disease-associated focus genes

Last updated:: 2 July 2025

ID:: 532555
Start date:: 4 February 2025
Project status:: Current
Principal investigator:: Professor Kasper Lage
Lead institution:: Sidera Bio ApS, Denmark

We aim to replicate, validate and build upon findings we have made in Danish data, analysing deep phenotype data (clinical data from the Health Data Authority registries) to create clinically relevant subgroups/clusters. Identifying drivers of these clusters, we will replicate the clustering in the UKBB data and then analyse WGS in the clusters found.
Deep learning methods based on neural networks can capture non-linear correlations and thus represent and identify biologically relevant information. We have developed deep learning models which can stratify patients and identify patterns in disease onset, disease progression, treatment effect, comorbidity, and disease burden. In the Danish data, we have GWAS data for some patients/subgroups. However, the project will be much improved by replication of the clusters in UKBB data.

Pilot case conditions: Cardiometabolic disease, Schizophrenia.
Both are complex diseases with poorly characterized impact of known variants. Better stratification can contribute to earlier diagnosis and better treatment, both directly through professional treatment guidelines and indirectly through a better understanding of the genetics and causal mechanisms, and contribution to the discovery of new drug targets.
The project will utilize data from sources in three countries, the UK (UKBB) Finland and Denmark.

Methods: We use self-supervised learning, where a neural network of Variational Autoencoders (VAEs) is trained to reconstruct the input data. Use of VAEs filters out redundancy and noise while learning higher-level latent features that highlight complex variations among individuals. Cluster analysis of this latent space can be used to stratify groups of individuals with common traits and explore associations between features, such as genomics and health history. We also use transformers, to capture temporality including disease progression.