Digital Health Modeling: Embeddings of patient stratification towards personalized diagnostics

Last updated:: 2 July 2025

ID:: 129198
Start date:: 6 December 2023
Project status:: Current
Principal investigator:: Professor Christian Conrad
Lead institution:: Charite - Universitatsmedizin Berlin, Germany

Latest developments in artificial intelligence allows to encode large and various data points to interrogate distinct parameters in life science and patient care, or identify which measure or value is needed to predict and treat therapy outcome. At the University Medicine Charité in Berlin advanced novel data collections are generated in field of imaging, spatial genetics and sequencing. In our oncology departments whole CT body-scans and MRI images are not sufficiently big to build accurate large AI models. In long-term infection disease research, like long-COVID19 or multi-organ fibrosis, new technologies are established. This includes spatial sequencing and long-read sequencing. Spatial sequencing visualizes the expressed gene sequences directly in the histopathological tissue, labelling each cell with a gene ‘barcode’. In long-read sequencing the whole gene information is read, while current short-read sequencing identifies only the first part. To understand the variations in the whole genes is essential to bring genetics and disease here into context. Another use case are cardiovascular patients with different symptoms and events, whereas unclear diagnosis or undetected effects appear. All these different data sets are quite specific and too small to build large generative models like ChatGPT. But with the large data sets from the UK biobank based on the different disease categories our cohorts can be linked and our AI can be expanded and improved. Notably, different to large language models our AI models would integrate different data ‘modalities’ like sequences, images, blood measurements, metabolites. This ’embedding’ of different data sets will allow to generate also missing data points or advise which data set are most crucial for predictions. But most important, self-supervised training will be possible which does not require any manual labelling of data sets. Instead, the data sets are huge enough that random picked data points can be automatically labelled. Like ChatGPT models the fine-tuning or pre-training of the data is main manual work. We envision to merge and curate our different data sets with UK biobank for clinical ‘questions’ in this extended life science data space without having a real chat assistant, but correct ‘answers’.