This project aims to develop a foundation model trained on spatial transcriptomics (ST) data to capture both cell-cell and gene-gene interactions in human tissues. The research question is: Can we build a scalable representation learning framework that integrates spatial and transcriptional information to support downstream health-related analyses, such as disease stratification and tissue pathology characterization?
The objective is to pretrain a large-scale model using ST data to learn rich embeddings of cellular and tissue organization. These representations will be evaluated on downstream tasks relevant to health, including cell type classification, spatial domain identification, and reconstruction of tissue architecture. Ultimately, we aim to apply this model to datasets like those in UK Biobank that include histological images, omics data, and disease phenotypes.
The rationale is inspired by recent breakthroughs in foundation models in fields such as natural language processing and imaging, where scaling model size and pretraining on large datasets has led to generalizable representations. ST data provides a biologically structured substrate that reflects both spatial and molecular contexts – ideal for capturing disease-relevant patterns at the tissue level.
This research is health-related and in the public interest: it will advance methods for understanding tissue-level alterations in diseases such as cancer, neurodegeneration, and inflammatory disorders. The resulting model will be openly shared with the academic community, enabling researchers to improve biomarker discovery, tissue classification, and mechanistic disease studies using UK Biobank resources.
We have read and will comply with the UK Biobank AI policy in all aspects of data use and model development.