Last updated:
ID:
1015365
Start date:
11 November 2025
Project status:
Current
Principal investigator:
Dr Guilhem Faure
Lead institution:
Broad Institute, United States of America

We aim to uncover novel relationships between genomic variation in intragenic regions and human biomarkers (phenotypes and disease states), using these as a proxy to discover functional non-coding RNAs (ncRNAs) and short peptides. This will be achieved by leveraging large-scale genomic embeddings and machine learning.
Key research questions:
1. Can embeddings derived from large language models (LLMs) capture signatures of specific human phenotypes, disease states, and population-specific variants?
2. Which variants are localized in predicted ncRNA or small peptide-encoding regions, and which are most strongly linked to physiological or pathological changes?
3. Can we predict the roles and molecular mechanisms of these molecules, and connect them to specific physiological or pathological states?
Objectives:
· Train and evaluate multiple embedding models on genomic sequences to test whether signals align with UK Biobank disease phenotypes and population biomarkers.
· Predict ncRNAs and peptides associated with specific conditions and covering variant genomic locations.
· Use agentic AI systems to reason over sequence-condition relationships and generate testable hypotheses.
· Functionally annotate candidate molecules and predict mechanisms using integrative molecular modeling.
· Prioritize molecules predicted to regulate physiological states for downstream experimental validation (in vitro and in vivo in mice).
Rationale:
Most human variation lies in predicted non-coding regions, yet their contributions to health and disease remain poorly understood. By combining UK Biobank’s unparalleled genotype-phenotype dataset with state-of-the-art LLM embeddings, we can systematically detect hidden molecular signals, accelerate discovery of novel regulatory molecules, and open new avenues for therapeutic development.

keywords: non-coding RNA, bioactive peptides, genomic embeddings, large language models, human phenotypes, disease mechanisms