Building AI models to uncover variant-disease-target relationships

Last updated:: 24 February 2026

ID:: 1075051
Start date:: 24 February 2026
Project status:: Current
Principal investigator:: Professor Marinka Zitnik
Lead institution:: Harvard Medical School, United States of America

This project will develop AI models that combine genetics and multi-omics datasets to better understand how genetic changes lead to disease. Specifically, (1) we will train AI models on a multimodal genetics-omics dataset, then use the model to predict DNA regulatory effects such as gene expression, and evaluate these predictions against UK Biobank information, and (2) train AI models that integrate prior knowledge from biomedical graphs and curated genetic and proteomic datasets to predict causal disease drivers and therapeutic targets, validated with UK Biobank evidence.

Research Questions
* Can large-scale AI models trained on genetic sequences predict how variants affect gene activity and disease, with evidence from UK Biobank?
* How can proteomic and clinical data be combined to distinguish putative causal drivers from downstream effects?
* Can a biomedical AI system link genetic variants, molecular change, and disease outcomes in an explainable way?

Objectives
* Develop hypothesis-driven AI models incorporating UK Biobank variant data and other modalities (cross-trait, protein-protein, and gene-context interactions).
* Interpret genetics-protein-disease relationships via biological pathways and regulatory networks to identify putative causal mechanisms.
* Leverage multimodal AI to improve power for discovering actionable variants in minority populations and validate findings using UK Biobank data.

Scientific Rationale
We will train AI models on DNA sequences, gene expression, and chromatin data to learn how DNA changes alter gene activity. Using UK Biobank, we will test whether individuals with specific variants show predicted expression, protein, and disease outcomes. Integrating pathway, protein-interaction, and disease databases will reveal why variants cause disease through molecular mechanisms. We will validate putative causal hypotheses statistically using UK Biobank genetic and health data.