Computational strategies for translational pan-genomics

Last updated:: 2 July 2025

ID:: 304181
Start date:: 28 May 2025
Project status:: Current
Principal investigator:: Professor Alexander Schönhuth
Lead institution:: Bielefeld University, Germany

As of today, the volume of available genome data has reached a
critical mass, enabling its exploitation through advanced artificial
intelligence (AI) approaches. Through corresponding analyses, crucial
individual genetic variations can be pinpointed, supporting clinical
diagnoses and facilitating the determination of appropriate treatment
protocols with an unprecedented level of accuracy in the history of
medicine.

However, effectively harnessing these data masses necessitates
advanced techniques for their analysis. Here, we propose organizing
the data using methods from “computational pan-genomics,” an area of
research focused on the efficient and compact arrangement of genomes
from entire populations. This involves arranging genomes in a manner
that is both efficient and compact. Additionally, we aim to exploit
the (evolutionarily/genetically coherent) organized data within
advanced AI frameworks.

In pursuit of these objectives, we encounter two specific challenges.

Firstly, we aim to leverage the fundamental knowledge gleaned from
large, disease-unspecific masses of genomes for the targeted analysis
of rarer diseases. This strategy involves two steps. Initially, one
learns everything possible about (evolutionarily related) genomes in
general, commonly referred to as “pre-training.” Subsequently, the
focus shifts to the specific rare disease of interest, commonly known
as “fine-tuning.” Despite the scarcity of data for rarer diseases,
various recent protocols demonstrate the success of this step-wise
strategy. Pursuing such strategies will help mitigate biases that
hinder the study of rare diseases, as more frequent diseases often
receive disproportionate attention, potentially overshadowing research
on rarer conditions.

Our second objective is to devise knowledge exploitation strategies
that safeguard the privacy of individuals contributing their genome
records for general usage. Publishing successful privacy-preserving
strategies will encourage individuals to make their genome data
accessible to researchers.

Both goals rely on the latest advances in computational pan-genomics
and advanced machine learning. The integration of these two domains is
commonly referred to as “translational pan-genomics.” We are confident
that the synergy between massive data, advanced computational
pan-genomics data organization techniques, and advanced AI frameworks,
such as large language models, holds breakthrough potential in the
near to mid-term future.