Statistical methods and scalable cloud-based toolset for integrative analysis of biobank-scale sequencing studies and biobanks for complex diseases

Last updated:: 2 July 2025

ID:: 211447
Start date:: 11 November 2024
Project status:: Current
Principal investigator:: Dr Zilin Li
Lead institution:: Northeast Normal University, China

Our goal is to develop a collection of powerful, scalable and resource-efficient statistical methods for analyzing sequencing data on a biobank scale, focusing on functionally informed approaches. We plan to integrate the proposed method into an all-in-one analysis pipeline and make this software accessible through the UK Biobank RAP. These tools will be designed to pinpoint genetic factors influencing health and disease, elucidate the biological mechanisms driving different outcomes, and predict disease risk based on genetic and environmental factors.

Many prevalent human diseases, such as cancers and cardiovascular diseases, have a complex genetic basis with multiple risk factors, yet many genetic variants influencing these diseases remain unidentified. Large-scale whole genome sequencing studies present a unique opportunity to investigate the impact of common and rare variants on human diseases or traits, especially for the noncoding genome. Our project’s primary goal is to discover these unknown genetic variants associated with various human traits, aiming to enhance our understanding and treatment of numerous diseases. Utilizing the vast data from sources like the UK Biobank, we have been developing a variety of powerful and scalable statistical association tests through functionally informed analysis by integrating the information of genetics data and functional annotations provided by genomics data. We plan to use these methods to identify new risk variants across a range of common diseases, especially for rare variants and the noncoding genome.

Our second goal is to perform gene-environment interaction analyses to understand how these genetic factors interact with environmental exposures in contributing to disease. By integrating genetic (provided by sequencing data) and genomic information (provided by functional annotations) with detailed environmental information supplied by the UK Biobank, we will uncover how these interactions impact disease manifestation and progression, potentially leading to more personalized prevention and treatment strategies. We will initially focus on cardiovascular disease.

The third goal of our work is to develop novel risk prediction models for a variety of outcomes by combining genetic data, genomic data, and environmental information. These models aim to identify individuals at high risk, enabling them to start treatment or prevention programs earlier, potentially improving health outcomes and quality of life.

Our project aims to enhance our understanding of genetic etiology of complex diseases. This knowledge will assist health professionals in advancing treatment and prevention strategies for various diseases. We expect the scope of the work we have outlined to take around three years.