Approved Research

Causal genetic inference for risk prediction and discovery

Invitae Corporation

Lay summary

By measuring the genomes of a large number of people and recording whether or not they have a disease, it is possible to identify genetic variations correlated with disease risk. Because an individual's genome is set at birth, such information would enable clinicians to identify at-risk individuals long before that disease is likely to occur. They could then prioritize testing and early interventions in the individuals that are most at risk.

However, correlation does not imply causation. With genetic risk prediction, correlations between genetic variations and disease risk often do not apply outside of the narrow population in which they were measured. For example, if genetic associations were measured for blonde hair in a Scandinavian population, genetic variants that actually caused another prevalent trait in that population, such blue eyes, would be strongly associated with the blonde hair trait. This could hurt prediction in populations where the correlation between blonde hair and blue eyes is weaker. Empirically, genetic predictions learned from individuals of European-descent are less accurate when applied to non-Europeans.

By inferring causal relationships, we avoid the limitations of correlational analyses. Causal relationships hold across different groups of individuals, while spurious correlations do not. Here, causal inference requires a statistical model of how ancestry, environmental factors, and genetic variation interact. Building this model, connecting it to existing knowledge about human biology, and rigorously testing it are the major research aims of this project.

Inferring causal relationships is a more significant theoretical and computational challenge than measuring correlations, so our approach relies on recent breakthroughs in machine learning, statistics, and high-performance computing. Our aim is to test our system head-to-head against existing state-of-the-art genetic prediction algorithms and report whether accuracy and robustness are improved with our approach.

We expect the project to last two years. Our project will produce more accurate, robust, and equitable tools for genetic prediction of disease risk. As a basic methodological tool, it will be applicable to research in public health and biomedicine. In the clinic, it will enable doctors to stratify individuals by disease risk and prioritize screening and early interventions for complex diseases such as breast cancer, type II diabetes, and heart disease.