Approved Research

Targeted Learning for causal and associational inference in population genomics

University of Edinburgh

Lay summary

The UK Biobank's genomic data and information about individuals' health, preferences and lifestyle allow researchers to find DNA patterns and exposures in humans linked to, say, a particular disease. One of the strengths of the UK Biobank is its size, as it contains data from more than half a million individuals. It is commonly understood that more data leads to more precision, allowing researchers to maximise the insights they extract from the data. However, if the mathematical, statistical, and machine learning techniques employed do not accurately reflect the data's complexity, these techniques will lead to completely invalid conclusions. This is the Curse of Big Data: as cohort size increases the uncertainty (variance) in estimates shrink thereby laying bare errors (or bias) in the proposed mathematical model. For instance, the researcher may have selected a wrong statistical model because it omits relevant variables, resulting in an oversimplification of a far more complex truth. It is really our collective success in creating rich databases of genomic and health data that requires us to similarly employ cutting edge mathematical theory in order to obtain valid and realistic estimates.

In this project we aim to find genetic changes, molecular traits and environmental exposures that affect people's risk of disease. To do so, we will develop and apply state-of-the-art mathematical estimation techniques, called Targeted Learning, that come with mathematical guarantees and provide realistic answers to these complex questions. With these techniques, we will be ideally positioned to take advantage of the UK biobank's large-scale genetic data and information about individuals' health, without suffering from the Curse of Big Data.

This study could directly help find causes of diseases and traits, as well as determine the (causal) effect of various molecular exposures on trait. We will make our findings publicly accessible in a searchable web interface, both for researchers and the general public, together with detailed explanations and examples of mathematical and biomedical notions. Furthermore, once the mathematical and computational tools are in place, we expect to apply them to a vast number of future applications, including other large-scale biobanks such as the One Million Veterans and All Of Us datasets currently being generated in the US. The project work is expected to take around two years for four or five full-time PhD students and four cross-disciplinary PIs (one each from genetics, cancer, machine learning, and mathematics).