Approved Research

Methodology and tools development for modelling variation of complex traits and disease

University of Edinburgh

Lay summary

The project aims to develop computational tools that will allow understanding how the environment and genes influence the risk of disease. In order to understand this, large datasets are required, but the computational tools to analyse these data are not available. We will develop such tools and demonstrate their use to understand how the environment and genome of UK Biobank participants shape their risk of disease. Only by understanding the complex interplay of genetics and environmental risk factors will we be able to develop medicines targeted to the relevant subgroups of individuals most likely to benefit and to guide public health interventions.

Genome-wide association studies have been instrumental in identifying genes that determine disease risk, however, many genes remain to be identified because scientists have only assayed a small proportion of the genetic variation through genotyping arrays. To overcome this problem, now the full DNA of UK participants will be examined, that is sequenced. However, this raises to scientific challenges. (1) The number of statistical tests that scientists will perform will be increased massively and (2) the volume of data will make the current model where researchers download the data to their institutions impossible (it would take years to download the data for analyses). Because of that the model will change to move the analyses tools to a common informatics platform where researchers can do the analyses. The tool will address a major issue in research, reproducibility. The tool will store the metadata of the data used, the sequence in which the data was chosen and the statistical model used, so that any other researcher (or the same researcher) can reproduce the same exact results (obviating participants' withdraws).

The tool will also allow sub-setting of the data, researchers are sometimes interested in looking at stratified analyses (for instance, post-menopausal females) but extracting the data and keeping tag of all the files is cumbersome and prone to error, the tool will allow to streamline and rationalise this type of analyses.

Finally, the tool will allow for standard epidemiological analyses to study environmental risk factors such as logistic regression, cox regression, etc.

Scope extension:

The project main aims are:

To develop scalable statistical methodology and computation tools to model genetic variation, environmental risk factors, gene by environment and gene by gene interactions for sequenced and genotyped cohorts.
To make the tool available to UK Biobank researchers (only software not data).
To internally apply the methodology to gene by environment and gene by gene interactions across all phenotypes available with UK Biobank to genotype and sequence data.

The extension of scope includes these three additional points:

Study of multi-morbidity

We would like to extend the current proposal to the study of multimorbidity. Multimorbidity is a major challenge to health systems internationally, but previous research examining how morbidities cluster within individuals has been limited in scope, has not been replicated, largely ignores social and geographical context, and has seldom translated into improvements in care. This research aims to use artificial intelligence and state-of-the art data science, social science, genomics, and health service research methods to understand clustering of morbidities within individuals, within communities, and in key clinical contexts.

The extension has two steps. We will first use unsupervised and supervised machine learning including multilayer networks to examine

clustering of morbidities in individuals using General Practice Research Database (GPRD) data. We will identify morbidity clusters that are consistent, operationalisable, stable, and reproducible across multiple methods and datasets. We will use these clusters in objective 2 along with clusters defined a priori based on theory or domain knowledge.

Objective 2: To examine the genomics of morbidity clustering in individuals to support validation of objective 1 cluster solutions, and examine aetiology. We will map objective 1 clusters into UK Biobank data as traits and identify genetic loci that explain why individuals belong to one or more clusters, using polygenic models to predict an individual's cluster membership. This will underpin future mechanistic and potentially interventional research, based on mapping potential causal pathways of morbidity clustering.

Estimation of penetrance of rare variants for complex disease with an emphasis in cancer.

The overarching aim of this proposal is to systematically assess the penetrance of whole exome and whole genome genetic variants linked to cancer risk and other diseases in UK Biobank. To do that, we will first start with cancer:

- Extract genetic variants associated with cancer from public databases that hold medically assayed human genetic variation.

- Estimate risk and penetrance, i.e. the probability that a carrier of a mutation has cancer by a given age, for each of these variants and each cancer site.

- Test whether the combined effect of common susceptibility variants modify the cancer risk and penetrance of rare pathogenic variants.

- Develop a web-interface where the user can query and download our prevalence estimates in a flexible and easy way. For instance, by tumour site, organ, gene or genetic variant. The website will include a penetrance predictor for polygenetic risk scores of common variants.

In a next phase we'll extend these aims to other diseases and genetic variants of known or unknown functional significance.

Estimation of genetic and environmental contributions to complex trait variation

This has a main part and 2 different arms and consolidates the aims of project 788 to improve project management within my group.

Heritability of liability to disease is often estimated using twin-pairs. However, estimates obtained from twins have three major limitations: (1) twins are not a representative sample of the population, (2) estimates are inflated by common environmental factors and (3) the sample size is often small. The UK biobank offers a unique opportunity to overcome these limitations.

Moreover, the heritability of disease liability sets the potential discriminative ability of disease classifiers based on genetic markers and determine familial recurrence risk of disease. Combining information from pedigreed data (family history) and genetic markers from the general population may aid in improving current models of risk based on genetic markers. Combining pedigree and population-based data in a unified approach is likely to help. However, obtaining detailed deep family history is often time consuming and expensive. The shallow family history data obtained by the UK biobank likely sets the limits of what is practically feasible. We will use simulations to investigate the benefit of genotyping relatives of the UKbiobank and combining UKBiobank data with case-control genome-wide association studies.

Arm 1

Request: for a derived data-field denoting cohabitation of participants.

Justification: This is very important for the current project because we need to discriminate what is due to genetics and to the environment. We have been working on this project for a while and have found an association between the risk of disease between the fathers and mothers of the UK Biobank participants. This is very important in relationship to the estimates of heritability. We believe some of the correlation of disease within family members is due to the fact that they share the same environmental risk factors in addition to genetics. We would like to replicate our finding by comparing the prevalence of disease within couples recruited to the UK Biobank (which share the environment but not the genetics).

Arm 2

The aims of the project extension are to:

Estimate heritability, genetic and environmental correlations for these traits.
Perform a GWAS for these traits.
Present the summary statistics in a database for the research community to query and download. We have currently analysed the traits described in the attached excel file.
To impute tissue-specific intermediate traits (e.g. using a reference panel of gene expression and methylation we have generated or obtained from public resources) and correlate them with disease status (yet to be done).
Present the summary statistics of the results in (4) in a database for the research community to query and download.
To write one or two papers with a few descriptive statistics of the data (similar to the one shown in the paper seen by UK Biobank).
Return to UK Biobank results from 3, 4, and 5.