Approved Research

Identification and analysis of genetic variants associated to multivariate phenotypes

Centre for Genomic Regulation (CRG)

Lay summary

Measuring a trait in a cohort of genotyped individuals allows to identify genomic loci statistically associated with it. This, depending on the nature of the phenotype, is the basis of genome-wide association studies (GWAS) and quantitative trait loci (QTL) mapping analyses. However, while genotypes are well-defined biological entities, phenotypes are usually defined more subjectively and may be related to a wide variety of biological processes. Indeed, multi-trait phenotypes are widespread in biology: levels of blood lipids (LDL, HDL, triglycerides), cellular composition of a tissue, traits that define a given neurological disorder, body measures (height and weight), expression of the genes in the same pathway, abundances of the splicing isoforms of a gene, etc. Nonetheless, despite the multivariate nature of many biological phenotypes, GWAS and QTL analyses are generally performed one phenotype at a time. This approach does not take into account the correlation structure of the studied traits, which often translates in a lack of power to detect the true associations. In addition, although some of the currently available multivariate methods have been occasionally applied to GWAS analyses, they present several limitations (increased complexity, lack of interpretability, strong model assumptions, large amount of computation required, etc.) which hinder their broad usage by the community. In this scenario, we have developed a fast, non-parametric method for multivariate distance matrix regression, extending the statistical framework originally proposed by Anderson (DOI: 10.1111/j.1442-9993.2001.01070.pp.x and 10.1111/1467-842X.00285). It allows to assess significance of the association between a quantitative multivariate response and a set of explanatory variables using the asymptotic null distribution of the test statistics. To evaluate our approach, although we are potentially interested in all kinds of multivariate phenotypes, we have thought of neuroimaging phenotypes (e.g. brain areas' volumes, connectivity, presence of lesions such as white matter hyperintensities, etc.), intrinsically multivariate, for a proof-of-concept GWAS analysis. Along the duration of the project, planned up to 3 years, our goal is to assess the performance of our approach and compare it to other univariate and multivariate strategies, as well as identifying those genetic variants that alter human brain structures, which may reveal new biological mechanisms underlying cognition and neuropsychiatric disorders.

Scope extension:

We aim to address the identification of common genetic variants associated to human multivariate phenotypes, accounting for additional relevant covariates. We have developed an extended version of Anderson's multivariate distance matrix regression method (DOI: 10.1111/j.1442-9993.2001.01070.pp.x and 10.1111/1467-842X.00285), a non-parametric analogue to multivariate analysis of variance (MANOVA) that allows to assess significance for the association between a quantitative multivariate response and a set of explanatory variables. As a proof-of concept, we are interested in applying it to MRI phenotypes, intrinsically multivariate, such as (but not restricted to) those derived from neuroimaging (volumes of the different brain areas, connectivity, lesions, etc.). Identifying genetic variants that alter human brain structures may reveal new biological mechanisms underlying cognition and neuropsychiatric disorders. Specifically, we plan to:

- Evaluate the performance of our multivariate approach.

- Compare to the usual single-phenotype (univariate) strategy.

- Compare to other multivariate methods: such as canonical correlation analysis (CCA), MANOVA, generalized linear mixed models (GLMMs), etc.

- Perform GWAS analyses using multivariate MRI (brain, heart, etc.) phenotypes.

To complement this analysis, we plan to explore a basic model with the interaction of two genetic variants generating a particular phenotype. We will first study simple phenotypes as the ones defined in less frequent disorders. A digenic contribution should be identified as a deficit of healthy co-carriers of rare and common variants. We will compare the observed and expected frequencies of carriers of any pair of variants, with the expected being determined by the individual frequency of each variant. In this analysis we will consider the UK Biobank dataset as a control cohort, where the number of individuals with a given combination of variants should fit the expected one according to a neutral model.