Skip to navigation Skip to main content Skip to footer

Approved research

Characterizing the contribution of short tandem repeats to human phenotypes.

Principal Investigator: Mr Richard Yanicky
Approved Research ID: 46122
Approval date: April 10th 2019

Lay summary

Short Tandem Repeats (STRs) are a class of genetic variation comprising of repeated short sequences of DNA in the genome. Several dozen STRs are known to contribute to human diseases, including Huntington's Disease and Fragile X Syndrome. However there are more than 1 million STRs in the human genome, most of which remain uncharacterized. Traditionally studying the role of STRs has been difficult since they are complex to analyze and are not directly captured by most genetics studies. We recently developed a resource that allows to analyze STRs in large datasets where they were not directly genotyped using a technique known as imputation. Here, we will leverage this resource to identify the contribution of STRs to a variety of traits in humans. We expect that imputing STR in the UK Biobank data and performing association tests will take up to 1 year with 1-2 years follow up work to evaluate and interpret our results. We expect our study will identify a novel class of genetic variation with widespread impact on a variety of human traits.

Scope extension: We will additionally analyze the contribution of other complex variants, including variable number tandem repeats and HLA haplotypes, to complex traits. The same rationale that supports the inclusion of STRs alongside SNPs in genetic analyses encourages the inclusion of other complex variants types, and there is already evidence that play a role (e.g. Mukamel, et al. 2021). We will call these other variant types both via imputation and directly-calling from the 450k whole exome sequences. Among other methods, we will continue to look at length-based associations for VNTRs; for HLA haplotypes we will look for haplotype associations.

Scope extension:

Traditional GWAS based on Single Nucleotide Polymorphisms (SNPs) fail to explain a majority of heritability for most traits, likely due to the underlying causal variants not being tagged by common SNPs. We recently reported thousands of STRs whose lengths are strongly associated with expression of nearby genes (Fotsing, et al. 2018), supporting the hypothesis that STRs likely contribute to a variety of complex traits.

The goal of this project is to perform a comprehensive analysis of the contribution of STRs to a wide range of phenotypes. We will first impute STRs into the UK Biobank genetic data leveraging a phased SNP-STR haplotype panel we recently developed (Saini, et al. 2018). We will then test each STR for association with different phenotypes by modeling the relationship between allele length and trait. Finally, we will apply fine-mapping techniques to determine loci for which the STR is likely the causal variant.

We will initially focus on traits for which preliminary analysis has identified STRs predicted to be causally driving GWAS signals, including body size and blood traits. We are requesting additional traits in order to more broadly apply assess the contribution of repeats to a range of complex traits.

We will additionally analyze the contribution of other complex variants, including variable number tandem repeats and HLA haplotypes, to complex traits. The same rationale that supports the inclusion of STRs alongside SNPs in genetic analyses encourages the inclusion of other complex variants types, and there is already evidence that play a role (e.g. Mukamel, et al. 2021). We will call these other variant types both via imputation and directly-calling from the 450k whole exome sequences. Among other methods, we will continue to look at length-based associations for VNTRs; for HLA haplotypes we will look for haplotype associations.

We will use trait associations of SNPs, STRs, and other variant types identified in analyses described above to build polygenic risk scores (PRSs) that collectively consider the effects of multiple variant types as well as complex effects such as interactions between variants. We will additionally evaluate performance of our association testing, fine-mapping, and polygenic risk score analyses across different ancestry groups and quantify the contribution of different variant types to traits of interest. Analyses will be updated to include variant calls based on newly available whole genome sequencing data.