Distribution and correlation free high-dimensional signal region detection with applications to whole genome association studies
Approved Research ID: 79237
Approval date: March 25th 2022
Canela-Xandri et al. (Canela-Xandri et al., 2018) has built an atlas of genetic associations on 660 binary traits by using ~452,000 related and unrelated UK Biobank participants of European descent. The genotypes of the UK Biobank participants were assayed two genotyping arrays, i.e., the Affymetrix UK BiLEVE Axiom or Affymetrix UK Biobank Axiom array, which significantly helps to identify common variants associated with traits or disease outcome through GWAS. However, these common variants make up only a proportion of the human genome and a vast majority of variants in the human genome are rare. Whole genome sequencing (WGS) association studies allow studying rare variant effects. Therefore, it is of substantial interest to conduct WGS studies among these 660 binary traits for identifying additional genetic associations from rare variants.
We propose a new algorithm based on binary search to scan the whole genome by utilizing a high-dimensional distribution and correlation-free (DCF) (Xue et al., 2019) two-sample test. We analytically show that the proposed method asymptotically controls the family-wise error rate and can consistently select the exactly true signal segments under some regularity conditions. By conducting simulation studies, we demonstrate that our procedure is computational faster than the Q-scan procedure (Li et al., 2020) while gaining a better power. We will apply this new algorithm on the 660 binary traits (Canela-Xandri et al., 2018), which include 657 binary phenotypes generated from self-reported disease status (UK Biobank field 20002), ICD10 codes from hospitalization events (UK Biobank fields 41202 and 41204), and ICD10 codes from cancer registries (UK Biobank fields 40006), as well as a further 3 binary phenotypes (UK Biobank fields 1777-0.0, 1707-0.0, 3079-0.0) from across the UK Biobank. Based on the identified genetic basis, we can better understand the causes of diseases or traits and provide much suitable treatments. This project will be undertaken during 36 months.