Skip to navigation Skip to main content Skip to footer

Approved research

Running Principle Component Analysis and feature selection methods on Biobank scale genotypes

Principal Investigator: Professor Petros Drineas
Approved Research ID: 41297
Approval date: May 20th 2019

Lay summary

The project objectives are two fold: 1) One of the key exploratory tools for any data is Principal Component Analysis, which gives a general idea about the structure of the data. As this is true for genotype data as well, PCA paints a picture of the genetic variation of the samples. Being a linear dimensionality reduction technique, PCA can be used to extract the fundamental features of a dataset without complex computational modeling and when used with genotype data, it can identify a set of markers or Single Nucleotide Polymorphisms (SNPs) associated with a genetic loci related to selective pressures along with deleterious mutations. Advanced sequencing and genotyping technologies has resulted in one of the growing challenges in the field of genetics: the availability of terabytes of data in biobanks from large cohorts. PCA's major drawback is that it does not scale well with increasing data size going up to terabytes of data. In this regard, we propose TeraPCA, which is a multithreaded C++ package based on Intel's MKL library (or any other BLAS/LAPACK distribution). It is an out-of-core algorithm based on the randomized subspace iteration method, where it computes an invariant subspace associated with the largest eigenvalues of a square matrix. TeraPCA outperforms the current standard software suit, FlashPCA2 by a factor of 8 or more and can also be run on a laptop with limited amount of RAM. In addition to it's computational prowess over other software suites, it produces accurate results with five to six digits of accuracy in eigenvalues and their corresponding eigenvectors, when compared to the MATLAB's SVD algorithm, which is a widely used standard. 2) In case-control datasets, one of the main goals is to identify the biomarkers which discriminate cases from controls. We will study various approaches towards identifying such markers and implement methods from randomized numerical linear algebra community to achieve that goal. We will also compare different statistical machine learning techniques to do feature selection on the original data set. Following which we can run supervised machine learning methodologies to correctly classify cases and controls. Combining multiple SNPs in linkage disequilibrium (LD) may recover the power of detecting the correlated latent causal SNPs better than single SNP analysis. So, we will test these methods and validate from standard GWAS results on the UK Biobank data. Thus this will be a faster alternative to standard GWAS approaches.