Research Questions:
Can a streaming-based feature selection method using multiple statistical indicators effectively identify informative SNPs while dramatically reducing memory requirements for large-scale genomic data?
Does a stacking ensemble learning approach integrating diverse base learners improve phenotype prediction accuracy compared to traditional genomic selection methods?
What is the optimal combination of feature selection parameters and base learners for different complex traits?
Objectives:
Develop a memory-efficient feature selection method that processes SNP data through streaming reading, using weighted combinations of variance and Pearson correlation coefficients to identify trait-associated genetic variants.
Implement a stacking machine learning framework combining ridge regression, random forest, and kernel ridge regression to capture both linear and nonlinear relationships between genotypes and phenotypes.
Systematically evaluate prediction accuracy, computational efficiency, and memory usage using UK Biobank genomic and phenotypic data, benchmarking against established methods including GBLUP.
Scientific Rationale:
Genomic selection has revolutionized breeding and disease risk prediction since Meuwissen et al. (2001). However, traditional methods like GBLUP assume linear SNP effects, limiting accuracy for traits with complex genetic architectures. Additionally, analyzing high-dimensional genomic data faces the “curse of dimensionality,” requiring substantial computational resources that limit accessibility.
Single machine learning models often capture only partial data characteristics. Ensemble methods that integrate multiple learners can model diverse aspects of SNP-phenotype relationships more comprehensively. Current feature selection approaches typically rely on single indicators and require loading entire datasets into memory, restricting scalability for biobank-scale data.