Approved research

Sub-phenotyping for prediction and genetic association testing

University of California San Francisco

Lay summary

Many diseases are heterogeneous amalgams of complex biological processes, impeding both patient treatment and biological insights. Medically, this heterogeneity causes disease misclassification and oversimplification, as treatment efficacy can depend highly on covariates like genetic background, environmental exposures and disease subtype. Biologically, disease heterogeneity creates statistical noise that obscures the link from cellular processes to disease. We aim to robustly model disease heterogeneity using the complex phenotypic patterns observed in the highly multi-phenotype UK Biobank dataset. We will use this model to improve and specialize medical predictions and to uncover the clearest and simplest latent biological signals for further analysis. Both goals of our research are important steps to improving healthcare. First, we aim to directly improve medical predictions and classifications by offering richer disease characterizations. Such sub-phenotyping has already proven crucial in the treatment of e.g. breast cancer, which is now stratified by the presence of estrogen receptors. Second, we aim to extend genetic association studies, which have already uncovered thousands of genotype-phenotype associations, by testing latent biological phenotypes to improve power and interpretability. The primary technique we will use is dimensionality reduction. This is based on the premise that underlying the thousands of phenotypes in the UK Biobank is a parsimonious set of endophenotypes. For example, repeat blood pressure measurements can be seen as estimating a very simple endophenotype--baseline blood pressure. Ideally, dimensionality reduction preserves signal while removing noise, e.g. the average blood pressure retains the target signal despite being simpler than the set of all measurements. By reducing large, generic phenotype sets to simple endophenotypes, we can study the endophenotypes in greater detail and with greater power. We intend to use the entire cohort. As our approach is based on machine learning techniques that are subtler than standard genetic association approaches, it is particularly important that we maximize our sample size. Further, as we aim to compare and dissect diseases, it is essential that we maximize the number of diseases with a minimum number of cases.