Approved Research

Machine learning methods for problems in biology, health, and inequality

Cornell University

Lay summary

The goals of this project are (1) to develop novel methods in artificial intelligence for massive biological datasets and (2) use them to study problems in biology and health, including: (a) the mathematical modeling of genetic sequences, (b) understanding how genetic mutations and the environment interact to determine human traits, (c) predicting disease risk and other phenotypes from genetic, medical, and environmental data, (d) understanding the bias and fairness of machine learning models in a health context, (e) quantifying and understanding inequality in a health context. This effort will involve adapting and extending methods for problems such as imputation, phasing, low-pass sequencing, genome-wide association analysis, polygenic risk scoring, and more. Our research will lead to practical technological improvements including reducing the cost and improving the accuracy for genomic assays, improving risk prediction, and improving the accuracy of medical machine learning models on underrepresented groups. This research will also yield basic knowledge on the structure and relatedness of genetic populations, on the role of genetics and environment in disease, and on the challenges posed by inequality in a health context.