Approved Research

Complex trait prediction from multi-source data using machine learning methods

ETH Zurich

Lay summary

Can we predict a participant's height based on their genetics? Their environmental conditions? Or some combination of those things? Does their health status/history help or exacerbate these factors?

These are the questions we try to answer in our proposed three year project. Height is a so-called complex trait, which are traits that cannot be explained by simple genetic inheritance rules, and cannot be easily predicted. They are influenced by multiple factors such as genotypic data and environmental conditions. For example, a person's height is influenced by the height of their parents (genotypic data), but also by other factors such as their diet (environmental conditions). Understanding a participants health status/history can further help explain a complex trait like height.

Each of these factors provides a different, incomplete view of a patient. However, the degree to which each of those factors, and combinations of these factors, affect a complex trait like height is still an open question in the scientific and medical community.

As health data becomes increasingly more comprehensive, machine learning can be used to investigate the relationship between these data types and complex traits, by identifying patterns that are predictive of a complex trait of interest in large amounts of data. Our goal is to develop and use a machine learning algorithm that can incorporate these different snapshots of a participant, and determine which phenotypes they influence most.

Identifying the factors driving a particular complex trait is relevant for public health because it can inform the design making and recommendations of practitioners in the medical field, and specifically indicate which levers should be pulled to change a health outcome. Identifying complex traits that are affected mostly by genetics could indicate that early screening may be critical so that treatment begins at onset; on the other hand, if particular environmental factors have the greatest effect on another phenotype, this indicates that public health preventive interventions could be impactful. The common thread underlying each of these scenarios is actively understanding how these different factors play a role in a participant's health.

The UK Biobank is an ideal partner for us in this research, with its vast troves of genotypic, health status and environmental data. We will predict complex traits by extracting meaningful patterns from these data sources using machine learning. In doing so, we will unlock a critical component in our progress towards personalized medicine.