Approved Research

Machine learning from complex and genetic data

Broad Institute

Lay summary

People take medical tests because doctors can diagnose disease based on the patterns seen in those tests. Some tests give simple results that can be easily interpreted by an expert. For example, a cardiologist can use lab tests that look at the levels of heart enzymes to better understand if a patient with chest pain is having a heart attack. This type of lab test gives back just one number.

Other tests give back data that are more complicated. The ECG, which takes electrical measurements of the heart, was invented in 1895 and standardized around 1942. This type of test returns a number representing the strength of the electrical signal at various parts of the body. The device does this hundreds of times per second. When these numbers are graphed, a picture of the electrical activity at the surface of the heart can be understood. These can be interpreted by a cardiologist to diagnose many different diseases. However, there may still be diseases that could be diagnosed by ECG that we do not yet know about. For example, in 1992, 50 years after the ECG was standardized in its modern form, two brothers recognized a new pattern in ECG that went along with sudden cardiac death. The pattern that they found is a very memorable one, but there may be numerous less memorable patterns that we might be able to use to predict disease.

Newer tests give back even more complicated data. Cardiac MRI images contain thousands of times more data than an ECG. Important patterns have been recognized in these images, such as certain patterns that are seen after heart attack. We think that there are probably many more important patterns left to discover. This is a challenging task for humans, but we think that machines that are exposed to tens of thousands of these images may help find some of these patterns.

We would like to understand whether some of these patterns, in combination with genetics and traditional risk factors, can help us identify people who would benefit from doing something: for example, getting pictures taken more frequently, or taking a drug before they develop a disease.

Scope extension: Complex imaging, electrocardiographic, and accelerometric data contain features that associate with disease. We ask (1) whether machine learning techniques will permit discovery of novel features from these data; (2) which of these features lie along the causal pathway to disease; and (3) whether machine learning models trained on complex data can augment predictive models that use genetic data and known risk factors such as diagnostic codes, blood-based biomarkers, and demographic and survey data. We seek to understand the epidemiological relationships and genetic bases of these complex traits, as well as those of classical phenotypes, risk factors, and diseases. We aim to derive and evaluate machine learning-based, classical epidemiological-based, and genetic risk-based models.

Scope extension:

Complex imaging, electrocardiographic, and accelerometric data contain features that associate with disease. We ask (1) whether machine learning techniques will permit discovery of novel features from these data; (2) which of these features lie along the causal pathway to disease; and (3) whether machine learning models trained on complex data can augment predictive models that use genetic data and known risk factors such as diagnostic codes, blood-based biomarkers, and demographic and survey data. We seek to understand the epidemiological relationships and genetic bases of these complex traits, as well as those of classical phenotypes, risk factors, and diseases. We aim to derive and evaluate machine learning-based, classical epidemiological-based, and genetic risk-based models.

Further, we seek to evaluate emerging risk factors such as proteomic biomarkers, telomere length, and clonal hematopoiesis broadly to understand their development and their impact on downstream disease risk and quantitative phenotypes. We also seek to study the epidemiology, genetics, and phenotypes linked to cardiovascular disease and cardiometabolic-linked risk including body composition, hepatic and renal disease, stroke, and dementia.