Approved research

Survival analysis with an explainable tree-based machine learning model

University of Washington

Lay summary

Aims: Our first aim is to use a computer to mathematically model mortality in a large group of individuals from the UK BioBank data. Our second aim is to use an "explainable" method that helps us better understand the model to find important attributes related to mortality. This method gives a number for each attribute (e.g., age, weight, etc.) for each individual that represents how important that attribute is for predicting mortality (i.e., importance values). Public Health Impact: Importance values give us insight that can positively impact public health in several ways. First, they can help us identify important risk factors (e.g., smoking, obesity, etc.) for mortality, which can help guide public health recommendations. Second, importance values allow us to order features by how important they are within groups of individuals. This information allows doctors to pay special attention to people with asthma, diabetes, or even dementia. Third, importance values teach us what attributes are important for predicting mortality. This means that in order to predict mortality, sometimes a simple questionnaire can be almost as effective as a blood test. Rather than spending a lot of money to collect patient information, importance values can help doctors know what features are the most important to spend their precious time collecting. For instance, for individuals with diabetes, it may be particularly important to know their diet history, whereas for other individuals it may not matter. Scientific Rationale: The traditional method (called linear models) that is often used for predicting mortality makes many assumptions that are often untrue in the real world. One example is that linear models assume that the attributes and mortality have a specific type of relationship (linear). In contrast, the model we use is able to capture more complex relationships between attributes and mortality. This means it is generally better at making predictions in comparison to linear models. In order to model mortality, we hypothesize that it is better to use a model that is capable of capturing these more complex relationships that often arise in the real world.