Explainable machine learning identifies a polygenic risk score as a key predictor of pancreatic cancer risk in the UK Biobank

Last updated:: 2 July 2025

Author(s):: Giulia Peduzzi, Alessio Felici, Roberto Pellungrini, Daniele Campa
Publish date:: 3 December 2024
Journal:: Digestive and Liver Disease
PubMed ID:: 39632152
DOI:: 10.1016/j.dld.2024.11.010

Abstract

BACKGROUND: Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy.

AIM: This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML).

METHODS: Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models.

RESULTS: All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules.

CONCLUSION: The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.