External Validation of Machine Learning Models for Bi-directional Cancer Risk Stratification Using Routine Health Examination Data in the UK Biobank Population

Last updated:: 4 August 2025

ID:: 676746
Start date:: 4 August 2025
Project status:: Current
Principal investigator:: Dr Erez Hasnis
Lead institution:: Rambam Health Care Campus, Israel

Background: We have developed a machine learning model using routine health examination data that demonstrates remarkable capability in stratifying cancer risk, identifying both high (30% 10-year risk, with a small group of very-high risk of 75%) and low-risk (1.9% 10-year risk) populations. The model incorporates 53 features including basic demographics, anthropometric measurements, and routine laboratory parameters. Following TRIPOD-AI guidelines, external validation in a diverse population is crucial for establishing the model’s generalizability and clinical utility.

Research Questions:
1. Can our model performance in bi-directional cancer risk stratification be replicated in the UK Biobank population?
2. How do the predictive features identified in our original cohort compare to those in the UK population?
3. Does the model maintain predictive performance while ensuring participant privacy?

Objectives:
1. Validate the model ability to identify extreme risk groups (both high and low risk) in a different population
2. Compare the relative importance of predictive features between the original and validation cohorts
3. Assess the performance across different cancer types in a larger, more diverse population
4. Generate aggregate outputs while protecting participant confidentiality

Scientific Rationale:
Our initial study demonstrated that readily available data can effectively identify both high and low-risk groups for cancer development. The UK Biobank provides an ideal validation cohort with its standardized data collection. The validated model will generate only aggregate outputs, with derived parameters returned to UK Biobank. This study will determine whether these parameters maintain their predictive value across different healthcare systems, potentially enabling cost-effective risk stratification.