Li et al. (Nat Med 2025) trained their LLM-based biological-age models on ~490 k UK Biobank (UKB) participants and validated them internally. However, the published analyses used the full UKB as training/test splits within the same pipeline; no true external hold-out set has been examined. We therefore propose a strict replication that reserves an untouched UKB sub-cohort to verify reproducibility under alternate sampling and variable preprocessing.
Research questions: (1) In a freshly sampled UKB hold-out (n ! 50 k), does the released Llama3-70B prompt reproduce the reported C-indices for all-cause mortality (0.757) and major incident diseases (CHD, stroke, COPD, T2D, arthritis)? (2) Are the age-gap hazard ratios (HR ! 1.05 per year) and proteomic associations (LEP, FGF21, IGFBP1/2) stable across different baseline exclusion criteria and follow-up censoring rules? (3) Does model performance degrade when primary-care text fields are removed, simulating low-resource settings?
Objectives: Download the identical UKB fields used by Li et al. (baseline phenome, HES, death registry, Olink proteomics). Generate a new random 90/10 split separated from the original training indices. Apply the exact prompt template and code released by the authors; recalculate overall and organ-specific age gaps. Evaluate discrimination (C-index) and calibration (Gronnesby-Borgan !²) for 36 outcomes, comparing against the published estimates. Repeat proteomic differential-abundance analysis between top- and bottom-decile age-gap groups. Provide 95 % CIs for the difference between replication and original performance metrics.
Scientific rationale: A fully independent hold-out replication guards against overfitting and publication bias, and quantifies the expected performance drop when the LLM is deployed in future UKB sub-studies or clinical implementations.