We are developing advanced disease risk prediction methods by applying large language models (LLMs) to multi-layered biomedical data within a digital twin framework. Our aim is to integrate diverse data sources including genomic sequences, polygenic risk scores, epigenomic and metabolomic profiles, physiological measurements such as electrocardiograms and neural activity, and longitudinal health checkup information. By tailoring LLMs to genomic variation and fine-tuning them with time-series health and lifestyle data, we seek to capture both genetic predispositions and dynamic environmental influences that contribute to disease onset and progression.
A central aspect of our approach is the use of UK Biobank data. UK Biobank provides an unparalleled large-scale cohort with harmonized genomic, clinical, lifestyle, and imaging data, enabling us to validate our algorithms beyond the Japanese cohorts that we primarily work with. By applying our LLM-based methods to UK Biobank, we can rigorously test the generalizability of our disease risk prediction models across populations, assess transferability of polygenic risk scores, and identify shared and distinct genetic-environmental interactions between European and East Asian populations. This comparative analysis not only strengthens the robustness of our models but also informs the design of population-specific and cross-population preventive strategies.
Through the combination of UK Biobank’s extensive resource and our own multi-layered datasets, we will construct digital twin systems capable of simulating future health trajectories under varying lifestyle or environmental scenarios. The outcome will be a platform that delivers highly accurate, personalized risk predictions for a broad spectrum of diseases, supports preventive interventions tailored to individuals, and contributes to global efforts in precision medicine and public health.