Approved Research

Synthetic data for lung cancer prognostic modelling

University College London

Lay summary

Risk prediction models are used to predict whether something of interest will happen to an individual. In the context of cancer screening, this might mean predicting whether someone has a higher chance of getting a cancer, and so whether they should have cancer screening.

To develop risk prediction models, researchers need access to large datasets. In fact, multiple datasets are necessary so that a model can be tested in different scenarios and its accuracy tested.

Such datasets are rare, and where they are available, they can be restricted or difficult to access, which can prevent research.

An opportunity to tackle this problem of access to multiple medical datasets may be provided by synthetic data. These are data created using machine learning that mimic real data, but do not present privacy risks. As an example, a synthetic version of a dataset would contain lots of people with a realistic combination of medical problems. However, all of the people in the synthetic dataset have been created by machine learning.

Synthetic data could have many possible uses in medicine by making it easier to access research datasets without risking patient privacy. However, first we need to test machine learning algorithms that generate synthetic data, and to better understand how we can use these generated data.

In this research, we aim to compare the performance of risk prediction models for lung cancer developed using a synthetic copy of the UK Biobank against risk prediction models developed on the 'real' UK Biobank dataset. In doing so, this research will contribute to our understanding of how synthetic data can be used in medical research, and therefore in improvements to patient care.

We estimate that this project will last 3 years.