Privacy-preserving Synthetic Twins for Medical Data With a Focus on Covid-19 Studies

Last updated:: 2 July 2025

ID:: 65101
Start date:: 4 September 2020
Project status:: Closed
Principal investigator:: Professor Samuel Kaski
Lead institution:: Aalto University, Finland

Machine learning allows rapid discovery of important features from the data which is important especially during times of crisis, such as the ongoing SARS-CoV-2 pandemic. In general these methods provide better utility when applied to large amounts of data. However the data is often spread across multiple parties and are subject to strict privacy regulations due to sensitive information contained within.

We aim to make widespread sharing of research data possible by developing techniques for releasing a synthetic twin instead of the original sensitive data. The synthetic twin shares the statistical properties of the original data, while preserving the anonymity of the individuals in the original data set. Our approach is based on differentially private learning, which prevents re-identification from the learning outcomes by limiting the extent of identifying characteristics learned from the data. We generate the synthetic data from a probabilistic model trained using such techniques.

Developing and refining our methods will take approximately one year. This project has potential for a significant boost in data accessibility which opens possibilities for a wide variety of new medical research.