Privacy-preserving Synthetic Twins for Medical Data With a Focus on Covid-19 Studies
Approved Research ID: 65101
Approval date: September 4th 2020
Machine learning allows rapid discovery of important features from the data which is important especially during times of crisis, such as the ongoing SARS-CoV-2 pandemic. In general these methods provide better utility when applied to large amounts of data. However the data is often spread across multiple parties and are subject to strict privacy regulations due to sensitive information contained within.
We aim to make widespread sharing of research data possible by developing techniques for releasing a synthetic twin instead of the original sensitive data. The synthetic twin shares the statistical properties of the original data, while preserving the anonymity of the individuals in the original data set. Our approach is based on differentially private learning, which prevents re-identification from the learning outcomes by limiting the extent of identifying characteristics learned from the data. We generate the synthetic data from a probabilistic model trained using such techniques.
Developing and refining our methods will take approximately one year. This project has potential for a significant boost in data accessibility which opens possibilities for a wide variety of new medical research.