Approved research

Analysing the effects of privacy protection methods on data utility and data research generalisability in observational studies: A case study in asthma

University College London

Lay summary

Health data is the raw material to conduct data-driven observational studies. The results of these studies guide new clinical processes and policy makings for public health. However, releasing health data for research poses a risk to the privacy of individuals. Data holders remove data elements that directly identify individuals, such as name and personal information. However, this measure is inadequate by itself to protect privacy because a combination of remaining attributes (e.g. age, sex, occupation) can potentially leak the identity. Other measures are applied to protect privacy further. Here are some of the examples: 1) Removing the column on smoking habits 2) Deleting the record of patients older than 80 3) Aggregating data by only providing the average value of blood pressure for males and females of all ages. 5) The data is manipulated so that people with similar characteristics are put in groups with a minimum size, for example, a group of at least ten males 20-30 of age and another group for females. In all these examples, the amount of precise information for research is reduced. For example, removing the smoking column would render the data less useful for finding a correlation between the disease and smoking. Another issue is that applied manipulations on data to protect privacy are not informed of the researcher. Hence, the researchers will have uncertainties about the similarity of anonymised data to the original data in terms of relevant information to their research. This will lead to biased research results which in the long term will affect data-driven medical decisions and policy makings in public health. This research aims to acquire anonymised asthma data from Biobank and further apply techniques used for anonymisation (e.g. deleting columns, aggregating values, creating synthetic records). This data will be used to analyse the effects of the anonymisation on the accuracy of observational methods in asthma. The research also aims at developing methods for measuring and characterisation of bias and loss of data usefulness. We will further investigate whether informing the researcher about the bias introduced by anonymity would improve the quality of research. It should be emphasised that the aim of the research is not to re-identify data subjects or analyse the strength of the anonymity of acquired dataset.