Exploration of tSNE and DBScan for the identification of patient subpopulations in Electronic Health Records using International Classification of Disease (ICD)codes.
Principal Investigator: Mr Marc Maurits
Approved Research ID: 54888
Approval date: January 15th 2020
Patients that look similar, could be very different. Currently we treat diseases as if they are one kind of disease. However, one disease could exist of multiple subsets of disease with different causal factors and different long term outcomes. We believe that when we combine all health information of patients, we could identify subgroups within patients with similar symptoms of similar diseases. For instance, many people have hypertension but the causes of the hypertension differ. When a complete medical history is known, the causes of the hypertension could be identified. Now for some disease, such as rheumatoid arthritis, there is a strong suggestion that different factors have influenced the disease development and disease course (e.g. smoking is a risk, but not all RA patients smoke and medication is often effective in only a subgroup of the patients). Currently we are unable to detect these expected subgroups. This is because 1) we lack extensive health information before onset of disease and 2) our analytical methods fail to identify valuable clusters. In the current era of big data, novel techniques have been developed. We aim to explore whether these techniques could help us in the quest of solving heterogeneity within diseases such as RA. Therefore we first would like to test the techniques on the complete patient data and if they work we will use them to find similar RA patient in order to improve their diagnosis and treatment. We have current evidence in our own data that our analytical methods work, but would very much like to test if it holds up in the UK BIOBANK.