Semi-supervised learning and integration of prior knowledge to define biomechanistically related phenotypes
Understanding the molecular mechanisms of disease is essential for effective and targeted treatment. The study of these disease mechanisms and genetic risk factors using genomic data requires a large enough sample size. While biobanks offer sufficiently large sample sizes, the scope of possible research is typically limited by the available medical information.
When looking at a patient's medical history, it is generally not known whether the patient never had a certain disease when there is no record of that disease in the patient's file. One example of such a disease is gastroenteritis caused by Norovirus infection. If the patient did not seek treatment for the infection, the cases are simply not reported and therefore not recorded in the patient file. Such factors make it difficult to determine which participants are in fact resistant to a certain disease.
On the other hand, patients that do have a record of a certain disease may have different types of that disease. One example is Parkinson's disease, where the diagnosis is based primarily on symptoms and not necessarily on the cause of the disease. But the different causes may require different treatment.
In both examples, if the patients were simply grouped by whether they have a record of that disease, the resulting case and control groups would be very inaccurate. This can be mitigated by learning the clinical and genetic similarities between patients and group them accordingly. These groups can then be comared to identify risk factors or markers associated with the different disease mechanisms.
To learn these groups, we will explore different semi-supervised machine learning methods. Semi-supervised learning is capable of handling small amounts of data for which the groups are known, together with large amounts of data, where the groups are not known. The available biological and medical knowledge about the diseases will be integrated into this approach to improve the learning of the patient groups from the clinical and genomic data.
Using this framework, we will then aim to identify why patients are at a higher or lower risk to have a certain type of disease and which molecular mechanisms are likely to be involved in that disease.