The phenotype of a biological organism is the result of interactions between the environment and various complex biological processes. These processes start with the copying of the DNA into RNA (transcription) and includes the production of proteins (translation). From a biochemical point of view and in downstream order the following molecular sets interact with each other before a phenotype is defined: the genome, epigenome, transcriptome, proteome, and metabolome. The metabolome is the complete set of small-molecule chemicals found within an organism. By being the final downstream product of transcription and translation, the metabolome is the closest set of biomarkers to the phenotype. As such, the metabolome closely reflects the current state of an organism which in particular includes the presence or absence of disease. Indeed, Machine learning (ML) models that can diagnose disease risk based on metabolite concentrations have been demonstrated for multiple diseases. These studies, on the other hand, are usually based on relatively small datasets of 10s to 1000 samples. ML models trained on small datasets have been found to be not generalizable to larger populations. This is likely due to the effect of confounding factors which can not be separated from relevant disease patterns in small datasets.
We propose taking advantage of both the general description of the phenotype provided by the metabolome and its demonstrated capability for diagnosing diseases. The effects of confounding factors (endemic in small-dataset studies) can be disambiguated through the use of the large UK Biobank biomarker dataset of 100,000+ samples for training a deep neural network. This network is then leveraged for obtaining other specialized models for the diagnosis of specific diseases. These models will be analyzed with perturbative methods for identifying the most important metabolites for certain diseases. These metabolites and their related biological pathways may expose possible targets for disease therapy. The project is expected to span 3 years.
Potential outcomes of the proposed research include the training of models capable of reliably diagnosing multiple diseases for the general population. This opens the door for mass disease screenings of entire populations which may particularly benefit marginalized low-income citizens. The extraction of insights from trained disease predictors may impact the development of treatments for multiple diseases. For example, the achievement of early-disease diagnosis will allow the timely start of treatment which is crucial for certain diseases such as various cancers.