Biobank-scale datasets present a huge potential for improved diagnoses and more effective treatments across a wide range of diseases. There are, however, a number of challenges in extracting these insights from the data. For example, the UK Biobank contains many different data types, including imaging, genomic, environmental and clinical variables. While there are a number of methods to analyse single data types separately, combining information from the different data sources is not straightforward. Statistical methods must also be computationally scalable; a procedure originally developed for a few hundred data points may not be feasible for several million.
In this project, we will develop, implement and evaluate a range of statistical methods capable of analysing large-scale healthcare datasets. These will include new approaches to combine information across multiple data modalities and that are capable of leveraging the full extent of biobank-scale data. By extracting novel biological insights from the data, these methods will have the potential for improved diagnoses, timely interventions and more effective treatment. We will also develop freely available, open-source software so that other researchers in the health sciences can use our findings to extract insights from UK Biobank or other large healthcare datasets.