Correcting finescale population stratification using haplotype sharing to improve association study and polygenic risk score accuracy
Any given two humans will share at least one ancestor at some point in the past. This shared ancestry may be reflected in their genomes as shared stretches of DNA. The number of these ancestral segments shared between individuals relates to the degree of shared ancestry between them. A recent study in Britain ("The People of the British Isles") which examined the patterns of ancestral segment sharing across the Island identified subtly related genetic subgroups of people that correspond to geographic regions in Britain. For example individuals sampled from Devon form a subtly distinct subgroup from individuals sampled from Cornwall. Studies in Ireland, Finland, Japan, Italy, The Netherlands, France and Spain have since revealed subtle genetic subgroups within these countries based on patterns of ancestral segment sharing. The existence of these ancestrally enriched subgroups has important implications for the design of genetic association studies performed on the UK BioBank and other datasets, as individuals within these clusters are expected to share slightly more genetic variation than random due to their shared ancestry. As such if we are looking for a mutation which shows association with an outcome such as a disease without accounting for this ancestral similarity we may falsely identify one simply shared due to ancestry.
Current methods for detecting and correcting underlying shared ancestry in association studies show lower resolution in detecting subtle shared ancestry than methods using ancestral segments, and have failed to identify within country subgroups. Hence it is possible that applying new methods leveraging segment sharing in the context of correcting for shared ancestry association studies will make results more robust and reduce false associations.
Our study aims to explore the use of a fast and scalable method for looking at shared ancestry segments across individuals in the UK Biobank. We will compare the outcomes of using this method to correct association studies to those from the standard methods and investigate the degree of inflation due to population structure in each using an established method called LD-score regression. Our project should take between 1 and 2 years and will provide a reusable resource for all Biobank researchers. Ideally, if successful, our project should also reduce the rate of false positive associations, allowing us to have greater confidence in results from the Biobank, and better inform targets for drug development and potentially improve the accuracy of genetic prediction, which may have clinical applications in the future.