Last updated:
Author(s):
Jerome Kelleher, Yan Wong, Anthony W. Wohns, Chaimaa Fadil, Patrick K. Albers, Gil McVean
Publish date:
2 September 2019
Journal:
Nature Genetics
PubMed ID:
31477934

Abstract

Inferring the full genealogical history of a set of DNA sequences is a core problem in evolutionary biology, because this history encodes information about the events and forces that have influenced a species. However, current methods are limited, and the most accurate techniques are able to process no more than a hundred samples. As datasets that consist of millions of genomes are now being collected, there is a need for scalable and efficient inference methods to fully utilize these resources. Here we introduce an algorithm that is able to not only infer whole-genome histories with comparable accuracy to the state-of-the-art but also process four orders of magnitude more sequences. The approach also provides an ‘evolutionary encoding’ of the data, enabling efficient calculation of relevant statistics. We apply the method to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the inferred genealogies are rich in biological signal and efficient to process.

Related projects

Each individual carries many rare genetic variants (frequency less than 1 in 1000), several of which affect gene function and may affect disease risk. Moreover,…

Institution:
University of Oxford, Great Britain

All projects