Scalable formats for storing and exchanging genetic variation data
Approved Research ID: 57353
Approval date: October 7th 2020
Comparing the genome sequences of humans is one of the primary ways in which researchers discover genes that are involved in disease and traits. The current approaches for storing and exchanging genetic variation information for human samples were designed almost ten years ago for small project-sizes such as the 1000 Genomes Project. As part of that effort, the VCF format has become the de-facto standard and has served the community well. In recent years, however, the limitations of these formats become apparent as research projects grow and the VCF format is being used to store data from large studies involving the genetic data from tens of thousands of individuals. As designed, the VCF format file will become impossibly large to handle when the number of individuals involved nears 1 million.
To resolve this, an update to the VCF format needs to be developed. As members of the Global Alliance for Genomics and Health (GA4GH) consortium, we have developed several candidate updates but are currently lacking a single data-set which all parties involved have access to test our proposals. For the new format to become a standard, different researchers, from many different institutions around the world need to be able to see how it performs on the same data. The UK BioBank is the largest project to date that can provide the amount and type of data that is required for this experiment. We will use the UK Biobank data as a single resource, agreed upon by the different researchers to develop a data format that can be agreed upon by the entire global community. The new standard will unlock the potential to create and study even larger cohorts and empower scientists to make discoveries from large cohorts at a truly global scale.