Last updated:
Author(s):
Rick Wertenbroek, Simone Rubinacci, Ioannis Xenarios, Yann Thoma, Olivier Delaneau
Publish date:
24 June 2022
Journal:
Bioinformatics
PubMed ID:
35748697

Abstract

MOTIVATION: Generation of genotype data has been growing exponentially over the last decade. With the large size of recent datasets comes a storage and computational burden with ever increasing costs. To reduce this burden, we propose XSI, a file format with reduced storage footprint that also allows computation on the compressed data and we show how this can improve future analyses.

RESULTS: We show that xSqueezeIt (XSI) allows for a file size reduction of 4-20× compared with compressed BCF and demonstrate its potential for ‘compressive genomics’ on the UK Biobank whole-genome sequencing genotypes with 8× faster loading times, 5× faster run of homozygozity computation, 30× faster dot products computation and 280× faster allele counts.

AVAILABILITY AND IMPLEMENTATION: The XSI file format specifications, API and command line tool are released under open-source (MIT) license and are available at https://github.com/rwk-unil/xSqueezeIt.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Related projects

With the decreasing cost of DNA sequencing, large databases of human genomes are being collected in order to boost health related research, leading to the…

Institution:
University of Lausanne, Switzerland

All projects