Modern DNA sequencing is used as a readout for diverse assays, with the count of aligned sequences (read depth) representing the quantitative signal for each underlying cellular phenomena. Existing data formats for quantitative genomics assays are, however, limited in either the analysis speeds they enable, the disk space they require or both. We have developed the dense depth data dump (D4) format and tool suite, with the goal of balancing improved analysis speeds with file size. The D4 format is adaptive in that it profiles a random sample of aligned sequence depth from the input sequence file to determine an optimal encoding that enables fast data access. We demonstrate that the D4 format offers substantial speed improvements over existing formats for random access, aggregation and summarization, while also achieving better or comparable file sizes. This performance enables scalable downstream analyses that would be otherwise difficult.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The 1000 Genomes data were downloaded from http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/. Bulk RNA-seq samples were downloaded from the ENCODE project website (www.encode.org) and a list of URLs to download all samples used is available at the D4utils GitHub repository (https://github.com/38/d4-format). The WGS dataset is sample HG002 from the Genome in a Bottle Project, and the RNA-seq dataset is sample ENCFF976QSN from the ENCODE project16. Source data are provided with this paper.
The code underlying the software described and the data analyzed in this study are available at: D4utils, https://github.com/38/d4-format; D4 Rust API, https://docs.rs/d4; D4 Python API, https://github.com/38/pyd4. Code has also been deposited on Zenodo17 at https://zenodo.org/record/4684595#.YJQ272ZKidZ.
Sasani, T. A. et al. Large, three-generation human families reveal post-zygotic mosaicism and variability in germline mutation accumulation. Elife 8, e46922 (2019).
Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010).
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
2 Chromatin patterns at transcription factor binding sites. Nature https://doi.org/10.1038/nature28171 (2019).
Pedersen, B. S., Collins, R. L., Talkowski, M. E. & Quinlan, A. R. Indexcov: fast coverage quality control for whole-genome sequencing. Gigascience 6, 1–6 (2017).
Kent, W. J., Zweig, A. S., Barber, G., Hinrichs, A. S. & Karolchik, D. bigWig and bigBed: enabling browsing of large distributed datasets. Bioinformatics 26, 2204–2207 (2010).
Frequently asked questions: data file formats. Genome Browser https://genome.ucsc.edu/FAQ/FAQformat.html (2021).
Koranne, S. Handbook of Open Source Tools 191–200 (Springer, 2011); https://doi.org/10.1007/978-1-4419-7719-9_10
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Fritz, M. H.-Y., Leinonen, R., Cochrane, G. & Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011).
Shao, Z., Reppy, J. H. & Appel, A. W. Unrolling lists. SIGPLAN Lisp Pointers VII, 185–195 (1994).
Pedersen, B. S. & Quinlan, A. R. mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868 (2018).
The SAM/BAM Format Specification Working Group Sequence Alignment/Map Format Specification (GitHub, 2021); http://samtools.github.io/hts-specs/SAMv1.pdf
Wang, Z., Weissman, T. & Milenkovic, O. smallWig: parallel compression of RNA-seq WIG files. Bioinformatics 32, btv561 (2015).
Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).
ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Hou, H., Quinlan, A. & Pedersen, B. Efficient analysis of quantitative genomics data with the D4 format. Zenodo https://doi.org/10.5281/ZENODO.4684595 (2021).
We acknowledge helpful comments from members of the Quinlan laboratory, as well as funding from the National Institutes of Health (NIH) grants HG006693, HG009141 and GM124355 awarded to A.Q.
The authors declare no competing interests.
Peer review information Nature Computational Science thanks Christopher Lee, Mikel Hernaez, Christoph Lange and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Hou, H., Pedersen, B. & Quinlan, A. Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools. Nat Comput Sci 1, 441–447 (2021). https://doi.org/10.1038/s43588-021-00085-0