Assessing the reproducibility, accuracy and utility of massively parallel DNA sequencing platforms remains an ongoing challenge. Here the Association of Biomolecular Resource Facilities (ABRF) Next-Generation Sequencing Study benchmarks the performance of a set of sequencing instruments (HiSeq/NovaSeq/paired-end 2 × 250-bp chemistry, Ion S5/Proton, PacBio circular consensus sequencing (CCS), Oxford Nanopore Technologies PromethION/MinION, BGISEQ-500/MGISEQ-2000 and GS111) on human and bacterial reference DNA samples. Among short-read instruments, HiSeq 4000 and X10 provided the most consistent, highest genome coverage, while BGI/MGISEQ provided the lowest sequencing error rates. The long-read instrument PacBio CCS had the highest reference-based mapping rate and lowest non-mapping rate. The two long-read platforms PacBio CCS and PromethION/MinION showed the best sequence mapping in repeat-rich areas and across homopolymers. NovaSeq 6000 using 2 × 250-bp read chemistry was the most robust instrument for capturing known insertion/deletion events. This study serves as a benchmark for current genomics technologies, as well as a resource to inform experimental design and next-generation sequencing variant calling.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
The genome sequences in this study are available as EBV-immortalized B lymphocyte cell lines (from Coriell) as well as from DNA (from Coriell and NIST). The data in this study were derived from the batch of DNA from the NIST Reference Materials. All data generated within this study from these genomes are publicly available on the NCBI Sequence Read Archive under the BioProject PRJNA646948, within accessions SRR12898279–SRR12898354.
All code used within this study is publicly available at https://www.github.com/jfoox/abrfngs2. This repository includes directories containing scripts for primary analyses such as alignment and variant calling (SLURM/), shell scripts to perform post-processing calculations (bin/) and R scripts used to create figures (Rmds/). All tables used to generate figures are provided in a tables/ directory.
Schuster, S. C. Next-generation sequencing transforms today’s biology. Nat. Methods 5, 16–18 (2008).
Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008).
MacLean, D., Jones, J. D. & Studholme, D. J. Application of ‘next-generation’ sequencing technologies to microbial genetics. Nature Rev. Microbiol. 7, 96–97 (2009).
Glenn, T. C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11, 759–769 (2011).
Aziz, N. et al. College of American Pathologists’ laboratory standards for next-generation sequencing clinical tests. Arch. Pathol. Lab. Med. 139, 481–493 (2015).
Schlaberg, R. et al. Validation of metagenomic next-generation sequencing tests for universal pathogen detection. Arch. Pathol. Lab. Med. 141, 776–786 (2017).
Zhou, J. et al. Reproducibility and quantitation of amplicon sequencing-based detection. ISME J. 5, 1303–1313 (2011).
Mellmann, A. et al. High interlaboratory reproducibility and accuracy of next-generation-sequencing-based bacterial genotyping in a ring trial. J. Clin. Microbiol. 55, 908–913 (2017).
Quail, M. A. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).
Shi, L. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006).
Shi, L. et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28, 827–838 (2010).
Li, S. et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat. Biotechnol. 32, 915–925 (2014).
Su, Z. et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 32, 903–914 (2014).
Wang, C. et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32, 926–932 (2014).
Li, S. et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat. Biotechnol. 32, 888–895 (2014).
Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).
Merker, J. D. et al. Proficiency testing of standardized samples shows very high interlaboratory agreement for clinical next-generation sequencing–based oncology assays. Arch. Pathol. Lab. Med. 143, 463–471 (2019).
Mahamdallie, S. et al. The ICR639 CPG NGS validation series: a resource to assess analytical sensitivity of cancer predisposition gene testing. Wellcome Open Res. 3, 68 (2018).
Zhong, Q. et al. Multi-laboratory proficiency testing of clinical cancer genomic profiling by next-generation sequencing. Pathol. Res. Pract. 214, 957–963 (2018).
Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).
Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Preprint at bioRxiv https://doi.org/10.1101/2020.07.24.212712 (2020).
Landrum, M. J. & Kattman, B. L. ClinVar at five years: delivering on the promise. Hum. Mutat. 39, 1623–1630 (2018).
Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIM.org: Online Mendelian Inheritance in Man (OMIM), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015).
Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).
Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short-and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2020).
Freed, D. N., Aldana, R., Weber, J. A. & Edwards, J. S. The Sentieon Genomics Tools - A fast and accurate solution to variant calling from next-generation sequence data. Preprint at bioRxiv 115717 (2017).
McIntyre, A. B. et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18, 182 (2017).
Sogin, M. L. in PCR Protocols: A Guide to Methods and Applications (eds Innis, M. et al.) (Elsevier, 2012).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Pedersen, B. S. & Quinlan, A. R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868 (2018).
Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).
Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Luo, R. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat. Mach. Intell. 2, 220–227 (2020).
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).
Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017).
Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016).
Toptaş, B. Ç., Rakocevic, G., Kómár, P. & Kural, D. Comparing complex variants in family trios. Bioinformatics 34, 4241–4247 (2018).
We thank Illumina and ThermoFisher for providing reagents allowing the study to take place. We also thank NIST for providing the GIAB DNA samples necessary to carry out the study. We acknowledge the HudsonAlpha Institute of Biotechnology for expert assistance in Illumina DNA library preparation. The Association of Biomolecular Resource Facilities (ABRF) also provided funding, logistical support and project oversight. We thank the ABRF NGS Study members, who contributed to the design and execution of this project. We are particularly grateful for the assistance provided by multiple core facilities spending their own time and resources to participate in this research. We thank the Epigenomics Core Facility and Scientific Computing Unit at Weill Cornell Medicine, as well as the Starr Cancer Consortium (I9-A9-071), and acknowledge funding from the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, Bert L and N Kuggie Vallee Foundation, the WorldQuant Foundation, The Pershing Square Sohn Cancer Research Alliance, NASA (NNX14AH50G, NNX17AB26G), the National Institutes of Health (R25EB020393, R01NS076465, R01AI125416, R01ES021006, 1R21AI129851, 1R01MH117406), the Bill and Melinda Gates Foundation (OPP1151054), TRISH (NNX16AO69A:0107, NNX16AO69A:0061), the Leukemia and Lymphoma Society grants (LLS 9238-16, Mak, LLS-MCL-982, Chen-Kiang) and the Alfred P. Sloan Foundation (G-2015-13964). Certain commercial equipment, instruments or materials are identified to adequately specify experimental conditions or reported results. Such identification implies neither recommendation nor endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. F.J.S. and M.M. are supported by the NIH (UM1 HG008898).
G.P.S. is employed by Illumina Inc. X.Z., W.Z., F.T., Y.Z. and H.L are employees of MGI Inc. All other authors declare no competing interests.
Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
(a) The insert Size distribution of every replicate, stratified by sequencing instrument. (b) The percentage of total reads that were mapped to decoy contigs within the GRCh38 reference genome.
Heatmap showing the distribution of read counts per library (rows) by GC content (columns) across human whole genome and exome samples. Read count values are normalized by total reads per replicate, such that a value of 1 matches maximum value for a given replicate. Annotation tracks on the right indicate the sequencing platform and cell line genome for that replicate.
Comparisons for every platform within each UCSC RepeatMasker region. Blue bars indicate >50% of shared sites are better represented in the given platform (column) versus all other platforms (rows). Red bars indicate that the other platform out-covered the given platform.
Precision and sensitivity scores as derived from rtg vcfeval analysis, stratified by regions in (a) the CLINVAR database and (b) the OMIM database. For each of the cell lines, genes from each database were overlapped with high confidence regions for variant calling. (c) Scores stratified by regions in the exome, as defined by the AmpliSeq target capture regions file. For each of the cell lines, exomic regions were overlapped with high confidence regions for variant calling.
Heatmap of genotype (GT) of variant alleles on chromosome 1, across all human replicates across within sequencing platforms, as measured against the Genome in a Bottle high confidence variant call sets for each genome. Heterozygous variant alleles are shaded in orange (0.5), homozygous variants in red (1), missing data in blue (0), and inapplicable sites (sites outside of the GIAB high confidence region in one cell line but present in another) in gray. Hierarchical clustering reveals strong grouping by cell line, followed by less clear grouping within platforms and inter- and intra-lab replicates.
UpSet intersections of Mendelian violations. Each plot is stratified by variant type (SNPs on top, followed by INDELs; INS_5 = insertions 0-5 bp in size, INS_6to15 = insertions 6 to 15 bp in size, INS_15 = insertions >15 bp in size; same for deletions, ‘DEL’). Events were recorded within high confidence regions for the Ashkenazi Son (HG002).
Comparison between the identified SVs in the six replicates from long-read sequencing instruments, showing agreement of 6,980 SVs between samples (green column).
Coverage, insert size, and read length mean and standard deviation across total SVs in sequencing runs.
(a) Insights into SV variability by caller. First the strategy used to examine SV caller variability after stratifying for platforms, replicates and centers variability; next the SV call set sizes and overlap with the GIAB SV call set for the SV caller variability set of HG002; finally the types and sizes of SVs in the SV caller variability set of HG002 (translocations are set to size 50 by default in the SURVIVOR parameters for visualization purposes). (b) Insights into SV variability by platform. Diagrams utilize sequencing runs from HiSeqX10, HiSeq2000 and HiSeq4000 while the final two characterize all platforms available. First the strategy used to examine platform variability after stratifying for SV callers, centers and replicates variability; next, SV call set sizes and overlaps with the GIAB SV call set for the platform variability SV call set of HG002; next, types and sizes of SVs in the platform variability SV call set of HG002. Final two panels include HiSeqX10, HiSeq2000, HiSeq4000, NovaSeq, BGI and MGI for visualization purposes. The NovaSeq, BGI and MGI SV call sets were not integrated into the analyses strategy because sequencing runs with replicates for each sample at different centers on different platforms were not available. On top, SV call set sizes and overlap with the GIAB SV call set for the platform variability SV call set of HG002. Below, types and sizes of SVs in the platform variability SV call set of HG002. (Translocations are set to size 50 by default in the SURVIVOR parameters for visualization purposes).
(a) Heatmap showing the distribution of read counts per library (rows) by GC content (columns) across bacterial genomes and the metagenomic mixtrue. Read count values are normalized by total reads per replicate, such that a value of 1 matches maximum value for a given replicate. Annotation tracks on the right indicate the sequencing platform and cell line genome for that replicate. (b) Calculations of entropy per genome/metagenomic mixture. Entropy was measured across all GC windows for all replicates for a given sample, rowSums(-(p * log(p)).
About this article
Cite this article
Foox, J., Tighe, S.W., Nicolet, C.M. et al. Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat Biotechnol 39, 1129–1140 (2021). https://doi.org/10.1038/s41587-021-01049-5