Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study

Abstract

Assessing the reproducibility, accuracy and utility of massively parallel DNA sequencing platforms remains an ongoing challenge. Here the Association of Biomolecular Resource Facilities (ABRF) Next-Generation Sequencing Study benchmarks the performance of a set of sequencing instruments (HiSeq/NovaSeq/paired-end 2 × 250-bp chemistry, Ion S5/Proton, PacBio circular consensus sequencing (CCS), Oxford Nanopore Technologies PromethION/MinION, BGISEQ-500/MGISEQ-2000 and GS111) on human and bacterial reference DNA samples. Among short-read instruments, HiSeq 4000 and X10 provided the most consistent, highest genome coverage, while BGI/MGISEQ provided the lowest sequencing error rates. The long-read instrument PacBio CCS had the highest reference-based mapping rate and lowest non-mapping rate. The two long-read platforms PacBio CCS and PromethION/MinION showed the best sequence mapping in repeat-rich areas and across homopolymers. NovaSeq 6000 using 2 × 250-bp read chemistry was the most robust instrument for capturing known insertion/deletion events. This study serves as a benchmark for current genomics technologies, as well as a resource to inform experimental design and next-generation sequencing variant calling.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Experimental design and mapping results.
Fig. 2: Distribution of genomic coverage across sequencing technologies for all replicates.
Fig. 3: Estimating rates of sequencing error per platform.
Fig. 4: Validating SNPs and INDEL events from short-read datasets against the GIAB high-confidence truth set as determined by RTG vcfeval.
Fig. 5: Assessing variability for the son (HG002) across HiSeq X10, 2000 and 4000, platforms that had more than one replicate per cell line to enable this analysis.
Fig. 6: Reproducibility of sequencing of bacterial genomes in a complex metagenomic mixture.

Data availability

The genome sequences in this study are available as EBV-immortalized B lymphocyte cell lines (from Coriell) as well as from DNA (from Coriell and NIST). The data in this study were derived from the batch of DNA from the NIST Reference Materials. All data generated within this study from these genomes are publicly available on the NCBI Sequence Read Archive under the BioProject PRJNA646948, within accessions SRR12898279SRR12898354.

Code availability

All code used within this study is publicly available at https://www.github.com/jfoox/abrfngs2. This repository includes directories containing scripts for primary analyses such as alignment and variant calling (SLURM/), shell scripts to perform post-processing calculations (bin/) and R scripts used to create figures (Rmds/). All tables used to generate figures are provided in a tables/ directory.

References

  1. 1.

    Schuster, S. C. Next-generation sequencing transforms today’s biology. Nat. Methods 5, 16–18 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Shendure, J. & Ji, H. Next-generation DNA sequencing. Nat. Biotechnol. 26, 1135–1145 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  3. 3.

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Mardis, E. R. The impact of next-generation sequencing technology on genetics. Trends Genet. 24, 133–141 (2008).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. 5.

    MacLean, D., Jones, J. D. & Studholme, D. J. Application of ‘next-generation’ sequencing technologies to microbial genetics. Nature Rev. Microbiol. 7, 96–97 (2009).

    Google Scholar 

  6. 6.

    Glenn, T. C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11, 759–769 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. 7.

    Aziz, N. et al. College of American Pathologists’ laboratory standards for next-generation sequencing clinical tests. Arch. Pathol. Lab. Med. 139, 481–493 (2015).

    PubMed  PubMed Central  Google Scholar 

  8. 8.

    Schlaberg, R. et al. Validation of metagenomic next-generation sequencing tests for universal pathogen detection. Arch. Pathol. Lab. Med. 141, 776–786 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. 9.

    Zhou, J. et al. Reproducibility and quantitation of amplicon sequencing-based detection. ISME J. 5, 1303–1313 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. 10.

    Mellmann, A. et al. High interlaboratory reproducibility and accuracy of next-generation-sequencing-based bacterial genotyping in a ring trial. J. Clin. Microbiol. 55, 908–913 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Quail, M. A. et al. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13, 341 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Shi, L. et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24, 1151–1161 (2006).

    CAS  Google Scholar 

  13. 13.

    Shi, L. et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 28, 827–838 (2010).

    CAS  Google Scholar 

  14. 14.

    Li, S. et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat. Biotechnol. 32, 915–925 (2014).

    PubMed  PubMed Central  Google Scholar 

  15. 15.

    Su, Z. et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat. Biotechnol. 32, 903–914 (2014).

    CAS  Google Scholar 

  16. 16.

    Wang, C. et al. The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32, 926–932 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Li, S. et al. Detecting and correcting systematic variation in large-scale RNA sequencing data. Nat. Biotechnol. 32, 888–895 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Risso, D., Ngai, J., Speed, T. P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 32, 896–902 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Merker, J. D. et al. Proficiency testing of standardized samples shows very high interlaboratory agreement for clinical next-generation sequencing–based oncology assays. Arch. Pathol. Lab. Med. 143, 463–471 (2019).

    CAS  Google Scholar 

  20. 20.

    Mahamdallie, S. et al. The ICR639 CPG NGS validation series: a resource to assess analytical sensitivity of cancer predisposition gene testing. Wellcome Open Res. 3, 68 (2018).

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Zhong, Q. et al. Multi-laboratory proficiency testing of clinical cancer genomic profiling by next-generation sequencing. Pathol. Res. Pract. 214, 957–963 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Zook, J. M. et al. An open resource for accurately benchmarking small variant and reference calls. Nat. Biotechnol. 37, 561–566 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  23. 23.

    Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Ball, M. P. et al. A public resource facilitating clinical use of genomes. Proc. Natl Acad. Sci. USA 109, 11920–11927 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Wagner, J. et al. Benchmarking challenging small variants with linked and long reads. Preprint at bioRxiv https://doi.org/10.1101/2020.07.24.212712 (2020).

  29. 29.

    Landrum, M. J. & Kattman, B. L. ClinVar at five years: delivering on the promise. Hum. Mutat. 39, 1623–1630 (2018).

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIM.org: Online Mendelian Inheritance in Man (OMIM), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015).

    PubMed  PubMed Central  Google Scholar 

  31. 31.

    Jeffares, D. C. et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 8, 14061 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).

    PubMed  PubMed Central  Google Scholar 

  33. 33.

    Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Olson, N. D. et al. precisionFDA Truth Challenge V2: calling variants from short-and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2020).

  35. 35.

    Freed, D. N., Aldana, R., Weber, J. A. & Edwards, J. S. The Sentieon Genomics Tools - A fast and accurate solution to variant calling from next-generation sequence data. Preprint at bioRxiv 115717 (2017).

  36. 36.

    McIntyre, A. B. et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18, 182 (2017).

    PubMed  PubMed Central  Google Scholar 

  37. 37.

    Sogin, M. L. in PCR Protocols: A Guide to Methods and Applications (eds Innis, M. et al.) (Elsevier, 2012).

  38. 38.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  39. 39.

    Pedersen, B. S. & Quinlan, A. R. Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34, 867–868 (2018).

    CAS  Google Scholar 

  40. 40.

    Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2018).

  41. 41.

    Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Luo, R. et al. Exploring the limit of using a deep neural network on pileup data for germline variant calling. Nat. Mach. Intell. 2, 220–227 (2020).

    Google Scholar 

  44. 44.

    Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. 45.

    Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).

    PubMed  PubMed Central  Google Scholar 

  46. 46.

    Chen, X. et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics 32, 1220–1222 (2016).

    CAS  Google Scholar 

  47. 47.

    Conway, J. R., Lex, A. & Gehlenborg, N. UpSetR: an R package for the visualization of intersecting sets and their properties. Bioinformatics 33, 2938–2940 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. 49.

    Toptaş, B. Ç., Rakocevic, G., Kómár, P. & Kural, D. Comparing complex variants in family trios. Bioinformatics 34, 4241–4247 (2018).

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Illumina and ThermoFisher for providing reagents allowing the study to take place. We also thank NIST for providing the GIAB DNA samples necessary to carry out the study. We acknowledge the HudsonAlpha Institute of Biotechnology for expert assistance in Illumina DNA library preparation. The Association of Biomolecular Resource Facilities (ABRF) also provided funding, logistical support and project oversight. We thank the ABRF NGS Study members, who contributed to the design and execution of this project. We are particularly grateful for the assistance provided by multiple core facilities spending their own time and resources to participate in this research. We thank the Epigenomics Core Facility and Scientific Computing Unit at Weill Cornell Medicine, as well as the Starr Cancer Consortium (I9-A9-071), and acknowledge funding from the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, Bert L and N Kuggie Vallee Foundation, the WorldQuant Foundation, The Pershing Square Sohn Cancer Research Alliance, NASA (NNX14AH50G, NNX17AB26G), the National Institutes of Health (R25EB020393, R01NS076465, R01AI125416, R01ES021006, 1R21AI129851, 1R01MH117406), the Bill and Melinda Gates Foundation (OPP1151054), TRISH (NNX16AO69A:0107, NNX16AO69A:0061), the Leukemia and Lymphoma Society grants (LLS 9238-16, Mak, LLS-MCL-982, Chen-Kiang) and the Alfred P. Sloan Foundation (G-2015-13964). Certain commercial equipment, instruments or materials are identified to adequately specify experimental conditions or reported results. Such identification implies neither recommendation nor endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments or materials identified are necessarily the best available for the purpose. F.J.S. and M.M. are supported by the NIH (UM1 HG008898).

Author information

Affiliations

Authors

Contributions

C.E.M., S.W.T., C.M.N. and D.A.B. conceived and designed the study. C.E.M., A.A., S.W.T., Z.T.H., W.F., G.S.G., S.L., P.K.L., D.W., X.Z., W.Z., F.T., Y.Z., J.X., J.J. and H.L. implemented the protocols. J.M.Z., W.E.C., M.B.-B. and G.N. assisted with analysis design. J.F. aggregated and processed data, led data analysis and figure generation, and wrote the manuscript. F.J.S., W.E.C., M.B.-B., G.N., M.M.K., M.M. and S.W.T. performed data analysis, figure generation and manuscript editing. G.P.S. performed experimental planning, support and data analysis.

Corresponding authors

Correspondence to Fritz J. Sedlazeck or Don A. Baldwin or Christopher E. Mason.

Ethics declarations

Competing interests

G.P.S. is employed by Illumina Inc. X.Z., W.Z., F.T., Y.Z. and H.L are employees of MGI Inc. All other authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Quality Control and Decoy Capture.

(a) The insert Size distribution of every replicate, stratified by sequencing instrument. (b) The percentage of total reads that were mapped to decoy contigs within the GRCh38 reference genome.

Extended Data Fig. 2 Normalized Genomic Coverage.

Heatmap showing the distribution of read counts per library (rows) by GC content (columns) across human whole genome and exome samples. Read count values are normalized by total reads per replicate, such that a value of 1 matches maximum value for a given replicate. Annotation tracks on the right indicate the sequencing platform and cell line genome for that replicate.

Extended Data Fig. 3 All-versus-all Genomic Coverage Comparison.

Comparisons for every platform within each UCSC RepeatMasker region. Blue bars indicate >50% of shared sites are better represented in the given platform (column) versus all other platforms (rows). Red bars indicate that the other platform out-covered the given platform.

Extended Data Fig. 4 Variant Detection by Context.

Precision and sensitivity scores as derived from rtg vcfeval analysis, stratified by regions in (a) the CLINVAR database and (b) the OMIM database. For each of the cell lines, genes from each database were overlapped with high confidence regions for variant calling. (c) Scores stratified by regions in the exome, as defined by the AmpliSeq target capture regions file. For each of the cell lines, exomic regions were overlapped with high confidence regions for variant calling.

Extended Data Fig. 5 Genomic Variant Heatmap.

Heatmap of genotype (GT) of variant alleles on chromosome 1, across all human replicates across within sequencing platforms, as measured against the Genome in a Bottle high confidence variant call sets for each genome. Heterozygous variant alleles are shaded in orange (0.5), homozygous variants in red (1), missing data in blue (0), and inapplicable sites (sites outside of the GIAB high confidence region in one cell line but present in another) in gray. Hierarchical clustering reveals strong grouping by cell line, followed by less clear grouping within platforms and inter- and intra-lab replicates.

Extended Data Fig. 6 Mendelian Violation Detection Per Context.

UpSet intersections of Mendelian violations. Each plot is stratified by variant type (SNPs on top, followed by INDELs; INS_5 = insertions 0-5 bp in size, INS_6to15 = insertions 6 to 15 bp in size, INS_15 = insertions >15 bp in size; same for deletions, ‘DEL’). Events were recorded within high confidence regions for the Ashkenazi Son (HG002).

Extended Data Fig. 7 Structural Variants per Instrument.

Comparison between the identified SVs in the six replicates from long-read sequencing instruments, showing agreement of 6,980 SVs between samples (green column).

Extended Data Fig. 8 Structural Variant Metrics.

Coverage, insert size, and read length mean and standard deviation across total SVs in sequencing runs.

Extended Data Fig. 9 SV Agreement between Callers and Instruments.

(a) Insights into SV variability by caller. First the strategy used to examine SV caller variability after stratifying for platforms, replicates and centers variability; next the SV call set sizes and overlap with the GIAB SV call set for the SV caller variability set of HG002; finally the types and sizes of SVs in the SV caller variability set of HG002 (translocations are set to size 50 by default in the SURVIVOR parameters for visualization purposes). (b) Insights into SV variability by platform. Diagrams utilize sequencing runs from HiSeqX10, HiSeq2000 and HiSeq4000 while the final two characterize all platforms available. First the strategy used to examine platform variability after stratifying for SV callers, centers and replicates variability; next, SV call set sizes and overlaps with the GIAB SV call set for the platform variability SV call set of HG002; next, types and sizes of SVs in the platform variability SV call set of HG002. Final two panels include HiSeqX10, HiSeq2000, HiSeq4000, NovaSeq, BGI and MGI for visualization purposes. The NovaSeq, BGI and MGI SV call sets were not integrated into the analyses strategy because sequencing runs with replicates for each sample at different centers on different platforms were not available. On top, SV call set sizes and overlap with the GIAB SV call set for the platform variability SV call set of HG002. Below, types and sizes of SVs in the platform variability SV call set of HG002. (Translocations are set to size 50 by default in the SURVIVOR parameters for visualization purposes).

Extended Data Fig. 10 Metagenomic Bacterial Sequencing Distribution.

(a) Heatmap showing the distribution of read counts per library (rows) by GC content (columns) across bacterial genomes and the metagenomic mixtrue. Read count values are normalized by total reads per replicate, such that a value of 1 matches maximum value for a given replicate. Annotation tracks on the right indicate the sequencing platform and cell line genome for that replicate. (b) Calculations of entropy per genome/metagenomic mixture. Entropy was measured across all GC windows for all replicates for a given sample, rowSums(-(p * log(p)).

Supplementary information

Supplementary Information

Supplementary Methods, Results and Tables 1–9.

Reporting Summary

Supplementary Data 1

FASTQC reports.

Supplementary Data 2

Satellite regions.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Foox, J., Tighe, S.W., Nicolet, C.M. et al. Performance assessment of DNA sequencing platforms in the ABRF Next-Generation Sequencing Study. Nat Biotechnol 39, 1129–1140 (2021). https://doi.org/10.1038/s41587-021-01049-5

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing