Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Protocol
  • Published:

Use of synthetic DNA spike-in controls (sequins) for human genome sequencing

Abstract

Next-generation sequencing (NGS) has been widely adopted to identify genetic variants and investigate their association with disease. However, the analysis of sequencing data remains challenging because of the complexity of human genetic variation and confounding errors introduced during library preparation, sequencing and analysis. We have developed a set of synthetic DNA spike-ins—termed ‘sequins’ (sequencing spike-ins)—that are directly added to DNA samples before library preparation. Sequins can be used to measure technical biases and to act as internal quantitative and qualitative controls throughout the sequencing workflow. This step-by-step protocol explains the use of sequins for both whole-genome and targeted sequencing of the human genome. This includes instructions regarding the dilution and addition of sequins to human DNA samples, followed by the bioinformatic steps required to separate sequin- and sample-derived sequencing reads and to evaluate the diagnostic performance of the assay. These practical guidelines are accompanied by a broader discussion of the conceptual and statistical principles that underpin the design of sequin standards. This protocol is suitable for users with standard laboratory and bioinformatic experience. The laboratory steps require ~1–4 d and the bioinformatic steps (which can be performed with the provided example data files) take an additional day.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic showing the design and use of sequins in NGS experiments.
Fig. 2: Sequin design principles.
Fig. 3: Compatibility of sequins with targeted sequencing.
Fig. 4: Overview of protocol for sequin use in human genome sequencing.
Fig. 5: Calibration of sequin coverage to matched human genome regions.
Fig. 6: Example traces measuring DNA fragment size and abundance.
Fig. 7: Example qPCR assessment of target enrichment for ALK, BRAF, PIK3CA, PTEN and TP53.
Fig. 8: Example of sequin and corresponding human variants.
Fig. 9: Alignment-free comparison of quantitative accuracy between libraries.
Fig. 10: Impact of sequence context on NGS performance.
Fig. 11: Performance evaluation of somatic variant calling by anaquin.
Fig. 12: Comparison of expected human and sequin variants analyzed by targeted sequencing.

Similar content being viewed by others

Data availability

All next-generation sequencing libraries and associated data files, including synthetic sequences and variant annotations, are available for download at http://www.sequinstandards.com/resources/#nature_protocols. Please see the ‘Equipment setup’ section and Supplementary Notes 1 and 2 for further details.

Code availability

Anaquin source code is available from https://github.com/sequinstandards/RAnaquin.

References

  1. Goodwin, S., McPherson, J. D. & McCombie, W. R. Coming of age: ten years of next-generation sequencing technologies. Nat. Rev. Genet. 17, 333–351 (2016).

    Article  CAS  Google Scholar 

  2. Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 121–132 (2014).

    Article  CAS  Google Scholar 

  3. Chen, L., Liu, P., Evans, T. C. & Ettwiller, L. M. DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification. Science 355, 752–756 (2017).

    Article  CAS  Google Scholar 

  4. Goldfeder, R. L. et al. Medical implications of technical accuracy in genome sequencing. Genome Med. 8, 24 (2016).

    Article  Google Scholar 

  5. Ross, M. G. et al. Characterizing and measuring bias in sequence data. Genome Biol. 14, R51 (2013).

    Article  Google Scholar 

  6. Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851 (2014).

    Article  CAS  Google Scholar 

  7. Clark, M. J. et al. Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 29, 908–914 (2011).

    Article  CAS  Google Scholar 

  8. Lam, H. Y. K. et al. Performance comparison of whole-genome sequencing platforms. Nat. Biotechnol. 30, 78–82 (2011).

    Article  Google Scholar 

  9. Gargis, A. S. et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat. Biotechnol. 30, 1033–1036 (2012).

    Article  CAS  Google Scholar 

  10. Deveson, I. W. et al. Chiral DNA sequences as commutable controls for clinical genomics. Nat. Commun. 10, 1342 (2019).

    Article  Google Scholar 

  11. Deveson, I. W. et al. Representing genetic variation with synthetic DNA standards. Nat. Methods 13, 784–791 (2016).

    Article  CAS  Google Scholar 

  12. Hardwick, S. A., Deveson, I. W. & Mercer, T. R. Reference standards for next-generation sequencing. Nat. Rev. Genet. 18, 473–484 (2017).

    Article  CAS  Google Scholar 

  13. Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    Article  CAS  Google Scholar 

  14. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).

    Article  CAS  Google Scholar 

  15. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

  16. Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    Article  CAS  Google Scholar 

  17. Kim, S. et al. Strelka2: fast and accurate variant calling for clinical sequencing applications. Nat. Methods 15, 591–594 (2018).

    Article  CAS  Google Scholar 

  18. Wong, T., Deveson, I. W., Hardwick, S. A. & Mercer, T. R. ANAQUIN: a software toolkit for the analysis of spike-in controls for next generation sequencing. Bioinformatics 33, 1723–1724 (2017).

    Article  CAS  Google Scholar 

  19. Hodges, E. et al. Genome-wide in situ exon capture for selective resequencing. Nat. Genet. 39, 1522–1527 (2007).

    Article  CAS  Google Scholar 

  20. Albert, T. J. et al. Direct selection of human genomic loci by microarray hybridization. Nat. Methods 4, 903–905 (2007).

    Article  CAS  Google Scholar 

  21. Hardwick, S. A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat. Methods 13, 792–798 (2016).

    Article  CAS  Google Scholar 

  22. Hardwick, S. A. et al. Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis. Nat. Commun. 9, 3096 (2018).

    Article  Google Scholar 

  23. Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).

    Article  CAS  Google Scholar 

  24. Zook, J. M. & Salit, M. Genomes in a bottle: creating standard reference materials for genomic variation—why, what and how?. Genome Biol. 12, P31 (2011).

    Article  Google Scholar 

  25. Sims, D. J. et al. Plasmid-based materials as multiplex quality controls and calibrators for clinical next-generation sequencing assays. J. Mol. Diagn. 18, 336–349 (2016).

    Article  CAS  Google Scholar 

  26. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  Google Scholar 

  27. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    Article  CAS  Google Scholar 

  28. Clarke, J. et al. Continuous base identification for single-molecule nanopore DNA sequencing. Nat. Nanotechnol. 4, 265–270 (2009).

    Article  CAS  Google Scholar 

  29. Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).

    Article  CAS  Google Scholar 

  30. Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).

    Article  CAS  Google Scholar 

  31. Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).

    Article  Google Scholar 

  32. Kavak, P. et al. Discovery and genotyping of novel sequence insertions in many sequenced individuals. Bioinformatics 33, i161–i169 (2017).

    Article  CAS  Google Scholar 

  33. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    Article  CAS  Google Scholar 

  34. Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

    Article  CAS  Google Scholar 

  35. Murphy, K. M. et al. Comparison of the microsatellite instability analysis system and the Bethesda panel for the determination of microsatellite instability in colorectal cancers. J. Mol. Diagn. 8, 305–311 (2006).

    Article  CAS  Google Scholar 

  36. Ka, S. et al. HLAscan: genotyping of the HLA region using next-generation sequencing data. BMC Bioinformatics 18, 258 (2017).

    Article  Google Scholar 

  37. Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the following funding sources: Australian National Health and Medical Research Council (NHMRC) Australia Fellowships (1062470 to T.R.M.), APP1108254 (to B.S.K.) and APP1114016 (to J.B). I.W.D is supported by a Cancer Institute NSW Early Career Fellowship (2018/ECF013). T.R.M. and T.W. are supported by a Paramor Family Fellowship. S.A.H. is supported by an Australian Postgraduate Award scholarship. A.L.M.R. is supported by a University of New South Wales Sydney Tuition Fee Scholarship. The contents of the published material are solely the responsibility of the administering institution, a participating institution or individual authors and do not reflect the views of the NHMRC.

Author information

Authors and Affiliations

Authors

Contributions

J.B., B.S.K. and C.B. contributed materials. J.B. performed the experiments. T.W., I.W.D., S.A.H. and A.L.M.R. carried out the bioinformatic analysis. J.B., T.W., I.W.D. and T.R.M. wrote the manuscript. All authors conceived the study and contributed to manuscript preparation.

Corresponding authors

Correspondence to Ira W. Deveson or Tim R. Mercer.

Ethics declarations

Competing interests

The authors declare competing interests: the Garvan Institute of Medical Research has filed patents covering aspects of sequencing controls.

Additional information

Journal peer review information: Nature Protocols thanks Justin Zook and other anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Related links

Key references using this protocol

Deveson, I. W. et al. Nat. Commun. 10, 1342 (2019): https://doi.org/10.1038/s41467-019-09272-0

Hardwick, S. A., Deveson, I. W. & Mercer, T. R. Nat. Rev. Genet. 18, 473–484 (2017): https://doi.org/10.1038/nrg.2017.44

Deveson, I. W. et al. Nat. Methods 13, 784–791 (2016): https://doi.org/10.1038/nmeth.3957

Integrated supplementary information

Supplementary Figure 1 Example of sequin calibration.

Genome browser views show sequencing alignments within a single sequin standard before (upper) and after (middle) coverage calibration, performed using anaquin ‘calibrate’. During calibration, sequin alignments are down-sampled to achieved matched coverage with the human sample DNA (lower) within sequin regions. This example also shows artifactual enrichment of read-pairs at sequin termini, which occurs during some library preparation methods. Anaquin ‘calibrate’ automatically removes these terminal alignments before calibration. Sequin edge regions (550 bp, by default) are also excluded during the calibration process, as well as downstream anaquin analyses (germline/somatic).

Supplementary information

Supplementary information

Supplementary Figure 1

Reporting Summary

Supplementary Information

Supplementary Notes 1 and 2

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Blackburn, J., Wong, T., Madala, B.S. et al. Use of synthetic DNA spike-in controls (sequins) for human genome sequencing. Nat Protoc 14, 2119–2151 (2019). https://doi.org/10.1038/s41596-019-0175-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41596-019-0175-1

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing: Cancer

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

Get what matters in cancer research, free to your inbox weekly. Sign up for Nature Briefing: Cancer