Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Detecting ultralow-frequency mutations by Duplex Sequencing

A Corrigendum to this article was published on 20 November 2014

This article has been updated

Abstract

Duplex Sequencing (DS) is a next-generation sequencing methodology capable of detecting a single mutation among >1 × 107 wild-type nucleotides, thereby enabling the study of heterogeneous populations and very-low-frequency genetic alterations. DS can be applied to any double-stranded DNA sample, but it is ideal for small genomic regions of <1 Mb in size. The method relies on the ligation of sequencing adapters harboring random yet complementary double-stranded nucleotide sequences to the sample DNA of interest. Individually labeled strands are then PCR-amplified, creating sequence 'families' that share a common tag sequence derived from the two original complementary strands. Mutations are scored only if the variant is present in the PCR families arising from both of the two DNA strands. Here we provide a detailed protocol for efficient DS adapter synthesis, library preparation and target enrichment, as well as an overview of the data analysis workflow. The protocol typically takes 1–3 d.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Overview of Duplex Sequencing.
Figure 2: Schematic of the basic computational workflow for Duplex Sequencing.
Figure 3: Quality control of the sequencing adapters at each step of synthesis.
Figure 4: Representative tag family size distributions.
Figure 5: Optimal peak family size.
Figure 6: Example Agilent TapeStation 2200 electropherograms.

Change history

  • 22 October 2014

     In the version of this article initially published online, the sequence for the MWS21 oligonucleotide was incorrectly listed in Table 3 as 5′-CCAGCAGAAGACGGCATACGAGATXXXXXXGTGACTGGAGTTCAGACGTGTGC-3′. The second nucleotide in the sequence should be A, not C, and the correct sequence is 5′-CAAGCAGAAGACGGCATACGAGATXXXXXXGTGACTGGAGTTCAGACGTGTGC-3′. The error has been corrected in the PDF and HTML versions of this article.

References

  1. Glenn, T.C. Field guide to next-generation DNA sequencers. Mol. Ecol. Resour. 11, 759–769 (2011).

    Article  CAS  Google Scholar 

  2. Loeb, L.A., Springgate, C.F. & Battula, N. Errors in DNA replication as a basis of malignant change. Cancer Res. 34, 2311–2321 (1974).

    CAS  PubMed  Google Scholar 

  3. Gerlinger, M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366, 883–892 (2012).

    Article  CAS  Google Scholar 

  4. Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).

    Article  CAS  Google Scholar 

  5. Bolli, N. et al. Heterogeneity of genomic evolution and mutational profiles in multiple myeloma. Nat. Commun. 5, 2997 (2014).

    Article  Google Scholar 

  6. Lawrence, M.S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

    Article  CAS  Google Scholar 

  7. Kreso, A. et al. Variable clonal repopulation dynamics influence chemotherapy response in colorectal cancer. Science 339, 543–548 (2013).

    Article  CAS  Google Scholar 

  8. Johnson, B.E. et al. Mutational analysis reveals the origin and therapy-driven evolution of recurrent glioma. Science 343, 189–193 (2014).

    Article  CAS  Google Scholar 

  9. Schmitt, M.W., Prindle, M.J. & Loeb, L.A. Implications of genetic heterogeneity in cancer. Ann. NY Acad. Sci. 1267, 110–116 (2012).

    Article  CAS  Google Scholar 

  10. Liu, Z. et al. Patterns of diversifying selection in the phytotoxin-like scr74 gene family of Phytophthora infestans. Mol. Biol. Evol. 22, 659–672 (2005).

    Article  CAS  Google Scholar 

  11. Srivatsan, A. et al. High-precision, whole-genome sequencing of laboratory strains facilitates genetic studies. PLoS Genet. 4, e1000139 (2008).

    Article  Google Scholar 

  12. Holt, K.E. et al. High-throughput sequencing provides insights into genome variation and evolution in Salmonella typhi. Nat. Genet. 40, 987–993 (2008).

    Article  CAS  Google Scholar 

  13. Loh, E., Salk, J.J. & Loeb, L.A. Optimization of DNA polymerase mutation rates during bacterial evolution. Proc. Natl. Acad. Sci. USA 107, 1154–1159 (2010).

    Article  CAS  Google Scholar 

  14. Budowle, B. & van Daal, A. Extracting evidence from forensic DNA analyses: future molecular biology directions. Biotechniques 46, 339–340, 342–350 (2009).

    Article  CAS  Google Scholar 

  15. Hofreiter, M., Jaenicke, V., Serre, D., von Haeseler, A. & Paabo, S. DNA sequences from multiple amplifications reveal artifacts induced by cytosine deamination in ancient DNA. Nucleic Acids Res. 29, 4793–4799 (2001).

    Article  CAS  Google Scholar 

  16. Knapp, M. & Hofreiter, M. Next-generation sequencing of ancient DNA: requirements, strategies and perspectives. Genes 1, 227–243 (2010).

    Article  CAS  Google Scholar 

  17. Carlson, C.A. et al. Decoding cell lineage from acquired mutations using arbitrary deep sequencing. Nat. Methods 9, 78–80 (2012).

    Article  CAS  Google Scholar 

  18. Besaratinia, A. et al. A high-throughput next-generation sequencing-based method for detecting the mutational fingerprint of carcinogens. Nucleic Acids Res. 40, e116 (2012).

    Article  CAS  Google Scholar 

  19. Schmitt, M.W. et al. Detection of ultra-rare mutations by next-generation sequencing. Proc Natl. Acad. Sci. USA 109, 14508–14513 (2012).

    Article  CAS  Google Scholar 

  20. Kinde, I., Wu, J., Papadopoulos, N., Kinzler, K.W. & Vogelstein, B. Detection and quantification of rare mutations with massively parallel sequencing. Proc. Natl. Acad. Sci. USA 108, 9530–9535 (2011).

    Article  Google Scholar 

  21. Hiatt, J.B., Patwardhan, R.P., Turner, E.H., Lee, C. & Shendure, J. Parallel, tag-directed assembly of locally derived short sequence reads. Nat. Methods 7, 119–122 (2010).

    Article  CAS  Google Scholar 

  22. McCloskey, M.L., Stoger, R., Hansen, R.S. & Laird, C.D. Encoding PCR products with batch-stamps and barcodes. Biochem. Genet. 45, 761–767 (2007).

    Article  CAS  Google Scholar 

  23. Jabara, C.B., Jones, C.D., Roach, J., Anderson, J.A. & Swanstrom, R. Accurate sampling and deep sequencing of the HIV-1 protease gene using a primer ID. Proc. Natl. Acad. Sci. USA 108, 20166–20171 (2011).

    Article  CAS  Google Scholar 

  24. Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2012).

    Article  CAS  Google Scholar 

  25. Casbon, J.A., Osborne, R.J., Brenner, S. & Lichtenstein, C.P. A method for counting PCR template molecules with application to next-generation sequencing. Nucleic Acids Res. 39, e81 (2011).

    Article  CAS  Google Scholar 

  26. Shiroguchi, K., Jia, T.Z., Sims, P.A. & Xie, X.S. Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes. Proc. Natl. Acad. Sci. USA 109, 1347–1352 (2012).

    Article  CAS  Google Scholar 

  27. Zong, C., Lu, S., Chapman, A.R. & Xie, X.S. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338, 1622–1626 (2012).

    Article  CAS  Google Scholar 

  28. Wang, J., Fan, H.C., Behr, B. & Quake, S.R. Genome-wide single-cell analysis of recombination activity and de novo mutation rates in human sperm. Cell 150, 402–412 (2012).

    Article  CAS  Google Scholar 

  29. Wang, Y. et al. Clonal evolution in breast cancer revealed by single-nucleus genome sequencing. Nature 512, 155–160 (2014).

    Article  CAS  Google Scholar 

  30. Acevedo, A., Brodsky, L. & Andino, R. Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Nature 505, 686–690 (2014).

    Article  CAS  Google Scholar 

  31. Lou, D.I. et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc. Natl. Acad. Sci. USA 110, 19872–19877 (2013).

    Article  CAS  Google Scholar 

  32. Diegoli, T.M., Farr, M., Cromartie, C., Coble, M.D. & Bille, T.W. An optimized protocol for forensic application of the PreCR repair mix to multiplex STR amplification of UV-damaged DNA. Forensic Sci. Int. Genet. 6, 498–503 (2012).

    Article  CAS  Google Scholar 

  33. Kennedy, S.R., Salk, J.J., Schmitt, M.W. & Loeb, L.A. Ultra-sensitive sequencing reveals an age-related increase in somatic mitochondrial mutations that are inconsistent with oxidative damage. PLoS Genet. 9, e1003794 (2013).

    Article  CAS  Google Scholar 

  34. Yost, S.E. et al. Identification of high-confidence somatic mutations in whole genome sequence of formalin-fixed breast cancer specimens. Nucleic Acids Res. 40, e107 (2012).

    Article  CAS  Google Scholar 

  35. Kerick, M. et al. Targeted high throughput sequencing in clinical cancer settings: formaldehyde fixed-paraffin embedded (FFPE) tumor tissues, input amount and tumor heterogeneity. BMC Med. Genomics 4, 68 (2011).

    Article  CAS  Google Scholar 

  36. Spencer, D.H. et al. Comparison of clinical targeted next-generation sequence data from formalin-fixed and fresh-frozen tissue specimens. J. Mol. Diagn. 15, 623–633 (2013).

    Article  CAS  Google Scholar 

  37. Li, H. & Durbin, R. Fast and accurate short-read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  Google Scholar 

  38. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

  39. McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    Article  CAS  Google Scholar 

  40. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank D. Crispin and A. Lawrence for testing the protocol as written and for helping to clarify several important steps of the protocol. We also thank A. Herr for helpful comments and discussion. Support for this research was provided by the following US National Institutes of Health grants from the National Institute on Aging (NIA) and the National Cancer Institute (NCI): NIA P01-AG001751, NCI P01-CA77852, NCI R01-CA160674 and NCI R01-CA102029 to L.A.L. S.R.K. was further supported by the Genetic Approaches to Aging Training grant (NIA T32-AG000057).

Author information

Authors and Affiliations

Authors

Contributions

S.R.K., M.W.S., J.J.S. and E.J.F. developed the original protocol. S.R.K., M.W.S. and B.F.K. developed the data analysis software. S.R.K., M.W.S., E.J.F., M.J.P., E.H.A., J.-C.S., K.J.K. and R.-A.R. optimized the protocol. S.R.K., M.W.S., J.J.S., B.F.K. and L.A.L. wrote the paper.

Corresponding author

Correspondence to Lawrence A Loeb.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Algorithm for tag_to_header.py.

The tag_to_header.py program proceeds through the two raw FASTQ files in parallel with in-register reads for the read 1 and read 2 files being processed simultaneously. The first 17-nt of each read consists of a 12-nt random tag, followed by an invariant five base spacer sequence (Fig. 1a). Prior to extracting the tag sequence, reads can be optionally filtered on the presence of a valid adapter sequence (i.e. having the sequence TGACT) (Step 2). If one or both reads have a spacer that fails this this optional filter, the entire read pair is ignored. It should be noted that using this option on data from a low quality sequencing run can lead to a significant loss of data. We rarely use this option in our data analysis pipeline. Next, the two 12-nt tag sequences and the invariant 5-nt spacer sequence are parsed from each of the paired reads (Step 3). The tags are checked to ensure that they are only composed of valid nucleotides (A, G, C, or T) (Step 4). If a tag fails this filter, both read pairs sharing the tag are discarded. If both tags pass the basic quality filter, the final duplex tag is formed by concatenated the 12-nt tag from read 2 (i.e. seq2.fq) to the 12-nt tag from read 1 (seq1.fq) (Step 5). For example, if the seq1.fq barcode sequence is 'AGCT' (the remaining bases of the tag are removed for clarity) the seq2.fq barcode sequence is CCAT, the resulting duplex tag is AGCTCCAT. The read is then written out to a new FASTQ file (seq1.fq.smi) with the duplex tag appended to the header region (Step 6).

Supplementary Figure 2 Algorithm for ConsensusMaker.py.

ConsensusMaker.py takes a Samtools generated sorted paired-end *.bam file as input. The script attempts to load all reads that map to the same genomic position (i.e. reads sharing the same POS field) into memory, but applies several filters while doing so. Importantly, all PCR duplicates needed for the creation of a SSCS molecule will map to the same position in the reference genome, and will share the same tag sequence. Reads can be filtered based on the value of their bitwise flag (i.e. FLAG field) (Step 2). We currently group and filter read by four main classes: 1) reads forming a proper read pair (FLAG values 83, 99, 147, 163), 2) paired, but not properly paired (FLAG values 65, 81, 97, 113, 129, 145, 161, and 177), 3) single-mapping reads (FLAG values 69, 73, 89, 117, 133, 137, 153, and 181), and 4) non-mapping read pairs (FLAG values 77 and 141). We recommend filtering out the non-mapping read pairs and any other FLAG values not indicated by the other three classes of reads. Reads not passing the FLAG filter are saved in a file (SSCS_NM.bam) in case they are needed for troubleshooting purposes. The filtered reads sharing the identical tag sequence are further sorted and sub-grouped by their CIGAR string (Step 3). Reads can also be optionally filtered to remove reads that overlap at their 3'-end or harbor soft-clipped bases (Step 4). Overlapping reads are found by first adding the read length to the value in the POS field and then comparing this sum to the value in the MPOS field. If the sum is greater than the MPOS field value, the read is considered overlapping. Soft-clipped reads are determined by the existence of a 'S' in the CIGAR string. Reads failing these filters can be written to a separate file (SSCS_NM.bam) or discarded from further analysis. Within each family, the number of reads with each CIGAR string is checked to find the most common CIGAR string. Only reads sharing the most common CIGAR string are considered for consensus making; all others can written out to a separate file (SSCS_LCC.bam) or discarded (Step 6). For example, if a family has four reads, three of which have a cigar string of 84M, and one of which has a cigar string 30M1D54M, only those reads with a cigar string of 84M will be considered as members of the family for consensus making. Those reads that remain are then checked to see if the number of reads in a tag family is greater than a minimum cutoff value (Step 7). We currently use a miminum membership of three reads. In addition, we have also instituted a maximum membership cutoff that limits the family size and reduces the computational time to process large families. We currently set this limit to 1,000 members. If the number of reads is below the minimum cutoff, the entire tag family is discarded and a consensus is not made. If the family size is greater than the maximum cutoff value, then the number of reads corresponding to the maximum cutoff value are randomly selected from the family and used to make the consensus. The rest of the reads in the family are ignored. After all the filters are applied, reads that share the same tag sequence and CIGAR string are subjected to consensus making (Step 8). The consensus sequence each position in a read is determined by a majority rules algorithm. Specifically, for each position index, the proportion of A's, T's, C's, and G's is calculated from every read in a tag family (e.g. #A's / # reads in tag family). The consensus for each position index is the base that occurs with a proportion greater than a user defined minimum value. If no proportion is greater than the minimum cutoff, the position is considered undefined and a 'N' is used as the consensus value for that position. Following consensus making, the SSCS may optionally be filtered based on the proportion of positions containing N's (Step 9). At this stage, the duplex tag from the header field, the genomic position, the cigar string, the flag, and the mate position are retained from the original reads, while the quality scores, mapping quality and optional tag fields can either replaced with dummy values or discarded. The final SSCS is written to a file (SSCS.bam) (Step 10). If additional tag families are available for consensus making that map to the same genomic position (i.e. share the same value for the POS field), then they are submitted for consensus making using the same filtering and consensus making algorithms (Steps 11 & 12). If no more families are available, all the reads mapping to the next position in the reference genome are loaded into memory, and the process begins again (Steps 13 & 14).

Supplementary Figure 3 Using a lower consensus cutoff does not affect mutation frequency.

The plot shows the comparison of two different consensus cutoff values (0.9 vs 0.7) used to process the same sample of mtDNA. The original Duplex Sequencing paper (Schmitt et al. 2012)19 used a consensus cutoff of 0.9 (i.e. 90% of bases a position have to agree in order for a consensus to be called). Reducing this cutoff to 0.7 does not result in a significant change in the mutation frequency. Error bars represent the 95% confidence interval (Wilson Score interval).

Supplementary Figure 4 Algorithm for DuplexMaker.py.

DuplexMaker.py takes the SSCS.bam file generated by ConsensusMaker.py as input. The script begins by loading all SSCS reads that map to the same genomic position (i.e. reads sharing the same POS field) into memory. All SSCS's needed for the creation of the final DCS read will map to the same position in the reference genome. The 24-nt duplex tag is parsed from the read header (i.e. QNAME field). The script then attempts to find the matching read by transposing the two 12-nt halves of tag sequence and comparing this new tag to every remaining tag associated with reads mapping to the same position. For example, if the original tag has a sequence AGTC, the transposed form of the tag would be TCAG. The TCAG tag will be compared to all the remaining tags mapping at the same genomic coordinates. If a SSCS for the transposed tag (TCAG in this case) is present at the given genomic position, the original read (tag AGTC) and the transposed tag associated read (tag TCAG) are compared to one another at each nucleotide position. If the two positions match, the given nucleotide base is used in the resultant DCS. If the positions do not match, a 'N' is placed in the resultant DCS. After a DCS is created, the two tags are removed from further consideration and are no longer used to form DCS reads. Following duplex making, the DCS reads may be optionally filtered based on the proportion of positions containing N's. The sequence of the final DCS read with FLAG scores of with 65, 69, 73, 97, 99, 129, 133, 137, 161, and 163 is changed to its reverse complement. Once a read's mate pair (i.e. shares the same tag sequence) has been processed, the two reads are written, in FASTQ format, to two files (one read is written to one file and the mate pair is written to the other FASTQ file), a well as to a .bam file. Waiting to write DCS reads to their respective files is done in order to keep the mate pairs in register in the FASTQ files, thus maintaining proper read pairing. If one read of a read-pair fails to form a DCS, then a dummy pair consisting of N's is written in its place. This step is done to ensure that reads remain properly paired. Once all tags at a given position have been examined, the program moves on to the next reference position and repeats the process.

Supplementary Figure 5 dT-tailing is less efficient than dA-tailing.

(a) Sequence of the oligonucleotides used to test tailing efficiency. (b) To test the tailing efficiency when the terminal base is a dG, 10 pmoles of JS154 was first radiolabeled with γ-32P-ATP using T4 polynucleotide kinase and then annealed to 10 pmoles of JS156 by heating to 95 °C for 1 min and allowing to slowly cool to room temperature (23 °C). The annealed oligonucleotides were either dA-tailed or dT-tailed with 1mM dATP or dTTP, respectively, at 37 °C using 15U Klenowexo- (New England Biolabs) according the manufacturer's instructions. After 1 hour, the reaction was heat inactivated at 65 °C for 20 min and run on a 14% Acrlamide/8M Urea PAGE gel. The radioactive bands were quantified by densitometry and the percent extension was calculated by dividing the amount of radioactivity in the extended band by the total amount of radioactivity in the lane. The tailing efficiency when the terminal base is a dT was determined in the same manner, with the exception that JS157 was radiolabeled with γ-32P-ATP and then subsequently annealed to JS155. (c) Table showing the percent of dA and dT tailing. dT-tailing was only slightly less efficient than dA-tailing when the terminal base was a dG (JS154); however, when the terminal base is a dT (JS157), no detectable dT-tailing was observed, whereas ~40% was dA-tailed.

Supplementary Figure 6 Removal of PCR contaminants by AMPure XP bead purification.

(a) Electropherogram and associated size profile shows the presence of a PCR contaminant prior to the bead cleanup step. (b) Bead cleanup of the same sample removes the PCR contaminant. The arrow denotes the contaminant. We typically use a 1.2X bead:DNA ratio (vol:vol) of beads to select against these low molecular weight products.

Supplementary information

Supplementary Figure 1

Algorithm for tag_to_header.py. (PDF 120 kb)

Supplementary Figure 2

Algorithm for ConsensusMaker.py. (PDF 175 kb)

Supplementary Figure 3

Using a lower consensus cutoff does not affect mutation frequency. (PDF 120 kb)

Supplementary Figure 4

Algorithm for DuplexMaker.py. (PDF 167 kb)

Supplementary Figure 5

dT-tailing is less efficient than dA-tailing. (PDF 128 kb)

Supplementary Figure 6

Removal of PCR contaminants by AMPure XP bead purification. (PDF 212 kb)

Supplementary Table 1

Redesigned dA-tailed adapters result in increased DS data yield. (PDF 67 kb)

Supplementary Table 2

Relative lane fraction needed to achieve optimal data on a sample with a peak family size that is too small. (PDF 91 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kennedy, S., Schmitt, M., Fox, E. et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat Protoc 9, 2586–2606 (2014). https://doi.org/10.1038/nprot.2014.170

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nprot.2014.170

This article is cited by

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing