DADA2: High-resolution sample inference from Illumina amplicon data

Journal name:
Nature Methods
Volume:
13,
Pages:
581–583
Year published:
DOI:
doi:10.1038/nmeth.3869
Received
Accepted
Published online

We present the open-source software package DADA2 for modeling and correcting Illumina-sequenced amplicon errors (https://github.com/benjjneb/dada2). DADA2 infers sample sequences exactly and resolves differences of as little as 1 nucleotide. In several mock communities, DADA2 identified more real variants and output fewer spurious sequences than other methods. We applied DADA2 to vaginal samples from a cohort of pregnant women, revealing a diversity of previously undetected Lactobacillus crispatus variants.

At a glance

Figures

  1. Comparison of sequence variants inferred by DADA2 with OTUs constructed by UPARSE.
    Figure 1: Comparison of sequence variants inferred by DADA2 with OTUs constructed by UPARSE.

    (ac) The merged sequences output by DADA2 are plotted for three Illumina amplicon data sets: (a) Balanced, (b) HMP, and (c) Extreme. Frequency is plotted on the y-axis; Hamming distance to the closest more abundant sequence is plotted on the x-axis. Shapes represent accuracy (Online Methods). When variants are well separated from other members of the community, the sequence variants inferred by DADA2 largely coincide with the OTUs output by UPARSE (black). However, DADA2 resolves additional variation (blue), especially within UPARSE's OTU radius (dashed line), while outputting fewer spurious sequences (One Off and Other).

  2. L.crispatus sequence variants in the human vaginal community during pregnancy.
    Figure 2: L.crispatus sequence variants in the human vaginal community during pregnancy.

    DADA2 identified six crispatus 16S rRNA sequence variants present in multiple samples and a significant fraction of all reads (L1, 19.7%; L2, 11.1%; L3, 6.5%; L4, 3.1%; L5, 1.3%; and L6, 0.4%). (a) The frequency of L1–L6 in each sample. Black bars at the bottom link samples from the same subject. (b,c) The frequency of (b) L1 vs. L2 and (c) L3 vs. L1, by sample. The dashed line indicates a total frequency of 1.

Accession codes

Primary accessions

Sequence Read Archive

References

  1. Human Microbiome Project Consortium. Nature 486, 207214 (2012).
  2. Rosen, M.J., Davison, M., Bhaya, D. & Fisher, D.S. Science 348, 10191023 (2015).
  3. Reeder, J. & Knight, R. Nat. Methods 7, 668669 (2010).
  4. Quince, C., Lanzen, A., Davenport, R.J. & Turnbaugh, P.J. BMC Bioinformatics 12, 38 (2011).
  5. Rosen, M.J., Callahan, B.J., Fisher, D.S. & Holmes, S.P. BMC Bioinformatics 13, 283 (2012).
  6. Bragg, L., Stone, G., Imelfort, M., Hugenholtz, P. & Tyson, G.W. Nat. Methods 9, 425426 (2012).
  7. Schloss, P.D. et al. Appl. Environ. Microbiol. 75, 75377541 (2009).
  8. Caporaso, J.G. et al. Nat. Methods 7, 335336 (2010).
  9. Edgar, R.C. Nat. Methods 10, 996998 (2013).
  10. Eren, A.M., Borisy, G.G., Huse, S.M. & Welch, J.L.M. Proc. Natl. Acad. Sci. USA 111, E2875E2884 (2014).
  11. Eren, A.M. et al. ISME J. 9, 968979 (2015).
  12. Tikhonov, M., Leach, R.W. & Wingreen, N.S. ISME J. 9, 6880 (2015).
  13. Wang, C., Mitsuya, Y., Gharizadeh, B., Ronaghi, M. & Shafer, R.W. Genome Res. 17, 11951201 (2007).
  14. McElroy, K., Zagordi, O., Bull, R., Luciani, F. & Beerenwinkel, N. BMC Genomics 14, 501 (2013).
  15. Guarner, F. Nat. Rev. Gastroenterol. Hepatol. 11, 647649 (2014).
  16. Schirmer, M. et al. Nucleic Acids Res. 43, e37 (2015).
  17. Kozich, J.J., Westcott, S.L., Baxter, N.T., Highlander, S.K. & Schloss, P.D. Appl. Environ. Microbiol. 79, 51125120 (2013).
  18. Edgar, R.C. & Flyvbjerg, H. Bioinformatics 31, 34763482 (2015).
  19. MacIntyre, D.A. et al. Sci. Rep. 11, 8988 (2015).
  20. Ravel, J. et al. Proc. Natl. Acad. Sci. USA 108, 46804687 (2011).
  21. Sun, Y. et al. Nucleic Acids Res. 37, e76 (2009).
  22. Caporaso, J.G. et al. ISME J. 6, 16211624 (2012).
  23. Edgar, R.C., Haas, B.J., Clemente, J.C., Quince, C. & Knight, R. Bioinformatics 27, 21942200 (2011).

Download references

Author information

Affiliations

  1. Department of Statistics, Stanford University, Stanford, California, USA.

    • Benjamin J Callahan &
    • Susan P Holmes
  2. Second Genome, South San Francisco, California, USA.

    • Paul J McMurdie,
    • Andrew W Han &
    • Amy Jo A Johnson
  3. Department of Applied Physics, Stanford University, Stanford, California, USA.

    • Michael J Rosen

Contributions

B.J.C. and S.P.H. designed the research; B.J.C., P.J.M., and M.J.R. implemented the algorithm; B.J.C. performed the analysis; B.J.C., P.J.M., M.J.R., and S.P.H. wrote the paper; and A.W.H. and A.J.A.J. generated the Extreme data set designed by B.J.C., P.J.M., and A.W.H.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (1,852 KB)

    Supplementary Figures 1–8, Supplementary Tables 1–3 and Supplementary Notes 1 and 2

Zip files

  1. Supplementary Software (1,344 KB)

    DADA2 software package and scripts for benchmarking and analysis

Additional data