Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing

High-throughput sequencing has revolutionized microbial ecology, but read quality remains a considerable barrier to accurate taxonomy assignment and α-diversity assessment for microbial communities. We demonstrate that high-quality read length and abundance are the primary factors differentiating correct from erroneous reads produced by Illumina GAIIx, HiSeq and MiSeq instruments. We present guidelines for user-defined quality-filtering strategies, enabling efficient extraction of high-quality data and facilitating interpretation of Illumina sequencing results.

We thank G. Giannoukos (Broad Institute of MIT and Harvard), I. Rasolonjatovo (Illumina), M. Gebert (University of Colorado, Boulder) and L. Wegener Parfrey (University of Colorado, Boulder) for contributing mock community sequencing data used in this study, and S. Huse and A. Gonzalez for useful feedback and discussions of this manuscript. This work was supported in part by grants from the US National Institutes of Health (NIH DK78669 to J.I.G., NIH R01HD059127 to D.A.M. and NIH U54HG004969 to D.G.), the Juvenile Diabetes Research Fund (D.G.), the Crohn's and Colitis Foundation of America (J.I.G. and D.G.), and the Howard Hughes Medical Institute. N.A.B. was supported by the 2012–2013 Dannon Probiotics Fellow Program (The Dannon Company) and a Wine Spectator scholarship.

Author information


  1. Department of Viticulture and Enology, University of California, Davis, Davis, California, USA.

    • Nicholas A Bokulich
    •  & David A Mills
  2. Department of Food Science and Technology, University of California, Davis, Davis, California, USA.

    • Nicholas A Bokulich
    •  & David A Mills
  3. Foods for Health Institute, University of California, Davis, Davis, California, USA.

    • Nicholas A Bokulich
    •  & David A Mills
  4. Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, Missouri, USA.

    • Sathish Subramanian
    • , Jeremiah J Faith
    •  & Jeffrey I Gordon
  5. Microbial Systems & Communities, Genome Sequencing and Analysis Program, Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA.

    • Dirk Gevers
  6. Department of Chemistry and Biochemistry, University of Colorado, Boulder, Colorado, USA.

    • Rob Knight
  7. Howard Hughes Medical Institute, Boulder, Colorado, USA.

    • Rob Knight
  8. Institute for Genomics and Systems Biology, Argonne National Laboratory, Argonne, Illinois, USA.

    • J Gregory Caporaso
  9. Department of Computer Science, Northern Arizona University, Flagstaff, Arizona, USA.

    • J Gregory Caporaso


N.A.B., J.G.C., D.A.M. and R.K. conceived and designed the experiments; N.A.B. performed the experiments and data analysis. All authors contributed sequencing data sets and wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to J Gregory Caporaso.

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–16, Supplementary Tables 1–9, Supplementary Note

