Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2

To the Editor — Rapid advances in DNA-sequencing and bioinformatics technologies in the past two decades have substantially improved understanding of the microbial world. This growing understanding relates to the vast diversity of microorganisms; how microbiota and microbiomes affect disease1 and medical treatment2; how microorganisms affect the health of the planet3; and the nascent exploration of the medical4, forensic5, environmental6 and agricultural7 applications of microbiome biotechnology. Much of this work has been driven by marker-gene surveys (for example, bacterial/archaeal 16S rRNA genes, fungal internal-transcribed-spacer regions and eukaryotic 18S rRNA genes), which profile microbiota with varying degrees of taxonomic specificity and phylogenetic information. The field is now transitioning to integrate other data types, such as metabolite8, metaproteome9 or metatranscriptome9,10 profiles.

The QIIME 1 microbiome bioinformatics platform has supported many microbiome studies and gained a broad user and developer community. Interactions with QIIME 1 users in our online support forum, our workshops and direct collaborations have shown the platform’s potential to serve an increasingly diverse array of microbiome researchers in academia, government and industry. Here, we present QIIME 2, a completely reengineered and rewritten system that is expected to facilitate reproducible and modular analysis of microbiome data to enable the next generation of microbiome science.

QIIME 2 was developed on the basis of a plugin architecture (Supplementary Fig. 1) that allows third parties to contribute functionality (https://library.qiime2.org). QIIME 2 plugins exist for latest-generation tools for sequence quality control from different sequencing platforms (DADA2 (ref. 11) and Deblur12), taxonomy assignment13 and phylogenetic insertion14, which quantitatively improve the results over QIIME 1 and other tools (as detailed in the corresponding tool-specific publications). The plugins also support qualitatively new functionality, including microbiome paired-sample and time-series analysis15 (which are critical for studying the effects of treatments on the microbiome), and machine learning16. Trained machine learning models can be saved for application to new data and interrogated to identify important microbiome features. Several recently released plugins, including q2-cscs17, q2-metabolomics18, q2-shogun19, q2-metaphlan2 (ref. 20) and q2-picrust2 (ref. 21), provide initial support for analysis of metabolomics and shotgun metagenomics data. We are currently working with teams developing bioinformatics tools for metatranscriptomics and metaproteomics, and we expect to add new plugins supporting these data types to the ecosystem shortly. Additionally, many of the existing ‘downstream’ analysis tools, such as q2-sample-classifier16, can already work with these data types individually or in combination if they are provided in a feature table. Thus, QIIME 2 has the potential to serve not only as a marker-gene analysis tool but also a multidimensional and powerful data science platform that can be rapidly adapted to analyze diverse microbiome features.

QIIME 2 provides many new interactive visualization tools facilitating exploratory analyses and result reporting. Static versions of interactive visualizations resulting from four worked examples are provided in Fig. 1. QIIME 2 View (https://view.qiime2.org) is a unique new service (Supplementary Methods) that allows users to securely share and interact with results without installing QIIME 2. The QIIME 2 visualizations presented in Fig. 1 are provided in Supplementary File 1 to allow readers to interact with QIIME 2 View. Corresponding worked QIIME 2 example code is provided in the Supplementary Methods.

Fig. 1: QIIME 2 provides many interactive visualization tools.
figure1

The products of four worked examples are presented here, and interactive versions of these screen captures are available in Supplementary File 1 and at https://github.com/qiime2/paper1. Detailed descriptions and methods, including the commands used to generate each of these visualizations, are provided in Supplementary Methods. a, Unweighted UniFrac principal coordinate analysis plot containing 37,680 samples, illustrating the scalability of QIIME 2. Colors indicate sample type, as described by the Earth Microbiome Project ontology (EMPO). b, Interactive taxonomic composition bar plot illustrating the phylum-level composition of microbial-mat samples collected along a temperature gradient in Yellowstone National Park Hot Spring outflow channels (Steep Cone Geyser). The many interactive controls available in this plot vastly decrease the burden of exploratory analysis over QIIME 1. c, Feature volatility plot (https://msystems.asm.org/content/3/6/e00219-18) illustrating the change in Bifidobacterium abundance over time in breast-fed and formula-fed infants. Temporally interesting features can be interactively discovered with this visualization. Bar charts rank the importance (predictive power for time point) and mean abundance of all microbial features. These bar charts provide an interface for visualizing volatility plots (line plots) of individual features in the context of their importance and abundance; clicking on a bar will display the volatility plot of that feature and highlight in blue that feature’s importance and abundance in the bar charts below. d, Molecular cartography of the human skin surface. Colored spots represent the abundance of the small-molecule cosmetic ingredient sodium laureth sulfate on the human skin. Sample data can be interactively visualized in three-dimensional models, thus supporting the discovery of spatial patterns.

Reproducibility, transparency and clarity of microbiome data science are guiding principles in QIIME 2 design. To this end, QIIME 2 includes a decentralized data-provenance tracking system: details of all analysis steps with references to intermediate data are automatically stored in the results. Users can thus retrospectively determine exactly how any result was generated (Fig. 2 illustrates a simplified provenance graph derived from the data provenance of Fig. 1c). QIIME 2 also detects corrupted results indicating that the provenance is no longer reliable and the results no longer contain information enabling reproducibility. The provenance of the visualizations presented in Fig. 1 can be interactively reviewed by loading the contents of Supplementary File 1 with QIIME 2 View, providing far more detailed information than can typically be provided in Methods text. QIIME 2 results are also semantically typed (Fig. 2), and actions indicate acceptable input types, clarifying the data that actions should be applied to and making complex workflows less error prone. Complex workflows can be created and shared by using Jupyter Notebooks22 or Common Workflow Language (CWL)23, and support for other workflow engines is currently in development.

Fig. 2: QIIME 2 iteratively records data provenance, ensuring bioinformatics reproducibility.
figure2

This simplified diagram illustrates the automatically tracked information regarding the creation of the taxonomy bar plot presented in Fig. 1c. QIIME 2 results (circles) contain network diagrams illustrating the data provenance stored in the result. Actions (quadrilaterals) are applied to QIIME 2 results and generate new results. Arrows indicate the flow of QIIME 2 results through actions. TaxonomicClassifier and FeatureData[Sequence] inputs contain independent provenance (red and blue, respectively) and are provided to a classify action (yellow), which taxonomically annotates sequences. The result of the classify action, a FeatureData[Taxonomy] result, integrates the provenance of both inputs with the classify action. This result is then provided to the barplot action with a FeatureTable[Frequency] input, which shares some provenance with the FeatureData[Sequence] input, because they were generated from the same upstream analysis. The resulting visualization (Fig. 1c) has the complete data provenance and correctly identifies shared processing of inputs. This simplified representation was created manually from the complete provenance graph for the purpose of illustration. An interactive and complete version of this provenance graph (as well as those for other Fig. 1 panels) can be accessed through Supplementary File 1.

Finally, QIIME 2 provides a software-development kit (https://dev.qiime2.org) that can be used to integrate it as a component of other systems (such as Qiita24 or Illumina BaseSpace) and to develop interfaces targeted toward users with different levels of computational sophistication (Supplementary Fig. 2). QIIME 2 provides the QIIME 2 Studio graphical user interface and QIIME 2 View, interfaces designed for end-user biologists, clinicians and policy-makers; the QIIME 2 application programming interface, designed for data scientists who want to automate workflows or work interactively in Jupyter Notebooks22; and q2cli and q2cwl, providing a command-line interface and CWL23 wrappers for QIIME 2, designed for experts in high-performance computing. At present, computationally expensive steps support parallel computing at the individual-action level (for example, many actions including de-noising and taxonomy assignment support multiple threads). We are currently developing deeper integration with parallelism strategies available in third-party workflow engines, and workflow-level parallelism is currently possible through CWL.

There are many other powerful open-source software tools for microbiome data science, including mothur25, phyloseq26 and related tools available through Bioconductor27, and the biobakery suite20,21,28. The microbiome bioinformatics platform mothur is often compared to QIIME 1 and QIIME 2. A major difference between mothur and QIIME lies in the interactive visualizations: QIIME 2 provides many interactive visualization tools (several examples are provided in Fig. 1), whereas mothur focuses on generating data that can be easily loaded and visualized with other tools. The phyloseq tool focuses on microbiome statistical analysis and generating publication-ready visualizations but, unlike QIIME 2, begins with a feature or operational-taxonomic-unit table, leaving ‘upstream’ processing steps, such as sequence demultiplexing and quality control, to other processing pipelines, many of which (like phyloseq) are available through Bioconductor. The biobakery suite provides analytic functionality that complements that of QIIME 2, and we are actively working with biobakery developers to support interoperability by making their tools accessible as QIIME 2 plugins (for example, the q2-metaphlan2 plugin allows users to run MetaPhlAn2 through QIIME 2). QIIME 2 provides the only Python-based microbiome data-science platform that supports retrospective data-provenance tracking to ensure reproducibility, multi-omics analysis support, interfaces geared toward different user types to enhance usability and an extensibility-focused design through the plugin architecture and software-development kit. We share feedback from users of QIIME 2 on these and other features in Supplementary Methods.

The tools described in the preceding paragraph are all interoperable through plugins, exchange of files in standard formats or using multi-language environments, such as Jupyter Notebooks22. For example, the BIOM format29 is supported by all of them. A diverse ecosystem of interoperable software is beneficial for the field, because it allows both experienced users to obtain multiple perspectives on their data and novice bioinformaticians to work in the programming environments that they are most comfortable with (for example, phyloseq allows users to work in R, whereas QIIME 2 allows users to work in Python). We plan to continue working with the developers of these tools, and with organizations such as the Genomics Standards Consortium, on plugins and standards to ensure interoperability, as well as developing tools to automatically import data from microbiome data-sharing platforms such as Qiita, the European Bioinformatics Institute (EBI) European Read Archive and the National Center for Biotechnology Information (NCBI) Sequence Read Archive.

Advances in microbiome research promise to improve many aspects of health and the world, and QIIME 2 will help drive those advances by enabling accessible, community-driven microbiome data science.

Data availability

Data for the analyses presented in Fig. 1 are available as follows: Earth Microbiome Project data in Fig. 1a were obtained from ftp://ftp.microbio.me/emp/release1, and the American Gut Project (AGP) data were obtained from Qiita (http://qiita.microbio.me) study ID 10317. Sequence data in Fig. 1b are available in Qiita under study ID 10249 and the EBI under accession number ERP016173. Sequence data in Fig. 1c are available in Qiita under study ID 925 and the EBI under accession number ERP022167. Data in Fig. 1d are available in the q2-ili GitHub repository (https://github.com/biocore/q2-ili). Interactive versions of the Fig. 1 visualizations can be accessed at https://github.com/qiime2/paper1.

Code availability

QIIME 2 is open source and free for all use, including commercial. It is licensed under a BSD three-clause license. Source code is available at https://github.com/qiime2. Help for QIIME 2 is provided at https://forum.qiime2.org.

Change history

  • 09 August 2019

    An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

  1. 1.

    Smith, M. I. et al. Science 339, 548–554 (2013).

  2. 2.

    Gopalakrishnan, V. et al. Science 359, 97–103 (2018).

  3. 3.

    Gehring, C. A., Sthultz, C. M., Flores-Rentería, L., Whipple, A. V. & Whitham, T. G. Proc. Natl Acad. Sci. USA 114, 11169–11174 (2017).

  4. 4.

    Lee, K., Pletcher, S. D., Lynch, S. V., Goldberg, A. N. & Cope, E. K. Front. Cell. Infect. Microbiol. 8, 168 (2018).

  5. 5.

    Metcalf, J. L. et al. Science 351, 158–162 (2016).

  6. 6.

    Rubin, R. L. et al. Ecol. Appl. 28, 1594–1605 (2018).

  7. 7.

    Pineda, A., Kaplan, I. & Bezemer, T. M. Trends Plant Sci. 22, 770–778 (2017).

  8. 8.

    Kapono, C. A. et al. Sci. Rep. 8, 3669 (2018).

  9. 9.

    Verberkmoes, N. C. et al. ISME J. 3, 179–189 (2009).

  10. 10.

    Barr, T. et al. Gut Microbes 9, 338–356 (2018).

  11. 11.

    Callahan, B. J. et al. Nat. Methods 13, 581–3 (2016).

  12. 12.

    Amir, A. et al. mSystems 2, e00191–16 (2017).

  13. 13.

    Bokulich, N. A. et al. Microbiome 6, 90 (2018).

  14. 14.

    Janssen, S. et al. mSystems 3, e00021–18 (2018).

  15. 15.

    Bokulich, N. A. et al. mSystems 3, e00219–18 (2018).

  16. 16.

    Bokulich, N. et al. J. Open Source Softw. 3, 934 (2018).

  17. 17.

    Sedio, B. E., Rojas Echeverri, J. C., Boya, P. C. A. & Wright, S. J. Ecology 98, 616–623 (2017).

  18. 18.

    Wang, M. et al. Nat. Biotechnol. 34, 828–837 (2016).

  19. 19.

    Hillmann, B. et al. mSystems 3, e00069–18 (2018).

  20. 20.

    Truong, D. T. et al. Nat. Methods 12, 902–903 (2015).

  21. 21.

    Langille, M. G. I. et al. Nat. Biotechnol. 31, 814–821 (2013).

  22. 22.

    Kluyver, T. et al. Positioning and power in academic publishing: players, agents and agendas. in Proc. 20th International Conference on Electronic Publishing (eds Loizides, F. & Schmidt, B.) 87–90 (IOS Press, 2016).

  23. 23.

    Amstutz, P. et al. https://doi.org/10.6084/m9.figshare.3115156.v2 (2016).

  24. 24.

    Gonzalez, A. et al. Nat. Methods 15, 796–798 (2018).

  25. 25.

    Schloss, P. D. et al. Appl. Environ. Microbiol. 75, 7537–7541 (2009).

  26. 26.

    McMurdie, P. J. & Holmes, S. PLoS One 8, e61217 (2013).

  27. 27.

    Huber, W. et al. Nat. Methods 12, 115–121 (2015).

  28. 28.

    Franzosa, E. A. et al. Nat. Methods 15, 962–968 (2018).

  29. 29.

    McDonald, D. et al. Gigascience 1, 7 (2012).

Download references

Acknowledgements

QIIME 2 development was primarily funded by NSF Awards 1565100 to J.G.C. and 1565057 to R.K. Partial support was also provided by the following: grants NIH U54CA143925 (J.G.C. and T.P.) and U54MD012388 (J.G.C. and T.P.); grants from the Alfred P. Sloan Foundation (J.G.C. and R.K.); ERCSTG project MetaPG (N.S.); the Strategic Priority Research Program of the Chinese Academy of Sciences QYZDB-SSW-SMC021 (Y.B.); the Australian National Health and Medical Research Council APP1085372 (G.A.H., J.G.C., Von Bing Yap and R.K.); the Natural Sciences and Engineering Research Council (NSERC) to D.L.G.; and the State of Arizona Technology and Research Initiative Fund (TRIF), administered by the Arizona Board of Regents, through Northern Arizona University. All NCI coauthors were supported by the Intramural Research Program of the National Cancer Institute. S.M.G. and C. Diener were supported by the Washington Research Foundation Distinguished Investigator Award. Thanks to the Yellowstone Center for Resources for research permit no. 5664 to J.R.S. for Yellowstone access and sample collection. We thank P. J. McMurdie for helpful discussion on the relationships between QIIME 2 and phyloseq. We would like to thank the users of QIIME 1 and 2, whose invaluable feedback has shaped QIIME 2. In particular, we would like to thank A. Abdelfattah (Stockholm University, Sweden), R. C. T. Boutin (University of British Columbia, Canada), D. J. Bradshaw II (Florida Atlantic University Harbor Branch Oceanographic Institute, USA), L. Bullington (MPG Ranch, USA), J. W. Debelius (Karolinska Institutet, Sweden), C. Duvallet (Massachusetts Institute of Technology, USA), E. Korzune Ganda (Cornell University, USA), A. Mahnert (Medical University of Graz, Austria), M. C. Melendrez (St. Cloud State University, USA), D. O’Rourke (University of New Hampshire, USA), A. R. Rivers (USDA ARS, USA), B. Sen (Tianjin University, China), S. Tangedal (Haukeland University Hospital and University of Bergen, Norway), P. J. Torres (San Diego State University, USA) and J. Warren (National Laboratory Service, UK) for writing end-user reviews included in the Supplementary Methods.

Author information

E.B., J.R.R., M.R.D., N.A.B., Y.B., J.E.B., C.J.B., A.M.C.-R., E.K.C., C. Diener, R.D., C.F.E., M. Ernst, M. Estaki, A.G., J.M.G., D.L.G., S.M.G., A.K.J., K.B.K., S.T.K., I.K., T.K., J.L., Y.-X.L., A.V.M., J.L.M., L.F.N., S.B.O., D.P., A.S., S.J.S., A.D.S., L.R.T., P. J. Torres, P. J. Turnbaugh, S.U.-H., F.V., J.W., R.K. and J.G.C. developed documentation, educational materials and/or user/developer support content. E.B., J.R.R., M.R.D., N.A.B., R.K. and J.G.C. wrote the manuscript; all authors assisted with revision of the manuscript. E.B., J.R.R., M.R.D., N.A.B. and J.G.C. designed and developed the QIIME 2 framework. D.M.D., A.G., R.L., E.L., S.C.M., R.S., J.R.S., W.W., C.H.D.W. and R.K. contributed data used in the manuscript and/or testing of QIIME 2. C.C.A., C.T.B., E.K.C., P.C.D., S.H., P.K., E.L., T.P., R.S., E.V., Y.W. and R.K. contributed to the design of analytical methods. E.B., J.R.R., M.R.D., N.A.B., G.A.A.-G., H.A., E.J.A., M.A., F.A., K.B., A.B., B.J.C., J.C., G.M.D., C. Duvallet, M. Ernst, J.F., A.G., K.G., J.G., S.M.G., B.H., H.H., C.H., G.H., S.J., L.J., B.D.K., C.R.K., D.K., J.K., M.G.I.L., C.L., M.M., C.M., B.D.M., D.M., L.J.M., J.T.M., A.T.N., J.A.N.-M., S.L.P., M.L.P., E.P., L.B.R., A.R., M.S.R., P.R., N.S., M.S., P.T., A.T., J.J.J.v.d.H., Y.V.-B., M.V., M.W., K.C.W., A.D.W., Z.Z.X., J.R.Z., Y.Z., Q.Z. and J.G.C. contributed software to QIIME 2 plugins, interfaces, framework and/or build and test systems.

Correspondence to J. Gregory Caporaso.

Additional information

Editor’s Note: This paper has been peer-reviewed.

Supplementary information

Supplementary Information

Supplementary Figs. 1–3 and Supplementary Methods

Supplementary File 1

Interactive versions of the visualizations presented in Fig. 1. These can be viewed by using QIIME 2, for example at https://view.qiime2.org.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Further reading