Abstract
Metagenomics projects collect DNA from uncharacterized environments that may contain thousands of species per sample. One main challenge facing metagenomic analysis is phylogenetic classification of raw sequence reads into groups representing the same or similar taxa, a prerequisite for genome assembly and for analyzing the biological diversity of a sample. New sequencing technologies have made metagenomics easier, by making sequencing faster, and more difficult, by producing shorter reads than previous technologies. Classifying sequences from reads as short as 100 base pairs has until now been relatively inaccurate, requiring researchers to use older, long-read technologies. We present Phymm, a classifier for metagenomic data, that has been trained on 539 complete, curated genomes and can accurately classify reads as short as 100 base pairs, a substantial improvement over previous composition-based classification methods. We also describe how combining Phymm with sequence alignment algorithms improves accuracy.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling
Environmental Microbiome Open Access 08 March 2023
-
Strain level microbial detection and quantification with applications to single cell metagenomics
Nature Communications Open Access 28 October 2022
-
UMGAP: the Unipept MetaGenomics Analysis Pipeline
BMC Genomics Open Access 10 June 2022
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout



References
National Research Council of the National Academies. The dawning of a new microbial age. in The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet p. 2 (The National Academies Press, Washington, DC, 2007).
Rondon, M.R. et al. Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms. Appl. Environ. Microbiol. 66, 2541–2547 (2000).
Krause, L. et al. Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res. 36, 2230–2239 (2008).
McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods. 4, 63–72 (2007).
Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A bioinformatician's guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578 (2008).
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).
Tito, R.Y. et al. Phylotyping and functional analysis of two ancient human microbiomes. PLoS One 3, e3703 (2008).
Huson, D.H., Auch, A.F., Qi, J. & Schuster, S.C. MEGAN analysis of metagenomic data. Genome Res. 17, 377–386 (2007).
Dinsdale, E.A. et al. Microbial ecology of four coral atolls in the Northern Line Islands. PLoS One 3, e1584 (2008).
Salzberg, S.L., Delcher, A.L., Kasif, S. & White, O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26, 544–548 (1998).
Delcher, A.L., Bratke, K.A., Powers, E.C. & Salzberg, S.L. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679 (2007).
Pruitt, K.D., Tatusova, T. & Maglott, D.R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 35 Database issue, D61–D65 (2007).
Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Bock, E. & Wagner, M. Oxidation of inorganic nitrogen compounds as an energy source. in The Prokaryotes, 3rd edn., vol. 3 (eds., Dworkin, M. and Falkow, S.) 457–495 (Springer, New York, 2006).
Chapus, C. et al. Exploration of phylogenetic data using a global sequence analysis method. BMC Evol. Biol. 5, 63 (2005).
Manichanh, C. et al. A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library. Nucleic Acids Res. 36, 5180–5188 (2008).
Mavromatis, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods. 4, 495–500 (2007).
White, J.R., Roberts, M., Yorke, J.A. & Pop, M. Figaro: a novel statistical method for vector sequence removal. Bioinformatics. 24, 462–467 (2008).
Delcher, A.L., Salzberg, S.L. & Phillippy, A.M. Using MUMmer to identify similar regions in large sequence sets. Curr. Protoc. Bioinformatics chapter 10, unit 13 (2003).
Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).
Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999).
Acknowledgements
We thank A. Delcher for helpful discussions regarding IMM configuration. This work was supported in part by US National Institutes of Health grants R01-LM006845 and R01-GM083873.
Author information
Authors and Affiliations
Contributions
A.B. performed the experiments and subsequent analysis. A.B. and S.L.S. designed the experiments and wrote the paper.
Corresponding author
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–10 and Supplementary Tables 1–11 (PDF 702 kb)
Supplementary Software
Open-source installer package for Phymm/PhymmBL, including all algorithms used during setup and scoring. (ZIP 5454 kb)
Rights and permissions
About this article
Cite this article
Brady, A., Salzberg, S. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods 6, 673–676 (2009). https://doi.org/10.1038/nmeth.1358
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.1358
This article is cited by
-
POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling
Environmental Microbiome (2023)
-
UMGAP: the Unipept MetaGenomics Analysis Pipeline
BMC Genomics (2022)
-
Complete de novo assembly of Wolbachia endosymbiont of Diaphorina citri Kuwayama (Hemiptera: Liviidae) using long-read genome sequencing
Scientific Reports (2022)
-
Strain level microbial detection and quantification with applications to single cell metagenomics
Nature Communications (2022)
-
Binning Metagenomic Contigs Using Unsupervised Clustering and Reference Databases
Interdisciplinary Sciences: Computational Life Sciences (2022)