Metagenome studies have retrieved vast amounts of sequence data from a variety of environments leading to new discoveries and insights into the uncultured microbial world. Except for very simple communities, the encountered diversity has made fragment assembly and the subsequent analysis a challenging problem. A taxonomic characterization of metagenomic fragments is required for a deeper understanding of shotgun-sequenced microbial communities, but success has mostly been limited to sequences containing phylogenetic marker genes. Here we present PhyloPythia, a composition-based classifier that combines higher-level generic clades from a set of 340 completed genomes with sample-derived population models. Extensive analyses on synthetic and real metagenome data sets showed that PhyloPythia allows the accurate classification of most sequence fragments across all considered taxonomic ranks, even for unknown organisms. The method requires no more than 100 kb of training sequence for the creation of accurate models of sample-specific populations and can assign fragments ≥1 kb with high specificity.
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).
Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).
Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).
Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol., 3, REVIEWS0003 (2002).
Woese, C.R. & Fox, G.E. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. USA 74, 5088–5090 (1977).
Woese, C.R. Bacterial evolution. Microbiol. Rev. 51, 221–271 (1987).
Graham, D.E., Overbeek, R., Olsen, G.J. & Woese, C.R. An archaeal genomic signature. Proc. Natl. Acad. Sci. USA 97, 3304–3308 (2000).
Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Tatusov, R.L. & Koonin, E.V. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol. 1, 8 (2001).
Ciccarelli, F.D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).
Cole, J.R. et al. The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res. 33, D294–D296 (2005).
Garcìa Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).
Teeling, H., Meyerdierks, A., Bauer, M., Amann, R. & Glockner, F.O. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938–947 (2004).
Gans, J., Wolinsky, M. & Dunbar, J. Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science 309, 1387–1390 (2005).
Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).
Karlin, S. & Mrazek, J. Compositional differences within and between eukaryotic genomes. Proc. Natl. Acad. Sci. USA 94, 10227–10232 (1997).
Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).
Nakashima, H., Ota, M., Nishikawa, K. & Ooi, T. Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Res. 5, 251–259 (1998).
Sandberg, R. et al. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 11, 1404–1409 (2001).
Abe, T. et al. A novel bioinformatic strategy for unveiling hidden genome signatures of eukaryotes: self-organizing map of oligonucleotide frequency. Genome Inform. Ser. Workshop Genome Inform. 13, 12–20 (2002).
Pride, D.T., Meinersmann, R.J., Wassenaar, T.M. & Blaser, M.J. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003).
Chapus, C. et al. Exploration of phylogenetic data using a global sequence analysis method. BMC Evol. Biol. 5, 63 (2005).
Abe, T., Sugawara, H., Kinouchi, M., Kanaya, S. & Ikemura, T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Res. 12, 281–290 (2005).
Edwards, R.A. et al. Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics 7, 57 (2006).
Sharp, P.M., Bailes, E., Grocock, R.J., Peden, J.F. & Sockett, R.E. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 33, 1141–1153 (2005).
Lynn, D.J., Singer, G.A. & Hickey, D.A. Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res. 30, 4272–4277 (2002).
Makarova, K.S., Grishin, N.V., Shabalina, S.A., Wolf, Y.I. & Koonin, E.V. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 1, 7 (2006).
DeLong, E.F. Microbial community genomics in the ocean. Nat. Rev. Microbiol. 3, 459–469 (2005).
Kalyuzhnaya, M.G. et al. Fluorescence in situ hybridization-flow cytometry-cell sorting-based method for separation and enrichment of type I and type II methanotroph populations. Appl. Environ. Microbiol. 72, 4293–4301 (2006).
Zhang, K. et al. Sequencing genomes from single cells by polymerase cloning. Nat. Biotechnol. 24, 680–686 (2006).
Campbell, A., Mrazek, J. & Karlin, S. Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl. Acad. Sci. USA 96, 9184–9189 (1999).
McHardy, A.C. Gene finding and the evaluation of synonymous codon usage features in microbial genomes.. Thesis, Bielefeld Univ., (2004).
Nelson, K.E. et al. Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 399, 323–329 (1999).
Tsirigos, A. & Rigoutsos, I. A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res. 33, 922–933 (2005).
Overbeek, R. et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 33, 5691–5702 (2005).
Wheeler, D.L. et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 29, 11–16 (2001).
We thank N. Ivanova, V. Kunin and F. Warnecke for help with selection of CAP and Thiothrix-specific training sets and for validation analyses of the metagenomic data-set binning, L. Krause for providing the SEED data, T. Huynh for implementing the web interface, and S. Polonsky for comments and discussion. The work of H.G.M. and P.H. was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program; the University of California, Lawrence Livermore National Laboratory, under contract W-7405-Eng-48; Lawrence Berkeley National Laboratory under contract DE-AC03-76SF00098; and Los Alamos National Laboratory under contract W-7405-ENG-36. PhyloPythia's results were incorporated in the US Department of Energy Joint Genome Institute Integrated Microbial Genomes & Metagenomes (IMG/M) experimental system (http://www.jgi.doe.gov).
The authors declare no competing financial interests.
Assignment accuracy for differently sized genomic fragments and coding sequences from unknown organisms at the level of the class.
Wn parameter search for the sequence composition space with the highest classification accuracy for 15 kb fragments of unknown organisms at different phylogenetic levels.
Evaluation of the relation of genomic fragment length used for model creation and classification accuracy for genomic fragments of unknown organisms and different lengths.
Assignments at the domain level with PhyloPythia for 50 kb genomic fragments from unknown organisms.
Comparison of classification accuracy for 3 kb fragments and 3 kb fragments carrying ribosomal proteins with PhyloPythia.
Clades at different depths of the phylogenetic tree that are sufficiently represented by genomes of the 340 organisms for composition-based modeling.
Wn parameter search for the sequence composition space with the highest classification accuracy.
Classification accuracy of the SVM with a gaussian versus a linear kernel.
Classification accuracy of PhyloPythia for genomic fragments of unknown organisms at different taxonomic ranks.
Phylogenetic classification accuracy of PhyloPythia for genomic fragments of known organisms at different taxonomic ranks.
Search for the best parameter settings for the SOM and TETRA-method.
Comparison of PhyloPythia to the SOM-phylotype associations and tetranucleotide-based binning of the dominant sample populations for the contigs ≥1kb of the Sargasso Sea sample.
Evaluation of PhyloPythia's classification accuracy for genome fragments of different Prochlorococcus strains.
About this article
Cite this article
McHardy, A., Martín, H., Tsirigos, A. et al. Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 4, 63–72 (2007). https://doi.org/10.1038/nmeth976
Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads
BMC Bioinformatics (2020)
A completeness-independent method for pre-selection of closely related genomes for species delineation in prokaryotes
BMC Genomics (2020)
Microbial community analysis using high-throughput sequencing technology: a beginner’s guide for microbiologists
Journal of Microbiology (2020)
A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms
BMC Genomics (2019)
Scientific Reports (2019)