Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Use of simulated data sets to evaluate the fidelity of metagenomic processing methods


Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene-finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity–based (blast hit distribution) and two sequence composition–based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.

Please visit methagora to view and post comments on this article

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type



Prices may be subject to local taxes which are calculated during checkout

Figure 1: Quality of assembly.
Figure 2: Gene prediction in data sets.
Figure 3: Hierarchical clustering of genes assigned to COGs in the simulated data sets.
Figure 4: Specificity and sensitivity values for selected binning methods.


  1. Tyson, G.W. et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428, 37–43 (2004).

    Article  CAS  Google Scholar 

  2. Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).

    Article  CAS  Google Scholar 

  3. Garcia Martin, H. et al. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities. Nat. Biotechnol. 24, 1263–1269 (2006).

    Article  Google Scholar 

  4. Hallam, S.J. et al. Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum . Proc. Natl. Acad. Sci. USA 103, 18296–18301 (2006).

    Article  CAS  Google Scholar 

  5. Delcher, A.L., Harmon, D., Kasif, S., White, O. & Salzberg, S.L. Improved microbial gene identification with GLIMMER. Nucleic Acids Res. 27, 4636–4641 (1999).

    Article  CAS  Google Scholar 

  6. Lukashin, A.V. & Borodovsky, M. GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res. 26, 1107–1115 (1998).

    Article  CAS  Google Scholar 

  7. Deschavanne, P.J., Giron, A., Vilain, J., Fagot, G. & Fertil, B. Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol. Biol. Evol. 16, 1391–1399 (1999).

    Article  CAS  Google Scholar 

  8. Karlin, S. & Burge, C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 11, 283–290 (1995).

    Article  CAS  Google Scholar 

  9. Teeling, H., Waldmann, J., Lombardot, T., Bauer, M. & Glockner, F.O. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 5, 163 (2004).

    Article  Google Scholar 

  10. McHardy, A.C., Martin, H.G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2006).

    Article  Google Scholar 

  11. Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, 0003 (2002).

    Article  Google Scholar 

  12. Liolios, K., Tavernarakis, N., Hugenholtz, P. & Kyrpides, N.C. The genomes on line database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 (2006).

    Article  CAS  Google Scholar 

  13. Markowitz, V.M. et al. The integrated microbial genomes (IMG) system. Nucleic Acids Res. 34, D344–D348 (2006).

    Article  CAS  Google Scholar 

  14. Strous, M. et al. Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature 440, 790–794 (2006).

    Article  Google Scholar 

  15. Woyke, T. et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature 443, 950–955 (2006).

    Article  CAS  Google Scholar 

  16. Tringe, S.G. et al. Comparative metagenomics of microbial communities. Science 308, 554–557 (2005).

    Article  CAS  Google Scholar 

  17. Jaffe, D.B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).

    Article  CAS  Google Scholar 

  18. Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes . Science 297, 1301–1310 (2002).

    Article  CAS  Google Scholar 

  19. Chain, P. et al. Complete genome sequence of the ammonia-oxidizing bacterium and obligate chemolithoautotroph Nitrosomonas europaea . J. Bacteriol. 185, 2759–2773 (2003).

    Article  CAS  Google Scholar 

  20. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  21. DeLong, E.F. et al. Community genomics among stratified microbial assemblages in the ocean's interior. Science 311, 496–503 (2006).

    Article  CAS  Google Scholar 

  22. Tringe, S.G. & Rubin, E.M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).

    Article  CAS  Google Scholar 

  23. Tatusov, R.L. et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).

    Article  Google Scholar 

  24. Markowitz, V.M. et al. An experimental metagenome data management and analysis system. Bioinformatics 22, e359–e367 (2006).

    Article  CAS  Google Scholar 

Download references


We thank A. Lykidis and I. Anderson from the Genome Biology Program at DOE-JGI for their feedback and comments on this manuscript. This work was performed under the auspices of the US Department of Energy's Office of Science, Biological and Environmental Research Program, and the University of California, Lawrence Livermore National Laboratory under contract number W-7405-Eng-48, Lawrence Berkeley National Laboratory under contract number DE-AC02-05CH11231 and Los Alamos National Laboratory under contract number W-7405-ENG-36.

Author information

Authors and Affiliations



K.M. and N.I. performed the analysis, K.B., H.S. and E.G. performed assemblies with Phrap, JAZZ and Arachne respectively, A.C.M. performed binning with PhyloPythia, A.S. performed gene predictions with fgenesb and developed and performed binning with BLAST distr, F.K. developed and performed binning with kmer, M.L. performed gene prediction with the GLIMMER/CRITICA pipeline, A.L., I.G., P.R. and I.R. supported the project, P.H. and N.C.K. supported the project and contributed conceptually. K.M., P.H. and N.C.K. wrote the manuscript.

Corresponding author

Correspondence to Konstantinos Mavromatis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Fig. 1

Enlarged versions of panels in Figure 1b,c. (PDF 1337 kb)

Supplementary Fig. 2

Relative abundance of Alpha and Gamma proteobacteria as derived from binning results for the simLC and simMC data sets. (PDF 118 kb)

Supplementary Table 1

Organisms used for the simulated data sets. (PDF 79 kb)

Supplementary Table 2

Binning summary for contigs larger than 8 Kb and larger than 10 reads. (PDF 54 kb)

Supplementary Methods (PDF 67 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mavromatis, K., Ivanova, N., Barry, K. et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 4, 495–500 (2007).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing