Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Strain-level microbial epidemiology and population genomics from shotgun metagenomics

Abstract

Identifying microbial strains and characterizing their functional potential is essential for pathogen discovery, epidemiology and population genomics. We present pangenome-based phylogenomic analysis (PanPhlAn; http://segatalab.cibio.unitn.it/tools/panphlan), a tool that uses metagenomic data to achieve strain-level microbial profiling resolution. PanPhlAn recognized outbreak strains, produced the largest strain-level population genomic study of human-associated bacteria and, in combination with metatranscriptomics, profiled the transcriptional activity of strains in complex communities.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: PanPhlAn validation and comparison with assembly.
Figure 2: PanPhlAn profiling of E. coli from metagenomics samples.
Figure 3: Large-scale population genomics study of E. rectale and A. muciniphila.

Accession codes

Primary accessions

Sequence Read Archive

Referenced accessions

Sequence Read Archive

References

  1. Daniel, R. Nat. Rev. Microbiol. 3, 470–478 (2005).

    CAS  Article  PubMed  Google Scholar 

  2. Qin, J. et al. Nature 464, 59–65 (2010).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  3. The Human Microbiome Consortium. Nature 486, 207–214 (2012).

  4. Qin, J. et al. Nature 490, 55–60 (2012).

    CAS  Article  PubMed  Google Scholar 

  5. Karlsson, F.H. et al. Nature 498, 99–103 (2013).

    CAS  Article  PubMed  Google Scholar 

  6. Segata, N. et al. Nat. Methods 9, 811–814 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  7. Sunagawa, S. et al. Nat. Methods 10, 1196–1199 (2013).

    CAS  Article  PubMed  Google Scholar 

  8. Wood, D.E. & Salzberg, S.L. Genome Biol. 15, R46 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Nielsen, H.B. et al. Nat. Biotechnol. 32, 822–828 (2014).

    CAS  Article  PubMed  Google Scholar 

  10. Huson, D.H., Auch, A.F., Qi, J. & Schuster, S.C. Genome Res. 17, 377–386 (2007).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  11. Abubucker, S. et al. PLOS Comput. Biol. 8, e1002358 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  12. Truong, D.T. et al. Nat. Methods 12, 902–903 (2015).

    CAS  Article  PubMed  Google Scholar 

  13. Franzosa, E.A. et al. Proc. Natl. Acad. Sci. USA 111, E2329–E2338 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  14. Francis, O.E. et al. Genome Res. 23, 1721–1729 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. Luo, C. et al. Nat. Biotechnol. 33, 1045–1052 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  16. Doughty, E.L., Sergeant, M.J., Adetifa, I., Antonio, M. & Pallen, M.J. PeerJ 2, e585 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Loman, N.J. et al. J. Am. Med. Assoc. 309, 1502–1510 (2013).

    CAS  Article  Google Scholar 

  18. Köser, C.U. et al. N. Engl. J. Med. 366, 2267–2275 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  19. Ahmed, S.A. et al. PLoS One 7, e48228 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. Rasko, D.A. et al. N. Engl. J. Med. 365, 709–717 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. Tettelin, H. et al. Proc. Natl. Acad. Sci. USA 102, 13950–13955 (2005).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  22. Reva, O. & Bezuidt, O. Mob. Genet. Elements 2, 96–100 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Le Chatelier, E., et al. & MetaHIT consortium. Nature 500, 541–546 (2013).

    CAS  Article  PubMed  Google Scholar 

  24. Zeller, G. et al. Mol. Syst. Biol. 10, 766 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  25. Qin, N. et al. Nature 513, 59–64 (2014).

    CAS  Article  PubMed  Google Scholar 

  26. Everard, A. et al. Proc. Natl. Acad. Sci. USA 110, 9066–9071 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  27. Ward, D.V. et al. Cell Rep. 10.1016/j.celrep.2016.03.015 (17 March 2016).

  28. Scher, J.U. et al. eLife 2, e01202 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Sokol, H. et al. Proc. Natl. Acad. Sci. USA 105, 16731–16736 (2008).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  30. Lee, S.M. et al. Nature 501, 426–429 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  31. Edgar, R.C. Bioinformatics 26, 2460–2461 (2010).

    CAS  Article  PubMed  Google Scholar 

  32. Page, J.P. et al. Bioinformatics 31, 3691–3693 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  33. Fouts, D.E., Brinkac, L., Beck, E., Inman, J. & Sutton, G. Nucleic Acids Res. 40, e172 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  34. Li, L., Stoeckert, C.J. Jr. & Roos, D.S. Genome Res. 13, 2178–2189 (2003).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  35. Sahl, J.W., Caporaso, J.G., Rasko, D.A. & Keim, P. PeerJ 2, e332 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Segata, N., Börnigen, D., Morgan, X.C. & Huttenhower, C. Nat. Commun. 4, 2304 (2013).

    Article  PubMed  Google Scholar 

  37. Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  38. Li, H., et al. & 1000 Genome Project Data Processing Subgroup. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Google Scholar 

  39. Stacklies, W., Redestig, H., Scholz, M., Walther, D. & Selbig, J. Bioinformatics 23, 1164–1167 (2007).

    CAS  Article  PubMed  Google Scholar 

  40. Kanehisa, M. et al. Nucleic Acids Res. 36, D480–D484 (2008).

    CAS  Article  PubMed  Google Scholar 

  41. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).

    CAS  Article  PubMed  Google Scholar 

  42. Subramanian, A. et al. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  43. McElroy, K.E., Luciani, F. & Thomas, T. BMC Genomics 13, 74 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  44. Bankevich, A. et al. J. Comput. Biol. 19, 455–477 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  45. Li, D., Liu, C.M., Luo, R., Sadakane, K. & Lam, T.W. Bioinformatics 31, 1674–1676 (2015).

    CAS  Article  PubMed  Google Scholar 

  46. Morrow, A.L. et al. Microbiome 1, 13 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Shannon, P. et al. Genome Res. 13, 2498–2504 (2003).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  48. Seemann, T. Bioinformatics 30, 2068–2069 (18 March 2014).

  49. Stamatakis, A. Bioinformatics 30, 1312–1313 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We gratefully thank the members of the Segata lab for insightful discussions on the method, K. Schibler for his contribution to the preterm infant cohort study, and V. De Sanctis and R. Bertorelli for help in sequencing the skin samples. This work was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under contract number HHSN272200900018C (D.V.W., A.L.M.). The work was also supported by the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) under REA grant agreement number PCIG13-GA-2013-618833 (N.S.), by startup funds from the Centre for Integrative Biology, University of Trento (N.S.), by MIUR “Futuro in Ricerca” RBFR13EWWI_001 (N.S.), by the Fondazione Caritro–2013 (N.S.) and by 'Terme di Comano' (N.S.).

Author information

Authors and Affiliations

Authors

Contributions

N.S. supervised the work and originally conceived the project. M.S. and D.V.W. contributed to the conception and design of the work. M.S. and T.T. implemented, validated, tested, and documented the software. M.S. and E.P. performed the experiments. A.T., D.V.W. and A.L.M. performed and provided the metagenomics sequencing. M.Z., F.A. and D.T.T. provided computational tools and performed comparative analyses. N.S. and M.S. wrote the manuscript. All authors provided feedback, edited, and approved the manuscript.

Corresponding author

Correspondence to Nicola Segata.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Flowchart of the PanPhlAn method

Illustration of the PanPhlAn gene-family profiling. (i) Metagenomic and metatranscriptomic samples are mapped against reference genomes. (ii) Single gene coverage is merged into gene-family coverage. (iii) Based on a uniform DNA coverage level, PanPhlAn detects the unique gene set of a particular strain in a sample. For metatranscriptomics, the set of gene-families that are uniquely associated to the strain present in a sample are considered for recruiting reads from the metatranscriptome. The obtained RNA coverage levels are then converted into logarithm of the median normalized RNA/DNA ratios.

Supplementary Figure 2 True positive rates of PanPhlAn validation

True positive rates of PanPhlAn for its validation on semi-synthetic data by considering different strains of E. coli (a)-(c), S. aureus (d), and S. epidermidis (e). PanPhlAn accurately detected strain specific gene-families. Specifically, at high target strain genome coverage (>10×), we correctly detected >98% of the gene-families. For coverages as low as 2×, we obtained a true positive rate >92%. At 1×, the majority (avg. 86.95% s.d. 1.95) of the strain-specific genes could still be successfully retrieved.

Supplementary Figure 3 Comparison PanPhlAn versus MetaPhlAn

(a,b) Strain signature comparison of MetaPhlAn2 versus PanPhlAn based on 12 synthetic metagenomes generated from six reference genomes, three of which were not included in the database of both tools. (a) Strain identification by MetaPhlAn2 (based on a set of 621 marker genes) exhibited some limitation in resolving closely related strains (e.g., for the two strains of Bacteroides vulgatus G000699705 and G000699845). On the other hand, (b) PanPhlAn distinguished these two strains due to a much larger number of pangenome gene-families (i.e., 6646 for Bacteroides vulgatus). (c,d) Comparison between MetaPhlAn and PanPhlAn in terms of ROC curves for two species (c) Bacteroides vulgatus and (d) Bacteroides fragilis. ROC curves were constructed using distance as classification thresholds between all sample-pairs. A pair is considered ‘positive’ if both synthetic samples are generated from the same genomes, and ‘negative’ if samples are based on different genomes. For both tested species, the ROC curves showed a better result for PanPhlAn due to a better distance-based discrimination of samples containing the same strain from samples containing different strains.

Supplementary Figure 4 PanPhlAn profiles from synthetic metagenomes

PanPhlAn profiling results for synthetic metagenomes generated from (a) E. coli strains not present in the pangenome database and (b) S. aureus strains present in the database. In both cases, PanPhlAn enabled high discriminative resolution even among closely related genomes while simultaneously providing whole-genome strain characterization and profiling.

Supplementary Figure 5 PanPhlAn profiling of the German 2011 E. coli outbreak (including the available outbreak reference genomes)

Heatmap clustering based on an E. coli reference database of 113 reference genomes that additionally to Fig. 2 included three O104:H4 genomes: the German 2011 outbreak strain and two similar isolates from 2009. As in Fig. 2, most of the 12 strains detected in metagenomics outbreak samples clustered together due to almost identical profiles. The three additional O104:H4 reference genomes exactly fell in the center of the main cluster of the detected strains, thereby confirming the correctness of the detected gene-family profiles as outbreak strain profiles. Samples outside the cluster differed in their gene-family profiles due to the presence of additional dominant E. coli strains overlying the target outbreak strain.

Supplementary Figure 6 Coverage of the German 2011 E. coli outbreak samples

Plots showing the coverage depth of the E. coli O104:H4 outbreak strain 2011C-3493 in metagenomic samples from the German outbreak in 2011. (a) Samples of the main cluster in Fig. 2 were proven to contain the outbreak strain by a genome-wide uniform coverage depth. (b) Samples outside the cluster showed gaps of lower coverage levels, thereby confirming the presence of an additional E. coli strain overlying the outbreak strain, and hence dominating the gene-family profile.

Supplementary Figure 7 E. coli strain diversity and similarity network across four different datasets

The E. coli genomic diversity in the healthy gut of American, Chinese, and European cohorts is shown as heatmap clustering and as strain similarity network based on PanPhlAn profiles of 1316 metagenomes. (a) PanPhlAn detected E. coli strains in a total of 114 samples and provided presence-absence gene-family profiles for all of them. (b) E. coli strain similarity network to complement Fig. 2c. Most German outbreak samples cluster together with all three O104:H4 reference genomes. The outbreak cluster includes also one sample from the Chinese Diabetes dataset, which PanPhlAn confirmed to be an O104:H4-like strain without the enterohemorrhagic genes. Network edge width is inversely proportional to Jaccard distance between gene-family profiles and nodes connected by short edges reflect high genomic similarity (single disconnected nodes are removed).

Supplementary Figure 8 Outbreak strain coverage of the Chinese sample T2D-063

Coverage analysis of the Chinese sample T2D-063 in the outbreak cluster (Fig. 2c and Supplementary Fig. 7) to investigate the genomic similarity with the German 2011 outbreak strain. The almost uniform genome-wide coverage depth confirmed high similarity with the outbreak strain including the presence of plasmids. However, the missing Shiga-toxin-encoding region suggests that the sample contained a similar O104:H4 strain which was not identical with the German outbreak. This coincided with the detected absence of Shiga-toxin genes from PanPhlAn’s gene-family profile result.

Supplementary Figure 9 Multiple strain detection in E. coli samples

PanPhlAn can yield spurious strain profiling when multiple strains of the same species are present at a comparable abundance. For this reason, PanPhlAn implements a quality control procedure to identify cases in which it suspects the presence of multiple strains. The figure shows the same PCoA plot of E. coli profiles as in Fig. 2c, but in addition samples identified by PanPhlAn as “multistrain” are marked with an “x”. Analysis of the 12 samples with such warning confirms that the gene repertoires predicted in these cases, despite being a true reflection of the overall E. coli gene content in the sample, does not accurately represent that of single E. coli strains.

Supplementary Figure 10 PCoA showing strain diversity of E. rectale (3 reference genomes)

Large-scale population genomics study of E. rectale based on 1830 gut metagenomic samples from 8 cohorts. In this plot we considered three reference genomes for E. rectale instead of the single genome used in Fig. 3a. In both cases the clustering result was very similar and resolved E. rectale strains into three geographically distinct clades.

Supplementary Figure 11 Retrieved gene comparison PanPhlAn versus assembly

Comparison between PanPhlAn and assembly-based approaches in terms of number of strain-specific retrieved genes for two species in the gut samples from the HMP study: (a) E. rectale (129 postive samples) and (b) A. muciniphila (56 positive samples). For both tested species, PanPhlAn detected a higher number of genes for most of the samples tested. This was verified especially when the target organism was at low-coverage. Specifically, more than 1000 genes were detected exclusively by PanPhlAn when the relative abundance of the target organism was around 1% for both (c) E. rectale and (d) A. muciniphila.

Supplementary Figure 12 Strain detection comparison PanPhlAn versus ConStrains and assembly

Strain detection comparison PanPhlAn versus ConStrains and assembly for the three gut samples (SRS014235, SRS050925, SRS048870) of the HMP dataset having the highest coverage for E. rectale. (a) The heatmap of the PanPhlAn profiling results highlighted distinct strains in the three samples. On the other hand, (b) ConStrains associated the same predominant strain to all the samples. Assembly in conjunction with phylogeny reconstruction (c) and core gene sequence divergence (d) confirmed the PanPhlAn results by detecting distinct strains in the three considered samples.

Supplementary Figure 13 Strain similarity networks of B. ovatus, B. fragilis, S. epidermidis, and N. meningitidis

PanPhlAn multi-cohort strain-strain similarity networks of (a) B. ovatus and (b) B. fragilis in human gut samples; (c) S. epidermidis in skin samples; and (d) N. meningitidis in throat samples. Each node represents a strain either captured from a metagenomic samples or available reference genomes. Edge width is inversely proportional to Jaccard distance between gene-family profiles and nodes connected by short edges reflect high genomic similarity.

Supplementary Figure 14 Marine environmental strain-level comparative genomics study

Marine population genomics based on 1,246 samples. Heatmaps showed hierarchical clustering of PanPhlAn profiles for the rarely sequenced Roseobacter species (a) Pelagibaca bermudensis (1 ref. genome), (b) Roseovarius nubinhibens (1 ref. genome), (c) Roseovarius TM1035 (1 ref. genome), and (d) Sulfitobacter (3 ref. genomes); and two better characterized marine species (e) Prochlorococcus marinus (17 ref. genomes) and (f) Pelagibacter ubique (5 ref. genomes). Different marine regions are marked by different colours. Strains of all species showed a broad presence in many marine regions, partly as regional cluster of strain-specific gene content, especially for locally isolated areas like Baltic Sea and North Sea. (g) PCoA plot based on PanPhlAn profiles of Pelagibacter ubique highlights differences between strains from different marine regions. Strains present in the Baltic Sea could be clearly distinguished from North Sea strains, and also strains detected in samples of the Trondheimsfjord in Norway clustered together.

Supplementary Figure 15 PanPhlAn inference of strain-specific in vivo transcriptional activity

Pangenome-wide coverage depth of both metagenomic and metatranscriptomic data from a healthy infant gut sample. (a) Genes are sorted by DNA coverage. Transcript coverages are then normalized by the corresponding gene coverages, and the resulting ratios are median-normalized, log-transformed and re-scaled. (b) Hierarchical clustering of strain-specific transcription profiles from gut samples of 5 healthy infants. (c) Functional analysis of the highest overall expressed pathway modules reporting KEGG modules sorted by Gene Set Enrichment Analysis score.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 2, 4, and 5, and Supplementary Notes 1–8 (PDF 13998 kb)

Supplementary Table 1

Synthetic and semi-synthetic metagenomes used for PanPhlAn validation. (XLSX 13 kb)

Supplementary Table 3

German 2011 E. coli outbreak specific gene set (Fisher exact test). (XLSX 57 kb)

Supplementary Table 6

Top 100 transcribed genes of E. coli in gut samples of healthy infants. (XLSX 8 kb)

Supplementary Table 7

Bottom 100 transcribed genes of E. coli in gut samples of healthy infants. (XLSX 9 kb)

Supplementary Table 8

Active pathway modules of E. coli in five gut samples of healthy infants. (XLSX 8 kb)

Supplementary Software

Software tool PanPhlAn for strain detection and characterization (version 1.2). (ZIP 40 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Scholz, M., Ward, D., Pasolli, E. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat Methods 13, 435–438 (2016). https://doi.org/10.1038/nmeth.3802

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3802

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing