Abstract
Identifying microbial strains and characterizing their functional potential is essential for pathogen discovery, epidemiology and population genomics. We present pangenome-based phylogenomic analysis (PanPhlAn; http://segatalab.cibio.unitn.it/tools/panphlan), a tool that uses metagenomic data to achieve strain-level microbial profiling resolution. PanPhlAn recognized outbreak strains, produced the largest strain-level population genomic study of human-associated bacteria and, in combination with metatranscriptomics, profiled the transcriptional activity of strains in complex communities.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Accession codes
Primary accessions
Sequence Read Archive
Referenced accessions
Sequence Read Archive
References
Daniel, R. Nat. Rev. Microbiol. 3, 470–478 (2005).
Qin, J. et al. Nature 464, 59–65 (2010).
The Human Microbiome Consortium. Nature 486, 207–214 (2012).
Qin, J. et al. Nature 490, 55–60 (2012).
Karlsson, F.H. et al. Nature 498, 99–103 (2013).
Segata, N. et al. Nat. Methods 9, 811–814 (2012).
Sunagawa, S. et al. Nat. Methods 10, 1196–1199 (2013).
Wood, D.E. & Salzberg, S.L. Genome Biol. 15, R46 (2014).
Nielsen, H.B. et al. Nat. Biotechnol. 32, 822–828 (2014).
Huson, D.H., Auch, A.F., Qi, J. & Schuster, S.C. Genome Res. 17, 377–386 (2007).
Abubucker, S. et al. PLOS Comput. Biol. 8, e1002358 (2012).
Truong, D.T. et al. Nat. Methods 12, 902–903 (2015).
Franzosa, E.A. et al. Proc. Natl. Acad. Sci. USA 111, E2329–E2338 (2014).
Francis, O.E. et al. Genome Res. 23, 1721–1729 (2013).
Luo, C. et al. Nat. Biotechnol. 33, 1045–1052 (2015).
Doughty, E.L., Sergeant, M.J., Adetifa, I., Antonio, M. & Pallen, M.J. PeerJ 2, e585 (2014).
Loman, N.J. et al. J. Am. Med. Assoc. 309, 1502–1510 (2013).
Köser, C.U. et al. N. Engl. J. Med. 366, 2267–2275 (2012).
Ahmed, S.A. et al. PLoS One 7, e48228 (2012).
Rasko, D.A. et al. N. Engl. J. Med. 365, 709–717 (2011).
Tettelin, H. et al. Proc. Natl. Acad. Sci. USA 102, 13950–13955 (2005).
Reva, O. & Bezuidt, O. Mob. Genet. Elements 2, 96–100 (2012).
Le Chatelier, E., et al. & MetaHIT consortium. Nature 500, 541–546 (2013).
Zeller, G. et al. Mol. Syst. Biol. 10, 766 (2014).
Qin, N. et al. Nature 513, 59–64 (2014).
Everard, A. et al. Proc. Natl. Acad. Sci. USA 110, 9066–9071 (2013).
Ward, D.V. et al. Cell Rep. 10.1016/j.celrep.2016.03.015 (17 March 2016).
Scher, J.U. et al. eLife 2, e01202 (2013).
Sokol, H. et al. Proc. Natl. Acad. Sci. USA 105, 16731–16736 (2008).
Lee, S.M. et al. Nature 501, 426–429 (2013).
Edgar, R.C. Bioinformatics 26, 2460–2461 (2010).
Page, J.P. et al. Bioinformatics 31, 3691–3693 (2015).
Fouts, D.E., Brinkac, L., Beck, E., Inman, J. & Sutton, G. Nucleic Acids Res. 40, e172 (2012).
Li, L., Stoeckert, C.J. Jr. & Roos, D.S. Genome Res. 13, 2178–2189 (2003).
Sahl, J.W., Caporaso, J.G., Rasko, D.A. & Keim, P. PeerJ 2, e332 (2014).
Segata, N., Börnigen, D., Morgan, X.C. & Huttenhower, C. Nat. Commun. 4, 2304 (2013).
Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).
Li, H., et al. & 1000 Genome Project Data Processing Subgroup. Bioinformatics 25, 2078–2079 (2009).
Stacklies, W., Redestig, H., Scholz, M., Walther, D. & Selbig, J. Bioinformatics 23, 1164–1167 (2007).
Kanehisa, M. et al. Nucleic Acids Res. 36, D480–D484 (2008).
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).
Subramanian, A. et al. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
McElroy, K.E., Luciani, F. & Thomas, T. BMC Genomics 13, 74 (2012).
Bankevich, A. et al. J. Comput. Biol. 19, 455–477 (2012).
Li, D., Liu, C.M., Luo, R., Sadakane, K. & Lam, T.W. Bioinformatics 31, 1674–1676 (2015).
Morrow, A.L. et al. Microbiome 1, 13 (2013).
Shannon, P. et al. Genome Res. 13, 2498–2504 (2003).
Seemann, T. Bioinformatics 30, 2068–2069 (18 March 2014).
Stamatakis, A. Bioinformatics 30, 1312–1313 (2014).
Acknowledgements
We gratefully thank the members of the Segata lab for insightful discussions on the method, K. Schibler for his contribution to the preterm infant cohort study, and V. De Sanctis and R. Bertorelli for help in sequencing the skin samples. This work was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under contract number HHSN272200900018C (D.V.W., A.L.M.). The work was also supported by the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) under REA grant agreement number PCIG13-GA-2013-618833 (N.S.), by startup funds from the Centre for Integrative Biology, University of Trento (N.S.), by MIUR “Futuro in Ricerca” RBFR13EWWI_001 (N.S.), by the Fondazione Caritro–2013 (N.S.) and by 'Terme di Comano' (N.S.).
Author information
Authors and Affiliations
Contributions
N.S. supervised the work and originally conceived the project. M.S. and D.V.W. contributed to the conception and design of the work. M.S. and T.T. implemented, validated, tested, and documented the software. M.S. and E.P. performed the experiments. A.T., D.V.W. and A.L.M. performed and provided the metagenomics sequencing. M.Z., F.A. and D.T.T. provided computational tools and performed comparative analyses. N.S. and M.S. wrote the manuscript. All authors provided feedback, edited, and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Flowchart of the PanPhlAn method
Illustration of the PanPhlAn gene-family profiling. (i) Metagenomic and metatranscriptomic samples are mapped against reference genomes. (ii) Single gene coverage is merged into gene-family coverage. (iii) Based on a uniform DNA coverage level, PanPhlAn detects the unique gene set of a particular strain in a sample. For metatranscriptomics, the set of gene-families that are uniquely associated to the strain present in a sample are considered for recruiting reads from the metatranscriptome. The obtained RNA coverage levels are then converted into logarithm of the median normalized RNA/DNA ratios.
Supplementary Figure 2 True positive rates of PanPhlAn validation
True positive rates of PanPhlAn for its validation on semi-synthetic data by considering different strains of E. coli (a)-(c), S. aureus (d), and S. epidermidis (e). PanPhlAn accurately detected strain specific gene-families. Specifically, at high target strain genome coverage (>10×), we correctly detected >98% of the gene-families. For coverages as low as 2×, we obtained a true positive rate >92%. At 1×, the majority (avg. 86.95% s.d. 1.95) of the strain-specific genes could still be successfully retrieved.
Supplementary Figure 3 Comparison PanPhlAn versus MetaPhlAn
(a,b) Strain signature comparison of MetaPhlAn2 versus PanPhlAn based on 12 synthetic metagenomes generated from six reference genomes, three of which were not included in the database of both tools. (a) Strain identification by MetaPhlAn2 (based on a set of 621 marker genes) exhibited some limitation in resolving closely related strains (e.g., for the two strains of Bacteroides vulgatus G000699705 and G000699845). On the other hand, (b) PanPhlAn distinguished these two strains due to a much larger number of pangenome gene-families (i.e., 6646 for Bacteroides vulgatus). (c,d) Comparison between MetaPhlAn and PanPhlAn in terms of ROC curves for two species (c) Bacteroides vulgatus and (d) Bacteroides fragilis. ROC curves were constructed using distance as classification thresholds between all sample-pairs. A pair is considered ‘positive’ if both synthetic samples are generated from the same genomes, and ‘negative’ if samples are based on different genomes. For both tested species, the ROC curves showed a better result for PanPhlAn due to a better distance-based discrimination of samples containing the same strain from samples containing different strains.
Supplementary Figure 4 PanPhlAn profiles from synthetic metagenomes
PanPhlAn profiling results for synthetic metagenomes generated from (a) E. coli strains not present in the pangenome database and (b) S. aureus strains present in the database. In both cases, PanPhlAn enabled high discriminative resolution even among closely related genomes while simultaneously providing whole-genome strain characterization and profiling.
Supplementary Figure 5 PanPhlAn profiling of the German 2011 E. coli outbreak (including the available outbreak reference genomes)
Heatmap clustering based on an E. coli reference database of 113 reference genomes that additionally to Fig. 2 included three O104:H4 genomes: the German 2011 outbreak strain and two similar isolates from 2009. As in Fig. 2, most of the 12 strains detected in metagenomics outbreak samples clustered together due to almost identical profiles. The three additional O104:H4 reference genomes exactly fell in the center of the main cluster of the detected strains, thereby confirming the correctness of the detected gene-family profiles as outbreak strain profiles. Samples outside the cluster differed in their gene-family profiles due to the presence of additional dominant E. coli strains overlying the target outbreak strain.
Supplementary Figure 6 Coverage of the German 2011 E. coli outbreak samples
Plots showing the coverage depth of the E. coli O104:H4 outbreak strain 2011C-3493 in metagenomic samples from the German outbreak in 2011. (a) Samples of the main cluster in Fig. 2 were proven to contain the outbreak strain by a genome-wide uniform coverage depth. (b) Samples outside the cluster showed gaps of lower coverage levels, thereby confirming the presence of an additional E. coli strain overlying the outbreak strain, and hence dominating the gene-family profile.
Supplementary Figure 7 E. coli strain diversity and similarity network across four different datasets
The E. coli genomic diversity in the healthy gut of American, Chinese, and European cohorts is shown as heatmap clustering and as strain similarity network based on PanPhlAn profiles of 1316 metagenomes. (a) PanPhlAn detected E. coli strains in a total of 114 samples and provided presence-absence gene-family profiles for all of them. (b) E. coli strain similarity network to complement Fig. 2c. Most German outbreak samples cluster together with all three O104:H4 reference genomes. The outbreak cluster includes also one sample from the Chinese Diabetes dataset, which PanPhlAn confirmed to be an O104:H4-like strain without the enterohemorrhagic genes. Network edge width is inversely proportional to Jaccard distance between gene-family profiles and nodes connected by short edges reflect high genomic similarity (single disconnected nodes are removed).
Supplementary Figure 8 Outbreak strain coverage of the Chinese sample T2D-063
Coverage analysis of the Chinese sample T2D-063 in the outbreak cluster (Fig. 2c and Supplementary Fig. 7) to investigate the genomic similarity with the German 2011 outbreak strain. The almost uniform genome-wide coverage depth confirmed high similarity with the outbreak strain including the presence of plasmids. However, the missing Shiga-toxin-encoding region suggests that the sample contained a similar O104:H4 strain which was not identical with the German outbreak. This coincided with the detected absence of Shiga-toxin genes from PanPhlAn’s gene-family profile result.
Supplementary Figure 9 Multiple strain detection in E. coli samples
PanPhlAn can yield spurious strain profiling when multiple strains of the same species are present at a comparable abundance. For this reason, PanPhlAn implements a quality control procedure to identify cases in which it suspects the presence of multiple strains. The figure shows the same PCoA plot of E. coli profiles as in Fig. 2c, but in addition samples identified by PanPhlAn as “multistrain” are marked with an “x”. Analysis of the 12 samples with such warning confirms that the gene repertoires predicted in these cases, despite being a true reflection of the overall E. coli gene content in the sample, does not accurately represent that of single E. coli strains.
Supplementary Figure 10 PCoA showing strain diversity of E. rectale (3 reference genomes)
Large-scale population genomics study of E. rectale based on 1830 gut metagenomic samples from 8 cohorts. In this plot we considered three reference genomes for E. rectale instead of the single genome used in Fig. 3a. In both cases the clustering result was very similar and resolved E. rectale strains into three geographically distinct clades.
Supplementary Figure 11 Retrieved gene comparison PanPhlAn versus assembly
Comparison between PanPhlAn and assembly-based approaches in terms of number of strain-specific retrieved genes for two species in the gut samples from the HMP study: (a) E. rectale (129 postive samples) and (b) A. muciniphila (56 positive samples). For both tested species, PanPhlAn detected a higher number of genes for most of the samples tested. This was verified especially when the target organism was at low-coverage. Specifically, more than 1000 genes were detected exclusively by PanPhlAn when the relative abundance of the target organism was around 1% for both (c) E. rectale and (d) A. muciniphila.
Supplementary Figure 12 Strain detection comparison PanPhlAn versus ConStrains and assembly
Strain detection comparison PanPhlAn versus ConStrains and assembly for the three gut samples (SRS014235, SRS050925, SRS048870) of the HMP dataset having the highest coverage for E. rectale. (a) The heatmap of the PanPhlAn profiling results highlighted distinct strains in the three samples. On the other hand, (b) ConStrains associated the same predominant strain to all the samples. Assembly in conjunction with phylogeny reconstruction (c) and core gene sequence divergence (d) confirmed the PanPhlAn results by detecting distinct strains in the three considered samples.
Supplementary Figure 13 Strain similarity networks of B. ovatus, B. fragilis, S. epidermidis, and N. meningitidis
PanPhlAn multi-cohort strain-strain similarity networks of (a) B. ovatus and (b) B. fragilis in human gut samples; (c) S. epidermidis in skin samples; and (d) N. meningitidis in throat samples. Each node represents a strain either captured from a metagenomic samples or available reference genomes. Edge width is inversely proportional to Jaccard distance between gene-family profiles and nodes connected by short edges reflect high genomic similarity.
Supplementary Figure 14 Marine environmental strain-level comparative genomics study
Marine population genomics based on 1,246 samples. Heatmaps showed hierarchical clustering of PanPhlAn profiles for the rarely sequenced Roseobacter species (a) Pelagibaca bermudensis (1 ref. genome), (b) Roseovarius nubinhibens (1 ref. genome), (c) Roseovarius TM1035 (1 ref. genome), and (d) Sulfitobacter (3 ref. genomes); and two better characterized marine species (e) Prochlorococcus marinus (17 ref. genomes) and (f) Pelagibacter ubique (5 ref. genomes). Different marine regions are marked by different colours. Strains of all species showed a broad presence in many marine regions, partly as regional cluster of strain-specific gene content, especially for locally isolated areas like Baltic Sea and North Sea. (g) PCoA plot based on PanPhlAn profiles of Pelagibacter ubique highlights differences between strains from different marine regions. Strains present in the Baltic Sea could be clearly distinguished from North Sea strains, and also strains detected in samples of the Trondheimsfjord in Norway clustered together.
Supplementary Figure 15 PanPhlAn inference of strain-specific in vivo transcriptional activity
Pangenome-wide coverage depth of both metagenomic and metatranscriptomic data from a healthy infant gut sample. (a) Genes are sorted by DNA coverage. Transcript coverages are then normalized by the corresponding gene coverages, and the resulting ratios are median-normalized, log-transformed and re-scaled. (b) Hierarchical clustering of strain-specific transcription profiles from gut samples of 5 healthy infants. (c) Functional analysis of the highest overall expressed pathway modules reporting KEGG modules sorted by Gene Set Enrichment Analysis score.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–15, Supplementary Tables 2, 4, and 5, and Supplementary Notes 1–8 (PDF 13998 kb)
Supplementary Table 1
Synthetic and semi-synthetic metagenomes used for PanPhlAn validation. (XLSX 13 kb)
Supplementary Table 3
German 2011 E. coli outbreak specific gene set (Fisher exact test). (XLSX 57 kb)
Supplementary Table 6
Top 100 transcribed genes of E. coli in gut samples of healthy infants. (XLSX 8 kb)
Supplementary Table 7
Bottom 100 transcribed genes of E. coli in gut samples of healthy infants. (XLSX 9 kb)
Supplementary Table 8
Active pathway modules of E. coli in five gut samples of healthy infants. (XLSX 8 kb)
Supplementary Software
Software tool PanPhlAn for strain detection and characterization (version 1.2). (ZIP 40 kb)
Rights and permissions
About this article
Cite this article
Scholz, M., Ward, D., Pasolli, E. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat Methods 13, 435–438 (2016). https://doi.org/10.1038/nmeth.3802
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nmeth.3802
This article is cited by
-
Exploration of genes encoding KEGG pathway enzymes in rhizospheric microbiome of the wild plant Abutilon fruticosum
AMB Express (2024)
-
High-resolution strain-level microbiome composition analysis from short reads
Microbiome (2023)
-
A landscape-scale field survey demonstrates the role of wheat volunteers as a local and diversified source of leaf rust inoculum
Scientific Reports (2023)
-
Microbiome epidemiology and association studies in human health
Nature Reviews Genetics (2023)
-
Hypersaline Lake Urmia: a potential hotspot for microbial genomic variation
Scientific Reports (2023)