Strain-level microbial epidemiology and population genomics from shotgun metagenomics

Scholz, Matthias; Ward, Doyle V; Pasolli, Edoardo; Tolio, Thomas; Zolfo, Moreno; Asnicar, Francesco; Truong, Duy Tin; Tett, Adrian; Morrow, Ardythe L; Segata, Nicola

doi:10.1038/nmeth.3802

Brief Communication
Published: 21 March 2016

Strain-level microbial epidemiology and population genomics from shotgun metagenomics

Matthias Scholz¹^na1,
Doyle V Ward²^na1,
Edoardo Pasolli¹^na1,
Thomas Tolio¹,
Moreno Zolfo¹,
Francesco Asnicar¹,
Duy Tin Truong ORCID: orcid.org/0000-0002-4169-7727¹,
Adrian Tett¹,
Ardythe L Morrow³ &
…
Nicola Segata ORCID: orcid.org/0000-0002-1583-5794¹

Nature Methods volume 13, pages 435–438 (2016)Cite this article

15k Accesses
230 Citations
62 Altmetric
Metrics details

Subjects

Abstract

Identifying microbial strains and characterizing their functional potential is essential for pathogen discovery, epidemiology and population genomics. We present pangenome-based phylogenomic analysis (PanPhlAn; http://segatalab.cibio.unitn.it/tools/panphlan), a tool that uses metagenomic data to achieve strain-level microbial profiling resolution. PanPhlAn recognized outbreak strains, produced the largest strain-level population genomic study of human-associated bacteria and, in combination with metatranscriptomics, profiled the transcriptional activity of strains in complex communities.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: PanPhlAn validation and comparison with assembly.**

**Figure 2: PanPhlAn profiling of *E. coli* from metagenomics samples.**

**Figure 3: Large-scale population genomics study of *E. rectale* and *A. muciniphila*.**

A distinct Fusobacterium nucleatum clade dominates the colorectal cancer niche

Article Open access 20 March 2024

Martha Zepeda-Rivera, Samuel S. Minot, … Christopher D. Johnston

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 March 2024

Nicole D. Sonnert, Connor E. Rosen, … Noah W. Palm

Elucidation of genes enhancing natural product biosynthesis through co-evolution analysis

Article 12 April 2024

Xinran Wang, Ningxin Chen, … Xiaozhou Luo

Accession codes

Primary accessions

Sequence Read Archive

Referenced accessions

Sequence Read Archive

References

Daniel, R. Nat. Rev. Microbiol. 3, 470–478 (2005).
Article CAS PubMed Google Scholar
Qin, J. et al. Nature 464, 59–65 (2010).
Article CAS PubMed PubMed Central Google Scholar
The Human Microbiome Consortium. Nature 486, 207–214 (2012).
Qin, J. et al. Nature 490, 55–60 (2012).
Article CAS PubMed Google Scholar
Karlsson, F.H. et al. Nature 498, 99–103 (2013).
Article CAS PubMed Google Scholar
Segata, N. et al. Nat. Methods 9, 811–814 (2012).
Article CAS PubMed PubMed Central Google Scholar
Sunagawa, S. et al. Nat. Methods 10, 1196–1199 (2013).
Article CAS PubMed Google Scholar
Wood, D.E. & Salzberg, S.L. Genome Biol. 15, R46 (2014).
Article PubMed PubMed Central Google Scholar
Nielsen, H.B. et al. Nat. Biotechnol. 32, 822–828 (2014).
Article CAS PubMed Google Scholar
Huson, D.H., Auch, A.F., Qi, J. & Schuster, S.C. Genome Res. 17, 377–386 (2007).
Article CAS PubMed PubMed Central Google Scholar
Abubucker, S. et al. PLOS Comput. Biol. 8, e1002358 (2012).
Article CAS PubMed PubMed Central Google Scholar
Truong, D.T. et al. Nat. Methods 12, 902–903 (2015).
Article CAS PubMed Google Scholar
Franzosa, E.A. et al. Proc. Natl. Acad. Sci. USA 111, E2329–E2338 (2014).
Article CAS PubMed PubMed Central Google Scholar
Francis, O.E. et al. Genome Res. 23, 1721–1729 (2013).
Article CAS PubMed PubMed Central Google Scholar
Luo, C. et al. Nat. Biotechnol. 33, 1045–1052 (2015).
Article CAS PubMed PubMed Central Google Scholar
Doughty, E.L., Sergeant, M.J., Adetifa, I., Antonio, M. & Pallen, M.J. PeerJ 2, e585 (2014).
Article PubMed PubMed Central Google Scholar
Loman, N.J. et al. J. Am. Med. Assoc. 309, 1502–1510 (2013).
Article CAS Google Scholar
Köser, C.U. et al. N. Engl. J. Med. 366, 2267–2275 (2012).
Article PubMed PubMed Central Google Scholar
Ahmed, S.A. et al. PLoS One 7, e48228 (2012).
Article CAS PubMed PubMed Central Google Scholar
Rasko, D.A. et al. N. Engl. J. Med. 365, 709–717 (2011).
Article CAS PubMed PubMed Central Google Scholar
Tettelin, H. et al. Proc. Natl. Acad. Sci. USA 102, 13950–13955 (2005).
Article CAS PubMed PubMed Central Google Scholar
Reva, O. & Bezuidt, O. Mob. Genet. Elements 2, 96–100 (2012).
Article PubMed PubMed Central Google Scholar
Le Chatelier, E., et al. & MetaHIT consortium. Nature 500, 541–546 (2013).
Article CAS PubMed Google Scholar
Zeller, G. et al. Mol. Syst. Biol. 10, 766 (2014).
Article PubMed PubMed Central Google Scholar
Qin, N. et al. Nature 513, 59–64 (2014).
Article CAS PubMed Google Scholar
Everard, A. et al. Proc. Natl. Acad. Sci. USA 110, 9066–9071 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ward, D.V. et al. Cell Rep. 10.1016/j.celrep.2016.03.015 (17 March 2016).
Scher, J.U. et al. eLife 2, e01202 (2013).
Article PubMed PubMed Central Google Scholar
Sokol, H. et al. Proc. Natl. Acad. Sci. USA 105, 16731–16736 (2008).
Article CAS PubMed PubMed Central Google Scholar
Lee, S.M. et al. Nature 501, 426–429 (2013).
Article CAS PubMed PubMed Central Google Scholar
Edgar, R.C. Bioinformatics 26, 2460–2461 (2010).
Article CAS PubMed Google Scholar
Page, J.P. et al. Bioinformatics 31, 3691–3693 (2015).
Article CAS PubMed PubMed Central Google Scholar
Fouts, D.E., Brinkac, L., Beck, E., Inman, J. & Sutton, G. Nucleic Acids Res. 40, e172 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, L., Stoeckert, C.J. Jr. & Roos, D.S. Genome Res. 13, 2178–2189 (2003).
Article CAS PubMed PubMed Central Google Scholar
Sahl, J.W., Caporaso, J.G., Rasko, D.A. & Keim, P. PeerJ 2, e332 (2014).
Article PubMed PubMed Central Google Scholar
Segata, N., Börnigen, D., Morgan, X.C. & Huttenhower, C. Nat. Commun. 4, 2304 (2013).
Article PubMed Google Scholar
Langmead, B. & Salzberg, S.L. Nat. Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, H., et al. & 1000 Genome Project Data Processing Subgroup. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Stacklies, W., Redestig, H., Scholz, M., Walther, D. & Selbig, J. Bioinformatics 23, 1164–1167 (2007).
Article CAS PubMed Google Scholar
Kanehisa, M. et al. Nucleic Acids Res. 36, D480–D484 (2008).
Article CAS PubMed Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Subramanian, A. et al. Proc. Natl. Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS PubMed PubMed Central Google Scholar
McElroy, K.E., Luciani, F. & Thomas, T. BMC Genomics 13, 74 (2012).
Article PubMed PubMed Central Google Scholar
Bankevich, A. et al. J. Comput. Biol. 19, 455–477 (2012).
Article CAS PubMed PubMed Central Google Scholar
Li, D., Liu, C.M., Luo, R., Sadakane, K. & Lam, T.W. Bioinformatics 31, 1674–1676 (2015).
Article CAS PubMed Google Scholar
Morrow, A.L. et al. Microbiome 1, 13 (2013).
Article PubMed PubMed Central Google Scholar
Shannon, P. et al. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Seemann, T. Bioinformatics 30, 2068–2069 (18 March 2014).
Stamatakis, A. Bioinformatics 30, 1312–1313 (2014).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We gratefully thank the members of the Segata lab for insightful discussions on the method, K. Schibler for his contribution to the preterm infant cohort study, and V. De Sanctis and R. Bertorelli for help in sequencing the skin samples. This work was supported by the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under contract number HHSN272200900018C (D.V.W., A.L.M.). The work was also supported by the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) under REA grant agreement number PCIG13-GA-2013-618833 (N.S.), by startup funds from the Centre for Integrative Biology, University of Trento (N.S.), by MIUR “Futuro in Ricerca” RBFR13EWWI_001 (N.S.), by the Fondazione Caritro–2013 (N.S.) and by 'Terme di Comano' (N.S.).

Author information

Matthias Scholz, Doyle V Ward and Edoardo Pasolli: These authors contributed equally to this work.

Authors and Affiliations

Centre for Integrative Biology, University of Trento, Trento, Italy
Matthias Scholz, Edoardo Pasolli, Thomas Tolio, Moreno Zolfo, Francesco Asnicar, Duy Tin Truong, Adrian Tett & Nicola Segata
Center for Microbiome Research, University of Massachusetts Medical School, Worcester, Massachusetts, USA
Doyle V Ward
Department of Pediatrics, Perinatal Institute, Cincinnati, Ohio, USA
Ardythe L Morrow

Authors

Matthias Scholz
View author publications
You can also search for this author in PubMed Google Scholar
Doyle V Ward
View author publications
You can also search for this author in PubMed Google Scholar
Edoardo Pasolli
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Tolio
View author publications
You can also search for this author in PubMed Google Scholar
Moreno Zolfo
View author publications
You can also search for this author in PubMed Google Scholar
Francesco Asnicar
View author publications
You can also search for this author in PubMed Google Scholar
Duy Tin Truong
View author publications
You can also search for this author in PubMed Google Scholar
Adrian Tett
View author publications
You can also search for this author in PubMed Google Scholar
Ardythe L Morrow
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Segata
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.S. supervised the work and originally conceived the project. M.S. and D.V.W. contributed to the conception and design of the work. M.S. and T.T. implemented, validated, tested, and documented the software. M.S. and E.P. performed the experiments. A.T., D.V.W. and A.L.M. performed and provided the metagenomics sequencing. M.Z., F.A. and D.T.T. provided computational tools and performed comparative analyses. N.S. and M.S. wrote the manuscript. All authors provided feedback, edited, and approved the manuscript.

Corresponding author

Correspondence to Nicola Segata.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Flowchart of the PanPhlAn method

Illustration of the PanPhlAn gene-family profiling. (i) Metagenomic and metatranscriptomic samples are mapped against reference genomes. (ii) Single gene coverage is merged into gene-family coverage. (iii) Based on a uniform DNA coverage level, PanPhlAn detects the unique gene set of a particular strain in a sample. For metatranscriptomics, the set of gene-families that are uniquely associated to the strain present in a sample are considered for recruiting reads from the metatranscriptome. The obtained RNA coverage levels are then converted into logarithm of the median normalized RNA/DNA ratios.

Supplementary Figure 2 True positive rates of PanPhlAn validation

True positive rates of PanPhlAn for its validation on semi-synthetic data by considering different strains of E. coli (a)-(c), S. aureus (d), and S. epidermidis (e). PanPhlAn accurately detected strain specific gene-families. Specifically, at high target strain genome coverage (>10×), we correctly detected >98% of the gene-families. For coverages as low as 2×, we obtained a true positive rate >92%. At 1×, the majority (avg. 86.95% s.d. 1.95) of the strain-specific genes could still be successfully retrieved.

Supplementary Figure 3 Comparison PanPhlAn versus MetaPhlAn

(a,b) Strain signature comparison of MetaPhlAn2 versus PanPhlAn based on 12 synthetic metagenomes generated from six reference genomes, three of which were not included in the database of both tools. (a) Strain identification by MetaPhlAn2 (based on a set of 621 marker genes) exhibited some limitation in resolving closely related strains (e.g., for the two strains of Bacteroides vulgatus G000699705 and G000699845). On the other hand, (b) PanPhlAn distinguished these two strains due to a much larger number of pangenome gene-families (i.e., 6646 for Bacteroides vulgatus). (c,d) Comparison between MetaPhlAn and PanPhlAn in terms of ROC curves for two species (c) Bacteroides vulgatus and (d) Bacteroides fragilis. ROC curves were constructed using distance as classification thresholds between all sample-pairs. A pair is considered ‘positive’ if both synthetic samples are generated from the same genomes, and ‘negative’ if samples are based on different genomes. For both tested species, the ROC curves showed a better result for PanPhlAn due to a better distance-based discrimination of samples containing the same strain from samples containing different strains.

Supplementary Figure 4 PanPhlAn profiles from synthetic metagenomes

PanPhlAn profiling results for synthetic metagenomes generated from (a) E. coli strains not present in the pangenome database and (b) S. aureus strains present in the database. In both cases, PanPhlAn enabled high discriminative resolution even among closely related genomes while simultaneously providing whole-genome strain characterization and profiling.

Supplementary Figure 5 PanPhlAn profiling of the German 2011 E. coli outbreak (including the available outbreak reference genomes)

Heatmap clustering based on an E. coli reference database of 113 reference genomes that additionally to Fig. 2 included three O104:H4 genomes: the German 2011 outbreak strain and two similar isolates from 2009. As in Fig. 2, most of the 12 strains detected in metagenomics outbreak samples clustered together due to almost identical profiles. The three additional O104:H4 reference genomes exactly fell in the center of the main cluster of the detected strains, thereby confirming the correctness of the detected gene-family profiles as outbreak strain profiles. Samples outside the cluster differed in their gene-family profiles due to the presence of additional dominant E. coli strains overlying the target outbreak strain.

Supplementary Figure 6 Coverage of the German 2011 E. coli outbreak samples

Plots showing the coverage depth of the E. coli O104:H4 outbreak strain 2011C-3493 in metagenomic samples from the German outbreak in 2011. (a) Samples of the main cluster in Fig. 2 were proven to contain the outbreak strain by a genome-wide uniform coverage depth. (b) Samples outside the cluster showed gaps of lower coverage levels, thereby confirming the presence of an additional E. coli strain overlying the outbreak strain, and hence dominating the gene-family profile.

Supplementary Figure 7 E. coli strain diversity and similarity network across four different datasets

The E. coli genomic diversity in the healthy gut of American, Chinese, and European cohorts is shown as heatmap clustering and as strain similarity network based on PanPhlAn profiles of 1316 metagenomes. (a) PanPhlAn detected E. coli strains in a total of 114 samples and provided presence-absence gene-family profiles for all of them. (b) E. coli strain similarity network to complement Fig. 2c. Most German outbreak samples cluster together with all three O104:H4 reference genomes. The outbreak cluster includes also one sample from the Chinese Diabetes dataset, which PanPhlAn confirmed to be an O104:H4-like strain without the enterohemorrhagic genes. Network edge width is inversely proportional to Jaccard distance between gene-family profiles and nodes connected by short edges reflect high genomic similarity (single disconnected nodes are removed).

Supplementary Figure 8 Outbreak strain coverage of the Chinese sample T2D-063

Coverage analysis of the Chinese sample T2D-063 in the outbreak cluster (Fig. 2c and Supplementary Fig. 7) to investigate the genomic similarity with the German 2011 outbreak strain. The almost uniform genome-wide coverage depth confirmed high similarity with the outbreak strain including the presence of plasmids. However, the missing Shiga-toxin-encoding region suggests that the sample contained a similar O104:H4 strain which was not identical with the German outbreak. This coincided with the detected absence of Shiga-toxin genes from PanPhlAn’s gene-family profile result.

Supplementary Figure 9 Multiple strain detection in E. coli samples

PanPhlAn can yield spurious strain profiling when multiple strains of the same species are present at a comparable abundance. For this reason, PanPhlAn implements a quality control procedure to identify cases in which it suspects the presence of multiple strains. The figure shows the same PCoA plot of E. coli profiles as in Fig. 2c, but in addition samples identified by PanPhlAn as “multistrain” are marked with an “x”. Analysis of the 12 samples with such warning confirms that the gene repertoires predicted in these cases, despite being a true reflection of the overall E. coli gene content in the sample, does not accurately represent that of single E. coli strains.

Supplementary Figure 10 PCoA showing strain diversity of E. rectale (3 reference genomes)

Large-scale population genomics study of E. rectale based on 1830 gut metagenomic samples from 8 cohorts. In this plot we considered three reference genomes for E. rectale instead of the single genome used in Fig. 3a. In both cases the clustering result was very similar and resolved E. rectale strains into three geographically distinct clades.

Supplementary Figure 11 Retrieved gene comparison PanPhlAn versus assembly

Comparison between PanPhlAn and assembly-based approaches in terms of number of strain-specific retrieved genes for two species in the gut samples from the HMP study: (a) E. rectale (129 postive samples) and (b) A. muciniphila (56 positive samples). For both tested species, PanPhlAn detected a higher number of genes for most of the samples tested. This was verified especially when the target organism was at low-coverage. Specifically, more than 1000 genes were detected exclusively by PanPhlAn when the relative abundance of the target organism was around 1% for both (c) E. rectale and (d) A. muciniphila.

Supplementary Figure 12 Strain detection comparison PanPhlAn versus ConStrains and assembly

Strain detection comparison PanPhlAn versus ConStrains and assembly for the three gut samples (SRS014235, SRS050925, SRS048870) of the HMP dataset having the highest coverage for E. rectale. (a) The heatmap of the PanPhlAn profiling results highlighted distinct strains in the three samples. On the other hand, (b) ConStrains associated the same predominant strain to all the samples. Assembly in conjunction with phylogeny reconstruction (c) and core gene sequence divergence (d) confirmed the PanPhlAn results by detecting distinct strains in the three considered samples.

Supplementary Figure 13 Strain similarity networks of B. ovatus, B. fragilis, S. epidermidis, and N. meningitidis

PanPhlAn multi-cohort strain-strain similarity networks of (a) B. ovatus and (b) B. fragilis in human gut samples; (c) S. epidermidis in skin samples; and (d) N. meningitidis in throat samples. Each node represents a strain either captured from a metagenomic samples or available reference genomes. Edge width is inversely proportional to Jaccard distance between gene-family profiles and nodes connected by short edges reflect high genomic similarity.

Supplementary Figure 14 Marine environmental strain-level comparative genomics study

Marine population genomics based on 1,246 samples. Heatmaps showed hierarchical clustering of PanPhlAn profiles for the rarely sequenced Roseobacter species (a) Pelagibaca bermudensis (1 ref. genome), (b) Roseovarius nubinhibens (1 ref. genome), (c) Roseovarius TM1035 (1 ref. genome), and (d) Sulfitobacter (3 ref. genomes); and two better characterized marine species (e) Prochlorococcus marinus (17 ref. genomes) and (f) Pelagibacter ubique (5 ref. genomes). Different marine regions are marked by different colours. Strains of all species showed a broad presence in many marine regions, partly as regional cluster of strain-specific gene content, especially for locally isolated areas like Baltic Sea and North Sea. (g) PCoA plot based on PanPhlAn profiles of Pelagibacter ubique highlights differences between strains from different marine regions. Strains present in the Baltic Sea could be clearly distinguished from North Sea strains, and also strains detected in samples of the Trondheimsfjord in Norway clustered together.

Supplementary Figure 15 PanPhlAn inference of strain-specific in vivo transcriptional activity

Pangenome-wide coverage depth of both metagenomic and metatranscriptomic data from a healthy infant gut sample. (a) Genes are sorted by DNA coverage. Transcript coverages are then normalized by the corresponding gene coverages, and the resulting ratios are median-normalized, log-transformed and re-scaled. (b) Hierarchical clustering of strain-specific transcription profiles from gut samples of 5 healthy infants. (c) Functional analysis of the highest overall expressed pathway modules reporting KEGG modules sorted by Gene Set Enrichment Analysis score.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 2, 4, and 5, and Supplementary Notes 1–8 (PDF 13998 kb)

Supplementary Table 1

Synthetic and semi-synthetic metagenomes used for PanPhlAn validation. (XLSX 13 kb)

Supplementary Table 3

German 2011 E. coli outbreak specific gene set (Fisher exact test). (XLSX 57 kb)

Supplementary Table 6

Top 100 transcribed genes of E. coli in gut samples of healthy infants. (XLSX 8 kb)

Supplementary Table 7

Bottom 100 transcribed genes of E. coli in gut samples of healthy infants. (XLSX 9 kb)

Supplementary Table 8

Active pathway modules of E. coli in five gut samples of healthy infants. (XLSX 8 kb)

Supplementary Software

Software tool PanPhlAn for strain detection and characterization (version 1.2). (ZIP 40 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Scholz, M., Ward, D., Pasolli, E. et al. Strain-level microbial epidemiology and population genomics from shotgun metagenomics. Nat Methods 13, 435–438 (2016). https://doi.org/10.1038/nmeth.3802

Download citation

Received: 08 July 2015
Accepted: 16 February 2016
Published: 21 March 2016
Issue Date: May 2016
DOI: https://doi.org/10.1038/nmeth.3802

This article is cited by

Exploration of genes encoding KEGG pathway enzymes in rhizospheric microbiome of the wild plant Abutilon fruticosum
- Aala A. Abulfaraj
- Ashwag Y. Shami
- Rewaa S. Jalal
AMB Express (2024)
High-resolution strain-level microbiome composition analysis from short reads
- Herui Liao
- Yongxin Ji
- Yanni Sun
Microbiome (2023)
A landscape-scale field survey demonstrates the role of wheat volunteers as a local and diversified source of leaf rust inoculum
- A.-L. Boixel
- H. Goyeau
- T. Vidal
Scientific Reports (2023)
Microbiome epidemiology and association studies in human health
- Hannah VanEvery
- Eric A. Franzosa
- Curtis Huttenhower
Nature Reviews Genetics (2023)
Hypersaline Lake Urmia: a potential hotspot for microbial genomic variation
- Roohollah Kheiri
- Maliheh Mehrshad
- Mohammad Ali Amoozegar
Scientific Reports (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Accession codes

Primary accessions

Sequence Read Archive

Referenced accessions

Sequence Read Archive

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links