Metaphylogenetic analysis of global sewage reveals that bacterial strains associated with human disease show less degree of geographic clustering

Knowledge about the difference in the global distribution of pathogens and non-pathogens is limited. Here, we investigate it using a multi-sample metagenomics phylogeny approach based on short-read metagenomic sequencing of sewage from 79 sites around the world. For each metagenomic sample, bacterial template genomes were identified in a non-redundant database of whole genome sequences. Reads were mapped to the templates identified in each sample. Phylogenetic trees were constructed for each template identified in multiple samples. The countries from which the samples were taken were grouped according to different definitions of world regions. For each tree, the tendency for regional clustering was determined. Phylogenetic trees representing 95 unique bacterial templates were created covering 4 to 71 samples. Varying degrees of regional clustering could be observed. The clustering was most pronounced for environmental bacterial species and human commensals, and less for colonizing opportunistic pathogens, opportunistic pathogens and pathogens. No pattern of significant difference in clustering between any of the organism classifications and country groupings according to income were observed. Our study suggests that while the same bacterial species might be found globally, there is a geographical regional selection or barrier to spread for individual clones of environmental and human commensal bacteria, whereas this is to a lesser degree the case for strains and clones of human pathogens and opportunistic pathogens.


SUPPLEMENTARY FIGURE LEGENDS
Supplementary Figure S1: Quantitative overview after major steps of the pipeline. Overview of the number of bacterial templates and unique bacterial templates left, after each of the four major steps in the pipeline. Figure S2: Unknown bases in the consensus sequences. A barchart of percentage of unknown bases in the consensus sequences. X-axis illustrates percentage of unknown bases and Yaxis shows number of consensus sequence counts. The dotted line shows the cut-off at 40% which we ended up using.

Supplementary
Supplementary Figure S1 Supplementary Figure S2 Out of 11,691 consensus sequences only 1,504 of t he consensus sequences had unknown bases that were lower or equal to 40%. T herefore, the cut -off wit h 40% unknown wash chosen and was a part of t he phylogenet ic analysis.  Supplementary Table S1: Country and regional information on each of the 80 samples from the Global Sewage project. Only 79 samples were used in the final results, the sample from Hungary was removed due to insufficient reads.

Classifications of bacterial templates
Here follows the step by step description of how the two classifications schemes EID2 plus (EID2p) and Five class classification (5CC) was made. EID2p is made by step 1-3 and is mostly built on the EID2 database [1], but with an additional step (step 3) hence the name EID2 plus. The 5CC is made by adding step 4 to the EID2p scheme Step 1 -Annotate for human interaction The database was scraped for information about interactions where the cargo taxa were bacteria, and carrier taxa were every possible taxon.
The scrape output was then separated according to carrier taxa, where information had to be about carriers at rank 'species' and cargo at either rank 'no rank' or 'species'. This resulted in files containing carrier information about bacteria, invertebrates, mammal, plant primates, rodents, and vertebrates.
Lastly the information about homo sapiens interactions with bacteria were extracted from the file containing information about primate interactions with bacteria.

Step 2 -Lookup of intersection between the human interaction list and the template list
Next the EID2 homo sapiens information was used to assign each template organism to either 'human interaction' or 'environmental'.
Firstly, a match between the template organism name and cargo at rank 'no rank' was searched for. If a match is found the template is classified as 'human interaction'. If no match was found at 'no rank' level, a match at species level is searched for. If a match was found the organism is classified as 'human interaction', else it was classified as 'environmental'.
When a match was identified, the name from the EID2 was noted down. If no match was found, NA was noted.
The conditions for which classification was assigned can be seen in the flow diagram below.
Information can be found in the columns: -EID2.interaction.step2 Listing the classification either 'human interaction' or 'environmental' -EID2.closest.match.interactions.step2 Listing the closest match between the template organism name and the EID2 database of bacterial cargos in humans.

Step 3 -Lookup for pathogenesis in the list from the article: Risk factors for human disease emergence (2001), Taylor, L. H. et al. [2]
Search for a match between the list of pathogens from Taylor  The template organisms were further classified as commensal, pathogen, or environmental.
If the template has a match to the Taylor list, it is classified as 'pathogen' regardless of its classification in step 2. If the template does not have a match in the Taylor list it will be classified as 'commensal' if it was classified as 'human interaction' in step 2. If it was classified as 'environmental' in step 2, it will keep its 'environmental' classification.
Note that if Taylor et al. has listed a genus (e.g. Lactobacillus sp. and Megaspaera sp.) as pathogenic, this will not be accepted as an evidence to classify the relevant templates as 'pathogen'.
The flow diagram below gives an overview of the decision flow.
Information can be found in the columns: -classification.pathogen. Taylor Step 4 -Five class classification (5CC). Hand curated classification, made by web search for pathogenesis grouping as well as confirming commensal and environmental status. In the paper, pathogens are divided into three groups: Colonizing opportunistic pathogens (COPs) Simple opportunistic pathogens (SOPs) Frank pathogens

Definitions listed by the paper:
Colonizing opportunistic pathogens (COPs) are microbes that asymptomatically colonize the human body and, when the conditions are right, can cause infections.
The broad category of opportunistic pathogens can be divided into two distinct subgroups: the COPs and the noncolonizing, simple opportunistic pathogens (SOP).
The defining feature of all opportunistic pathogens is their capacity to cause disease when they are introduced into a susceptible body site or when hosts are immunologically compromised.
Whereas SOPs, such as Vibrio vulnificus, Mycobacterium marinum, and Legionella pneumophila, are only present in environmental reservoirs, COPs can also take up long-term residence in/on the human body as part of the "normal" human microbiome.
Frank pathogens, can cause acute, chronic, or latent infections that can be symptomatic or asymptomatic.
The detection of frank pathogens is often associated with a diseased status, whether active or latent, and identifying cases of active or recent infections is usually enough to trace transmission routes. If an article search shows no evidence of human interaction, or optimal growth conditions are far removed from those of the human body, e.g. a psychrophilic bacterium with optimal growth temperature below that of the human body.

Groupings used in our study
Opportunistic pathogen definition: Can cause disease in immunocompromised people. Does not colonize the human microbiome, normally act as environmental bacteria except when causing infections.
Pathogen defintions: Obligate/frank pathogen, cause infection when in contact with human, both immunocompetent as well as immunocompromised people. There should be no evidence of the bacteria colonizing the human microbiome, it normally acts as environmental when not infecting humans.

Columns
-Web.search.step4 Listing classification; commensal, COP, environmental, opportunistic pathogen, pathogen -Search.notes.step4 Notes taken while searching for information regarding each organism -Ref1.step4 Link to paper with information regarding the organism -Ref2.step4 Additional link to paper with information regarding the organism -Ref3.step4 Additional link to paper with information regarding the organism