Setting a baseline for global urban virome surveillance in sewage

The rapid development of megacities, and their growing connectedness across the world is becoming a distinct driver for emerging disease outbreaks. Early detection of unusual disease emergence and spread should therefore include such cities as part of risk-based surveillance. A catch-all metagenomic sequencing approach of urban sewage could potentially provide an unbiased insight into the dynamics of viral pathogens circulating in a community irrespective of access to care, a potential which already has been proven for the surveillance of poliovirus. Here, we present a detailed characterization of sewage viromes from a snapshot of 81 high density urban areas across the globe, including in-depth assessment of potential biases, as a proof of concept for catch-all viral pathogen surveillance. We show the ability to detect a wide range of viruses and geographical and seasonal differences for specific viral groups. Our findings offer a cross-sectional baseline for further research in viral surveillance from urban sewage samples and place previous studies in a global perspective.


Scientific Reports
| (2020) 10:13748 | https://doi.org/10.1038/s41598-020-69869-0 www.nature.com/scientificreports/ fatality rate but overall high population level impact has been tested successfully to monitor the progress of the global polio-elimination program, particularly in regions where non-replicating polio-virus vaccines are used 15,16 . The huge potential of environmental surveillance was illustrated when a silent epidemic of wild-poliovirus type 1 in Israel was detected, which led to a mop-up vaccination campaign and resolution of the epidemic, without a single case of paralytic poliomyelitis 17 . In addition, small-scale studies have already shown the potential for using metagenomic sequencing of sewage extracts for the detection of a range of virus families [18][19][20] (Table 1 in  Appendix). While these studies have largely focused on viruses with a replication phase in the gastro-intestinal tract, the fecal and/or urinary shedding of, for instance, measles virus, yellow fever virus, Zika virus, West Nile virus, Ebola virus, SARS coronavirus, and MERS coronavirus suggests the potential utility of sewage testing to capture circulation of these pathogens as well [21][22][23][24][25] . Moreover, metagenomic sequencing has the potential to detect any viral genomic material in the sample, without targeting a specific viral pathogen or limiting for only known viral pathogens. In this study, we pilot the use of metagenomics to describe a comparative snapshot of the virome from sewage samples of high-density urban areas across all continents. We provide a critical appraisal of technical and analytical biases and discuss the potential utility for human and animal disease monitoring and surveillance, as well as the additional steps needed to go towards routine implementation.

Results
Data quality evaluation. Urban sewage samples and associated metadata (Supp. File 1) were obtained from 62 countries across all continents between January and April 2016 from the influent of wastewater treatment plants prior to treatment or from open sewage systems in low-and middle-income countries. All samples were previously processed for the detection of bacterial antimicrobial resistance genes using DNA metagenomics 26 .
Here we focus solely on viral DNA and RNA metagenomics (methods) and the analysis of the viral data. Sewage samples are highly variable in terms of composition and DNA abundance and therefore potential biases that might impact the final read abundance and diversity of the sewage virome were evaluated. Initially, an extensive evaluation of the technical factors that may impact the resulting data to gain a deeper understanding of potential pitfalls was performed. First, read abundance was evaluated as a proxy for viral abundance. Sequencing protocols for virome analysis in sewage typically require an amplification step to provide enough DNA input for sequencing, which can result in artificial duplication of sequence reads and thereby impact the quantitative interpretation of the data substantially (Fig. 1a). Indeed, the observed viral species richness was negatively correlated with the number of amplification cycles needed to obtain enough DNA as input for sequencing (Fig. 1b), while the average fold replication of a read was positively correlated (Fig. 1c). The impact of dereplication on the individual species level read counts varied greatly within a sample. Especially in samples with a low number of reads after dereplication (Fig. 1d) the decrease in read counts for a species ranges from 600 to fivefold . These differences have a profound effect on the species distribution within the sample, and thus the interpretation thereof. The effect of dereplication is much less variable between species in samples with a high number of reads after dereplication (Fig. 1e). Therefore, the optimal use of virome sequencing depends on the initial abundance of viral sequences in the sample and extra amplification may only increase the coverage of the same viruses, but does not increase the richness of the virome, which needs to be carefully considered when designing and interpreting sewage metagenomics studies. Besides the influence of read replication on read abundance, the richness of the virome can be impacted by the presence of non-viral sequences. Typically, the metagenomic data contain a large fraction of unknown reads, and, despite the virus specific sample preparation, non-viral reads, including archaeal, bacterial, and eukaryote DNA.
While the overall proportion of reads for the different domains was comparable in most samples, multidimensional scaling of the non-viral read counts showed that some samples were very divergent from the central cluster and were manually marked as outliers (Fig. 2a, dashed line). Viral read abundance was low in these outlier samples (Fig. 2b, right panel). There was no significant correlation between the concentration of human or bacterial read fractions with any of the measured sample characteristics, such as pH, conductivity, and type of sewer system.

Exploration of the sewage virome.
Based on the data quality assessment, we analyzed viral diversity in the samples after dereplication and following annotation by both Kajiu and Centrifuge as described. Between 0.09% and 22% of the reads could be annotated as viral (median of 6%), with high abundances of bacteriophages, plant-and insect viruses (Fig. 3). Most abundant were bacteriophages, representing on average 77% (ranging from 9 to 94%) of the annotated viral reads in the sewage. In particular Microviridae (median of 18%, range 0.5-51% of reads), Siphoviridae (median of 17%, range 0.22-67% of reads), Myoviridae (median of 9%, range 0.08-41% of reads), and Podoviridae (median of 4%, range 0.02-25% of reads), were highly abundant. These bacteriophage families could be found around the globe without obvious regional differences when using read annotations at this taxonomic level. Although specific bacteriophages have been studied extensively as potential indicators of human fecal pollution, bacteriophage taxonomy is relatively poorly defined, making accurate classification challenging at genus and species level 27,28 . Hence, geographical patterns at a more fine-grained level of annotation may be lost in our analysis. Moreover, interpretation of patterns of bacteriophage abundance could be obscured by the fact that bacteriophages can encounter bacterial hosts in the sewage in which they can multiply. As described elsewhere, the analysis of the bacterial resistomes of the same samples showed clear segregation of sequences from Africa and Asia versus those from Europe and the US 26 . A more detailed analysis is needed to assess if there is a relation between specific bacteriophages and the resistomes, as environmental viromes have been shown to be a potential reservoir for antimicrobial resistance genes 29 .

Scientific Reports
| (2020) 10:13748 | https://doi.org/10.1038/s41598-020-69869-0 www.nature.com/scientificreports/ viruses. On average, more than 84% of these reads belonged to the Virgaviridae family. Especially viral species related to infections of cucumber, tomato, tobacco and pepper plants could be detected in sewage, as indicated by species level taxonomy (Fig. 4b). Apart from a sample from Kenya, the abundance of vegetable-consumptionrelated viruses was higher in samples from Europe and North America compared to samples from the rest of the world (Fig. 4a) (Welch's t-test, p-value = 0.06). The global presence and high abundance of plant viruses has led to the proposal that they may be good indicators for human fecal contamination alike specific bacteriophage populations 30 . However, this remains to be validated given the geographic variation observed in our dataset, which could reflect differences in diet and/or agricultural practices in these countries. A median of 1.4% (ranging 0.1-74%) of the sewage virome consisted of viruses associated with insects, comprising mainly species from the genera Ambidensovirus, Cripavirus, and Brevidensovirus (Fig. 4d), known to infect a range of crickets, cockroaches, fruit flies, and mosquitos 31 . In the global distribution there was an increased abundance of insect viruses in samples from around the equator, mainly in samples from Africa ( Fig. 4c) (Welch's t-test, p-value = 0.0004). One exception was the sample from Finland, which had a high abundance of insect virus reads (13.7%) in comparison with samples from other European countries (1.5%). Several reads were found to be annotated as "Aedes albopictus densovirus 2", "Aedes aegypti Thai densovirus", and "Anopheles gambiae densonucleosis virus". There is some evidence that these densoviruses may be associated with Aedes aegypti, Aedes albopictus and Culex mosquitos 32,33 . Current data are not sufficient to meet the requirements for sewage surveillance, but these findings show the potential to track mosquitos by looking for mosquito specific viruses. www.nature.com/scientificreports/ A selection of viral taxa was analyzed containing human pathogenic viruses from the Astro-, Entero-Noro-, Sapo-, Adeno-and Rotaviridae families that are known to be abundant across the world as causes of diarrheal disease (Fig. 5b). Most abundant and widespread were the astroviruses. Enteroviruses were present to a lesser extent but could be detected in sewage samples from across the globe as well. Members of the noro-, sapo-, adeno-, and rotaviruses were only sporadically detected. Further investigation of samples with high human astrovirus content showed mostly evidence of the classic Human Astrovirus 1, 2 and 4 that are common causes of diarrheal disease, and sporadic detection of other clades such as Human Astrovirus MLB and Human Astrovirus VA for which less is known regarding clinical impact 35 . Mapping of human enterovirus reads resulted in 102 small contiguous sequences which were typed using the enterovirus typing tool 36 . Mainly Enterovirus C (46%) and B (9%) were detected. Further subtyping of for instance poliovirus was not possible because of a lack of coverage of the standardized genotyping region VP1. The same mapping was done for norovirus, resulting in 13 contigs of 84 to 962 nucleotides in length. Most norovirus sequences were typed as either GII, with capsid type 6, 10 and 17, and www.nature.com/scientificreports/ GIV, all viruses that are commonly found in outbreak based surveillance 37 . Sapovirus sequences, all belonging to type GI, were found in seven of the samples. Adenovirus and rotavirus hits were sporadically detected across all sampling sites and upon further investigation showed mainly adenovirus C and rotavirus A hits. It is known that noroviruses, astroviruses and rotaviruses follow a winter seasonality and enteroviruses follows a summer seasonality pattern [38][39][40] . The time of sampling of the sewage was in a 3-month timeframe between January and March, which corresponds to the winter period in the northern hemisphere, therefore a higher prevalence of winter seasonal viruses was expected in those. When looking at the global distribution of viruses, the average abundance of astro-and noroviruses was higher in the northern hemisphere, and the reverse pattern was observed for enteroviruses, with higher average abundance in the southern hemisphere during the sampling period (Fig. 5c). Given the cross-sectional nature of our study we acknowledge that these seasonal patterns will have to be confirmed using longitudinal sampling which would allow for meaningful statistical analysis, but our first observations align with what is generally expected at that time of the year.

Discussion
This global sewage study gives, for the first time, a catch-all metagenomic comparison of the urban sewage virome of major cities across the world. We show that it is possible to detect a wide diversity of viruses in sewage samples and we identify geographical and seasonal differences in abundance for specific viral groups, including those that are currently targeted by surveillance for diarrheal and neurological disease, as well as viruses that could be used as indicators for presence of specific mosquito species. In addition, we provide the global scientific community with a geographically very broad resource for searching for novel virus sequences as novel pathogens continue to emerge. The pilot study also highlights some important challenges that need to be addressed to take the technology forward, such as how to deal with low input samples and the overabundance of phages, plant, and insect viruses in the sample. Metagenomic sequencing of viruses is a complex and evolving technology which is currently far from being standardized. Differences in sample preprocessing, sequencing technology, and data analysis can have a major impact on the viral read abundance, diversity, and the proportion of sequences that are annotated 41,42 . In our study, we eliminated lab-to-lab variability by performing all sample preparation, sequencing and analysis at the same location, which, apart from the analysis, is obviously not feasible for global surveillance. Further work is ongoing, including the development of fieldable sample treatment and sequencing protocols, comparison of effects of sample preparation on viral richness and further exploration of applicability, by longitudinal sampling and sampling in the presence of known ongoing outbreaks.
A critical challenge of using metagenomic sequencing for surveillance purposes remains the interpretation of sequence annotations. With the development of high-speed k-mer based annotation tools such as the ones used in this study, annotation can be performed rapidly and with few false negatives. However, erroneous and mis-annotated entries in public databases, together with inconsistency in the sequence-based taxonomic classification of viruses, make annotation to the species level challenging. Major steps have been taken to create a more consistent sequence based viral taxonomy 27,43 , but these approaches have not yet been integrated in fast viral annotation tools. Also, deposits of large volumes of virus sequences without a clear host association or pathogenicity data in public databases 44 make it difficult to interpret the relevance of such findings. In our data, many of these "environmental viruses" could be identified. Given the increase in virus diversity in reference databases, it is striking how many sequence reads can remain unclassified with the currently used methods. This is in line with previous observations, where 40-90% of the sequence reads could not be classified 45 . It can very well be that the currently unclassified sequence reads represent potential new viruses, including novel pathogens.
In conclusion, we show the potential of global viral surveillance using metagenomic sequencing of sewage without ignoring the complexity of the approach. However, with improvements in sample preprocessing, sequencing methods and interpretability of viral sequence annotation this potential will increase.

Methods
Urban sewage sample and metadata collection. Samples were obtained from 62 countries from all continents as previously described 26 . All samples were taken before wastewater treatment. A questionnaire was filled in with information on sampling site, sample consistency and sample temperature, including transport time, storage time, and temperature before shipping. All samples were taken in a timeframe of 3 months from January until March 2016. In addition to sample specific data, additional metadata (Supp. File 1) was collected such as demographics, type of industry in the surrounding area, weather conditions and catchment area of the sewer. Upon arrival, samples were thawed at room temperature and 250 ml of the raw sewage was taken and centrifuged at 10,000 g for 10 min. The pellet was removed for bacterial content determination and DNA metagenomic sequencing 26 and the supernatant was used to perform the virus specific sample pretreatment and sequencing.
Sample processing for sequencing. Viral extraction was performed on 40 ml of sewage supernatant as previously described 46 . In short, the conductivity was measured to exceed 2000 µs and the pH of the samples was adjusted to pH 4. Afterwards 10 ml PEG 6,000 was added and the samples were incubated overnight at 4˚C under agitation.
After incubation the samples were centrifuged a 13,500 g for 1.5 h at 4 °C. The supernatant was removed, the pellet was dissolved in warm glycine buffer and 1 mL of chloroform-butanol (50/50) was added. After mixing, the sample was centrifuged for 5 min at 13,000 g at 4 °C. The filtrate was collected through a series of filters with 5 µm, 1.2 µm, 0.45 µm and 0.22 µm pore size.
Unprotected free DNA was removed by incubation with Ambion Turbo DNase for 30 min at 37 °C. Total nucleic acid content was extracted using Roche NA isolation kit and cDNA was made using superscript III Scientific Reports | (2020) 10:13748 | https://doi.org/10.1038/s41598-020-69869-0 www.nature.com/scientificreports/ (Invitrogen) using random hexamers that avoids amplification of human rRNA 47 . dsDNA was made using Klenow (NEB) and samples were sheared using Ion Shear Plus Enzyme Mix II. Libraries were amplified for 15 cycles using High Fidelity Platinum PCR reaction. The library concentration was determined using Ion Torrent quantification kit (Thermo Fisher). If the concentration was below 20 nM, extra amplification cycles were performed. Sequencing was performed on the Ion Torrent S5XL platform to generate around 10 million sequence reads per sample.
Data preprocessing. Raw fastq files were quality trimmed using FastP 48 . Read ends were trimmed to mean quality 25 with a sliding window of 5. Reads were trimmed to 400 nucleotides by default because the chemistry of Ion Torrent sequencing technology allows for reads of maximally 400 nucleotides long and longer reads were observed to contain high Phred score non-sense repetitive patterns in the tail region. Reads shorter than 50 nucleotides were discarded as well as reads with an average Phred score below 25. Duplicate reads were removed using CD-HIT 49 by clustering reads that start at the exact same position in the genome and have over 90% sequence identity in the first 50 nucleotides of the read, because of variable read length and observed insertion and deletion errors in the beginning of the reads.

Read based analysis.
Due to the expected high diversity of viruses present in the sewage samples, a read based annotation of the data was chosen, contrary to an assembly-based approach. Annotation was performed using two taxonomic annotation tools: Kaiju 50 and Centrifuge 51 . Kaiju performs taxonomic annotation based on an amino acid (AA) level which provides a higher sensitivity. This is especially important for the annotation of viral sequences given the high mutation rate of viruses 52 compared to other organisms. In parallel with Kaiju, Centrifuge was run, which uses nucleotide (nt) identity for taxonomic annotation. Combining a nucleotide and an amino acid based matching approach ensures that both coding and non-coding read sequences can be annotated. In addition, the combination of two read annotation tools with different annotation strategies was chosen to give more robust mapping results. The databases used for taxonomic annotation consisted of archaeal, bacterial and human RefSeq sequences and were extended with all viral and phage entries in GenBank version 230 53 because of the limited viral and phage sequence diversity in the RefSeq database.
Recommended quality thresholds and parameters for metagenomic data were used for both Kaiju and Centrifuge. Kaiju was run in greedy mode with a score cutoff of 70 and an error of 5. Centrifuge was run with a score threshold of 300 and a hit length cutoff of 50. If neither method produced a hit the read was annotated as "Unknown". BASTA 54 was used to determine the last common ancestor (LCA) of each hit given by both methods without restrictions on hit quality.
The final read counts passing QC were determined by the sum of read annotations at a certain taxonomic level and were normalized by total dereplicated read count to adjust for differences in sequencing depth and data quality [55][56][57] . The LCA taxon was used if the annotation at a certain taxonomic level was absent. Manual regrouping of taxonomic levels was performed to calculate read counts of human pathogenic viruses and read counts by host group. For sample comparison, read counts were normalized by Hellinger transformation 58 . Sample-wise comparison was done by calculating the Bray-Curtis dissimilarity between the normalized read counts using the R package Vegan 59 . Further investigation of the annotation of specific viral species was performed by mapping the reads against a redundant set of reference genomes using KMA with default parameters 60 . The maps of global read distribution were created using the continent subdivision from the "rnaturalearthdata" R package and the Köppen-Geiger climate classification 61 .

Data availability
Raw sequence data that support the findings of this study have been deposited in the European Nucleotide Archive with the study accession code PRJEB23496.