Detecting SARS-CoV-2 lineages and mutational load in municipal wastewater and a use-case in the metropolitan area of Thessaloniki, Greece

The COVID-19 pandemic represents an unprecedented global crisis necessitating novel approaches for, amongst others, early detection of emerging variants relating to the evolution and spread of the virus. Recently, the detection of SARS-CoV-2 RNA in wastewater has emerged as a useful tool to monitor the prevalence of the virus in the community. Here, we propose a novel methodology, called lineagespot, for the monitoring of mutations and the detection of SARS-CoV-2 lineages in wastewater samples using next-generation sequencing (NGS). Our proposed method was tested and evaluated using NGS data produced by the sequencing of 14 wastewater samples from the municipality of Thessaloniki, Greece, covering a 6-month period. The results showed the presence of SARS-CoV-2 variants in wastewater data. lineagespot was able to record the evolution and rapid domination of the Alpha variant (B.1.1.7) in the community, and allowed the correlation between the mutations evident through our approach and the mutations observed in patients from the same area and time periods. lineagespot is an open-source tool, implemented in R, and is freely available on GitHub and registered on bio.tools.

wastewater sample (corresponding to the 05-11 February 2021 time period), using NC_045512 as the reference genome, and for which three different variant callers were assessed: (1) freebayes 14 , (2) mpileup 15 and (3) GATK Mutect2 16 (cancer only mode). In terms of parameters, freebayes was used with a low mutation frequency parameter of 0.01, mpileup reported every position (either reference, or mutation), and GATK Mutect2 was used with the default parameters. The full bioinformatic pipeline used to identify the mutations is described in the "Raw data analysis" Section. Additionally, the proposed variant calling pipeline (with all set parameters) is also provided in the [lineagespot GitHub repository] (https:// github. com/ Bioda taAna lysis Group/ linea gespot/ blob/ master/ inst/ scrip ts/ raw-data-analy sis. md).
An example of the output produced by the comparison, is shown in Table 1. In this table the number of common mutations (N t ) between the lineage definition (retrieved either as a Pangolin rule or through the outbreak. info definition) and the mutations of the input dataset is captured for each lineage.
The number of common mutations are compared pairwise for the three variant callers. For each lineage, the absolute values of the differences between the N t metrics are calculated. As an example, for lineage B.1.1.7, the output produced by using the freebayes variant caller tool in the first step of the methodology, returns N t = 10 mutations satisfied, while the output of the GATK Mutect2 tool returns N t = 4 mutations satisfied. As a result, the two outputs exhibit a difference of N freebayes t − N gatk t = 6 ( Table 2). In addition, Table 2 gathers the differences between the number of common mutations across the three variant callers, which are consequently used for their overall comparison ( Figure 1A). In Table 3, a summary of the  www.nature.com/scientificreports/ differences is provided containing the total number of differences for the compared lineages and the maximum N t differences for each pair of files. Based on the above comparison, we consider that the most productive and informative approach is to utilize freebayes as the variant calling tool. The rest of the results shown below, are based only on the freebayes tool output.
Evaluating lineage-specific mutations across time periods. Sensitivity. In order to investigate how specific amino acid substitutions evolve over time, all VCF files were merged into a combined table in www.nature.com/scientificreports/ which all the detected nucleotide mutations along with the corresponding amino acid substitutions that have been identified are stored. Moreover, information about the allele frequency, and the overlapping gene is also provided for each mutation. Table 4 gives an example of the overall information.
Having a structured format, we first examined if the number of reads that were produced during PCR amplification is affecting the number of mutations that are identified in every sample. For this reason, two replicates of the same biological specimen were produced. The first replicate (Replicate A) contained 181,880 reads while the second replicate (Replicate B) contained 69,706 reads. The two replicates were analyzed as described in the "Raw data analysis" section and two VCF files were produced which contained 401 and 293 mutations respectively; of these, 59 mutations were common in both replicates, 342 mutations were unique for Replicate A and 234 unique for Replicate B.
A Student's t-test was performed in order to compare the mean values of the allele frequency between the two replicates and an F test to compare the variances. All tests resulted in p values higher than the 0.05 threshold, meaning that no statistically significant difference was found between the two replicates. On the contrary, when comparing two samples coming from different time periods there was a significant difference on the mutations' allele frequency (p values from Student's t-test and F test were below 0.05).
Moreover, for the same replicates, 6 different pairs of VCF files were generated in order to investigate how the number of mutations located changes while increasing the low mutation frequency parameter of 0.01 that was used at freebayes tool during the raw data analysis. The thresholds that were chosen were 0.01, 0.1, 0.2, 0.3, 0.4 and 0.5. In Fig. 1B, the evolution of the number of mutations found is shown while Fig. 1C gives the corresponding allele frequency.
Mutational load detection. The proposed methodology was applied on the wastewater samples, across 14 time periods. Information regarding the total number of reads, the number of reads mapping to SARS-CoV-2 reference genome etc. are provided in Supplementary Table 1.
To this end, Table 4 was collapsed at the gene and amino acid substitution level, therefore reducing any data noise that is introduced by nucleotide mutations that correspond to the same amino acid change. In Fig Detection of variants of concern/variants of interest. Furthermore, the evolution of lineage-specific mutations over the 14 time periods was studied. To this end, two Variants of Concern were selected; the Alpha (B.1.1.7) and the Beta (B.1.351) variants which are shown in Fig. 3.
Moreover, we examined the prevalence rate of the two variants of concern by calculating the average allele frequency of their mutations. The results for the 14 time periods are presented in Table 5 and show that it does not lead to 100% sum per time period. The latter is caused due to overlapping mutations among the lineages and implies that the average value of all the mutations is not a reliable metric to characterize a specific lineage's presence.
Further in the analysis, we attempted to examine the minimal lineage support per time period. In this step we calculated the average allele frequency of the mutations unique to each variant of concern and produced a table which is more comparable with the clinical data metric. In this context, the minimum allele frequency of the present (greater than zero) mutations indicates the level of our confidence. The results of the two metrics are shown in Table 5.
Assessing lineage assignment. Qualitative assessment between the major lineages found in the targeted clinical samples, retrieved from the ENA study ID: PRJEB44141 (ERP128154), and across the 14 time periods. The clinical samples (specific ENA entries listed in Supplementary Information file) exhibit the lineage distribution shown in Supplementary Table 2.
Based on the clinical results, as derived from targeted and non-randomized sampling of the Thessaloniki area, we can perform a direct comparison between the level of presence in a particular variant of concern (Alpha variant-B.1.1.7) in clinical and wastewater data (Fig. 4). www.nature.com/scientificreports/

Discussion
Analyzing wastewater, i.e. used water that goes through the drainage system to a treatment facility, is a way that researchers and surveillance systems can track pathogens, such as SARS-CoV-2, or biomarkers that are excreted in urine or feces. Monitoring effluents could be a reliable and more effective tool to estimate SARS-CoV-2 spread compared to sampling and testing the population, because wastewater surveillance can account for those who have not been tested and have only mild or no symptoms. Moreover, an effective and reliable methodology able to detect viral load and SARS-CoV-2 variants from municipal wastewater samples could decrease the cost of virus variant detection in the general population based on whole genome sequencing, since only a few samples must be processed and analyzed.
In this manuscript we present and validate a methodology named lineagespot, making use of NGS data, able to detect lineages and mutational load of SARS-CoV-2. The methodology aims to aid the epidemiological system for the monitoring of COVID-19 pandemic in urban areas. www.nature.com/scientificreports/  Table 5. Allele frequency metrics computed for the comparison with the clinical data. To this end the average allele frequency of all mutations, the average allele frequency of the unique mutations and the minimum allele frequency were calculated for each time period. Three metrics were used to quantify the presence of each lineage. Firstly, the "Average allele frequency of the mutation" which is the average allele frequency of all amino acid mutation of a lineage, the "Average allele frequency of the unique mutation" which is the average allele frequency of the unique (no shared with another lineage) amino acids mutations of a lineage, and finally the "Minimum allele frequency of the present (non-zero) unique mutation" which is the minimum allele frequency of the non-zero unique mutation of lineage.  www.nature.com/scientificreports/ The method has been tested in different time point samples taken from the main Municipal Wastewater Treatment Plant of Thessaloniki-Greece, where effluents from approx. 750,000 inhabitants are collected. The lineagespot method demonstrated to be sensitive enough to identify and quantify differences in the mutational load, across various time points and was capable of recording the evolution of the B.1.1.7 lineage in the community. The method is capable of identifying SARS-CoV-2 mutations and characterize the abundance of already defined variants (Alpha, Beta variant etc.) within the same sample, assuming that mutations detected in short reads are within that set of mutations already sequenced. In addition, lineagespot allows the identification of amino acid substitutions which are not linked with any variant of concern but are dominant in particular time periods in a specific area. Moreover, the quantitative data obtained using lineagespot are in accordance with the trends of well-known mutations (such as D614G) in the same period with the overall epidemiological status of the municipal area. The application of lineagespot in such complex samples, like those from Wastewater Treatment Plants, was able to assign lineages and in agreement with the trend of the major lineages detected in the area of Thessaloniki, in the 14 time points by whole genome sequencing of samples from the general population.
Overall, the method developed herein was proven superior compared to other methodologies (Sanger sequencing) 4,7 , since it was more informative and sensitive enough to detect mutations with low frequency and able to assign with good approximation the correct lineage present in the municipality.

Methods
Sampling and isolation. Wastewater samples were collected from the entrance of the main Municipal Wastewater Treatment Plant of the city which accommodates sewerage of about 750,000 inhabitants. Wastewater entering this plant refers exclusively to citizens from urban districts of the city. Typical values of certain physicochemical properties of wastewater samples tested in this study are displayed in Table 6. These properties demonstrate, among others, the existence of suspended solids, dissolved organic matter, dissolved oxygen and salinity that may have strong impact on viral adsorption and decay because of oxidation and increased metabolic activity of microorganisms in wastewater. The residence time of wastewater until the entrance of the Plant is between 2 and 7 h (depending on the area), which is more than enough to allow viral adsorption and decay. Identification of mutations may be hindered by viral adsorption and decay and for this reason the present effort is particularly significant.
Sampling and handling of the wastewater samples were performed according to Petala et al. 3 . Briefly, samples were obtained using a refrigerated autosampler (6712 Teledyne ISCO) programmed to deliver a 24-h composite www.nature.com/scientificreports/ sample by mixing consecutive half-hour samples. Samples were transported to the lab on ice and were processed immediately. Three 50-mL aliquots of each untreated wastewater sample were subjected to centrifugation at 4000×g for 30 min. Afterwards, a composite sample was obtained from supernatants and pH was adjusted to 4 using 2 M HCl solution. Then, three aliquots of 40 mL each, were filtered through respective 0.45-μm pore-size, 47-mm diameter electronegative membranes (HAWP04700; Merck Millipore, Ireland) for Sars-CoV-2 nucleocapsid binding. Each membrane filter was rolled into a Falcon 15-mL conical centrifuge tube with the top side facing inward, and was subjected to RNA extraction real-time RT-PCR and testing for SARS-CoV-2 quantification in a separate facility where no clinical samples are processed to exclude the possibility of contamination. For every round of wastewater samples, a blank negative water control was handled and processed by the same steps through the whole procedure.

RNA extraction and SARS-CoV-2 quantification.
Each electronegative membrane was subjected to RNA extraction process based on a phenol-chloroform-method 17 coupled with magnetic bead binding. The following reagents were added sequentially: (a) 900 μL of guanidinium isothiocyanate-based "Lysis buffer I" [ . After the magnetic separation of beads, the supernatant was removed and the procedure was repeated by adding the remaining 1.65 mL of the mixture. The beads were washed according to the manufacturer's protocol and RNA was eluted in 60 μL buffer. Concentrated RNAs were subjected to real-time RT-PCR testing for SARS-CoV-2 quantification, utilizing the N2 protocol proposed by the Centers for Disease Control and Prevention (CDC) for the diagnosis of COVID-19 in humans (CDC, 2020). The assay was performed on a CFX96 Touch Real-Time PCR Detection System (Bio-Rad Laboratories, Hercules, CA, USA). Calibration curves were generated using the synthetic single-stranded Table 6. Main quality characteristics of wastewater samples.  www.nature.com/scientificreports/ RNA standard "EURM-019" (Joint Research Centre, European Commission) and SARS-CoV-2 viral loads were expressed as genome copies per μL of RNA extract.
Library preparation and sequencing. The targeted sequencing method was applied by preparing 400 nt amplicons using the ARTIC v3 protocol developed by Wellcome Sanger Institute 18 , with some modifications. First, cDNA synthesis was prepared from 10 μL of RNA using Super Script II reverse transcriptase (Invitrogen-Thermo Fisher Scientific, USA) and 50 ng/μL of random primers according to the protocol guidelines. For subsequent cDNA amplification, 2.5 μL of the generated cDNA was used instead of 6 μL, using ARTIC PCR primer pools (v3). Depending on the sequencing platform, libraries were submitted or not to the FS DNA Workflow for Illumina by a fragmentation step for 30 min at 37 °C, using the NEBNext Ultra II FS reagents (NEB#7658). Finally, the NEBNext adaptor (New England Biolabs, US, #7335) was used in the ligation reaction, diluted with adaptor dilution buffer at 10 μΜ final concentration. All purification steps were performed according to the ARTIC protocol. The samples were paired-end, sequenced either on a MiSeq or a NextSeq500 platform (Illumina, USA) with a read length of 2 × 300 and 2 × 150 bp, respectively. Library construction and sequencing was performed at the premises of CERTH INAB. The diagnostic facility at CERTH is an ISO 15189 certified unit. All steps are performed in dedicated physically separated areas with the appropriate measures. All the appropriate controls, both negative and positive ones according to Good Laboratory Practices and the adopted SOPs, have been used during the entire process.
Raw data analysis. The initial phase of the bioinformatics analysis is to produce an alignment of the sequencing reads, while maintaining extremely strict criteria, in order to remove any potential contaminants and/or sequencing errors. The first step is the adaptor removal process, where any adaptors were removed from the raw fastq sequences, with the cleaned reads mapped to the SARS-CoV-2 reference genome (Wuhan variant, NC_045512), using Minimap2 tool 19 with a minimal chaining score (matching bases minus log gap penalty) equal to 40. From this process, only the paired-end sequences were retained, while any other (unmatched, multiple mappings, etc.) were removed. In the next step, two different computational workflows were employed corresponding to the two different sequence lengths of 150-base or 300-base paired-end, used depending on the NGS platform. For the first seven samples and the last three (2 December 2020-18 March 2021 and 16 April 2021-06 May 2021), the primer sequences are excluded using the iVar tool 20 , setting a minimum of 200 length in nucleotides for a read to be retained after trimming, and a minimum threshold for sliding window of 15 quality to pass (width of sliding window equal to 4). The final sequences are then remapped to the same reference genome (minimal chaining score equal to 40). For the four samples between 19 March 2021 and 15 April 2021 primer trimming and remapping to reference genome was not applied, as the updated protocol used did not necessitate this step. Finally, in order to be able to detect low frequency mutations, the freebayes variant caller was used with a low mutation frequency parameter of 0.01. Ultimately, all identified mutations were annotated using the SnpEff tool 21 and the NC_045512.2 (version 5.0) database. The proposed variant calling pipeline (with all set parameters) is provided on the GitHub repository.

Downstream analysis of lineages detection.
In order to identify and assign different SARS-CoV-2 lineages based on the mutations detected from a single wastewater sample, we implemented the proposed methodology in a tool named lineagespot.
The tool accepts as input a VCF file, which contains all nucleotide (and the corresponding amino acid) mutations identified in a sample, along with a file containing all lineage-assignment mutations. After analyzing all inputs, a tab-delimited file (TSV file) is produced containing the identified mutations that are related to SARS-CoV-2 lineages. In addition, we compute average allele frequencies to quantify lineage abundance metrics. Figure 5 shows an overview of the tool's functionalities.
lineagespot is dependent on the source used for retrieving lineage definitions. In our case, the two potential sources are (1) lineage-characteristic mutation profiles pre-calculated by outbreak.info and (2) lineage-characteristic mutation profiles derived from the trained Pangolin models. Both outbreak.info and Pangolin are available resources (https:// cov-linea ges. org/ resou rces. html) for monitoring the SARS-CoV-2 outbreak, and have their data provided from GISAID. In this section we will describe the process for both cases; however, given the fact that the Pangolin definitions are based on a trained machine learning model that is not necessarily interpretable, lineagespot is using by default the first option (i.e. outbreak.info as the source of definitions): i. Outbreak.info as the source of lineage definitions By default, lineagespot uses outbreak.info to retrieve information about lineage assignments. In order to do this, either an API is used, from which information regarding variants of concern or user specified variants and the related amino acid substitutions are retrieved, or user provided tab-delimited files can be given as an input to the tool in case a user is interested in specific mutations or lineages. Based on these inputs lineagespot identifies potential SARS-CoV-2 variants in the input samples and produces metrics to quantify the presence of each lineage. ii. Pangolin as the source of definitions Initially, a vector of all genome positions is created, for which each position is set to be equal to the reference genome nucleotides. Then, every VCF file is read and a set of new rules is formed by setting each position of the file with the reported mutation or multiple mutations (in case there is more than one reported mutation at the same position). It should be noted that positions that have been detected with more than one mutation, should include all of them at the VCF's ALT column in a comma-delimited format. The second phase aims to compare the mutations derived from the VCF file with a lineage's mutations. Specifically, all decision rules are read from the input Pangolin file, and for each lineage, the related rules are compared with the final rule vector.
It should be noted that having pangolin as a source of lineages' definitions relies on the following assumption. Given a group of reads that satisfy a rule A of lineage L, and another group of reads that satisfy rule B from the same lineage L, then the lineage L is incorrectly assigned. As an example, suppose that a group of reads satisfy only the first two rules from lineage B.1.177.17 (28,931! = ' A' , 25,613! = 'G'), and another group of reads satisfy the next two rules from the same lineage (407! = '-' , 22,087! = ' A'). Based on the method description above, lineage B.1.177.17 will be marked as an identified lineage, even though none of the reads satisfy all of the lineage's rules. In order to mitigate this risk, we are taking into consideration a number of different indicators that reflect the number of total rules satisfied, the number of rules that are satisfied based on the detected mutations, and the overall number of reads that support both reference and allele for each of the rules. Three metrics are computed and stored in the output file; the total number of rules leading to the related lineage, the number of rules satisfied