Introduction

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has caused a worldwide pandemic with severe socio-economic impacts, indicating the importance of monitoring this pandemic at the population level1. SARS-CoV-2 variants may have different characteristics, e.g., in transmission, mortality, and the effectiveness of vaccines, indicating the importance of variant detection at the population level2.

Recently, it has been reported that subvariants of SARS-CoV-2 are escaping neutralizing antibodies, causing risk of reinfection and indicating the importance of variant tracing3. Wastewater-based surveillance of SARS-CoV-2 RNA fragments has been shown to be an effective way to monitor COVID-19 pandemic trends at the population level4. Wastewater-based surveillance has also indicated the potential to detect mutations that have not been detected in clinical samples, and so it could be a valuable complement for clinical surveillance5,6.

SARS-CoV-2 variants have been detected from wastewater by using next-generation sequencing widely over the world, e.g., in Europe7,8,9 as well as in North and South America10,11. Also, various NGS library methods such as enrichment panels12, amplicon panels11,13, and metagenomic shotgun sequencing12 have been utilized. Frequently, part of the genome does not have sufficient coverage, which may be related to wastewater being a complex sample matrix containing environmental factors and PCR inhibitors, possibly leading to the coverage problem14. This and a highly variable number of variant-defining mutations between known SARS-CoV-2 variants makes variant detection in wastewater samples difficult, especially when part of the variant-defining mutations are in the genomic region with low or null coverage.

In this study, we present next-generation sequencing utilizing the ARTIC Network’s primer set, including the bioinformatics pipeline to follow temporal and spatial trends of SARS-CoV-2 variants in Finland. The pipeline outcome includes (1) sequence quality control with parameters of uniformity, coverage, and the number of mapped reads, (2) a probability calculation with a hypergeometric test to evaluate the existence of a variant, even if part of the genome is not covered, and (3) the evaluation of proportions of SARS-CoV-2 variants in wastewater.

Materials and methods

Wastewater samples

For the purpose of variant detection, 24-h composite wastewater samples were collected, as described in Hokajärvi et al.15 and delivered within 24 h to a laboratory for analysis. Sixteen influent wastewater samples from 8 November 2021 to 28 February 2022 were collected weekly, following the standard biosafety precautions for handling untreated wastewater, from Viikinmäki Wastewater Treatment Plant (WWTP) in Helsinki, Finland. Viikinmäki WWTP serves the municipalities of Helsinki, Sipoo, Kerava, Tuusula, Järvenpää, Pornainen, Mäntsälä, and partly also Vantaa, covering a total population of 860 000 inhabitants. The steps of the analysis pipeline starting from wastewater samples collected from the wastewater treatment plant are summarized in Fig. 1.

Figure 1
figure 1

Flowchart presenting the pipeline to identify SARS-CoV-2 variants from wastewater in Finland. The blue boxes present the laboratory tasks and the green boxes bioinformatics tasks.

Extraction of nucleic acids

Nucleic acids were extracted from wastewater, as described in Hokajärvi et al. and Tiwari et al.15,16. In brief, 70 ml of wastewater was pre-centrifugated and supernatants were concentrated with Centricon Plus-70 centrifugal filters (#UFC701008, Merck, Germany) with a concentration time of 25 min in 3000g, producing 240–930 µl of concentrate. The concentrate volume less than 700 µl was normalized to 700 μl for all filtrated supernatants with flow-through wastewater. Nucleic acid was extracted from 300 µl of concentrate using a Chemagic-360D (Perkin-Elmer, MA, USA) instrument with a Chemagic Viral300 DNA/RNA extraction kit (#CMG-1033-S, Perkin-Elmer), and eluted to a 50- or 60-µl kit elution buffer.

Generation of amplicon libraries and next-generation sequencing

Nucleic acids extracted from wastewater samples may contain PCR inhibitors causing poor assay sensitivity, which also need to be taken into account in NGS-based methods14. To maximize the odds of ensuring high-quality sequencing data by minimizing the risk of inhibition, undiluted and diluted (1:5 or 1:10) RNA was used for cDNA synthesis with LunaScript RT SuperMix (#E3010L, New England Biolabs, MA, USA) in 20-µl reaction volume in accordance with the manufacturer’s protocol.

To generate the amplicon sequencing library, multiplex PCR was performed by using ARTIC.v4 (samples from 8 and 21 November 2021) or ARTIC.v4.1 primers (samples for any other date)17. An ARTIC Illumina library constructed by using 15 µl of amplified cDNA was purified with QuickStep™ 2 SOPE Resin and EdgeBio Optima DTR 96-well Plate by Edge Biosystems (Edge Biosystems Inc, CA, USA). Library preparation was performed in accordance with COVID-19 ARTIC v3 Illumina library construction and sequencing protocol v.4. Library preparation was miniaturized to × 0.25 from the original reaction volume. Unique Dual Index UMI oligos by IDT (Integrated DNA Technologies, IA, USA) were used as ligation adapters. Illumina-specific p5 and p7 primers were introduced in library amplification PCR. An equivolume pool was formed from amplified libraries and purified from adapter-dimers using Agencourt AMPure XP SPRI paramagnetic bead chemistry (Beckman Coulter, Indianapolis, IN, USA). The library pool was quantified for sequencing using LabChip GX Touch HT High Sensitivity assay (PerkinElmer, Waltham, MA, USA). Sequencing was performed with the Illumina NovaSeq 6000 system using an SP flow cell with a lane divider (Illumina, San Diego, CA, USA) with paired-end 251 bp reads.

Identification of variant specific mutations

Specific mutations of the variants were identified from a CoV-Spectrum platform (https://cov-spectrum.org) based on the Global Initiative on Sharing All Influenza Data (GISAID)18,19. From CoV-Spectrum, lists of mutations and mutation frequencies in variant sequences of each variant likely to be present in the population—in this case Delta, Omicron BA.1, and Omicron BA.2—were downloaded. Using in-house R-script, mutations that were present in more than 80% of the sequences were identified and then compared to mutations of other variants. Only mutations unique to each variant were left on the mutation reference list.

To use this pipeline in other time frames of interest, a new list of variant specific mutations would need to be generated. In addition, other strategies for generating the reference list can be used, for example including mutations which might affect transmission or disease severity.

Mapping of the reads and quality control

Reads were mapped to reference the SARS-CoV-2 genome (NC_045512.2) with the Burrows-Wheeler Alignment Tool (BWA-MEM)20 using the default settings. Quality control of the raw reads and mapped reads was performed using FastQC and MultiQC21,22. To evaluate the quality of sequencing and mapping to the reference genome, we calculated the on-target coverage percentage of each nucleotide in the sample when the depth of reads was > 100 and > 1, as well as uniformity. Uniformity was calculated thus:

$$Uniformity=\frac{Total\;number\;of\;bases\;with\;coverage\ge 20\%\;of\;mean}{Total\;number\;of\;targeted\;bases}$$

For further analysis, sequencing data from RNA dilution producing the highest uniformity was used.

Identification of mutations from wastewater samples

ARTIC primers were trimmed from the BAM file with iVar using the ivar trim command, using the default settings (min-length 30; min-quality 20; sliding-window-width 4)23. Mutations, deletions, and insertions were identified from the trimmed BAM file with ivar variants by using the default settings (minimum quality 20; minimum frequency threshold 0.03, minimum depth 0)23,24.

By means of iVar output filing, mutations in the samples were divided into three categories:

  • SARS-CoV-2 detected in the sample: > 20 reads were mapped across the reference genome (NC_045512.2).

  • Mutation in SARS-CoV-2 in a specific locus can be detected: > 20 read mapped mutation reads per nucleotide25.

  • Allelic frequency of the SARS-CoV-2 variants in the sample can be calculated: > 20 mutation reads, coverage > 100 mapped reads per nucleotide, and the p-value of a Fisher’s exact test from iVar for mutations < 0.05, which ensure that allelic frequency in a given position is higher than the mean error rate23.

Since the capability to detect mutations in the sample may vary in accordance with sequencing coverage and uniformity of the reads across the genome, and so all the nucleotides of the genome may not be sequenced as high coverage, a hypergeometric test was applied to justify the existence of the variant in the sample. The hypergeometric test was performed using the phyper function in R (version 4.3.1), utilizing the total number of mutations in the sample, the total number of variant specific mutations found in the sample, the total number of variant specific mutations in the reference, and the total number of nonspecific mutations found in the sample. Based on the results, the variants with the p-value < 0.05 were considered detected.

Proportions of variants in wastewater

To define proportions of the SARS-CoV-2 variants in the sample, average and standard deviation from the nucleotide allelic frequencies were calculated from the samples and variant specific nucleotide mutations, which fulfilled the allelic frequency calculation criteria described above.

Results

SARS-CoV-2 genomic coverage of sequenced samples

Variation in the genomic coverage, uniformity, and number of mapped reads between sequenced samples was observed (Fig. 2). The median of genomic coverage when the depth was > 100 reads per nucleotide was 92.0% (range 66.3–97.1%), the mean of uniformity was 85.5% (range 72.2–90.9%), and the mean of mapped reads 1.916 M (range 0.032–5.403 M) when the highest coverage from two sample dilutions of the same sample was used to identify variants from the given sample. As genomic coverage in some of the samples was < 100%, not all potential mutations could be reliably identified (Table 1, Fig. 2).

Figure 2
figure 2

Coverage plots of one sample with two RNA dilution before cDNA synthesis. On Y-axis depth and X-axis SARS-CoV-2 genomic location. (A) Sample from 31 January 2022 in which undiluted RNA was used for cDNA synthesis. Coverage 90.0%, uniformity 85.5%, and 5.40 M mapped reads. (B) Sample from 31 January 2022 in which 1:5 diluted RNA was used for cDNA synthesis. Coverage 54.3%, uniformity 51.9%, and 1.18 M mapped reads (solid line) indicated a depth of 100 and dashed-line median of the coverage. As coverage in some parts of the genome was less than 100× (area below the black line), the proportions of mutations could not be reliably detected.

Table 1 Quality parameters of Artic amplicon sequencing and mapping of the reads to SARS-CoV-2 reference genome. The best result by coverage from two sequenced nucleic acid dilutions is presented in the table. Data from Viikinmäki Wastewater Treatment Plant, Helsinki, Finland.

Identification of variant specific mutations from CoV-Spectrum

SARS-CoV-2 variant-specific mutations were identified from the CoV-Spectrum platform (cov-spectrum.org [accessed 24.03.2023]), which is based on sequences of SARS-CoV-2 clinical samples in GISAID18. Variant specific mutations were used to recognize SARS-CoV-2 variants from wastewater. To use this pipeline in other time frames where some other variants are present, the list of reference needs to be updated accordingly.

For Delta B.1.617.2, 33 variant specific nucleotide mutations were identified. In total, 21 of these were nucleotide substitutions, of which 18 led to change of amino acid, two were synonymous, and one was an intergenic mutation. A total of 12 nucleotide deletions were identified, from which 11 led to the deletion of amino acids D119-, F120-, E156-, and F157-, and one nucleotide deletion was identified as an amino acid mutation (Table 2).

Table 2 Variant specific mutations identified for Delta B.1.617.2, Omicron BA.1, and Omicron BA.2 variants of the SARS-CoV-2 CoV-Spectrum platform, used to identify variants from wastewater samples.

For Omicron BA.1, a total of 37 variant-specific nucleotide mutations were identified. A total of 16 of these were nucleotide substitutions, of which 12 led to the mutation of amino acid, and four were synonymous mutations. The rest (n = 21) were nucleotide deletions, and 18 of these were mutations leading to amino acid deletion, while three were associated with amino acid change (Table 2).

A total of 53 Omicron BA.2 variant-specific nucleotide mutations were identified. Of these, 24 were nucleotide substitutions, of which 17 led to amino acid mutation, and seven were synonymous mutations. Twenty-nine nucleotide deletions led to amino acid deletion and one to amino acid change. A total of 17 were intergenic mutations (Table 2).

Detected SARS-CoV-2 variants in wastewater and their proportions

Based on the detected variant specific mutations, Delta B.1.617.2 was found in wastewater samples from the start of the current study (8.11.2021) until 17.1.2022, according to a hypergeometric test (p < 0.05, Table 3). Delta B.1.617.2 was the most dominant variant in the Helsinki wastewater for the whole year (2022) until 20.12.2022 when Omicron BA.1 became the variant with the highest proportion (p < 0.05) (Fig. 3). Omicron BA.1 was observed for the first time in the influent wastewater of Viikinmäki WWTP jn Helsinki in a composite sample taken on 12–13.12.2021. Omicron BA.2 was observed for the first time on 24.1.2022 (p < 0.05). Both Omicron BA.1 and BA.2 were observed until the end of this study period (28.2.2022). The Omicron BA.2 variant became more common on 28.2.2022 than Omicron BA.1.

Table 3 Detected SARS-CoV-2 mutations found in Helsinki, Viikinmäki WWTP wastewater samples between November 2021 and February 2022. The P-value of hypergeometric distribution when comparing found mutations to known mutations of the variants in the sample. When the p-value of the hypergeometric test was < 0.05, the variant was considered found. Underlining represents synonymous nucleotide mutations.
Figure 3
figure 3

Proportions of the Delta B.1.617.2, Omicron BA.1, and Omicron BA.2 variants in Helsinki wastewater, 8.11.2021 to 28.2.2022. Delta B.1.617.2 was the most common variant until 13.12.2021. Omicron BA.1 became the most common on 20.12.2021. On 28.2.2022, Omicron BA.2 became the most common. The standard deviation of the mutation allelic frequency is in the lighter shaded area.

Notably, the Delta variant was detected by hypergeometric tests in the wastewater sample collected on 8.11.2022, even if only 22 of the 33 variant specific mutations were found (p < 0.05). This may be related to 76.9% uniformity and 89.0% coverage, indicating that part of the genome does not have enough reads to assess Delta mutations. This indicates that a hypergeometric test is able to justify the existence of the variant in the sample, even with partial genome coverage (Tables 1 and 3). Interestingly, 13 samples out of 16 contained mutations that were not identified to Delta, Omicron BA.1, or Omicron BA.2 variants (Table 3), indicating that method also identified mutations that are not included in the list of variant specific mutations (Table 1).

Discussion

The present study introduces a complete protocol to identify variants of the SARS-CoV-2 virus and its proportions in wastewater samples. The protocol was tested with 24 h composite samples from influent derived from Helsinki Viikinmäki Wastewater Treatment Plant, Finland. The composite wastewater samples were concentrated with ultrafiltration columns prior to nucleic acid extraction followed by cDNA synthesis. cDNA was used to generate the ARTIC amplicon library, which was then sequenced with Illumina NGS platform. Bioinformatics of NGS data contained the mapping of reads to the SARS-CoV-2 reference genome with BWA-MEM, following primer trimming, detection of mutations, and calculation of allelic frequencies with iVar following quality control along with uniformity, coverage, and the number of mapped reads. Finally, the existence of the variant was evaluated with hypergeometric distribution following the calculation of variant proportions in the wastewater sample. The protocol presented here was used in Finland to assess virus variants and their proportions in the population, and could also be used to evaluate variants in wastewater elsewhere.

To identify the variant specific mutations, we utilized the CoV-Spectrum platform which contains the percentage proportion of amino acid and nucleotide mutations of the clinical sample sequences submitted to database18. The CoV-Spectrum has been used as a source to identify variant specific mutations in order to compare clinical and wastewater data and the identification of variants from wastewater studies26,27,28. This way, we could systematically identify variant specific mutations and use them to recognize SARS-CoV-2 variants from wastewater sequencing data in the time frame of interest and update list of specific mutations to recognize other variants. To extend the pipeline to follow other properties of the virus in the population, mutations which might affect transmission or disease severity could be included in the reference list.

As wastewater is a difficult sample matrix affected by variations in the amount and composition of environmental factors and PCR inhibitors, this results in a situation where part of the target genome may not have sufficient coverage14. As the data presented herein shows, the whole genome may not be covered, even when uniformity percentage and genome coverage are at a sufficient level. Our strategy of using two RNA dilutions before cDNA synthesis from the same sample resulted in over 90% genomic coverage and over 85% uniformity on average, indicating a high success rate but also gaps in the sequencing data. In the optimal case, when sequencing reads are evenly distributed to the target genome and have at least 100× coverage, it should be possible to detect a virus proportion of about 3–6%23. To evaluate the quality of sequencing, we utilized genome coverage percentage and uniformity as quality parameters. The use of one of these parameters alone could lead to a situation where the sample has sufficient mapped reads, but they are unevenly distributed; or reads are evenly distributed but the coverage is too low to identify mutations and evaluate the allelic frequency.

Similar to our study, many others have used iVar for primer clipping and identification of nucleotide mutations in SARS-CoV-2 wastewater studies9,23,29,30,31. Identified mutations in the samples were systematically categorized into three categories in accordance with the number of mapped reads, sequencing depth, and iVar p-value to evaluate their capability to identify variants and calculate variant proportions in the samples. For the detection of mutations, we used the threshold of 20 mapped reads, as suggested by World Health Organization instructions for clinical samples to detect mutations25. Since wastewater samples might contain more diverse mutations than clinical samples5, and some errors in sequencing are possible, the use of any lower threshold might lead to the detection of erroneous mutations. However, to calculate allelic frequency and later in evaluating proportions of the variant, we used > 100 reads per mutation as a coverage threshold to ensure sufficient coverage and reliable value for allelic frequency, as previously described by Rios et al.9. This coverage should be enough to detect a frequency of about 10%. However, higher coverage would reduce variation in the estimation of allelic frequency23.

Evaluation of the presence of variants may be difficult if there are gaps in genome coverage and a variation in the number of variant specific mutations, which have been recognized as challenges in variant detection from wastewater5. To evaluate the presence of variants in the sample, we used a hypergeometric test, which has been previously used widely in bioinformatics analysis to evaluate, e.g., enriched pathways in gene expression data32,33. With this strategy, the evaluation of the presence of variants in the wastewater sample is feasible, even if the sequencing data does not cover all genome areas.

During the period of the study, Delta B.1.617.2, Omicron BA.1, and Omicron BA.2 variants were identified from wastewater influent of Helsinki wastewater. Overall, identified variants and their proportions show positive agreement with clinical samples in the Helsinki area, indicating that this pipeline follows temporal and spatial variation of variants at the population level when visually comparing trends34. We also found some mutations that were not on the list of variant-specific mutations. These mutations may be associated with the founder effect of some variant in Finland, or represent traits of new variants in the population35. Also, those mutations may be shared with two or more variants, or are mutations which have frequency below 80% and are thus not includedin the list of variant-specific mutations. This indicates that the limitation of our strategy is the identification of novel variants, since variants are now recognized by predefined mutations. Discovering novel variants from wastewater may also be challenging due to several virus variants/types in sample and short sequencing reads5,12,35.

To conclude, the pipeline presented herein is suitable for detecting variants of SARS-CoV-2 from wastewater samples to follow spatial and temporal trends in the population. Also, this pipeline could be easily modified to detect variants of some other pathogen, e.g., influenza or the RS virus, when novel amplicon panels are designed with an appropriate tool36 and by replacing reference genome and primer clipping files of this protocol with the corresponding primer panel in use.