Virus–pathogen interactions improve water quality along the Middle Route of the South-to-North Water Diversion Canal

Bacterial pathogens and viruses are the leading causes of global waterborne diseases. Here, we discovered an interesting natural paradigm of water “self-purification” through virus–pathogen interactions over a 1432 km continuum along the Middle Route of the South-to-North Water Diversion Canal (MR-SNWDC) in China, the largest water transfer project in the world. Due to the extremely low total phosphorus (TP) content (ND-0.02 mg/L) in the MR-SNWDC, the whole canal has experienced long-lasting phosphorus (P) limitation since its operation in 2015. Based on 4443 metagenome-assembled genomes (MAGs) and 40,261 nonredundant viral operational taxonomic units (vOTUs) derived from our recent monitoring campaign, we found that residential viruses experiencing extreme P constraints had to adopt special adaptive strategies by harboring smaller genomes to minimize nucleotide replication, DNA repair, and posttranslational modification costs. With the decreasing P supply downstream, bacterial pathogens showed repressed environmental fitness and growth potential, and a weakened capacity to maintain P acquisition, membrane formation, and ribonucleotide biosynthesis. Consequently, the unique viral predation effects under P limitation, characterized by enhanced viral lytic infections and an increased abundance of ribonucleotide reductase (RNR) genes linked to viral nuclear DNA replication cycles, led to unexpectedly lower health risks from waterborne bacterial pathogens in the downstream water-receiving areas. These findings highlighted the great potential of water self-purification associated with virus–pathogen dynamics for water-quality improvement and sustainable water resource management.

DRAM-v [11] to perform the parallel AMG identification.The genes with M/F flag assignments and auxiliary scores of ≤ 3 were regarded as putative AMGs.
In order to avoid false positive results, only the AMGs located between two virus-associated or viral hallmark genes and those located alongside the viralassociated or viral hallmark genes were selected for further analysis [12].
Phyre2 [13] was applied to identify tertiary protein structures with confidence > 90% and coverage > 70%.PROSITE [14] was used to analyze conserved regions and active sites of putative AMGs based on PROSITE collection of motifs.Genome maps for AMG-containing viral contigs were visualized based on COG, VIBRANT, VirSorter2, and DRAM-v annotations.

Comparisons of viral sequences in the MR-SNWDC and other freshwater ecosystems
Viral contigs with over 90% completeness were selected from the freshwater sources in the IMG/VR database [15], for subsequent viral clustering analysis with vOTUs in the MR-SNWDC.Each reported viral sequence was assigned to a specific ecosystem subtype (lake, lentic, groundwater, sediment, wetlands, river, ice, creek, lotic, pond, and drinking water).The protein sequences retrieved from Prodigal v2.6.3 [6] were used for gene-sharing network analysis through vConTACT2 v0.9.19 [16].Diamond [17] was applied to estimate the protein-protein similarity.Protein clusters were calculated by the Markov Cluster Algorithm (MCL), with the subsequent VC generation using ClusterONE [18].

Relationship between vOTUs in the MR-SNWDC and publicly reported viral sequences in the IMG/VR database
Gene-sharing network analysis was performed to evaluate the relationship between 40,261 vOTUs in the MR-SNWDC and 37,364 viral sequences (>90% completeness) from a broader diversity of freshwater ecosystems in the IMG/VR database [15].Around half of vOTUs in the MR-SNWDC were assigned to 7,389 viral clusters (VCs) at the genus level, with 68.2% VCs not including viruses from any other ecosystems in the IMG/VR database (Fig. S10A).Only 9.1% of identified vOTUs were clustered with publicly reported viruses, suggesting that the MR-SNWDC was an endemic pool of diverse and novel freshwater viruses.Among 3,670 vOTUs which shared VCs with publicly available viruses, over 85% were clustered with viral sequences from the lake source.In addition, about one thirds of lake-derived viral genera were clustered with vOTUs in the MR-SNWDC, ranking the most among all freshwater sources (Fig. S10B), which highlighted the role of Danjiangkou Reservoir (lake-like) in shaping the viral communities across the canal.

Fig. S1
Fig. S1 Sketch map of the MR-SNWDC.Sampling sites are distributed at 32 monitoring stations along the water canal (see TableS1).The length of the ).The length of the canal (1,432 km) is measured by the sum of the dendritic distances between each pair of adjacent sampling sites, as an indication of canal network density, rather than the straight-line distance between the water source area and the canal end.Sampling campaigns are carried out at the same sites in August 2020 and March 2021, respectively.

Fig. S2
Fig. S2 Regional similarity of viral communities in autumn (A) and spring (B).Sorenson similarity is calculated for the relative abundances of vOTUs.The width of each curve represents the similarity value between the paired regions.Source data are provided in the Source Data file.

Fig. S3
Fig. S3 Spatiotemporal distribution of bacterial communities in autumn and spring.Nonmetric multidimensional scaling (NMDS) analyses visualize the temporal variation of bacterial β-diversity (A) as well as the distinct partition of bacterial communities into four ecological regions in autumn (B) and spring (C), based on the Bray-Curtis dissimilarity matrix calculated from the relative abundances of prokaryotic MAGs.The stress value denotes the ordination fitness of each NMDS plot.Each group is encircled by an ellipse at 95% confidence interval.One outlier sample (06A) is excluded from subsequent analyses.D The richness of observed bacterial species transported from the water source area (Reg 1) to downstream regions (Reg 2~4) in autumn (upper panel) and spring (lower panel).Source data are provided in the Source Data file.

Fig. S4
Fig. S4 Changes in the N:P ratio (molar) along the canal in autumn and spring.The goodness-of-fit R 2 value and the statistical significance are presented for each linear regression (****: < 0.0001).Source data are provided in the Source Data file.

Fig. S5
Fig. S5 Changes in the N:P ratio (molar) along the main canal during 2015~2021.Source data are provided in the Source Data file.

Fig. S6
Fig. S6 Changes in P flux along the canal in autumn and spring (concentration × flow rate, g/s).Each linear regression is denoted by the goodness-of-fit R 2 value and the significance level of p value (****: < 0.0001).Source data are provided in the Source Data file.

Fig. S7
Fig. S7 Relevance of environmental factors and heterotrophic bacterial communities.A Correlation between environmental factors and heterotrophic bacterial communities in autumn and spring.Pairwise Pearson's coefficients are denoted by color gradients.Edge width demonstrates the Mantel's r correlation coefficients.Edge color represents the significance level of p value based on 999 permutations.B Random forest importance of each environmental factor for heterotrophic bacterial communities in two seasons.All environmental factors are brought into a ranking by their importance index represented by the increase in node purity.The significance of each environmental factor is shown in asterisks (**: < 0.01; *: < 0.05).Source data are provided in the Source Data file.

Fig.
Fig. S8 Co-occurrence network of environmental factors and MAGs/vOTUs.The size of each dot marking environmental factors is proportional to the number of connections.Dots with different colors denote different modules in networks.Source data are provided in the Source Data file.

Fig. S9
Fig. S9 Molecular properties of viral genomes in the MR-SNWDC and the IMG/VR database.Differences in GC content (A) and specific amino acid frequencies (B) are estimated by Bonferroni-adjusted Wilcoxon test.The statistical significance is marked by asterisks (****: ≤ 0.0001).Source data are provided in the Source Data file.

Fig. S10
Fig. S10 Comparison of viral species in the MR-SNWDC and the IMG/VR database.A Shared viral clusters (VCs) among different datasets.Viral sequences from 11 freshwater ecosystems are selected from the IMG/VR database.Each source of VCs is defined as a set.The bars on the left represent the total number of VCs in each set.Dots with interconnecting vertical black lines represent the intersections, where black dots represent sets that were within the intersection and unfilled light gray dots represent sets that were not part of the intersection.The bars on the top right represent the number of VCs within the intersection.B Proportional number of viral genera from diverse freshwater sources in the IMG/VR database which are clustered with vOTUs in the MR-SNWDC.Source data are provided in the Source Data file.

Fig. S11
Fig. S11 Changes in average copy number of bacteria-encoded genes involved in key P-associated metabolic processes along the canal.The goodness-of-fit R 2 value and the significance level of p value are presented for each linear regression (****: ≤ 0.0001).Source data are provided in the Source Data file.

Fig. S12 .
Fig. S12.Dynamics and P-associated functions of bacteria with highquality genomes (completeness > 90%, contamination < 5%), as well as their relationships with viruses.A Changes in the richness (line charts) and growth potential (pie charts, see Materials and Methods) of bacteria along the canal in autumn and spring.B Changes in average copy number of key bacteria-encoded genes of four metabolic processes associated with P acquisition and utilization.C Virus-host abundance ratios display notable increase with water flow in both seasons.Each linear regression is denoted by the goodness-of-fit R 2 value and the significance level of p value.Source data are provided in the Source Data file.

Fig. S13
Fig. S13 Changes in average copy number of functional genes involved in carbohydrate, energy, nitrogen, sulfur, and nucleotide metabolism along the canal in autumn and spring.The average gene copy is normalized by each KEGG pathway/module.Source data are provided in the Source Data file.

Fig. S14
Fig. S14 Correlation between abundances of viruses and their hosts for each phylum in autumn and spring.Host phyla linked to relatively fewer viruses are categorized into "Others".Each linear regression is denoted by the goodness-of-fit R 2 value and the significance level of p value.Source data are provided in the Source Data file.

Fig.
Fig. S15 Virus-host abundance ratios (VHR) for each bacterial phylum.Source data are provided in the Source Data file.

Fig.
Fig. S16 Co-occurrence network of virus-host interactions.Nodes with or without black outlines represent MAGs or vOTUs, respectively.Each edge marks a specific virus-host linkage.The modularity of the network is calculated using community detection algorithm built in Gephi.Top five modules are shown in different colors.Source data are provided in the Source Data file.

Fig. S18
Fig. S18 Outlook of practical utility of natural bacteriophage therapy in natural aquatic ecosystems compared to the target bacteriophage therapy used in human body.Targeted infections by specific viruses have enabled precise treatments of pathogen-induced human diseases in clinical practice.Future studies would be expected to widen the application prospects of natural bacteriophage therapy to eliminate the waterborne pathogens.