Prokaryotic viruses impact functional microorganisms in nutrient removal and carbon cycle in wastewater treatment plants

As one of the largest biotechnological applications, activated sludge (AS) systems in wastewater treatment plants (WWTPs) harbor enormous viruses, with 10-1,000-fold higher concentrations than in natural environments. However, the compositional variation and host-connections of AS viruses remain poorly explored. Here, we report a catalogue of ~50,000 prokaryotic viruses from six WWTPs, increasing the number of described viral species of AS by 23-fold, and showing the very high viral diversity which is largely unknown (98.4-99.6% of total viral contigs). Most viral genera are represented in more than one AS system with 53 identified across all. Viral infection widely spans 8 archaeal and 58 bacterial phyla, linking viruses with aerobic/anaerobic heterotrophs, and other functional microorganisms controlling nitrogen/phosphorous removal. Notably, Mycobacterium, notorious for causing AS foaming, is associated with 402 viral genera. Our findings expand the current AS virus catalogue and provide reference for the phage treatment to control undesired microorganisms in WWTPs.

1. The AS sample from each wastewater treatment plant essentially represents large pooled environmental sources (municipality or region). The authors used de novo assembly to generate contigs for viral classification. De novo assembly is prone to creating artificial chimeric contigs that do not exist, particularly when the metagenomic dataset is from pooled environmental sources. Most of the analyses are based on the credibility of these contigs. This may also result in an overestimation the data. Were assemblies assessed for chimeras or validated? 2. The 16S rRNA amplicon sequencing (line 109) seems abrupt. The rationale for this experiment is unclear. 16S analysis methodology is not described and 16S bacterial community data presented is fairly superficial (PCoA, Fig 1D). Authors describe that 16S rRNA sequencing data was used to determine phylum abundance in Line 159, referencing Supplementary Table 2. However, Supplementary Table 2 provided for review does not match that -" Table S2. Host prediction for shared viral genera between Sample AS and IMG/VR AS (phylum level)". It seems that the authors use the 16S data as a validation to the CRISPR-Cas host identification approach (line 157). Therefore, it would be important to show this data to support their claim. I was very pleased to have the opportunity to review this manuscript and to read of this description of the viral diversity in 6 Activated Sludge plants in Hong Kong. The editors will hopefully be aware that these seemingly prosaic environments have been critical test beds in microbial ecology and this is a paper very much in that tradition.
The paper contributes to the growing realisation that viruses are much much more important than microbial ecologists have perhaps hitherto accepted. The ongoing revolution in genomics is part of the story: the discovery of prophage and CRISPR sequences in many genomes are a clear indication that the viruses are impinging on the ecology of a wide variety of organisms. This paper, and other papers of this nature are starting to complement this picture by assembling sequences of viral DNA into plausible genomes. There is inevitably a degree of pure observation work of this kind. But, to the authors credit, this manuscript goes further by linking the viruses to the CRISPR sequence in known genomes. This is not a perfect approach, to linking predator and prey, but it is certainly a start and a step in the right direction. However, I was wondering if the authors had considered the possibility that the sequences they are detecting are conserved in viruses and are not necessarily unique to the viruses they have sequenced? There are also naturally some limitations, some of which the authors are aware of and highlight in the manuscript: the use of 0.2 micrometre filters and DNA sequencing natural confines their observations to small DNA viruses.

Responses:
Thanks for the valuable comments. We agree that there are some limitations as we described in our manuscript.
However, the authors are making a common error in metagenomics of equating abundance with proportional abundance. Whilst this is not an egregious error in the context of this manuscript, it is an insidious one for the field as a lack of quantitative tools can make progress difficult.

Responses:
Thanks for the valuable comments. We agree with the reviewer that absolute quantification approach should be developed in this field. In this study, we have changed all the expressions of "abundance" into "relative abundance" in our revised manuscript (Line 106,112,152).
A case in point is that in this manuscript the authors have perhaps overlooked the fact that a significant fraction of the viruses in the mixed liquor will have come in with the wastewater and thus may be of little or no ecological significance in the reactor. This might account for the presence of viruses thought to prey on methanogens or sulphate reducing bacteria. Line 112 "from viral abundance data" There is no viral abundance data in this paper. Only proportional abundance. Both data formats have their uses, but one should not confuse them.

Responses:
Thanks for the valuable comments. We revised all the word "abundance" to "relative abundance" to avoid confusion as suggested (Line 106,112,152).
Line 116 "high abundance" There is no viral abundance data in this paper.

Responses:
Thanks for the valuable comments. We revised all the word "abundance" to "relative abundance" as suggested (Line 106,112,152).
Line 131 "Overall, these results suggest that the virome across WWTPs consists of many shared genera. The lack of detection of specific genera in one AS virome may be primarily due to either detection limit or temporal dynamics. Hence, it is possible that AS common viromes may be more diverse than those observed in this study." I am not 100% sure that I understand this statement. But as we expect different plant to share the same bacterial taxa, surely expect them to share some viral taxa as well?

Response to comments of Reviewer 2
Reviewer #2 (Remarks to the Author): Chen et al present a study of prokaryotic viruses of activated sludge samples from 6 wastewater treatment plants in Hong Kong collected in 2018. They performed metagenomic sequencing and analyzed the virome using de novo assembled contigs. The authors identified 50,037 viral contigs that would be a substantial addition to existing IMG/VR database. They compared communities using viral genera and found that viral genera were largely shared across the wastewater treatment plants. The authors predict host-association by querying CRISPR-Cas spacer sequences and in turn, infer community functions indirectly or directly through auxiliary metabolic genes. There are a few major points that need to be addressed.

Responses: Thanks for the general comments.
Major comments: 1. The AS sample from each wastewater treatment plant essentially represents large pooled environmental sources (municipality or region). The authors used de novo assembly to generate contigs for viral classification. De novo assembly is prone to creating artificial chimeric contigs that do not exist, particularly when the metagenomic dataset is from pooled environmental sources. Most of the analyses are based on the credibility of these contigs.
This may also result in an over-estimation the data. Were assemblies assessed for chimeras or validated? 2. The 16S rRNA amplicon sequencing (line 109) seems abrupt. The rationale for this experiment is unclear. 16S analysis methodology is not described and 16S bacterial community data presented is fairly superficial (PCoA, Fig 1D). Authors describe that 16S rRNA sequencing data was used to determine phylum abundance in Line 159, referencing Supplementary   There are some inconsistencies in the manuscript where the authors loosely use "reads" and "contigs" interchangeably which makes it confusing. They are quite different in terms of data.
For example, Line 90: "Comparison with NCBI RefSeq viral genome database showed that across the six AS systems, only between 0.4-1.6% of total viral reads could be assigned to a known viral family". Is this 0.4-1.6% of the total viral contigs? Methods do not describe reads to RefSeq. "Reads were then mapped to contigs by CLC Genomics Workbench with length fraction 0.8 and similarity fraction 0.9 to obtain contig coverage." I was very grateful for the opportunity to reevaluate this manuscript in the light of the comments that I and other reviewers have made. I am delighted to see that the authors have taken these comments to heart and addressed my queries in a cogent and comprehensive manner. I am sorry to have taken a little while to communicate this good opinion.

Responses
Reviewer #2 (Remarks to the Author): Authors have made substantial clarifications that improved the manuscript. Regarding the major comments from the first review, most have been addressed sufficiently except for one remaining major concern: 1. Regarding host assignment issues (previous major comment #3), is the basis for predicting connections between the virus to metabolic functions ("Viruses may impact key functional microorganisms in WWTPS section, beginning from new line 180). Authors provided clarification to the approach that it is handled in a fairly simplistic way (counting both each -new lines 188-194). Reviewer notes that while this is not "inaccurate"; however, this approach is not particularly innovative either. No molecular validation was provided to substantiate the multi-host interactions. This could potentially lead to over-estimation (as in line 198, 204, 214). Hence, the connections noted by the authors could be considered speculative ( Figure 4). In this section, the authors used words such as "may", "might", "suggest" throughout which is in line with the descriptive nature. Whereas reviewer/readers are expecting more substantial insights into these interactions. Overall, this is the weakest section of the manuscript in terms of impact.
Comments that authors have resolved: The in silico validation using mock communities (in response to previous major comment #1), was assessed thoroughly. Reviewer finds the data convincing. Major comment #1 is resolved.
(Previous major comment #2) Regarding 16S data. Authors removed that section in this revision, which reviewer agrees is appropriate. Major comment #2 is resolved.
(Previous major comment #4) Regarding site-specific contigs. Authors clarified the methodology. Major comment #4 is resolved.
Reviewer #3 (Remarks to the Author): I was contacted to review only the quality of the CRISPR based host prediction.
I think that the approach seems correct, but there is no statistical assessment of spacer matches to viruses. In other words, how many hits would be expected randomly, using the same approach, taking into consideration the size and number of all sequences? How does this number compare to the actual number of predicted pairs? I think this information is important to assess the results.
One way of doing that could be the e-value threshold in the blastn searches. The e-value threshold multiplied by the number of queries could give an estimation of the virus-host pairs expected only by chance.
As you use different requirements apart from an e-value, you could be interested in a different approach. A good way is to create databases with the same number and sizes of your sequences but created randomly. And then use the same criteria for the selection of hits.
This expected number of random pairs must be taking into account for certain conclusions, i.e. is 135 viruses infecting different domains of life far from expected randomly? Of course, it is possible to make two sets (or more) of requirements (one set more restrictive than the other) that can help to support or discard the predictions made. Also, it is possible to calculate, for instance, the expected number of random hits of the subset of archaeal spacers to the subset of viruses predicted to infect bacteria.
Another issue that I would recommend is to turn on the blast filter for redundant sequences. If this filter was used, which is probably true if you only obtained unique matches to spacers, describe it in the methods.
When considering the effect of viruses in hosts with a particular metabolism, it could also be useful to compare the number of spacers in hosts with the given metabolism. That could allow the evaluation of metabolic pathways that are usually not associated with many CRISPR spacers. Or the opposite, find that some metabolisms are associated with a disproportionally large number of unique spacers. Responses: Thanks for the positive comments.

Response to comments of Reviewer 2
Reviewer #2 (Remarks to the Author): Authors have made substantial clarifications that improved the manuscript. Regarding the major comments from the first review, most have been addressed sufficiently except for one remaining major concern: 1. Regarding host assignment issues (previous major comment #3), is the basis for predicting connections between the virus to metabolic functions ("Viruses may impact key functional microorganisms in WWTPS section, beginning from new line 180). Authors provided clarification to the approach that it is handled in a fairly simplistic way (counting both each -new lines 188-194). Reviewer notes that while this is not "inaccurate"; however, this approach is not particularly innovative either. No molecular validation was provided to substantiate the multi-host interactions. This could potentially lead to over-estimation (as in line 198, 204, 214).
Hence, the connections noted by the authors could be considered speculative (Figure 4). In this section, the authors used words such as "may", "might", "suggest" throughout which is in line with the descriptive nature. Whereas reviewer/readers are expecting more substantial insights into these interactions. Overall, this is the weakest section of the manuscript in terms of impact.
Responses: Thanks for the valuable comments.
We acknowledge that over-estimation of virus-host interactions might happen in our study and possibility of the over-estimation could be further assessed in another research, using those emerging methods to verify the virus-host relationships, including single-cell genome sequencing and chromosome conformation capture method like Hi-C. We hope that the reviewer recognizes that this work is beyond the current manuscript as it focuses on documenting the extensive diversity of the viruses in WWTPs.
Nonetheless, we want to provide a putative virus-host connections catalog for the potential application of phage treatment in WWTPs. Without this catalog, it would still be hard to deconvolute Hi-C data.
We have added the limitations of our work and some future possible developments in the manuscript (Line 334 to 337): It will be important to supplement this approach with culturing-based approaches, single-cell genome sequencing or chromosome conformation capture method to experimentally verify virus-host links as well as to identify hosts that do not possess CRISPR-Cas defense systems.

Comments that authors have resolved:
The in silico validation using mock communities (in response to previous major comment #1), was assessed thoroughly. Reviewer finds the data convincing. Major comment #1 is resolved.
Responses: Thanks for the positive comments.
(Previous major comment #2) Regarding 16S data. Authors removed that section in this revision, which reviewer agrees is appropriate. Major comment #2 is resolved.
Responses: Thanks for the positive comments.
(Previous major comment #4) Regarding site-specific contigs. Authors clarified the methodology. Major comment #4 is resolved.
Responses: Thanks for the positive comments.

Response to comments of Reviewer 3
Reviewer #3 (Remarks to the Author): I was contacted to review only the quality of the CRISPR based host prediction.
I think that the approach seems correct, but there is no statistical assessment of spacer matches to viruses. In other words, how many hits would be expected randomly, using the same approach, taking into consideration the size and number of all sequences? How does this number compare to the actual number of predicted pairs? I think this information is important to assess the results.
One way of doing that could be the e-value threshold in the blastn searches. The e-value threshold multiplied by the number of queries could give an estimation of the virus-host pairs expected only by chance.
Responses: Thanks for the valuable comments.
As for the e-value, actually we used the blastn default e-value 10. From NCBI website (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYP E=FAQ), it is said that virtually identical short alignments have relatively high E values, which is exactly the case in our study since we use blastn-short function.
We set the coverage to be stable as 90%, identity as 97%, mismatch as 1 and analyzed different e-values (from 1e-10 to 10). Then we tested the precision and recall for the NCBI Refseq Prokaryotic virus dataset (V91). Our results showed that stricter e-value parameter will not largely improve the precision rate (from 93.4% to 98.6%), but it will significantly decrease the virus recall (from 22.4% to 13.5%) in the test dataset. For example, in the real ST1812 sample, if we lower the e-value to 1e-10, we could only get 9 matched host genera compared with 414 host genera found at default e-value. Therefore, considering the trade-off between precision and recall, we mainly controlled other parameters like identity, coverage and mismatch.
We have added the precision and recall results regarding e-value, precision, coverage parameters for the NCBI Refseq Prokaryotic virus dataset (V91) in Supplementary Table 7.
As you use different requirements apart from an e-value, you could be interested in a different approach. A good way is to create databases with the same number and sizes of your sequences but created randomly. And then use the same criteria for the selection of hits.
This expected number of random pairs must be taking into account for certain conclusions, i.e. is 135 viruses infecting different domains of life far from expected randomly? Of course, it is possible to make two sets (or more) of requirements (one set more restrictive than the other) that can help to support or discard the predictions made. Also, it is possible to calculate, for instance, the expected number of random hits of the subset of archaeal spacers to the subset of viruses predicted to infect bacteria.
Responses: Thanks for the valuable comments.
Following your suggestion, we have created a database randomly using RSAT-random sequence (http://rsat.sb-roscoff.fr/random-seq_form.cgi) which contain the same number (N=50037) and the same size (~614MB) of our total viral sequences. Then we used the same criteria to select best hit. Results showed that 340/50037 (0.68%) random DNA sequences have hits to our curated CRISPR-Cas spacer database. However, none of them could link spacers from different domains of life.
Therefore, our conclusion about 135 viruses infecting different domains of life is unlikely to happen simply by chance.
We have added this part in the methods (Line 476-481): We have created a database randomly using RSAT-random sequence (http://rsat.sbroscoff.fr/random-seq_form.cgi) which contain the same number (N=50037) and the same size (~614MB) of our total viral sequences. Then we used the same criteria to select best hit. Results showed that 340/50037 (0.68%) random DNA sequences have hits to our curated CRISPR-Cas spacer database. However, none of them could link spacers from different domains of life.
Another issue that I would recommend is to turn on the blast filter for redundant sequences. If this filter was used, which is probably true if you only obtained unique matches to spacers, describe it in the methods.

Responses: Thanks for the valuable comments.
Yes, we turned on the blast filter. Following your suggestion, we have described it in the methods (Line 470-472): Viral contigs identified for all six WWTPs were searched against manually curated CRISPR-Cas spacer database using BLASTn-short to link viruses to their hosts with 97% identity, 90% coverage ,1 mismatch and 1 maximum target sequence.
When considering the effect of viruses in hosts with a particular metabolism, it could also be useful to compare the number of spacers in hosts with the given metabolism. That could allow the evaluation of metabolic pathways that are usually not associated with many CRISPR spacers. Or the opposite, find that some metabolisms are associated with a disproportionally large number of unique spacers.

Responses: Thanks for the valuable comments.
We have checked the number of spacers in all Midas genera (functional microorganisms in WWTPs) in our study. Results showed that these Midas genera contain on average 44 spacers in their genomes. Notably, for those genera that have more than 10 reference genomes in our database, Methanosarcina, Sorangium, Desulfotomaculum and Methanoculleus have more than 100 spacers in their genomes, indicating their strong potential virus-host interactions (Supplementary Table 8).
We have added this part in the revised manuscript (Line 226-231) and Supplementary Table 8: We have also checked the number of spacers in all Midas genera (functional microorganisms in WWTPs) in our study. Results showed that these Midas genera contain on average 44 spacers in their genome. Notably, for those genera that have more than 10 reference genomes in our curated database, Methanosarcina, Sorangium, Desulfotomaculum and Methanoculleus have more than 100 spacers in their genomes, indicating their strong potential virus-host interactions (Supplementary Table 8).
Lines 462-463: Can more details about this curated database be explained? Did the authors use another not cited database or construct a new one? What were the general criteria for curating CRISPR-Cas spacers? What microorganisms were included in the database? Does the database represent all prokaryotic diversity or just WWTP genomes? How many organisms were analyzed for the database? How many of those organisms have CRISPR spacers?
Responses: Thanks for your question which we need to further explain.

Did the authors use another not cited database or construct a new one?
We have constructed a new database on our own.

What were the general criteria for curating CRISPR-Cas spacers?
We used the default parameter of CRT 1 to predict CRISPR-Cas spacers for all 190078 prokaryotic genomes. For CRISPRs with Cas genes, CRT could achieve 99% precision and 99% recall.
What microorganisms were included in the database? Does the database represent all prokaryotic diversity or just WWTP genomes? How many organisms were analyzed for the database?
We downloaded all assembled bacterial and archaeal genomes (N=190078) from NCBI assembly website (https://www.ncbi.nlm.nih.gov/assembly), so we think this database could represent all prokaryotic diversity.

How many of those organisms have CRISPR spacers?
Of 190078 genomes, 90858 (47.8%) could be predicted by CRT to contain CRISPR spacers.
Can more details about this curated database be explained?
We have added these details in the revised manuscript to give clear information (Line 458 to 467): All bacterial and archaeal assembled genomes (N=190,078) from NCBI Assembly (https://www.ncbi.nlm.nih.gov/assembly) were retrieved to manually curate a CRISPR-Cas spacer database predicted by CRISPR Recognition Tool (CRT). Of 190078 genomes, 90858 (47.8%) could be predicted by CRT to contain CRISPR spacers. BLASTn-short was performed between reference viral genomes and spacers. Different coverage, identity and e-value were tested to achieve a balance between recall percentage and precision percentage. Finally, coverage 90%, identity 97%, mismatch 1 and default e-value could result in precision of 93.4% and recall of 22.4% (Supplementary Table 7). These parameters were applied for all the datasets in our study.
In the prior round of comments to the authors, Reviewer's only concern was the descriptive nature of the virus-host interactions. As explained, there are concerns of over-estimation in these data from wastewater treatment plant samples. No molecular validation was provided to substantiate the virus-host interactions. In this revision, Authors have decided to address this concern by stating that this is a limitation of their study in the Discussion (lines 334-337). No further new analyses/experiments were done to address this concern.
In the Reviewer Assessment, Reviewer is asked to evaluate: "What are the major claims of the paper? Are they novel and will they be of interest to others in the community and the wider field?" The reporting of 50,037 viral contigs from environmental sampling is interesting, but not substantive by itself. The potentially novel insights from virus-host interactions would have clearly elevated this manuscript to be "of interest to others in the community and the wider field". Reviewer also considered whether the methodology could justify. The overall approach is standard, but not particularly novel -to be clear, this is not a major critique of the study.
Based on these, Reviewer's opinion is that the current form of the manuscript may be of limited interest "to others in the community and the wider field".
Reviewer #3 (Remarks to the Author): The authors have made good progress towards heeding my suggestions, however certain small issues remain.
Lines 470-472. Contrary to what the response to the review says, there is no mention of having turned on the blast filter for redundant sequences.
Lines 266-231. My comment was about how the biases due to an uneven distribution of CRISPR in prokaryotic groups can affect the results. The abundance of CRISPR spacers doesn´t necessarily correlate with how often prokaryotes are attacked by viruses or the number of different viruses attacking the prokaryote. Therefore, I don´t think that the phrase "strong potential virus-host interactions" is as adequate, instead there is a higher probability to detect virus-host interactions. If we applied this to particular metabolisms, for some it will be easier to find corresponding viruses. Yet, it is possible to find many viruses targeting a given metabolism and still have a lot of un-affected strains. The opposite can also be true.
Line 461: This is the place to include, about CRT, something as "using the default parameters".
Line 479: Compare the 0.68% of recall in random sequences to your recall in the datasets for the same real sequences. Is real recall 11-22.6% (line 143)? It could also be stated in line 143 that this implies that the number of erroneous associations can be between 6% (100x0.68/11) and 3% (100x0.68/22.6). Also, the process of testing against random sequences is better if more than one replicate is used.
-Lines 480-481." However, none of them could link spacers from different domains of life." The fact that, with random viral sequences, no domains of life were linked is not representative for the exposed results. As with random sequences the number of associations is expectedly decreased, it is to be expected that the random chance of associating spacers from different domains of life is greatly reduced. For instance, if recall is 22,4% of viruses compared to 0,68%, there is 32 times more opportunities to have an incorrect association to an already paired viral genome in your actual data.
A much adequate approach for that purpose would be to take only the recalled viral genomes, transform them into random sequences, and see recall to the set of spacers from the other domain. It is a good practice to repeat this process a few times, not just one. And then compare to the number of cross-domain pairings obtained. However, this approach has a few pitfalls (viruses that were recalled for two domains and a percentage of the recalls may be incorrect).
Another possibility is to take the estimated percentage of erroneous host identifications (see commentary about line 479). For example, let´s assume an estimate of 3% errors and every virus predicted to infect 2 domains that has two identifications (one for each domain). So the possibility that at least one of these identifications is false equals P(f)=1-(1-0.03)^2= 0,059=5,9%; Therefore, of 135 viruses infecting 2 domains of live, about 8.0 (5.9%) would be expected to have happened just by chance.

Response to comments of Reviewer 2
Reviewer #2 (Remarks to the Author): In the prior round of comments to the authors, Reviewer's only concern was the descriptive nature of the virus-host interactions. As explained, there are concerns of over-estimation in these data from wastewater treatment plant samples. No molecular validation was provided to substantiate the virus-host interactions. In this revision, Authors have decided to address this concern by stating that this is a limitation of their study in the Discussion (lines 334-337). No further new analyses/experiments were done to address this concern.
In the Reviewer Assessment, Reviewer is asked to evaluate: "What are the major claims of the paper? Are they novel and will they be of interest to others in the community and the wider field?" The reporting of 50,037 viral contigs from environmental sampling is interesting, but not substantive by itself. The potentially novel insights from virus-host interactions would have clearly elevated this manuscript to be "of interest to others in the community and the wider field". Reviewer also considered whether the methodology could justify. The overall approach is standard, but not particularly novel -to be clear, this is not a major critique of the study.
2 Based on these, Reviewer's opinion is that the current form of the manuscript may be of limited interest "to others in the community and the wider field".

Responses: Thanks for the valuable comments.
To address the only concern raised by Reviewer #2 regarding the molecular validation of the CRISPR-based host assignment for viruses, we did an additional sampling in 2020-12 at Shatin WWTP (one of the sites mentioned in the paper) and used the novel high throughput chromosome conformation capture (Hi-C) method to validate virus-host connections in the sample. To validate the virus-host connections predicted by our CRISPR-based methods, we also used Illumina and Nanopore sequencing to obtain viral contigs and host genome bins in this sample (Figure 7).

Hi-C validation of virus-host interactions in AS system
High throughput chromosome conformation capture (Hi-C) method was used to validate the virus-host connections predicted by our CRISPR-based methods using an additional sample in December 2020 at ST WWTP, by referring to viral contigs and host genome bins obtained from direct sequencing using Illumina and Nanopore metagenomic sequencing (Figure 7). -Lines 480-481." However, none of them could link spacers from different domains of life." The fact that, with random viral sequences, no domains of life were linked is not representative for the exposed results. As with random sequences the number of associations is expectedly decreased, it is to be expected that the random chance of associating spacers from different domains of life is greatly reduced. For instance, if recall is 22,4% of viruses compared to 0,68%, there is 32 times more opportunities to have an incorrect association to an already paired viral genome in your actual data.
A much adequate approach for that purpose would be to take only the recalled viral genomes, transform them into random sequences, and see recall to the set of spacers from the other domain. It is a good practice to repeat this process a few times, not just one. And then compare to the number of cross-domain pairings obtained. However, this approach has a few pitfalls (viruses that were recalled for two domains and a percentage of the recalls may be incorrect).
Another possibility is to take the estimated percentage of erroneous host identifications (see commentary about line 479). For example, let´s assume an estimate of 3% errors and every virus predicted to infect 2 domains that has two identifications (one for each domain). So the possibility that at least one of these identifications is false equals P(f)=1-(1-0.03)^2= 0,059=5,9%; Therefore, of 135 viruses infecting 2 domains of live, about 8.0 (5.9%) would be expected to have happened just by chance.

Responses: Thanks for the valuable comments.
Most of my comments have been addressed. Also, there has been a miscommunication issue (Line 512). When I mentioned to filter "redundant sequences" I actually meant to turn on the BLAST filter for low complexity regions. But as default BLASTn parameters (-dust) filters them out and, it is unlikely that this will change, it is not something that I care much about now.
However, there are a couple of concerns that I´d like the authors to address.
Line 178. "at least" should be replaced by "each".
Lines 233-239. There is still the issue of the uneven distribution of CRISPR throughout prokaryotes. This is not a novelty at all. The point is that the number of phage-prey interactions that can be detected using this approach is a minority, and does not always represent present scenarios. Even phages infecting a CRISPR-positive strain may leave no trace in the form of spacers. Therefore, the conclusions that derive from these predictions must be handled with caution. Phages are most probably affecting all organisms and metabolisms, and it would be surprising if they weren´t. So, what I would like the authors to do is, to include a disclaimer about these limitations of their method. Something similar to what they do about Hi-C in lines 322 to 326.
Reviewer #4 (Remarks to the Author): Authors have addressed concerns previously raised by reviewer 2 including additional experiments to substantiate their hypothesis. Additionally, limitations of their study have been included in the discussion section.
interactions that can be detected using this approach is a minority, and does not always represent present scenarios. Even phages infecting a CRISPR-positive strain may leave no trace in the form of spacers. Therefore, the conclusions that derive from these predictions  Table 8).
It should be noticed that the number of phage-prey interactions that can be detected using this approach is a minority and does not always represent present scenarios. Even phages infecting a CRISPR-positive strain may leave no trace in the form of spacers. Therefore, the conclusions that derive from these predictions must be handled with caution.

Response to comments of Reviewer 4
Reviewer #4 (Remarks to the Author): Authors have addressed concerns previously raised by reviewer 2 including additional experiments to substantiate their hypothesis. Additionally, limitations of their study have been included in the discussion section.

Responses:
Thanks for the positive comments and your efforts in handling this manuscript.