Analysis of viromes and microbiomes from pig fecal samples reveals that phages and prophages rarely carry antibiotic resistance genes

Understanding the transmission of antibiotic resistance genes (ARGs) is critical for human health. For this, it is necessary to identify which type of mobile genetic elements is able to spread them from animal reservoirs into human pathogens. Previous research suggests that in pig feces, ARGs may be encoded by bacteriophages. However, convincing proof for phage-encoded ARGs in pig viromes is still lacking, because of bacterial DNA contaminating issues. We collected 14 pig fecal samples and performed deep sequencing on both highly purified viral fractions and total microbiota, in order to investigate phage and prophage-encoded ARGs. We show that ARGs are absent from the genomes of active, virion-forming phages (below 0.02% of viral contigs from viromes), but present in three prophages, representing 0.02% of the viral contigs identified in the microbial dataset. However, the corresponding phages were not detected in the viromes, and their genetic maps suggest they might be defective. We conclude that among pig fecal samples, phages and prophages rarely carry ARG. Furthermore, our dataset allows for the first time a comprehensive view of the interplay between prophages and viral particles, and uncovers two large clades, inoviruses and Oengus-like phages.

bands (see Figure A below). The lower band was the viral band, which was collected from the bottom of the tube and dialyzed against SM buffer (4 h) in two successive 1-L volumes (final volumes 3-5 ml after dialysis). Samples (0.9 mL) were then treated with DNAseI (2.5 U, 1h30 with Turbo DnaseI, Ambion) to remove residual free DNA.

Figure A: Iodixanol gradients of the 14 pig samples (sample number is below each tube). A * indicates VLP samples that were filtered at 0,2µm, after gradient purification.
To check for bacterial DNA contamination of the virome preparations prior to DNA extraction, qPCR was performed with universal 16S primers and compared to an E. coli total DNA standard. Total nanoparticle concentrations were also estimated with a Videodrop apparatus (Myriade), an apparatus that detects nanoparticles using light interferometry 2 . Most samples had less than 10 ng/mL of bacterial DNA for every 10 10 nanoparticles and were not further treated prior to gradient loading. Filtration (0.2 µM pore size, polyethersulfone filters) was performed on four samples (VP4, VP5, VP15 and VP17) that were considered too heavily contaminated with bacteria. For viral DNA extraction, 1 mL of viral sample was added to a 1-mL solution of 50% phenol/50% chloroform-iso-amyl alcohol (24:1), manually mixed for 1 min, and centrifuged for 20 min at 12 000 g. The soluble phase was recovered and treated a second time with phenol-chloroform. The soluble phase was then added to an equal volume of chloroform-isoamyl alcohol (24:1), emulsified and centrifuged for 5 min at 12 000 g. DNA from the soluble phase was finally precipitated by adding sodium acetate (0.3 M), 1 µg of glycogen as the DNA carrier, and 0.6 volume of isopropanol. The DNA pellet was collected by centrifugation, washed with 70% ethanol, dried, and resuspended in 11 µL of 10 mM Tris pH 8. DNA was quantified using Qubit. Due to the insufficient total amount of DNA obtained (20 to 250 ng out of the 600 ng required), a multiple displacement amplification step (MDA, with kit Genomiphi2) was performed on 1 µL of each sample (90 min at 37°C), after microdialysis (0.02 µM VWPS filters, Millipore) to remove potential inhibitors. This step allowed for the inclusion of ssDNA viruses in the analysis. However, RNA viruses (which are the less abundant among known bacterial viruses) were excluded. Libraries were prepared using the Truseq kit and sequencing was performed on an Illumina Hiseq platform (2 x 150 nt, pair end), with an average depth of 50 million total reads per sample.

Virome read cleaning, assembly and contig treatments to constitute the VP reference dataset
Reads from each sample were trimmed using Trimmomatic 3 (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:100), and dereplicated with the fastx-unique tool from the free version of the Usearch9 suite (R. Edgar, http://drive5.com/usearch). Reads were then paired again using the fastq_pair program (Edwards Lab), and unique reads (~30% of total) were kept as well. Assemblies were run separately on each sample using SPAdes 4 (version 3.13, parameters: --only-assembler -k 21,33,55,77,99,127) to include both paired and unique reads. The meta option in SPAdes, which does not permit inclusion of unique reads, was disregarded in order to keep as many reads as possible for assemblies, and improve contig lengths. A clustering step of viral contigs from all samples was performed at 95% nt identity using CDhit (psi-cd-hit.pl -c 0.95 -G 0 -g 1 -aS 0.9 -prog blastn) and only contigs that were equal to or above 2 kb were retained for further analysis. The final viral dataset contained 7755 virome-de-porc (VP) contigs (out of the original 8090 contigs prior to clustering).
To examine the assembly completeness of the VP contigs dataset, dereplicated reads were mapped back on the 7755 contigs using the Bowtie2 5 with default parameters (allows multiple matches, with random affiliation). On average, 75% of total virome reads were mapped back on the VP collection. To generate abundance matrices, mapping result were treated with tools from the Samtool suite 6 (view, sort, index, idxstat), and the number of reads found on each contig was finally converted into reads per length of the contig (in kb) per million (RPKM). The VP matrix contained high RPKM values, ranging from 5 to 91 000.

Microbiota DNA sample preparation and sequencing
Total microbiota DNA extraction was performed according to protocol H of the international standards for human fecal samples (IHMS, http://microbiome-standards.org/index.php?id=Sop&num=007), starting from 0.2 g of frozen fecal sample. In short, purification steps involved guanidine thiocyanate/SDS sarkosyl treatment for 1h at 70°C, followed by bead beating to break open all bacteria, then repeated polyvinylpolypyrrolidone applications to remove polyphenols. DNA was then precipitated with isopropanol, washed and dried. This protocol should also break open the viral capsids, thanks to the initial guanidine thiocyanate protein denaturation step, although the question has never been specifically addressed, to our knowledge. However, the absence of any step to convert ssDNA into dsDNA prevented sequencing of ssDNA viruses. Libraries were prepared using the Truseq kit and sequencing was performed on an Illumina Hiseq platform (2x150 nt, pair end), with an average depth of 61 million total reads per sample.

Microbiota read cleaning, assembly and contig treatments to constitute the P reference dataset
The same steps applied for viromes were followed for total microbiota, with some simplifications.
Briefly, reads of each sample were trimmed with Trimmomatic (same parameters as virome reads, except MINLEN=125), and then directly assembled with SPAdes (same options as for viromes). A clustering step of of each individual microbial sample was performed at 95% nt identity with CDhit, and only contigs of a size equal to or above 2 kb were retained for further analysis. The microbial dataset comprised 220,000 pig (P) contigs (from 271,163 contigs prior clustering).
To examine the assembly completeness, reads were mapped back onto the P contig dataset using Bowtie2 on default parameters. On average, 69% of microbial reads were mapped back on the P contigs collection. To generate abundance matrices, mapping results were treated with tools from the Samtools suite, and the number of reads found on each contig was finally converted into RPKM. The P contigs matrix contained RPKM values that were 10-times lower than the VP matrix, ranging from 0.71 to 6890.
To estimate the main phyla present in the microbiome samples, a taxonomic affiliation was determined using Kaiju 7 for the top 10 585 P contigs with a global abundance above 10 RPKM. To sum up contig abundances by phyla, those which might belong to the same microbial species were binned into clusters based on kmer composition and co-occurrence with VAMB 8 . A single contig per VAMB cluster was conserved in the abundance matrix, which ended up containing 5862 contigs (~species).
Moreover, in the context of ARG analyses, Kaiju-based taxonomic assignment of interesting contigs was complemented with a BLASTn against the nt database (retaining the best hit with E-value below 10 -50 , when present).

Bacterial contamination level of the viral-enriched DNA
The amount of bacterial DNA contamination of the virome reads was estimated using two methods.
First, reads were mapped onto the 16S SILVA collection 9 with Bowtie2. All samples had a ratio of 16S matching reads over total reads (prior dereplication) of 2 x 10 -4 , which is a cut-off for viromes with satisfactory-level purity 10 . The maximal 16S read ratio was 3 x 10 -5 in sample VP2 (see Suppl. Table 2).
Second, reads were submitted to the viromeQC program 11 , which estimates a viral enrichment yield. This is done by comparing the frequency of reads matching with 16S, 23S and ribosomal protein encoding genes, to that expected for global microbial metagenomics samples. Enrichment was 100fold, except for 3 samples (lowest value 7-fold, sample VP2, Suppl. Table 2). ViromeQC results matched well with the 16S method, and showed that the pig viromes were essentially viral in content.

Contig annotations and ARG search
VP and P contigs were annotated using Prokka 12 (version 1.13 with --addgenes and -rnamer options).
Following annotation, three ARG searches were compared for the VP contigs: (i) hmmscan against ResFam profiles ( 13 -cut_ga option), as in our previous report 14  BLASTn hits (≥ 99% identity, 100% coverage), 41 were Clostridiales, and 11 Bifidobacteriaceae complete genomes. In the vicinity of the tet gene, genes annotated as traG, mobC and relaxase were found, suggesting the presence of an ICE.

Elements of VP contig classification
-single-stranded (ss) DNA viruses: A total of 2807 ssDNA viruses were recognized. Petitvirales were recognized either because they were specifically mentioned ("non-Caudovirales gene enrichment" score) in the Virsorter output, or by searching for terms 'VP1', 'VP4', and 'Pilot' in the annotated contigs of the VIBRANT output. The ssDNA eukaryotic viruses were mostly detected thanks to Megablast searches against the viral section of nt, and therefore directly affiliated. A 'REP_TLCV' (rep gene of tomato-leaf curl virus) annotation in VIBRANT files allowed to detect some additional eukaryotic viruses. Tubulavirales were detected thanks to Inovirus-detector 17 . Most (272 of 344 total inoviruses) were circular, and these were analyzed into more details. A vCONTACT2 clustering of these 252 circular contigs revealed 50 clusters with a quality above 0.5, involving altogether 65% (179 phages) of all input genomes. Pairwise ANI (average nt identity) were also computed for all genomes sharing the same gpI gene, a gene typical of filamentous phages and needed for virion morphogenesis. Species membership was defined for pairs with ANI above 0.95, and genus boundary ANI > 0.75. A summary of the 272 circular genome properties is listed in Suppl. Table 3. (> 95% nt identity) or affiliated by Kaiju as viral, were added (total 16 940 viral P contigs). We reasoned that this set of viral contigs represented two very different situations, either virulent or temperate phages replicating and forming virions (category "active phage"), or dormant prophages that do not manifest viral activity (ie no capacity to form a free virion, at least in the ecosystem studied, category "prophage"). We took advantage of the virome reads to partition this set into either "active phage" if the viral P contig was matched by virome reads, or "prophage" if not, as described below.
We were also interested in the overlap between VP contigs and these viral P contigs. In a first step, therefore, the virome VP contigs were placed together with microbiome viral P contigs and dereplicated (95% identity, 90% coverage), generating a set of 23 181 non redundant viral contigs.
Among them, 6658 were found only in the VP set (54 Mb cumulative length). There were 660 contigs in the "overlap" category, where contigs are representing both viral P and VP segments, with in some cases many small P contigs matching a large VP contig, and also the reverse situation, indicating that viral assembly was incomplete in both cases, unsurprisingly (24 Mb). The last 15 873 contigs were found only in the P assembly (178 Mb).
In a second step, virome reads were mapped onto the complete collection of 23 181 non redundant viral contigs (Suppl. Figure 2b). We reasoned that the modest overlap of the VP contigs with P ones (only 14% of VP have a match to P) could be the result of partial assembly, so that parts of the same phage may be assembled in each set. Therefore, it was expected that some of the P contigs having no overlap with any VP contig might still be corresponding to active phages. In this mapping, the VP contig with lowest cumulative abundance across the 14 samples had a RPKM value of 2.6. We then considered as "active viral contigs" all P contigs with global RPKM abundance above 2.6. This added 3049 P contigs as active viral contigs (39 Mb). In total, of the 16 940 starting viral P set, 4021 contigs (23.7%) belonged to the active phage category (corresponding to these 3049, together with 972 P contigs present or included into VP contigs of the "overlap" set), while the rest was considered inactive prophages.
Finally, to confirm the partial assembly hypothesis, the reciprocal mapping of microbiota P reads onto the same contig dataset was performed (Suppl. Figure 2c). In this mapping, the viral-P contig with lowest abundance across the 14 samples had a RPKM value of 0.45, so that all VP contigs above this value were counted as covered. The matrix revealed indeed that 3264 contigs of the VP set, representing 20% of the total RPKM of this matrix, were covered by P reads. Inspection of the remaining, not-covered VP contigs (3394) revealed that 70% were ssDNA viruses.

Host predictions for interesting phage clades
For the 7755 VP contigs, host prediction was first attempted by searching for matching CRISPR spacers, using the recent spacer database from Dion et al. 24 and the default parameters (a maximum of two mismatches allowed). A host was predicted for 469 of these contigs (6% of total contigs). Next, a set of spacers was extracted from our P contigs (of all sizes, including those below 2 kb). For these spacers, CRISPR arrays were first identified using CRISPRDetect v2.2 25 . Spacers were then extracted from the CRISPRDetect GFF output file along with the following metadata: contig ID, start and end positions on the contig, length, sequence, orientation and position inside the locus. Spacers were assigned a unique ID and a custom BLAST database was generated (n = 14 941 spacers). Next, BLASTn v2.9.0 was used to perform pairwise alignments between the P contig spacers and the selected viral contigs, applying the same filters as previously described 24 . A total of 1075 (14%) VP contigs matched the in-house spacer collection. For 148 of the P contigs containing such matching spacers, a taxonomical affiliation could be proposed, based on the Kaiju indication followed by a BLASTn search (E-value < 10 -16 ) against the Bacteria section of the nt database. As expected, affiliation was more successful using the in-house spacer collection, although in most cases, taxonomical affiliation of these in-house P contigs was unsuccessful (many very short contigs).
For the two clades of phages mostly focused in our study, 272 inoviruses and 94 Oengus-like phages, 15 and three hosts were predicted by this first method, respectively. Three additional tests were performed on this subset: (1) submission to the spacer collection at JGI (https://img.jgi.doe.gov/cgibin/vr/main.cgi) and applying the two-mismatch cutoff as above, (2) BLASTn search against NCBI nt (Evalue < 10 -4 and coverage > 10%), and (3) WiSH 26 . WiSH computes conditional probabilities of all 9mers in bacterial genomes of the database (2155 bacterial genomes meaningful for the ecosystem were taken) and compares them to that of the query phage. The first output is the probability (expressed as log-likelihood) that the given phage will infect each of the bacteria in this collection. The closer the log-likelihood is to 0, the more likely the host is correct. A log-likelihood is also calculated for the phages that do not infect the bacteria of this same collection. By comparing these two loglikelihoods, a second statistic (a p-value) gives the confidence of the WiSH prediction. Four inoviruses and four phages of the Oengus cluster with known hosts were tested as controls. Only the five best log-likelihood with a p-value inferior or equal to 0.05 for each phage contig were considered as putative hosts. Next, a filter was applied on the log-likelihood values, which had to be below those obtained for the control phages. Finally, the bacteriophages with 4 to 5 predictions from the same taxonomic order were considered sure and kept.

Huge phage search
We attempted to detect any so-called "huge phage" (size > 200 kb) 27 among virome VP contigs. VP homologs to the reported 361 huge phage genomes 27 were searched by a simple BLASTn comparison (E-val < 10 -5 ) to the huge phages. Very few hits were found (30) and none with a query coverage above 14%. Then a more sensitive and more restricted search was done with tBLASTx on a subset of large virome VP contigs and microbiota P contigs. These included the 21 VP contigs of a size above 150 kb, a 381 kb P contig considered viral due to its homology to a VP contig by BLASTn, and 5 VAMB bins having both a Kaiju annotation as viral and a cumulated size above 180 kb. These contigs were compared to the 35 closed_circular_curated huge phages 27

and the 15 complete
Prevotella megaphages 19 with cutoffs placed at 30% query cover, 65% (amino-acid) identity and Evalue 10 -3 . Four distant relatives of complete phage AOT2015-SM01_PERU_trim_clean_scaffold_10 (PERU for short, 201 kb) were found. For the first of these, a long contig was present both in the microbiome and virome set. In the microbiome, this contig (P12_NODE_16) was 183 kb long, and its central section matched perfectly with the virome contig VP12_NODE3 (123 kb), confirming its viral nature. The three additional contigs (175-154 kb) came from three viromes VP2, VP9 and VP12. A Clostridium host was predicted for one of them (VP2_NODE1) because of a CRISPR spacer 24 . The largest of these 4 contigs was manually annotated, and all 4 were aligned with PERU using tBLASTx and Easyfig (see figure A below).
A fraction of the structural module (head genes) and the whole replication module genes were shared between these phages. Interestingly, P12_NODE_16 had long tail genes (absent from PERU and the 3 other contigs) and no detectable sheath encoding gene, suggesting that it is a siphovirus. A few ribosomal genes and eukaryote-like ribonucleoprotein genes (major vault-like protein-encoding genes) were also predicted.