Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes

Roux, Simon; Krupovic, Mart; Daly, Rebecca A.; Borges, Adair L.; Nayfach, Stephen; Schulz, Frederik; Sharrar, Allison; Matheus Carnevali, Paula B.; Cheng, Jan-Fang; Ivanova, Natalia N.; Bondy-Denomy, Joseph; Wrighton, Kelly C.; Woyke, Tanja; Visel, Axel; Kyrpides, Nikos C.; Eloe-Fadrosh, Emiley A.

doi:10.1038/s41564-019-0510-x

Download PDF

Article
Open access
Published: 22 July 2019

Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes

Nature Microbiology volume 4, pages 1895–1906 (2019)Cite this article

15k Accesses
166 Citations
106 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 11 February 2020

This article has been updated

Abstract

Bacteriophages from the Inoviridae family (inoviruses) are characterized by their unique morphology, genome content and infection cycle. One of the most striking features of inoviruses is their ability to establish a chronic infection whereby the viral genome resides within the cell in either an exclusively episomal state or integrated into the host chromosome and virions are continuously released without killing the host. To date, a relatively small number of inovirus isolates have been extensively studied, either for biotechnological applications, such as phage display, or because of their effect on the toxicity of known bacterial pathogens including Vibrio cholerae and Neisseria meningitidis. Here, we show that the current 56 members of the Inoviridae family represent a minute fraction of a highly diverse group of inoviruses. Using a machine learning approach leveraging a combination of marker gene and genome features, we identified 10,295 inovirus-like sequences from microbial genomes and metagenomes. Collectively, our results call for reclassification of the current Inoviridae family into a viral order including six distinct proposed families associated with nearly all bacterial phyla across virtually every ecosystem. Putative inoviruses were also detected in several archaeal genomes, suggesting that, collectively, members of this supergroup infect hosts across the domains Bacteria and Archaea. Finally, we identified an expansive diversity of inovirus-encoded toxin–antitoxin and gene expression modulation systems, alongside evidence of both synergistic (CRISPR evasion) and antagonistic (superinfection exclusion) interactions with co-infecting viruses, which we experimentally validated in a Pseudomonas model. Capturing this previously obscured component of the global virosphere may spark new avenues for microbial manipulation approaches and innovative biotechnological applications.

Metagenomic analysis reveals unexplored diversity of archaeal virome in the human gut

Article Open access 29 December 2022

Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome

Article Open access 24 June 2021

High-throughput identification of viral termini and packaging mechanisms in virome datasets using PhageTermVirome

Article Open access 15 September 2021

Main

Inoviruses, bacteriophages from the Inoviridae family, exhibit unique morphological and genetic features. While the vast majority of known bacteriophages carry double-stranded DNA (dsDNA) genomes encapsidated into icosahedral capsids, inoviruses are instead characterized by rod-shaped or filamentous virions, circular single-stranded DNA genomes of ~5–15 kb and a chronic infection cycle^1,2,3 (Fig. 1a). Owing to their unique morphology and simple genome amenable to genetic engineering, several inoviruses are widely used for biotechnological applications, including phage display or as drug delivery nanocarriers^4,5,6,7. Ecologically, cultivated inoviruses are known to infect hosts from only 5 bacterial phyla and 10 genera but can have significant effect on the growth and pathogenicity of their host^8,9,10. For instance, an inovirus prophage, CTXphi, encodes and expresses the major virulence factor of toxigenic Vibrio cholerae^11,12, whereas in other bacterial hosts, including Pseudomonas, Neisseria and Ralstonia, inovirus infections indirectly influence pathogenicity by altering biofilm formation and host colonization abilities^{8,13,14,15,16}.

**Fig. 1: Overview of inovirus infection cycle, diversity and sequence detection process.**

Despite these remarkable properties, their elusive life cycle and peculiar genomic and morphological properties have hampered systematic discovery of additional inoviruses: to date, only 56 inovirus genomes have been described¹⁷. Most inoviruses do not elicit negative effects on the growth of their hosts when cultivated in the laboratory and can thus easily evade detection. Furthermore, established computational approaches for the detection of virus sequences in whole-genome shotgun sequencing data are not efficient for inoviruses because of their unique and diverse gene content^18,19,20 (Fig. 1b). Finally, inoviruses are probably undersampled in viral metagenomes due to their long, flexible virions with low buoyant density^21,22.

Here, we unveil a substantial diversity of 10,295 inovirus sequences, derived from a broad range of bacterial and archaeal hosts, and identified through an exhaustive search of 56,868 microbial genomes and 6,412 shotgun metagenomes using a custom computational approach to identify putative inovirus genomes. These sequences reveal that inoviruses are far more widespread, diverse and ecologically pervasive than previously appreciated, and provide a robust foundation to further characterize their biology across multiple hosts and environments.

Results

Inoviruses are highly diverse and globally prevalent

To evaluate the global diversity of inoviruses, an analysis of all publicly available inovirus genomes was first conducted to identify characteristic traits that would enable automatic discovery of divergent inovirus sequences (Supplementary Table 1). Across the 56 known Inoviridae genomes, the gene encoding the morphogenesis (pI) protein, an ATPase of the FtsK–HerA superfamily, represented the only conserved marker gene (Fig. 1a,b and Supplementary Fig. 1). However, three additional features specific of inovirus genomes could be defined: (1) short structural proteins (30–90 amino acids) with a single predicted transmembrane domain (TMD; Supplementary Table 1), (2) genes either functionally uncharacterized or similar to other inoviruses, and (3) shorter genes than those in typical bacterial or archaeal genomes (Supplementary Fig. 2A). These features were used to automatically detect inovirus sequences through a two-step process (Fig. 1b). First, pI-like proteins are detected through a standard hidden Markov model (HMM)-based similarity search. Then, a random forest classifier trained on genomes of isolate inoviruses and manually curated prophages used these genome features to identify inovirus sequences from the background host genome. This approach yielded 92.5% recall and 99.8% precision on our manually curated reference set (Fig. 1c, Supplementary Fig. 2 and Supplementary Notes).

This detection approach was applied to 56,868 bacterial and archaeal genomes and 6,412 metagenomes publicly available from the Integrated Microbial Genomes (IMG) database²³ (Supplementary Table 2). After manual curation of edge cases and removal of detections not based on a clear inovirus-like ATPase, a total of 10,295 sequences were recovered (Fig. 1d, Supplementary Fig. 3 and Supplementary Notes). From these, 5,964 distinct species were identified using genome-wide average nucleotide identity (ANI), and only 38 of these included isolate inovirus genomes. About one-third of these species (30%) encoded an ‘atypical’ morphogenesis gene, with an amino-terminal instead of carboxy-terminal TMD (Supplementary Fig. 3). Although this atypical domain organization has been observed in four isolate species currently classified as inoviruses, some of these inovirus-like sequences might eventually be considered as entirely separate groups of viruses. Sequence accumulation curves did not reach saturation, highlighting the large diversity of inoviruses yet to be sampled (Supplementary Fig. 4).

Inovirus sequences were identified in 6% of bacterial and archaeal genomes (3,609 of 56,868) and 35% of metagenomes (2,249 of 6,412). More than half of the species (n = 3,675) were exclusively composed of sequences assembled from metagenomes. These revealed that inoviruses are found in every major microbial habitat whether aquatic, soil or human associated, and throughout the entire globe (Fig. 2 and Supplementary Notes). Hence, inoviruses are much more diverse than previously estimated and globally distributed.

**Fig. 2: Geographical and biome distribution of inovirus sequences detected in metagenomes.**

Inoviruses infect a broad diversity of bacterial hosts

To examine the host range of these inoviruses, we focused on the 2,284 inovirus species directly associated with a host, that is, proviruses derived from a microbial genome (Fig. 3). The majority (90%) of these species were associated with Gammaproteobacteria and Betaproteobacteria, from which most known inoviruses were previously isolated (Supplementary Table 1). However, the range of host genera within these groups was vastly expanded, including clinically and ecologically relevant microorganisms such as Azotobacter, Haemophilus, Kingella or Nitrosomonas (Supplementary Table 3). The remaining 412 species strikingly increased the potential host range of inoviruses to 22 additional phyla, including the Candidate Phyla Radiation (Fig. 3). For three of these (Acidobacteria, Chlamydiae and Spirochaetes), only short inovirus contigs were detected, lacking host flanking regions, which would provide confident host linkages. Hence, these contigs could potentially derive from sample contamination (for example, from reagents), and inovirus presence within these phyla remains uncertain (Supplementary Table 4). The notable host expansion is consistent with reported experimental observations of filamentous virus particles induced from a broad range of bacteria, for example, Mesorhizobium, Clostridium, Flavobacterium, Bacillus and Arthrobacter^24,25 (Fig. 3).

**Fig. 3: Phylum-wide distribution of inovirus detections across microbial genomes.**

This large-scale detection of inovirus sequences in microbial genomes also enabled a comprehensive assessment of co-infection, both between different inoviruses and with other types of viruses. In the majority of cases, a single inovirus sequence was detected per genome, with multiple detections mostly found within Gammaproteobacteria, Betaproteobacteria and Spiroplasma genomes (Supplementary Fig. 5). Conversely, inovirus prophages were frequently detected along and sometimes colocalized with Caudovirales prophages, suggesting that these two types of phages frequently co-infect the same host cell (Supplementary Fig. 5 and Supplementary Notes). Overall, the broad range of bacteria and archaea infected by inoviruses combined with their propensity to co-infect a microbial cell with other viruses and their global distribution indicate that inoviruses probably play an important ecological role in all types of microbial ecosystems.

Inoviruses sporadically transferred from bacterial to archaeal hosts

Although no archaea-infecting inoviruses have been reported so far²⁶, some inovirus sequences were associated with members of two archaeal phyla (Euryarchaeota and Aenigmarchaeota), which suggests that inoviruses infect hosts across the entire prokaryotic diversity (Fig. 3). These putative archaeal proviruses encoded the full complement of genes expected in an active inovirus (Fig. 4a and Supplementary Notes). Using PCR, we further confirmed the presence of a circular, excised form of the complete inovirus genome for the provirus identified in the Methanolobus profundi MobM genome (Fig. 4b, Supplementary Fig. 6 and Supplementary Notes). This indicates that our predictions in archaeal genomes are probably genuine inoviruses.

**Fig. 4: Characterization of archaea-associated inoviruses.**

Few groups of viruses include both bacteriophages and archaeoviruses. Such evolutionary relationships between viruses infecting hosts from different domains of life might signify either descent from an ancestral virus that infected the common ancestor of bacteria and archaea, or horizontal virus transfer from one host domain to the other^26,27,28. Here, the four archaea-associated inoviruses were clearly distinct from most other inoviruses and clustered only with metagenomic sequences in pI phylogeny (Fig. 4c). In addition, they were classified into two different proposed families (see below) corresponding to the two host groups, reflecting clear differences in their gene content (Fig. 4a,c and Supplementary Notes). The high genetic diversity of these archaea-associated inoviruses, combined with the lack of similarity to bacteria-infecting species, suggest that they are not derived from a recent host switch event.

A possible scenario would involve an ancestral group of inoviruses infecting the common ancestor of archaea, as postulated for the double-jelly-roll virus lineage²⁸. However, to be confirmed, this hypothesis would require the detection of additional inoviruses in other archaeal clades or an explanation as to why inoviruses were retained only in a handful of archaeal hosts. Instead, on the basis of the current data, a more likely scenario involves ancient and rare events of interdomain inovirus transfer from bacteria to archaea, including possibly to a Methanosarcina host for which substantive horizontal transfers of bacterial genes have already been reported²⁹.

Gene content classification reveals six distinct inovirus families

The vast increase of inovirus sequences provided a great opportunity for re-evaluation of the inovirus classification and the development of an expanded taxonomic framework for the large number of inovirus species identified. Similar to other bacterial viruses, especially temperate phages³⁰, inovirus genomes display modular organization and are prone to recombination and horizontal gene transfers³¹ (Supplementary Fig. 7). Hence, we opted to apply a bipartite network approach, in which genomes are connected to gene families, enabling a representation and clustering of the diversity based on shared gene content. A similar approach has been previously employed for the analysis of DNA and RNA viruses, and was shown to be efficient in cases in which the genomes to be clustered share only a handful of genes^26,32,33,34. Here, this approach yielded 6 distinct groups of genomes divided into 212 subgroups (Fig. 5a and Supplementary Table 3).

**Fig. 5: Inovirus genome sequence space and gene content.**

A comparison of marker gene conservation between these groups and established viral taxa suggested that the former Inoviridae family should be reclassified as an order, provisionally divided into 6 candidate families and 212 candidate subfamilies, with few shared genes across candidate families (Fig. 5a, Supplementary Fig. 7 and Supplementary Notes). Beyond gene content, these proposed families also displayed clearly distinct host ranges as well as specific genome features, particularly in terms of genome size and coding density (Supplementary Fig. 7). Thus, we propose to establish these as candidate families named ‘Protoinoviridae’, ‘Vespertilinoviridae’, ‘Amplinoviridae’, ‘Paulinoviridae’, ‘Densinoviridae’ and ‘Photinoviridae’, on the basis of their isolate members and characteristics (see Supplementary Notes). If confirmed, and compared with currently recognized inoviruses, the genomes reported here would increase diversity by 3 families and 198 subfamilies.

The host envelope organization seems to play an important role in the evolution of inoviruses, which is reflected in their classification: members of the ‘Protoinoviridae’ and ‘Amplinoviridae’ are associated with diderm hosts—that is, Gram-negative bacteria with an outer membrane—whereas the other candidate families are associated with monoderm hosts or hosts without a cell wall (Supplementary Fig. 7). Conversely, no structuring by biome was observed and all proposed families were broadly detected across multiple types of ecosystems. Hence, we propose here a classification of inovirus diversity into six families based on gene content with coherent host ranges and specific genomic features, which strongly suggests that they represent ecologically and evolutionarily meaningful units.

Inovirus genomes encode an extensive functional repertoire

The extended catalogue of inovirus genomes offers an unprecedented window into the diversity of their genes and predicted functions. Overall, 68,912 proteins were predicted and clustered into 3,439 protein families and 13,714 singletons. This is on par with the functional diversity observed in known Caudovirales genomes, the largest order of dsDNA viruses, for which the same number of proteins clustered into 12,285 protein families but only 8,552 singletons (see Methods). A putative function was predicted for 1,133 of the 3,439 inovirus protein families (iPFs). Most of these (>95%) could be linked to virion structure, virion extrusion, DNA replication and integration, toxin–antitoxin systems or transcription regulation (Supplementary Table 5). A total of 51 and 47 distinct iPFs could be annotated as major and minor coat proteins, respectively, with an additional 934 iPFs identified as potentially structural based on their size and presence of a TMD (see Methods). Notably, each candidate inovirus family seemed to be associated with a specific set of structural proteins, including distinct major coat iPFs (Supplementary Fig. 8). Conversely, genome replication and integration-associated iPFs were broadly shared across candidate families (Fig. 5b). This confirms that replication-associated and integration-associated genes are among the most frequently exchanged among viral genomes and with other mobile genetic elements, especially in small single-stranded DNA viruses³⁵.

In addition, 15 distinct sets of iPFs representing potential toxin–antitoxin pairs were identified across 181 inovirus genomes, including 10 unaffiliated iPFs that were predicted as putative antitoxins through co-occurrence with a toxin iPF (Fig. 5b and Supplementary Table 5; see Methods). These genes typically stabilize plasmids or prophages in host cell populations, although alternative roles in stress response and transcription regulation have been reported³⁶. In addition, toxin–antitoxin systems often affect host cell phenotypes, such as motility or biofilm formation¹. Here, similar toxin proteins could be associated with distinct and seemingly unrelated antitoxins and vice versa, suggesting that gene shuffling and lateral transfer occur even within these tightly linked gene pairs (Supplementary Fig. 9). All but one toxin–antitoxin pairs were detected in proteobacteria-associated inoviruses, most likely because of a database bias. Thus, numerous uncharacterized iPFs across other candidate families of inoviruses may also encode previously undescribed toxin–antitoxin systems and, more generally, host manipulation mechanisms.

Inoviruses can both leverage and restrict co-infecting viruses

Finally, we investigated potential interactions between persistently infecting inoviruses, other co-infecting viruses, and the host clustered regularly interspaced short palindromic repeats (CRISPR)–CRISPR-associated (Cas) immunity systems. CRISPR–Cas systems typically target bacteriophages, plasmids and other mobile genetic elements³⁷. We detected 1,150 inovirus-matching CRISPR spacers across 42 bacterial and 1 archaeal families. These spacers were associated with three types and eight subtypes of CRISPR–Cas systems, indicating that inoviruses are broadly targeted by antiviral defences (Fig. 6a, Supplementary Table 6 and Supplementary Notes). Several host groups, most notably Neisseria meningitidis, were clear outliers, that is, they displayed a particularly high ratio of inovirus-derived spacers suggesting a uniquely high level of spacer acquisition and inovirus infection (Fig. 6a). This is particularly notable because inoviruses were recently suggested to increase N. meningitidis pathogenicity¹³ and hints at conflicting host–inovirus interactions in this specific group.

**Fig. 6: Interaction of inoviruses with CRISPR–Cas systems and co-infecting viruses.**

Next, we examined instances of ‘self-targeting’, that is, CRISPR spacers matching an inovirus integrated in the same host genome. Among the 1,429 genomes that included both a CRISPR–Cas system and an inovirus prophage, only 45 displayed a spacer match(es) to a resident prophage (Supplementary Table 6), suggesting that self-targeting of these integrated elements is lethal and strongly counter-selected³⁸. This was confirmed experimentally using the Pseudomonas aeruginosa strain PA14 harbouring an integrated inovirus prophage (Pf1), for which the introduction of a plasmid carrying Pf1-targeting CRISPR spacers was lethal (Supplementary Fig. 10a). In the 45 cases of observed self-targeting, the corresponding CRISPR–Cas system is thus probably non-functional or inhibited via an anti-CRISPR (acr) locus, as recently described in dsDNA phages³⁸. We first evaluated ten hypothetical proteins, and hence candidate Acr proteins, from self-targeted inoviruses infecting P. aeruginosa; however, none showed Acr activity (Supplementary Notes and Supplementary Fig. 10b). Alternatively, inoviruses could leverage the Acr activity of a co-integrated virus. This hypothesis was further reinforced by the fact that 43 of the 45 self-targeted inoviruses were detected alongside co-infecting dsDNA phages, with 5 of these encoding known acr genes (Supplementary Table 6). We confirmed experimentally cross-protection by trans-acting Acr in the P. aeruginosa PA14 model, and observed that co-infection with an acr-encoding dsDNA bacteriophage rescued the lethality caused by self-targeted inoviruses (Supplementary Notes and Supplementary Fig. 10a).

While this represents an instance of beneficial co-infection for inoviruses, we also uncovered evidence of antagonistic interactions between inoviruses and dsDNA bacteriophages. Specifically, 2 of the 10 inovirus-encoded hypothetical proteins tested strongly limited infection of Pseudomonas cells by different bacteriophages (Fig. 6b, Supplementary Figs. 10c and 12 and Supplementary Notes). This superinfection exclusion effect was found to be host and virus strain dependent, which could drive intricate tripartite coevolution dynamics. Thus, these preliminary observations indicate that inoviruses may not only evade CRISPR–Cas immunity by leveraging the Acr activity of co-integrated phages, but also significantly influence the infection dynamics of unrelated co-infecting viruses through superinfection exclusion (Fig. 6c). Multiple effects of virus–virus interactions on host ecology and evolution have been recently highlighted or proposed, and are the main focus of a nascent ‘sociovirology’ field³⁹. Given their broad host range (Fig. 3), frequent detection alongside non-inovirus prophages (Supplementary Fig. 5), extended host cell residence time and the experimental results presented here, inoviruses could be driving many of these interactions and are undeniably important to consider in this framework.

Discussion

Taken together, the results presented here call for a complete re-evaluation of the diversity and role of inoviruses in nature. Collectively, inoviruses are distributed across all biomes and display an extremely broad host range spanning both prokaryotic domains of life. Comparative genomics revealed evidence of longstanding virus–host codiversification, leading to strong partitioning of inovirus diversity by host taxonomy, high inovirus prevalence in several microbial groups, including major pathogens, and potential interdomain transfer. Even though small (5–20 kb), their genomes encode a large functional diversity shaped by frequent gene exchange with unrelated groups of viruses, plasmids and transposable elements. Some of the many uncharacterized inovirus genes probably encode molecular mechanisms at the interface of virus–host and virus–virus interactions, such as modulators of the CRISPR–Cas systems, superinfection exclusion genes or toxin–antitoxin modules. This expanded and restructured catalogue of 5,964 distinct inovirus genomes thus provides a renewed framework for further investigation of the different effects that inoviruses have on microbial ecosystems, and exploration of their unique potential for biotechnological applications and manipulation of microorganisms.

Methods

Construction of an Inoviridae genome reference set

Genome sequences affiliated to Inoviridae and ≥2.5 kb were downloaded from NCBI Genbank and RefSeq on 14 July 2017 (refs. ^40,41). These were clustered at 98% ANI to remove duplicates and screened for cloning vectors and partial genomes (Supplementary Table 1). Two of these genomes (Stenotrophomonas phage phiSMA9, NC_007189, and Ralstonia phage RSS30, NC_021862) presented an unusually long section (≥1 kb) without any predicted gene, associated with a lack of short genes that are typical of Inoviridae. For these, genes were predicted de novo using Glimmer⁴² trained on their host genomes (NC_010943 for phiSMA9 and NC_003295 for RSS30) with standard genetic code. Similarly, genes for Acholeplasma phage MV-L1 (NC_001341) were predicted de novo using Glimmer with genetic code 4 (Mycoplasma/Spiroplasma) and trained on the host genome (NC_010163), followed by a manual curation step to integrate both RefSeq-annotated genes and these newly predicted CDS.

Protein clusters (PCs) were computed from these genomes from an all-versus-all blastp of predicted CDS (thresholds: e ≤ 0.001, bit score ≥ 30) and clustered with InfoMap³³. Sequences from these PCs were then aligned with MUSCLE⁴³, transformed into an HMM profile and compared with each other using HHSearch⁴⁴ (cut-offs: probability ≥ 90% and coverage ≥ 50%, or probability ≥ 99%, coverage ≥ 20% and hit length ≥ 100). The larger clusters generated through this second step are designated here as iPFs. Only ten PCs were clustered into larger iPFs, but these were consistent with the functional annotation of these proteins. For instance, one iPF combined two PCs both composed of replication initiation proteins.

Marker genes were identified from a bipartite network linking Inoviridae genomes to iPFs (Supplementary Fig. 1). Only the genes encoding the morphogenesis (pI) protein represented good candidates for a universally conserved gene across all members of the Inoviridae, and HMM profiles were built for the three pI iPFs. To optimize these profiles, sequences were first clustered at 90% amino acid identity with cd-hit⁴⁵, then aligned with MUSCLE⁴³ and the profile generated with hmmbuild⁴⁶.

These reference genomes were also used to evaluate the detection of the Inoviridae structural proteins based on protein features beyond sequence similarity (see Supplementary Notes). Here, signal peptides were predicted using SignalP in both Gram-positive and Gram-negative modes⁴⁷, and TMDs were identified with TMHMM⁴⁸.

Search for inovirus in microbial genomes and metagenomes

Proteins predicted from 56,868 microbial genomes publicly available in the IMG as of October 2017 (Supplementary Table 2) were compared with the reference morphogenesis (pI) proteins with hmmsearch⁴⁶ (hmmer.org, score ≥ 30 and e ≤ 0.001) for the pI-like iPFs and blastp⁴⁹ (bit score ≥ 50) for the singleton pI protein (Acholeplasma phage MV-L1). These included 54,405 bacterial genomes, 1,304 archaeal genomes and 1,149 plasmid sequences. A total of 6,819 hits were detected, from which 795 corresponded to complete inovirus genomes. These included 213 circular contigs, that is, likely complete genomes, and 582 integrated prophages with canonical attachment (att) sites, that is, direct repeats of ≥10 bp in a tRNA or outside of an integrase gene. All sequences were manually inspected to verify that these were plausible inovirus genomes (see Supplementary Notes). The predicted pI proteins from the curated genomes were then added to the references to generate new improved HMM models. Using these improved models, an additional set of 639 putative pI proteins was identified. New models were built from these proteins and used in a third round of searches, which did not yield any additional genuine inovirus sequence after manual inspection.

An automatic classifier was trained on this extended inovirus genome catalogue, that is, the reference genomes and the 795 manually curated genomes, to detect putative inovirus fragments around pI-like genes, based on 10 distinctive features of inovirus genomes (Supplementary Fig. 2 and Supplementary Notes). These 795 manually curated genomes were identified from 17 host phyla (or class for Proteobacteria) and were later classified into 5 proposed families and 245 proposed subfamilies (see below ‘Gene-content-based clustering of inovirus genomes’). Three types of classifiers were tested: random forest (function randomForest from R package randomForest⁵⁰ using 2,000 trees, other parameters left as default), random forest with conditional inference (function cforest from R package party⁵¹ using 2,000 trees, other parameters left as default) and a generalized linear model with lasso regularization (function glmnet from R package glmnet⁵²). The efficiency of classifiers was evaluated via a tenfold cross-validation in which the input data set was partitioned into ten equal-sized subsamples, with one retained for validation and the other nine used for training through the ten possible permutations. Results were visualized as a ROC curve generated with ggplot2 (refs. ^53,54). The importance of features in the random forest classifier was evaluated using the function ‘importance’, from the R package randomForest.

On the basis of the inflection point observed on the ROC curves, the random forest classifier was selected as the optimal method as it provided the highest true-positive rate (>92%) for false-positive rates of <1 % (Supplementary Fig. 2). This model was then used to classify all putative inovirus fragments that had not been identified as complete genomes previously, using a sliding window approach (up to 30 genes around the putative pI protein), and looking for the fragment with the maximum score in the random forest model (if >0.9). For the predicted integrated prophages, putative non-canonical att sites were next searched as direct repeats (10 bp or longer) around the fragment. Overall, 3,908 additional putative inovirus sequences were detected, including 738 prophages flanked by direct repeats.

A similar approach was used to search for inovirus sequences in 6,412 metagenome assemblies (Supplementary Table 2). Predicted proteins were compared with the 4 HMM profiles as well as to the Acholeplasma phage MV-L1 singleton sequence, which led to 27,037 putative pI proteins using the same thresholds as for isolate genomes. The final data set of inovirus sequences predicted from these metagenome assemblies consisted of 6,094 sequences, including 922 circular contigs, 44 prophages with canonical att sites (direct repeats of 10 bp or longer in a tRNA or next to an integrase) and 994 prophages with non-canonical att sites (direct repeats of 10 bp or longer).

Clustering of inovirus genomes in putative species

Next, we sought to cluster these putative inovirus genomes along with the previously collected reference genomes to remove duplicated sequences and to select only one representative per species. This clustering was conducted according to the latest guidelines submitted to the International Committee on Taxonomy of Viruses (ICTV) for Inoviridae, that is, “95% DNA sequence identity as the criterion for demarcation of species”⁵⁵ (https://talk.ictvonline.org/files/ictv_official_taxonomy_updates_since_the_8th_report/m/prokaryote-official/6774/download), and included our 10,295 sequences alongside the 56 reference genomes. Notably, however, predictions spanning multiple tandemly integrated inovirus prophages had to be processed separately, otherwise they could lead to clusters gathering multiple species. To detect these cases of tandem insertions, we searched for and clustered separately all predictions with multiple pI proteins, as this gene is expected to be present in single copy in inoviruses (n = 800 sequences).

All non-tandem sequences were first clustered incrementally with priority given to complete genomes over partial genomes as well as fragments identified in microbial genomes over fragments from metagenomes. First, circular contigs and prophages with canonical att sites identified in a microbial genome were clustered, and all other fragments were affiliated to these seed sequences. Next, unaffiliated fragments detected in microbial genomes and with non-canonical att sites (that is, simple direct repeat) were clustered together, and other fragments were affiliated to this second set of seed sequences. Finally, the remaining unaffiliated sequences detected in microbial genomes were clustered together. This allowed us to use the more ‘certain’ predictions (that is, circular sequences and prophages with identified att sites) preferentially as seeds of putative species.

A similar approach was used to cluster sequences identified from metagenomes, as well as to separately cluster putative tandem fragments, that is, those including multiple pI proteins. All the clustering and affiliation was done with a threshold of 95% ANI on 100% of alignment fraction (according to the ICTV guidelines), with sequence similarity computed using mummer⁵⁶. Accumulation curves were calculated for 100 random ordering of input sequences using a custom perl script and plotted with ggplot2 (refs. ^53,54).

Clustering of predicted proteins from non-redundant inovirus sequences

Predicted proteins from the representative genome of each putative species were next clustered using the same approach as for the reference genomes. A clustering into PCs was first achieved through an all-versus-all blastp using hits with e ≤ 0.001 and bit score ≥ 50 or bit score ≥ 30 if both proteins are ≤70 amino acids. HMM profiles were constructed for the 5,142 PCs and these were compared all-versus-all using HHSearch, keeping hits with ≥90% probability and ≥50% coverage or ≥99% probability, ≥20% coverage and hit length of ≥100. This resulted in 4,008 protein families (iPFs).

The PCs were subsequently used for taxonomic classification of the inovirus sequences (see below), while iPFs were primarily used for functional affiliation. iPF functions were predicted based on the affiliation of iPF members against PFAM v30 (score ≥ 30), as well as manual inspection of individual iPFs using HHPred⁵⁷.

PCs containing pI-like proteins were also further evaluated to identify potential false positives stemming from a related ATPase encoded by another type of virus or mobile genetic element (see Supplementary Notes). The criteria used to determine genuine inovirus pI-like PCs were: the PC members closest known functional domain was Zot (based on the hmmsearch against PFAM), the proteins contained one or two TMD (either N-terminal or C-terminal), at least half of the sequences encoding this PC also include other genes expected in an inovirus sequence such as replication initiation proteins, and no significant similarity could be identified to any other type of ATPase using HHpred⁵⁷.

Gene-content-based clustering of inovirus genomes

A bipartite network was built in which genomes and PCs (as nodes) are connected by an edge when a predicted protein from the genome is a member of the PC. This network was then used to classify inovirus sequences as done previously for dsDNA viruses³². PCs were used instead of iPFs as they offer a higher resolution. Sequences with two pI proteins (that is, tandem prophages) were excluded from this network-based classification as these could lead to improper connections between unrelated genomes. Singleton proteins were also excluded, and only PCs with at least 2 members were used to build the network. This network had a very low density (0.05%) reflecting the fact that most PCs were restricted to a minor fraction of the genomes. Nevertheless, this type of network can still be organized into meaningful groups through information theoretic approaches: here, sequence clusters were obtained through InfoMap, with default parameters and a two-level clustering (that is, genomes can be associated with a group and a subgroup).

A summarized representation of the network was generated by displaying each subgroup (level 2) as a node with a size proportional to the number of species in the subgroup, and drawing an edge to a PC if >50% of the subgroup sequences encode this PC, except for the larger group (‘Protoinoviridae’:Subfamily_1) where connections are drawn for PCs found in >25% of the sequences. The network was then visualized using Cytoscape⁵⁸, with nodes from the same group (level 1) first gathered manually, and nodes allotment within group automatically generated using Prefuse-directed layout (default spring length of 200).

To evaluate the taxonomic rank to which these groups and subgroups would correspond, we calculated pairwise amino acid identity percentage of pI proteins for genomes (1) between groups and (2) within groups but between subgroups, using Sequence Demarcation Tool⁵⁹. These were then compared with the pairwise amino acid identity calculated with the same approach for established viral groups, namely, Caudovirales order using the terminase large subunit (TerL) as a marker protein, Microviridae using the major capsid protein (VP1) as a marker protein and Circoviridae using the replication initiation protein (Rep) as a marker protein (see Supplementary Notes).

Distribution of inovirus sequences by host and biome

The distribution of hosts for inovirus sequences was based on detections in IMG draft and complete genomes, that is, excluding all metagenome-derived detections but including detections in metagenome-assembled genomes (published draft genomes assembled from metagenomes). Host taxonomic classification was extracted from the IMG database. For visualization purposes, a set of 56 universal single-copy marker proteins^60,61 was used to build phylogenetic trees for bacteria and archaea based on all available microbial genomes in IMG²³ (genomes downloaded on 27 October 2017) and about 8,000 metagenome-assembled genomes from the Genome Taxonomy Database⁶² (downloaded on 18 October 2017). Marker proteins were identified with hmmsearch (version 3.1b2, hmmer.org) using a specific HMM for each of the markers. Genomes lacking a substantial proportion of marker proteins (>28) or which had additional copies of >3 single-copy markers were removed from the data set.

To reduce redundancy and to enable a representative taxon sampling, DNA-directed RNA polymerase β-subunit 160 kDa (COG0086) was identified using hmmsearch (hmmer 3.1b2) and the HMM of COG0086 (ref. ⁶³). Protein hits were then extracted and clustered with cd-hit⁴⁵ at 65% sequence similarity, resulting in 99 archaeal and 837 bacterial clusters. Genomes with the greatest number of different marker proteins were selected as cluster representatives. For every marker protein, alignments were built with MAFFT⁶⁴ (v7.294b) and subsequently trimmed with BMGE (v1.12) using BLOSUM30 (ref. ⁶⁵). Single-protein alignments were then concatenated, resulting in an alignment of 11,220 sites for the archaea and 16,562 sites for the bacteria. Maximum-likelihood phylogenies were inferred with FastTree2 (v2.1.9 SSE3, OpenMP)⁶⁶ using the options: -spr 4 -mlacc 2 -slownni -lg.

A distribution of inovirus sequences across biomes was obtained by compiling ecosystems and sampling location of all metagenomes where at least one inovirus sequence was detected. This information was extracted from the GOLD database⁶⁷, and the map was generated using the BaseMap functions from the matplotlib python library⁶⁸.

Estimation of inovirus prevalence and co-infection patterns

Prevalence and co-infection patterns were evaluated from the set of sequences identified in complete and draft microbial genomes from the IMG database, that is, excluding detections from metagenome assemblies. To control for the presence of near-identical genomes in the database, prevalence and co-infection frequencies were calculated after clustering host genomes based on pairwise ANI (cut-offs: 95% nucleotide identity on 95% alignment fraction). Prevalence was calculated at the host genus rank as the number of genomes with one or more inovirus sequence detected. Co-occurrence of inoviruses was evaluated based on the detections of distinct species in single-host genomes. Finally, we evaluated the rate of bacteria and archaea co-infected by an inovirus and a member of the Caudovirales order, the group of dsDNA viruses including most of the characterized bacteriophages (both lytic and temperate) as well as several archaeoviruses. To identify Caudovirales infections, we used the gene encoding the terminase large subunit as a marker gene, and searched the same genomes from the IMG database for hits to the PFAM domains terminase_1, terminase_3, terminase_6 and terminase_GpA (hmmsearch, score ≥ 30).

Phylogenetic trees of inovirus sequences

Phylogenies of inovirus sequences were based on multiple alignment of pI protein sequences. To obtain informative multiple alignments, an all-versus-all blastp⁴⁹ of all pI proteins was computed and used to identify the nearest neighbours of sequences of interests. For sequences detected in archaeal genomes, an additional 10 most closely related sequences with e ≤ 0.001, bit score ≥ 50 and a blast hit covering ≥50% of the query sequence were recruited for each archaea-associated sequence to help populate the tree. A similar approach was used for the tree based on the integrase genes from archaea-associated inoviruses: the protein sequences for the three integrase genes were compared with the NCBI nr database with blastp⁴⁹ (bit score ≥ 50, e ≤ 0.001) to gather their closest neighbours across archaeal and bacterial genomes.

Resulting data sets were first filtered for partial sequences as follows: the average sequence length was calculated excluding the top and bottom 10%, and all sequences shorter than half of this average were excluded. These protein sequences were next aligned with MUSCLE (v3.8.1551)⁴³, automatically trimmed with trimAL (v1.4.rev15)⁶⁹ (option gappyout), and trees were constructed using IQ-TREE (v1.5.5) with an automatic detection of optimal model⁷⁰ and displayed using iToL⁷¹. The optimal substitution model, selected based on the Bayesian information criterion, was VT + F + R5 for the the pI phylogeny of archaeal inoviruses, and LG + R4 for the integrase phylogeny of archaeal inoviruses. Annotated trees are available at http://itol.embl.de/shared/Siroux (project ‘Inovirus’).

Functional affiliation of iPFs

An automatic functional affiliation of all iPFs was generated by compiling the annotation of all members based on a comparison to PFAM (data extracted from the IMG). To refine these annotations for functions of interest, namely, replication initiation proteins, integration proteins, DNA methylases and toxin–antitoxin systems, individual iPF alignments were submitted to the HHPred website⁵⁷, and the alignments were visually inspected for conserved residues and/or motifs (Supplementary Table 5, motifs extracted from refs. ^72,73 and the PFAM database v30 (ref. ⁷⁴)).

To identify toxin–antitoxin protein partners, all inovirus sequences were screened for co-occurring genes including an iPF annotated as toxin and/or antitoxin, and the list of putative pairs was next manually curated (Supplementary Table 5). This enabled the identification of putative antitoxin proteins detected as conserved uncharacterized iPF frequently observed next to a predicted toxin iPF.

Finally, putative structural proteins and DNA-interacting proteins were specifically searched for. Putative structural proteins were predicted as described above for the isolate reference genomes, that is, as sequences of 30–90 amino acids, after in silico removal of signal peptide, if detected, and displaying 1 or 2 TMD. For the most abundant iPFs predicted as major coat proteins, the secondary structure was predicted with Phyre2 (ref. ⁷⁵). For DNA-interacting proteins, PFAM annotations were screened for HTH, RHH, Zn-binding and Zn-ribbon domains. In addition, HHsearch was used to compare the iPFs to 3 conserved HTH domains from the SMART database⁷⁶: Bac_DnaA_C, HTH_DTXR and HTH_XRE (probability ≥ 90).

CRISPR spacer matches and CRISPR–Cas systems identification

All inovirus sequences were compared with the IMG CRISPR spacer database with blastn, using options adapted for short sequences (-task blastn-short -evalue 1 -word_size 7 -gapopen 10 -gapextend 2 -penalty −1 -dust no). Only cases with zero or one mismatch were further considered. Next, the genome context of these spacers was explored to identify the ones with a clear associated CRISPR–Cas system and to affiliate these systems to the different types described. Only spacers for which a cas gene could be identified in a region of ±10 kb were retained. The CRISPR–Cas system affiliation was based on the set of cas genes identified around the spacer and performed following the guidelines from ref. ⁷⁷.

For host genomes with a self-targeting spacer, additional (that is, non-inovirus) prophages were detected using VirSorter²⁰. The number of distinct prophages was also estimated using the detection of large terminase subunits (hmmsearch against PFAM database, score ≥ 30). Putative Acr and anti-CRISPR-associated (Aca) proteins were first detected through similarity to previously described Acr systems³⁸ (blastp, e ≤ 0.001 and score ≥ 50). Putative Acr and Aca proteins were identified by searching for HTH-domain-containing proteins identified based on HTH domains in the SMART database (see above) in inovirus sequences displaying a match to a CRISPR spacer extracted from the same host genome.

Microscopy and PCR investigation of a predicted provirus in M. profundi MobM

M. profundi strain MobM cells were grown in anaerobic DSMZ medium 479 at 37 °C with 5 mM methanol added as a methanogenic substrate instead of trimethylamine⁷⁸. After 35 h of growth, anaerobic mitomycin C was added to the culture at a final concentration of 1.0 μg ml⁻¹ to induce the provirus. Samples were collected before and 4 h after induction and were filtered with 0.22-μm pore size polyethersulfone filters (Millipore, Fisher Scientific) to obtain a ‘cellular’ (≥0.22 μm) and a ‘viral’ (<0.22 µm) fraction.

The four types of samples (with or without induction, cellular and viral fractions) were prepared and imaged at the Molecular and Cellular Imaging Center, Ohio State University, Wooster, OH, USA. An equal volume of 2× fixative (6% glutaraldehyde and 2% paraformaldehyde in 0.1 M potassium phosphate buffer pH 7.2) was added directly to the culture post-induction. Of the medium, 30 μl was applied to a formovar and carbon-coated copper grid for 5 min, blotted and then stained with 2% uranyl acetate for 1 min. Samples were examined with a Hitachi H7500 electron microscope and imaged with the SIA-L12C (16 megapixels) digital camera.

PCRs were initially run for induced and non-induced samples on both size fractions with three pairs of primers: one internal to the predicted provirus (B primers), one spanning the insertion site (P primers) and one spanning the junction of the predicted excised circular genome (C primers). The reactions were conducted for 35 cycles with denaturation, annealing and extension cycles of 0.5, 0.5 and 1.0 min at 95.0, 52.0 and 72.0 °C, respectively. For C primers, numerous nonspecific amplification products were obtained with these conditions, and another set of PCRs was conducted with higher annealing temperatures of 56.5 °C and 57.5 °C, both in triplicates. The PCR product was then cleaned to remove polymerase, free dNTPs and primers (Zymo Research) and subsequently used as templates for Sanger sequencing. The resulting chromatograms were analysed using the R⁵⁴ packages sangerseqR⁷⁹, sangeranalyseR⁸⁰ and readr⁸¹. The extracted primary sequences were aligned to the MobM genome using blastn⁴⁹ and MUSCLE⁴³, and the alignment was visualized with Jalview⁸².

Experimental characterization of hypothetical proteins from self-targeted Pseudomonas inoviruses

Hypothetical proteins predicted on inovirus prophages, which were (1) found in Pseudomonas genomes, (2) predicted to be targeted by at least one CRISPR spacer from the same genome, and (3) for which no acr locus could be identified anywhere else in the same genome, were selected for further functional characterization. The ten candidate genes were first codon optimized for expression in Pseudomonas using an empirically derived codon usage table. Codon optimization and vendor defined synthesis constraints removal were performed using BOOST⁸³. Synthetic DNA were obtained from Thermo Fisher Scientific and cloned in between the SacI and PstI sites of an Escherichia–Pseudomonas broad host range expression vector, pHERD30T⁸⁴. All gene constructs were sequence-verified before testing.

P. aeruginosa strains (PAO1::pLac I-C CRISPR–Cas, PA14 and 4386) were cultured on LB agar or liquid media at 37 °C. The pHERD30T plasmids were electroporated into P. aeruginosa strains, and LB was supplemented with 50 µg ml⁻¹ gentamicin to maintain the pHERD30T plasmid. Phages DMS3m, JBD30, D3, 14–1, Luz7 and KMV were amplified on PAO1, and phage JBD44a was amplified on PA14. All phages were stored in SM buffer at 4 °C in the presence of chloroform.

For phage titring, a bacterial lawn was first generated by spreading 6 ml of top agar seeded with 200 µl host bacteria on a LB agar plate supplemented with 10 mM MgSO₄, 50 µg ml⁻¹ gentamicin and 0.1% arabinose. The I-C cas genes in strain PAO1 were induced with 1 mM isopropyl-β-d-1-thiogalactopyranoside. Three microlitres of phage serially diluted in SM buffer was then spotted onto the lawn and incubated at 37 °C for 16 h. Growth rates were similar between cells transformed with an empty vector and cells transformed with a vector including a candidate gene, except for the two cases where no growth was observed after transformation (see Supplementary Notes).

Experimental confirmation of self-targeting lethality and trans-acting Acr activity from a co-infecting phage in a P. aeruginosa model

The effect of CRISPR targeting of an integrated inovirus prophage was assessed in the P. aeruginosa strain PA14, which naturally encodes an intact Pf1 inovirus prophage, and for which both natural CRISPR arrays were deleted (strain PA14 ∆CRISPR1/∆CRISPR2 (Pf1)). Host cells were transformed with plasmids encoding CRISPR spacers either targeting the Pf1 coat gene or without a target in the host genome. To generate these plasmids, complementary single-stranded oligos (IDT) were annealed and ligated into a linearized derivative of shuttle vector pHERD30T bearing I-F direct repeats in the multiple cloning site downstream of the pBAD promoter. PA14 lysogens were electroporated with 100 ng plasmid DNA, allowed to recover for 1 h in LB at 37 °C and plated on LB agar plates supplemented with 50 μg ml⁻¹ gentamicin and 0.1% arabinose. Colonies were enumerated after growth for 14 h at 37 °C. Transformation efficiency (TE) was calculated as colonies per microgram DNA, and the percentage TE was calculated by normalizing the TE of the CRISPR RNA-expressing plasmids to the TE of an empty vector.

To evaluate the effect of an acr locus from a co-infecting prophage on self-targeted inoviruses, strain PA14 ∆CRISPR1/∆CRISPR2 (Pf1) was lysogenized with phage DMS3m_acrIF1 by streaking out cells from a solid plate infection and screening for colonies resistant to superinfection by DMS3m_acrIF1. Lysogeny was confirmed by prophage induction. The same plasmid transformation approach was then used to assess the effect of inovirus self-targeting on host cell viability.

Quantification and statistical analysis

Sequence similarity searches were conducted with thresholds of E-value ≤ 0.001 and bit score ≥ 30 or 50, the former being used mainly for short proteins. The different classifiers (random forest, conditional random forest and generalized linear model) used to identify inovirus sequences were evaluated using a tenfold cross-validation approach. For all boxplots, the lower and upper hinges correspond to the first and third quartiles, respectively, and the whiskers extend no further than ±1.5 times the interquartile range.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The following files are available at https://genome.jgi.doe.gov/portal/Inovirus/Inovirus.home.html: Gb_files_inoviruses.zip: GenBank files of all representative genomes for each inovirus species; Ref_PCs_inoviruses.zip: PCs from the references (raw fasta, alignment fasta and hmm profile); iPFs_inoviruses.zip: protein families from the extended inovirus data set (raw fasta, alignment fasta and hmm profile); MobM_C_primer_amplicon.fasta: multiple sequence alignment of the C primer products with the Methanolobus MobM genome (NZ_FOUJ01000007), confirming that C primer products span the junction of the excised genome. Accession numbers of all inovirus sequences used as reference are listed in Supplementary Table 1. Accession numbers of all genomes and metagenomes mined, including detailed information for each (meta)genome in which some inovirus sequences were detected are available in Supplementary Table 2. Finally, the list of all inovirus genome accession numbers, along with taxonomic and environmental distribution information, is provided in Supplementary Table 3.

Code availability

The set of scripts and models used to detect inovirus sequences is available at https://bitbucket.org/srouxjgi/inovirus/src/master/Inovirus_detector/.

Change history

11 February 2020
An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

Rakonjac, J., Bennett, N. J., Spagnuolo, J., Gagic, D. & Russel, M. Filamentous bacteriophage: biology, phage display and nanotechnology applications. Curr. Issues Mol. Biol. 13, 51–76 (2011).
CAS PubMed Google Scholar
Fauquet, C. M. The diversity of single stranded DNA. Virus Biodivers. 7, 38–44 (2006).
Article Google Scholar
Marvin, D. A., Symmons, M. F. & Straus, S. K. Structure and assembly of filamentous bacteriophages. Prog. Biophys. Mol. Biol. 114, 80–122 (2014).
Article CAS PubMed Google Scholar
Bradbury, A. R. M. & Marks, J. D. Antibodies from phage antibody libraries. J. Immunol. Methods 290, 29–49 (2004).
Article CAS PubMed Google Scholar
Nam, K. T. et al. Stamped microbattery electrodes based on self-assembled M13 viruses. Proc. Natl Acad. Sci. USA 105, 17227–17231 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ju, Z. & Sun, W. Drug delivery vectors based on filamentous bacteriophages and phage-mimetic nanoparticles. Drug Deliv. 24, 1898–1908 (2017).
Article CAS PubMed PubMed Central Google Scholar
Henry, K. A., Arbabi-Ghahroudi, M. & Scott, J. K. Beyond phage display: non-traditional applications of the filamentous bacteriophage as a vaccine carrier, therapeutic biologic, and bioconjugation scaffold. Front. Microbiol. 6, 755 (2015).
PubMed PubMed Central Google Scholar
Ilyina, T. S. Filamentous bacteriophages and their role in the virulence and evolution of pathogenic bacteria. Mol. Genet. Microbiol. Virol. 30, 1–9 (2015).
Article Google Scholar
Shapiro, J. W. & Turner, P. E. Evolution of mutualism from parasitism in experimental virus populations. Evolution 72, 707–712 (2018).
Article PubMed Google Scholar
Sweere, J. M. et al. Bacteriophage trigger anti-viral immunity and prevent clearance of bacterial infection. Science 363, eaat9691 (2019).
Article CAS PubMed PubMed Central Google Scholar
Waldor, M. K. & Mekalanos, J. J. Lysogenic conversion by a filamentous phage encoding cholera toxin. Science 272, 1910–1914 (1996).
Article CAS PubMed Google Scholar
Faruque, S. M. & Mekalanos, J. J. Pathogenicity islands and phages in Vibrio cholerae evolution. Trends Microbiol. 11, 505–510 (2003).
Article CAS PubMed Google Scholar
Bille, E. et al. A virulence-associated filamentous bacteriophage of Neisseria meningitidis increases host-cell colonisation. PLoS Pathog. 13, e1006495 (2017).
Article PubMed PubMed Central CAS Google Scholar
Rice, S. A. et al. The biofilm life cycle and virulence of Pseudomonas aeruginosa are dependent on a filamentous prophage. ISME J. 3, 271–282 (2009).
Article CAS PubMed Google Scholar
Rakonjac, J. Filamentous bacteriophages: biology and applications. eLS https://doi.org/10.1002/9780470015902.a0000777 (2012).
Varani, A. M., Monteiro-Vitorello, C. B., Nakaya, H. I. & Van Sluys, M.-A. The role of prophage in plant-pathogenic bacteria. Annu. Rev. Phytopathol. 51, 429–451 (2013).
Article CAS PubMed Google Scholar
Mai-Prochnow, A. et al. ‘Big things in small packages: the genetics of filamentous phage and effects on fitness of their host’. FEMS Microbiol. Rev. 39, 465–487 (2015).
Article PubMed Google Scholar
Páez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).
Article PubMed CAS Google Scholar
Páez-Espino, D., Pavlopoulos, G. A., Ivanova, N. N. & Kyrpides, N. C. Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data. Nat. Protoc. 12, 1673–1682 (2017).
Article PubMed CAS Google Scholar
Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).
Article PubMed PubMed Central CAS Google Scholar
Brum, J. R. & Sullivan, M. B. Rising to the challenge: accelerated pace of discovery transforms marine virology. Nat. Rev. Microbiol. 13, 147–159 (2015).
Article CAS PubMed Google Scholar
Vega Thurber, R. V. et al. Laboratory procedures to generate viral metagenomes. Nat. Protoc. 4, 470–483 (2009).
Article CAS Google Scholar
Chen, I. M. A. et al. IMG/M: integrated genome and metagenome comparative data analysis system. Nucleic Acids Res. 45, D507–D516 (2017).
Article CAS PubMed Google Scholar
Kimura, M., Wang, G., Nakayama, N. & Asakawa, S. in Biocommunication in Soil Microorganisms (ed. Witzany, G.) 189–213 (Springer, 2011).
Kim, A. Y. & Blaschek, H. P. Isolation and characterization of a filamentous virus-like particle from Clostridium acetobutylicum NCIB-6444. J. Bacteriol. 173, 530–535 (1991).
Article CAS PubMed PubMed Central Google Scholar
Iranzo, J., Koonin, E. V., Prangishvili, D. & Krupovic, M. Bipartite network analysis of the archaeal virosphere: evolutionary connections between viruses and capsid-less mobile elements. J. Virol. 90, 11043–11055 (2016).
Article CAS PubMed PubMed Central Google Scholar
Prangishvili, D., Bamford, D. H., Forterre, P. & Iranzo, J. The enigmatic archaeal virosphere. Nat. Rev. Microbiol. 15, 724–739 (2017).
Article CAS PubMed Google Scholar
Krupovic, M., Cvirkaite-Krupovic, V., Iranzo, J., Prangishvili, D. & Koonin, E. V. Viruses of archaea: structural, functional, environmental and evolutionary genomics. Virus Res. 244, 181–193 (2018).
Article CAS PubMed Google Scholar
Garushyants, S. K., Kazanov, M. D. & Gelfand, M. S. Horizontal gene transfer and genome evolution in Methanosarcina. BMC Evol. Biol. 15, 102 (2015).
Article PubMed PubMed Central Google Scholar
Mavrich, T. N. & Hatfull, G. F. Bacteriophage evolution differs by host, lifestyle and genome. Nat. Microbiol. 2, 17112 (2017).
Article CAS PubMed PubMed Central Google Scholar
Krupovic, M., Prangishvili, D., Hendrix, R. W. & Bamford, D. H. Genomics of bacterial and archaeal viruses: dynamics within the prokaryotic virosphere. Microbiol. Mol. Biol. Rev. 75, 610–635 (2011).
Article PubMed PubMed Central Google Scholar
Iranzo, J., Krupovic, M. & Koonin, E. V. The double-stranded DNA virosphere as a modular hierarchical network of gene sharing. mBio 7, e00978-16 (2016).
Article PubMed PubMed Central Google Scholar
Rosvall, M. & Bergstrom, C. T. Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems. PLoS ONE 6, e18209 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wolf, Y. I. et al. Origins and evolution of the global RNA virome. mBio 9, e02329-18 (2018).
Article PubMed PubMed Central Google Scholar
Koonin, E. V., Dolja, V. V. & Krupovic, M. Origins and evolution of viruses of eukaryotes: the ultimate modularity. Virology 479–480, 2–25 (2015).
Article PubMed CAS Google Scholar
Song, S. & Wood, T. K. Post-segregational killing and phage inhibition are not mediated by cell death through toxin/antitoxin systems. Front. Microbiol. 9, 814 (2018).
Article PubMed PubMed Central Google Scholar
Marraffini, L. A. CRISPR–Cas immunity in prokaryotes. Nature 526, 55–61 (2015).
Article CAS PubMed Google Scholar
Borges, A. L., Davidson, A. R. & Bondy-Denomy, J. The discovery, mechanisms, and evolutionary impact of anti-CRISPRs. Annu. Rev. Virol. 4, 37–59 (2017).
Article CAS PubMed PubMed Central Google Scholar
Díaz-Muñoz, S. L., Sanjuán, R. & West, S. Sociovirology: conflict, cooperation, and communication among viruses. Cell Host Microbe 22, 437–441 (2017).
Article PubMed PubMed Central CAS Google Scholar
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
Article PubMed CAS Google Scholar
Brister, J. R., Ako-Adjei, D., Bao, Y. & Blinkova, O. NCBI viral genomes resource. Nucleic Acids Res. 43, D571–D577 (2015).
Article CAS PubMed Google Scholar
Delcher, A. L., Bratke, K. A., Powers, E. C. & Salzberg, S. L. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 23, 673–679 (2007).
Article CAS PubMed Google Scholar
Edgar, R. C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113 (2004).
Article PubMed PubMed Central CAS Google Scholar
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2011).
Article PubMed CAS Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).
Article CAS PubMed PubMed Central Google Scholar
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article CAS PubMed PubMed Central Google Scholar
Petersen, T. N., Brunak, S., Von Heijne, G. & Nielsen, H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods 8, 785–786 (2011).
Article CAS PubMed Google Scholar
Krogh, A., Larsson, B., von Heijne, G. & Sonnhammer, E. L. L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567–580 (2001).
Article CAS PubMed Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central CAS Google Scholar
Liaw, A. & Wiener, M. Classification and regression by randomForest. R News 2, 18–22 (2002).
Google Scholar
Strobl, C., Boulesteix, A. L., Kneib, T., Augustin, T. & Zeileis, A. Conditional variable importance for random forests. BMC Bioinformatics 9, 307 (2008).
Article PubMed PubMed Central CAS Google Scholar
Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13 (2011).
Article PubMed PubMed Central Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
R Core Team R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2018).
Adriaenssens, E. M., Krupovic, M. & Knezevic, P. Taxonomy of prokaryotic viruses: 2016 update from the ICTV bacterial and archaeal viruses subcommittee. Arch. Virol. 162, 1153–1157 (2017).
Article CAS PubMed Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Alva, V., Nam, S.-Z., Söding, J. & Lupas, A. N. The MPI bioinformatics toolkit as an integrative platform for advanced protein sequence and structure analysis. Nucleic Acids Res. 44, W410–W415 (2016).
Article CAS PubMed PubMed Central Google Scholar
Demchak, B. et al. Cytoscape: the network visualization tool for GenomeSpace workflows. F1000Res. 3, 151 (2014).
Article PubMed PubMed Central Google Scholar
Muhire, B. M., Varsani, A. & Martin, D. P. SDT: a virus classification tool based on pairwise sequence alignment and identity calculation. PLoS ONE 9, e108277 (2014).
Article PubMed PubMed Central CAS Google Scholar
Eloe-Fadrosh, E. A. et al. Global metagenomic survey reveals a new bacterial candidate phylum in geothermal springs. Nat. Commun. 7, 10476 (2016).
Article CAS PubMed PubMed Central Google Scholar
Yu, F. B. et al. Microfluidic-based mini-metagenomics enables discovery of novel microbial lineages from complex environmental samples. eLife 6, e26580 (2017).
Article PubMed PubMed Central Google Scholar
Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).
Article CAS PubMed Google Scholar
Tatusov, R. L. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).
Article CAS PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Criscuolo, A. & Gribaldo, S. BMGE (block mapping and gathering with entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol. Biol. 10, 210 (2010).
Article PubMed PubMed Central CAS Google Scholar
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article PubMed PubMed Central CAS Google Scholar
Mukherjee, S. et al. Genomes Online Database (GOLD) v.6: data updates and feature enhancements. Nucleic Acids Res. 45, D446–D456 (2017).
Article CAS PubMed Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article PubMed PubMed Central CAS Google Scholar
Nguyen, L. T., Schmidt, H. A., Von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Article CAS PubMed Google Scholar
Letunic, I. & Bork, P. Interactive Tree of Life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242–W245 (2016).
Article CAS PubMed PubMed Central Google Scholar
Krupovic, M. Networks of evolutionary interactions underlying the polyphyletic origin of ssDNA viruses. Curr. Opin. Virol. 3, 578–586 (2013).
Article CAS PubMed Google Scholar
Carr, S. B., Phillips, S. E. V. & Thomas, C. D. Structures of replication initiation proteins from staphylococcal antibiotic resistance plasmids reveal protein asymmetry and flexibility are necessary for replication. Nucleic Acids Res. 44, 2417–2428 (2016).
Article CAS PubMed PubMed Central Google Scholar
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).
Article CAS PubMed Google Scholar
Kelley, L. A., Mezulis, S., Yates, C., Wass, M. & Sternberg, M. The Phyre2 web portal for protein modelling, prediction, and analysis. Nat. Protoc. 10, 845–858 (2015).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I. SMART 4.0: towards genomic data integration. Nucleic Acids Res. 32, 142D–144D (2004).
Article Google Scholar
Makarova, K. S. et al. An updated evolutionary classification of CRISPR–Cas systems. Nat. Rev. Microbiol. 13, 722–736 (2015).
Article CAS PubMed PubMed Central Google Scholar
Mochimaru, H. et al. Methanolobus profundi sp. nov., a methylotrophic methanogen isolated from deep subsurface sediments in a natural gas field. Int. J. Syst. Evol. Microbiol. 59, 714–718 (2009).
Article CAS PubMed Google Scholar
Hill, J. T. et al. Poly peak parser: method and software for identification of unknown indels using sanger sequencing of polymerase chain reaction products. Dev. Dyn. 43, 1632–1636 (2014).
Article CAS Google Scholar
Lanfear, R. sangeranalyseR: a suite of functions for the analysis of Sanger sequence data in R v.1.20.0 (2015).
Wickham, H., Hester, J. & Francois, R. readr: read rectangular text data v.1.3.1 (2017).
Waterhouse, A. M., Procter, J. B., Martin, D. M. A., Clamp, M. & Barton, G. J. Jalview version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009).
Article CAS PubMed PubMed Central Google Scholar
Oberortner, E., Cheng, J. F., Hillson, N. J. & Deutsch, S. Streamlining the design-to-build transition with build-optimization software tools. ACS Synth. Biol. 6, 485–496 (2017).
Article CAS PubMed Google Scholar
Qiu, D., Damron, F. H., Mima, T., Schweizer, H. P. & Yu, H. D. PBAD-based shuttle vectors for functional analysis of toxic and highly regulated genes in Pseudomonas and Burkholderia spp. and other bacteria. Appl. Environ. Microbiol. 74, 7422–7426 (2008).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The MobM strain was provided by D. J. Ferguson, Miami University, Oxford, OH, USA. Its genome was sequenced and assembled by the US Department of Energy Joint Genome Institute through a Community Science Program initiative to K.C.W. (CSP no. 1777). T. Meulia at the Molecular and Cellular Imaging Center, Ohio State University, Wooster, OH, USA, performed the transmission electron microscopy of MobM samples. We gratefully acknowledge the contributions of many principal investigators who sent extracted DNA for isolate genome and metagenome sequencing as part of the Department of Energy Joint Genome Institute Community Science Program, and allowed us to include in our study the inovirus sequences detected in these publicly available data sets regardless of publication status (the complete list of data sets in which inovirus sequences were detected including principal investigators is available in Supplementary Table 2). This work was conducted by the US Department of Energy Joint Genome Institute, a Department of Energy Office of Science User Facility, under contract no. DE-AC02–05CH11231 and used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02–05CH11231. R.A.D. and K.C.W. were partially supported by funding from the National Sciences Foundation Dimensions of Biodiversity (Award 1342701). M.K. was supported by l’Agence Nationale de la Recherche (France) project ENVIRA (ANR-17-CE15-0005-01). The Bondy-Denomy lab (A.L.B. and J.B.-D.) is supported by the UCSF Program for Breakthrough in Biomedical Research, funded in part by the Sandler Foundation, the NIH Office of the Director (DP5-OD021344) and NIGMS (R01GM127489). Research of P.B.M.C. was funded by the US Department of Energy award DE-AC02–05CH11231. Funding for A.S. was provided by the National Science Foundation grant EAR 1331940 (the Eel River Critical Zone Observatory).

Author information

Authors and Affiliations

DOE Joint Genome Institute, Walnut Creek, CA, USA
Simon Roux, Stephen Nayfach, Frederik Schulz, Jan-Fang Cheng, Natalia N. Ivanova, Tanja Woyke, Axel Visel, Nikos C. Kyrpides & Emiley A. Eloe-Fadrosh
Department of Microbiology, Institut Pasteur, Paris, France
Mart Krupovic
Department of Soil and Crop Sciences, Colorado State University, Fort Collins, CO, USA
Rebecca A. Daly & Kelly C. Wrighton
Department of Microbiology and Immunology, University of California, San Francisco, San Francisco, CA, USA
Adair L. Borges & Joseph Bondy-Denomy
Department of Earth & Planetary Sciences, University of California, Berkeley, Berkeley, CA, USA
Allison Sharrar & Paula B. Matheus Carnevali
Quantitative Biosciences Institute, University of California, San Francisco, San Francisco, CA, USA
Joseph Bondy-Denomy

Authors

Simon Roux
View author publications
You can also search for this author in PubMed Google Scholar
Mart Krupovic
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca A. Daly
View author publications
You can also search for this author in PubMed Google Scholar
Adair L. Borges
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Nayfach
View author publications
You can also search for this author in PubMed Google Scholar
Frederik Schulz
View author publications
You can also search for this author in PubMed Google Scholar
Allison Sharrar
View author publications
You can also search for this author in PubMed Google Scholar
Paula B. Matheus Carnevali
View author publications
You can also search for this author in PubMed Google Scholar
Jan-Fang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Natalia N. Ivanova
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Bondy-Denomy
View author publications
You can also search for this author in PubMed Google Scholar
Kelly C. Wrighton
View author publications
You can also search for this author in PubMed Google Scholar
Tanja Woyke
View author publications
You can also search for this author in PubMed Google Scholar
Axel Visel
View author publications
You can also search for this author in PubMed Google Scholar
Nikos C. Kyrpides
View author publications
You can also search for this author in PubMed Google Scholar
Emiley A. Eloe-Fadrosh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

S.R., M.K. and E.A.E.-F. conceived the study. R.A.D. and A.L.B. designed the archaeal inovirus induction and functional characterization of inovirus genes in Pseudomonas experiments, respectively. A.S. and P.B.M.C. contributed unpublished metagenomic data. S.R. and M.K. performed the data and metadata curation. S.R. developed the computational tools. S.N. and F.S. contributed additional computational analyses. R.A.D., J.-F.C. and A.L.B. performed the experiments. S.R., M.K., R.A.D., A.L.B., S.N., F.S., J.-F.C., N.N.I., J.B.-D., A.V., N.C.K. and E.A.E.-F. designed and wrote the manuscript. All authors reviewed and corrected the final manuscript.

Corresponding authors

Correspondence to Simon Roux or Emiley A. Eloe-Fadrosh.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Supplementary Figs. 1–12, Supplementary Table legends and Supplementary References.

Reporting Summary

Supplementary Table 1

List and characteristics of reference inovirus genomes used in this study.

Supplementary Table 2

List of genomes and metagenomes mined.

Supplementary Table 3

Classification of inovirus sequences into species, proposed families and proposed subfamilies.

Supplementary Table 4

Additional indication of inovirus infection for 20 phylum-level putative host groups.

Supplementary Table 5

Functional annotation of protein families (iPFs).

Supplementary Table 6

List of matches between inovirus sequences and IMG CRISPR spacer database.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Roux, S., Krupovic, M., Daly, R.A. et al. Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes. Nat Microbiol 4, 1895–1906 (2019). https://doi.org/10.1038/s41564-019-0510-x

Download citation

Received: 12 February 2019
Accepted: 05 June 2019
Published: 22 July 2019
Issue Date: November 2019
DOI: https://doi.org/10.1038/s41564-019-0510-x

This article is cited by

Unveiling the unknown viral world in groundwater
- Zongzhi Wu
- Tang Liu
- Jinren Ni
Nature Communications (2024)
Control of lysogeny and antiphage defense by a prophage-encoded kinase-phosphatase module
- Yunxue Guo
- Kaihao Tang
- Xiaoxue Wang
Nature Communications (2024)
Identification of mobile genetic elements with geNomad
- Antonio Pedro Camargo
- Simon Roux
- Nikos C. Kyrpides
Nature Biotechnology (2024)
Diverse and abundant phages exploit conjugative plasmids
- Natalia Quinones-Olvera
- Siân V. Owen
- Michael Baym
Nature Communications (2024)
The gut virome is associated with stress-induced changes in behaviour and immune responses in mice
- Nathaniel L. Ritz
- Lorraine A. Draper
- John F. Cryan
Nature Microbiology (2024)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Inoviruses are highly diverse and globally prevalent

Inoviruses infect a broad diversity of bacterial hosts

Inoviruses sporadically transferred from bacterial to archaeal hosts

Gene content classification reveals six distinct inovirus families

Inovirus genomes encode an extensive functional repertoire

Inoviruses can both leverage and restrict co-infecting viruses

Discussion

Methods

Construction of an Inoviridae genome reference set

Search for inovirus in microbial genomes and metagenomes

Clustering of inovirus genomes in putative species

Clustering of predicted proteins from non-redundant inovirus sequences

Gene-content-based clustering of inovirus genomes

Distribution of inovirus sequences by host and biome

Estimation of inovirus prevalence and co-infection patterns

Phylogenetic trees of inovirus sequences

Functional affiliation of iPFs

CRISPR spacer matches and CRISPR–Cas systems identification

Microscopy and PCR investigation of a predicted provirus in M. profundi MobM

Experimental characterization of hypothetical proteins from self-targeted Pseudomonas inoviruses

Experimental confirmation of self-targeting lethality and trans-acting Acr activity from a co-infecting phage in a P. aeruginosa model

Quantification and statistical analysis

Reporting Summary

Data availability

Code availability

Change history

11 February 2020

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links