Multiple origins of prokaryotic and eukaryotic single-stranded DNA viruses from bacterial and archaeal plasmids

Single-stranded (ss) DNA viruses are a major component of the earth virome. In particular, the circular, Rep-encoding ssDNA (CRESS-DNA) viruses show high diversity and abundance in various habitats. By combining sequence similarity network and phylogenetic analyses of the replication proteins (Rep) belonging to the HUH endonuclease superfamily, we show that the replication machinery of the CRESS-DNA viruses evolved, on three independent occasions, from the Reps of bacterial rolling circle-replicating plasmids. The CRESS-DNA viruses emerged via recombination between such plasmids and cDNA copies of capsid genes of eukaryotic positive-sense RNA viruses. Similarly, the rep genes of prokaryotic DNA viruses appear to have evolved from HUH endonuclease genes of various bacterial and archaeal plasmids. Our findings also suggest that eukaryotic polyomaviruses and papillomaviruses with dsDNA genomes have evolved via parvoviruses from CRESS-DNA viruses. Collectively, our results shed light on the complex evolutionary history of a major class of viruses revealing its polyphyletic origins.


SUPPLEMENTARY NOTE 1 Analysis of bacterial mobile genetic elements encoding viral-like Reps
The majority of the Reps from pCRESS7 and pCRESS9, both represented in members of the phylum Tenericutes, are encoded by extrachromosomal plasmids (Supplementary table 1). By contrast, only 6 of the 237 (2.5%) Reps found in other groups are plasmid-borne (Supplementary table 1), with the rest being encoded in bacterial chromosomes. To characterize the provenance and potential function of these Reps, we performed a detailed genomic context analysis of selected genes from each Rep group. In the majority of cases, when the Reps were encoded on sufficiently large genomic contigs, the rep gene was located in the vicinity of a gene encoding an integrase of the tyrosine recombinase superfamily ( Figure 2C). Further analysis showed that the loci encompassing the two genes as well as a variable number of other genes were flanked by direct repeats corresponding to the attachment sites (Supplementary figure 3a-i, Supplementary table 1), a typical feature of integration of circular dsDNA molecules mediated by tyrosine recombinases 1 . The majority of the analyzed mobile genetic elements were integrated into diverse tRNA genes. However, we also identified several elements integrated into protein-coding genes. For instance, elements encoding Reps from pCRESS2, -6 and -8 have recombined with the 3'-distal region of the gene encoding 30S ribosomal protein S9 (Supplementary table 1).
The Rep-encoding elements carry from 2 (e.g., plasmid pRGRH0065) to over 20 genes (e.g., element LacLac-E1; Supplementary figure 3). Despite this variability, based on the shared content of signature genes, the elements could be assigned to one of three families: (i) elements from Rep-based pCRESS1-3, -5, -8 and certain members of pCRESS6 share genes encoding FtsK-like DNA segregation ATPases (cl28087), single-stranded DNA binding (SSB) proteins, and occasionally, Sec10/PgrA surface exclusion domain protein, which specifically inhibits the ability of cells to uptake homologous plasmids (IPR027607); (ii) elements from pCRESS4, -6 and -7 encode a conserved MOBV-family plasmid mobilization protein (PF01076); (iii) all elements carried by phytoplasma, irrespective of their placement in the Rep-based phylogeny, encode plasmid copy number control proteins 2 , a distinct SSB protein (PRK06752), and a conserved protein of unknown function ( Figure 2C). Notably, none of the elements encoded any homologs of currently known viral structural proteins. Such pattern of gene sharing and incongruence with the Rep-based grouping ( Figure 1) is consistent with the recombination and horizontal spread of these elements between different bacterial species. The latter conclusion is supported by phylogenetic analysis of the Rep sequences from each of the 9 pCRESS groups (Supplementary figure 3ai). A notable case of horizontal plasmid transfer is observed in pCRESS7, where two elements found in alpha-proteobacteria and gamma-proteobacteria are nested among elements from Clostridia. Collectively, these observations indicate that viral-like Reps in bacteria are encoded by diverse extrachromosomal and integrated plasmids.

Sequence motifs shared by bacterial and CRESS-DNA virus Reps
A detailed comparison of the conserved motifs in the nuclease and helicase domains of the viral-like bacterial and CRESS-DNA virus Reps (Figure 3) supports the inferences made from the clustering analysis ( Figure 1). An asparagine in motif C of the helicase domain, which interacts with the γ-phosphate of ATP and a nucleophilic water molecule 3 , is conserved across all known CRESS-DNA viruses as well as the bacterial Reps of pCRESS2 and pCRESS3, but not in pCRESS1 or the YLxH supergroup. pCRESS3 bacterial Reps are most similar to those of smacoviruses, nanoviruses and circoviruses. Most notably, these proteins share the unique modification of the motif II, HUQ, in the nuclease domain, not found in other known prokaryotic plasmid or virus Reps. By contrast, algae-infecting bacilladnaviruses are more similar to the bacterial pCRESS2 Reps, especially within the helicase domain, where the conserved aspartate residues in the Walker B motif are replaced by glutamates. The uncultivated gastropodassociated circular DNA viruses (GasCSV) appear to be chimeric with respect to the nuclease and helicase domains. Whereas the former is more similar to the nuclease domain of circoviruses, the latter shows the highest similarity to the helicase domain of pCRESS2 bacterial Reps. A similar recombination hotspot between the two domains has been previously observed in many uncultivated CRESS-DNA viruses 4,5 . Sequence motifs of pCRESS9 and P. pulchra plasmid Reps are most closely similar to those of geminiviruses and genomoviruses. Furthermore, all Rep sequences in this assemblage share the GRS motif located between motifs II and III (Figure 3), which is not found in other CRESS-DNA viruses and is thought to enable the appropriate spatial arrangement of motifs II and III 6,7 . Another synapomorphic character shared by geminiviruses, genomoviruses and pCRESS9 plasmids is the replacement of the arginine finger, which is conserved in the helicase domain of other CRESS-DNA viruses 8 , with an asparagine residue. Notably, in P. pulchra plasmids, the arginine finger motif is well-conserved, suggesting an ancestral position of this group with respect to geminiviruses, genomoviruses and pCRESS9.

Further phylogenetic and statistical validation of the Rep tree topology
To test the robustness of the PhyML tree, we performed the following additional analyses: (i) maximum likelihood phylogenies were constructed using other methods, namely, RAxML and IQ-Tree, with alternative branch support methods, including the classical bootstrap and the more recently introduced ultrafast bootstrap procedures; (ii) phylogeny was reconstructed using the 20-profile mixture model which, similar to Bayesian CAT models but in the maximum likelihood framework, allows 20 substitution models along the sequences in the alignment 9 ; (iii) statistical analysis of the unconstrained and 3 constrained tree topologies was performed. The IQ-Tree and RAxML trees had topologies nearly identical to the topology of the PhyML tree, although the branch support values estimated with the bootstrap procedure in RAxML tree were slightly lower than the aBayes and ultrafast bootstrap values for the PhyML and IQ-Tree trees, respectively. To account for potential differences in site-specific amino acid replacement patterns, we used the C20 mixture model, which yielded a topology nearly identical to that in the single-model maximum likelihood analyses ( Figure 5 and Figure S5). To further scrutinize the robustness of the phylogenetic tree, we constructed a set of constrained trees with alternative topologies and compared these to the unconstrained tree using several statistical tests, including the approximately unbiased test 10 . All tests rejected the trees with alternative topologies (Supplementary table 2). Collectively, these results indicate that the obtained tree topology is highly robust and is likely to accurately reflect the evolutionary history of Reps encoded by CRESS-DNA viruses and plasmids.

SUPPLEMENTARY TABLES
Supplementary table 1. Characterization of the integrated and extrachromosomal plasmids from pCRESS1-9, including information on their size, integration coordinates, integration targets, size of the attachment sites.                        figure 6. Maximum likelihood phylogenetic trees of Rep proteins using different sequence sampling: a) pE194/pMV158 cluster is absent; b) pE194/pMV158 cluster is present. The trees were constructed using PhyML.