Introduction

Large DNA viruses carry genes for their own DNA repair apparatus to enhance the accuracy of genome replication (Furuta et al., 1997; Srinivasan and Tripathy, 2005; Redrejo-Rodriguez et al., 2009; Bogani et al., 2010). The amoeba-infecting Mimivirus (Acanthamoeba polyphaga mimivirus, APMV) with the largest genome (1.2 Mb) of all known viruses encodes eight putative genes for DNA repair enzymes capable of correcting mismatches or errors induced by oxidation, UV irradiation and alkylating agents (Raoult et al., 2004). Most of these genes were never found in a viral genome until their discovery in Mimivirus. One of these corresponds to a MutS homolog (open reading frame (ORF) L359) predicted to function in DNA mismatch repair (MMR) or recombination. MMR recognizes and corrects base–base mismatches and small insertion or deletion loops introduced during replication, leading to 50- to 1000-folds enhancement of replication fidelity in cellular organisms (Schofield and Hsieh, 2003; Iyer et al., 2006). The best-studied MMR system is the Escherichia coli MutS–MutL–MutH pathway. In the first step in this pathway, the MutS homodimer binds the site of a mismatch (or a loop) in double-strand DNA. The MutS protein recruits the ‘linker protein’ MutL and together activates the endonuclease MutH, which nicks specifically the newly synthesized DNA strand to initiate DNA excision and resynthesis pathway. Homologs of E. coli MutS have been found in many species of bacteria, archaea and eukaryotes, and together classified in the MutS family (Eisen, 1998; Lin et al., 2007). In viruses, MutS homologs have only been found in Mimivirus, the closely related Mamavirus (La Scola et al., 2008; Yutin et al., 2009), and more recently in the giant marine virus, Cafeteria roenbergensis virus (CroV) with a 730-kb genome (Fischer et al., 2010).

The phyletic distribution of the close homologs of Mimivirus MutS is notable. The Mimivirus MutS homolog is most closely related to the homologs found in the mitochondrial genomes of a group of animals (that is, octocorals) and several genomes of the ɛ-Proteobacteria such as Sulfurimonas, Nitratiruptor and Arcobacter (Claverie et al., 2006, 2009). Octocorals (phylum Cnidaria, class Anthozoa, subclass Octocorallia) include diverse species of corals (for example, soft corals, sea fans, sea pens), representing important members of marine communities from shallow tropical coral reefs to the deep sea (McFadden et al., 2006). A mutS homolog has been found encoded in the mitochondria of all octocorals, including the three major orders Alcyoncea, Helioporacea and Pennatulacea, but not in the mitochondrial genomes of any other eukaryotes, including those of the sister subclass Hexacorallia (for example, stony corals, sea anemones) (Pont-Kingdon et al., 1995; Brugler and France, 2008). The ɛ-proteobacteria Sulfurimonas and Nitratiruptor are sulfur-oxidizing chemoautotrophs and often found in deep-sea hydrothermal vent or coastal sediments (Nakagawa et al., 2007; Sievert et al., 2008). Arcobacter includes species of water-borne pathogens and taxonomically close to Campylobacter jejuni and Helicobacter pylori (Miller et al., 2007). The common origin of these MutS homologs is further suggested by their atypical domain organization. Distinct from all other MutS family proteins, the MutS homologs in Mimivirus, octocorals and the ɛ-proteobacteria are fused with a C-terminal HNH nicking endonuclease domain (Malik and Henikoff, 2000; Claverie et al., 2009). The domain fusion was predicted to make these enzymes a ‘self-contained’ single polypeptide having both mismatch recognition (MutS) and nicking (MutH) functions (Malik and Henikoff, 2000). The distribution of these unique MutS homologs is thus limited to a few totally unrelated lineages (that is, Mimivirus, a single subclass of animals, and the ɛ-Proteobacteria) and suggests the occurrence of gene transfer between their ancestors (Claverie et al., 2009). We introduce the MutS7 subfamily to denote this specific group of MutS proteins.

DNA viruses with genomes greater than 300 kb up to 1.2 Mb are being discovered with increasing frequency from diverse ecosystems, with many of them now being subject to genome sequencing analysis (La Scola et al., 2010; Van Etten et al., 2010). The double-stranded DNA (dsDNA) genomes of these giant viruses (often called ‘giruses’ (Claverie et al., 2006; Claverie and Ogata, 2009)) show a high coding potential with more than several hundred of genes densely packed in their genomes. To investigate the presence of Mimivirus-like mutS gene in other giruses, we have undertaken a genomic sequencing survey of four giruses previously isolated from marine environments. The four giruses investigated are Pyramimonas orientalis virus (PoV-01B, 560-kb genome), Phaeocystis pouchetii virus (PpV-01B, 485-kb genome), Chrysochromulina ericina virus (CeV-01B, 510-kb genome) and Heterocapsa circularisquama virus (HcDNAV, 356-kb genome) (Jacobsen et al., 1996; Sandaa et al., 2001; Tarutani et al., 2001). The hosts of these viruses are phylogenetically distant and ecologically distinct unicellular marine algae. C. ericina and P. pouchetii are haptophytes classified in different orders, that is, Prymnesiales and Phaeocystales, respectively. C. ericina has a worldwide distribution; it occurs most commonly in low numbers but has been observed to form blooms together with other Chrysochromulina species (Simonsen and Moestrup, 1997). P. pouchetii may both be a free-swimming flagellated cell and a non-flagellated cell embedded in gelatinous colonies that form dense blooms in polar and sub-polar regions. P. orientalis is a non-blooming prasinophyte belonging to the green algae (Chlorophyta). H. circularisquama is a small thecate dinoflagellate which frequently forms large-scale red tides in Japan causing mass mortality of shellfish (Tarutani et al., 2001). These four giruses are all lytic viruses belonging to the nucleo-cytoplasmic large DNA virus (NCLDV) superfamily (Yutin et al., 2009). Phylogenetic analysis of DNA polymerase and major capsid protein sequences has revealed that PoV, PpV and CeV form a monophyletic clade that clusters together with Mimivirus (Larsen et al., 2008; Monier et al., 2008). In contrast, the DNA polymerase sequence of HcDNAV has been found to be closely related to that of African swine fever virus (Ogata et al., 2009), suggesting that HcDNAV is phylogenetically distant from the other viruses included in this study.

With the accumulation of genome sequences and the following phylogenetic studies during the last decade, a significant advance has been made regarding the classification of diverse MutS proteins (Eisen, 1998; Lin et al., 2007). In this study, we use the following naming of the MutS subfamilies, which is adapted from the recent study by Lin et al. (Lin et al., 2007). In total, 12 subfamilies were previously described to compose the MutS family: ‘MutS1/MSH1’ including E. coli MutS and the mitochondria-targeted fungal MutS homolog 1 (MSH1); ‘MutS2’, known to inhibit recombination in H. pylori (Pinto et al., 2005) and to possess a C-terminal endonuclease domain called the small MutS-related (Smr) domain (Moreira and Philippe, 1999; Fukui et al., 2008); ‘MSH2’, ‘MSH3’, ‘MSH4’, ‘MSH5’ and ‘MSH6/7’, found in most eukaryotes (with the exception of MSH7 being a plant-specific paralogous group of MSH6 (Wu et al., 2003)); another plant-specific MSH1 (called ‘plt-MSH1’ hereafter) with the GIY-YIG endonuclease domain at their C-terminus (Abdelnoor et al., 2006); ‘MutS3’, ‘MutS4’ and ‘MutS5’, recently described but functionally uncharacterized prokaryotic homologs (Lin et al., 2007), and the above mentioned ‘MutS7’ subfamily represented by the Mimivirus MutS homolog.

In this report, we analyze the MutS sequences newly identified in four giruses, and assess the abundance of their homologs in environmental sequence databases.

Materials and methods

Girus MutS sequences

As part of an ongoing genome-sequencing project, we obtained assembled contigs for three previously isolated dsDNA viruses, Pyramimonas orientalis virus (PoV-01B, 141 contigs), Phaeocystis pouchetii virus (PpV-01B, 287 contigs) and Chrysochromulina ericina virus (234 contigs, CeV-01B) (Larsen et al., 2008). This sequence information will be published elsewhere. In this study presented here, we scanned these girus contigs for the presence of MutS homologs. Two complete MutS ORFs were readily identified in each of the PoV and PpV contigs. Part of a contig corresponding to the CeV MutS ORF was targeted for PCR amplifications using overlapping sets of primers and re-sequenced to resolve ambiguities in the contig (see Supplementary Table S1). A fragmented ORF for the HcDNAV MutS was identified in the previously-described low coverage shotgun sequencing data (Ogata et al., 2009). We obtained a complete ORF for the HcDNAV MutS after several trials of TAIL-PCR and sequencing. These sequences were submitted to public DNA databases (DDBJ: AB587728; EMBL: FR691705-FR691709). The MutS sequence from CroV (crov486, YP_003970119), that only became recently available, was partially included in our analysis.

Bioinformatics analysis

Reference MutS sequences except the girus sequences determined in this study were retrieved from the UniProt protein sequence database (as of April 27, 2010) (UniProtConsortium, 2010). The selection of sequences was performed to maximize the coverage of diverse MutS subfamilies, referring to previous publications (Eisen, 1998; Lin et al., 2007), through iterative process involving clustering by BLASTCLUST (Altschul et al., 1997), inspection of sequence alignment and phylogenetic reconstruction. We used T-Coffee version 8.06 (Notredame et al., 2000) for multiple sequence alignment. We used ClustalX (Larkin et al., 2007) for the visualization of alignments. All gap-containing sites were removed from the alignments for the following phylogenetic analyses. Maximum-likelihood phylogenetic analyses were performed using PhyML version 3.0 (Guindon and Gascuel, 2003) using LG substitution matrix (Le and Gascuel, 2008) and a gamma low (four rate categories). We used ProtTest version 2.2 (Abascal et al., 2005) to determine the best substitution model (that is, LG) for our phylogenetic reconstruction based on the MutS domain V sequences. Phylogenetic trees were drawn using MEGA version 4 (Kumar et al., 2008). For the delineation of the sequence domains, we used HMMER/HMMSEARCH version 2.3.2 (Eddy, 1996) and PSI-BLAST (Altschul et al., 1997). The assignment of environmental sequences on the MutS7 and MutS8 subtrees was performed using a maximum-likelihood method implemented in the ‘phylogenetic placement’ software developed by Matsen et al. (pplacer version 1.0; http://matsen.fhcrc.org/pplacer/). The results were visualized using Archaeopteryx version 0.957 (http://www.phylosoft.org/archaeopteryx/) (Han and Zmasek, 2009). Correspondence analysis of codon usages was performed using CodonW version 1.3 (http://codonw.sourceforge.net/).

Results

Two types of MutS homologs in giruses

We identified six ORFs similar to known MutS family proteins in the analyzed viral genomic sequences. These ORFs were classified into two groups according to their length and sequence similarity. The first group of ORFs was relatively long and was found in all the analyzed giruses (PoV, 910 amino-acid residues (aa); PpV, 1004 aa; CeV, 1043 aa; HcDNAV, 953 aa). When searched against the NCBI non-redundant protein sequence database using BLAST (Altschul et al., 1997), these girus MutS homologs showed the most significant sequence similarities to the MutS7 homologs in Mimivirus (amino-acid sequence identity 31–38%; alignment coverage 34–99%; E-value=10−6310−120), ɛ-proteobacteria (29–37%; 95–99%; 10−10010−153) and octocorals (26–28%; 96–99%; 10−6710−84). Like previously reported MutS7 homologs, these four predicted proteins were found to possess a C-terminal HNH endonuclease domain (Supplementary Figure S1). The second group of shorter ORFs similar to MutS proteins was found in PoV (539 aa) and PpV (600 aa). These PoV and PpV MutS homologs showed the most significant sequence similarity in ‘Candidatus Amoebophilus asiaticus’ (Aasi_0916; amino-acid sequence identity 38%; alignment coverage 39%; E-value=2 × 10−26) and Clostridium perfringens (YP_694765.1; 34%; 32%; 2 × 10−17), respectively.

Girus MutS homologs correspond to two distinct subfamilies

To classify the newly identified girus MutS homologs, we compiled a reference sequence set containing 150 MutS homologs, representing diverse MutS subfamilies, and performed phylogenetic analyses. Our analyses revealed 15 distinct clades, 12 of which corresponded to the previously described MutS subfamilies (Figure 1a and Supplementary Figure S2). The newly identified four girus MutS homologs of the first group (that is, those with longer amino-acid sequences) were found within the MutS7 group (Figure 1b). The other MutS homologs of the second group (with shorter amino-acid sequences) were grouped in none of the previously documented subfamilies but with two paralogous sequences from ‘Amoebophilus asiaticus’ (Figure 1c). This bacterium is an obligate intracellular amoeba symbiont belonging to the Bacteroidetes. We use MutS8 to denote this new group of MutS homologs. In addition, we identified two previously undescribed subfamilies found only in bacteria. These subfamilies are referred to as MutS6 and MutS9.

Figure 1
figure 1

Maximum-likelihood phylogenetic tree of MutS family proteins. (a) Phylogenetic tree covering diverse MutS subfamilies including the newly identified MutS6, MutS8 and MutS9. The tree is based on the alignment of the MutS domain V sequences. (b) Phylogenetic tree of MutS7 homologs based on the conserved sequences between MutS1 and MutS7. (c) Phylogenetic tree of MutS8 homologs based on the conserved sequences between MutS1 and MutS8. The trees in the panel b and c are rooted with MutS1 sequences as the outgroup. Statistically supported branches are indicated by black dots if bootstrap values are >75%. Color code for branches and sequence names are as follows: Bacteria (blue), Archaea (light blue), Eukaryotes (green), Giruses (Red). Scale bars correspond to 0.5 substitutions per site. The recently described CroV MutS was only included in b.

Next, we determined the sequence domain architecture of MutS subfamilies with the use of position-specific scoring matrices corresponding to eight domains known to be present in MutS homologs (Figure 2 and Supplementary Figure S3). Sequence length and domain architecture were found to be comparable within individual MutS subfamilies but could differ greatly across subfamilies. Outside of these identified domains, no residual similarity was found between different subfamilies (BLAST E-value <10−5), corroborating the classification of MutS proteins based on our phylogenetic analysis.

Figure 2
figure 2

Domain architecture of MutS family proteins. The drawing represents the typical sequence domain organizations of MutS subfamilies (approximately scaled). A larger set of sequences is depicted in Supplementary Figure S3. Position-specific scoring matrices used for the delineation of sequence domains are as follows: MutS domain I (pfam01624), II (pfam05188), III (pfam05192), IV (pfam05190), V (pfam00488), GIY-YIG endonuclease (pfam01541), Smr (pfam01713) and HNH-endonuclease (pfam01844).

MutS7 was found to contain at least five known domains including the N-terminal MutS domain I. The domain I of bacterial MutS1 is known to directly interact with and recognize mismatched bases. The mismatch recognition by the domain I involves a phenylalanine residue (Phe 36 in E. coli) and a glutamic acid residue (Glu 38 in E. coli) in a conserved motif ‘FXE’ within this domain (Natrajan et al., 2003). The MutS domain I is also present in the eukaryotic MSH1, MSH2, MSH3, MSH6 and plt-MSH1 subfamilies. They also exhibit conserved residues at the same location, albeit with different patterns from ‘FXE’ for MSH2 and MSH3 (Culligan et al., 2000). Remarkably, all the members of MutS7 sequences were found to show the conserved ‘FXE’ motif (that is, ‘FYE’ for Mimivirus, HcDNAV and octocorals; ‘FHE’ for CroV; ‘FFE’ for PoV, PpV, CeV and ɛ-proteobacteria) (Supplementary Figure S4). This suggests that MutS7 may be involved in MMR rather than DNA recombination. We noted that the Mimivirus mutS gene showed the same intermediate expression pattern as other genes involved in DNA replication (with the highest level of expression between 3 and 5 h after infection) (Legendre et al., 2010). The newly identified MutS8, MutS6 and MutS9 lacked the MutS domain I but they possess the domain III and V. A similar domain configuration can be seen in the members of the previously described MutS3 subfamily of unknown function.

MutS7 and MutS8 are abundant in marine metagenomic sequence data sets

We next used the 150 reference MutS sequences to assess the abundance of the MutS subfamilies in a standard protein sequence database (that is, UniProt), as well as in an environmental sequence collection (that is, NCBI/Env_Nr) using BLAST. We first collected MutS homologs from UniProt with the use of a position-specific scoring matrices corresponding to the MutS domain V sequences extracted from the reference sequence set. This resulted in a set of 4028 MutS homologs including the six MutS homologs from PoV, PpV, CeV and HcDNAV. These 4028 sequences were searched against the 150 reference sequences with BLASTP (E-value <10−5), and best hits were used for subfamily assignment. The relative abundance of the predicted subfamilies is shown in Figure 3 and Supplementary Table S2. Being consistent with their ubiquitous presence in prokaryotes, the most abundant subfamily was the MutS1/MSH1 subfamily (45%), which was followed by MutS2 representing 27% of MutS homologs in UniProt. Each of the remaining 13 subfamilies accounted for less than 5% of the total MutS subfamily assignments. The two subfamilies, MutS7 and MutS8, containing viral homologs were ranked at twelfth (0.7%) and fifteenth (0.1%), respectively. This analysis also confirmed the presence of MutS7 exclusively in giruses, the ɛ-Proteobacteria and octocoral mitochondria. The MutS8 subfamily was found to contain only PpV, PoV and ‘Amoebophilus asiaticus’ sequences. MutS6 was found exclusively in the Bacteroidetes (Bacteroides, Chitinophaga, Dyadobacter, Pedobacter, Sphingobacterium). MutS9 was found in the Bacteroidetes, Firmicutes (Clostridia), Fusobacteria, Thermotogae and ‘Candidatus Cloacamonas (candidate division WWE1)’. Eukaryotic MutS sequences were found in nine subfamilies (that is, MutS1/MSH1, MSH2, MSH3, MSH4, MSH5, MSH6/7, plt-MSH1, MutS2, MutS7). Bacterial sequences were present in eight subfamilies (that is, MutS1/MSH1, MutS2, MutS3, MutS4, MutS6, MutS7, MutS8, MutS9). Archaeal MutS sequences were found in three subfamilies (that is, MutS1/MSH1, MutS4, MutS5).

Figure 3
figure 3

Representation of the different MutS subfamilies in the curated UniProt database (left panel) versus the environmental sequence data set, NCBI/Env_Nr (right panel).

As the current database is highly biased towards model organisms that have been cultured and targeted for genomic analysis, we applied the same procedure to an environmental protein sequence data set (NCBI/Env_Nr) to reduce such a bias. The position-specific scoring matrices corresponding to the MutS domain V identified 1568 MutS homologs in NCBI/Env_Nr. The subfamily assignments of these environmental sequences are shown in Figure 3 and Supplementary Table S2. Again MutS1/MSH1 (62%) and MutS2 (15%) subfamilies were the most highly represented groups. However, the MutS7 and MutS8 subfamilies, which include giral MutS homologs, were now ranked at third (176 environmental protein sequences; 11%) and fourth (106 environmental protein sequences; 7%), respectively. Each of the remaining 11 subfamilies accounted for less than 2% of the total assignments. The environmental protein sequences classified in MutS7 or MutS8 were all from a marine microbial metagenomic study, the global ocean sampling expedition (GOS) (Rusch et al., 2007). The GOS reads associated with these protein sequences (441 reads for MutS7; 262 reads for MutS8) were found to originate in different geographical sampling sites (38 sites for MutS7; 35 sites for MutS8; Supplementary Table S3). Therefore, the MutS7 and MutS8 subfamily members are relatively abundant in marine microbial communities, and presently underrepresented in the curated sequence database (that is, UniProt).

Environmental MutS7 and MutS8 are likely of ‘girus-origin’

An inspection of the BLAST results of the MutS7-like or MutS8-like environmental sequences immediately suggests that most of them are likely of girus origin. Of the 176 environmental MutS7 homologs, 152 (86%) sequences showed their BLAST best hit to girus MutS7 sequences (79 sequences to CeV; 48 to PoV; 18 to PpV; 5 to HcDNAV; 2 to Mimivirus). The remaining 24 sequences showed best hit to MutS7 sequences from ɛ-proteobacteria. There was no environmental sequence having a best hit to the octocoral MutS7 group. Of the 106 environmental MutS8 homologs, 95 (89%) sequences showed their best hit to girus MutS8 (69 sequences to PpV; 26 to PoV). The remaining 11 sequences best matched to ‘Amoebophilus asiaticus’. To verify the evolutionary relatedness between the environmental sequences and girus MutS homologs, we used a maximum-likelihood method implemented in the ‘phylogenetic placement’ software developed by Matsen et al. (pplacer; http://matsen.fhcrc.org/pplacer/). Again, a majority (88% for MutS7 and 96% for MutS8) of the environmental sequences were positioned on the branches leading to giruses in the reference MutS7 and MutS8 phylogenetic trees (Figure 4). Finally, we compared the nucleotide compositions of these MutS homologs. Most of the mutS7 and mutS8 genes were found to be A+T-rich (girus-MutS7: A+T=64–82%; ɛ-proteobacteria-MutS7: 58–73%; octocoral-MutS7: 74–78%; girus-MutS8: 64–74%; Amoebophilus-MutS8: 64–66%). The environmental sequences assigned to these subfamilies were also found to be A+T-rich in average: 69% for MutS7 and 71% for MutS8. Despite this similarity in nucleotide composition, however, a correspondence analysis of the codon usages revealed that a large proportion of environmental sequences showed codon usages close to those of girus sequences for both MutS7 and MutS8 (Supplementary Figure S5). Overall, these results suggest that most of the MutS7 and MutS8 homologs in the GOS metagenomic data set probably belong to marine giruses.

Figure 4
figure 4

Taxonomic placement of the environmental MutS7 (a) and MutS8 (b) homologs. The number of environmental sequences mapped on each branch is indicated. The width of the branch is proportional to the number of mapped sequences.

Discussion

The recent accumulation of genomic and metagenomic sequence data revolutionized our understanding of the diversity and evolution of genes in microorganisms. With over 1000 sequenced genomes from cells and over 1500 genomes from DNA viruses, the available sequence data now cover a wide spectrum of species, which have already helped advancing our understanding of the functions and evolution of protein families such as the MutS family (Eisen, 1998; Lin et al., 2007). However, given the huge diversity of girus genomes (Ogata and Claverie, 2007), they seem to be still underrepresented in this sequencing effort (Claverie et al., 2006; Claverie and Abergel, 2010); out of the 1500 available viral genomes, only a handful of genomes exceed 350 kb (for example, Mimivirus (1.2 Mb), CroV (730 kb), Emiliania huxleyi virus (407 kb), Paramecium bursaria Chlorella virus NY2A (369 kb), Marseillevirus (368 kb), Canarypox virus (360 kb)). In this study, we analyzed four distantly related marine giruses representing a relatively large class of giruses with estimated genome size from 356 kb up to 560 kb and identified new MutS homologs in all of the four giruses.

We showed that these girus-encoded MutS proteins fell into two subfamilies: MutS7 and MutS8. The recently reported MutS sequence from the largest marine girus, CroV, was classified in the MutS7 subfamily (Figure 1b) and was found to share the typical domain organization of this subfamily. Most unexpectedly, close homologs of the girus-encoded MutS7 and MutS8 were found to be highly abundant in marine metagenomic sequence data sets. Giruses thus seem to represent one of the major sources of the diversity of MutS family proteins. Our phylogenetic reconstruction strongly suggests the occurrence of horizontal gene transfers between giruses and cellular organisms for both the MutS7 and MutS8 subfamilies. The abundance of ‘girus-like’ MutS7 in the marine environment favors the previously proposed scenario that an ancestor of marine giruses had a central role in transferring MutS7 to the octocoral mitochondrial genome. Consistently, the branch to the octocoral MutS7 sequences was placed within the girus clade in the MutS7 phylogenetic tree (Figure 1b). The self-contained nature of the mutS7 gene (with both recognition and cutting functions) might have facilitated such a gene transfer between distantly related organisms. Similar gene transfer from a virus to the ancestor of mitochondria has been proposed for the mitochondrial RNA/DNA polymerases and DNA primase; in this case, the source is likely to be a cryptic prophage (related to T3/T7) and the mitochondrial enzymes are encoded in the nuclear genome (Filee and Forterre, 2005). The possible gene transfer for MutS8 (found in PoV, PpV and the obligate intracellular amoeba-symbiont ‘Amoebophilus asiaticus’ isolated from lake sediment) reinforces the previously proposed idea that amoebae (or other phagocytic protists) function as ‘genetic melting pots’ to enhance the evolution of intracellular bacteria and viruses infecting these eukaryotes by providing ample opportunities for gene exchanges (Ogata et al., 2006). Given the apparent specificity of virally encoded MutS for viruses with the largest genomes, these MutS sequences will be useful to probe metagenomic sequences for the presence of unknown giruses.

DNA viruses show a tremendous variation in genome size from a few kilobases for the oncogenic polyomaviruses, to more than a megabase for the giant Mimivirus (Monier et al., 2007). Drake's rule states that the mutation rate per genome per strand copying is roughly constant across DNA-based microorganisms including bacteria, unicellular eukaryotes and DNA viruses (Drake, 1991; Sanjuan et al., 2010). Mutation rate per nucleotide per replication is thus negatively correlated with the genome size. In fact, the loss of DNA repair functions is a common trend in bacterial with reduced genomes, which exhibit higher mutation rate than other bacteria with larger genomes (Moran and Wernegreen, 2000; Moran et al., 2009). Experimental data for mutation rate is currently unavailable for giruses. However, given the large amount of coding DNA that they need to protect from mutations, giruses may be under a specific selective pressure for efficient DNA repair systems (such as MutS7), which may be less crucial for smaller viruses. The identification of MutS homologs in all of the four giruses tested in this study, as well as a wealth of other DNA repair genes in Mimivirus and CroV are consistent with this view.

Although the organisms with MutS7 or MutS8 had many opportunities to exchange these genes (Claverie et al., 2009), the reason for the sporadic and limited phyletic distribution of these MutS subfamilies still remains unclear. One might presume that the functions of these MutS proteins are somehow associated with A+T-rich genomes. However, the presence of ɛ-proteobacteria with A+T-rich genomes (such as H. pylori, A+T=62%) lacking these MutS subfamily members contradicts this hypothesis. E. coli MutH distinguishes the nascent DNA strand from the template DNA strand through the hemi-methylation of bases. It would be interesting to examine the presence of hemi-methylated bases in girus genomes and octocoral mitochondrial genomes. We have started to clone and purify the Mimivirus MutS7 for functional characterization.

The unique presence of a MutS homolog in Mimivirus was already noticed during the initial genome annotation (Raoult et al., 2004). We then recognized the surprising relationship between the Mimivirus MutS and its homologs uniquely found in the mitochondria of all octocorals (Claverie et al., 2009), all belonging to the newly defined MutS7 subfamily. The finding of these MutS homologs in CroV, CeV, PoV, PpV and HcDNAV definitely confirms their association with large DNA viruses in marine environments. These findings strongly suggest that the presence of MutS in Mimivirus is not merely an example of an eccentric lateral gene transfer, but probably requires a more subtle explanation. We believe that much deeper experimental investigation of these girus MutS homologs would help provide a holistic view on the evolution of gene families in the light of evolutionary interactions between the viral and cellular gene pools.