Introduction

The Roseobacter clade is a monophyletic subfamily-level lineage in the Alphaproteobacteria that represents up to 20% of bacterial cells in marine environments (Moran et al., 2007). Members of this clade are known for their large and G+C-rich genomes equipped with versatile metabolic capabilities that allow rapid response to transient, nutrient-enriched patches (Moran et al., 2004; Buchan et al., 2005; Newton et al., 2010). Several recent metagenomic analyses, however, suggest that views of this lineage may be biased by the characteristics of the cultured strains. For example, natural populations of Roseobacters in surface ocean waters have a significantly lower proportion of genes for adhesion, cellular capsule formation and signal transduction (Luo et al., 2012), as well as for motility and quorum sensing (Newton et al., 2010), compared with the numerous genomes available from cultured strains. Moreover, one cultured Roseobacter member branching at the base of the phylogenomic tree, strain HTCC2255, has a small and G+C-poor genome with evidence for gene loss biased toward functions related to cell–cell interaction, transcriptional regulation and motility (Luo et al., 2013). Collectively, these studies suggest that genome content and life history strategies of Roseobacter clade members are more diverse than originally thought, with potential implications for their role in heterotrophic processing of fixed carbon in the ocean (Luo et al., 2013).

The genome streamlining hypothesis has been invoked to explain the reduced genomes characteristic of many marine bacterial taxa (Dufresne et al., 2005; Giovannoni et al., 2005; Dupont et al., 2012), based on the idea that selection acts to reduce genome size because of the metabolic costs of maintaining DNA with no adaptive value. Streamlined marine bacteria are also often small in cell size (potentially conferring a surface-to-volume ratio advantage; Dufresne et al., 2005), have genomes poor in G+C pairs (which require more nitrogen than A+T pairs; Grzymski and Dussaq, 2012; Swan et al., 2013), have a low proportion of noncoding DNA (Giovannoni et al., 2005; Swan et al., 2013) and maintain fewer genes involved in transcription and signal transduction (Swan et al., 2013). The primary mechanism theorized to underlie genome streamlining in marine bacteria has been selection for low metabolic costs in the oligotrophic ocean environment (Dufresne et al., 2005; Giovannoni et al., 2005).

Purifying selection, which acts to eliminate deleterious mutations, is considered the most prevalent form of natural selection at the molecular level (Nei, 1987; Hughes, 1999, 2008; Nei et al., 2010). Evidence for purifying selection on protein-coding genes is typically provided by the observation that the number of synonymous (silent) substitutions per synonymous site (dS) exceeds the number of nonsynonymous (amino acid changing) substitutions per nonsynonymous site (dN) in a vast majority of genes, as nonsynonymous mutations are more likely to be deleterious than synonymous mutations (Hughes, 2008). Purifying selection may become relatively inefficient, however, in taxa where the effective population size is reduced. This can occur frequently in natural populations (for example, owing to population bottlenecks or founder effects) and it results in chance fixation of slightly deleterious mutations by genetic drift (Ohta, 1992), a mechanism referred to as relaxation of purifying selection (Hughes and Friedman, 2009; Hughes, 2009, 2013).

The purpose of this study is to use molecular evolutionary genetic approaches to understand the evolution of oceanic Roseobacters that show evidence of genome streamlining. Our focus is on the mechanisms giving rise to differences in their genome traits compared with the better-studied members with large, G+C-rich genomes equipped for exploiting nutrient-rich environments (Moran et al., 2004; Buchan et al., 2005; Newton et al., 2010). We turned to single-cell amplified genomes (SAGs) obtained by flow cytometric sorting of single bacterial cells from seawater for this analysis (Swan et al., 2013), as SAGs are not biased toward traits required for growth on rich, solid media and, unlike metagenomic assemblies, represent a single genome appropriate for population genetic analysis at the nucleotide and amino acid level. Our results indicate that genetic drift cannot be ruled out as a dominant evolutionary force giving rise to the ecologically important Roseobacter lineages containing smaller and G+C-poor genomes.

Materials and methods

Sampling site and procedure

Water samples for single-cell analyses were collected using Niskin bottles from the South Atlantic (10 m; 27 November 2007; 12°29′41.40″ S, 4°59′55.20″ W), HOT station ALOHA in the North Pacific (25 m; 9 September 2009; Cruise KN192-5 station 11; 22°45′00′′ N, 158°00′00′′ W) and the Gulf of Maine (1 m; 16 September 2009; 43°50′39.87′′ N, 69°38′27.49′′ W). Replicate aliquots of water (1 ml volume) were cryopreserved with 6% glycine betaine (Sigma) and stored at −80 °C (Cleland et al., 2004). Uncultivated cells SCGC AAA015-O19 (South Atlantic), SCGC AAA298-K06 and SCGC AAA300-J04 (North Pacific) and SCGC AAA076-C03 (Gulf of Maine) were sequenced using methods in Swan et al. (2013).

Estimating genome size

The size of single-cell and isolate genomes was estimated using a polynomial regression model by regressing the number of nucleotides sequenced against the ratio of the number of conserved single-copy genes universally present in five fully sequenced Roseobacter genomes to the number of predicted protein-coding genes in each genome. Although the MDA method used to amplify the SAG genomes has been shown to be biased, it is nonetheless appropriate provided that the bias is random, which has been shown to be the case (Woyke et al., 2009). The average genome size of Roseobacters in the Global Ocean Survey (GOS) metagenomes (Rusch et al., 2007) was obtained by dividing the number of Roseobacter nucleotides by the number of Roseobacter genomes occurring in GOS. The former value was determined based on sequences with best hits to a Roseobacter gene using the RAPsearch2 software (Zhao et al., 2012) with an E-value of 10−3 to search all GOS reads against the NCBI nonredundant (nr) database. This software is more efficient compared with BLAST, while achieving a similar accuracy (Zhao et al., 2012). In this way, we identified 85 842 reads or 90 555 430 bp as belonging to the Roseobacter clade. The number of genomes was approximated by the total number of Roseobacter GOS nucleotides coding for one of the five single-copy genes (atpD, dnaK, gyrB, recA and rpoB) divided by the average length of that gene in Roseobacter isolate genomes. GOS reads associated with Roseobacters coding for these five genes were verified using RPS-BLAST searches (Altschul et al., 1997) against the Clusters of Orthologous Groups database (Tatusov et al., 1997), producing an estimate of 31.6 Roseobacter genome equivalents in the GOS database. This value was divided into the number of Roseobacter bases to estimate genome size. To obtain a distribution of genome size in GOS, we applied a resampling technique (with replacement) to the Roseobacter reads identified by RAPsearch2. Replicates were resampled 1000 times, and the average genome size was estimated for each as described above.

Estimating the fraction of noncoding DNA

The fraction of noncoding DNA in single-cell genomes and isolate genomes was estimated by dividing the total length of predicted noncoding DNA sequences by the number of nucleotides sequenced. To obtain the fraction of noncoding DNA in GOS roseobacters, we used the 5608 Roseobacter reads (including the paired-end reads; 5 941 949 bp) sampled by the dN pipeline software and validated by phylogenetic reconstruction (Luo et al., 2012). When these high-confidence Roseobacter reads were searched against the NCBI RefSeq database (Pruitt et al., 2009) using the BLASTX software (Altschul et al., 1997), we determined that 5 491 733 bp hit within a coding region using a cutoff of E-value <0.1 and bit score >40. The remaining base pairs represented noncoding DNA in the GOS Roseobacters. To obtain a distribution of noncoding DNA in GOS, we applied a resampling technique (with replacement) to the 5608 Roseobacter reads. A total of 1000 resamplings were carried out, and the fraction of noncoding DNA was calculated for each.

Estimating the relative abundance of the uncultivated Roseobacters in GOS

The fraction of Roseobacter reads in the GOS data set that were affiliated with the SAG-O19 clade (consisting of O19, K06 and J04) was estimated in two ways. First, we constructed a custom reference database by adding the predicted amino acid sequences of the four SAGs to the NCBI microbial RefSeq database (Pruitt et al., 2009). GOS reads were queried against this database using the RAPsearch2 software (Zhao et al., 2012) with a cutoff E-value of 10−3, and sequences with a best hit to a Roseobacter gene were retained. To account for the lower genome coverage of the SAGs, we focused on 1396 orthologous gene families that covered at least one member of the SAG-O19 clade genomes and one of the remaining Roseobacter genomes, with orthologous families identified by Cluster of Orthologous Groups (Tatusov et al., 1997). The proportion of best hits to SAG-O19 clade members in this subset was used as an indication of clade representation in the GOS data set. Because not all marine Roseobacter lineages are currently represented by public genome sequences, we also compared the 16S rRNA gene sequences between the SAG-O19 clade members and the GOS Roseobacters. The metagenomic reads that harbored a 16S rRNA gene with a best hit to a Roseobacter 16S rRNA gene were retrieved from the GOS metagenome. Those with a sequence identity >99% to the 16S rRNA gene of a SAG-O19 clade member were considered members of the clade.

Ortholog identification, alignment construction, model adequacy test

The four SAGs and 40 isolate sequences were used in genome tree construction with four outgroup genomes in other Alphaproteobacteria lineages. Orthologous genes were identified by a reciprocal BLAST search (Altschul et al., 1997) for all possible genome pairs at the amino acid level using an E-value of 10−6. Then, shared orthologous genes among the 48 genomes were identified by the MSOAR software (Chen et al., 2005); MSOAR assigns orthologs between a pair of genomes by considering sequence similarity among homologs and minimizing the number of gene rearrangements (Chen et al., 2005). Next, one genome was picked at random as the reference genome, and pairwise orthologs were linked to the reference genome. In this way, orthologs shared by all the genomes were identified (Luo et al., 2011a, 2011b, 2012, 2013). To identify false positive orthologous sets, each member in an orthologous set was searched against the Clusters of Orthologous Groups database (Tatusov et al., 1997) using the RPS-BLAST software (Altschul et al., 1997). Orthologous sets including members hitting to different Clusters of Orthologous Groups were not used further. Following this procedure, we identified 52 single-copy orthologous protein families that are shared by the 48 genomes.

As member genes in a family can be divergent, no single alignment method worked well for all of the data sets. Therefore, we used four different algorithms to align the member genes at the amino acid level. These four methods are Mafft (Katoh et al., 2005), T-Coffee (Notredame et al., 2000), Muscle (Edgar, 2004) and Prank (Löytynoja and Goldman, 2008). Each alignment was trimmed using the trimAl software (Capella-Gutiérrez et al., 2009) with the parameters ‘-automated1 -resoverlap 0.55 -seqoverlap 60’. The quality of four trimmed alignments for each family was evaluated using a consistency score calculated by trimAl (Capella-Gutiérrez et al., 2009), with a higher consistency score showing that the alignment produced by one method was more consistent than alignments produced by other methods. Overall, Mafft was selected for 37 data sets, and one of the other methods was chosen for each of the remaining data sets. The best alignments were concatenated for genome-based phylogenetic construction, resulting in a concatenated data set with 18 825 sites.

Model adequacy with respect to amino acid composition was assessed for each of the 52 protein families using χ2-homogeneity statistic on posterior distributed samples, which were generated by posterior predictive simulation using the P4 Bayesian phylogenetic software package (Foster, 2004).

Genome-based phylogenetic construction using a composition-heterogeneous Bayesian model

To account for a systematic error associated with compositional bias in the amino acid sequences, the concatenated protein sequence was recoded into the following six Dayhoff groups that correspond to PAM and most other matrices (Hrdy et al., 2004): (cysteine), (alanine, serine, threonine, proline, glycine), (asparagine, aspartic acid, glutamic acid, glutamine), (histidine, arginine, lysine), (methionine, isoleucine, leucine, valine), (phenylalanine, tyrosine, tryptophan). This recoding scheme has been used to improve topological estimation in the presence of compositional heterogeneity in phylogenomic studies (Cox et al., 2008; Foster et al., 2009; Nesnidal et al., 2010). We applied Bayesian nonstationary models including the node-discrete composition heterogeneity (NDCH) model and the node-discrete rate matrix heterogeneity (NDRH) model. The NDCH model allows different composition on different branches, and the NDRH model allows different rate matrices on different branches (Foster, 2004). These models are computationally expensive but specifically designed to account for compositional heterogeneity. They are implemented in the P4 Bayesian phylogenetic software package (Foster, 2004).

The Dayhoff-recoded concatenated data set was analyzed using multiple configurations of NDCH and NDRH, general time-reversible substitution matrix plus four Gamma-distributed rate categories and using the polytomy prior (Lewis et al., 2005). Ten replicate runs were performed for each configuration of NDCH and NDRH. In each replicate run, one cold and three heated MCMC chains were run for a total of 1 500 000 generations, with trees sampled every 1000 generations. The first 500 000 generations were discarded as ‘burn-in’. The model adequacy with respect to composition was assessed using χ2-homogeneity statistic on posterior distributed samples, which were generated by posterior predictive simulation in P4. This test rejected the stationary model (1 composition vector plus 1 general time-reversible rate matrix across the tree). Model selection was performed using the Bayes factor, which is the ratio of the marginal likelihood of the two models being compared. Mathematically, it is expressed as twice the difference in the logs of the marginal likelihoods. A value >10 is considered significant to favor the better model (Kass and Raftery, 1995). The marginal likelihoods were estimated using equation 16 in Newton and Raftery (Newton and Raftery, 1994) based on the harmonic mean of the likelihoods from posterior distributions. This test selected a nonstationary model of NDCH(8)+NDRH(2), that is, a model with eight composition vectors and two general time-reversible rate matrices. The average standard deviation of split support was <0.01, suggesting that convergence was reached. A majority-rule consensus tree was constructed from the post-burn-in trees. This conservative method resolved three SAGs (O19, K06 and J04) into a novel clade (designated SAG-O19 clade) in which no cultured representatives were found.

Estimating the rates of conservative and radical amino acid replacement

Orthologous gene families among the 48 genomes were identified using the OrthoMCL software (Li et al., 2003). In each family, duplicates in the same genomes were discarded such that each family presumably contains single-copy orthologous genes. Families that contained <30 genomes of Roseobacter isolates and less than two of the three SAGs in the SAG-O19 clade were also discarded. This data filtering procedure resulted in 585 families for subsequent analyses. Genes were aligned using Mafft (Katoh et al., 2005) at the amino acid level, and DNA sequences were imposed on the alignments.

The 20 amino acids were classified using two widely accepted schemes based on their physicochemical properties: (I) charge (Hughes et al., 1990), and (II) polarity and volume (Miyata et al., 1979). Conservative and radical amino acid replacements were defined as amino acid substitutions within and between groups, respectively. Rates of conservative (dC) and radical (dR) replacements, that is, the number of conservative nonsynonymous nucleotide substitutions per conservative nonsynonymous site and the number of radical nonsynonymous nucleotide substitutions per radical nonsynonymous site, between two orthologous genes were determined using the HON-new software (Zhang, 2000). To account for the transitional bias, the transition/transversion ratio, determined separately for each gene family using the MEGA-CC software (Kumar et al., 2012), was incorporated. In dC and dR calculations, the two genomes in the basal HTCC2255 clade were not included, because this lineage has distinct genome characteristics and does not share an evolutionary path with all other Roseobacters (Luo et al., 2013).

Results and Discussion

Roseobacter single-cell genome characterization

Partial genomes of four SAGs were recovered from the Pacific and Atlantic Oceans (Table 1), represented by 0.6–2.0 Mbp of assembled DNA per genome. These SAGs were determined to represent cells with reduced genome sizes (estimated to be 2.6–3.5 Mbp) using a polynomial regression model based on the presence of single-copy genes conserved in cultured Roseobacters (Figure 1). One of the SAGs, C03, shares an identical 16S rRNA gene and >97% average nucleotide identity with the strain HTCC2255, which branches at the base of the Roseobacter phylogeny and has genome content consistent with a free-living strategy (Luo et al., 2013). Genome content of the other three SAGs (O19, K06, J04) is also significantly different from the cultured, presumably patch-adapted Roseobacters (χ2-test; P<0.001). In particular, genes involved in transcriptional regulation and replication/recombination/repair are significantly less common in these three SAGs (Xipe analysis, Rodriguez-Brito et al., 2006; 99% confidence interval, 1000 replicates), functional categories found to be underrepresented in streamlined marine bacterial genomes (Lauro et al., 2009; Yooseph et al., 2010; Luo et al., 2013; Swan et al., 2013). They also have an atypically low G+C content, averaging 39% (±1%) compared with a mean of 60% (±4%) for cultured Roseobacters (Figure 2b), and a smaller fraction of noncoding DNA (0.075±0.004) compared with the cultured Roseobacters (0.109±0.022) (Figure 2c), two characteristics frequently linked with bacterial genome reduction (Giovannoni et al., 2005; McCutcheon and Moran, 2012).

Table 1 Characteristics of single-cell amplified genomes
Figure 1
figure 1

Regression model for Roseobacter SAG genome size estimation based on genome statistics for 40 cultured Roseobacter strains. The x axis shows the ratio of the number of conserved single-copy genes universally present in fully sequenced Roseobacter genomes to the number of predicted protein-coding genes in a genome; the y axis is the number of nucleotides sequenced. The data were fit to a polynomial regression model (R2=0.93), and the model was used to estimate the genome sizes of the four SAGs (C03, O19, K06 and J04), which were found to be 2.64, 3.10, 3.50 and 2.65 Mb, respectively. The prediction interval (PI) is also shown.

Figure 2
figure 2

Genomic characteristics of the three SAG-O19 clade members compared with metagenomic roseobacter sequences and isolates. Distribution of G+C content sampled from Global Ocean Survey (GOS) metagenomic data sets (a) and cultured Roseobacter genomes (b). For panel a, the GOS roseobacter reads (n=5608) were identified using the dN pipeline software, which assigns metagenomic reads to a microbial clade with high confidence (Luo et al., 2012). For panel b, the cultured roseobacter genomes were randomly sheared to generate an in silico metagenome, and then analyzed as described for panel a. The gray arrow in panel b indicates the average G+C content of the SAGs. The fraction of noncoding DNA (c) and estimated average genome size (d) in Roseobacter GOS reads, SAGs and isolates. The box-and-whiskers plots indicate the median (horizontal line), the boundary of the first (Q1) and third quartile (Q3) (box) and the range from Q1–1.5(Q3–Q1) to Q3+1.5(Q3–Q1) (whiskers); circles indicate outliers. The details for noncoding DNA fraction and genome size estimation can be found in the Materials and methods.

We compared characteristics of oceanic Roseobacter sequences obtained from metagenomic data with the SAG and isolate genomes. Over 5600 Roseobacter reads retrieved from the GOS metagenome with high confidence using the dN pipeline software (Luo et al., 2012) had a bimodal distribution of G+C content, with a major peak centered at 42% and a secondary peak at 54% (Figure 2a). The oceanic genomes had an estimated size of 2.87 (±0.15) Mb (Figure 2d), and percent noncoding DNA content of 0.076 (±0.002) (Figure 2c). All three of these characteristics are consistent with those of the SAGs while divergent from the isolates, and suggest a cultivation bias against Roseobacter clade members with streamlined genomes. Thus, the SAG genomes provide an opportunity to explore the evolution of a fraction of this lineage that has previously been largely inaccessible to molecular genetic analysis. We note, however, that the size fractionation scheme used in the GOS study (0.1–0.8 μm size range; Rusch et al. 2007) was biased against large and particle-associated Roseobacter cells, and these may better resemble the cultured strains. Roseobacters in the surface ocean are estimated to be about equally divided between free-living and particle-associated cells (Luo et al., 2012).

Phylogenomic tree construction

A robust phylogenomic tree was constructed to address the evolutionary history of the Roseobacter clade. The initial RAxML-based trees were incongruent depending on whether or not SAGs O19, K06 and J04 were included (Supplementary Figure S1), despite using an identical set of amino acid characters in both cases. Inconsistencies in the trees included an unresolved branching order of the five major Roseobacter clades (Newton et al., 2010), as shown by low bootstrap support, and an extraordinarily long branch for SAG C03 in the HTCC2255 clade (Supplementary Figure S1A), conflicting with its identical 16S rRNA gene sequence and >97% average nucleotide identity to the HTCC2255 genome.

A wide range in G+C content such as that found in the Roseobacter genomes (37–70%) is known to cause compositional bias in the amino acid sequences (Gu et al., 1998; Foster and Hickey, 1999; Singer and Hickey, 2000; Swan et al., 2013). Such compositional bias introduces systematic errors in phylogenetic reconstruction using standard approaches that assume homogeneity of amino acid (or nucleotide) composition among taxa (Galtier and Gouy, 1995; Jermiin et al., 2004; Herbeck et al., 2005; Foster et al., 2009; Nesnidal et al., 2010). Possible disparity in amino acid composition across the lineages was explored using a posterior predictive simulation, and subsequent χ2-homogeneity tests showed that half of the 52 orthologous protein groups shared by all Roseobacter genomes significantly violated the homogeneous assumption with respect to amino acid composition across the tree (Supplementary Table S1). To address the tree-wide compositional variation, the P4 Bayesian phylogenetic software package (Foster, 2004) was selected for phylogenomic tree construction as it implements the NDCH model, which allows for varying composition over the tree, and the NDRH model, which allows different rate matrices on different branches. The P4 phylogenomic tree resolved the earlier issues and was supported by high posterior probabilities on all internal branches (Figure 3). In this tree, SAGs O19, K06 and J04 formed a new clade without any cultured representatives (Figure 3; hereafter, the SAG-O19 clade). In addition, members of the SAG-O19 clade show extraordinarily long branches in the P4 tree (Figure 1), suggesting an enhanced evolutionary rate of this uncultivated clade.

Figure 3
figure 3

Bayesian phylogenomic tree of the Roseobacter clade using a composition-heterogeneous model in the P4 software package based on a concatenation of 52 single-copy orthologous protein sequences. The scale bar indicates the number of substitutions per site. The value near each internal branch is the posterior probability for that branch. The tree is rooted using four genomes from sister clades in the Alphaproteobacteria (Hyphomonadaceae, Caulobacterales, Rhizobiales); the outgroups are not shown. The SAG clade is highlighted in blue, an example control clade is highlighted in orange and the remaining members used to calculate dR/dC ratios are indicated in green. The basal lineage consisting of HTCC2255 and SAG C03 were not included in the analyses. G+C content and genome size are indicated in the two right columns, the latter estimated using a regression model for incomplete genomes (Figure 1). Plasmids, if any, are not included in genome size calculation.

Enhanced evolutionary rate of dissimilar amino acid substitutions in the uncultivated clade

Using approaches based on genome binning and 16S rRNA gene analysis (see Materials and methods), the SAG-O19 clade is estimated to account for 35% of the free-living Roseobacter cells (<0.8 μm) in GOS metagenomic samples of surface ocean waters (Supplementary Figure S2). Comparing nucleotide substitution patterns may shed light on the nature of the evolutionary forces giving rise to this abundant lineage. A common approach to diagnosing the efficiency of purifying selection compares the genomic average of the ratio of dN to dS, among different lineages (Clark et al., 1999; Wernegreen and Moran, 1999; Hu and Blanchard, 2009; Kuo et al., 2009). Because nonsynonymous mutations are more likely to be deleterious than synonymous mutations, an elevated genome-wide value indicates that more deleterious mutations have accumulated owing to relaxation of purifying selection (Ohta, 1992). A major limitation of this approach, however, is that it only applies to closely related lineages; for more divergent sequences, synonymous sites are saturated with substitutions, making the estimate of dS unreliable (Ochman et al., 1999; Luo and Hughes, 2012). In the case of the Roseobacter SAG-O19 clade, dN/dS is not applicable owing to saturation at the synonymous sites.

Using an alternative approach, we examined the type of nonsynonymous substitutions in the SAG-O19 clade. Conservative substitutions involving amino acids with similar physicochemical properties are more common than radical substitutions involving amino acids with substantial differences in physicochemical properties (Zuckerkandl and Pauling, 1965; Dayhoff et al., 1972). This bias is attributed to a higher intensity of purifying selection on changes that affect protein function. The radical (dR) and conservative (dC) substitution rates in Roseobacter genomes were computed by classifying amino acids into three groups according to charge (positive, negative and neutral; Hughes et al., 1990), and computing the number of radical (those between charge groups) nonsynonymous substitutions per radical nonsynonymous site (dR) and the number of conservative (within a charge group) nonsynonymous substitutions per conservative nonsynonymous site (dC) (Hughes et al., 1990; Zhang, 2000). The genomic average of the dR/dC ratio has been used previously to compare the efficiency of purifying selection among different lineages, with an inflated mean value indicative of relaxed purifying selection (Zhang, 2000; Eyre-Walker et al., 2002; Hughes and Friedman, 2009; Wernegreen, 2011). The advantages of this approach are that it is applicable to data sets with sequences that are too divergent for synonymous site analysis and provides a population genetics environment to address ancient events.

The pattern of nonsynonymous substitution in the three SAG-O19 clade genomes was compared with control isolate clades similarly consisting of three genomes. Six different control clades were designated, all of which yielded qualitatively the same results (Figure 4); detailed results are discussed only for the MED193/SK209-2-6/Y4I control group here. For over 500 orthologs, the mean dC and dR was computed for the following two sets of comparisons: (1) between the SAG-O19 clade and the remaining genomes (excluding the control clade and the basal HTCC2255 lineage; see Materials and methods and Figure 3) and (2) between the control clade and the remaining genomes (excluding the SAG-O19 clade and the basal HTCC2255 lineage). As anticipated, the mean dC was significantly greater than mean dR (0.297±0.004 vs 0.287±0.004 for the comparisons involving the SAG-O19 clade; 0.209±0.003 vs 0.192±0.003 for comparisons involving the control clade; paired t-test P<0.001 for both), confirming the expected bias toward conservative amino acid changes. However, the mean ratio dR/dC was significantly greater in the SAG-O19 clade (0.970±0.007) compared with the control clades (0.916±0.007 for the MED193/SK209-2-6/Y4I comparison; paired t-test; P<0.001; Figure 4), suggesting that the bias toward conservative amino acid changes is weaker for the SAG-O19 clade. We conducted a similar analysis that classified amino acids by volume and polarity (Table 2; Miyata et al. 1979) rather than charge, and this also showed a greater rate of physicochemically dissimilar amino acid changes in the SAG-O19 clade relative to the cultured Roseobacter clades (data not shown). Therefore, analyses of dR/dC are consistent with the hypothesis that relaxation of purifying selection is a dominant evolutionary force giving rise to the clade.

Figure 4
figure 4

The ratio of radical (dR) to conservative (dC) substitution rates based on physicochemical properties (shown here, charge) of amino acids for the SAG-O19 clade and six control clades. Bars indicate one standard deviation of the mean. For the control clades, only strain names are shown; full names can be found in Figure 3.

Table 2 Classification of amino acids by two independent schemes based on physicochemical properties

Enhanced evolutionary rate of G+C content change in the uncultivated clade

As members of the SAG-O19 clade have a substantially lower G+C content compared with most other Roseobacter lineages, we examined whether this difference is caused by a change in selective pressure. The 20 amino acids were classified into three groups based on the G+C content at the first two sites of a codon (as natural selection largely operates on nucleotide substitutions resulting in the replacement of amino acids): high G+C, low G+C and neutral (Table 3). Using the same analysis design as above, we compared the pattern of codon replacement for the SAG-O19 clade and for each of the six different control clades with the remaining Roseobacter genomes. In the comparisons involving the SAG-O19 clade, mean dwithin (within group; 0.323±0.004) was significantly greater than mean dbetween (between group; 0.267±0.004; paired t-test; P<0.001). Similarly, in the comparisons involving the MED193/SK209-2-6/Y4I control clade, mean dwithin (0.242±0.003) was significantly greater than mean dbetween (0.174±0.003; paired t-test; P<0.001). Thus, in both sets of comparisons, amino acid changes had a preference for codons with similar G+C content. However, the mean ratio dbetween/dwithin was significantly greater in the comparisons involving the SAG-O19 clade (0.811±0.005) than those involving all six control clades (0.708±0.004 for the MED193/SK209-2-6/Y4I comparison; paired t-test; P<0.001), suggesting a greater rate of substitution toward G+C-poor codons in the SAG-O19 clade.

Table 3 Grouping of amino acids based on the G+C content of the first two positions in a codon

Acceleration of replacement from G+C-rich to G+C-poor codons in the uncultivated SAG-O19 clade could have both selection and drift explanations. The low G+C content of marine bacterioplankton genomes has frequently been attributed to adaptation via selection for nitrogen conservation in the oligotrophic ocean, achieved by lowering cell quotas through the preferential use of nitrogen-poor A/T nucleobases in the genome and nitrogen cost minimization in the proteome (Bragg and Hyder, 2004; Martin-Cuadrado et al., 2008; Grzymski and Dussaq, 2012). On the other hand, there is an inherent mutational bias from G/C to A/T in bacteria (Lind and Andersson, 2008; Balbi et al., 2009; Hershberg and Petrov, 2010; Hildebrand et al., 2010). Thus, the presence of lineages with high G+C content suggests that purifying selection has acted to maintain the G+C content (Balbi et al., 2009; Hershberg and Petrov, 2010; Hildebrand et al., 2010; Raghavan et al., 2012), and, conversely, lineages with low G+C content could be caused by random fixation as a result of relaxation of purifying selection. In the case of Roseobacters, the mean ratio dbetween/dwithin was significantly greater in the comparisons involving the G+C-poor SAG-O19 clade than those involving all six control clades, a result that could be explained by relaxation of purifying selection in the SAG-O19 clade.

If low G+C content is under selection in this clade, the attribution of a drift component to the high dR/dC is less strongly supported because selective pressure to decrease G+C content may interfere with selection for conservation of amino acid physicochemical properties. In other words, the fitness cost of radical changes could be balanced by the benefit of decreasing G+C content. However, the assumption that low G+C content in marine bacteria is favored by selection is controversial. For instance, members of the LD12 clade are found in freshwater where N is not limiting, yet have the same low G+C content as their marine SAR11 relatives (30%; Giovannoni et al., 2005), suggesting that factors other than selection for N thrift must have acted to shape these G+C-poor genomes (Ghai et al., 2011).

Conclusion

Streamlined genomes are found in several other dominant marine bacterial lineages, such as SAR11 (Giovannoni et al., 2005), SAR86 (Dupont et al., 2012) and Prochlorococcus (Rocap et al., 2003). Although there is no direct estimate of effective population sizes for these lineages, it has generally been assumed that they are quite large, thereby minimizing the power of genetic drift (Morris et al., 2012) and implicating selective forces in shaping characteristics such as genome size (Dufresne et al., 2005; Giovannoni et al., 2005; Hu and Blanchard, 2009; Grote et al., 2012; Morris et al., 2012) and nucleotide composition (Giovannoni et al., 2005; Martin-Cuadrado et al., 2008). Indeed, small genomes may enhance the metabolic efficiency and lead to smaller cells with increased surface-to-volume ratios, both of which are advantageous to bacteria living in oligotrophic oceans (Dufresne et al., 2005; Giovannoni et al., 2005). Further, cells with G+C-poor genomes use less nitrogen in both genomes and proteomes compared with cells with G+C-rich genomes (Bragg and Hyder, 2004; Grzymski and Dussaq, 2012). Other explanations for genome streamlining in marine bacteria include changes in bacterial lifestyle, making previously necessary genes dispensable. This relaxation of purifying selection on nonessential genes (Luo et al., 2011a, 2013) does not require a reduction in effective population size, and thus does not conflict with the selection hypothesis.

Here, we addressed the evolutionary history of a streamlined clade in the Roseobacter lineage through analysis of genomes from single, uncultivated cells, finding evidence for accelerated evolution toward nonconservative amino acid changes and low G+C content that suggests a possible role for genetic drift. The seemingly contradictory explanations of drift and selection can be reconciled if these evolutionary forces acted at different periods during the clade’s evolutionary history. The conceptual differences between these evolutionary processes, including whether the fixation of genomic traits originates through random fixation of neutral or even slightly deleterious mutations or through directional evolution toward fixation of advantageous variants, is fundamental to our understanding of microbial adaptation in the ocean.