Introduction

Archaea and bacteria encode diverse defense systems that protect these microbes from viruses, plasmids and other selfish genetic elements. One of the most efficient and widespread types of defense system is CRISPR (clustered, regularly, interspaced, short, palindromic repeats)–Cas (CRISPR-associated genes). CRISPR–Cas systems provide acquired, heritable immunity to bacteria and archaea against both viral (Barrangou et al., 2007; Manica et al., 2011) and plasmid (Marraffini and Sontheimer, 2008; Gudbergsdottir et al., 2011) invasion. CRISPR–Cas systems are more common in archaea than in bacteria, and are more abundant and contain more spacers in thermophiles than in mesophiles (Anderson et al., 2011; Weinberger et al., 2012b).

The CRISPR–Cas loci encompass arrays of direct, often partially palindromic repeats that are interspersed by short unique DNA segments (20–50 bp long), termed spacers, and multiple cas genes that encode proteins involved in the immune response. A seminal discovery that led to the characterization of the unique mechanisms of CRISPR–Cas action has been that some of the spacers are (nearly) identical to sequences of selfish elements (in this case, known as protospacers). Subsequently, it has been shown that a complex of Cas1 and Cas2 proteins excises protospacer DNA from invading elements and integrates them into the CRISPR arrays (Barrangou et al., 2007; Deveau et al., 2008; Li et al., 2014). One or several spacer arrays in a prokaryotic cell can be transcribed and processed into small CRISPR RNA (crRNA) molecules by a complex of several Cas proteins known as Cascade. The Cascade complex then mediates the formation of a duplex between the crRNA and the cognate protospacer sequence in an invading nucleic acid molecule triggering degradation of the latter (Haurwitz et al., 2010; Semenova et al., 2011).

The fact that the CRISPR–Cas systems are able to continuously acquire new spacers enables partial reconstruction of the history of past selfish-element infections (Tyson and Banfield, 2008; Denef et al., 2010; Held et al., 2010; Stern et al., 2012; Weinberger et al., 2012a). In the absence of parasitic elements, spacers can be easily lost owing to the deletion bias of prokaryotic genome evolution (Kuo and Ochman, 2009) and the presumed cost of maintaining CRISPR systems (Weinberger et al., 2012a).

The source(s) of this putative cost of the CRISPR-Cas system is not fully clear. One possibility involves autoimmunity, that is, erroneous incorporation of protospacers from the host genome into a CRISPR cassette, resulting in damage to the host chromosome (Stern et al., 2010). Another potentially major source of cost could be the curtailment of horizontal gene transfer (HGT) and hence prevention of genetic novelty acquisition by bacteria and archaea resulting from the CRISPR–Cas activity (Marraffini, 2013; Hatoum-Aslan and Marraffini, 2014). Indeed, it has been shown that multidrug-resistant clinical isolates of enterococci lack CRISPR–Cas, unlike sensitive strains, presumably because having the system hinders resistance plasmid acquisition (Palmer and Gilmore, 2010). The balance between spacer gain and loss could thus be affected by the relative selective pressures exerted, on one hand by viruses and on the other hand by beneficial plasmids (Jiang et al., 2013). Furthermore, CRISPR–Cas can prevent nonparasitic gene acquisition events such as natural transformation with naked DNA (Bikard et al., 2012). It therefore appears that there is a trade-off between active acquired immunity (that is, the constant integration of new spacers and maintenance of large CRISPR arrays) and the ability to acquire new genes and novel functions via HGT. Here we test this hypothesis using available genomic databases and come to the conclusion that on evolutionary timescales, the inhibitory effect of CRISPR–Cas on HGT is undetectable.

Materials and methods

Data

The CRISPR spacer counts were obtained from the September 2013 version of the CRISPRdb (Grissa et al., 2007) and included spacers from all partitions in the genome in question (that is, both chromosome- and plasmid-encoded spacers). The spacer number ranged from 0 to 736.

The updated version of the ATGC database clusters ~60% of proteins and 62% of genomes from RefSeq (as of June 2013) into 269 groups of closely related genomes of bacteria and archaea (Novichkov et al., 2009; Puigbo et al., 2014). Clusters of Orthologous Groups (COGs) shared between genomes in each ATGC were identified (Tatusov et al., 1997; Kristensen et al., 2010), leaving singletons defined as those genes that were not shared with any of the closely related genomes included in the given ATGC. These singletons are most likely to have arisen via recent HGT that has occurred after these closely related organisms have diverged. Only clusters of three or more taxa were used for detection of singletons. For the detection of putative HGT in archaea, the established set of arCOGs (Wolf et al., 2012) (December 2012 release) were employed.

Prophage regions within each genome were identified by PhageFinder (Fouts, 2006). Counts of laterally acquired genes based on unusual dinucleotide composition were obtained from (Popa et al., 2011). Growth temperatures for microbial genomes were downloaded from BacDive (Sohngen et al., 2014) on April 2014.

Statistical analysis

The available data were compiled for 1399 microbial genomes. Decimal logarithms for the fraction of singletons in arCOGs or ATGC COGs, fraction of horizontally transferred genes estimated using the dinucleotide pattern and the fraction of phage-related genes in bacteria were used in the analysis as independent measures of HGT (for correlations between HGT measures, see supplementary Table 1). The number of CRISPR cassette spacers (S) was used in the log10(S+1) form. Growth temperatures were in degree Celsius. The linear model function lm() of the R package was used to estimate the parameters of linear predictors; goodness of fit of alternative models was compared using the ANOVA() function.

Results

Correlating the number of CRISPR spacers and gene acquisition via HGT

There is currently no accurate proxy for CRISPR–Cas activity that can be applied to all microbial genomes. In the absence of a direct measure, spacer count, although not perfect, is the best available genomic proxy for CRISPR activity. It has been established that active CRISPR–Cas systems continuously acquire additional spacers, resulting in longer CRISPR arrays (Tyson and Banfield, 2008). Conversely, spacers can also be quickly deleted (Deveau et al., 2008; Horvath et al., 2008; Tyson and Banfield, 2008), in the absence of selective pressure for maintaining CRIPSR–Cas activity (Weinberger et al., 2012a), because prokaryotic genomes have a bias toward deletions (Kuo and Ochman, 2009), and owing to frequent recombination within repeat loci. Thus, inactive CRISPR arrays will tend to shrink, whereas active ones will expand or at least maintain more or less constant size. Furthermore, even if a CRISPR–Cas system has been recently inactivated in a genome, but this genome still carries a large CRISPR array, it appears more appropriate to view its recent evolutionary history as one where CRISPR–Cas has had an effect.

We therefore used genomic spacer count as a proxy for the CRISPR activity and correlated it with the three available measures of gene acquisition via HGT, namely fraction of prophage genes in bacteria, fraction of singletons in the ATGC COGs and arCOGs, and the fraction of the recently acquired genes inferred on the basis of dinucleotide composition. These relationships show a complex pattern (Tables 1 and 2 and Figure 1), with the magnitude and even the sign of the correlation differing between archaea and bacteria, between CRISPR-positive and CRISPR-negative genomes and between thermophiles and mesophiles. For example, CRISPR-negative bacteria on average encode many fewer prophage-encoded proteins than CRISPR-positive genomes (66.85 vs 114.883, respectively). However, in the CRISPR-positive bacteria, the number of CRISPR spacers negatively correlates with the fraction of prophage genes (Table 2), in agreement with the established role of CRISPR systems in preventing lysogenization (Edgar and Qimron, 2010). In bacteria, the association between spacer number and the the fraction of singletons was nonmonotonic, and surprisingly, the most spacer-rich species also had the highest average fraction of singletons (Figure 1), and the presence of spacers in a genome positively correlated with the fraction of recently acquired genes (both singletons and genes of unusual dinucleotide signature, see Table 2).

Table 1 Estimated levels of HGT in CRISPR–spacer-containing and CRISPR–spacer-lacking genomes
Table 2 Nonparametric correlations between CRISPR spacer count and HGT measures
Figure 1
figure 1

The fraction of singleton genes in bacterial genomes binned by CRISPR spacer counts.

In archaeal mesophiles, the fraction of singletons positively correlated with the number of CRISPR spacers (ρ=0.137, N=46), but in the archaeal thermophiles the relationship seems to be reversed (ρ=−0.213, N=65; although both correlation fail to reach statistical significance owing to the small sample size, with P-values of 0.3639 and 0.0887, respectively). In addition, all measures of HGT increase with the genome size, which is a typical dependence for most classes of nonhousekeeping genes (Koonin and Wolf, 2008). In contrast, the number of CRISPR spacers is independent of genome size but has been shown to correlate with the optimal growth temperature (Weinberger et al., 2012b).

Factors affecting gene acquisition

To disentangle the multidimensional network of possibly nonmonotonic relationships, we constructed a linear model to test the predictive power of several variables that could affect the fraction of recently acquired genes. These variables included domain affiliation (Bacteria or Archaea), genome size (the number of protein-coding genes), the growth temperature (either the optimal growth temperature or a binary thermophile vs mesophile classification) and the CRISPR–Cas activity approximated by the number of spacers in CRISPR cassettes (see Supplementary Methods for details). The analysis was started with a model containing all four predictor variables and all their pairwise combinations. The least significant combinations or variables were removed, and the reduced model was compared with the original one using the ANOVA. If the reduced model did not significantly differ from the original full model, the reduction was accepted and the reduced model tested further by removal of the next least significant contributor. Data for 1399 genomes with at least one proxy measure for HGT (fraction of singletons in the context of arCOGs or ATGC COGs, dinucleotide pattern based estimates, number of prophage-related genes in bacteria) available were compiled. Optimal growth temperatures were available for 1034 organisms.

In cases where the growth temperatures were available and included in the model, the number of CRISPR spacers did not contribute significantly to the prediction of the genomic impact of recent HGT, by either method. The optimal predictor for the fraction of singletons included the combined contribution from genome size and growth temperatures, with different offsets for archaea and bacteria (because archaea tend to have many more spacers than bacteria, Figure 2). Altogether, these variables explained ~0.3 of the original variance in the fraction of singletons. When growth temperatures were classified in a binary form (thermophile or mesophile), the number of spacers contributed to the prediction of the number of singletons, along with the genome size, thermophilic status and domain affiliation (Figure 3). Given the strong correlation between the number of CRISPR spacers and the optimal growth temperature (linear correlation coefficient of 0.64), it is likely that the number of spacers is not linked to the extent of HGT directly, but rather serves as a proxy for the temperature. Qualitatively similar results were obtained for the fraction of foreign genes based on dinucleotide pattern and for the fraction of phage-related genes in bacteria (Supplementary Figure 1). Thus, on the scale of evolution of relatively close bacteria and archaea that comprise the ATGC, there is no evidence of a connection between the activity of CRISPR–Cas and HGT.

Figure 2
figure 2

A predictive model for the fraction of singletons in genomes of archaea and bacteria. Surfaces (top: archaea, bottom: bacteria) indicate the expected fraction of singletons given the optimum growth temperature and genome size.

Figure 3
figure 3

Predicted and observed fraction of singletons in genomes of archaea and bacteria. Data points correspond to 261 genomes with available optimum growth temperature and the number of singletons. The y=x line is shown.

Discussion

Because CRISPR–Cas systems can prevent plasmid conjugation (Marraffini and Sontheimer, 2008), prophage integration (Edgar and Qimron, 2010) and transformation with naked DNA (Bikard et al., 2012), these immune systems are thought to represent a barrier for HGT that might be detrimental for microbial populations. Observations on human pathogens (Palmer and Gilmore, 2010) as well as a combination of experiments and models (Jiang et al., 2013), have indeed demonstrated that under selective pressure for gene acquisition (for example, exposure to antibiotics, such that acquired resistance is key for survival or replication), CRISPR–Cas systems are often inactivated or lost. Nevertheless, the question of whether indeed there is a general trade-off between possessing efficient protection against selfish genetic elements and access to genetic novelty has not been addressed on a large scale. Here we tested the association between (relatively) recent gene acquisition and CRISPR–Cas activity and observed that overall there is little if any support for the trade-off hypothesis. Indeed, there was no significant negative correlation between CRISPR–Cas activity and the fraction of recently acquired genes (either singletons or genes with unusual dinucleotide composition) in bacteria, whereas the negative correlation observed in archaea seems to merely reflect the anticorrelation between the growth temperature and HGT rate.

The apparent lack of trade-off between CRISPR–Cas activity and gene acquisition via HGT demonstrated in our analysis can be attributed to several nonmutually exclusive evolutionary mechanisms. The simplest explanation is that CRISPR–Cas systems and CRISPR arrays are often themselves mobile so that their presence–absence (inactivity) in any extant genome is not indicative of their longer-term impact. In other words, the relevant events in the evolution of resistance to mobile elements and proclivity for HGT, in which CRISPR–Cas systems have an important role, occur on the population level rather than on the evolutionary scale. Second, in some environments, microbes could experience such high exposure to mobile genetic elements that any CRISPR-mediated resistance to incoming DNA would be ‘a drop in a bucket’. Thus, large CRISPR arrays could indicate genomes that actively acquire new spacers and are under selection to maintain this activity, because they undergo a barrage of selfish DNA elements. In addition, recent evidence has shown that some CRISPR–Cas systems require transcription of the foreign DNA for interference, and thus allow lysogenization, and by inference, at least some forms of HGT, while preventing lytic infection (Goldberg et al., 2014). Third, spacer acquisition in both bacteria and archaea is far from random, with arrays preferentially acquiring new spacers from genomes that contain DNA sequences matching pre-existing spacers, at least partially, a phenomenon known as priming (Datsenko et al., 2012; Swarts et al., 2012; Li et al., 2014). Priming inevitably generates a strong bias toward frequently encountered invasive genetic elements, typically highly infecting viruses, rather than the full spectrum of the exogenous DNA, some of which is potentially beneficial. Indeed, extensive surveys of CRISPR spacers have demonstrated that the majority of matches are to widespread archaeal viruses and bacteriophages, and that multiple spacers potentially matching the same viral genome are often present within the same array (Brodt et al., 2011; Stern et al., 2012). Thus, priming is probably a common mechanism in CRISPR–Cas systems and is likely not only to contribute to more efficient resistance against common viruses but also to limit the novelty-restricting effect of CRISPR–Cas.

What would initially appear like a CRISPR–Cas-mediated barrier to gene transfer in thermophiles and hyperthermophiles, is strongly suggested by our statistical model to reflect a negative association between the growth temperature of an organism and its propensity for taking up novel genes. The barriers to HGT have been intensively studied (Sorek et al., 2007; Wellner et al., 2007; Wellner and Gophna, 2008; Gophna, 2009; Omer et al., 2010; Popa et al., 2011; Naor et al., 2012) but to our knowledge, this work is the first to demonstrate the effect of the growth temperatures on the estimated frequency of gene acquisition. Paradoxically, genomes of extremophilic archaea and bacteria have been known for the large impact of HGT on their evolution, often involving interdomain transfer (Koonin et al., 2001; Gophna et al., 2005). However, most of the HGT events associated with those lineages are ancient ones, often apparently dating back to the emergence of major groups such as Halobacteria, Thermoplasmatales and Archeoglobales (Nelson-Sathi et al., 2012, 2014). In contrast, our present analysis focused on relatively recent gene transfer events that occurred after the divergence of archaeal and bacterial families and genera. Thus, a major influx of foreign genes in the past of a particular lineage does not imply promiscuous HGT at present. Although it appears reasonable to assume that transformation could be anticorrelated with temperature because naked DNA would degrade faster in hotter environments, natural competence has been demonstrated in hyperthermophiles such as Pyrococcus furiosus (Lipscomb et al., 2011) with optimal growth temperatures exceeding 95 °C. In the absence of molecular evidence for the effects of increased temperature, one is left with ecological rationalizations. Higher temperature environments, in particular extreme ones, can be extremely harsh and therefore encompass a lower diversity of microbes (Kemp and Aller, 2004; Miller et al., 2009; López-López et al., 2013) and tighter functional constraints on protein structures (Friedman et al., 2004; Zeldovich et al., 2007; Drake, 2009) both resulting in a smaller gene pool available for HGT.