Homologous recombination substantially delays sequence but not gene content divergence of prokaryotic populations

Evolution of bacterial and archaeal genomes is a highly dynamic process that involves extensive gain and loss of genes. Therefore, phylogenetic trees of prokaryotes can be constructed both by the traditional sequence-based methods (gene trees) and by comparison of gene compositions (genome trees). Comparing the branch lengths in gene and genome trees with identical topologies for 34 clusters of closely related bacterial and archaeal genomes, we found that the terminal branches of gene trees were systematically compressed compared to those of genome trees. Thus, sequence evolution seems to be significantly delayed with respect to genome evolution by gene gain and loss. The extent of this delay widely differs among bacterial and archaeal lineages. We develop and explore mathematical models demonstrating that the delay of sequence divergence can be explained by sequence homogenization that is caused by homologous recombination. Once evolving genomes become isolated by barriers that impede homologous recombination, gene and genome evolution processes settle into parallel trajectories, and genomes diverge, resulting in speciation. This model of prokaryotic genome evolution gives a mechanistic explanation of our previous finding that archaeal genomes contain a class of genes that turn over rapidly, before significant sequence divergence occurs, and provides a framework for correcting phylogenetic trees, to make them consistent with the dynamics of gene turnover.


Introduction 27
Evolution of bacterial and archaeal genomes is a highly dynamic process that involves extensive gain and 28 loss of genes, with turnover rates comparable to if not exceeding the rate of nucleotide substitution 1-3 . 29 Gene gain and loss occur through insertion and deletion of genome segments of variable size, including 30 large genomic islands, via mechanisms of non-homologous recombination, often involving mobile 31 genetic elements 4,5 . The gene gain and loss events can be used to generate "gene content trees" that 32 reflect the evolution of microbial pangenomes and complement traditional phylogenetic trees 33 constructed from sequence alignments of highly conserved marker genes. From such trees, a gene 34 turnover clock can be defined. The gene turnover clock ticks at a rate that does not necessarily correlate 35 with the rate of the traditional, sequence-based molecular clock. 36 Evolution of prokaryotic populations is strongly affected by homologous recombination, which is 37 regarded as a major contributor to maintaining genetic cohesion by preventing sequence divergence via 38 gene conversion 6,7 . However, because barriers to recombination do not necessarily affect the whole 39 genome, bacterial strains can diverge at some loci while remaining cohesive at others 8 . The rate of 40 population divergence and, eventually, speciation thus depends on the dynamics of recombination 41 barrier emergence across genomes. 42 Efficient homologous recombination between two genomes requires the presence of (nearly) identical 43 nucleotide sequences flanking the exchanged genomic regions. The minimum length of these flanks 44 depends on the species, with typical values around 25-100 nucleotides 9,10 . As genomes diverge, the 45 probability to find fully conserved flanking sequences decreases, and so does the efficiency of 46 recombination 11,12 . The existence of genetic barriers to homologous recombination was initially 47 observed in experimental studies which have shown that sequence divergence of over 5% can prevent 48 most recombination events in some bacteria [13][14][15] . Subsequently, comparative genomic analyses have 49 confirmed that barriers to homologous recombination are widespread and can be used to define 50 biological species in bacteria and archaea [16][17][18] . Mechanistic modeling of the molecular processes involved 51 in homologous recombination has shown that barriers to recombination can build up spontaneously if 52 the balance of mutation and recombination favors the sequence variability in the population 19-21 . 53 Barriers to recombination can also arise after the acquisition of new genes 8,22 , especially if the newly 54 acquired genes are involved in niche specialization 23,24 . In this case, the barriers seem to result, in part, 55 from selection against gene conversion events that would lead to the loss of the recently acquired, 56 beneficial genes. 57 Here, we investigate how the fraction of genes shared by closely related bacterial and archaeal genomes 58 decays with the phylogenetic distance and show that molecular and gene turnover clocks are 59 incongruent at short evolutionary times. To elucidate the origin of this discrepancy, we develop a 60 mathematical model of genome evolution that describes the dynamics of sequence divergence in the 61 presence of gene conversion. The model predicts the existence of a recombination-driven delay in the 62 molecular clock, the magnitude of which corresponds to the time required for barriers to recombination 63 to spread across the genome. By fitting the model to genomic data, we obtain estimates of such 64 recombination-driven delays in 34 groups of closely related bacteria and archaea, and show that the 65 incongruence between the molecular and gene turnover clocks disappears if the former is corrected to 66 account for the estimated delay. Finally, we investigate the tempo and the factors that contribute to the 67 establishment of barriers to recombination in the populations of diverse bacteria and archaea. 68 69 70

71
Discrepancy between the molecular clock and the gene turnover clock 72 The evolution of gene content in closely related genomes has been recently investigated by 73 mathematical modeling and comparative genomics 25-28 . These analyses have shown that the fraction of 74 genes shared by a pair of genomes decays exponentially with time as the genomes diverge, and that 75 genes can be classified in two categories based on their turnover rates 26 . Here we extend these 76 approaches to sets of 3 and more genomes. For illustration, let us consider a simple scenario in which all 77 genes have the same turnover rate . The fraction of genes shared by a pair of genomes is then 78 where 2 is the total evolutionary tree distance, which in the case of 2 genomes is equal to twice the 80 distance from the last common ancestor. We show in the Methods that a similar formula describes the 81 divergence in gene content for groups of 3 or more genomes. Specifically, the fraction of genes shared 82 by genomes decays as 83 where is the total evolutionary distance spanned by those genomes. Given a phylogenetic tree, the 85 total distance is the sum of branch lengths of the subtree that includes the genomes. The most 86 notable aspect of this result is that the dynamics of gene content divergence is independent of the 87 number of genomes considered ( ). As a result, plots of the fraction of shared genes ( ) as a function of 88 the total evolutionary distance ( ) for different sample sizes collapse into a single curve (Fig. 1a). This 89 property remains valid under very general models of gene turnover, including the case where the rates 90 of gene gain and loss differ across gene families (see Methods), under the condition that the tree 91 distance is proportional to the rate of the gene turnover clock. 92 To test this theoretical prediction, we analyzed the profiles of gene sharing in 34 groups of closely 93 related genomes from Bacteria and Archaea. As a proxy for the evolutionary time, we used the branch 94 lengths of high-resolution sequence similarity phylogenetic trees built from concatenated alignments of 95 single-copy core genes. Then, we sampled subsets of genomes and represented the fraction of shared 96 genes as a function of the total tree distance. At odds with the theoretical expectation, we found that 97 gene-sharing decay curves depend on the number of sampled genomes: as more genomes are added, 98 the curves shift down and the fraction of shared genes becomes smaller than expected (left panel on 99 Fig. 1c shows a representative case; see also Supplementary Fig. S1). As illustrated by Fig. 1b, such a  100 non-overlapping pattern could be easily explained if the lengths of the terminal branches in the 101 phylogenetic trees were systematically underestimated. In more general terms, the curves in Fig. 1c  102 involve two different clocks: the molecular clock, that is used to infer the branch lengths in the 103 phylogenetic tree; and the gene turnover clock, that governs the stochastic process of gene loss and, 104 with it, the decay in the fraction of shared genes. Thus, the absence of overlap among gene sharing 105 curves reflects a non-linear relationship between these two clocks, or more specifically, a delay in the 106 molecular clock relative to the gene turnover clock. 107 To further explore the differences between the molecular clock and the gene turnover clock, we 108 generated an alternative set of phylogenetic trees by rescaling branch lengths so that they are 109 proportional to the number of gene gains and losses occurred during the evolution of a lineage ( Fig.  110 2a,b; see Methods for details). Such "gene content trees" are topologically congruent with the gene 111 turnover clock, as long as the genomes are close enough to assume that gene turnover rates have 112 remained approximately constant since the last common ancestor (this approach does not require, 113 however, that gain and loss rates are homogeneous across genes). The non-linear relationship between 114 the molecular clock and the gene turnover clock becomes evident when comparing pairwise distances 115 among leaves in both sets of trees (Fig. 2c) as soon as barriers to recombination are established, the affected loci start diverging from the ancestral 129 population at a rate that is proportional to the substitution rate. As a result, the overall sequence 130 divergence at a given time results from the contribution of all loci that are isolated by barriers to 131 recombination, weighted by the time elapsed since the respective locus crossed the barrier. 132 We show in the Methods that homologous recombination within a population induces a delay in the 133 molecular clock, such that the average divergence of a genome with respect to a member of the same 134 population grows as 135 where is the time since the last common ancestor. Mathematically, the delay ( ) is a concave and 137 saturating function, the detailed form of which depends on the dynamics of the evolution of 138 recombination barriers (for exact expressions for some simple scenarios, see Supplementary  139 Information). For sufficiently long times, this term reaches a constant value ∞ , which is the long-term 140 evolutionary delay of the molecular clock induced by homologous recombination. The delay in the 141 molecular clock accounts for the amount of unobserved variation that is erased by gene conversion 142 during the early phases of divergence from the ancestral population. 143 A notable consequence of the delay in the molecular clock is that the terminal branches of phylogenetic 144 trees inferred from sequence analysis appear shorter than expected given the actual evolutionary times. 145 Specifically, terminal branches are shortened by a distance ( ), whereas internal branches are 146 shortened by a distance ( ) − ( ), where and are consecutive branching times measured 147 from the tips towards the root. Because both ( ) and ( ) tend to the same value ∞ as time passes, 148 the recombination-driven delays cancel out at long evolutionary times and deep internal branches 149 remain approximately unchanged. 150 To show that recombination-driven delays are the plausible cause for the lack of linearity between the 151 molecular and gene turnover clocks, we used the recombination barrier model to correct the branches 152 of sequence-based trees, accounting for the effects of recombination (see Methods). Then, we sampled 153 subsets of genomes and reassessed the divergence in gene content as a function of the corrected tree 154 distances. Remarkably, correction of the sequence trees led to gene-sharing decay curves that do not 155 depend on the sample size, as predicted by the theory (right panel on Fig. 1c shows a representative 156 case). When extending the same approach to all groups of genomes using taxa-specific delays (see 157 below) we found that the mean separation among the curves obtained for different sample sizes 158 decreased by 30% (permutation test p = 0.012; Supplementary Fig. S2). These results are compatible 159 with the existence of a recombination-driven delay in the molecular clock, which causes a systematic 160 shortening of the terminal branches of phylogenetic trees built from sequence alignments. 161

162
The dynamics of escape from recombination 163 We leveraged the differences between the substitution and gene turnover clocks to obtain quantitative 164 estimates of the recombination-driven delays in different taxa. Starting from (uncorrected) sequence 165 similarity trees (Fig. 2a) and gene content trees (Fig. 2b), we retrieved all pairwise distances among 166 leaves and plotted the distances from the sequence similarity tree against those from the gene content 167 tree (Fig. 2c). The recombination-driven delays were obtained by fitting the recombination barrier 168 model to such plots. To better understand how barriers to recombination spread along the genome, we 169 evaluated several scenarios for the temporal establishment of such barriers (see Methods). We found 170 that the data in 18 out of the 34 studied groups are best explained by an "autocatalytic" scenario, in 171 which the rate at which barriers spread accelerates as more and more loci become isolated 172 (Supplementary Table S1). The autocatalytic scenario also provides good fits in 13 more groups, 173 although the inclusion of an extra parameter required in this scenario is not statistically justified if the 174 delays are very small. Under the autocatalytic scenario, the fraction of sites susceptible to homologous 175 recombination follows a sigmoidal curve in time, with a relatively sharp transition (with few exceptions) 176 from a state of fully recombining genomes to a state in which all sites freely diverge. 177 The estimates of the long-term evolutionary delay ∞ range from less than 0.01 to more than 0.5 178 underreported substitutions per site, with broad variation among prokaryotic groups and sometimes 179 even within the same genus ( Fig. 3 and Supplementary Table S1). In approximately half of the groups, 180 molecular evolution appears to be strongly delayed, possibly by the pull of recombination, as indicated 181 by the fact that ∞ is larger than the depth of the sequence similarity tree. In these cases, there are few 182 pairs of genomes that have reached the regime of free (linear) divergence, and the estimation of the 183 upper 95% confidence bound for the long-term evolutionary delay becomes unfeasible. The 5 groups of 184 Firmicutes included in the analysis (covering bacilli, clostridia and streptococci) belong to this category. 185 In contrast, representatives from the genus Pseudomonas are characterized by a linear divergence 186 regime, with little or no signs of recombination-driven delay. 187 The magnitude of the recombination-driven delay is tightly linked to the time frame over which 188 sequence divergence takes place. In taxa with little or no delay, variations in the time at which different 189 genes cross the recombination barrier are negligible; from the perspective of divergence times, all genes 190 in these taxa start diverging roughly at the same time (Fig. 4a, left). Conversely, in taxa with lengthy 191 delays, between-gene variations in divergence times can be comparable to the total evolutionary depth 192 of the taxon (Fig. 4a, right). Accordingly, it can be expected that differences in sequence divergence 193 across genes will be larger in taxa with long recombination-driven delays. To test whether genomic data 194 are compatible with this prediction, we first calculated, for a set of 100 nearly universal gene families, 195 the gene family-and taxa-corrected evolutionary rates (i.e. the gene-specific substitution rates 196 normalized by the gene family-and taxon-averaged substitution rates). It can be shown that the 197 standard deviation of evolutionary rates corrected in this way is equal to the coefficient of variation of 198 the times over which genes have been diverging (see Supplementary Information). As predicted by the 199 model, there is a significant positive correlation between the recombination-driven delays and the 200 variance of evolutionary rates (R = 0.64, p < 0.001). Moreover, when comparing taxa with long and short 201 delays (relative to the evolutionary depth), we found that variance of evolutionary rates is significantly 202 greater in the taxa with long delays ( Fig. 4b; p = 0.002, Student's T-test). 203 To elucidate potential causes for the large variation in the recombination-driven delays found in the 204 data, we searched for associations with genomic and ecological features, such as genome size, number 205 of mobile genetic elements, gene turnover rate, lifestyle (free-living or host-associated), natural 206 competence for transformation, and effective population size ( Supplementary Fig. S3). Among those, we 207 only found a strong negative correlation between the recombination-driven delay and the relative rate 208 of gene turnover with respect to substitutions (Fig. 4c, Spearman's rho = -0.86, p<0.001), which suggests 209 that fast gene turnover facilitates the spread of barriers to recombination. The statistical power for the 210 analysis of ecological traits (lifestyle and effective population size) was low, and therefore, it cannot be 211 ruled out that these factors contribute to the spread of recombination barriers as well. 212 213 Discussion 214 Evolution of prokaryotes occurs at two levels: at the sequence level, through substitutions and small 215 indels, and at the genome level, through the transfer and loss of genes and groups of genes 4,30-32 . 216 Whereas evolution at the sequence level is traditionally used to determine phylogenetic relationships 217 among prokaryotes and to assign new genomes to taxonomic groups, it is the gene repertoire (and 218 therefore evolution at the genome level) which determines metabolic capacities, ecological properties 219 and pathogenicity of bacterial strains 33-35 . By studying sequence and gene content divergence in closely 220 related groups of prokaryotes, we found that sequence evolution is often delayed with respect to 221 genome evolution, although the magnitude of the delay broadly varies across bacterial and archaeal 222 lineages. We show that the delay in sequence evolution is likely to result from gene conversion that 223 homogenizes the core genome while, at the same time, accessory genes can be gained and lost. These 224 results are fully compatible with our previous findings indicating that archaeal genomes contain a subset 225 of genes that turn over extremely rapidly, before detectable sequence divergence occurs 26 . The delay 226 on sequence divergence caused by homologous recombination provides a mechanistic explanation for 227 the previously observed "instantaneous" gene turnover in prokaryotes 36 . These results are also 228 compatible with the observation that genes are gained and lost at higher rates on the tips of 229 phylogenetic trees 2 . 230 Notwithstanding the long-standing debate on the applicability of the species concept in prokaryotes, 231 several recent studies strongly suggest that speciation does occur in bacteria and is a crucial factor 232 shaping the earth microbiome 16,37 . The establishment of barriers to recombination is a pivotal step in the 233 early divergence of closely related prokaryotic strains that eventually leads to speciation 19,23 . In the 234 absence of such barriers, sequence divergence is prevented by the cohesive effect of intra-strain 235 homologous recombination; only after the barriers arise and recombination ceases, substitutions start 236 to accumulate at an appreciable rate. By modeling sequence evolution in the presence of gene 237 conversion, we show here that the temporal dynamics for the spread of barriers to recombination 238 directly affects the molecular clock. Specifically, homologous recombination sets back the molecular 239 clock by an amount of time that, in the long-term, equals the average waiting time for the establishment 240 of recombination barriers. Homologous recombination during early divergence also leads to the 241 compression of the tips of phylogenetic trees. Our model provides a framework for correcting such 242 trees, to make them consistent with the dynamics of gene turnover. 243 The causes that underlie the spread of barriers to recombination are complex and likely involve genomic 244 and ecological factors. Theoretical models show that barriers can simply emerge if mutations generate 245 diversity faster than recombination erodes it. However, it is a matter of debate whether mutation and 246 recombination alone can explain the formation of non-recombining species in nature 20,21 . Our finding 247 that recombination-driven delays negatively correlate with the rate of gene turnover supports the 248 hypothesis that gene gain and loss facilitates the establishment of barriers to homologous 249 recombination, by promoting niche differentiation 18 and/or by interfering with gene conversion at 250 flanking loci 22-24 . Moreover, the autocatalytic (or sigmoidal) dynamics that best describes the fraction of 251 recombination-free loci in our model is consistent with the previous findings indicating that barriers to 252 recombination are initiated by the acquisition of lineage-specific genes and subsequently spread from 253 the vicinity of those genes towards the rest of the genome 8 . 254 Recombination-driven delays are indicative of the time frame over which different genes start to diverge 255 during the split of prokaryotic lineages. Lineages with short delays are characterized by low variance in 256 the gene-specific (gene family-and taxon-corrected) evolutionary rates, which implies that most of their 257 genes began diverging over a brief period of time. In contrast, the higher variances observed in lineages 258 with long delays imply that, in those lineages, divergence of genes occurred in an asynchronous way and 259 spanned longer periods of time. To illustrate this point, it has been estimated that the split between 260 Escherichia and Salmonella took place around 140 million years ago and spanned over 70 million years 8 , 261 which is in close agreement with our finding that the delay in the Escherichia/Salmonella group is 262 approximately half of its evolutionary depth. We adopted a phenomenological approach to modeling the effect of intra-population homologous 274 recombination on the genetic diversity of a population. Under this approach, each genomic region is 275 either subject to periodical recombination with other members of the population or has reached a point 276 where homologous recombination is not possible anymore. In the latter case, the respective genomic 277 region is assumed to have crossed (diverged beyond) the recombination barrier. There was no attempt 278 to explicitly model the molecular mechanisms that underlie barriers to recombination, and therefore, 279 the present model is, in principle, applicable to scenarios in which ecological factors, rather than 280 accumulated sequence divergence, drive the isolation of evolving populations. Instead, given a pair of 281 genomes, we used a general function ( ) to describe the fraction of each genome that is subject to 282 recombination. The probability that a region of the genome crosses the recombination barrier exactly at 283 time (where the time starts counting at the last common ancestor of the two genomes) is 284 Considering that only the regions that have crossed the recombination barrier make a long-term 286 contribution to sequence divergence, and that the number of substitutions in those regions is 287 proportional to the time elapsed since the recombination barrier was crossed, the overall sequence 288 divergence between a pair of genomes becomes 289 where is the average substitution rate. Integration by parts leads to the final result 291 where ( ) = ∫ ( ) 0 represents a recombination-driven delay in the molecular clock. 293 To contrast the model with genomic data, we explored three more specific scenarios by assigning 294 explicit functional forms to the rate at which regions of the genome cross the recombination barrier. 295 Thus, we considered a power law time-dependency of the rate ( ) = (which includes the case of a 296 constant rate); a linear time-dependency plus a constant term ( ) = 0 + 1 ; and an "autocatalytic" 297 scenario ( ) = 0 − 1 in which the rate increases as more and more regions of the genome cross 298 the recombination barrier. Given a functional expression for ( , ), the fraction of the genome subject 299 to recombination is obtained by solving the differential equation / = ( , ) (see Supplementary 300 Information for further details). 301

Model of genome content divergence 302
The number of genes, , in a prokaryotic genome was modeled as a stochastic birth-death process, with 303 genome-wide gain rate + and loss rate − 25 . Under this model, the number of genes in a genome is 304 described by the differential equation 305 To facilitate comparison of gene content across genomes, each genome was represented by a vector 307 with elements that assume values of 1 or 0. Each entry represents a gene, i.e. an ATGC-COG, where 1 or 308 0 indicate the presence or absence, respectively, of that ATGC-COG in the genome. Genome size is 309 then given by the sum of all elements in . 310 The number of shared genes in a pair of genomes, or pairwise intersection, 2 is defined as 311 where 1 and 2 are vectors that represent the two genomes, the angled brackets indicate averaging 313 over multiple realizations of the stochastic process, and the dot operation stands for a scalar product. 314 The dynamics of pairwise intersections is given by 315 where we used the fact that both averages are equal Assuming that 317 genes are acquired from an (effectively) infinite gene pool 38 , we have 318 and substituting this relation into the differential equation for pairwise intersections of Eq.(9), we 320 obtain 321 The solution of this equation shows that pairwise genome intersections decay exponentially as 323 The intersection of genomes, , is the number of orthologous genes that are shared by all genomes. 330 It is formally defined as 331 Similar to pairwise genome intersections, the time derivative of k-intersections is given by 333 Solving the differential equation, we obtain an exponential decay for the k-intersections 335 If time is inferred from a tree 337 where is proportional to − ⁄ and is the sum of branch lengths in the tree that encompasses the 339 genomes (see section 2 of the Supplementary Information for formal derivation). This expression can be 340 extended to genomes composed of fast and slow evolving genes, and becomes 341 where 1 and 2 are the average numbers of genes of each class. 343

Genomic data 344
We used the Alignable Tight Genomic Clusters (ATGC) database 39 to define groups of closely related 345 bacterial and archaeal genomes. By construction, ATGCs are independent of taxonomic affiliation and 346 meet the objective criteria of high synteny and low divergence (synonymous substitution rate dS<1.5 in 347 protein-coding genes). We selected 36 ATGCs that matched the following criteria: i) maximum pairwise 348 tree distance is at least 0.1 substitutions per site, and ii) the phylogenetic tree contains more than two 349 clades, such that pairwise tree distances are centered around more than two typical values. Two of the 350 36 genome clusters were identified as outliers and were excluded from the dataset. The 34 genome 351 clusters analyzed in this study are listed in Supplementary Table S1. To facilitate computational analysis, 352 we subsampled large ATGCs to keep at most 20 representative genomes per ATGC. ATGC-specific 353 Clusters of Orthologous Genes (ATGC-COGs) were downloaded from the ATGC database and 354 postprocessed to obtain finer grain gene families by reconstructing approximate phylogenetic trees 355 from original ATGC COG alignments and splitting them into subtrees with minimum paralogy. ATGC-356 specific phyletic profiles were built by registering, as a binary matrix, the presence or absence of each 357 ATGC-COG in each genome within the ATGC (multiple genes from a single genome that belong to the 358 same ATGC-COG were counted once). 359

Tree construction 360
High-resolution phylogenetic trees based on the concatenated alignments of single-copy core genes 361 were downloaded from the ATGC database 39 . We refer to these trees as sequence similarity-based 362 trees. The phylogenomic reconstruction software Gloome 40 was used to obtain trees based on the gene 363 content similarity among the members of each ATGC. As the input for Gloome, we used the phyletic 364 profiles for the presence or absence of each ATGC-COG, and the sequence similarity-based trees from 365 the ATGC database; options were set to optimize the tree branch lengths under a genome evolution 366 model with 4 categories of gamma-distributed gain and loss rates. This procedure resulted in 2 trees per 367 ATGC, both with the same topology but with different branch lengths (one based on sequence 368 divergence, the other on gene content divergence). All trees were inspected for extremely long and 369 short branches, and clades responsible for such branches (typically 1 or 2 genomes in 5 out of the 34 370 ATGCs) were manually removed to avoid possible artifacts in the following steps. 371 Tree comparison and model fitting 372 For each ATGC, we computed all pairwise distances among tree leaves in the sequence similarity-and 373 gene content-based trees. Then, we compared the observed relationship between both sets of distances 374 with the expectations of the recombination barrier model under four scenarios for the recombination 375 barrier crossing rate (constant, linearly increasing with time, linearly increasing plus a constant term, 376 and proportional to the fraction of the genome that has already crossed, which leads to an 377 "autocatalytically" accelerated crossing rate). For each scenario, we fitted the model parameters using 378 non-linear least-squares optimization (implemented in Matlab R2018b), with sequence similarity-based 379 tree distances as independent variable and gene content-tree distances as dependent variable. The 380 choice of sequence similarity-based tree distances as the independent variable was motivated by the 381 need to fulfil the assumption of homoscedasticity in the non-linear regression model. Additionally, we 382 studied the fit of a heuristic power law model = .
To compare the goodness of fit provided by 383 different models, we calculated the Akaike Information Criterion (AIC) as = 2 + 384 (ln(2 / ) + 1), where is the number of parameters, is the number of observations, and 385 is the residual sum of squares 41 . The 95% confidence interval for the delay parameter ∞ was obtained 386 by finding the values of ∞ such that the residual sum of squares becomes = * (1 + 387 1.96 2 /( − 1)), where * is the residual sum of squares produced by the best fit of the model 42 . 388 Correction of tree branch lengths 389 To account for unobserved variation, tree branch lengths were corrected by using the autocatalytic 390 barrier spread model with the parameters inferred in the previous step. In that model, the observable 391 divergence increases with time according to the function 392 In the simplest case of an ultrametric tree, the corrected height (the distance from the tip) of a node , 396 ℎ � , can be calculated by applying the inverse function to the original height ℎ , that is 397 Branch lengths would then be obtained by subtracting the depths of the parent and child nodes. 399 Extending this idea to non-ultrametric trees, we first defined the parental height of a node, , as the 400 distance between the node's parent and the tip, that is, = ℎ + , where is the length of the 401 branch that connects node to its parent node. Parental heights were calculated as weighted averages 402 from the tips to the root, such that = for the leaves and 403 = 1 1 + 2 2 1 + 2 + [22] 404 for internal nodes. Subindices 1 and 2 refer to the child nodes of node ; weights were computed as 405 = for leaves and = 1 + 2 + for internal nodes (thus, the weight of a node is equal to the 406 total length of the subtree that contains that node as its root plus the length of the branch that connects 407 it to its parent node). Next, corrected values of the parental heights were obtained by computing, 408 numerically (MATLAB R2018b), the inverse function � = −1 ( ). The corrected branch lengths for the 409 tips are simply � = � . Finally, we proceeded from tips to root and obtained the corrected branch 410 lengths associated with internal nodes as 411 where the new weights � 1 and � 2 were recalculated at each step using the corrected branch lengths. 413

Analysis of gene content 414
For a set of genomes that belong to the same ATGC, the gene content overlap was calculated as the 415 number of ATGC-COGs shared by all the genomes divided by the mean number of ATGC-COGs per 416 genome. The total sequence divergence of a set of genomes was calculated as the sum of all branch 417 lengths in the sequence similarity subtree that results from selecting the corresponding leaves in the 418 whole-ATGC tree. Curves for the temporal decay of the fraction of shared genes were obtained by 419 plotting the gene content overlap against the total sequence divergence for all possible combinations of 420 genomes within the ATGC. Smooth curves were obtained by fitting a cubic spline model with 5 knots 421 (placed in both extremes and in the 25-, 50-, and 75-percentiles of the data x-values) using the SLM tool 422 (D'Errico, August 10, 2017(http://www.mathworks.com/matlabcentral/fileexchange/24443)) in MATLAB 423 R2018b. The mean separation among the curves obtained for different values of was calculated as 424 instances of gene replacement, which manifest as a non-linear relationship between sequence and tree 449 pairwise distances, we discarded the ATGC-COGs with the fit to the regression model 2 < 0.9. To 450 account for COG-specific evolutionary rates, the relative evolutionary rate of each ATGC-COG was 451 divided by the mean of all ATGC-COGs that match the same COG. The result is the ATGC-COG residual 452 evolutionary rate, that is, the ATGC-COG evolutionary rate corrected by COG-and ATGC-wise averages. 453 For each ATGC, the dispersion of evolutionary rates was quantified as the standard deviation of the 454 residual evolutionary rates of its ATGC-COGs. 455

Figure 1. Homologous recombination leads to compression of terminal branches in sequence-based 574
phylogenetic trees that can be detected through the analysis of gene content decay curves. 575 a) If tree distances are proportional to the true evolutionary time, the fraction of genes shared by a 576 subset of genomes will decay with the total length of the subtree, and the decay curves will be the same 577 regardless of the number of genomes in the subset. For illustration purposes, three subsets of 2 578 genomes are highlighted in brown, and two subsets of 4 genomes are highlighted in green. 579 b) Homologous recombination between pairs of closely related genomes erases recent sequence 580 divergence which results in an underestimation of the evolutionary times associated with terminal tree 581 branches. Such underestimation leads to gene content decay curves that depend on the number of 582 genomes included in the subset. Accordingly, the decay curve of subsets of 4 genomes is different from 583 the decay curve of subsets of 2 genomes. 584 c) The gene content decay curves of the Bacillus thuringiensis/cereus/anthracis group are compatible 585 with a scenario of recombination-driven shortening of the terminal tree branches (left plot, based on 586 the tree from Fig. 2a). On the right, if the recombination model is used to correct for unobserved 587 variation (fit in Fig. 2c Black circles indicate the best fit of the long-term delay parameter ∞ based on the comparison of 604 sequence similarity and gene content trees. Blue lines show the 95% confidence intervals. Red triangles 605 indicate the total depth of the sequence similarity tree. Values of the delay above or very close to the 606 total tree depth imply that most genomes in a group are strongly bound by homologous recombination; 607 an upper 95% confidence bound cannot be calculated in those cases. 608 609 610