Gene transfers can date the Tree of Life

Biodiversity has always been predominantly microbial and the scarcity of fossils from Bacteria, Archaea and microbial Eukaryotes has prevented a comprehensive dating of the tree of life. Here we show that patterns of lateral gene transfer deduced from the analysis of modern genomes encode a novel and abundant source of information about the temporal coexistence of lineages throughout the history of life. We use state-of-the-art species tree aware phylogenetic methods to reconstruct the history of thousands of gene families and demonstrate that dates implied by gene transfers are consistent with estimates from relaxed molecular clocks in Bacteria, Archaea and Eukaryotes. We present the order of speciations according to LGT calibrated to geological time for three datasets comprised of 40 genomes for Cyanobacteria, 60 genomes for Archaea and 60 genomes for Fungi. An inspection of discrepancies between transfers and clocks and a comparison with mammal fossils show that gene transfer in microbes is potentially as informative for dating the tree of life as the geological record in macroorganisms.


Posterior samples for estimating CCPs
Conditional Clade Probabilities 5,6 7 (CCPs) were estimated from a distribution of gene trees sampled by PhyloBayes. Gene family alignments of the different datasets were taken from the original publications 3,4,8 . For each alignment an MCMC sample was obtained using PhyloBayes (v3.2e) 9 using an LG+Γ4+I substitution model 10 with a burn-in of 1000 samples followed by at least 3000 samples 7,8 . We included all the gene families used in the original publications (Szöllősi et al 2 , Williams et al 3 and Nagy et al 4 ). The .ale files containing the CCPs for each gene family can be found in ftp://pbil.univ-lyon1.fr/pub/datasets/davin2017/CCPs

Inferring transfers
To infer transfer events we use a hierarchical probabilistic model that models both the evolution of gene phylogenies along the species tree as well as the evolution of sequences along gene trees 11 . We use a joint likelihood that takes into account i) the probability of a given gene tree topology according to a probabilistic model of gene family evolution (a birth-death process that models the duplication, transfer and loss of genes) and ii) the probability of the sequence alignment according to a substitution model. Using such a joint likelihood approach allows one to take into account uncertainty in both gene tree topology and gene tree-species tree reconciliation explicitly during the inference of transfers. Reconciled gene trees, which explicitly imply transfer events used in the downstream reconstruction of relative ages, are sampled according to their joint likelihood using amalgamated likelihood estimation (ALE), a probabilistic approach to exhaustively explore all reconciled gene trees that can be amalgamated as a combination of clades observed in a sample of gene trees. In   7 we demonstrated using simulations that gene trees reconstructed using the above described joint likelihood are substantially more accurate than those reconstructed using sequences alone. Reconciled gene trees were sampled using the ALE undated 2 12 , 7 software (available from the ALE git repository: https://github.com/ssolo/ALE.git) 7 . 100 reconciled gene trees were sampled using ALEml_undated for each family. This allows us to assign a posterior probability to each event (D,T,L) as the fraction of families in which a given event is found. Transfers detected with a posterior probability <0.05 were discarded. Then, for a given transfer between species tree node X and species tree node Y, a global frequency was computed by summing family-wise posterior probabilities across all families. The .uml_recs files containing the reconciled gene trees can be found in ftp://pbil.univ-lyon1.fr/pub/datasets/davin2017/Reconciliations An example command is: ALEml_undated CyanoSpeciesTree Family00001.ale 100 _

Relative age constraints implied by transfers
The horizontal transfer of genes occurs between contemporaneous species, but this does not mean that the branches inferred as the donor and recipient in reconciled gene trees have to be contemporary 12 (Fig. S8). A transfer leaving the species tree from branch a and arriving on branch b will establish a constraint between the father node of branch a and the daughter node of branch b . This implies that not all transfers carry dating information (Fig. S10). Transfers arriving at leaves or transfers that correspond to constraints implied by the topology ( i.e. from a node to one of its own descendants) do not carry information on the relative age of nodes in the species tree, and were discarded from the analysis. The data was handled with scripts written in Python 2.7 using extensively the Python library ETE3 13 Relative age constraints implied by fossil calibrations To obtain relative age constraints from fossil calibrations, we used the 26 calibrations from dos Reis et al 1 . These calibrations specify maximum (upper bounds) or minimum age calibrations (lower bounds), or both, for nodes of the species tree. The first step to obtain relative constraints between speciation nodes was propagating lower and upper calibrations up and down the tree respectively. For instance, if a node has no calibration but one of its daughter nodes has a lower calibration then this calibration was propagated up. If a node had an upper calibration, but the daughter node did not, then this was passed down. We then evaluated all pairs of nodes i,j that were not in a direct ancestral relationship and recorded an "older" relative constraint if the lower bound calibration of node i was older than the upper bound calibration of j .

MaxTiC
The MaxTiC 14 algorithm is a heuristic for inferring the relative ages of nodes on a species tree, based on the time orders implied by the largest set of consistent transfers. A set of transfers is called consistent if there exists a time order of the nodes of the species tree with which all transfers from this set are compatible (meaning none of them goes back in time, S9 b). The algorithm takes as inputs a species tree and a set of time-informative transfers (constraints) and outputs a time ranking of speciation nodes that corresponds to the time orders implied by the largest consistent subset of transfers. The sizes of these consistent subsets recovered by MaxTiC in the different datasets are shown Table S2. The python implementation of MaxTiC (available from the ALE github repository at https://github.com/ssolo/ALE.git) was run with the following commands: python maxtic.py species_tree constraint_file ls=180 The time orders inferred using MaxTiC were then compared with the ranking of speciation nodes given by the different molecular clock estimates to obtain Figure 3.

Molecular clock estimates of the chronograms
Chronograms were sampled using PhyloBayes 3.3 9 under different clock models (autocorrelated lognormal 9,15 , LN; Uncorrelated Gamma Multipliers 16 , UGAM; White-noise 17 , WN; Strict Clock 9 , CL) and two types of prior on the divergence times (Uniform and Birth-death processes) 17 . We used two different models of protein evolution (LG 10 and GTR 18 ). An example command is: phylobayes3.3f/exe_lin64/pb -d ./Cyano_alignment.phy -T ./CyanoTree -x 1 15000 -ugam -unitime -lg -rp 1000 1000 Cyano_UGAM_LG_UNITIME In Archaea we used the alignment of Williams et al 3 , which consisted of 10738 amino acid positions, to obtain the species chronogram. We used chronograms sampled from five different chains to obtain the 5000 chronograms used in the Figure 3 (discarding 1000 as burn-in from each chain). 2,7,8 In the other datasets, where the alignments were much longer, we sampled 4000 columns randomly and used this sampled alignment as an input for PhyloBayes. A single chain was used for each dataset. 5000 chronograms from the beginning of the chain were discarded as burn-in and we sampled every second iteration of the remaining 10000, to obtain a distribution of 5000 chronograms.
In the main figure the model of protein evolution is LG and the prior on divergence times is Uniform. The age of the root was arbitrarily set to 1000. The default options for prior on substitution rates were used. All alignments can be found in ftp://pbil.univ-lyon1.fr/pub/datasets/davin2017/Alignments .

Comparison of the clade-to-tip divergence between donors and recipients
To calculate the clad-to-tip divergence between donors and recipients, we fixed the topology of the species tree and then computed branch-lengths in terms of amino acidic substitutions using the concatenates of nearly universal families and a LG+Γ4+I model of protein evolution. These calculations were made using IQ-TREE 19 . Then, we traversed the trees in postorder and computed the divergence of each inner node as the divergence assigned to its children nodes plus the mean branch length of the subtending branches (leaves divergence was set to 0). Then, we took all the constraints found in each data set and separated them into those constraints that had been retained by the MaxTiC algorithm and those that had been discarded. We computed the difference between the estimated clade-to-tip divergence of the donor and recipient clades for each constraint. The trees can be found in: ftp://pbil.univ-lyon1.fr/pub/datasets/davin2017/SpeciesTreeSubstitutions Correlation between fossil-based relative age constraints in mammals and inferred substitutions per site To calculate the correlation between fossil-based relative age constraints in mammals and inferred substitutions per site, we used the 12 calibration points as given by dos Reis et al. 1 that have both lower and upper bounds. We excluded the calibration point at the root, to allow direct comparison with the rankings for transfers (since the ranking of the root is always equal to 0). We also removed the second oldest calibration point because of its high leverage. This resulted in 10 calibration points that were used in the analysis. For each calibration we took the mean point between the upper and lower bounds to obtain the y coordinates. To obtain the substitutions per site (x axis), we took the total branch length separating two species according to the strict molecular clock consensus chronogram and multiplied it by the mean evolutionary rate ν (substitution per site / time) estimated by PhyloBayes. The strict molecular clock chronogram was computed on a DNA alignment of universal gene families (used to obtain the species tree in the original paper) without using any calibration points. 15000 chronograms were sampled from a single chain, removing the first 5000 as the burn-in and then resampling one of every two of the remaining 10000 chronograms.
C orrelation between the speciation time order and the expected number of substitutions per site in Cyanobacteria To calculate the correlation between the speciation time order and the expected number of substitutions per site in Cyanobacteria, we measured the speciation ranking given by the MaxTiC algorithm, assigning a value of 1 to the speciation node that is closer to the leaves. The number of substitutions per site was estimated the same way as for the mammalian dataset using the sampled amino-acid alignment described in "Molecular clocks estimates of the chronograms", the LG model of protein evolution and the rooted species tree topology from Szöllősi et al. 8 We sampled 10 speciation nodes, to obtain a correlation comparable to the one previously described in mammals where 10 calibrations had been used. The Spearman's rank correlation was calculated between the speciation ranking and the expected number of substitutions per site. This procedure was repeated 10000 times. The points, p-value and Spearman's rho represented in the main figure correspond to the correlation with the median Spearman's rho of the 10000 correlations. The distribution of p-values can be seen in S13 (lower panel).

Robustness analysis
To assess the support of the transfer-based constraints, we measured the support of each of the relative age constraints using a jackknife approach, sampling without replacement half of the gene families that contain transfers. We repeated this procedure 1000 times and then calculated a set of compatible transfers for each sample. This method also provides us with a measure of the sensitivity of the MaxTiC algorithm to the choice of different gene families. We quantified support for each constraint as the fraction of times that it is observed in the 1000 different random samples. A large fraction (between 40% and 47%) of the constraints appear in at least 95% of the samples (S20-S22 upper panel), indicating that transfer-based constraints are generally robust to variation in the selection of families. We studied how different support thresholds affect the agreement with molecular clock estimates. Varying the threshold we observed a gradual tendency of increasing agreement for higher thresholds (S20-S22, lower panels).

Robustness of inferred relative node ages to an alternative root in Cyanobacteria
Phylogenetic studies using outgroup rooting agree that Gloeobacter violaceus is the earliest-diverging lineage within Cyanobacteria. 20-22 9,39,40 This rooting, however, is sensitive to the choice of out-group species 23 . In contrast to these results, recent studies using genome-scale data and alternative rooting methods 8,24 have suggested that Gloeobacter violaceus may be a derived lineage within Cyanobacteria. In Figs 3 and 4 we use the root position proposed by Szöllősi et al.10 based on transfers, but we also analyzed the outgroup rooting (see Fig.S24). Comparing the two rootings we found that over 95% (183 out of 191) of the supported relative age constraints under the outgroup root (involving clades found under our alternative root) were also supported under the alternative rooting of Szöllősi et al. In particular, we still recover a recent divergence for the Prochlorococcus-Synechococcus clade. These results, taken together with our simulations 14 , suggest that relative age constraints inferred from transfers are robust to uncertainty in the position of the root.

Comparison of information on relative dates from different sources in cyanobacteria
The clade representing heterocyst-forming cyanobacteria (green clade in Fig. S25)  Under a strict molecular clock, including the fossil date as a minimum constraint produces unrealistically narrow confidence intervals and drastically overestimates the age of the unconstrained blue clade, which has been estimated to have diverged 1,020 -640 Mya based on a dataset with broader taxonomic sampling and additional fossil calibrations (cf. Table 3 in Blank & Sanchez-Baracaldo, Geobiology 2010). This demonstrates that it is necessary to relax the assumption of the strict molecular clock, i.e. to assume that changes in evolutionary rate occur along the phylogeny, with a corresponding increase in model complexity.
The blue and green clade shown in Fig. S25 are particularly relevant because the transfer-based relative age constraints we derive for cyanobacteria in our manuscript constrain the blue clade as well as its three ancestral nodes to be younger than the green clade. Examining the different relaxations of the molecular clock considered in the manuscript we find that the transfer-based constraints are met to different extents by different models (see Table S3 below). Table S3 demonstrates two important effects: first, as shown in Figure 2a. on the next page using the full agreement score, introducing the internal fossil calibration in the cyanobacterial dataset together with appropriate relaxed clock models can significantly increase agreement with relative transfer-based constraints. This increase in agreement, in fact, provides evidence of congruent dating information in fossil and transfer based constraints in cyanobacteria. Second, however, despite the significant increase in agreement, as can be seen in the last row of Table S3, and more generally in Figure S26a., introducing all available fossil based calibrations and using the best fitting relaxed molecular clock method still does not allow us to sample trees that completely agree with transfer-based constraints.
To determine the limits of agreement between relaxed molecular clocks and transfer-based relative constraints that can be reached under realistic runtimes we launched 10 independent MCMC runs under the best-fitting lognormal model and ran the chains until 30,000 samples were obtained per chain after discarding burn in (approximately one week of computation per chain on 3.7 GHz Intel Xeon processors with PhyloBayes v4.1). Unfortunately, and as shown in Figure  S26b., current approaches are limited to sampling chronograms well below 100% agreement. Similar results were obtained for the Fungi and Archaea datasets under extensive sampling (Fungi and Archaea exhibit similar agreement with a median of 0.78 and 0.72 respectively).

Estimating trees calibrated to geological time that carry partial information from transfer-based relative age constraints
To obtain an initial sample of chronograms we used Phylobayes as described above with two additional modifications. First, we added fossil calibrations where available (see Table S4 below). Second, we introduced a subset of the transfer-based relative constraints which involve internally calibrated nodes in the phylogeny by the following procedure: i) for minimum age calibrations all nodes that are constrained to be older by transfer-based relative age constraints were also assigned the same or higher minimum age constraint; ii) conversely, for maximum age constraints that are constrained to be younger by transfer-based relative age constraints we assigned the same or lower maximum age constraint. Introducing this calibration propagation scheme in our datasets we found that it resulted in significantly higher agreement with transfer based constraints in Fungi (median agreement increased from 0.76 to 0.78), but not Cyanobacteria (median 0.729 vs 0.73). For Archaea the lack of internal calibrations prevented us from applying the method. To estimate trees calibrated to geological time that carry partial information from transfer-based relative age constraints we constructed consensus chronograms of the subset of 5% of the sampled chronograms with the highest agreement with transfer-based relative a ge constraints. Consensus chronograms are shown in Figs. S25-27. The complete set of sampled chronograms and their agreement score can be downloaded from ftp://pbil.univ-lyon1.fr/pub/datasets/davin2017/ .        illustrative, because in real data the proportion of extinct and unrepresented clades compared to sampled diversity is much higher 12 . An LGT event is indicated by the arrow. The donor lineage for the LGT becomes extinct, but the LGT survives in descendants of the recipient lineage. Note that the apparent donor species lineage on the represented tree is not contemporaneous to the recipient species tree branch where the transfer arrives. This transfer event implies the constraint that the yellow node is older than the blue node.  11 . In contrast to dated DTL methods, where transfers can only occur between branches that are contemporary, in the undated method transfers can occur between any two branches in the tree (e.g. transfer in b is allowed), with the exception of transfer where the recipient branch is an ancestor of the donor branch 2,25 (e.g. transfer in a is forbidden) Fig. S10: Informative vs. non-informative transfer. The informative transfer above (panel a, same transfer as in Fig. S8) indicates that the yellow node must be older than the blue node. Non-informative transfers (b) fall in two categories. The non-informative transfer in brown is transferred into a leaf; since leaves have no descendants, such transfers do not imply a relative age constraint. The transfer highlighted in purple is non-informative because it implies a constraint that is already established by the position of the root. The transfers in a are contradictory because they imply different orders of speciations. The brown transfer indicates that the pink node must be older than the orange node. The purple transfer indicates that the orange node must be older than the green node. Transfers in b , on the other hand, are consistent because there exists an ordering of speciations that is compatible with the constraints implied by the transfers. The purple transfer implies that the yellow node must be older than the blue node and the brown transfer implies that the orange node must be older than the pink node.

Fig. S12: Agreement between Molecular Clock estimates and the constraints established by transfers selected (left column) and discarded (right column) by the MaxTiC algorithm.
The left column shows the same data as Figure 2, and is used here as a comparison on the same scale as the right column. The agreement score of the prior distribution (blue) changes between the two sets of constraints because on the left the set of selected transfers is consistent by construction ( i.e. agreeing with at least one chronogram), while on the right the set of discarded transfers is not necessarily consistent. Taking this into account, it appears that molecular clocks tend to agree with the transfers selected by the MaxTiC (left) and disagree, or at least not significantly agree, with the discarded transfers (right).

Fig. S17: Agreement between clocks and transfer-based constraints in Cyanobacteria.
Different substitution models (LG and GTR) were used as well as different priors (uniform~ UNITIME and Birth-Death~BD). The most noticeable effect seems to be a slightly different behaviour of the strict clock when using different models of protein evolution, but the results remain qualitatively similar. Colors correspond to the different types of clocks, as in Fig. 3 (orange strict clock, purple lognormal, green uncorrelated gamma-multipliers and gray white-noise). The area delimited by the dashed lines corresponds to the 95% confidence interval determined by the prior.      For each gene family the number of inferred transfers per branch is computed. In this graph we can see that for a wide range in the frequency of LGT we recover a tree very close to the real one. The levels of similarity however never reach 1, suggesting that a small fraction of transfers inferred by ALE are spurious. The boxplots correspond to the frequency of transfers found in two real datasets of Fungi and Cyanobacteria. More details on this analysis can be found in the original paper.    . Introducing the internal fossil calibration increases agreement with relative transfer-based constraints. a) Agreement with transfer-based relative constraints for different relaxed molecular clock models. As before "root" indicates molecular clock models with the root constrained to be between 3,850 Mya and 2,450 Mya, while "root&internal" indicates a constraint on the age of the green clade at 1,957 Mya in addition to the root constraints. b) Agreement with transfer-based relative constraints from extensive sampling of over 300,000 chronograms under the best fitting log-normal model with "root&internal" calibrations.  Table S4. The complete set of sampled chronograms and their agreement score can be downloaded from ftp://pbil.univ-lyon1.fr/pub/datasets/davin2017/  Table S4. The complete set of sampled chronograms and their agreement score can be downloaded from ftp://pbil.univ-lyon1.fr/pub/datasets/davin2017/   Table S1. Number of transfers and constraints. Each transfer is associated with a frequency that is used to weight the transfer. In a the total weight of transfers found after reconciling all families with their species tree. In b, the total weight of transfers that remain after applying the cutoff of 0.05 to remove the less supported transfers. In c the total weight of the time-informative transfers. In d, the total weight of the time informative transfers that are discarded by the MaxTiC algorithm. In e, the total weight of the time-informative transfers that are retained by the MaxTiC algorithm.