A New Phylogenomic Approach For Quantifying Horizontal Gene Transfer Trends in Prokaryotes

It is well established nowadays that among prokaryotes, various families of orthologous genes exhibit conflicting evolutionary history. A prime factor for this conflict is horizontal gene transfer (HGT) - the transfer of genetic material not via vertical descent. Thus, the prevalence of HGT is challenging the meaningfulness of the classical Tree of Life concept. Here we present a comprehensive study of HGT representing the entire prokaryotic world. We mainly rely on a novel analytic approach for analyzing an aggregate of gene histories, by means of the quartet plurality distribution (QPD) that we develop. Through the analysis of real and simulated data, QPD is used to reveal evidence of a barrier against HGT, separating the archaea from the bacteria and making HGT between the two domains, in general, quite rare. In contrast, bacteria’s confined HGT is substantially more frequent than archaea’s. Our approach also reveals that despite intensive HGT, a strong tree-like signal can be extracted, corroborating several previous works. Thus, QPD, which enables one to analytically combine information from an aggregate of gene trees, can be used for understanding patterns and rates of HGT in prokaryotes, as well as for validating or refuting models of horizontal genetic transfers and evolution in general.


The simulation procedure
We produced ten random, edge-weighted, simulated species trees over n taxa (for n = 10, 20, . . . 100). Each simulated species tree was produced based on the Yule process (see [1]). Subsequently, we generated ten families of "gene" trees for each simulated species tree, each family consisting of 2500 gene trees. Each gene tree simulates a tree that is created when a species tree is subjected to a certain HGT process -a series of HGT events, consistent with a Poisson process of a constant rate. Ten different HGT rates were used, namely λ = 0.1, 0.2, . . . , 1.0. The rate of the HGT process remained constant within each gene family, but varied from one gene family to another. The simulated species trees and gene trees were created using our own scripts. The details of how the model trees and the gene trees were created appear below.

Generating a simulated species tree
Our simulated species trees were generated based on the Yule process. The Yule process is a process in which a binary ultrametric tree is generated. Generating a Yule tree with n leaves is a process that has n − 1 recursive stages, where in a preliminary stage, a node that represents the root of the tree is created with a time signature of zero.
In the first stage of the recursion, two exponentially distributed numbers are sampled.
Let us denote them as T 1 and T 2 . We choose the exponential distribution to have a parameter λ = 1, which means that for every ε > 0 we have P (T 1 < ε) = P (T 2 < ε) = 1 − e −ε . After T 1 and T 2 are set, two nodes are created with time signatures T 1 and T 2 . They are then connected to the root as two descendant leaves. (We can think of the time signature as representing the time of that node's creation. Thus, nodes with small time signatures are considered as "created" before nodes with large ones.) This completes the first stage. Note that once this stage is completed, there are two leaves in our tree.
Let us assume that the (k −1)-th stage in the recursion has been completed, resulting in a tree with k leaves. We carry out the k-th stage as follows: Two exponentially distributed numbers are sampled, let us denote them as T (k) 1 and T (k) 2 , and two nodes are created. These two nodes are connected as descendant leaves to the leaf of the k − 1 stage that has the smallest time signature (the leaf that was "created" first, which now becomes an inner node of the tree), and their time signatures are set at T plus the time signature of their immediate ancestor. (Naturally, the time signature of each node on the Yule tree may be regarded as representing its distance from the root, or the length of the path from it to the root.) It is easy to see that at this point the generated tree has k + 1 leaves. This completes the k-th stage in the recursive process.
Once n − 1 stages have been completed and we have generated a tree with n leaves, the time signatures of all the leaves are reset to equal the smallest time signature among the leaves. This final act is what makes the generated tree ultrametric.

Generating a simulated gene tree
Based on a given species tree, we generated a set of gene trees by subjecting the species tree to a Poisson process of HGTs with a constant rate of events λ. This means that for each gene tree, the number of HGT events, denoted by #HGT , was a random variable for which P r(#HGT = m) = e −λL (λL) m m! (where L denotes the total length of the species tree). Furthermore, once the value of #HGT was fixed, the m recipients of HGT events were distributed randomly and uniformly on the species tree. We remark that this is equivalent to assuming that the time between two subsequent HGT events along each lineage is an exponentially distributed random variable (like speciation events in the Yule model). An HGT event was simulated by a subtree pruning and regrafting (SPR) operation (see [4]), from the recipient of genetic material to another, randomly chosen point, at the same depth from the root in the species tree (i.e., a contemporaneous species) -the donor. When all HGTs for a gene were generated, the resulted gene tree was ready.
2 Summary of the real data phylogeny reconstruction As explained in the main body of the paper, we used two sets of plurality quartets, one based on the NUTs and the other based on the entire gene pool, as input for wQMC in order to construct the hypothesized phylogeny of the species in our study. The resulting trees can be found, in Newick format, in Appendices C and D below. The full list of species names and abbreviations can be found in [2] (additional data file 1). Tables 1 and 2 summarize the results of the phylogeny reconstruction. They show that both trees satisfy approximately 90% of the input quartets, and that the percentage of satisfied quartets increases with the plurality rates of those quartets.  Table 1: A summary of the phylogeny reconstruction based on the entire gene pool. The reconstructed tree satisfies almost 90% of the input quartets, including almost all input quartets whose plurality rates are 70% or more.  Table 2: A summary of the phylogeny reconstruction based on the NUTs. The reconstructed tree satisfies more than 90% of the input quartets, and almost all input quartets whose plurality rates are 70% or more.

Real data QPD -an extension
Similarly to what was done in the main body of the paper, we divided the plurality quartets into groups based on the number of archaea (or bacteria) they have. In the main body of the text we considered the plurality quartets that were derived from the entire gene pool, while here we consider the plurality quartets that were derived from the NUTs alone. Again, we divide the quartets into five groups (0 archaea, 4 bacteria; 1 archaea, 3 bacteria; 2 archaea, 2 bacteria; 3 archaea, 1 bacteria; 4 archaea, 0 bacteria) and show ( Figure 2) how each of these groups contribute to the overall QPD of the NUTs (presented in Figure 1). We also present the independent QPDs of the five groups ( Figure 3). As in the main body of the paper, in Figure 2 we see two peaks that correspond to quartets with 2 archaea and 2 bacteria and to quartets with 1 archaea and 3 bacteria, and in Figure 3 we again see that if a quartet contains one or zero archaea (three or four bacteria) then it is more likely to have a low plurality rate, compared to a quartet with three or four archaea (one or zero bacteria). These results are in agreement with what we found when we analyzed the plurality quartets of the entire gene pool, and they strengthen our conclusions regarding a high probability of bacteria-to-bacteria HGT, compared to archaea-to-archaea and to a greater extent when compared to the probability of HGTs involving both archaea and bacteria.

QPDs of simulated uniform HGT gene trees
Here we present the QPDs of the simulated data (Figures 4 through 13). The plots suggest that the shape of the QPD is dependent primarily on the rate of HGT events, and not on the number pf leaves in the species tree. Figure 1: Real data QPD, based on the NUTs alone. We clearly see two local maxima, a phenomenon which is absent from the QPDs pertaining to the simulated data.

QPDs of simulated biased HGT gene trees
In an attempt to reconcile the real data QPD in the main body of the paper with our simulation process, we modified the way in which gene trees were generated. Specifically, we generated a simulated "species" tree with 100 leaves and found a node with 41 descendant leaves. That node and all of its descendants (leaves and internal nodes) were defined as "archaea". The remaining leaves and internal nodes were defined as "bacteria". HGT events were divided into four categories: archaea to archaea, archaea   We present here the resulting QPD graphs thus generated pertaining to an intraarchaea rate of 0.6 (Figures 14, 17, 20, 23). Evidently, an inter-domain HGT rate of 0.1, at least twofold smaller than the smallest intra-domain HGT rate simulated, is necessary for the creation of a clear bimodal QPD graph. In addition, dealing only with the bimodal simulated QPDs, the plurality quartets were divided into five groups (0 archaea, 4 bacteria; 1 archaea, 3 bacteria; 2 archaea, 2 bacteria; 3 archaea, 1 bacteria; 4 archaea, 0 bacteria). We show, as in Figures 2 and 3, how each of these groups contribute to the overall QPD and how they compare among themselves when plotted independently. We remark that Figures 4 through 25 were plotted using randomly sampled quartet sets of size 50,000.

Remarks about a previously reported archaea-bacteria separation
In [3], Puigbò et al. report on an archaea-bacteria separation, derived from a "tree distance" they defined. We claim that this separation may result, at least in part, from the underlying structure of the species' phylogeny. To prove that claim, we generated ten simulated species trees with 100 leaves in each tree. As before, a node with 41 descendant leaves was identified in each species tree. This node and all of its descendants were defined as "archaea". The remaining nodes were defined as "bacteria". The pairwise distance matrices induced by each species tree were subsequently computed, using the "tree distance" defined in [3], and the resulting heatmaps were then plotted ( Figures   26 through 35). This is equivalent to having no HGTs affect gene evolution at all. All heatmaps thus generated show some level of archaea-bacteria separation, which is clearly induced by the structure of the underlying phylogeny alone.