Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer

Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.

Average Common Substring. The Average Common Substring (acs) method 12 uses the concept of matching statistics 13 . Instead of decomposing the concatenation of two sequences, this method searches for the longest match in sequence A starting at every position in sequence B. Unlike the Lempel-Ziv factorisation method, here the longest matches can overlap.
Shortest unique substring. Instead of looking for the longest matches between two sequences, kr looks for the longest common substrings extended by one, known as the longest substrings between two mutations, i.e. the SHortest Unique subSTRING, shustring 14 .

k-Mismatch Average Common Substring (kmacs).
This method is a variant of the acs method, using the longest common substring with k-mismatches (the number of mismatches is noted mm in our study) 9,15 .

Optimisation of parameter settings
Six of the nine AF methods used in this study require the specification of a key parameter that would influence the estimated distance among a set of sequences, thus the resulting trees. 2 , ffp and cvt require a value to be set for k (i.e. k-mer length), co-phylog requires a half-context length K, spaced a number of patterns n, and kmacs a number of mismatches mm. To assess the optimal parameter setting corresponding to each of these methods in inferring phylogenies across the simulated data, we ran the program across a range of values and inferred a phylogenetic tree. We used k = [14-26] for 2 , ffp and cvt, K = 6, 7, 8 and 9 for co-phylog, and based on authors' recommendations 15 , n = 60, 70, 80, 90 and 100 for spaced, and mm = [12][13][14][15][16][17][18][19][20] for kmacs.
We consider, in each scenario, the parameter that yielded the minimum average RF (i.e. the tree topology that is the most congruent to reference) as optimal. The other three methods, gram, kr and acs were run using the default parameters.
Analysis of genome divergence. Supplementary Fig. S1 shows the mean RF observed using the six AF methods across different parameter settings and the mutation rate m.  Analysis of genome rearrangements. Supplementary Fig. S4 shows the mean RF obtained using five of the six AF methods across different parameter settings and the rearrangement rate r. Owing to the highly consistent results observed in spaced using different values of n (as described above), we did not assess the setting of n in spaced in this case. In all methods, the observed variation of RF is very little between the best parameters, e.g. for cvt, RF ranges between 0.068 (r = 1.00) and 0.070 (r = 0.10) at k = 14 ( Supplementary Fig. S4c).
To visualise the extent of inverted translocation across these datasets, we computed the MSA of these datasets using progressiveMauve 16 . Supplementary Fig. S5 shows an example of part of the whole genome alignments at each r. The number of locally collinear blocks 17 in Mauve alignments increases proportionately with r, illustrating the complexity in multiply aligning these genomes, particularly when r ≥ 0.1.

Analysis of genome divergence in empirical data
To estimate the divergence among the 143 genomes, we first generated a phylogenetic tree using the 2 method 18 at k = 26 ( Supplementary Fig. S6). Because 2 is a simple dissimilarity measure, the branch lengths of this tree indicate divergence (although not directly interpretable) of the species. The tree inferred by 2 ( Supplementary Fig. S6) shows that Archaea are separated (with long branch lengths within clade) from the Bacteria. The internal branch lengths are relatively shorter than the external branches, suggesting that multiple speciation events (i.e. an adaptive radiation process) occurred in a short timeframe after the establishment of corresponding niches (e.g. bacterial groupings).
We also compute the percentage of shared k-mers for each genome pair across all empirical dataset (Supplementary Table S1

Selection of jackknife rate for pseudo-replicates generation
Previous studies suggested that a jackknife "rate" (proportion)  = 37.5-40% is optimal for the jackknife technique 19,20 , but this was based on aligned sequences or gene-order data. To determine which  was optimal for our jackknife (JK) analysis we used a range of 20-80% cut-off to generate 100 pseudo-replicates for each empirical dataset. We followed the same technique described in the main Methods to generate JK support values at each rate; the results are shown in Supplementary Fig. S8

Assessment of jackknife pseudo-replicates in different AF methods
To assess the robustness of jackknife support in each of the AF methods, we independently accessed topological difference, measured as RF (normalised Robinson-Foulds distance), between (a) a supertree that is summarised from 100 trees (independently generated from each JK pseudo-replicate of the same data), and (b) the tree generated using the AF method from the original (non-jackknifed) data (144 prokaryote genomes). Here we used three supertree methods: the maximum representation of parsimony (MRP) 21 , the fast subtree Prune-and-Regraph (fast-SPR) 22  We observed the lowest RF for 2 (RF = 0.04) and the highest for kmacs (RF = 0.45); there was little variation among the supertrees as generated using each of the three methods. Our results suggest that some AF methods (particularly 2 , ffp and cvt) are more robust to data truncation and therefore more appropriate for the calculation of jackknife support.