Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny

Multiple sequence alignments are widely used to infer evolutionary relationships, enabling inferences of structure, function, and phylogeny. Standard practice is to construct one alignment by some preferred method and use it in further analysis; however, undetected alignment bias can be problematic. I describe Muscle5, a novel algorithm which constructs an ensemble of high-accuracy alignment with diverse biases by perturbing a hidden Markov model and permuting its guide tree. Confidence in an inference is assessed as the fraction of the ensemble which supports it. Applied to phylogenetic tree estimation, I show that ensembles can confidently resolve topologies with low bootstrap according to standard methods, and conversely that some topologies with high bootstraps are incorrect. Applied to the phylogeny of RNA viruses, ensemble analysis shows that recently adopted taxonomic phyla are probably polyphyletic. Ensemble analysis can improve confidence assessment in any inference from an alignment.

PPP, this calculation is parallelized by observing that the contributions to M ij from different pairs of sequences in Eq. 1 are independent, and can therefore be calculated on separate threads.
The profile alignment is determined by using dynamic programming to maximise the sum of M ij over alignment columns. I call this method for aligning a pair of profiles PPP-pa.
PPP profile alignment with subsampling (PPP-ps) PPP-pa has complexity O(N X N Y ) in the number of sequences N X and N Y in X and Y respectively. For large N , this cost can be prohibitive. Super5 implements a faster approximation to PPP-pa by selecting random subsets X * and Y * of the sequences in X and Y respectively. The subsampled matrix M * is then computed as follows, The maximum posterior alignment is then calculated using M * rather than M . I call this method PPP-ps. Super5 aligns profiles using PPP-pa if N X N Y ≤ n max (default n max = 2, 000), otherwise a random subset of sequence pairs is selected and PPP-ps is used.

Expected-error distance
The expected error rate (fraction of incorrect columns) in the posterior decoding alignment A of sequences x and y is calculated as follows [1], Sequences that can be more accurately aligned (according to the HMM) have smaller EE, which can be considered as an approximate distance measure defined on pairs of sequences (it is not strictly a distance because the triangle inequality is not necessarily satisfied). Assuming that EE correctly predicts accuracy, the total number of errors can be greedily minimised by aligning the closest pair at each iteration, i.e. by progressive alignment using a UPGMA guide tree. Both PPP and Super5 construct UPGMA guide trees using EE distances. Super5 uses EE-based clustering to reduce redundancy and divide sequences into smaller sets that are tractable for PPP (described in more detail below).

Guide tree variants
Both PPP and Super5 use progressive alignment. Variant guide trees are constructed as follows.
The goal is to identify three large subtrees a, b, c which are joined in all three possible orders ((a, b), c), ((a, c), b) and ((b, c), a). This is achieved by considering all possible bifurcations of the original tree (which is temporarily considered to be unrooted), and identifying the edge which most closely approximates dividing the tree into subtrees with one third and two thirds of the sequences, respectively. The smaller subtree is a. The larger subtree bc is divided into two equal-sized (or approximately equal-sized) subtrees b and c by a similar search for the best edge, as shown in Fig. S2. Including the original guide tree, this gives a total of four variant guide trees for generating ensembles.
This design accomplishes a number of objectives. Close to the root, there should be substantial variations in the joining order of large groups to induce substantive variations into any bias caused by progressive alignment. Close to the leaves, the joining order should be preserved as far as possible because alignment accuracy correlates inversely with divergence; therefore, closely-related sequences should be aligned before more distantly related sequences, and radical changes to the joining order will tend to reduce accuracy. In practice, guide trees are often highly imbalanced such that many guide tree nodes join daughter subtrees where one is much larger.
The smaller subtree is often a single sequence; Fig. S2 shows an extreme (but not unusual) example where all nodes are 1 : n, i.e. join a single sequence to all others aligned so far. In such a tree, none of the pair-wise alignments joins two large groups. The construction described here guarantees that large groups are joined close to the root even when the tree is highly unbalanced.
When the tree is more balanced, more or all of the joining order is preserved close to the leaves, as illustrated in Fig. S3. When multiple guide tree permutations are applied to the same HMM parameters, e.g. in a stratified ensemble, this design achieves a useful speed optimisation because the alignments of a, b and c can be computed once, leaving only three pair-wise alignments per permutation to complete the MSA. If parameters are perturbed, the guide tree is always recomputed before applying a permutation to account for changes to the EE distance matrix.

Progressive alignment variants compared to previous methods
Neighbour-Joining [3] (N-J) is sometimes used to construct guide trees, e.g. by ClustalW [4] and T-Coffee [5]. To the extent that more closely related sequences can be aligned more accurately, a UPGMA guide tree tends to give a more accurate progressive alignment because nodes in an N-J tree are predicted evolutionary neighbours, which are not necessarily closest neighbours. I have previously shown that UPGMA guide trees give higher average alignment accuracy than N-J in Muscle v3 [6].
The heads or tails (HoT) method [7] generates variant progressive alignments from a single guide tree by aligning reversed sequences ("tails") at some nodes and unreversed sequences ("heads") at others. It is not obvious that this method will generate any variation, because parameters and the guide tree are held fixed. The pair-wise alignment at a node is usually constructed by dynamic programming which maximises the alignment score (sum of substitution scores and gap penalties), and this score does not change if the alignment is reversed. If an alignment of reversed sequences has the highest score, the equivalent alignment of unreversed sequences will also have the highest score. However, reversed sequences may nevertheless give different alignments due to issues such as floating-point rounding and tie-breaking in cases where more than one alignment has the highest possible score. Any method for generating variants potentially has merit and is worth considering in a larger framework, but it is self-evident that the variations produced by HoT can explore only a tiny fraction of the space of equally plausible alignments compared to Muscle5.
Unistrap [8] induces variant guide trees by changing the order of input sequences to a chosen aligner, e.g. MAFFT or Clustal-Omega. As with HoT, it is not immediately obvious that changes will be induced because the guide tree is calculated from a matrix of all pair-wise distances, and pair-wise distances are the same regardless of input ordering. However, computational artefacts such as tie-breaking may nevertheless cause the guide tree to change. This method has similar limitations to HoT: variation in the guide tree may be absent or marginal, and most or all of the ensemble may then reflect consistent bias towards identical or similar guide trees.
GUIDANCE2 generates variant guide trees starting from an initial MSA generated with a default guide tree. Re-sampled MSAs are generated by sampling columns with replacement, and a N-J tree is generated from each re-sampled MSA. HoT variant alignments are generated from each guide tree. Compared to the Muscle5 procedure, this has a number of disadvantages. As explained above, there is compelling theoretical and empirical evidence that N-J guide trees are generally inferior to UPGMA trees. There is no lower bound on the variation in guide trees.
Systematic bias due to the guide tree in the initial MSA may be reflected in the re-sampled N-J trees and hence propagate to the final ensemble.
Muscle5 improves on previous methods by using UPGMA, applying minimal modifications consistent with guaranteeing substantial variation in progressive joining order, especially close to the root where progressive bias is most likely to manifest and most likely to degrade phylogenetic tree inference.

Substitution matrix variation compared to previous methods
To the best of my knowledge, all previous methods for generating variant alignments have used a fixed substitution matrix, except for a 1995 study of arthropod phylogeny [9] which assessed the effects of varying the transition-transversion ratio. Here, Muscle5 introduces an important innovation by varying each substitution score in the matrix independently of the others. This can be motivated and interpreted as follows. Choosing a substitution matrix is equivalent to choosing parameters for an evolutionary model; e.g., a PAM matrix is equivalent to parameters for the JTT model [10]. If the substitution matrix is held fixed, this is equivalent to assuming that all sites evolve according to the same model with the same parameters. Of course, of of these assumptions are highly unrealistic; in fact, constraints vary greatly between sites, and tractable models are drastic simplifications of selective constraints in vivo. Even if a supposedly ideal algorithm could optimise per-site parameters simultaneously with likelihoods of the estimated alignment and tree, the model would remain a drastic simplification and the predicted alignment and tree could still be badly wrong. Therefore, rather than striving towards an unattainable "optimal" solution which may be biologically incorrect, it is better to assess the consequences of using the best tractable simplified model.

Super5 algorithm
The Super5 algorithm was designed to scale PPP by introducing divide-and-conquer heuristics. A sketch of the algorithm workflow is given in Fig. S4.

Redundancy reduction
The first step of Super5 aims to identify clusters of two or more highly similar sequences in the input data. For each cluster, a representative sequence is identified. Representatives are propagated for subsequent processing while the remaining sequences are set aside and added back into the representative alignment in the final step. This strategy is effective in reducing computational cost if the number of representatives is substantially smaller than the number of input sequences, which is often the case in practice. Clusters are constructed using a greedy list removal strategy similar to the UCLUST algorithm [13]. Input sequences are sorted by decreasing length, with the goal of ensuring that shorter fragments do not become representatives. A k-mer index on the representatives is used to prioritise sequence comparisons by U-sorting [13]

Coarse clustering
The next step divides representatives into clusters small enough to be tractable for PPP, i.e. a few hundred sequences. A first-draft set of clusters is obtained by UCLUST-EE with EE < 0.3.
Clusters which are bigger than the maximum size (default 500) are sub-divided by UCLUST-EE with EE < 0.1. Any remaining clusters which are still too large are sub-divided at random.

Intra-cluster alignment and consensus
Each coarse cluster is aligned by PPP, and the consensus sequence for each MSA is calculated by taking the highest-frequency symbol from each column, deleting any positions where this symbol is a gap.

Guide tree construction
An all-vs-all EE distance matrix is calculated from the consensus sequences, and a biased UPGMA tree [6] constructed from the distance matrix.

Representative MSA
MSAs for coarse clusters are combined by progressive alignment following the guide tree. Profile alignment is performed by PPP-pa up to a size threshold (2,000 pairs by default), otherwise by PPP-ps where the threshold number of sequence pairs is selected at random. This yields an MSA of all representative sequences.

Final MSA
The final MSA is constructed by re-introducing the non-representative sequences set aside in the first step, using the previously constructed pair-wise alignments to their corresponding representatives. These pair-wise alignments imply transitive alignments of non-representative sequences to the representative MSA, which are used to construct the final MSA.
Assessing uncertainty in predictions from simplified evolutionary models Muscle5 introduces perturbations into its HMM using a simple, ad hoc procedure which may appear unjustified or unnatural to readers versed in the application of mathematical models to evolution. Here, I offer some comments on conventional approaches and my thinking behind the design of Muscle5.
Probabilistic models of evolution are used to predict alignments and/or phylogenetic trees by attempting to maximise likelihood (in a frequentist framework) or posteriors (Bayesian).
Conventional wisdom dictates that an ideal algorithm would simultaneously optimise model parameters together with the alignment and tree. However, evolutionary models are not realistic − on the contrary, they are drastic simplifications compared to the complex reality of biological sequence evolution. Even if simultaneous optimisation were tractable, the predicted alignments and trees would surely be prone to errors due to simplifications in the model and the occurrence of low-probability events, and we therefore should not expect that ideally optimised predictions would always or usually be correct. In fact, we should not necessarily even expect that the correct alignment is in the set of possible optimal alignments that are reachable by a given model. This can be illustrated by a toy example. Consider two species X and Y with most However, standard pair-wise alignment algorithms will never predict gaps in opposite sequences in adjacent columns because a single substitution has higher probability than two insertions. The predicted alignment will be (MQTIF, MQTLF), which is not correct. If this alignment is given to a maximum likelihood tree algorithm, it will incorrectly infer that the (I, L) column contains a substitution and therefore calculate an incorrect likelihood by the standards of its own model.
Finally, I am sceptical that there is enough information in a typical set of input sequences to effectively optimise a model with many parameters, especially if parameters are allowed to vary by site, as should clearly be the case with substitution rates.
I suspect that the best algorithms we have today achieve (in)accuracy comparable to that which would be achieved by a hypothetical ideal algorithm, in which case a point of diminishing returns has been reached from a biological perspective (as opposed to a computer science perspective where the optimisation problem is interesting on its own terms). In the design of Muscle5, I therefore explicitly rejected the goal of improved optimisation and instead sought to determine the sensitivity of predicted alignments to perturbations in model parameters as a proxy for assessing uncertainty arising from the use of simplified models and the likely presence of unlikely

RdRp alignment and tree
I re-aligned the 4,617 RNA-dependent RNA polymerase (RdRp) sequences from Wolf2018 using the Super5 algorithm, generating a stratified ensemble. I reduced replicate MSAs to a tractable size (n = 250 sequences) for maximum-likelihood tree estimation by constructing the none.0 alignment of all sequences, estimating an approximate tree T F T by FastTree [16], identifying a set of edges which divide T F T into n subtrees of approximately size, and selecting one representative sequence at random from each subtree; a similar protocol was used by Wolf2018.
Representative sequences were extracted from the stratified ensemble and corresponding maximum-likelihood trees were estimated by RaxML. Varying the random number seed d for representative sequence selection provides another mechanism for generating replicate MSAs and hence replicate trees; these are denoted d.perm.s. Phylum branching order was inferred by identifying the best-fit subtree for each phylum (Methods); bootstrap confidences were assigned from the highest-confidence edges linking phylum subtrees. Following Wolf2018, trees were estimated using RAxML [17] and confidence was measured by transfer expectation.

Ensemble
Collection of alternative MSAs of the same sequences.
H-ensemble Ensemble of high-accuracy MSAs such that no particular MSA from this ensemble (or by any other method) is preferred a priori.
Stratified ensemble H-ensemble constructed to have subsets (strata) where a potential source of systematic errors is held fixed, e.g. guide tree topology or gap penalty values. Enables detection of particular types of bias / systematic error in an inference made from an MSA.   Example of splitting a balanced tree into three approximately equal-sized subtrees. Note that all pair-wise leaf alignments are preserved in all permutations, in contrast to the example in Fig. S2 where I + J is preserved but the order is changed for all other leaves, showing that a maximally unbalanced tree is a worst-case scenario for this approach.  Figure S4. Workflow of the Super5 algorithm.
Super5 applies a divide-and-conquer strategy to PPP, enabling scaling to larger datasets.  Figure S6. Condensing a tree and calculating monophyly.
This hypothetical example shows how a tree is condensed from species to family. Leaves are species, belonging to three families: Coronaviridae, Arteriviridae and Mesoniviridae. For each family, the best fit subtree is identified as the node which gives fewest errors (FP+FN). Arteriviridae is monophyletic (no errors), but one coronavirus species is misplaced inducing a FP for Mesoniviridae and a FN for Coronaviridae. The tree is condensed by deleting all edges under the best-fit subtree nodes and labeling these nodes, which are now leaves, with the family names. This simplifies the tree and eliminates the error. Monophyly is defined as m = T P/(T P + F P + F N ). In this example, Arteriviridae has T P = 4, F P = 0, F N = 0 hence m = 0, Coronaviridae has T P = 3, F P = 0, F N = 1, m = 3/4 = 0.75 and Mesoniviridae has T P = 3, F P = 1, F N = 0, m = 3/4 = 0.75. For calculating monophyly of a higher rank, e.g. phylum, errors at the lower rank used for assessment (e.g. class) are not counted; rather, the tree is first condensed to class, then phylum monophyly is assessed using the best-fit nodes for class.