Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.
Multiple sequence alignment (MSA) is one of the most important analyzes in molecular biology. Most algorithms use progressive heuristics1 to solve the MSA problem. The scheme consists of three stages: (I) calculation of a similarity matrix for investigated sequences, (II) a guide tree construction, (III) greedy alignment according to the order given by the tree. Pairwise similarities can be established in various ways. Some algorithms use accurate although time-consuming methods, such as calculating pairwise alignments of the highest probability2 or maximum expected accuracy3. Others employ approximated, yet faster approaches, e.g. tuple matching4,5. As the sizes of the protein families to be analyzed continue to increase, the necessity to calculate all pairwise similarities has become a bottleneck for alignment algorithms. Therefore, several attempts have been made to accelerate this stage. Kalign6 and Kalign27 employ Wu-Manber8 and Muth-Manber9 fast string matching algorithms, respectively, for similarity measurements. This allows thousands of sequences to be aligned in a reasonable timespan. The idea was further extended by the authors of the presented research in Kalign-LCS10, which introduced the longest common subsequence to Kalign2 for similarity measurement. This improved both the alignment quality and the execution time. Nevertheless, in view of the most recent developments in high throughput sequencing, biologists are required to align protein families containing tens of thousands of members. Progressive algorithms which calculate and store all pairwise similarities could not be applied to the problems of such a size due to excessive time and memory requirements.
PartTree, a divisive sequence clustering algorithm for building a guide tree without calculating all pairwise similarities11, was one of the ideas to tackle the problem. With average time complexity of O(k log k) and space complexity of O(k) (k being the number of sequences in the input set), PartTree was successfully adopted by MAFFT 6 package12. As a result tens of thousands of sequences could be aligned on a typical desktop computer. A different approach is presented in Clustal Omega13. It uses mBed, an algorithm for embedding sequences into a lower-dimensional space14, which requires only O(k log k) exact similarity values to approximate other values. Embedding is combined with sequence clustering with the use of K-means algorithm, which prevents the storage of the whole similarity matrix, and keeps memory requirements under control. Both MAFFT and Clustal Omega use tuple matching for similarity calculation.
A different approach to large-scale alignments was introduced in UPP15. The algorithm creates a set of profile models on the basis of a subset of sequences referred to as a backbone. The backbone is then used in the process of aligning the remaining elements of the input set.
While MAFFT, Clustal Omega, and UPP are computationally applicable for families of even 100 000 proteins, we show that the quality of the results for such problems is often unsatisfactory. This article presents FAMSA, a progressive multiple alignment algorithm, particularly suitable for large sets of sequences. Pairwise similarities are established, by analogy to Kalign-LCS, on the basis of the longest common subsequences (LCS). Unlike MAFFT and Clustal Omega, FAMSA calculates all pairwise similarities. The process is highly efficient due to the multithreaded, bit-parallel LCS algorithm suited for AVX extensions16 of modern processors. Employing a memory-saving, single-linkage algorithm17 for guide tree construction, reduces first-stage memory requirements to O(k). A novel, in-place algorithm of profile alignment which prevents memory reallocations during the progressive stage is an important factor contributing to the computational scalability of FAMSA. As a result, FAMSA is the fastest and most memory-efficient alignment software when large protein families are taken into account. FAMSA’s superiority was observed for sets ranging from thousands to half a million of sequences.
FAMSA is not only efficient, but also very accurate thanks to a number of algorithmic features. They include LCS for similarity measurement, MIQS substitution matrix18, and a correction of gap penalties inspired by MUSCLE4. The penalties are additionally adjusted to the set size, which is a novel technique in alignment software, particularly useful in large sets of sequences. Misalignments during the progressive stage are fixed with a refinement scheme similar to the one included in QuickProbs 219. Consequently, when sets of a few thousand or more sequences are considered, FAMSA is significantly more accurate than any other algorithm. Interestingly enough, the difference increases along with the growing number of sequences. For instance, for sets exceeding 25 000 proteins, FAMSA aligned 25–45% more columns in a correct manner than UPP or the most accurate variants of MAFFT or Clustal Omega. When the largest benchmark family containing 415 519 sequences was investigated, the difference was even more remarkable—FAMSA successfully restored three times more columns than its competitors, at a fraction of the required time and memory.
The scalability of FAMSA was assessed on extHomFam, a new benchmark generated by analogy to HomFam13, by enriching Homstrad20 with families from Pfam database21. It contains 380 sets of sizes ranging from 218 to 415 519 sequences. The abundance of numerous protein families (k > 10000) makes extHomFam particularly representative for large-scale alignment problems, which are of crucial importance in the context of recent advances in high throughput sequencing.
Like other progressive algorithms, FAMSA consists of four stages:
Calculation of pairwise similarities,
Determination of a guide tree,
Progressive profile merging according to the guide tree order,
Optional iterative refinement of the final profile.
Detailed descriptions of the algorithm stages together with analyzes of time and space complexities are provided in the following subsections.
Calculation of pairwise similarities
The length of a longest common subsequence (LCS) is used in order to determine the pairwise similarities of sequences in the input set. The choice was motivated by the promising results of LCS application in previous studies10,22. Given two sequences, A and B, the length of an LCS is the maximal number of perfectly matching columns. This can be considered as an estimation of true pairwise alignment. To compensate the effect of LCS length being larger for longer sequences, the value is normalized by the indel distance (the number of single-symbol insertions and deletions necessary to transform one sequence into another). This distance approximates the misalignment cost, i.e. the number of gaps in the alignment, in which only perfect matches are allowed:
The LCS length can be computed using a straightforward dynamic programming (DP) rule23. Owing to the internal properties of the DP matrix, the calculation can be made using the bit-parallel manner. In this approach, w cells are computed at a time (w is a computer word size equal to 64 in modern architectures)24. The indel distance for the sequences A and B can be directly derived from the LCS length according to the formula:
where |S| denotes the length of the S sequence. The time complexity of the pairwise similarity calculation is:
under a reasonable assumption that w is comparable to or smaller than the longer sequence length.
As modern computers are equipped with multi-core processors, FAMSA distributes the calculation of LCS lengths for different pairs of sequences to several computing threads. Moreover, the presented software makes use of vector operations provided by technologies like SSE, AVX, AVX 216, which are supported by contemporary processors. This allows multiple pairs of sequences to be processed simultaneously by the same thread. Assuming that there are t processing threads, a words in a single AVX vector (2 for AVX, 4 for AVX 2, and 8 for the announced AVX-512), and that n denotes a sequence length, the total time complexity of the first stage can be expressed as:
Massively parallel architectures have become widespread in computationally demanding tasks. As FAMSA was designed for analyzing large protein families, it allows massively parallel devices, such as graphics processors, to be employed for the calculation of pairwise similarities. The procedure is implemented in OpenCL25, therefore it is suitable for GPUs manufactured by all main vendors, including NVidia and AMD. The distribution of LCS computation over thousands of graphics processor threads further increases the throughput of the first FAMSA stage. Yet, as is shown in the experimental part of the article, even without OpenCL, FAMSA is able to process hundreds of thousands of proteins in a very short time.
Determination of the guide tree
it can be computed incrementally, i.e. without storing the complete similarity matrix,
it is very fast, i.e. it can be completed in O(k2) time using the SLINK algorithm17,
To benefit from the incremental property of SLINK, the first two stages of FAMSA are performed simultaneously, which restricts the memory footprint. In particular, tree generation requires only O(k) space in contrast to O(k2), required by other guide tree construction algorithms like UPGMA. This is of crucial importance when huge protein families are investigated.
Progressive construction of the alignment
The progressive construction stage requires O(k) profile alignments, each computed by means of dynamic programming. At least half of these alignments are degenerated cases in which one or both profiles consist of a single sequence. Dynamic programming implementation can be simplified in those cases, hence specialized variants of the general DP procedure were prepared. This resulted in remarkable computation time savings for huge datasets, in which most profile alignments are made against a single sequence due to the structure of a guide tree.
Several improvements in the classical computation rule were made in FAMSA to upgrade the quality of alignment as well as the processing speed. They were possible owing to the internal profile representation composed of three arrays storing:
occurrence counters of each alphabet symbol in consecutive columns (32n* computer words, with n* being the profile length),
costs of alignment of consecutive columns to each possible alphabet symbol (also 32n* computer words),
sequences in the gapped representation.
While two former components were previously employed by alignment algorithms, e.g. Kalign, the gapped representation is, to the best of our knowledge, a novel technique. In this representation, two equal-sized arrays are stored for each sequence: (i) sequence symbols, (ii) the number of gaps present before the corresponding sequence symbol. Moreover, to quickly localize a symbol in a column, as well as to insert or remove gaps, dynamic position statistics are stored in an additional array. The space for the gapped sequence is approximately 13 times the length of the sequence (see Fig. 1 for example). The proposed profile representation allows a dynamic programming matrix to be computed rapidly and is memory frugal. The DP computation step for a pair of profiles takes
time, where n1, n2 are the input profile lengths.
The presence of gapped representation is especially useful when families containing tens of thousands of proteins are investigated. Other aligners construct a new profile by copying the sequences symbol by symbol, with occasional gap insertions, which tends to be a bottleneck for large-scale analyzes. This is not the case in FAMSA, where entire sequences are moved from the input profiles to the new profile and gaps are rapidly inserted by updating gap counters in corresponding arrays. The construction time of a new profile is:
where k1 and k2 are the number of sequences in both profiles, ko = k1 + k2, and no is the resulting profile length. Thus, the overall time of all profile constructions is:
where nf is the final profile length.
By adding the time of DP matrix calculation, the total time of this stage is obtained:
As profile alignments in the bottom part of the guide tree are independent, they can be performed in parallel. Therefore, to improve the computation time, FAMSA distributes profile alignments over multiple threads. It would also be possible to parallelize the dynamic programming computation and construction of a single profile. This is expected to be particularly beneficial for families of a million and more proteins. Nevertheless, this was not done in the current FAMSA version due to implementation complications and unavailability of such large sets in existing databases.
Determination of gap types and costs
Among numerous amino acid substitution matrices for dynamic programming calculation, MIQS was selected due to superior results reported in the recent study18. The gap costs are determined according to the classic affine penalty function, with a distinction between terminal and non-terminal gap open and gap extension costs, by analogy to MUSCLE4. In particular, four types of gaps are used:
terminal gap open (To)—opens a sequence at the left end or opens a contiguous series of gaps at the right end of a sequence,
terminal gap extension (Te)—extends a series of gaps inserted at the beginning or end of a sequence,
gap open (Go)—opens a contiguous series of gaps inserted within a sequence,
gap extension (Ge)—extends a contiguous series of gaps inserted within a sequence.
While the determination of the number of gaps and their types is straightforward in a pairwise alignment, it becomes problematic in MSA. Due to the fact that before aligning two profiles their sequences may have already contained gaps, the insertion of a column of gaps (either a single one or as the first one in a contiguous series of columns with gaps) does not always mean that exclusively gap opens have been inserted. Inserting only gap opens would result in an overestimation of their number. That is why the types of gaps within a column should be corrected.
Figure 2 demonstrates an example of the alignment of two profiles, X and Y. A column of gaps is to be inserted into profile X (the left part of the figure). The proper types of gaps together with the corrected gaps in the neighboring column are shown in the right part of the figure. The following situations must be considered during correction of the gaps:
S1: there is a terminal gap open at the right side of the inserted one; hence, the inserted gap should be a terminal gap open, and the following gap should be transformed into a terminal gap extension,
S2: there is a terminal gap extension at the left side of the inserted one; hence, the inserted gap should also be a terminal gap extension,
S3: the inserted gap will be placed into the gap series, so it should be a gap extension,
S4: there is a gap open at the right side of the inserted one; hence, the inserted gap should be a gap open, and to prevent the occurrence of two gap opens one after the other, the second gap should turn into a gap extension,
S5: the inserted gap is to be placed within the series of residues as the only gap, so it should be gap open,
S6: there is a terminal gap open at the left side of the inserted one; hence, the inserted gap should be a terminal gap extension.
The optimization of gap parameters and recognition of their influence on alignment accuracy is still the subject of intensive studies30. Various techniques have been proposed, e.g. adding a bonus score to a gap cost to enforce the alignment of distantly related sequences7. In our research, all gap costs (i.e. gap opens and gap extensions, both terminal and non-terminal) are multiplied by a factor related to the number of sequences in the input collection. This prevents unnecessary widening of alignments of large collections. The scaling factor is calculated as:
where gl and gd are two constants set by default to 45 and 7 (values chosen experimentally).
The application of gap corrections and scaling leads to another modification of the traditional approach. It is usually assumed that the insertion of a gap column in to the first profile cannot be immediately followed by the insertion of a gap column in to the second profile. If some assumptions about the gap costs and substitution matrix values are made, it can be proved to be reasonable, i.e. such a situation never leads to the optimal alignment. Nevertheless, this does not hold true if the gap correction is applied. Therefore, it is checked whether consecutive insertions of gap columns in to both profiles render a higher-scored alignment. (The profile alignment score is the summed alignment score of each sequence pair).
The idea of an iterative refinement is to correct misalignments made in the early phase of the profile alignment. Several algorithms were proposed for this task, like REFINER31 or the methods implemented in MMSA22, MSAProbs32. This problem was investigated in our recent paper19. It was shown that the classical methods do not work for sufficiently large collections of sequences. A column-oriented refinement was also proposed to improve the quality of alignments for collections up to 1000 sequences. In this approach, the algorithm scans the profile to localize columns that contain at least one gap. Then, it randomly selects one of such columns and splits the profile into two subprofiles, depending on the gap presence in the selected column. Empty columns are removed afterwards, and subprofiles are realigned. Finally, if new alignment is scored higher than the original one, it is accepted as the current solution.
To simplify the time complexity analysis of the refinement, the input and the output profiles are assumed to be of comparable lengths (which is usually the case). A single refinement iteration requires
time, with n* being the length of the shorter of the two profiles obtained after splitting the original profile.
Preliminary analyzes showed the refinement to be particularly beneficial for smaller sets of sequences. For this reason, and to improve the processing time of large protein families, the refinement is applied only for k ≤ 1000. The number of iterations was experimentally set to 100.
An assessment of MSA algorithms was performed using benchmark datasets. The presence of high-quality, manually curated reference alignments allowed supervised accuracy measures to be calculated. They were sum-of-pairs (SP) and total-column (TC) scores, defined as fractions of correctly aligned symbol pairs and columns, respectively. The scores were determined with a use of QSCORE software33.
Our aim was to propose an efficient and robust algorithm for the alignment of thousands of proteins. The largest available benchmarks contain sets of at most hundreds of sequences, with an exception of HomFam introduced by Sievers et al.13. HomFam consists of 92 families constructed by extending Homstrad reference alignments (only those having 5 or more sequences were taken into account) with corresponding families from Pfam database. This protocol results in large benchmark sets: 18 of them consist of more than 10 000 members (the number of reference sequences ranges from a few to a few tens, though).
Since 2011, when HomFam was introduced, Pfam and Homstrad databases have demonstrated significant growth. Hence, we present the new benchmark, extHomFam, to carry out more extensive experiments. It was constructed according to the HomFam generation protocol, with several modifications. 399 high-quality alignments containing at least 3 proteins were selected from Homstrad (ver. 1 Apr 2015). By decreasing the threshold from 5, the larger benchmark could be obtained than the original HomFam. If two-protein families were also taken into account, extHomFam would be increased to 1013 sets, at the cost of positively biasing the TC score (with two reference sequences it becomes equal to SP). Therefore, pairwise-only alignments were excluded. Next, selected Homstrad sets were enriched with the corresponding Pfam (ver. 28) families. After removing duplicated sequences, sets of less than 200 proteins were filtered out to provide the final benchmark of 380 families. For convenience extHomFam was divided at thresholds k = 4000, 10000, and 25000 to obtain subsets named small, medium, large, and extra-large. Note, that sets of ~1000 sequences are usually referred to in the literature as large—our naming convention is intended to show size diversity. ABC_tran, the most numerous set in extHomFam, contains 415 519 sequences, which is the largest benchmark protein family available.
The scalability of algorithms was evaluated on the 53 largest extHomFam families, containing at least 30000 sequences each. These sets were recursively downsampled to the desired sizes, with a guarantee of preserving sequences from reference alignments. This scheme has a valuable property: smaller sets are contained in larger ones, which reduces results variability.
To evaluate the performance of the presented algorithm on smaller alignment problems, classic benchmarks, i.e. BAliBASE34, PREFAB4, OXBench-X35, and SABmark36, were also considered in the experiments.
Competitive algorithms and system setup
From among numerous sequence alignment algorithms, only those able to handle families of thousands of sequences were investigated on HomFam and extHomFam. They were MUSCLE4, Kalign27, Kalign-LCS10, UPP15, Clustal Omega13 and MAFFT37. MAFFT was analyzed in default configuration in which it calculates O(k2) pairwise similarities, as well as -parttree and -dpparttree modes, especially suited for large sets of sequences due to lower computational requirements. Clustal Omega was executed with default parameters and with two combined iterations (-iter2), which delivered superior results in the previous studies13,38. MUSCLE in default mode was unfeasible for immense protein families, hence -maxiters2 variant was also considered. Details on execution parameters and program versions are given in the Supplementary material.
A workstation equipped with two 12-core Intel Xeon E5-2670v3 processors (clocked at 2.3 GHz), Nvidia Quadro M6000 graphic card (3072 cores clocked at 1.0 GHz), and 128 GB RAM was used for the experiments. To investigate the behavior of the algorithms on modern workstations and servers containing from a few to several tens of cores, all methods were run with 8 computing threads, unless stated otherwise. FAMSA was run in the CPU mode, except for the experiment on the algorithm scalability w.r.t. the number of CPU cores, where the GPU variant was additionally investigated.
HomFam and extHomFam benchmark evaluation
Following Sievers et al.13, HomFam was divided into three parts depending on the family size. As is shown in Table 1, for k ≤ 3000, FAMSA was inferior only to both Clustal Omega configurations. For k > 3000, the presented algorithm took the lead in both measures, revealing its potential for large protein families. More importantly still, FAMSA was from several to hundreds of times faster than its competitors. For instance, it processed the entire HomFam in less than 12 minutes while Clustal-default and MAFFT-default required, respectively, 8 h 40 m and 2 h 30 m. An even greater difference was observed for Clustal-iter2, which completed the analyzes in over 51 hours. Interestingly enough, MAFFT-parttree and -dpparttree were also inferior to FAMSA. This is especially noteworthy because they calculate only selected pairwise similarities (usually O(klog k)) instead of the full matrix (O(k2)).
The experiments on extHomFam confirmed superior accuracy and execution time of FAMSA to scale well with the number of sequences (Fig. 3; more detailed results are given in Supplementary material). FAMSA was inferior to Clustal-iter2 by a small margin, and only on the small subset. For k > 4000, it became the best aligner and, depending on the subset and quality measure, was followed by Clustal, MAFFT, or UPP. As MUSCLE and MAFFT -parttree rendered inferior results, they were excluded from Fig. 3. Kalign2, Kalign-LCS, and MUSCLE did not complete the analyzes on extra-large due to excessive memory or time requirements. Clustal Omega and MAFFT-default failed to process, respectively, one and four largest extHomFam families (the missing MAFFT results were taken from -dpparttree variant, though). Advances in SP and TC measures of FAMSA over the competing software on medium, large, and extra-large subsets were assessed statistically with the use of the Wilcoxon signed-rank test with the Bonferroni-Holm correction for multiple testing. The differences are significant at α = 0.005; p-values for all pairwise comparisons can be found in Table 2.
As can be seen in Fig. 3, the quality advance of the presented software over other algorithms increased for consecutive subsets. For instance, on extra-large, FAMSA aligned in a proper manner approximately 25% more columns than UPP—the second best algorithm. A more detailed analysis of FAMSA accuracy compared to the competitors is given in Fig. 4. Four extHomFam categories were further divided into 11 subsets of approximately 35 families. Selected statistical indicators (median, mean, 12.5th and 87.5th percentile) of absolute differences in SP and TC measures between FAMSA and other algorithms were plotted for each interval at k axis. Clearly, the number of test cases for which the presented software was superior to that made by the competitors, as well as the absolute advance in terms of quality, increases with the growing set size. This observation is supported by the scalability analysis performed on the 53 largest families (k ≥ 30000), randomly resampled to obtain less numerous sets. Figure 5 shows that FAMSA outrun the competitors when the number of proteins exceeded 5000. Importantly, the performance was hardly affected when more sequences were added. This might be caused by the bias of the guide trees towards reference sequences. Indeed, the following section shows that the reference sequences were slightly closer to each other in the guide trees than suggested by the random model. Nevertheless, this held for all analyzed algorithms, therefore can be considered as a property of the benchmark.
The abundance of extremely large protein families makes extHomFam the most demanding benchmark in terms of computational resources. Apart from FAMSA, only MAFFT-dpparttree and -parttree were able to process all of its sets. Other algorithms either crashed due to memory requirements or were terminated on purpose when the processing time of a family was longer than 24 hours. (An exception was made only for Clustal-iter2 due to its superior quality results.) While MAFFT-default, MAFFT-dpparttree, UPP, and Clustal-default required from 50 to 188 hours, FAMSA finished the computations in less than 7 hours, which corresponds to 7- to 26-fold advance. Clustal-iter2 was an extreme case; it needed almost 1000 hours, and showed its combined iterations to be inapplicable for very large protein families. A more detailed analysis of the computational scalability of the presented algorithm is provided in Fig. 5. It confirms that FAMSA is faster than MAFFT-default, Clustal Omega, and UPP by 1–2 orders of magnitude. The efficiency of the presented algorithm is due to the fast bit-parallel similarity computation and the in-place profile joining. Yet, as FAMSA calculates more distances than Clustal Omega and MAFFT-dpparttree (O(k2) instead of O(k log k)), it might be expected to exceed competitor execution times for a sufficiently large k. To verify this, the algorithms were compared on ABC_tran, the largest family in extHomFam with 415 519 proteins. FAMSA processed this set in less than 2 hours. Clustal Omega crashed due to excessive memory requirements after 55 hours of calculations suggesting that the algorithm is dominated by stages other than similarity computation. The situation was different for MAFFT-dpparttree. Its execution time scaled better with the number of sequences, though it was still inferior to FAMSA by a factor of 2.5. Importantly enough, FAMSA required below 8 GB of RAM, while MAFFT-dpparttree needed 47 GB. For comparison, MAFFT-default and Clustal Omega failed to run on a 128 GB machine (the former demanded 318 GB just for storing the similarity matrix). In conclusion, the calculation of all pairwise similarities performed by FAMSA did not prevent it from being the fastest and most memory efficient aligner in the comparison–even for immense protein families.
As FAMSA was designed to use all the available computational power, it takes advantage of multi-core architectures of contemporary computers. The ten largest protein families from extHomFam (all which contain at least 100 000 sequences: ABC_tran, gtp, HATPase_c, helicase_NC, kinase, mdd, response_reg, rvp, sdr, TyrKc) were selected to investigate the scalability of the algorithm stages with respect to the number of computing threads. The experiments also considered the variant of FAMSA in which similarity calculation was adapted for massively parallel architectures with a use of OpenCL. For convenience, processing times of ABC_tran were marked separately. As can be seen in Fig. 6, when FAMSA was run serially, more than 90% of the execution time was related to stages I and II (the algorithm performs them simultaneously). Nevertheless, as pairwise similarities can be calculated independently, these stages scale noticeably better with the number of threads than the progressive construction. In particular, when more than 12 cores were involved, stage III of the algorithm started to be the bottleneck. This was also the case for the GPU FAMSA variant.
Impact of guide tree computation method
A single linkage method for guide tree determination was used in FAMSA owing to low memory requirements of the SLINK algorithm and superior quality results reported in the previous studies. Nevertheless, as the tree structure was shown to be of crucial importance for the analysis of large protein families41, alternative methods were examined. The first one was UPGMA27, which can be computed in O(k2) time and space. The memory consumption is actually close to 2k2 bytes. This is equivalent to about 345 GB for the largest family (ABC_tran) making UPGMA unfeasible for immense sets of sequences. The performance of trees produced by Clustal Omega was also investigated in the experiments (FAMSA provides the user with the possibility to import external trees in the Newick format). Finally, we examined chained guide trees41,42,43. This method was the fastest, as it did not require calculation of sequence similarities.
The comparison of the results for extHomFam is given in Table 3. As UPGMA was non-computable for ABC_tran, the results of the single linkage were considered in this case. The experimental results for chained trees were averaged for 21 trials.
The experiments confirmed that the single linkage was superior in terms of alignment quality. The smallest, yet statistically significant advance (pSP = 0.000011, pTC = 0.000019), was observed when compared with UPGMA. This, together with memory efficiency, made the authors choose the single linkage for FAMSA. To provide a deeper insight into the structure of the trees rendered by different strategies, the Sackin index44 was used, defined as the sum of heights of all leaves in the tree. Figure 7 shows the comparison of the normalized Sackin indexes (i.e. the Sackin indexes divided by the number of sequences in the family) for trees produced by FAMSA + single linkage, FAMSA + UPGMA, and Clustal Omega. The lines corresponding to perfectly balanced and imbalanced (i.e. chained) trees are also presented for convenience. It can be seen that the indexes for single linkage trees are noticeably higher than those for UPGMA and Clustal Omega. Interestingly enough, the normalized Sackin indexes for UPGMA and Clustal Omega trees are approximately twice as large as in the perfectly balanced case.
The last experimental step involved the analysis of the guide tree structure in terms of the reference alignments coverage (Table 4). For each family the fraction of a guide tree covering reference sequences was calculated. It was defined as the number of leaves in the smallest part of the tree containing all reference sequences divided by the family size. The results were averaged over all families in the benchmark. The obtained fractions were then compared with a random model determined by performing a Monte Carlo simulation (1000 trials) for each guide tree. The trial involved a random selection of a sequence subset with the same cardinality as the reference alignment.
Interestingly enough, the fractions for the single linkage were smaller than those for UPGMA and Clustal Omega. This was probably caused by single linkage trees being noticeably more imbalanced. It can also be seen that for all tree generation algorithms, fractions containing reference sequences were smaller than the corresponding Monte Carlo results (when analyzing the entire extHomFam, the difference varied from 0.14 to 0.19). This means that the reference sequences were located in the guide trees closer to each other than suggested by the random model.
The same analysis was performed on the resampled sets employed for scalability experiments. The results coincide with the observations made for the entire benchmark.
Classic benchmark evaluation
For completeness, the accuracy of algorithms was investigated on classic benchmarks with families ranging from a few to approximately one hundred sequences (Table 5). As expected, consistency-based methods (QuickProbs 2, MSAProbs, and GLProbs) were superior to the competitors. When it comes to the non-consistency approaches, FAMSA was characterized by moderate performance on the majority of the benchmarks, except SABmark, on which it was the best. The analysis of execution times confirms that FAMSA is one of the fastest algorithms for low and moderately-sized sets, like those contained in the investigated benchmarks.
The abundance of protein families containing hundreds of thousands of members imposes the development of algorithms computationally capable of aligning immense sets of sequences. The traditional progressive scheme was successfully modified by Clustal Omega and MAFFT aligners to eliminate its greatest bottleneck in large-scale analyzes—calculation of all pairwise similarities. Nevertheless, the experiments with FAMSA show that the computation of the entire similarity matrix with the use of LCS measure, combined with a memory-efficient single-linkage tree construction and in-place profile alignment, is orders of magnitude faster than the competing solutions. Importantly enough, this comes with superior alignment quality—FAMSA was significantly more accurate than Clustal Omega and MAFFT on sets of a few thousand and more sequences. ABC_tran, the largest among the investigated families containing 415 519 sequences, reveals the potential of the presented software. The set was processed by FAMSA within 2 hours in less than 8 GB of RAM, which is suitable for a typical laptop. In contrast, Clustal Omega crashed after 2 days of computations on a 128 GB machine due to excessive memory requirements. MAFFT in memory-efficient mode completed the analysis in 5 hours allocating 47 GB of RAM, yet only 5.7% of columns were successfully aligned, while FAMSA restored as much as 16.8%. No columns were aligned properly by UPP.
The scalability of the presented algorithm in terms of the alignment quality as well as time and memory requirements, makes it applicable for protein families even of a million sequences—a no-go area for the competing software. Such families will likely be present in the Pfam database in the near future, as a consequence of advances in sequencing technologies. Importantly enough, the efficiency of FAMSA has the potential to be further improved. The natural option is the parallelization of the dynamic programming procedure at the profile construction stage; this appeared to be a bottleneck in the scalability tests. Another possibility could be better utilization of massively parallel architectures by optimizing the OpenCL code for GPUs or adapting it for Intel Xeon Phi co-processors.
An alternative development direction concerns the alignment quality. Iterative refinement is one of the numerous techniques designed to improve accuracy. For computational reasons, it is performed by FAMSA on families of fewer than 1000 sequences, though. Some limited, less time-consuming refinement scheme could be applied also for larger sets of sequences. Other ideas include the introduction of profile Markov models or consistency. Until recently the latter was found infeasible for large families because of excessive computational requirements. However, our latest research19 shows that consistency applied to a small, carefully selected fraction of sequences, may improve the alignment quality without compromising on the execution time. The experiments involved sets of up to a thousand of sequences. Accordingly, the scalability of the presented ideas to families two orders of magnitude larger is an open question. Moreover, designing a consistency scheme suitable for FAMSA is by no means an easy task.
Another issue related to large-scale analyzes is an accuracy assessment, particularly the unavailability of reference sequences. Evaluating the quality of the alignment of 10 000 or more proteins on the basis of a reference containing only a small fraction of the members is the largest flaw of the experimental pipeline used in the current research. The authors believe that the progress in multiple alignment domain is likely to be facilitated with the development of new benchmark datasets containing more reference sequences.
An interesting attempt into this direction is the recent work by Fox et al.45. The proposed ContTest benchmark predicts a contact map for some protein that has a known three-dimensional structure on the ground of the evaluated multiple sequence alignment. Then the contact map is compared with the known contact map for the same protein. The benchmark contains families up to 44 thousand sequences.
FAMSA executables together with the source code are available at https://github.com/refresh-bio/FAMSA; extHomFam can be downloaded from http://dx.doi.org/10.7910/DVN/BO2SVW. Web service for remote analyzes is under development.
How to cite this article: Deorowicz, S. et al. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci. Rep. 6, 33964; doi: 10.1038/srep33964 (2016).
The work was supported by Polish National Science Centre under the projects DEC-2011/03/B/ST6/01588, DEC-2012/05/N/ST6/03158 and DEC-2015/17/B/ST6/01890 and by Silesian University of Technology under the project BK-263/RAu2/2015, performed using the infrastructure supported by POIG.02.03.01-24-099/13 grant: ‘GeCONiI—Upper Silesian Center for Computational Science and Engineering’.