Large multiple sequence alignments with a root-to-leaf regressive method

Garriga, Edgar; Di Tommaso, Paolo; Magis, Cedrik; Erb, Ionas; Mansouri, Leila; Baltzis, Athanasios; Laayouni, Hafid; Kondrashov, Fyodor; Floden, Evan; Notredame, Cedric

doi:10.1038/s41587-019-0333-6

Letter
Published: 02 December 2019

Large multiple sequence alignments with a root-to-leaf regressive method

Nature Biotechnology volume 37, pages 1466–1470 (2019)Cite this article

4482 Accesses
16 Citations
78 Altmetric
Metrics details

Subjects

Abstract

Multiple sequence alignments (MSAs) are used for structural^1,2 and evolutionary predictions^1,2, but the complexity of aligning large datasets requires the use of approximate solutions³, including the progressive algorithm⁴. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up⁵. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes⁶.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Regressive algorithm overview.**

**Fig. 2: Relative performances of alternative MSA algorithm combinations.**

**Fig. 3: CPU requirements of the regressive algorithm on HomFam datasets containing more than 10,000 sequences.**

Generation of accurate, expandable phylogenomic trees with uDance

Article 27 July 2023

Large scale sequence alignment via efficient inference in generative models

Article Open access 04 May 2023

A sequence-based evolutionary distance method for Phylogenetic analysis of highly divergent proteins

Article Open access 20 November 2023

Data availability

All data, analyses and results are available from Zenodo (https://doi.org/10.5281/zenodo.3271452).

Code availability

The regressive alignment algorithm has been implemented in T-Coffee and is available at the T-Coffee website (http://www.tcoffee.org) and on GitHub (https://github.com/cbcrg/tcoffee). A GitHub repository containing the Nextflow workflow²⁵ and Jupyter notebooks²⁶ to replicate the analysis are available at https://github.com/cbcrg/dpa-analysis (release v.1.2).

References

Uguzzoni, G. et al. Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis. Proc. Natl Acad. Sci. USA 114, E2662–E2671 (2017).
Article CAS Google Scholar
Mirarab, S., Bayzid, M. S., Boussau, B. & Warnow, T. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science 346, 1250463 (2014).
Article Google Scholar
Wang, L. & Jiang, T. On the complexity of multiple sequence alignment. J. Comput. Biol. 1, 337–348 (1994).
Article CAS Google Scholar
Hogeweg, P. & Hesper, B. The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175–186 (1984).
Article CAS Google Scholar
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Article Google Scholar
Lewin, H. A. et al. Earth BioGenome Project: sequencing life for the future of life. Proc. Natl Acad. Sci. USA 115, 4325–4333 (2018).
Article CAS Google Scholar
Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).
Article CAS Google Scholar
Chatzou, M. et al. Multiple sequence alignment modeling: methods and applications. Brief. Bioinform. 17, 1009–1023 (2015).
Article Google Scholar
Breen, M. S., Kemena, C., Vlasov, P. K., Notredame, C. & Kondrashov, F. A. Epistasis as the primary factor in molecular evolution. Nature 490, 535–538 (2012).
Article CAS Google Scholar
Liu, K., Raghavan, S., Nelesen, S., Linder, C. R. & Warnow, T. Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science 324, 1561–1564 (2009).
Article CAS Google Scholar
Mirarab, S. et al. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J. Comput. Biol. 22, 377–386 (2015).
Article CAS Google Scholar
Collins, K. & Warnow, T. PASTA for proteins. Bioinformatics 34, 3939–3941 (2018).
Article CAS Google Scholar
Nguyen, N.-P. D., Mirarab, S., Kumar, K. & Warnow, T. Ultra-large alignments using phylogeny-aware profiles. Genome Biol. 16, 124 (2015).
Article Google Scholar
Yamada, K. D., Tomii, K. & Katoh, K. Application of the MAFFT sequence alignment program to large data-reexamination of the usefulness of chained guide trees. Bioinformatics 32, 3246–3251 (2016).
Article CAS Google Scholar
Minh, B. Q., Klaere, S. & von Haeseler, A. Phylogenetic diversity within seconds. Syst. Biol. 55, 769–773 (2006).
Article Google Scholar
Stebbings, L. A. & Mizuguchi, K. HOMSTRAD: recent developments of the homologous protein structure alignment database. Nucleic Acids Res. 32, D203–D207 (2004).
Article CAS Google Scholar
Blackshields, G., Sievers, F., Shi, W., Wilm, A. & Higgins, D. G. Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol. Biol. 5, 21 (2010).
Article Google Scholar
Katoh, K. & Toh, H. PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23, 372–374 (2007).
Article CAS Google Scholar
Katoh, K., Kuma, K.-I., Toh, H. & Miyata, T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 33, 511–518 (2005).
Article CAS Google Scholar
Greenacre, M. J. Biplots in Practice (Fundacion BBVA, 2010).
Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222–D230 (2013).
Article Google Scholar
Herlihy, M. & Shavit, N. The Art of Multiprocessor Programming 1st edn (Morgan Kaufmann, 2012).
Kahn, S. D. On the future of genomic data. Science 331, 728–729 (2011).
Article CAS Google Scholar
Thompson, J. D., Plewniak, F. & Poch, O. BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87–88 (1999).
Article CAS Google Scholar
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Article Google Scholar
Perkel, J. M. Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 145–146 (2018).
Article CAS Google Scholar

Download references

Acknowledgements

We thank G. Riddihough for revisions and comments on the manuscript and O. Gascuel for suggestions. This project was supported by the Centre for Genomic Regulation, the Spanish Plan Nacional, the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa’ (E.G., P.T., C.M., I.E., L.M., A.B., F.K., E.F. and C.N.) and an ERC Consolidator Grant from the European Commission, grant agreement no. 771209 ChrFL (F.K.).

Author information

Authors and Affiliations

Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Barcelona, Spain
Edgar Garriga, Paolo Di Tommaso, Cedrik Magis, Ionas Erb, Leila Mansouri, Athanasios Baltzis, Evan Floden & Cedric Notredame
Institut de Biologia Evolutiva (UPF-CSIC), Universitat Pompeu Fabra, Barcelona, Spain
Hafid Laayouni
Bioinformatics Studies, ESCI-UPF, Barcelona, Spain
Hafid Laayouni
Institute of Science and Technology, Klosterneuburg, Austria
Fyodor Kondrashov
Universitat Pompeu Fabra (UPF), Barcelona, Spain
Cedric Notredame

Authors

Edgar Garriga
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Di Tommaso
View author publications
You can also search for this author in PubMed Google Scholar
Cedrik Magis
View author publications
You can also search for this author in PubMed Google Scholar
Ionas Erb
View author publications
You can also search for this author in PubMed Google Scholar
Leila Mansouri
View author publications
You can also search for this author in PubMed Google Scholar
Athanasios Baltzis
View author publications
You can also search for this author in PubMed Google Scholar
Hafid Laayouni
View author publications
You can also search for this author in PubMed Google Scholar
Fyodor Kondrashov
View author publications
You can also search for this author in PubMed Google Scholar
Evan Floden
View author publications
You can also search for this author in PubMed Google Scholar
Cedric Notredame
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.N. designed and implemented the algorithm. E.F., E.G., L.M., A.B. and P.D.T designed the validation procedure and carried out the validation. I.E. performed statistical and CCA analyses. E.F., C.N., E.G., C.M., L.M., A.B., P.D.T., I.E., F.K. and H.L. wrote and edited the manuscript.

Corresponding authors

Correspondence to Evan Floden or Cedric Notredame.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Effect of size of N on HomFam average TC score and relative CPU time.

Summary of Total Column score and CPU usage values collected over HomFam datasets of various sizes of N using either (A) ClustalO with mBed trees or (B) Mafft-FFTNS1 with PartTree trees. Four datasets were excluded from the analysis because of incomplete CPU time reports across the runs (mmp, kunitz, hormone_rec and peroxidase). Each combination of alignment method and tree method, n=90 independent MSA samples.

Supplementary Figure 2 Relative performances of alternative MSA algorithm combinations.

(A) The relative accuracy is defined as the difference between the TC score measured on the projection of embedded sequences and the TC score measured on the direct alignment of these same sequences with the considered method. The three alignment protocols all use a PartTree guide-trees combined with the following aligners Fftns1 in non-regressive mode (red), Fftns1 in regressive mode (green) and Gins1 in regressive mode (blue). The envelope is the standard deviation measured on the averaged values. (B) similar comparison between the regressive deployment of Sparsecore using a mBed guide tree (blue) and the default, non-regressive deployment of this same aligner (red). (C) similar comparison on UPP using a mBed guide-tree for the regressive deployment (blue) and UPP default mode for the non-regressive (red). (D) similar display for Fftns1 using a mBed guide-tree for the regressive (blue) and non-regressive deployment (red). (E) similar analysis using ClustalO as an aligner and PartTree guide-trees in regressive (blue) and non-regressive modes (red). Each combination of alignment method and tree method, n=94 independent MSA samples.

Supplementary information

Supplementary Materials

Supplementary Figs. 1 and 2, Notes 1 and 2 and Tables 1–7.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Garriga, E., Di Tommaso, P., Magis, C. et al. Large multiple sequence alignments with a root-to-leaf regressive method. Nat Biotechnol 37, 1466–1470 (2019). https://doi.org/10.1038/s41587-019-0333-6

Download citation

Received: 26 February 2019
Accepted: 29 October 2019
Published: 02 December 2019
Issue Date: December 2019
DOI: https://doi.org/10.1038/s41587-019-0333-6

This article is cited by

Phylogenetic and functional characterization of water bears (Tardigrada) tubulins
- Kamila Novotná Floriančičová
- Athanasios Baltzis
- Stanislav Vinopal
Scientific Reports (2023)
Reference flow: reducing reference bias using multiple population genomes
- Nae-Chyun Chen
- Brad Solomon
- Ben Langmead
Genome Biology (2021)