Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.
This is a preview of subscription content, access via your institution
Subscribe to Nature+
Get immediate online access to Nature and 55 other Nature journal
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
The Uniref50 dataset is freely available under the Creative Commons Attribution (CC BY 4.0) License from https://www.uniprot.org. We used the 2018-03 release available at ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2018_03/uniref/. Pfam is freely available under the Creative Commons Zero (‘CC0’) licence from https://pfam.xfam.org. We used the Pfam-A seed dataset available at http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.seed.gz. Source data are provided with this paper.
The code used in this study is freely available under an Apache v.2.0 license at https://github.com/google-research/google-research/tree/master/dedal. All analysis was carried out in Python v.3.6, with the following libraries: absl-py 0.7, gin-config 0.4, numpy 1.18.4, tensorflow 2.3, tensorflow_datasets 3.0, tensorflow_probability 0.1 and tf-models-official 2.6.
Prakash, T. & Taylor, T. D. Functional assignment of metagenomic data: challenges and applications. Brief Bioinform. 13, 711–727 (2012).
Lockless, S. W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 (1999).
Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PloS ONE 6, e28766 (2011).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Smith, T. F. & Waterman, M. S. et al. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. Basic local alignment search tools. J. Mol. Bol. 215, 403–410 (1990).
Pearson, W. R. Rapid and sensitive sequence comparisons with FASTP and FASTA. Meth. Enzymol. 183, 63–98 (1990).
Altschul, S. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Landan, G. & Graur, D. Characterization of pairwise and multiple sequence alignment errors. Gene 441, 141–147 (2009).
Lobb, B., Kurtz, D. A., Moreno-Hagelsieb, G. & Doxey, A. C. Remote homology and the functions of metagenomic dark matter. Front Genet. 6, 234 (2015).
Yu, C.-N. J., Joachims, T., Elber, R. & Pillardy, J. Support vector training of protein alignment models. J. Comput. Biol. 15, 867–880 (2008).
Fitch, W. M. & Smith, T. F. Optimal sequence alignments. Proc. Natl Acad. Sci. USA 80, 1382–1386 (1983).
Waterman, M. S., Eggert, M. & Lander, E. Parametric sequence comparisons. Proc. Natl Acad. Sci. USA 89, 6090–6093 (1992).
Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment. Algorithmica 12, 312–326 (1994).
Waterman, M. S. Parametric and ensemble sequence alignment algorithms. Bull. Math. Biol. 56, 743–767 (1994).
Vingron, M. & Waterman, M. S. Sequence alignment and penalty choice. review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994).
Gusfield, D. & Stelling, P. Parametric and inverse-parametric sequence alignment with xparal. Methods Enzymol. 266, 481–494 (1996).
Pachter, L. & Sturmfels, B. Parametric inference for biological sequence analysis. Proc. Natl Acad. Sci. USA 101, 16138–16143 (2004).
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
Keul, F., Hess, M., Goesele, M. & Hamacher, K. Pfasum: a substitution matrix from pfam structural alignments. BMC Bioinform. 18, 293 (2017).
Sun, F., Fernández-Baca, D. & Yu, W. Inverse Parametric Sequence Alignment. In Computing and Combinatorics (eds Ibarra, O. H. & Zhang, L.) 97–106 (Springer, 2002).
Saigo, H., Vert, J.-P. & Akutsu, T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinform. 7, 246 (2006).
Kececioglu, J. & Kim, E. Simple and Fast Inverse Alignment. In Research in Computational Molecular Biology (eds Apostolico, A. et al.) 441–455 (Springer, 2006).
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In 7th International Conference on Learning Representations (ICLR) (Openreview.net, 2019).
Morton, J. T. et al. Protein structural alignments from sequence. Preprint at bioRxiv https://doi.org/10.1101/2020.11.03.365932 (2020).
Petti, S. et al. End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman. Bioinformatics https://doi.org/10.1093/bioinformatics/btac724 (2022).
Vaswani, A. et al. Attention is all you need. In Proc. of the 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.), 5998–6008 (Curran Associates, Inc., 2017).
Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 Vol. 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems (tensorflow.org, 2015); https://www.tensorflow.org/
Müller, T. & Vingron, M. Modeling amino acid replacement. J. Comput. Biol. 7, 761–776 (2000).
Müller, T., Spang, R. & Vingron, M. Estimating amino acid substitution models: a comparison of dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19, 8–13 (2002).
Mensch, A. & Blondel, M. Differentiable dynamic programming for structured prediction and attention. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 3462–3471 (PMLR, 2018).
Berthet, Q. et al. Learning with differentiable perturbed optimizers. In Proc. of the 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 9508–9519 (Curran Associates, Inc., 2020).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Collins, M. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In EMNLP ’02: Proc. ACL-02 Conference on Empirical Methods in Natural Language Processing Vol. 10, 1–8 (Association for Computational Linguistics, 2002).
Lafferty, J., McCallum, A. & Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning (eds Brodley, C. & Danyluk, A.) 282–289 (Morgan Kaufmann, 2001).
Blondel, M., Martins, A. F. & Niculae, V. Learning with Fenchel–Young losses. J. Mach. Learn. Res. 21, 1–69 (2020).
Karlin, S. & Altschul, S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA 87, 2264–2268 (1990).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).
Raghava, G. P. & Barton, G. J. Quantifications of the variation in percentage identity for protein sequence alignments. BMC Bioinform. 7, 415 (2006).
Doolittle, R. F. Similar amino acid sequences: chance or common ancestry? Science 214, 149–159 (1981).
We thank D. Belanger, D. Dohan and A. Gane for preprocessing the data files and implementing the input pipelines for UniRef50 and M. Bileschi for providing feedback to improve the manuscript and guidance for navigating the Pfam database.
This study was funded by Google LLC. F.L.-L., Q.B., M.B., O.T. and J.-P.V. were employees of Google LLC and owned Alphabet stock as part of the standard compensation package during the making of this study.
Peer review information
Nature Methods thanks Sean Eddy and Rama Ranganathan for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Llinares-López, F., Berthet, Q., Blondel, M. et al. Deep embedding and alignment of protein sequences. Nat Methods 20, 104–111 (2023). https://doi.org/10.1038/s41592-022-01700-2