Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Deep embedding and alignment of protein sequences

Abstract

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

This is a preview of subscription content, access via your institution

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: Overview of DEDAL.
Fig. 2: Alignment performance.
Fig. 3: Example of pairwise alignment of two protein domain sequences from Pfam-A seed.
Fig. 4: Homology detection performance.

Data availability

The Uniref50 dataset is freely available under the Creative Commons Attribution (CC BY 4.0) License from https://www.uniprot.org. We used the 2018-03 release available at ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2018_03/uniref/. Pfam is freely available under the Creative Commons Zero (‘CC0’) licence from https://pfam.xfam.org. We used the Pfam-A seed dataset available at http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.seed.gz. Source data are provided with this paper.

Code availability

The code used in this study is freely available under an Apache v.2.0 license at https://github.com/google-research/google-research/tree/master/dedal. All analysis was carried out in Python v.3.6, with the following libraries: absl-py 0.7, gin-config 0.4, numpy 1.18.4, tensorflow 2.3, tensorflow_datasets 3.0, tensorflow_probability 0.1 and tf-models-official 2.6.

References

  1. Prakash, T. & Taylor, T. D. Functional assignment of metagenomic data: challenges and applications. Brief Bioinform. 13, 711–727 (2012).

  2. Lockless, S. W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 (1999).

    Article  CAS  Google Scholar 

  3. Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PloS ONE 6, e28766 (2011).

    Article  CAS  Google Scholar 

  4. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).

    Article  CAS  Google Scholar 

  5. Smith, T. F. & Waterman, M. S. et al. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).

    Article  CAS  Google Scholar 

  6. Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. Basic local alignment search tools. J. Mol. Bol. 215, 403–410 (1990).

    Article  CAS  Google Scholar 

  7. Pearson, W. R. Rapid and sensitive sequence comparisons with FASTP and FASTA. Meth. Enzymol. 183, 63–98 (1990).

    Article  CAS  Google Scholar 

  8. Altschul, S. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  Google Scholar 

  9. Landan, G. & Graur, D. Characterization of pairwise and multiple sequence alignment errors. Gene 441, 141–147 (2009).

    Article  CAS  Google Scholar 

  10. Lobb, B., Kurtz, D. A., Moreno-Hagelsieb, G. & Doxey, A. C. Remote homology and the functions of metagenomic dark matter. Front Genet. 6, 234 (2015).

  11. Yu, C.-N. J., Joachims, T., Elber, R. & Pillardy, J. Support vector training of protein alignment models. J. Comput. Biol. 15, 867–880 (2008).

    Article  CAS  Google Scholar 

  12. Fitch, W. M. & Smith, T. F. Optimal sequence alignments. Proc. Natl Acad. Sci. USA 80, 1382–1386 (1983).

    Article  CAS  Google Scholar 

  13. Waterman, M. S., Eggert, M. & Lander, E. Parametric sequence comparisons. Proc. Natl Acad. Sci. USA 89, 6090–6093 (1992).

    Article  CAS  Google Scholar 

  14. Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment. Algorithmica 12, 312–326 (1994).

    Article  Google Scholar 

  15. Waterman, M. S. Parametric and ensemble sequence alignment algorithms. Bull. Math. Biol. 56, 743–767 (1994).

    Article  CAS  Google Scholar 

  16. Vingron, M. & Waterman, M. S. Sequence alignment and penalty choice. review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994).

    Article  CAS  Google Scholar 

  17. Gusfield, D. & Stelling, P. Parametric and inverse-parametric sequence alignment with xparal. Methods Enzymol. 266, 481–494 (1996).

    Article  CAS  Google Scholar 

  18. Pachter, L. & Sturmfels, B. Parametric inference for biological sequence analysis. Proc. Natl Acad. Sci. USA 101, 16138–16143 (2004).

    Article  CAS  Google Scholar 

  19. Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).

    Article  CAS  Google Scholar 

  20. Keul, F., Hess, M., Goesele, M. & Hamacher, K. Pfasum: a substitution matrix from pfam structural alignments. BMC Bioinform. 18, 293 (2017).

    Article  Google Scholar 

  21. Sun, F., Fernández-Baca, D. & Yu, W. Inverse Parametric Sequence Alignment. In Computing and Combinatorics (eds Ibarra, O. H. & Zhang, L.) 97–106 (Springer, 2002).

  22. Saigo, H., Vert, J.-P. & Akutsu, T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinform. 7, 246 (2006).

    Article  Google Scholar 

  23. Kececioglu, J. & Kim, E. Simple and Fast Inverse Alignment. In Research in Computational Molecular Biology (eds Apostolico, A. et al.) 441–455 (Springer, 2006).

  24. Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In 7th International Conference on Learning Representations (ICLR) (Openreview.net, 2019).

  25. Morton, J. T. et al. Protein structural alignments from sequence. Preprint at bioRxiv https://doi.org/10.1101/2020.11.03.365932 (2020).

  26. Petti, S. et al. End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman. Bioinformatics https://doi.org/10.1093/bioinformatics/btac724 (2022).

  27. Vaswani, A. et al. Attention is all you need. In Proc. of the 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.), 5998–6008 (Curran Associates, Inc., 2017).

  28. Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

    Article  CAS  Google Scholar 

  29. Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).

    Article  CAS  Google Scholar 

  30. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 Vol. 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).

  31. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  CAS  Google Scholar 

  32. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems (tensorflow.org, 2015); https://www.tensorflow.org/

  33. Müller, T. & Vingron, M. Modeling amino acid replacement. J. Comput. Biol. 7, 761–776 (2000).

    Article  Google Scholar 

  34. Müller, T., Spang, R. & Vingron, M. Estimating amino acid substitution models: a comparison of dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19, 8–13 (2002).

    Article  Google Scholar 

  35. Mensch, A. & Blondel, M. Differentiable dynamic programming for structured prediction and attention. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 3462–3471 (PMLR, 2018).

  36. Berthet, Q. et al. Learning with differentiable perturbed optimizers. In Proc. of the 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 9508–9519 (Curran Associates, Inc., 2020).

  37. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).

    Google Scholar 

  38. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).

    Article  Google Scholar 

  39. Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  Google Scholar 

  40. Collins, M. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In EMNLP ’02: Proc. ACL-02 Conference on Empirical Methods in Natural Language Processing Vol. 10, 1–8 (Association for Computational Linguistics, 2002).

  41. Lafferty, J., McCallum, A. & Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning (eds Brodley, C. & Danyluk, A.) 282–289 (Morgan Kaufmann, 2001).

  42. Blondel, M., Martins, A. F. & Niculae, V. Learning with Fenchel–Young losses. J. Mach. Learn. Res. 21, 1–69 (2020).

    Google Scholar 

  43. Karlin, S. & Altschul, S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA 87, 2264–2268 (1990).

    Article  CAS  Google Scholar 

  44. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).

  45. Raghava, G. P. & Barton, G. J. Quantifications of the variation in percentage identity for protein sequence alignments. BMC Bioinform. 7, 415 (2006).

    Article  CAS  Google Scholar 

  46. Doolittle, R. F. Similar amino acid sequences: chance or common ancestry? Science 214, 149–159 (1981).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We thank D. Belanger, D. Dohan and A. Gane for preprocessing the data files and implementing the input pipelines for UniRef50 and M. Bileschi for providing feedback to improve the manuscript and guidance for navigating the Pfam database.

Author information

Authors and Affiliations

Authors

Contributions

F.L.-L. contributed to the development of the method, implemented most of the code, designed and ran most of the experiments and drafted the manuscript. Q.B. contributed to the development of the method, to the code and to the experiments. M.B. contributed to the development of the method and to the code. O.T. designed and implemented large parts of the code and contributed to the experiments. J.P.-V. designed the initial project, contributed to the method development and drafted the manuscript. All authors provided regular feedback to all aspects of the work, reviewed code and contributed to the writing of the final manuscript.

Corresponding author

Correspondence to Jean-Philippe Vert.

Ethics declarations

Competing interests

This study was funded by Google LLC. F.L.-L., Q.B., M.B., O.T. and J.-P.V. were employees of Google LLC and owned Alphabet stock as part of the standard compensation package during the making of this study.

Peer review

Peer review information

Nature Methods thanks Sean Eddy and Rama Ranganathan for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Information.

Reporting Summary.

Peer Review File.

Supplemental Data

Statistical Source Data for Supplementary Figs. 2, 3, 5, 6 and 8–10.

Source data

Source Data Fig. 2

Statistical Source Data for Fig. 2.

Source Data Fig. 4

Statistical Source Data for Fig. 4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Llinares-López, F., Berthet, Q., Blondel, M. et al. Deep embedding and alignment of protein sequences. Nat Methods 20, 104–111 (2023). https://doi.org/10.1038/s41592-022-01700-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01700-2

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing