Deep embedding and alignment of protein sequences

Llinares-López, Felipe; Berthet, Quentin; Blondel, Mathieu; Teboul, Olivier; Vert, Jean-Philippe

doi:10.1038/s41592-022-01700-2

Article
Published: 15 December 2022

Deep embedding and alignment of protein sequences

Felipe Llinares-López¹,
Quentin Berthet¹,
Mathieu Blondel¹,
Olivier Teboul¹ &
…
Jean-Philippe Vert ORCID: orcid.org/0000-0001-9510-8441¹

Nature Methods volume 20, pages 104–111 (2023)Cite this article

8572 Accesses
9 Citations
73 Altmetric
Metrics details

Subjects

Abstract

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 3: Example of pairwise alignment of two protein domain sequences from Pfam-A seed.**

**Fig. 4: Homology detection performance.**

Protein remote homology detection and structural alignment using deep learning

Article Open access 07 September 2023

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

Article Open access 30 March 2024

Single-sequence protein structure prediction using a language model and deep learning

Article 03 October 2022

Data availability

The Uniref50 dataset is freely available under the Creative Commons Attribution (CC BY 4.0) License from https://www.uniprot.org. We used the 2018-03 release available at ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2018_03/uniref/. Pfam is freely available under the Creative Commons Zero (‘CC0’) licence from https://pfam.xfam.org. We used the Pfam-A seed dataset available at http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/Pfam-A.seed.gz. Source data are provided with this paper.

Code availability

The code used in this study is freely available under an Apache v.2.0 license at https://github.com/google-research/google-research/tree/master/dedal. All analysis was carried out in Python v.3.6, with the following libraries: absl-py 0.7, gin-config 0.4, numpy 1.18.4, tensorflow 2.3, tensorflow_datasets 3.0, tensorflow_probability 0.1 and tf-models-official 2.6.

References

Prakash, T. & Taylor, T. D. Functional assignment of metagenomic data: challenges and applications. Brief Bioinform. 13, 711–727 (2012).
Lockless, S. W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 (1999).
Article CAS Google Scholar
Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PloS ONE 6, e28766 (2011).
Article CAS Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Article CAS Google Scholar
Smith, T. F. & Waterman, M. S. et al. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Article CAS Google Scholar
Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D. Basic local alignment search tools. J. Mol. Bol. 215, 403–410 (1990).
Article CAS Google Scholar
Pearson, W. R. Rapid and sensitive sequence comparisons with FASTP and FASTA. Meth. Enzymol. 183, 63–98 (1990).
Article CAS Google Scholar
Altschul, S. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Landan, G. & Graur, D. Characterization of pairwise and multiple sequence alignment errors. Gene 441, 141–147 (2009).
Article CAS Google Scholar
Lobb, B., Kurtz, D. A., Moreno-Hagelsieb, G. & Doxey, A. C. Remote homology and the functions of metagenomic dark matter. Front Genet. 6, 234 (2015).
Yu, C.-N. J., Joachims, T., Elber, R. & Pillardy, J. Support vector training of protein alignment models. J. Comput. Biol. 15, 867–880 (2008).
Article CAS Google Scholar
Fitch, W. M. & Smith, T. F. Optimal sequence alignments. Proc. Natl Acad. Sci. USA 80, 1382–1386 (1983).
Article CAS Google Scholar
Waterman, M. S., Eggert, M. & Lander, E. Parametric sequence comparisons. Proc. Natl Acad. Sci. USA 89, 6090–6093 (1992).
Article CAS Google Scholar
Gusfield, D., Balasubramanian, K. & Naor, D. Parametric optimization of sequence alignment. Algorithmica 12, 312–326 (1994).
Article Google Scholar
Waterman, M. S. Parametric and ensemble sequence alignment algorithms. Bull. Math. Biol. 56, 743–767 (1994).
Article CAS Google Scholar
Vingron, M. & Waterman, M. S. Sequence alignment and penalty choice. review of concepts, case studies and implications. J. Mol. Biol. 235, 1–12 (1994).
Article CAS Google Scholar
Gusfield, D. & Stelling, P. Parametric and inverse-parametric sequence alignment with xparal. Methods Enzymol. 266, 481–494 (1996).
Article CAS Google Scholar
Pachter, L. & Sturmfels, B. Parametric inference for biological sequence analysis. Proc. Natl Acad. Sci. USA 101, 16138–16143 (2004).
Article CAS Google Scholar
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
Article CAS Google Scholar
Keul, F., Hess, M., Goesele, M. & Hamacher, K. Pfasum: a substitution matrix from pfam structural alignments. BMC Bioinform. 18, 293 (2017).
Article Google Scholar
Sun, F., Fernández-Baca, D. & Yu, W. Inverse Parametric Sequence Alignment. In Computing and Combinatorics (eds Ibarra, O. H. & Zhang, L.) 97–106 (Springer, 2002).
Saigo, H., Vert, J.-P. & Akutsu, T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinform. 7, 246 (2006).
Article Google Scholar
Kececioglu, J. & Kim, E. Simple and Fast Inverse Alignment. In Research in Computational Molecular Biology (eds Apostolico, A. et al.) 441–455 (Springer, 2006).
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In 7th International Conference on Learning Representations (ICLR) (Openreview.net, 2019).
Morton, J. T. et al. Protein structural alignments from sequence. Preprint at bioRxiv https://doi.org/10.1101/2020.11.03.365932 (2020).
Petti, S. et al. End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman. Bioinformatics https://doi.org/10.1093/bioinformatics/btac724 (2022).
Vaswani, A. et al. Attention is all you need. In Proc. of the 31st International Conference on Neural Information Processing Systems (eds Guyon, I. et al.), 5998–6008 (Curran Associates, Inc., 2017).
Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS Google Scholar
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Article CAS Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019 Vol. 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article CAS Google Scholar
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems (tensorflow.org, 2015); https://www.tensorflow.org/
Müller, T. & Vingron, M. Modeling amino acid replacement. J. Comput. Biol. 7, 761–776 (2000).
Article Google Scholar
Müller, T., Spang, R. & Vingron, M. Estimating amino acid substitution models: a comparison of dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 19, 8–13 (2002).
Article Google Scholar
Mensch, A. & Blondel, M. Differentiable dynamic programming for structured prediction and attention. In Proc. 35th International Conference on Machine Learning Vol. 80 (eds Dy, J. & Krause, A.) 3462–3471 (PMLR, 2018).
Berthet, Q. et al. Learning with differentiable perturbed optimizers. In Proc. of the 34th International Conference on Neural Information Processing Systems (eds Larochelle, H. et al.) 9508–9519 (Curran Associates, Inc., 2020).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
Google Scholar
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Article Google Scholar
Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article CAS Google Scholar
Collins, M. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In EMNLP ’02: Proc. ACL-02 Conference on Empirical Methods in Natural Language Processing Vol. 10, 1–8 (Association for Computational Linguistics, 2002).
Lafferty, J., McCallum, A. & Pereira, F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conf. on Machine Learning (eds Brodley, C. & Danyluk, A.) 282–289 (Morgan Kaufmann, 2001).
Blondel, M., Martins, A. F. & Niculae, V. Learning with Fenchel–Young losses. J. Mach. Learn. Res. 21, 1–69 (2020).
Google Scholar
Karlin, S. & Altschul, S. F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl Acad. Sci. USA 87, 2264–2268 (1990).
Article CAS Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).
Raghava, G. P. & Barton, G. J. Quantifications of the variation in percentage identity for protein sequence alignments. BMC Bioinform. 7, 415 (2006).
Article CAS Google Scholar
Doolittle, R. F. Similar amino acid sequences: chance or common ancestry? Science 214, 149–159 (1981).
Article CAS Google Scholar

Download references

Acknowledgements

We thank D. Belanger, D. Dohan and A. Gane for preprocessing the data files and implementing the input pipelines for UniRef50 and M. Bileschi for providing feedback to improve the manuscript and guidance for navigating the Pfam database.

Author information

Authors and Affiliations

Brain Team, Google Research, Paris, France
Felipe Llinares-López, Quentin Berthet, Mathieu Blondel, Olivier Teboul & Jean-Philippe Vert

Authors

Felipe Llinares-López
View author publications
You can also search for this author in PubMed Google Scholar
Quentin Berthet
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Blondel
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Teboul
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Philippe Vert
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

F.L.-L. contributed to the development of the method, implemented most of the code, designed and ran most of the experiments and drafted the manuscript. Q.B. contributed to the development of the method, to the code and to the experiments. M.B. contributed to the development of the method and to the code. O.T. designed and implemented large parts of the code and contributed to the experiments. J.P.-V. designed the initial project, contributed to the method development and drafted the manuscript. All authors provided regular feedback to all aspects of the work, reviewed code and contributed to the writing of the final manuscript.

Corresponding author

Correspondence to Jean-Philippe Vert.

Ethics declarations

Competing interests

This study was funded by Google LLC. F.L.-L., Q.B., M.B., O.T. and J.-P.V. were employees of Google LLC and owned Alphabet stock as part of the standard compensation package during the making of this study.

Peer review

Peer review information

Nature Methods thanks Sean Eddy and Rama Ranganathan for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Information.

Reporting Summary.

Peer Review File.

Supplemental Data

Statistical Source Data for Supplementary Figs. 2, 3, 5, 6 and 8–10.

Source data

Source Data Fig. 2

Statistical Source Data for Fig. 2.

Source Data Fig. 4

Statistical Source Data for Fig. 4.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Llinares-López, F., Berthet, Q., Blondel, M. et al. Deep embedding and alignment of protein sequences. Nat Methods 20, 104–111 (2023). https://doi.org/10.1038/s41592-022-01700-2

Download citation

Received: 30 November 2021
Accepted: 24 October 2022
Published: 15 December 2022
Issue Date: January 2023
DOI: https://doi.org/10.1038/s41592-022-01700-2

This article is cited by

Protein embedding based alignment
- Benjamin Giovanni Iovino
- Yuzhen Ye
BMC Bioinformatics (2024)
PLMSearch: Protein language model powers accurate and fast sequence search for remote homology
- Wei Liu
- Ziye Wang
- Shanfeng Zhu
Nature Communications (2024)
Protein remote homology detection and structural alignment using deep learning
- Tymor Hamamsy
- James T. Morton
- Richard Bonneau
Nature Biotechnology (2023)