Expanding functional protein sequence spaces using generative adversarial networks

Repecka, Donatas; Jauniskis, Vykintas; Karpus, Laurynas; Rembeza, Elzbieta; Rokaitis, Irmantas; Zrimec, Jan; Poviloniene, Simona; Laurynenas, Audrius; Viknander, Sandra; Abuajwa, Wissam; Savolainen, Otto; Meskys, Rolandas; Engqvist, Martin K. M.; Zelezniak, Aleksej

doi:10.1038/s42256-021-00310-5

Article
Published: 04 March 2021

Expanding functional protein sequence spaces using generative adversarial networks

Nature Machine Intelligence volume 3, pages 324–333 (2021)Cite this article

14k Accesses
124 Citations
195 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

De novo protein design for catalysis of any desired chemical reaction is a long-standing goal in protein engineering because of the broad spectrum of technological, scientific and medical applications. However, mapping protein sequence to protein function is currently neither computationally nor experimentally tangible. Here, we develop ProteinGAN, a self-attention-based variant of the generative adversarial network that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino-acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase (MDH) as a template enzyme, we show that 24% (13 out of 55 tested) of the ProteinGAN-generated and experimentally tested sequences are soluble and display MDH catalytic activity in the tested conditions in vitro, including a highly mutated variant of 106 amino-acid substitutions. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse functional proteins within the allowed biological constraints of the sequence space.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: ProteinGAN learns the intrinsic relationships between natural protein sequences.**

**Fig. 2: ProteinGAN expands the functional MDH sequence space.**

Emergence of fractal geometries in the evolution of a metabolic enzyme

Article Open access 10 April 2024

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Data availability

All training data files, including ProteinGAN running examples, have been deposited to the Zenodo repository and are available at https://doi.org/10.5281/zenodo.4068040. Source data are provided with this paper.

Code availability

The implementation of ProteinGAN can be accessed at https://github.com/Biomatter-Designs/ProteinGAN.

References

Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).
Article Google Scholar
Keefe, A. D. & Szostak, J. W. Functional proteins from a random-sequence library. Nature 410, 715–718 (2001).
Article Google Scholar
Taverna, D. M. & Goldstein, R. A. Why are proteins marginally stable? Proteins 46, 105–109 (2002).
Article Google Scholar
Axe, D. D. Estimating the prevalence of protein sequences adopting functional enzyme folds. J. Mol. Biol. 341, 1295–1315 (2004).
Article Google Scholar
Hansson, L. O., Bolton-Grob, R., Massoud, T. & Mannervik, B. Evolution of differential substrate specificities in Mu class glutathione transferases probed by DNA shuffling. J. Mol. Biol. 287, 265–276 (1999).
Article Google Scholar
Crameri, A., Raillard, S. A., Bermudez, E. & Stemmer, W. P. DNA shuffling of a family of genes from diverse species accelerates directed evolution. Nature 391, 288–291 (1998).
Article Google Scholar
Bloom, J. D., Labthavikul, S. T., Otey, C. R. & Arnold, F. H. Protein stability promotes evolvability. Proc. Natl Acad. Sci. USA 103, 5869–5874 (2006).
Article Google Scholar
Guo, H. H., Choe, J. & Loeb, L. A. Protein tolerance to random amino acid change. Proc. Natl Acad. Sci. USA 101, 9205–9210 (2004).
Article Google Scholar
Rennell, D., Bouvier, S. E., Hardy, L. W. & Poteete, A. R. Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol. 222, 67–88 (1991).
Article Google Scholar
Axe, D. D., Foster, N. W. & Fersht, A. R. A search for single substitutions that eliminate enzymatic function in a bacterial ribonuclease. Biochemistry. 37, 7157–7166 (1998).
Article Google Scholar
Shafikhani, S., Siegel, R. A., Ferrari, E. & Schellenberger, V. Generation of large libraries of random mutants in Bacillus subtilis by PCR-based plasmid multimerization. Biotechniques 23, 304–310 (1997).
Article Google Scholar
Rockah-Shmuel, L., Tóth-Petróczy, Á. & Tawfik, D. S. Systematic mapping of protein mutational space by prolonged drift reveals the deleterious effects of seemingly neutral mutations. PLoS Comput. Biol. 11, e1004421 (2015).
Article Google Scholar
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Article Google Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article Google Scholar
AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301 (2019).
Article Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Preprint at bioRxiv https://doi.org/10.1101/622803 (2020).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article Google Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-only deep representation learning. Preprint at bioRxiv https://doi.org/10.1101/589333 (2019).
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Article MathSciNet MATH Google Scholar
Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).
Article Google Scholar
Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2005).
Article Google Scholar
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Preprint at bioRxiv https://doi.org/10.1101/2020.01.23.917682 (2020).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article Google Scholar
Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).
Article Google Scholar
Boomsma, W. et al. A generative, probabilistic model of local protein structure. Proc. Natl Acad. Sci. USA 105, 8932–8937 (2008).
Article Google Scholar
Krogh, A., Brown, M., Mian, I. S., Sjölander, K. & Haussler, D. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994).
Article Google Scholar
Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. eLife 8, e39397 (2019).
Article Google Scholar
Riesselman, A. J., Shin, J. E., Kollasch, A. W. & McMahon, C. Accelerating protein design using autoregressive generative models. Preprint at bioRxiv https://doi.org/10.1101/757252 (2019).
Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018).
Article Google Scholar
Anand, N. & Huang, P. Generative modeling for protein structures. In Advances in Neural Information Processing Systems Vol. 31 (eds Bengio, S. et al.) 7494–7505 (Curran Associates, 2018).
Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at https://arxiv.org/pdf/1712.06148.pdf (2017).
Amimeur, T., Shaver, J. M., Ketchem, R. R. & Taylor, J. A. Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. Preprint at bioRxiv https://doi.org/10.1101/2020.04.12.024844 (2020)
Gupta, A. & Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).
Article Google Scholar
Goodfellow, I. et al. Generative adversarial nets. In Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani, Z. et al.) 2672–2680 (Curran Associates, 2014).
Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Preprint at https://arxiv.org/pdf/1803.01271.pdf (2018).
Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011).
Article Google Scholar
Zhang, H., Goodfellow, I., Metaxas, D. & Odena, A. Self-attention generative adversarial networks. Preprint at https://arxiv.org/pdf/1805.08318.pdf (2018).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article MathSciNet Google Scholar
Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. & Durbin, R. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26, 320–322 (1998).
Article Google Scholar
Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Preprint at https://doi.org/10.1101/626507 (2019).
Santoni, D., Felici, G. & Vergni, D. Natural vs random protein sequences: discovering combinatorics properties on amino acid words. J. Theor. Biol. 391, 13–20 (2016).
Article MathSciNet MATH Google Scholar
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
Article Google Scholar
Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
MATH Google Scholar
Dawson, N. L. et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 45, D289–D295 (2017).
Article Google Scholar
Rosano, G. L. & Ceccarelli, E. A. Recombinant protein expression in Escherichia coli: advances and challenges. Front. Microbiol. 5, 172 (2014).
Article Google Scholar
Huang, H. et al. Panoramic view of a superfamily of phosphatases through substrate profiling. Proc. Natl Acad. Sci. USA 112, E1974–E1983 (2015).
Article Google Scholar
Pertusi, D. A., Stine, A. E., Broadbelt, L. J. & Tyo, K. E. J. Efficient searching and annotation of metabolic networks using chemical similarity. Bioinformatics 31, 1016–1024 (2015).
Article Google Scholar
Mashiyama, S. T. et al. Large-scale determination of sequence, structure and function relationships in cytosolic glutathione transferases across the biosphere. PLoS Biol. https://doi.org/10.1371/journal.pbio.1001843 (2014).
Lockless, S. W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 (1999).
Article Google Scholar
Socolich, M. et al. Evolutionary information for specifying a protein fold. Nature 437, 512–518 (2005).
Article Google Scholar
Russ, W. P., Lowery, D. M., Mishra, P., Yaffe, M. B. & Ranganathan, R. Natural-like function in artificial WW domains. Nature 437, 579–583 (2005).
Article Google Scholar
Pervez, M. T. et al. Evaluating the accuracy and efficiency of multiple sequence alignment methods. Evol. Bioinform. Online 10, 205–217 (2014).
Article Google Scholar
Nuin, P. A. S., Wang, Z. & Tillier, E. R. M. The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics 7, 471 (2006).
Article Google Scholar
Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. Preprint at https://arxiv.org/pdf/1812.04948.pdf (2018).
van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/pdf/1609.03499.pdf (2016).
Bloom, J. D. et al. Thermodynamic prediction of protein neutrality. Proc. Natl Acad. Sci. USA 102, 606–611 (2005).
Article Google Scholar
Neylon, C. Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution. Nucleic Acids Res. 32, 1448–1459 (2004).
Article Google Scholar
Voigt, C. A., Martinez, C., Wang, Z.-G., Mayo, S. L. & Arnold, F. H. Protein building blocks preserved by recombination. Nat. Struct. Biol. 9, 553–558 (2002).
Google Scholar
Chen, T. & Romesberg, F. E. Directed polymerase evolution. FEBS Lett. 588, 219–229 (2014).
Article Google Scholar
Truppo, M. D. Biocatalysis in the pharmaceutical industry: the need for speed. ACS Med. Chem. Lett. 8, 476–480 (2017).
Article Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Preprint at https://arxiv.org/pdf/1512.03385.pdf (2015).
Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. Preprint at https://arxiv.org/pdf/1502.03167.pdf (2015).
Maas, A. L. Rectifier nonlinearities improve neural network acoustic models. In Proc. 30th International Conference on Machine Learning Vol. 30 (ACM, 2013).
Mescheder, L., Geiger, A. & Nowozin, S. Which training methods for GANs do actually converge? Preprint at https://arxiv.org/pdf/1801.04406.pdf (2018).
Miyato, T., Kataoka, T., Koyama, M. & Yoshida, Y. Spectral normalization for generative adversarial networks. Preprint at https://arxiv.org/pdf/1802.05957.pdf (2018).
UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
Article Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article Google Scholar
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Article Google Scholar
Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res 28, 235–242 (2000).
Article Google Scholar
Berman, H. M. et al. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. 58, 899–907 (2002).
Article Google Scholar
Eswar, N. et al. Comparative protein structure modeling using MODELLER. Curr. Protoc. Protein Sci. 2, 2.9 (2006).
Google Scholar
Sievers, F., Wilm, A., Dineen, D. & Gibson, T. J. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).
Article Google Scholar
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Article Google Scholar
Sanner, M. F., Olson, A. J. & Spehner, J. C. Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38, 305–320 (1996).
Article Google Scholar
McCloskey, D. & Ubhi, B. K. Quantitative and qualitative metabolomics for the investigation of intracellular metabolism. SCIEX Tech Note 1–11 (2014).
Wilbur, W. J. & Lipman, D. J. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl Acad. Sci. USA 80, 726–730 (1983).
Article Google Scholar

Download references

Acknowledgements

We thank G. Stonyte, J. Nainys and C. Correia-Melo for comments on the manuscript. We also thank A. Repecka and L. Petkevicius for their valuable and constructive suggestions for improving the model. L.K. and R.M. were supported by the Agency for Science, Innovation and Technology (Lithuania) grant no. 31V-59/(1.78)SU-1687. J.Z. and A.Z. were supported by SciLifeLab fellow programme funding. S.V. was supported by VR starting grant no. 2019-05356. The computations were enabled with resources provided by the Swedish National Infrastructure for Computing (SNIC) at C3SE, partially funded by the Swedish Research Council through grant agreement no. 2018-05973. M. Öhman and T. Svedberg at C3SE are acknowledged for technical assistance in making the code run on Vera C3SE resources.

Author information

These authors contributed equally: Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus.

Authors and Affiliations

Biomatter Designs, Vilnius, Lithuania
Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus, Irmantas Rokaitis & Audrius Laurynenas
Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
Vykintas Jauniskis, Elzbieta Rembeza, Jan Zrimec, Sandra Viknander, Martin K. M. Engqvist & Aleksej Zelezniak
Institute of Biochemistry, Life Sciences Center, Vilnius University, Vilnius, Lithuania
Simona Poviloniene, Audrius Laurynenas & Rolandas Meskys
Chalmers Mass Spectrometry Infrastructure, Chalmers University of Technology, Gothenburg, Sweden
Wissam Abuajwa & Otto Savolainen
Science for Life Laboratory, Stockholm, Sweden
Aleksej Zelezniak

Authors

Donatas Repecka
View author publications
You can also search for this author in PubMed Google Scholar
Vykintas Jauniskis
View author publications
You can also search for this author in PubMed Google Scholar
Laurynas Karpus
View author publications
You can also search for this author in PubMed Google Scholar
Elzbieta Rembeza
View author publications
You can also search for this author in PubMed Google Scholar
Irmantas Rokaitis
View author publications
You can also search for this author in PubMed Google Scholar
Jan Zrimec
View author publications
You can also search for this author in PubMed Google Scholar
Simona Poviloniene
View author publications
You can also search for this author in PubMed Google Scholar
Audrius Laurynenas
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Viknander
View author publications
You can also search for this author in PubMed Google Scholar
Wissam Abuajwa
View author publications
You can also search for this author in PubMed Google Scholar
Otto Savolainen
View author publications
You can also search for this author in PubMed Google Scholar
Rolandas Meskys
View author publications
You can also search for this author in PubMed Google Scholar
Martin K. M. Engqvist
View author publications
You can also search for this author in PubMed Google Scholar
Aleksej Zelezniak
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.R. implemented the method, contributed with principal analysis and wrote the first draft. V.J. contributed principal analysis, designed experiments and wrote the first draft. L.K. contributed principal analysis, designed experiments and wrote the first draft. E.R. performed laboratory experiments. I.R. contributed principal analysis and wrote the first draft. J.Z. contributed principal analysis, performed laboratory experiments and wrote the first draft. S.P. performed laboratory experiments. A.L. contributed principal analysis. S.V. contributed principal analysis. W.A. performed laboratory experiments. O.S. contributed principal analysis and supervised the mass spectrometry work. R.M. supervised the study and designed the experiments. M.K.M.E. supervised the study, designed experiments, contributed principal analysis and wrote the manuscript. A.Z. supervised the study, designed experiments, contributed principal analysis, financed the experiments and wrote the manuscript. All authors contributed to writing of the paper and read the final manuscript.

Corresponding author

Correspondence to Aleksej Zelezniak.

Ethics declarations

Competing interests

L.K., V.J., D.R., I.R. and R.M. are shareholders of the company Biomatter Designs. The company has submitted a patent application for the technology described in the Article. The other authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Frances Arnold and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Methods, Figs. 1–25 and Tables 1–7.

Supplementary Table 2

Supplementary Table 2. Generated sequences and their identities to the closest real sequence.

Source data

Source Data Fig. 1

SDS–PAGE gels of purified proteins. a, Batch 1, protocol 1 (Methods). b, Batches 2 and 3, protocol 1 (Methods). c, Batch 1, protocol 2. T, total lysate; S, soluble lysate; E, elution after affinity column use. d, Batch 2, protocol 2. e, Batch 3, protocol 2 (Methods). The results are summarized in Supplementary Table 3.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Repecka, D., Jauniskis, V., Karpus, L. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell 3, 324–333 (2021). https://doi.org/10.1038/s42256-021-00310-5

Download citation

Received: 26 June 2020
Accepted: 01 February 2021
Published: 04 March 2021
Issue Date: April 2021
DOI: https://doi.org/10.1038/s42256-021-00310-5

This article is cited by

Tpgen: a language model for stable protein design with a specific topology structure
- Xiaoping Min
- Chongzhou Yang
- Ningshao Xia
BMC Bioinformatics (2024)
Sparks of function by de novo protein design
- Alexander E. Chu
- Tianyu Lu
- Po-Ssu Huang
Nature Biotechnology (2024)
Machine learning-aided design and screening of an emergent protein function in synthetic cells
- Shunshi Kohyama
- Béla P. Frohn
- Petra Schwille
Nature Communications (2024)
Strategies for non-viral vectors targeting organs beyond the liver
- Jeonghwan Kim
- Yulia Eygeris
- Gaurav Sahay
Nature Nanotechnology (2024)
Variational autoencoder for design of synthetic viral vector serotypes
- Suyue Lyu
- Shahin Sowlati-Hashjin
- Michael Garton
Nature Machine Intelligence (2024)