Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Expanding functional protein sequence spaces using generative adversarial networks

A preprint version of the article is available at bioRxiv.


De novo protein design for catalysis of any desired chemical reaction is a long-standing goal in protein engineering because of the broad spectrum of technological, scientific and medical applications. However, mapping protein sequence to protein function is currently neither computationally nor experimentally tangible. Here, we develop ProteinGAN, a self-attention-based variant of the generative adversarial network that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino-acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase (MDH) as a template enzyme, we show that 24% (13 out of 55 tested) of the ProteinGAN-generated and experimentally tested sequences are soluble and display MDH catalytic activity in the tested conditions in vitro, including a highly mutated variant of 106 amino-acid substitutions. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse functional proteins within the allowed biological constraints of the sequence space.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Fig. 1: ProteinGAN learns the intrinsic relationships between natural protein sequences.
Fig. 2: ProteinGAN expands the functional MDH sequence space.

Data availability

All training data files, including ProteinGAN running examples, have been deposited to the Zenodo repository and are available at Source data are provided with this paper.

Code availability

The implementation of ProteinGAN can be accessed at


  1. Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).

    Article  Google Scholar 

  2. Keefe, A. D. & Szostak, J. W. Functional proteins from a random-sequence library. Nature 410, 715–718 (2001).

    Article  Google Scholar 

  3. Taverna, D. M. & Goldstein, R. A. Why are proteins marginally stable? Proteins 46, 105–109 (2002).

    Article  Google Scholar 

  4. Axe, D. D. Estimating the prevalence of protein sequences adopting functional enzyme folds. J. Mol. Biol. 341, 1295–1315 (2004).

    Article  Google Scholar 

  5. Hansson, L. O., Bolton-Grob, R., Massoud, T. & Mannervik, B. Evolution of differential substrate specificities in Mu class glutathione transferases probed by DNA shuffling. J. Mol. Biol. 287, 265–276 (1999).

    Article  Google Scholar 

  6. Crameri, A., Raillard, S. A., Bermudez, E. & Stemmer, W. P. DNA shuffling of a family of genes from diverse species accelerates directed evolution. Nature 391, 288–291 (1998).

    Article  Google Scholar 

  7. Bloom, J. D., Labthavikul, S. T., Otey, C. R. & Arnold, F. H. Protein stability promotes evolvability. Proc. Natl Acad. Sci. USA 103, 5869–5874 (2006).

    Article  Google Scholar 

  8. Guo, H. H., Choe, J. & Loeb, L. A. Protein tolerance to random amino acid change. Proc. Natl Acad. Sci. USA 101, 9205–9210 (2004).

    Article  Google Scholar 

  9. Rennell, D., Bouvier, S. E., Hardy, L. W. & Poteete, A. R. Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol. 222, 67–88 (1991).

    Article  Google Scholar 

  10. Axe, D. D., Foster, N. W. & Fersht, A. R. A search for single substitutions that eliminate enzymatic function in a bacterial ribonuclease. Biochemistry. 37, 7157–7166 (1998).

    Article  Google Scholar 

  11. Shafikhani, S., Siegel, R. A., Ferrari, E. & Schellenberger, V. Generation of large libraries of random mutants in Bacillus subtilis by PCR-based plasmid multimerization. Biotechniques 23, 304–310 (1997).

    Article  Google Scholar 

  12. Rockah-Shmuel, L., Tóth-Petróczy, Á. & Tawfik, D. S. Systematic mapping of protein mutational space by prolonged drift reveals the deleterious effects of seemingly neutral mutations. PLoS Comput. Biol. 11, e1004421 (2015).

    Article  Google Scholar 

  13. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  Google Scholar 

  14. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  Google Scholar 

  15. AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301 (2019).

    Article  Google Scholar 

  16. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Preprint at bioRxiv (2020).

  17. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  Google Scholar 

  18. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-only deep representation learning. Preprint at bioRxiv (2019).

  19. Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  20. Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).

    Article  Google Scholar 

  21. Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2005).

    Article  Google Scholar 

  22. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Preprint at bioRxiv (2020).

  23. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  24. Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).

    Article  Google Scholar 

  25. Boomsma, W. et al. A generative, probabilistic model of local protein structure. Proc. Natl Acad. Sci. USA 105, 8932–8937 (2008).

    Article  Google Scholar 

  26. Krogh, A., Brown, M., Mian, I. S., Sjölander, K. & Haussler, D. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994).

    Article  Google Scholar 

  27. Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. eLife 8, e39397 (2019).

    Article  Google Scholar 

  28. Riesselman, A. J., Shin, J. E., Kollasch, A. W. & McMahon, C. Accelerating protein design using autoregressive generative models. Preprint at bioRxiv (2019).

  29. Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018).

    Article  Google Scholar 

  30. Anand, N. & Huang, P. Generative modeling for protein structures. In Advances in Neural Information Processing Systems Vol. 31 (eds Bengio, S. et al.) 7494–7505 (Curran Associates, 2018).

  31. Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at (2017).

  32. Amimeur, T., Shaver, J. M., Ketchem, R. R. & Taylor, J. A. Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. Preprint at bioRxiv (2020)

  33. Gupta, A. & Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).

    Article  Google Scholar 

  34. Goodfellow, I. et al. Generative adversarial nets. In Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani, Z. et al.) 2672–2680 (Curran Associates, 2014).

  35. Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Preprint at (2018).

  36. Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011).

    Article  Google Scholar 

  37. Zhang, H., Goodfellow, I., Metaxas, D. & Odena, A. Self-attention generative adversarial networks. Preprint at (2018).

  38. Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

    Article  MathSciNet  Google Scholar 

  39. Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. & Durbin, R. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26, 320–322 (1998).

    Article  Google Scholar 

  40. Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Preprint at (2019).

  41. Santoni, D., Felici, G. & Vergni, D. Natural vs random protein sequences: discovering combinatorics properties on amino acid words. J. Theor. Biol. 391, 13–20 (2016).

    Article  MathSciNet  MATH  Google Scholar 

  42. El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).

    Article  Google Scholar 

  43. Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    MATH  Google Scholar 

  44. Dawson, N. L. et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 45, D289–D295 (2017).

    Article  Google Scholar 

  45. Rosano, G. L. & Ceccarelli, E. A. Recombinant protein expression in Escherichia coli: advances and challenges. Front. Microbiol. 5, 172 (2014).

    Article  Google Scholar 

  46. Huang, H. et al. Panoramic view of a superfamily of phosphatases through substrate profiling. Proc. Natl Acad. Sci. USA 112, E1974–E1983 (2015).

    Article  Google Scholar 

  47. Pertusi, D. A., Stine, A. E., Broadbelt, L. J. & Tyo, K. E. J. Efficient searching and annotation of metabolic networks using chemical similarity. Bioinformatics 31, 1016–1024 (2015).

    Article  Google Scholar 

  48. Mashiyama, S. T. et al. Large-scale determination of sequence, structure and function relationships in cytosolic glutathione transferases across the biosphere. PLoS Biol. (2014).

  49. Lockless, S. W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 (1999).

    Article  Google Scholar 

  50. Socolich, M. et al. Evolutionary information for specifying a protein fold. Nature 437, 512–518 (2005).

    Article  Google Scholar 

  51. Russ, W. P., Lowery, D. M., Mishra, P., Yaffe, M. B. & Ranganathan, R. Natural-like function in artificial WW domains. Nature 437, 579–583 (2005).

    Article  Google Scholar 

  52. Pervez, M. T. et al. Evaluating the accuracy and efficiency of multiple sequence alignment methods. Evol. Bioinform. Online 10, 205–217 (2014).

    Article  Google Scholar 

  53. Nuin, P. A. S., Wang, Z. & Tillier, E. R. M. The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics 7, 471 (2006).

    Article  Google Scholar 

  54. Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. Preprint at (2018).

  55. van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at (2016).

  56. Bloom, J. D. et al. Thermodynamic prediction of protein neutrality. Proc. Natl Acad. Sci. USA 102, 606–611 (2005).

    Article  Google Scholar 

  57. Neylon, C. Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution. Nucleic Acids Res. 32, 1448–1459 (2004).

    Article  Google Scholar 

  58. Voigt, C. A., Martinez, C., Wang, Z.-G., Mayo, S. L. & Arnold, F. H. Protein building blocks preserved by recombination. Nat. Struct. Biol. 9, 553–558 (2002).

    Google Scholar 

  59. Chen, T. & Romesberg, F. E. Directed polymerase evolution. FEBS Lett. 588, 219–229 (2014).

    Article  Google Scholar 

  60. Truppo, M. D. Biocatalysis in the pharmaceutical industry: the need for speed. ACS Med. Chem. Lett. 8, 476–480 (2017).

    Article  Google Scholar 

  61. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Preprint at (2015).

  62. Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. Preprint at (2015).

  63. Maas, A. L. Rectifier nonlinearities improve neural network acoustic models. In Proc. 30th International Conference on Machine Learning Vol. 30 (ACM, 2013).

  64. Mescheder, L., Geiger, A. & Nowozin, S. Which training methods for GANs do actually converge? Preprint at (2018).

  65. Miyato, T., Kataoka, T., Koyama, M. & Yoshida, Y. Spectral normalization for generative adversarial networks. Preprint at (2018).

  66. UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

    Article  Google Scholar 

  67. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  Google Scholar 

  68. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at (2014).

  69. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  Google Scholar 

  70. Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).

    Article  Google Scholar 

  71. Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).

    Article  Google Scholar 

  72. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  73. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res 28, 235–242 (2000).

    Article  Google Scholar 

  74. Berman, H. M. et al. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. 58, 899–907 (2002).

    Article  Google Scholar 

  75. Eswar, N. et al. Comparative protein structure modeling using MODELLER. Curr. Protoc. Protein Sci. 2, 2.9 (2006).

    Google Scholar 

  76. Sievers, F., Wilm, A., Dineen, D. & Gibson, T. J. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).

    Article  Google Scholar 

  77. Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article  Google Scholar 

  78. Sanner, M. F., Olson, A. J. & Spehner, J. C. Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38, 305–320 (1996).

    Article  Google Scholar 

  79. McCloskey, D. & Ubhi, B. K. Quantitative and qualitative metabolomics for the investigation of intracellular metabolism. SCIEX Tech Note 1–11 (2014).

  80. Wilbur, W. J. & Lipman, D. J. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl Acad. Sci. USA 80, 726–730 (1983).

    Article  Google Scholar 

Download references


We thank G. Stonyte, J. Nainys and C. Correia-Melo for comments on the manuscript. We also thank A. Repecka and L. Petkevicius for their valuable and constructive suggestions for improving the model. L.K. and R.M. were supported by the Agency for Science, Innovation and Technology (Lithuania) grant no. 31V-59/(1.78)SU-1687. J.Z. and A.Z. were supported by SciLifeLab fellow programme funding. S.V. was supported by VR starting grant no. 2019-05356. The computations were enabled with resources provided by the Swedish National Infrastructure for Computing (SNIC) at C3SE, partially funded by the Swedish Research Council through grant agreement no. 2018-05973. M. Öhman and T. Svedberg at C3SE are acknowledged for technical assistance in making the code run on Vera C3SE resources.

Author information

Authors and Affiliations



D.R. implemented the method, contributed with principal analysis and wrote the first draft. V.J. contributed principal analysis, designed experiments and wrote the first draft. L.K. contributed principal analysis, designed experiments and wrote the first draft. E.R. performed laboratory experiments. I.R. contributed principal analysis and wrote the first draft. J.Z. contributed principal analysis, performed laboratory experiments and wrote the first draft. S.P. performed laboratory experiments. A.L. contributed principal analysis. S.V. contributed principal analysis. W.A. performed laboratory experiments. O.S. contributed principal analysis and supervised the mass spectrometry work. R.M. supervised the study and designed the experiments. M.K.M.E. supervised the study, designed experiments, contributed principal analysis and wrote the manuscript. A.Z. supervised the study, designed experiments, contributed principal analysis, financed the experiments and wrote the manuscript. All authors contributed to writing of the paper and read the final manuscript.

Corresponding author

Correspondence to Aleksej Zelezniak.

Ethics declarations

Competing interests

L.K., V.J., D.R., I.R. and R.M. are shareholders of the company Biomatter Designs. The company has submitted a patent application for the technology described in the Article. The other authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Frances Arnold and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Methods, Figs. 1–25 and Tables 1–7.

Supplementary Table 2

Supplementary Table 2. Generated sequences and their identities to the closest real sequence.

Source data

Source Data Fig. 1

SDS–PAGE gels of purified proteins. a, Batch 1, protocol 1 (Methods). b, Batches 2 and 3, protocol 1 (Methods). c, Batch 1, protocol 2. T, total lysate; S, soluble lysate; E, elution after affinity column use. d, Batch 2, protocol 2. e, Batch 3, protocol 2 (Methods). The results are summarized in Supplementary Table 3.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Repecka, D., Jauniskis, V., Karpus, L. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell 3, 324–333 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing