Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Expanding functional protein sequence spaces using generative adversarial networks

Abstract

De novo protein design for catalysis of any desired chemical reaction is a long-standing goal in protein engineering because of the broad spectrum of technological, scientific and medical applications. However, mapping protein sequence to protein function is currently neither computationally nor experimentally tangible. Here, we develop ProteinGAN, a self-attention-based variant of the generative adversarial network that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino-acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase (MDH) as a template enzyme, we show that 24% (13 out of 55 tested) of the ProteinGAN-generated and experimentally tested sequences are soluble and display MDH catalytic activity in the tested conditions in vitro, including a highly mutated variant of 106 amino-acid substitutions. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse functional proteins within the allowed biological constraints of the sequence space.

A preprint version of the article is available at bioRxiv.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: ProteinGAN learns the intrinsic relationships between natural protein sequences.
Fig. 2: ProteinGAN expands the functional MDH sequence space.

Data availability

All training data files, including ProteinGAN running examples, have been deposited to the Zenodo repository and are available at https://doi.org/10.5281/zenodo.4068040. Source data are provided with this paper.

Code availability

The implementation of ProteinGAN can be accessed at https://github.com/Biomatter-Designs/ProteinGAN.

References

  1. 1.

    Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).

    Article  Google Scholar 

  2. 2.

    Keefe, A. D. & Szostak, J. W. Functional proteins from a random-sequence library. Nature 410, 715–718 (2001).

    Article  Google Scholar 

  3. 3.

    Taverna, D. M. & Goldstein, R. A. Why are proteins marginally stable? Proteins 46, 105–109 (2002).

    Article  Google Scholar 

  4. 4.

    Axe, D. D. Estimating the prevalence of protein sequences adopting functional enzyme folds. J. Mol. Biol. 341, 1295–1315 (2004).

    Article  Google Scholar 

  5. 5.

    Hansson, L. O., Bolton-Grob, R., Massoud, T. & Mannervik, B. Evolution of differential substrate specificities in Mu class glutathione transferases probed by DNA shuffling. J. Mol. Biol. 287, 265–276 (1999).

    Article  Google Scholar 

  6. 6.

    Crameri, A., Raillard, S. A., Bermudez, E. & Stemmer, W. P. DNA shuffling of a family of genes from diverse species accelerates directed evolution. Nature 391, 288–291 (1998).

    Article  Google Scholar 

  7. 7.

    Bloom, J. D., Labthavikul, S. T., Otey, C. R. & Arnold, F. H. Protein stability promotes evolvability. Proc. Natl Acad. Sci. USA 103, 5869–5874 (2006).

    Article  Google Scholar 

  8. 8.

    Guo, H. H., Choe, J. & Loeb, L. A. Protein tolerance to random amino acid change. Proc. Natl Acad. Sci. USA 101, 9205–9210 (2004).

    Article  Google Scholar 

  9. 9.

    Rennell, D., Bouvier, S. E., Hardy, L. W. & Poteete, A. R. Systematic mutation of bacteriophage T4 lysozyme. J. Mol. Biol. 222, 67–88 (1991).

    Article  Google Scholar 

  10. 10.

    Axe, D. D., Foster, N. W. & Fersht, A. R. A search for single substitutions that eliminate enzymatic function in a bacterial ribonuclease. Biochemistry. 37, 7157–7166 (1998).

    Article  Google Scholar 

  11. 11.

    Shafikhani, S., Siegel, R. A., Ferrari, E. & Schellenberger, V. Generation of large libraries of random mutants in Bacillus subtilis by PCR-based plasmid multimerization. Biotechniques 23, 304–310 (1997).

    Article  Google Scholar 

  12. 12.

    Rockah-Shmuel, L., Tóth-Petróczy, Á. & Tawfik, D. S. Systematic mapping of protein mutational space by prolonged drift reveals the deleterious effects of seemingly neutral mutations. PLoS Comput. Biol. 11, e1004421 (2015).

    Article  Google Scholar 

  13. 13.

    Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  Google Scholar 

  14. 14.

    Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  Google Scholar 

  15. 15.

    AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst. 8, 292–301 (2019).

    Article  Google Scholar 

  16. 16.

    Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Preprint at bioRxiv https://doi.org/10.1101/622803 (2020).

  17. 17.

    Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  Google Scholar 

  18. 18.

    Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-only deep representation learning. Preprint at bioRxiv https://doi.org/10.1101/589333 (2019).

  19. 19.

    Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).

    MathSciNet  MATH  Article  Google Scholar 

  20. 20.

    Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).

    Article  Google Scholar 

  21. 21.

    Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2005).

    Article  Google Scholar 

  22. 22.

    Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Preprint at bioRxiv https://doi.org/10.1101/2020.01.23.917682 (2020).

  23. 23.

    Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  24. 24.

    Balakrishnan, S., Kamisetty, H., Carbonell, J. G., Lee, S.-I. & Langmead, C. J. Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).

    Article  Google Scholar 

  25. 25.

    Boomsma, W. et al. A generative, probabilistic model of local protein structure. Proc. Natl Acad. Sci. USA 105, 8932–8937 (2008).

    Article  Google Scholar 

  26. 26.

    Krogh, A., Brown, M., Mian, I. S., Sjölander, K. & Haussler, D. Hidden Markov models in computational biology. Applications to protein modeling. J. Mol. Biol. 235, 1501–1531 (1994).

    Article  Google Scholar 

  27. 27.

    Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. eLife 8, e39397 (2019).

    Article  Google Scholar 

  28. 28.

    Riesselman, A. J., Shin, J. E., Kollasch, A. W. & McMahon, C. Accelerating protein design using autoregressive generative models. Preprint at bioRxiv https://doi.org/10.1101/757252 (2019).

  29. 29.

    Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep. 8, 16189 (2018).

    Article  Google Scholar 

  30. 30.

    Anand, N. & Huang, P. Generative modeling for protein structures. In Advances in Neural Information Processing Systems Vol. 31 (eds Bengio, S. et al.) 7494–7505 (Curran Associates, 2018).

  31. 31.

    Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. Preprint at https://arxiv.org/pdf/1712.06148.pdf (2017).

  32. 32.

    Amimeur, T., Shaver, J. M., Ketchem, R. R. & Taylor, J. A. Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. Preprint at bioRxiv https://doi.org/10.1101/2020.04.12.024844 (2020)

  33. 33.

    Gupta, A. & Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).

    Article  Google Scholar 

  34. 34.

    Goodfellow, I. et al. Generative adversarial nets. In Advances in Neural Information Processing Systems Vol. 27 (eds Ghahramani, Z. et al.) 2672–2680 (Curran Associates, 2014).

  35. 35.

    Bai, S., Kolter, J. Z. & Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. Preprint at https://arxiv.org/pdf/1803.01271.pdf (2018).

  36. 36.

    Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011).

    Article  Google Scholar 

  37. 37.

    Zhang, H., Goodfellow, I., Metaxas, D. & Odena, A. Self-attention generative adversarial networks. Preprint at https://arxiv.org/pdf/1805.08318.pdf (2018).

  38. 38.

    Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

    MathSciNet  Article  Google Scholar 

  39. 39.

    Sonnhammer, E. L., Eddy, S. R., Birney, E., Bateman, A. & Durbin, R. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res. 26, 320–322 (1998).

    Article  Google Scholar 

  40. 40.

    Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Preprint at https://doi.org/10.1101/626507 (2019).

  41. 41.

    Santoni, D., Felici, G. & Vergni, D. Natural vs random protein sequences: discovering combinatorics properties on amino acid words. J. Theor. Biol. 391, 13–20 (2016).

    MathSciNet  MATH  Article  Google Scholar 

  42. 42.

    El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).

    Article  Google Scholar 

  43. 43.

    Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    MATH  Google Scholar 

  44. 44.

    Dawson, N. L. et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 45, D289–D295 (2017).

    Article  Google Scholar 

  45. 45.

    Rosano, G. L. & Ceccarelli, E. A. Recombinant protein expression in Escherichia coli: advances and challenges. Front. Microbiol. 5, 172 (2014).

    Google Scholar 

  46. 46.

    Huang, H. et al. Panoramic view of a superfamily of phosphatases through substrate profiling. Proc. Natl Acad. Sci. USA 112, E1974–E1983 (2015).

    Article  Google Scholar 

  47. 47.

    Pertusi, D. A., Stine, A. E., Broadbelt, L. J. & Tyo, K. E. J. Efficient searching and annotation of metabolic networks using chemical similarity. Bioinformatics 31, 1016–1024 (2015).

    Article  Google Scholar 

  48. 48.

    Mashiyama, S. T. et al. Large-scale determination of sequence, structure and function relationships in cytosolic glutathione transferases across the biosphere. PLoS Biol. https://doi.org/10.1371/journal.pbio.1001843 (2014).

  49. 49.

    Lockless, S. W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295–299 (1999).

    Article  Google Scholar 

  50. 50.

    Socolich, M. et al. Evolutionary information for specifying a protein fold. Nature 437, 512–518 (2005).

    Article  Google Scholar 

  51. 51.

    Russ, W. P., Lowery, D. M., Mishra, P., Yaffe, M. B. & Ranganathan, R. Natural-like function in artificial WW domains. Nature 437, 579–583 (2005).

    Article  Google Scholar 

  52. 52.

    Pervez, M. T. et al. Evaluating the accuracy and efficiency of multiple sequence alignment methods. Evol. Bioinform. Online 10, 205–217 (2014).

    Article  Google Scholar 

  53. 53.

    Nuin, P. A. S., Wang, Z. & Tillier, E. R. M. The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics 7, 471 (2006).

    Article  Google Scholar 

  54. 54.

    Karras, T., Laine, S. & Aila, T. A style-based generator architecture for generative adversarial networks. Preprint at https://arxiv.org/pdf/1812.04948.pdf (2018).

  55. 55.

    van den Oord, A. et al. WaveNet: a generative model for raw audio. Preprint at https://arxiv.org/pdf/1609.03499.pdf (2016).

  56. 56.

    Bloom, J. D. et al. Thermodynamic prediction of protein neutrality. Proc. Natl Acad. Sci. USA 102, 606–611 (2005).

    Article  Google Scholar 

  57. 57.

    Neylon, C. Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution. Nucleic Acids Res. 32, 1448–1459 (2004).

    Article  Google Scholar 

  58. 58.

    Voigt, C. A., Martinez, C., Wang, Z.-G., Mayo, S. L. & Arnold, F. H. Protein building blocks preserved by recombination. Nat. Struct. Biol. 9, 553–558 (2002).

    Google Scholar 

  59. 59.

    Chen, T. & Romesberg, F. E. Directed polymerase evolution. FEBS Lett. 588, 219–229 (2014).

    Article  Google Scholar 

  60. 60.

    Truppo, M. D. Biocatalysis in the pharmaceutical industry: the need for speed. ACS Med. Chem. Lett. 8, 476–480 (2017).

    Article  Google Scholar 

  61. 61.

    He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Preprint at https://arxiv.org/pdf/1512.03385.pdf (2015).

  62. 62.

    Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep network training by reducing internal covariate shift. Preprint at https://arxiv.org/pdf/1502.03167.pdf (2015).

  63. 63.

    Maas, A. L. Rectifier nonlinearities improve neural network acoustic models. In Proc. 30th International Conference on Machine Learning Vol. 30 (ACM, 2013).

  64. 64.

    Mescheder, L., Geiger, A. & Nowozin, S. Which training methods for GANs do actually converge? Preprint at https://arxiv.org/pdf/1801.04406.pdf (2018).

  65. 65.

    Miyato, T., Kataoka, T., Koyama, M. & Yoshida, Y. Spectral normalization for generative adversarial networks. Preprint at https://arxiv.org/pdf/1802.05957.pdf (2018).

  66. 66.

    UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

    Article  Google Scholar 

  67. 67.

    Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  Google Scholar 

  68. 68.

    Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/pdf/1412.6980.pdf (2014).

  69. 69.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  Google Scholar 

  70. 70.

    Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).

    Article  Google Scholar 

  71. 71.

    Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).

    Article  Google Scholar 

  72. 72.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  73. 73.

    Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res 28, 235–242 (2000).

    Article  Google Scholar 

  74. 74.

    Berman, H. M. et al. The Protein Data Bank. Acta Crystallogr. D Biol. Crystallogr. 58, 899–907 (2002).

    Article  Google Scholar 

  75. 75.

    Eswar, N. et al. Comparative protein structure modeling using MODELLER. Curr. Protoc. Protein Sci. 2, 2.9 (2006).

    Google Scholar 

  76. 76.

    Sievers, F., Wilm, A., Dineen, D. & Gibson, T. J. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 7, 539 (2011).

    Article  Google Scholar 

  77. 77.

    Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

    Article  Google Scholar 

  78. 78.

    Sanner, M. F., Olson, A. J. & Spehner, J. C. Reduced surface: an efficient way to compute molecular surfaces. Biopolymers 38, 305–320 (1996).

    Article  Google Scholar 

  79. 79.

    McCloskey, D. & Ubhi, B. K. Quantitative and qualitative metabolomics for the investigation of intracellular metabolism. SCIEX Tech Note 1–11 (2014).

  80. 80.

    Wilbur, W. J. & Lipman, D. J. Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl Acad. Sci. USA 80, 726–730 (1983).

    Article  Google Scholar 

Download references

Acknowledgements

We thank G. Stonyte, J. Nainys and C. Correia-Melo for comments on the manuscript. We also thank A. Repecka and L. Petkevicius for their valuable and constructive suggestions for improving the model. L.K. and R.M. were supported by the Agency for Science, Innovation and Technology (Lithuania) grant no. 31V-59/(1.78)SU-1687. J.Z. and A.Z. were supported by SciLifeLab fellow programme funding. S.V. was supported by VR starting grant no. 2019-05356. The computations were enabled with resources provided by the Swedish National Infrastructure for Computing (SNIC) at C3SE, partially funded by the Swedish Research Council through grant agreement no. 2018-05973. M. Öhman and T. Svedberg at C3SE are acknowledged for technical assistance in making the code run on Vera C3SE resources.

Author information

Affiliations

Authors

Contributions

D.R. implemented the method, contributed with principal analysis and wrote the first draft. V.J. contributed principal analysis, designed experiments and wrote the first draft. L.K. contributed principal analysis, designed experiments and wrote the first draft. E.R. performed laboratory experiments. I.R. contributed principal analysis and wrote the first draft. J.Z. contributed principal analysis, performed laboratory experiments and wrote the first draft. S.P. performed laboratory experiments. A.L. contributed principal analysis. S.V. contributed principal analysis. W.A. performed laboratory experiments. O.S. contributed principal analysis and supervised the mass spectrometry work. R.M. supervised the study and designed the experiments. M.K.M.E. supervised the study, designed experiments, contributed principal analysis and wrote the manuscript. A.Z. supervised the study, designed experiments, contributed principal analysis, financed the experiments and wrote the manuscript. All authors contributed to writing of the paper and read the final manuscript.

Corresponding author

Correspondence to Aleksej Zelezniak.

Ethics declarations

Competing interests

L.K., V.J., D.R., I.R. and R.M. are shareholders of the company Biomatter Designs. The company has submitted a patent application for the technology described in the Article. The other authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Frances Arnold and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Methods, Figs. 1–25 and Tables 1–7.

Supplementary Table 2

Supplementary Table 2. Generated sequences and their identities to the closest real sequence.

Source data

Source Data Fig. 1

SDS–PAGE gels of purified proteins. a, Batch 1, protocol 1 (Methods). b, Batches 2 and 3, protocol 1 (Methods). c, Batch 1, protocol 2. T, total lysate; S, soluble lysate; E, elution after affinity column use. d, Batch 2, protocol 2. e, Batch 3, protocol 2 (Methods). The results are summarized in Supplementary Table 3.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Repecka, D., Jauniskis, V., Karpus, L. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat Mach Intell (2021). https://doi.org/10.1038/s42256-021-00310-5

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing