Machine-learning-guided directed evolution for protein engineering


Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence–function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Directed evolution with and without machine learning.
Fig. 2
Fig. 3: GP-UCB algorithm.
Fig. 4: Directed evolution using PLS regression.
Fig. 5: Directed evolution using GPs and Bayesian optimization.
Fig. 6: Autoencoder.


  1. 1.

    Dou, J. et al. Sampling and energy evaluation challenges in ligand binding protein design. Protein Sci. 26, 2426–2437 (2017).

    CAS  Article  Google Scholar 

  2. 2.

    Garcia-Borras, M., Houk, K. N. & Jiménez-Osés, G. Computational design of protein function. In Computational Tools for Chemical Biology (ed. Martín-Santamaría, S.) 87–107 (Royal Society of Chemistry, 2017).

  3. 3.

    Mandecki, W. The game of chess and searches in protein sequence space. Trends Biotechnol. 16, 200–202 (1998).

    CAS  Article  Google Scholar 

  4. 4.

    Pierce, N. A. & Winfree, E. Protein design is NP-hard. Protein Eng. 15, 779–782 (2002).

    CAS  Article  Google Scholar 

  5. 5.

    Smith, J. M. Natural selection and the concept of a protein space. Nature 225, 563–564 (1970).

    CAS  Article  Google Scholar 

  6. 6.

    Orr, H. A. The distribution of fitness effects among beneficial mutations in Fisher’s geometric model of adaptation. J. Theor. Biol. 238, 279–285 (2006).

    Article  Google Scholar 

  7. 7.

    Khersonsky, O. & Tawfik, D. S. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471–505 (2010).

    CAS  Article  Google Scholar 

  8. 8.

    Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nat. Rev. Mol. Cell Biol. 10, 866–876 (2009).

    CAS  Article  Google Scholar 

  9. 9.

    Drummond, D. A., Silberg, J. J., Meyer, M. M., Wilke, C. O. & Arnold, F. H. On the conservative nature of intragenic recombination. Proc. Natl Acad. Sci. USA 102, 5380–5385 (2005).

    CAS  Article  Google Scholar 

  10. 10.

    Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007).

    CAS  Article  Google Scholar 

  11. 11.

    Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013). This is the first study to combine SCHEMA recombination with the GP-UCB algorithm to optimize a protein property.

    CAS  Article  Google Scholar 

  12. 12.

    Bedbrook, C. N., Yang, K. K., Rice, A. J., Gradinaru, V. & Arnold, F. H. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS Comput. Biol. 13, e1005786 (2017).

    Article  Google Scholar 

  13. 13.

    Bedbrook, C. N., Yang, K. K., Robinson, J. E., Gradinaru, V. & Arnold, F. H. Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics. Preprint at (2019). This paper demonstrates the utility of machine learning for optimizing a property that would not be possible to engineer with directed evolution alone.

  14. 14.

    Wu, Z., Kan, S. B. J., Lewis, R. D., Wittmann, B. J. & Arnold, F. H. Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl Acad. Sci. USA 116, 8852–8858 (2019).

    CAS  Article  Google Scholar 

  15. 15.

    Hastie, T. & Tibshirani, R. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2008).

  16. 16.

    Murphy, K. Machine Learning, a Probabilistic Perspective (MIT Press, 2012). Murphy’s textbook provides a thorough introduction to modern machine learning.

  17. 17.

    Liao, J. et al. Engineering proteinase K using machine learning and synthetic genes. BMC Biotechnol. 7, 16 (2007).

    Article  Google Scholar 

  18. 18.

    Govindarajan, S. et al. Mapping of amino acid substitutions conferring herbicide resistance in wheat glutathione transferase. ACS Synth. Biol. 4, 221–227 (2015).

    CAS  Article  Google Scholar 

  19. 19.

    Musdal, Y., Govindarajan, S. & Mannervik, B. Exploring sequence-function space of a poplar glutathione transferase using designed information-rich gene variants. Protein Eng. Des. Sel. 30, 543–549 (2017).

    CAS  Article  Google Scholar 

  20. 20.

    Wolpert, D. H. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 1341–1390 (1996).

    Article  Google Scholar 

  21. 21.

    Li, Y. et al. A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments. Nat. Biotechnol. 25, 1051–1056 (2007).

    CAS  Article  Google Scholar 

  22. 22.

    Breiman, L. Classification and Regression Trees (Routledge, 2017).

  23. 23.

    Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  24. 24.

    Friedman, J. H. Stochastic gradient boosting. Comput. Stat. Data Anal. 38, 367–378 (2002).

    Article  Google Scholar 

  25. 25.

    Tian, J., Wu, N., Chu, X. & Fan, Y. Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinforma. 11, 370 (2010).

    Article  Google Scholar 

  26. 26.

    Li, Y. & Fang, J. PROTS-RF: a robust model for predicting mutation-induced protein stability changes. PLoS One 7, e47247 (2012).

    CAS  Article  Google Scholar 

  27. 27.

    Jia, L., Yarlagadda, R. & Reed, C. C. Structure based thermostability prediction models for protein single point mutations with machine learning tools. PLoS One 10, e0138022 (2015).

    Article  Google Scholar 

  28. 28.

    Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995).

    Google Scholar 

  29. 29.

    Nadaraya, E. On estimating regression. Theory Probab. Its Appl. 9, 141–142 (1964).

    Article  Google Scholar 

  30. 30.

    Leslie, C., Eskin, E. & Noble, W. S. The spectrum kernel: a string kernel for SVM protein classification. Pac. Symp. Biocomput. 2002, 564–575 (2002).

    Google Scholar 

  31. 31.

    Leslie, C. S., Eskin, E., Cohen, A., Weston, J. & Noble, W. S. Mismatch string kernels for discriminative protein classification. Bioinformatics 20, 467–476 (2004).

    CAS  Article  Google Scholar 

  32. 32.

    Jokinen, E., Heinonen, M. & Lähdesmäki, H. mGPfusion: predicting protein stability changes with Gaussian process kernel learning and data fusion. Bioinformatics 34, i274–i283 (2018).

    CAS  Article  Google Scholar 

  33. 33.

    Capriotti, E., Fariselli, P. & Casadio, R. I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 33, W306–W310 (2005).

    CAS  Article  Google Scholar 

  34. 34.

    Capriotti, E., Fariselli, P., Calabrese, R. & Casadio, R. Predicting protein stability changes from sequences using support vector machines. Bioinformatics 21, ii54–ii58 (2005).

    CAS  Article  Google Scholar 

  35. 35.

    Cheng, J., Randall, A. & Baldi, P. Prediction of protein stability changes for single-site mutations using support vector machines. Proteins 62, 1125–1132 (2006).

    CAS  Article  Google Scholar 

  36. 36.

    Buske, F. A., Their, R., Gillam, E. M. & Bodén, M. In silico characterization of protein chimeras: relating sequence and function within the same fold. Proteins 77, 111–120 (2009).

    CAS  Article  Google Scholar 

  37. 37.

    Liu, J. & Kang, X. Grading amino acid properties increased accuracies of single point mutation on protein stability prediction. BMC Bioinforma. 13, 44 (2012).

    Article  Google Scholar 

  38. 38.

    Zaugg, J., Gumulya, Y., Malde, A. K. & Bodén, M. Learning epistatic interactions from sequence-activity data to predict enantioselectivity. J. Comput. Aided Mol. Des. 31, 1085–1096 (2017).

    CAS  Article  Google Scholar 

  39. 39.

    Saladi, S. M., Javed, N., Müller, A. & Clemons, W. M. Jr. A statistical model for improved membrane protein expression using sequence-derived features. J. Biol. Chem. 293, 4913–4927 (2018).

    CAS  Article  Google Scholar 

  40. 40.

    Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006).

  41. 41.

    Wilson, A. G. & Nickisch, H. Kernel interpolation for scalable structured Gaussian processes (KISS-GP). In Proc. 32nd International Conference on Machine Learning (eds. Bach, F. & Blei, D.) 1775–1784 (JMLR, 2015).

  42. 42.

    Wang, K. A. et al. Exact Gaussian processes on a million data points. Preprint at (2019).

  43. 43.

    Pires, D. E., Ascher, D. B. & Blundell, T. L. mCSM: predicting the effects of mutations in proteins using graph-based signatures. Bioinformatics 30, 335–342 (2014).

    CAS  Article  Google Scholar 

  44. 44.

    Mellor, J., Grigoras, I., Carbonell, P. & Faulon, J.-L. Semisupervised Gaussian process for automated enzyme search. ACS Synth. Biol. 5, 518–528 (2016).

    CAS  Article  Google Scholar 

  45. 45.

    Saito, Y. et al. Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth. Biol. 7, 2014–2022 (2018).

    CAS  Article  Google Scholar 

  46. 46.

    Zhang, S. et al. A deep learning framework for modeling structural features of RNA-binding protein targets. Nucleic Acids Res. 44, e32 (2016).

    Article  Google Scholar 

  47. 47.

    Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).

    CAS  Article  Google Scholar 

  48. 48.

    Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics 32, i121–i127 (2016).

    CAS  Article  Google Scholar 

  49. 49.

    Hu, J. & Liu, Z. DeepMHC: deep convolutional neural networks for high-performance peptide-MHC binding affinity prediction. Preprint at (2017).

  50. 50.

    Jiménez, J., Doerr, S., Martínez-Rosell, G., Rose, A. S. & De Fabritiis, G. DeepSite: protein-binding site predictor using 3D-convolutional neural networks. Bioinformatics 33, 3036–3042 (2017).

    Article  Google Scholar 

  51. 51.

    Gomes, J., Ramsundar, B., Feinberg, E. N. & Pande, V. S. Atomic convolutional networks for predicting protein-ligand binding affinity. Preprint at (2017).

  52. 52.

    Mazzaferro, C. Predicting protein binding affinity with word embeddings and recurrent neural networks. Preprint at (2017).

  53. 53.

    Khurana, S. et al. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34, 2605–2613 (2018).

    CAS  Article  Google Scholar 

  54. 54.

    Dehouck, Y. et al. Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Bioinformatics 25, 2537–2543 (2009).

    CAS  Article  Google Scholar 

  55. 55.

    Giollo, M., Martin, A. J., Walsh, I., Ferrari, C. & Tosatto, S. C. NeEMO: a method using residue interaction networks to improve prediction of protein stability upon mutation. BMC Genom. 15, S7 (2014).

    Article  Google Scholar 

  56. 56.

    Almagro Armenteros, J. J., Sønderby, C. K., Sønderby, S. K., Nielsen, H. & Winther, O. DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 33, 3387–3395 (2017).

    Article  Google Scholar 

  57. 57.

    Sønderby, S. K. & Winther, O. Protein secondary structure prediction with long short term memory networks. Preprint at (2014).

  58. 58.

    Szalkai, B. & Grolmusz, V. Near perfect protein multi-label classification with deep neural networks. Methods 132, 50–56 (2018).

    CAS  Article  Google Scholar 

  59. 59.

    Cao, R. et al. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, 1732 (2017).

    Article  Google Scholar 

  60. 60.

    Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Preprint at (2019).

  61. 61.

    Hopf, T. A. et al. Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607–1621 (2012).

    CAS  Article  Google Scholar 

  62. 62.

    Snoek, J., Larochelle, H. & Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In NIPS ’12: Proceedings of the 25th International Conference on Neural Information Processing Systems (eds. Pereira, F. et al.) 2951–2959 (Curran Associates, 2012).

  63. 63.

    Domingos, P. A few useful things to know about machine learning. Commun. ACM 55, 78–87 (2012).

    Article  Google Scholar 

  64. 64.

    Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).

    Article  Google Scholar 

  65. 65.

    Kawashima, S. et al. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–D205 (2008).

    CAS  Article  Google Scholar 

  66. 66.

    Ofer, D. & Linial, M. ProFET: feature engineering captures high-level protein functions. Bioinformatics 31, 3429–3436 (2015).

    CAS  Article  Google Scholar 

  67. 67.

    Barley, M. H., Turner, N. J. & Goodacre, R. Improved descriptors for the quantitative structure–activity relationship modeling of peptides and proteins. J. Chem. Inf. Model. 58, 234–243 (2018).

    CAS  Article  Google Scholar 

  68. 68.

    Qiu, J., Hue, M., Ben-Hur, A., Vert, J.-P. & Noble, W. S. A structural alignment kernel for protein structures. Bioinformatics 23, 1090–1098 (2007).

    CAS  Article  Google Scholar 

  69. 69.

    Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).

    CAS  Article  Google Scholar 

  70. 70.

    Asgari, E. & Mofrad, M. R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One 10, e0141287 (2015).

    Article  Google Scholar 

  71. 71.

    Ng, P. dna2vec: consistent vector representations of variable-length k-mers. Preprint at (2017).

  72. 72.

    Kimothi, D., Soni, A., Biyani, P. & Hogan, J. M. Distributed representations for biological sequence analysis. Preprint at (2016).

  73. 73.

    Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).

    CAS  Article  Google Scholar 

  74. 74.

    Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at (2018).

  75. 75.

    Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-only deep representation learning. Preprint at (2019).

  76. 76.

    Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Preprint at (2019).

  77. 77.

    Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. Seventh International Conference on Learning Representations (2019).

  78. 78.

    Yang, K. K., Chen, Y., Lee, A. & Yue, Y. Batched stochastic Bayesian optimization via combinatorial constraints design. Proc. Mach. Learn. Res. 89, 3410–3419 (2019).

    Google Scholar 

  79. 79.

    Srinivas, N., Krause, A., Kakade, S. M. & Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proc. 27th International Conference on Machine Learning (eds. Fürnkranz, J. & Joachims, T.) 1015–1022 (Omnipress, 2010).

  80. 80.

    Fox, R. et al. Optimizing the search algorithm for protein engineering by directed evolution. Protein Eng. 16, 589–597 (2003). This study is the first to use machine learning to guide directed evolution.

    CAS  Article  Google Scholar 

  81. 81.

    de Jong, S. Simpls: an alternative approach to partial least squares regression. Chemom. Intell. Lab. Syst. 18, 251–263 (1993).

    Article  Google Scholar 

  82. 82.

    The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).

    Article  Google Scholar 

  83. 83.

    Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).

    Article  Google Scholar 

  84. 84.

    Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–1819 (2010).

    CAS  Article  Google Scholar 

  85. 85.

    Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at (2015).

  86. 86.

    Ha, D. & Eck, D. A neural representation of sketch drawings. Sixth International Conference on Learning Representations (2018).

  87. 87.

    Roberts, A., Engel, J., Raffel, C., Hawthorne, C. & Eck, D. A hierarchical latent vector model for learning long-term structure in music. Preprint at (2018).

  88. 88.

    Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences. Preprint at (2017).

  89. 89.

    Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018). This study predicts the effects of mutations without using any labeled data.

    CAS  Article  Google Scholar 

  90. 90.

    Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at (2014).

  91. 91.

    Costello, Z. & Garcia Martin, H. How to hallucinate functional proteins. Preprint at (2019).

  92. 92.

    Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent neural network model for constructive peptide design. J. Chem. Inf. Model. 58, 472–479 (2018).

    Article  Google Scholar 

  93. 93.

    Gupta, A. & Zou, J. Feedback GAN (FBGAN) for DNA: a novel feedback-loop architecture for optimizing protein functions. Preprint at (2018).

  94. 94.

    Anand, N. & Huang, P. Generative modeling for protein structures. In Advances in Neural Information Processing Systems 31 (eds. Bengio, S. et al.) 7504–7515 (Curran Associates, 2018).

  95. 95.

    Brookes, D. H. & Listgarten, J. Design by adaptive sampling. Preprint at (2018).

  96. 96.

    Brookes, D. H., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. Proc. Mach. Learn. Res. 97, 773–782 (2019).

    Google Scholar 

  97. 97.

    Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nat. Methods 11, 801–807 (2014).

    CAS  Article  Google Scholar 

Download references


The authors thank Y. Chen, K. Johnston, B. Wittmann, and H. Yang for comments on early versions of the manuscript, as well as members of the Arnold lab, J. Bois, and Y. Yue for general advice and discussions on protein engineering and machine learning. This work was supported by the US Army Research Office Institute for Collaborative Biotechnologies (W911F-09-0001 to F.H.A.), the Donna and Benjamin M. Rosen Bioengineering Center (to K.K.Y.), and the National Science Foundation (GRF2017227007 to Z.W.).

Author information




K.K.Y., Z.W., and F.H.A. conceptualized the project. K.K.Y. wrote the manuscript with input and editing from all authors.

Corresponding author

Correspondence to Frances H. Arnold.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information: Nina Vogt was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yang, K.K., Wu, Z. & Arnold, F.H. Machine-learning-guided directed evolution for protein engineering. Nat Methods 16, 687–694 (2019).

Download citation

Further reading


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing