Abstract
Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout






Similar content being viewed by others
Data availability
All protein fitness data were publicly available through citations available in the paper. A processed version of these data and our evaluation results are available on Dryad with https://doi.org/10.6078/D1K71B. All protein structures used in the study are available publicly with PDB IDs 2WUR, 6R5K and 2KR4.
Code availability
The code to reproduce the results is available at https://github.com/chloechsu/combining-evolutionary-and-assay-labelled-data.
References
Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR–Cas9. Science 346, 1258096 (2014).
Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR–Cas9 for genome engineering. Cell 157, 1262–1278 (2014).
Chalfie, M., Tu, Y., Euskirchen, G., Ward, W. W. & Prasher, D. C. Green fluorescent protein as a marker for gene expression. Science 263, 802–805 (1994).
Leader, B., Baca, Q. J. & Golan, D. E. Protein therapeutics: a summary and pharmacological classification. Nat. Rev. Drug Discov. 7, 21–39 (2008).
Pollegioni, L., Schonbrunn, E. & Siehl, D. Molecular basis of glyphosate resistance–different approaches through protein engineering. FEBS J. 278, 2753–2766 (2011).
Joo, H., Lin, Z. & Arnold, F. H. Laboratory evolution of peroxide-mediated cytochrome P450 hydroxylation. Nature 399, 670–673 (1999).
Heim, R. & Tsien, R. Y. Engineering green fluorescent protein for improved brightness, longer wavelengths and fluorescence resonance energy transfer. Curr. Biol. 6, 178–182 (1996).
Binz, H. K., Amstutz, P. & Plückthun, A. Engineering novel binding proteins from nonimmunoglobulin domains. Nat. Biotech. 23, 1257–1268 (2005).
Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).
Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045 (2021).
Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotech. 39, 691–696 (2021).
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. In Proc. International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 773–782 (PMLR, 2019).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Dehouck, Y., Kwasigroch, J. M., Gilis, D. & Rooman, M. Popmusic 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinform. 12, 151 (2011).
Capriotti, E., Fariselli, P. & Casadio, R. I-mutant2. 0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 33, W306–W310 (2005).
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotech. 35, 128–135 (2017).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation 34, 57–65 (2013).
Mann, J. K. et al. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).
Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Mol. Biol. E 33, 268–280 (2016).
Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc. Natl Acad. Sci. USA 109, 16858–16863 (2012).
Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. RNA 19, 1537–1551 (2013).
Wu, N. C., Dai, L., Olson, C. A., Lloyd-Smith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).
Otwinowski, J., McCandlish, D. M. & Plotkin, J. B. Inferring the shape of global epistasis. Proc. Natl Acad. Sci. USA 115, E7550–E7558 (2018).
Shanehsazzadeh, A., Belanger, D. & Dohan, D. Is transfer learning necessary for protein landscape prediction? Preprint at https://arxiv.org/abs/2011.03443 (2020).
Rao, R. et al. Evaluating protein transfer learning with TAPE. In Proc. Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 9689–9701 (Curran Associates, Inc., 2019).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://doi.org/10.1101/2021.07.18.452833 (2021).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
Shamsi, Z., Chan, M. & Shukla, D. TLmutation: predicting the effects of mutations using transfer learning. J. Phys. Chem. B. 124, 3845–3854 (2020).
Barrat-Charlaix, P., Figliuzzi, M. & Weigt, M. Improving landscape inference by integrating heterogeneous data in the inverse ising problem. Sci. Rep. 6, 37812 (2016).
Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: long papers (eds Gurevych, I. & Miyao, Y.) 328–339 (Association for Computational Linguistics, 2018).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1: long and short papers, 4171–4186 (2019).
Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at bioRxiv https://doi.org/10.1101/2020.07.12.199554 (2020).
Aghazadeh, A. et al. Epistatic net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021).
Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29 (2011).
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Tran. Inf. Syst. 20, 422–446 (2002).
Gelman, S. et al. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
Gray, V. E., Hause, R. J., Luebeck, J., Shendure, J. & Fowler, D. M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Systems 6, 116–124 (2018).
Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) Vol. 32 (NeurIPS, 2019).
Hardt, M. & Recht, B.Patterns, predictions, and actions: A story about machine learning. Preprint at https://arxiv.org/abs/2102.05242 (2021).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Fannjiang, C. & Listgarten, J. Autofocused oracles for model-based design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2020) Vol. 33 (NeurIPS, 2020).
Sugiyama, M., Krauledat, M. & Müller, K.-R. Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8, 985–1005 (2007).
Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).
Kawashima, S. et al. Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–5 (2007).
Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).
Besag, J. Statistical analysis of non-lattice data. J. Royal Stat. Soc.: Ser. D. Statistician 24, 179–195 (1975).
Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015).
Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. International Conference on Machine Learning (eds Hal, D., III & Aarti, S.) 950–959 (PMLR, 2020).
Acknowledgements
We thank A. Aghazadeh, P. Almhjell, F. Arnold, A. Busia, D. Brookes, M. Jagota, K. Johnston, L. Schaus, N. Thomas, Y. Wang and B. Wittmann for helpful discussions. We also thank P. Barrat-Charlaix, S. Biswas, J. Meier and Z. Shamsi for providing helpful details about their methods and implementations. Partial support was provided by the US Department of Energy, Office of Biological and Environmental Research, Genomic Science Program Lawrence Livermore National Laboratory’s Secure Biosystems Design Scientific Focus Area under grant award no. SCW1710 (J.L., C.H.), the Chan Zuckerberg Investigator program (J.L.) and C3.ai (J.L., H.N.). Research reported in this publication was supported by the National Library Of Medicine of the National Institutes of Health under grant award no. T32LM012417 (H.N.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This material is based on work supported by the National Science Foundation Graduate Research Fellowship Program under grant no. DGE 2146752 (C.F.).
Author information
Authors and Affiliations
Contributions
C.H. and J.L. conceptualized the study and developed the methodology. C.H. implemented models and analyzed data, with contributions from H.N. and C.F. All authors wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
J.L. is on the Scientific Advisory Board of Patch Biosciences and Foresite Laboratories. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Performance of existing methods and the augmented Potts model with NDCG.
Analog of Fig. 2, but using NDCG instead of Spearman correlation. (a) Average performance across all 19 data sets, as measured by NDCG. The horizontal axis shows the number of supervised training examples used. Error bars are centered at the mean and indicate bootstrapped 95% confidence intervals estimated from 20 random splits of training and test data. Asterisks (*) indicate that P < 0.01 among all two-sided Mann-Whitney U tests that the augmented Potts model has different performance from each other method, at a given sample size. In particular, the largest such p-value for each training set size was respectively, P = 3.9 × 10−2, 6.9 × 10−7, 2.2 × 10−7, 7.9 × 10−8, 7.7 × 10−4, 6.8 × 10−8, 6.8 × 10−8, 6.8 × 10−8 and P = 7.7 × 10−4 for the 80-20 split. (b) Average performance across all three data sets containing double mutant sequences (sequences that are two mutations away from the wild-type), and restricted to testing on only double mutants.
Extended Data Fig. 3 Performance on individual data sets when trained on 80% data.
A breakdown of averaged Spearman correlation results presented in the right-side mini-panel in Fig. 2a, on 80-20 splits, by individual data set. See Supplementary Fig. 2 for the analogous plot using NDCG. Error bars indicate bootstrapped 95% confidence interval from 20 random data splits. Box-and-whisker plots show the first and third quartiles as well as median values. The upper and lower whiskers extend from the hinge to the largest or smallest value no further than 1.5 x interquartile range from the hinge.
Extended Data Fig. 4 Augmented approach using different probability density models, with NDCG.
Analogous to Fig. 3, but using NDCG. Methods are compared with their augmented counterpart, using matching colors on each pair. Flat, horizontal lines represent evolutionary density models that do not have access to assay-labeled data. Dashed lines indicate existing methods. Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits.
Extended Data Fig. 5 Performance on individual data sets with NDCG.
Analogous to Fig. 4, but using NDCG. (a) Other than the EVmutation Potts model, the DeepSequence VAE, and Profile HMM, none of which use supervised data, all other methods here used 240 labeled training sequences. Each colored dot is the average NDCG from 20 random train-test splits. Random horizontal jitter was added for display purposes. The bottom row of black dots indicates the effective MSA size determined by accounting for sequence similarity with sample reweighting at 80% identity cutoff. (b) Summary of how often each modeling strategy had maximal NDCG. Such modelling strategies were determined by first identifying the top-performing strategy for any given scenario, and then also identifying any other strategy that came within the 95% confidence interval of the top performer.
Extended Data Fig. 6 The distribution of best model(s) on each data set.
Analogous to Fig. 4b, but varying the number of assay-labeled training examples. (a) Summary of how often each modelling strategy had maximal Spearman correlation. Such modelling strategies were determined by first identifying the top-performing strategy for any given scenario, and then also identifying any other strategy that came within the 95% confidence interval of the top performer. Four settings are used: with no assay-labeled data, when training on 48 or 240 assay-labeled single-mutant examples, and in the 80-20 train-test split setting. (b) Summary of how often each modelling strategy had maximal NDCG.
Extended Data Fig. 7 Extrapolation performance from single and double mutants to higher-order mutants.
Analogous to Fig. 5, but training on a random sample from both single and double mutants. Each column shows the performance when training on randomly sampled single mutants and then separately testing on single, double, or triple mutants, none of which were in the training data. The total size (TS) indicates the total number of mutants of a particular order in all of the data. For example, ‘TS=613’ for single mutants means there were 613 total single mutants in the data set that we sampled from. Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits. See Supplementary Fig. 6 for analogous plot using NDCG.
Extended Data Fig. 8 Edit distance from wild-type sequence as a predictive model (UBE4B U-box domain).
Analogous to Fig. 6, but on the UBE4B U-box domain data set. We compared the performance of non-augmented evolutionary density models to two predictive models that use only the edit distance of a sequence to the wild type. In one version, the edit distance is defined as the number of mutations away from the wild type. In the other version, we used BLOSUM62 to compute the distance from wild type, which thus accounts not only for the number of mutations, but also the type of mutation. Each dot represents a UBE4B U-box domain sequence, with darker colors indicating larger distances from the wild-type.
Extended Data Fig. 9 FoldX predictions as additional features in augmented models.
Each column shows the performance of augmented models with a single FoldX-derived stability feature added, when training on randomly sampled single mutants and then separately testing on single, double, or triple mutants. It also shows augmentation of two density models at the same time, without FoldX, as in "Augmented VAE + Potts". Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits. See Supplementary Fig. 7 for analogous evaluation with NDCG.
Extended Data Fig. 10 Performance of linear model using only one feature per site (not per amino acid at each site).
In addition to the linear model with one-hot encoded, site-specific amino acid features, we also evaluated a simpler linear model with position-only features that encode which sites are mutated. The evaluation uses Spearman correlation. Each column shows the performance when training on randomly sampled single mutants and then separately testing on single, double, or triple mutants, none of which were in the training data. Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits.
Supplementary information
Supplementary Information
Supplementary Figs. 1–16 and Table 1.
Rights and permissions
About this article
Cite this article
Hsu, C., Nisonoff, H., Fannjiang, C. et al. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 40, 1114–1122 (2022). https://doi.org/10.1038/s41587-021-01146-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-021-01146-5
This article is cited by
-
Cross-protein transfer learning substantially improves disease variant prediction
Genome Biology (2023)
-
SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering
Journal of Cheminformatics (2023)
-
Predicting the antigenic evolution of SARS-COV-2 with deep learning
Nature Communications (2023)
-
Rapid discovery of high-affinity antibodies via massively parallel sequencing, ribosome display and affinity screening
Nature Biomedical Engineering (2023)
-
Machine learning-enabled retrobiosynthesis of molecules
Nature Catalysis (2023)