Learning protein fitness models from evolutionary and assay-labeled data

Hsu, Chloe; Nisonoff, Hunter; Fannjiang, Clara; Listgarten, Jennifer

doi:10.1038/s41587-021-01146-5

Article
Published: 17 January 2022

Learning protein fitness models from evolutionary and assay-labeled data

Nature Biotechnology volume 40, pages 1114–1122 (2022)Cite this article

20k Accesses
49 Citations
70 Altmetric
Metrics details

Subjects

Abstract

Machine learning-based models of protein fitness typically learn from either unlabeled, evolutionarily related sequences or variant sequences with experimentally measured labels. For regimes where only limited experimental data are available, recent work has suggested methods for combining both sources of information. Toward that goal, we propose a simple combination approach that is competitive with, and on average outperforms more sophisticated methods. Our approach uses ridge regression on site-specific amino acid features combined with one probability density feature from modeling the evolutionary data. Within this approach, we find that a variational autoencoder-based probability density model showed the best overall performance, although any evolutionary density model can be used. Moreover, our analysis highlights the importance of systematic evaluations and sufficient baselines.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Machine learning methods for protein fitness prediction.**

**Fig. 2: Performance of existing methods and the augmented Potts model.**

**Fig. 3: Augmented approach using different probability density models.**

**Fig. 4: Performance on individual data sets.**

**Fig. 5: Extrapolative performance from single mutants to higher-order mutants.**

**Fig. 6: Edit distance from wild-type sequence as a predictive model.**

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Data availability

All protein fitness data were publicly available through citations available in the paper. A processed version of these data and our evaluation results are available on Dryad with https://doi.org/10.6078/D1K71B. All protein structures used in the study are available publicly with PDB IDs 2WUR, 6R5K and 2KR4.

Code availability

The code to reproduce the results is available at https://github.com/chloechsu/combining-evolutionary-and-assay-labelled-data.

References

Doudna, J. A. & Charpentier, E. The new frontier of genome engineering with CRISPR–Cas9. Science 346, 1258096 (2014).
Hsu, P. D., Lander, E. S. & Zhang, F. Development and applications of CRISPR–Cas9 for genome engineering. Cell 157, 1262–1278 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chalfie, M., Tu, Y., Euskirchen, G., Ward, W. W. & Prasher, D. C. Green fluorescent protein as a marker for gene expression. Science 263, 802–805 (1994).
Article CAS PubMed Google Scholar
Leader, B., Baca, Q. J. & Golan, D. E. Protein therapeutics: a summary and pharmacological classification. Nat. Rev. Drug Discov. 7, 21–39 (2008).
Article CAS PubMed Google Scholar
Pollegioni, L., Schonbrunn, E. & Siehl, D. Molecular basis of glyphosate resistance–different approaches through protein engineering. FEBS J. 278, 2753–2766 (2011).
Article CAS PubMed PubMed Central Google Scholar
Joo, H., Lin, Z. & Arnold, F. H. Laboratory evolution of peroxide-mediated cytochrome P450 hydroxylation. Nature 399, 670–673 (1999).
Article CAS PubMed Google Scholar
Heim, R. & Tsien, R. Y. Engineering green fluorescent protein for improved brightness, longer wavelengths and fluorescence resonance energy transfer. Curr. Biol. 6, 178–182 (1996).
Article CAS PubMed Google Scholar
Binz, H. K., Amstutz, P. & Plückthun, A. Engineering novel binding proteins from nonimmunoglobulin domains. Nat. Biotech. 23, 1257–1268 (2005).
Article CAS Google Scholar
Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).
Article CAS Google Scholar
Alford, R. F. et al. The Rosetta all-atom energy function for macromolecular modeling and design. J. Chem. Theory Comput. 13, 3031–3048 (2017).
Article CAS PubMed PubMed Central Google Scholar
Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).
Article CAS PubMed PubMed Central Google Scholar
Rocklin, G. J. et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science 357, 168–175 (2017).
Article CAS PubMed PubMed Central Google Scholar
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440–445 (2020).
Article CAS PubMed Google Scholar
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Article CAS PubMed Google Scholar
Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Syst. 12, 1026–1045 (2021).
Bryant, D. H. et al. Deep diversification of an AAV capsid protein by machine learning. Nat. Biotech. 39, 691–696 (2021).
Brookes, D., Park, H. & Listgarten, J. Conditioning by adaptive sampling for robust design. In Proc. International Conference on Machine Learning (eds Chaudhuri, K. & Salakhutdinov, R.) 773–782 (PMLR, 2019).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article CAS PubMed Google Scholar
Sinai, S. et al. AdaLead: a simple and robust adaptive greedy search algorithm for sequence design. Preprint at https://arxiv.org/abs/2010.02141 (2020).
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Article CAS PubMed PubMed Central Google Scholar
Dehouck, Y., Kwasigroch, J. M., Gilis, D. & Rooman, M. Popmusic 2.1: a web server for the estimation of protein stability changes upon mutation and sequence optimality. BMC Bioinform. 12, 151 (2011).
Article Google Scholar
Capriotti, E., Fariselli, P. & Casadio, R. I-mutant2. 0: predicting stability changes upon mutation from the protein sequence or structure. Nucleic Acids Res. 33, W306–W310 (2005).
Article CAS PubMed PubMed Central Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotech. 35, 128–135 (2017).
Article CAS Google Scholar
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article CAS PubMed PubMed Central Google Scholar
Sim, N.-L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res. 40, W452–W457 (2012).
Article CAS PubMed PubMed Central Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010).
Article CAS PubMed PubMed Central Google Scholar
Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. Human Mutation 34, 57–65 (2013).
Article CAS PubMed Google Scholar
Mann, J. K. et al. The fitness landscape of hiv-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).
Article PubMed PubMed Central CAS Google Scholar
Cheng, R. R., Morcos, F., Levine, H. & Onuchic, J. N. Toward rationally redesigning bacterial two-component signaling systems using coevolutionary information. Proc. Natl Acad. Sci. USA 111, E563–E571 (2014).
CAS PubMed PubMed Central Google Scholar
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase tem-1. Mol. Biol. E 33, 268–280 (2016).
Article CAS Google Scholar
Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc. Natl Acad. Sci. USA 109, 16858–16863 (2012).
Article CAS PubMed PubMed Central Google Scholar
Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).
Article CAS PubMed PubMed Central Google Scholar
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Article CAS PubMed PubMed Central Google Scholar
Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. RNA 19, 1537–1551 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wu, N. C., Dai, L., Olson, C. A., Lloyd-Smith, J. O. & Sun, R. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife 5, e16965 (2016).
Article PubMed PubMed Central CAS Google Scholar
Otwinowski, J., McCandlish, D. M. & Plotkin, J. B. Inferring the shape of global epistasis. Proc. Natl Acad. Sci. USA 115, E7550–E7558 (2018).
Article CAS PubMed PubMed Central Google Scholar
Shanehsazzadeh, A., Belanger, D. & Dohan, D. Is transfer learning necessary for protein landscape prediction? Preprint at https://arxiv.org/abs/2011.03443 (2020).
Rao, R. et al. Evaluating protein transfer learning with TAPE. In Proc. Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 9689–9701 (Curran Associates, Inc., 2019).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article CAS PubMed PubMed Central Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://doi.org/10.1101/2021.07.18.452833 (2021).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
Article CAS PubMed Google Scholar
Shamsi, Z., Chan, M. & Shukla, D. TLmutation: predicting the effects of mutations using transfer learning. J. Phys. Chem. B. 124, 3845–3854 (2020).
Article CAS PubMed Google Scholar
Barrat-Charlaix, P., Figliuzzi, M. & Weigt, M. Improving landscape inference by integrating heterogeneous data in the inverse ising problem. Sci. Rep. 6, 37812 (2016).
Article CAS PubMed PubMed Central Google Scholar
Howard, J. & Ruder, S. Universal language model fine-tuning for text classification. In Proc. 56th Annual Meeting of the Association for Computational Linguistics, Vol. 1: long papers (eds Gurevych, I. & Miyao, Y.) 328–339 (Association for Computational Linguistics, 2018).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1: long and short papers, 4171–4186 (2019).
Suzek, B. E. et al. Uniref clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article CAS PubMed Google Scholar
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at bioRxiv https://doi.org/10.1101/2020.07.12.199554 (2020).
Aghazadeh, A. et al. Epistatic net allows the sparse spectral regularization of deep neural networks for inferring fitness functions. Nat. Commun. 12, 5225 (2021).
Article CAS PubMed PubMed Central Google Scholar
Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl Acad. Sci. USA 110, E1263–E1272 (2013).
Article CAS PubMed PubMed Central Google Scholar
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29 (2011).
Article CAS PubMed PubMed Central Google Scholar
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of ir techniques. ACM Tran. Inf. Syst. 20, 422–446 (2002).
Article Google Scholar
Gelman, S. et al. Neural networks to learn protein sequence-function relationships from deep mutational scanning data. Proc. Natl Acad. Sci. USA 118, e2104878118 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gray, V. E., Hause, R. J., Luebeck, J., Shendure, J. & Fowler, D. M. Quantitative missense variant effect prediction using large-scale mutagenesis data. Cell Systems 6, 116–124 (2018).
Article CAS PubMed Google Scholar
Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based protein design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) Vol. 32 (NeurIPS, 2019).
Hardt, M. & Recht, B.Patterns, predictions, and actions: A story about machine learning. Preprint at https://arxiv.org/abs/2102.05242 (2021).
Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
Fannjiang, C. & Listgarten, J. Autofocused oracles for model-based design. In Proc. 33rd Conference on Neural Information Processing Systems (NeurIPS 2020) Vol. 33 (NeurIPS, 2020).
Sugiyama, M., Krauledat, M. & Müller, K.-R. Covariate shift adaptation by importance weighted cross validation. J. Mach. Learn. Res. 8, 985–1005 (2007).
Google Scholar
Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).
Article CAS PubMed Google Scholar
Kawashima, S. et al. Aaindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–5 (2007).
Article PubMed PubMed Central CAS Google Scholar
Eddy, S. R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).
Article CAS PubMed Google Scholar
Besag, J. Statistical analysis of non-lattice data. J. Royal Stat. Soc.: Ser. D. Statistician 24, 179–195 (1975).
Google Scholar
Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS Comput. Biol. 11, e1004182 (2015).
Article PubMed PubMed Central CAS Google Scholar
Blondel, M., Teboul, O., Berthet, Q. & Djolonga, J. Fast differentiable sorting and ranking. In Proc. International Conference on Machine Learning (eds Hal, D., III & Aarti, S.) 950–959 (PMLR, 2020).

Download references

Acknowledgements

We thank A. Aghazadeh, P. Almhjell, F. Arnold, A. Busia, D. Brookes, M. Jagota, K. Johnston, L. Schaus, N. Thomas, Y. Wang and B. Wittmann for helpful discussions. We also thank P. Barrat-Charlaix, S. Biswas, J. Meier and Z. Shamsi for providing helpful details about their methods and implementations. Partial support was provided by the US Department of Energy, Office of Biological and Environmental Research, Genomic Science Program Lawrence Livermore National Laboratory’s Secure Biosystems Design Scientific Focus Area under grant award no. SCW1710 (J.L., C.H.), the Chan Zuckerberg Investigator program (J.L.) and C3.ai (J.L., H.N.). Research reported in this publication was supported by the National Library Of Medicine of the National Institutes of Health under grant award no. T32LM012417 (H.N.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This material is based on work supported by the National Science Foundation Graduate Research Fellowship Program under grant no. DGE 2146752 (C.F.).

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, University of California, Berkeley, USA
Chloe Hsu, Clara Fannjiang & Jennifer Listgarten
Center for Computational Biology, University of California, Berkeley, USA
Hunter Nisonoff & Jennifer Listgarten

Authors

Chloe Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Hunter Nisonoff
View author publications
You can also search for this author in PubMed Google Scholar
Clara Fannjiang
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Listgarten
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.H. and J.L. conceptualized the study and developed the methodology. C.H. implemented models and analyzed data, with contributions from H.N. and C.F. All authors wrote the paper.

Corresponding authors

Correspondence to Chloe Hsu or Jennifer Listgarten.

Ethics declarations

Competing interests

J.L. is on the Scientific Advisory Board of Patch Biosciences and Foresite Laboratories. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of existing methods and the augmented Potts model with NDCG.

Analog of Fig. 2, but using NDCG instead of Spearman correlation. (a) Average performance across all 19 data sets, as measured by NDCG. The horizontal axis shows the number of supervised training examples used. Error bars are centered at the mean and indicate bootstrapped 95% confidence intervals estimated from 20 random splits of training and test data. Asterisks (*) indicate that P < 0.01 among all two-sided Mann-Whitney U tests that the augmented Potts model has different performance from each other method, at a given sample size. In particular, the largest such p-value for each training set size was respectively, P = 3.9 × 10⁻², 6.9 × 10⁻⁷, 2.2 × 10⁻⁷, 7.9 × 10⁻⁸, 7.7 × 10⁻⁴, 6.8 × 10⁻⁸, 6.8 × 10⁻⁸, 6.8 × 10⁻⁸ and P = 7.7 × 10⁻⁴ for the 80-20 split. (b) Average performance across all three data sets containing double mutant sequences (sequences that are two mutations away from the wild-type), and restricted to testing on only double mutants.

Extended Data Fig. 2 Performance on individual data sets when trained on limited labeled data.

A breakdown of averaged Spearman correlation results presented in Fig. 2a by individual data set. See Supplementary Fig. 1 for the analogous plot using NDCG. Error bands are centered at mean and indicate bootstrapped 95% confidence interval from 20 random data splits.

Extended Data Fig. 3 Performance on individual data sets when trained on 80% data.

A breakdown of averaged Spearman correlation results presented in the right-side mini-panel in Fig. 2a, on 80-20 splits, by individual data set. See Supplementary Fig. 2 for the analogous plot using NDCG. Error bars indicate bootstrapped 95% confidence interval from 20 random data splits. Box-and-whisker plots show the first and third quartiles as well as median values. The upper and lower whiskers extend from the hinge to the largest or smallest value no further than 1.5 x interquartile range from the hinge.

Extended Data Fig. 4 Augmented approach using different probability density models, with NDCG.

Analogous to Fig. 3, but using NDCG. Methods are compared with their augmented counterpart, using matching colors on each pair. Flat, horizontal lines represent evolutionary density models that do not have access to assay-labeled data. Dashed lines indicate existing methods. Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits.

Extended Data Fig. 5 Performance on individual data sets with NDCG.

Analogous to Fig. 4, but using NDCG. (a) Other than the EVmutation Potts model, the DeepSequence VAE, and Profile HMM, none of which use supervised data, all other methods here used 240 labeled training sequences. Each colored dot is the average NDCG from 20 random train-test splits. Random horizontal jitter was added for display purposes. The bottom row of black dots indicates the effective MSA size determined by accounting for sequence similarity with sample reweighting at 80% identity cutoff. (b) Summary of how often each modeling strategy had maximal NDCG. Such modelling strategies were determined by first identifying the top-performing strategy for any given scenario, and then also identifying any other strategy that came within the 95% confidence interval of the top performer.

Extended Data Fig. 6 The distribution of best model(s) on each data set.

Analogous to Fig. 4b, but varying the number of assay-labeled training examples. (a) Summary of how often each modelling strategy had maximal Spearman correlation. Such modelling strategies were determined by first identifying the top-performing strategy for any given scenario, and then also identifying any other strategy that came within the 95% confidence interval of the top performer. Four settings are used: with no assay-labeled data, when training on 48 or 240 assay-labeled single-mutant examples, and in the 80-20 train-test split setting. (b) Summary of how often each modelling strategy had maximal NDCG.

Extended Data Fig. 7 Extrapolation performance from single and double mutants to higher-order mutants.

Analogous to Fig. 5, but training on a random sample from both single and double mutants. Each column shows the performance when training on randomly sampled single mutants and then separately testing on single, double, or triple mutants, none of which were in the training data. The total size (TS) indicates the total number of mutants of a particular order in all of the data. For example, ‘TS=613’ for single mutants means there were 613 total single mutants in the data set that we sampled from. Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits. See Supplementary Fig. 6 for analogous plot using NDCG.

Extended Data Fig. 8 Edit distance from wild-type sequence as a predictive model (UBE4B U-box domain).

Analogous to Fig. 6, but on the UBE4B U-box domain data set. We compared the performance of non-augmented evolutionary density models to two predictive models that use only the edit distance of a sequence to the wild type. In one version, the edit distance is defined as the number of mutations away from the wild type. In the other version, we used BLOSUM62 to compute the distance from wild type, which thus accounts not only for the number of mutations, but also the type of mutation. Each dot represents a UBE4B U-box domain sequence, with darker colors indicating larger distances from the wild-type.

Extended Data Fig. 9 FoldX predictions as additional features in augmented models.

Each column shows the performance of augmented models with a single FoldX-derived stability feature added, when training on randomly sampled single mutants and then separately testing on single, double, or triple mutants. It also shows augmentation of two density models at the same time, without FoldX, as in "Augmented VAE + Potts". Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits. See Supplementary Fig. 7 for analogous evaluation with NDCG.

Extended Data Fig. 10 Performance of linear model using only one feature per site (not per amino acid at each site).

In addition to the linear model with one-hot encoded, site-specific amino acid features, we also evaluated a simpler linear model with position-only features that encode which sites are mutated. The evaluation uses Spearman correlation. Each column shows the performance when training on randomly sampled single mutants and then separately testing on single, double, or triple mutants, none of which were in the training data. Error bars are centered at the mean and indicate bootstrapped 95% confidence interval from 20 random data splits.

Supplementary information

Supplementary Information

Supplementary Figs. 1–16 and Table 1.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hsu, C., Nisonoff, H., Fannjiang, C. et al. Learning protein fitness models from evolutionary and assay-labeled data. Nat Biotechnol 40, 1114–1122 (2022). https://doi.org/10.1038/s41587-021-01146-5

Download citation

Received: 09 April 2021
Accepted: 02 November 2021
Published: 17 January 2022
Issue Date: July 2022
DOI: https://doi.org/10.1038/s41587-021-01146-5

This article is cited by

Assessing antibody and nanobody nativeness for hit selection and humanization with AbNatiV
- Aubin Ramon
- Montader Ali
- Pietro Sormanni
Nature Machine Intelligence (2024)
Cross-protein transfer learning substantially improves disease variant prediction
- Milind Jagota
- Chengzhong Ye
- Yun S. Song
Genome Biology (2023)
SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering
- Mingchen Li
- Liqi Kang
- Liang Hong
Journal of Cheminformatics (2023)
Predicting the antigenic evolution of SARS-COV-2 with deep learning
- Wenkai Han
- Ningning Chen
- Xin Gao
Nature Communications (2023)
Rapid discovery of high-affinity antibodies via massively parallel sequencing, ribosome display and affinity screening
- Benjamin T. Porebski
- Matthew Balmforth
- Philipp Holliger
Nature Biomedical Engineering (2023)