Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Persistent spectral theory-guided protein engineering

A preprint version of the article is available at bioRxiv.

Abstract

Although protein engineering, which iteratively optimizes protein fitness by screening the gigantic mutational space, is constrained by experimental capacity, various machine learning models have substantially expedited protein engineering. Three-dimensional protein structures promise further advantages, but their intricate geometric complexity hinders their applications in deep mutational screening. Persistent homology, an established algebraic topology tool for protein structural complexity reduction, fails to capture the homotopic shape evolution during filtration of given data. Here we introduce a Topology-offered Protein Fitness (TopFit) framework to complement protein sequence and structure embeddings. Equipped with an ensemble regression strategy, TopFit integrates the persistent spectral theory, which is a new topological Laplacian, and two auxiliary sequence embeddings to capture mutation-induced topological invariant, shape evolution and sequence disparity in the protein fitness landscape. The performance of TopFit is assessed by 34 benchmark datasets with 128,634 variants, involving a vast variety of protein structure acquisition modalities and training set size variations.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Conceptual diagram of the TopFit method.
Fig. 2: PST for topological persistence and homotopic shape evolution.
Fig. 3: Prediction from single embedding on fitness landscape measured by Spearman correlations.
Fig. 4: Comparisons between TopFit embeddings and other methods for fitness prediction.
Fig. 5: Comparisons between TopFit and other regression models for fitness predictions using Spearman correlation.

Similar content being viewed by others

Data availability

There are 34 DMS datasets with experimentally measured fitness used in this work including: 32 DeepSequence datasets8, the avGFP dataset41 and the GB1 dataset42. The original data sources of the 32 DeepSequence datasets are provided in Supplementary Data 1 and Supplementary Note 7. Structure data were obtained from PDB database22 and AF38, and the specific entry ID is provided in Supplementary Data 1. The data analyzed and generated in this work, including sequence-to-fitness datasets, optimized structure data, MSAs, fine-tune parameters for eUniRep models, predictions from evolutionary scores for individual mutations and sequence- and structure-based embeddings are available at https://github.com/WeilabMSU/TopFit (ref. 65) and our lab server https://weilab.math.msu.edu/Downloads/TopFit/. Source data for Figs. 35 and Extended Data Figs. 1, 3, 510 are available with this paper. Source data for Extended Data Figs. 2 and 4 are available in Supplementary Data 2.

Code availability

All source codes and models are publicly available at https://github.com/WeilabMSU/TopFit (ref. 65).

References

  1. Narayanan, H. et al. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends Pharmacol.Sci. 42, 151–165 (2021).

    Article  Google Scholar 

  2. Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).

    Article  Google Scholar 

  3. Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).

    Article  Google Scholar 

  4. Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol. 69, 11–18 (2021).

    Article  Google Scholar 

  5. Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).

    Article  Google Scholar 

  6. Hopf, T. A. et al. The evcouplings python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2019).

    Article  Google Scholar 

  7. Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    Article  Google Scholar 

  8. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  Google Scholar 

  9. Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).

    Article  Google Scholar 

  10. Rao, R. M. et al. MSA transformer. In International Conference on Machine Learning 8844–8856 (PMLR, 2021).

  11. The UniProt Consortium. UniProt: the universal protein knowledge base in 2021. Nucleic Acids Res. 49, D480–D489 (2021).

  12. Rao, R. et al. Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).

    Google Scholar 

  13. Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations (2018).

  14. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  15. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  16. Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).

    Google Scholar 

  17. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-n protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article  Google Scholar 

  18. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).

    Google Scholar 

  19. Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning 16990–17017 (PMLR, 2022).

  20. Hsu, C., Nisonoff, H., Fannjiang, C. & Listgarten, J. Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022).

    Article  Google Scholar 

  21. Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).

    Article  Google Scholar 

  22. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    Article  Google Scholar 

  23. Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).

    Article  Google Scholar 

  24. Leman, J. K. et al. Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat. Methods 17, 665–680 (2020).

    Article  MathSciNet  Google Scholar 

  25. Edelsbrunner, H. & Harer, J. Computational Topology: An Introduction (American Mathematical Society, 2010).

  26. Zomorodian, A. & Carlsson, G. Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005).

    Article  MathSciNet  MATH  Google Scholar 

  27. Cang, Z. & Wei, G.-W. Integration of element specific persistent homology and machine learning for protein–ligand binding affinity prediction. Int. J. Numer. Methods Biomed. Eng. 34, e2914 (2018).

    Article  Google Scholar 

  28. Wang, M., Cang, Z. & Wei, G.-W. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nat. Mach. Intell. 2, 116–123 (2020).

    Article  Google Scholar 

  29. Wang, R., Nguyen, D. D. & Wei, G.-W. Persistent spectral graph. Int. J. Numer. Methods Biomed. Eng. 36, e3376 (2020).

    Article  MathSciNet  Google Scholar 

  30. Mémoli, F., Wan, Z. & Wang, Y. Persistent Laplacians: properties, algorithms and implications. SIAM J. Math. Data Sci. 4, 858–884 (2022).

    Article  MathSciNet  MATH  Google Scholar 

  31. Meng, Z. & Xia, K. Persistent spectral-based machine learning (perspect ML) for protein–ligand binding affinity prediction. Sci. Adv. 7, eabc5329 (2021).

    Article  Google Scholar 

  32. Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Systems 12, 1026–1045 (2021).

    Article  Google Scholar 

  33. Horak, D. & Jost, J. Spectra of combinatorial laplace operators on simplicial complexes. Adv. Math. 244, 303–336 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  34. Chung, F. R. K. & Graham, F. C. Spectral Graph Theory (American Mathematical Society, 1997).

  35. Brouwer, A. E. & Haemers, W. H. Spectra of Graphs (Springer, New York, 2011).

  36. Eckmann, B. Harmonische funktionen und randwertaufgaben in einem komplex. Comment. Math. Helv. 17, 240–255 (1944).

    Article  MathSciNet  MATH  Google Scholar 

  37. Kac, M. Can one hear the shape of a drum? Am. Math. Mon. 73, 1–23 (1966).

    Article  MathSciNet  MATH  Google Scholar 

  38. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  39. Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol. 16, e9380 (2020).

    Article  Google Scholar 

  40. Qiu, Y., Hu, J. & Wei, G.-W. Cluster learning-assisted directed evolution. Nat. Comput. Sci. 1, 809–818 (2021).

    Article  Google Scholar 

  41. Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).

    Article  Google Scholar 

  42. Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).

    Article  Google Scholar 

  43. Klesmith, J. R., Bacik, J.-P., Michalczyk, R. & Whitehead, T. A. Comprehensive sequence-flux mapping of a levoglucosan utilization pathway in E. coli. ACS Synth. Biol. 4, 1235–1243 (2015).

    Article  Google Scholar 

  44. Bubenik, P. et al. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16, 77–102 (2015).

    MathSciNet  MATH  Google Scholar 

  45. Adams, H. et al. Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18, 1–35 (2017).

    MathSciNet  Google Scholar 

  46. Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).

    Article  MathSciNet  MATH  Google Scholar 

  47. Qiu, Y. & Wei, G.-W. Clade 2.0: evolution-driven cluster learning-assisted directed evolution. J. Chem. Inf. Model. 62, 4629–4641 (2022).

    Article  Google Scholar 

  48. Rollins, N. J. et al. Inferring protein 3D structure from deep mutation scans. Nat. Genet. 51, 1170–1176 (2019).

    Article  Google Scholar 

  49. Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).

    Article  Google Scholar 

  50. Kawashima, S. & Kanehisa, M. AAIndex: amino acid index database. Nucleic Acids Res. 28, 374–374 (2000).

    Article  Google Scholar 

  51. Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).

    Google Scholar 

  52. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 4171–4186 (2019).

  53. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

    Article  Google Scholar 

  54. Yu, F., Koltun, V. & Funkhouser, T. Dilated residual networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 472–480 (IEEE, 2017).

  55. Humphrey, W., Dalke, A. & Schulten, K. VMD: visual molecular dynamics. J. Mol. Graph. 14, 33–38 (1996).

    Article  Google Scholar 

  56. Xiang, Z. & Honig, B. Extending the accuracy limits of prediction for side-chain conformations. J. Mol. Biol. 311, 421–430 (2001).

    Article  Google Scholar 

  57. Maria, C., Boissonnat, J.-D., Glisse, M. & Yvinec, M. The GUDHI library: simplicial complexes and persistent homology. In International Congress on Mathematical Software 167–174 (Springer, 2014).

  58. Wang, R. et al. HERMES: persistent spectral graph software. Found. Data Sci. 3, 67 (2021).

    Article  MATH  Google Scholar 

  59. Pedregosa, F. et al. scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  60. Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning 115–123 (PMLR, 2013).

  61. Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016).

  62. Cheng, H.-T. et al. Wide and deep learning for recommender systems. In Proc.1st Workshop on Deep Learning for Recommender Systems 7–10 (ACM, 2016).

  63. Kabsch, W. & Sander, C. Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).

    Article  Google Scholar 

  64. Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 422–446 (2002).

    Article  Google Scholar 

  65. Qiu, Y. YuchiQiu/TopFit: Nature Computational Science publication accompaniment (v1.0.0). Zenodo https://doi.org/10.5281/zenodo.7450235 (2022).

Download references

Acknowledgements

This work was supported in part by NIH grants R01GM126189 and R01AI164266; NSF grants DMS-2052983, DMS-1761320 and IIS-1900473; NASA grant 80NSSC21M0023; Michigan Economic Development Corporation; MSU Foundation; Bristol-Myers Squibb 65109 and Pfizer. We thank C. Hsu and J. Listgarten for helpful discussions.

Author information

Authors and Affiliations

Authors

Contributions

All authors conceived this work, and contributed to the original draft, review and editing. Y.Q. performed experiments and analyzed data. G.-W.W. provided supervision and resources and acquired funding.

Corresponding author

Correspondence to Guo-Wei Wei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The average performance of various models over 34 datasets.

a,b, This is a supplement for Fig. 3a. a, Line plots show identical data with Fig. 3a with additional data for two TopFit strategies. b, Results are evaluated by NDCG. a,b, Ensemble regression is used, except ridge regression for Georgiev and one-hot embeddings. Absolute values of ρ were shown for evolutionary scores. The width of shade shows 95% confidence interval from n = 20 repeats. Evolutionary scores use absolute values for corresponding quantities.

Source data

Extended Data Fig. 2 Spearman correlation for various models on individual datasets.

This is a supplement for Fig. 3a. Each line plots show average ρ for each dataset over n = 20 repeats. Training data sizes are 24, 96, 168, and 240. The width of the shade shows 95% confidence interval.

Extended Data Fig. 3 The frequency that an embedding is ranked as the best across 34 datasets using Spearman correlation.

ad, This is a supplement for Fig. 3b including comparisons over different strategies. Histograms show the frequency that an embedding is ranked as the best across 34 datasets with 24, 96, 168 and 240 training data, respectively. For each dataset, the best embedding has average ρ over n=20 repeats within the 95% confidence interval of the embedding with the highest average ρ. Comparisons were performed for a sequence-based embeddings, b structure- and sequence-based embeddings, c structure-based embeddings, sequence-based embeddings, and evolutionary scores and d structure-based embeddings, sequence-based embeddings, evolutionary scores and two sets of TopFit (VAE+PST+ESM and VAE+PST+eUniRep). We showed and used absolute values Spearman correlation for evolutionary scores.

Source data

Extended Data Fig. 4 NDCG for various models on individual datasets.

This is an analog for Extended Data Figure 2 but using NDCG. Each line plots show average NDCG for each dataset over n = 20 repeats. Training data sizes are 24, 96, 168 and 240. The width of the shade shows 95% confidence interval.

Extended Data Fig. 5 The frequency that an embedding is ranked as the best across 34 datasets using NDCG.

ad, This is an analog for Extended Data Figure 3 but measured by NDCG. Histograms show the frequency that an embedding is ranked as the best across 34 datasets with 24, 96, 168 and 240 training data, respectively. For each dataset, the best embedding has average NDCG over n = 20 repeats within the 95% confidence interval of the embedding with the highest average NDCG. Comparisons were performed for a sequence-based embeddings; b structure- and sequence-based embeddings; c structure-based embeddings, sequence-based embeddings and evolutionary scores and d structure-based embeddings, sequence-based embeddings, evolutionary scores and two sets of TopFit (VAE+PST+ESM and VAE+PST+eUniRep). We showed and used absolute values NDCG for evolutionary scores.

Source data

Extended Data Fig. 6 Relationships between quality of wild-type protein structure and PST performance.

ab, This is a supplement for Fig. 3c. Boxplots show distribution of a percentages of coils for protein structure over 34 datasets and b third quartile (Q3) of B factors at alpha carbons over 26 X-ray datasets. Datasets were classified into two classes depending on whether PST embedding is the best embedding. Scatter plots show same data with boxplots but for individual datasets. One-sided Mann–Whitney U-test examines the statistical significance that two classes have different values. Boxplots display five-number summary where center line shows median, upper and lower limits of the box show upper and lower quartiles, and upper and lower whiskers show the maximum and the minimum by excluding “outliers” outside the interquartile range. In a, sample sizes for PST ranked as the best model are n = 21, n = 15, n = 18 and n = 19 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 13, n = 19, n = 16 and n = 15 for training data size 24, 96, 168 and 240, respectively. The p-values are 0.01, 3 × 10−5, 1 × 10−3 and 1 × 10−3 for training data size 24, 96, 168 and 240, respectively. In b, sample sizes for PST ranked as the best model are n = 8, n = 12, n = 14 and n = 15 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 18, n = 14, n = 12 and n = 11 for training data size 24, 96, 168 and 240, respectively. P values are 0.03, 0.02, 0.07 and 0.02 for training data size 24, 96, 168 and 240, respectively.

Source data

Extended Data Fig. 7 Model occurrence in ensemble regression.

This is a supplement for Fig. 3e to show model occurrence on individual datasets. For each repeat, the top N = 3 regressors were picked and counted. Histograms count the model occurrence over 20 repeats.

Source data

Extended Data Fig. 8 Comparisons between TopFit and other methods for mutation effects prediction using Spearman correlation.

a,b, This is an analog for Fig. 4a-b, but TopFit combines VAE score, eUniRep embedding, and PST embedding. All supervised models use 240 labeled training data. Results are evaluated by Spearman correlation ρ. DeepSequence VAE takes the absolute value of ρ. The average ρ from n = 20 repeats is shown. All 34 datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). a, Dot plots show results across 34 datasets. b, Dot plots show pairwise comparison between TopFit with one method at each plot. Medians of difference for average Spearman correlation Δρ across all datasets are shown. One-sided rank-sum test determines the statistical significance that TopFit has better performance than VAE score, eUniRep embedding and PST embedding with P values 3 × 10−7, 2 × 10−7 and 4 × 10−7, respectively.

Source data

Extended Data Fig. 9 Comparisons between TopFit with other methods. TopFit consists of VAE score, ESM embedding and PST embedding.

This is a supplement for Fig. 4b to include results with various numbers of training data. Average Spearman correlation from n = 20 repeats are shown, and all datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). One-sided rank-sum test determines the statistical significance that TopFit has better performance than other strategies, except we use null hypothesis that TopFit has worse performance than VAE with 24 training data. The p-values are shown in the corresponding subfigures. They are 1. TopFit versus VAE: P = 4 × 10−6, 2 × 10−5 and 1 × 10−6; 2. TopFit versus ESM: P = 2 × 10−7, 2 × 10−7 and 2 × 10−7 and 3. TopFit versus PST: P = 2 × 10−7, 2 × 10−7 and 2 × 10−7 for training data size 24, 96 and 168, respectively.

Source data

Extended Data Fig. 10 Comparisons between TopFit with other methods. TopFit consists of VAE score, eUniRep embedding and PST embedding.

This is a supplement for Extended Data Figure 8b to include results with various numbers of training data. Average Spearman correlation from n = 20 repeats are shown, and all datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). One-sided rank-sum test determines the statistical significance that TopFit has better performance than other strategies, except we use null hypothesis that TopFit has worse performance than VAE with 24 training data. The p-values are shown in the corresponding subfigures. They are 1. TopFit versus VAE: P = 3 × 10−6, 3 × 10−5 and 8 × 10−7; 2. TopFit versus eUniRep: P = 4 × 10−7, 2 × 10−7 and 2 × 10−7; and 3. TopFit versus PST: P = 3 × 10−7, 2 × 10−7 and 3 × 10−7 for training data size 24, 96 and 168, respectively.

Source data

Supplementary information

Supplementary Information

Supplementary Figs. 1–16, Tables 1–5 and Notes 1–7.

Reporting Summary

Peer Review file

Supplementary Data 1

Dataset information. List of datasets, including their UniProt ID, the structure data ID, reference, sequence region for mutations and so on.

Supplementary Data 2

Raw data for computational results. Each folder is named by the name of dataset. Please refer to Supplementary Data 1 for the correspondence between dataset naming convention used in this file and that in figures. Dataset with additional suffix “_AF” or “_NMR” indicate AF or NMR structure was used for computations. In each folder, results are saved in “.csv” with eight possible names for different tasks. “ridge.csv”, “ensemble.csv”, “evolutionary_scores.csv” are results from random train/test splits using ridge regression, ensemble regression and evolutionary scores, respectively. “unseen_ensemble.csv” and “unseen_evolutionary_scores.csv” are results from train/test splits for unseen mutational sites using ensemble regression and evolutionary scores, respectively. “ridge_n_mut_1.csv”, “ensemble_n_mut_1.csv”, “evolutionary_scores_n_mut_1.csv” are results from extrapolation task that single mutations are used to predict multiple mutations for ridge regression, ensemble regression and evolutionary scores, respectively. Explanations for column name are available in “README.txt”. It provides raw data for all figures, Extended Data figures and Supplementary figures generated in this work, excluding Fig. 5, which is available from its own source data. Statistical data for Extended Data Figs. 2 and 4 are available directly from this data.

Source data

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 6

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Source Data Extended Data Fig. 8

Statistical source data.

Source Data Extended Data Fig. 9

Statistical source data.

Source Data Extended Data Fig. 10

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiu, Y., Wei, GW. Persistent spectral theory-guided protein engineering. Nat Comput Sci 3, 149–163 (2023). https://doi.org/10.1038/s43588-022-00394-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-022-00394-y

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics