Persistent spectral theory-guided protein engineering

Qiu, Yuchi; Wei, Guo-Wei

doi:10.1038/s43588-022-00394-y

Article
Published: 20 February 2023

Persistent spectral theory-guided protein engineering

Nature Computational Science volume 3, pages 149–163 (2023)Cite this article

2449 Accesses
14 Citations
12 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Although protein engineering, which iteratively optimizes protein fitness by screening the gigantic mutational space, is constrained by experimental capacity, various machine learning models have substantially expedited protein engineering. Three-dimensional protein structures promise further advantages, but their intricate geometric complexity hinders their applications in deep mutational screening. Persistent homology, an established algebraic topology tool for protein structural complexity reduction, fails to capture the homotopic shape evolution during filtration of given data. Here we introduce a Topology-offered Protein Fitness (TopFit) framework to complement protein sequence and structure embeddings. Equipped with an ensemble regression strategy, TopFit integrates the persistent spectral theory, which is a new topological Laplacian, and two auxiliary sequence embeddings to capture mutation-induced topological invariant, shape evolution and sequence disparity in the protein fitness landscape. The performance of TopFit is assessed by 34 benchmark datasets with 128,634 variants, involving a vast variety of protein structure acquisition modalities and training set size variations.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Conceptual diagram of the TopFit method.**

**Fig. 2: PST for topological persistence and homotopic shape evolution.**

**Fig. 3: Prediction from single embedding on fitness landscape measured by Spearman correlations.**

**Fig. 4: Comparisons between TopFit embeddings and other methods for fitness prediction.**

**Fig. 5: Comparisons between TopFit and other regression models for fitness predictions using Spearman correlation.**

Cluster learning-assisted directed evolution

Article 09 December 2021

Yuchi Qiu, Jian Hu & Guo-Wei Wei

Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis

Article Open access 15 November 2019

Daniele Raimondi, Gabriele Orlando, … Yves Moreau

Transformer-based protein generation with regularized latent space optimization

Article 26 September 2022

Egbert Castro, Abhinav Godavarthi, … Smita Krishnaswamy

Data availability

There are 34 DMS datasets with experimentally measured fitness used in this work including: 32 DeepSequence datasets⁸, the avGFP dataset⁴¹ and the GB1 dataset⁴². The original data sources of the 32 DeepSequence datasets are provided in Supplementary Data 1 and Supplementary Note 7. Structure data were obtained from PDB database²² and AF³⁸, and the specific entry ID is provided in Supplementary Data 1. The data analyzed and generated in this work, including sequence-to-fitness datasets, optimized structure data, MSAs, fine-tune parameters for eUniRep models, predictions from evolutionary scores for individual mutations and sequence- and structure-based embeddings are available at https://github.com/WeilabMSU/TopFit (ref. ⁶⁵) and our lab server https://weilab.math.msu.edu/Downloads/TopFit/. Source data for Figs. 3–5 and Extended Data Figs. 1, 3, 5–10 are available with this paper. Source data for Extended Data Figs. 2 and 4 are available in Supplementary Data 2.

Code availability

All source codes and models are publicly available at https://github.com/WeilabMSU/TopFit (ref. ⁶⁵).

References

Narayanan, H. et al. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends Pharmacol.Sci. 42, 151–165 (2021).
Article Google Scholar
Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).
Article Google Scholar
Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).
Article Google Scholar
Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol. 69, 11–18 (2021).
Article Google Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Article Google Scholar
Hopf, T. A. et al. The evcouplings python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2019).
Article Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Article Google Scholar
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article Google Scholar
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).
Article Google Scholar
Rao, R. M. et al. MSA transformer. In International Conference on Machine Learning 8844–8856 (PMLR, 2021).
The UniProt Consortium. UniProt: the universal protein knowledge base in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Rao, R. et al. Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
Google Scholar
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations (2018).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article Google Scholar
Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Google Scholar
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-n protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
Article Google Scholar
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).
Google Scholar
Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning 16990–17017 (PMLR, 2022).
Hsu, C., Nisonoff, H., Fannjiang, C. & Listgarten, J. Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022).
Article Google Scholar
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).
Article Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Article Google Scholar
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Article Google Scholar
Leman, J. K. et al. Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat. Methods 17, 665–680 (2020).
Article MathSciNet Google Scholar
Edelsbrunner, H. & Harer, J. Computational Topology: An Introduction (American Mathematical Society, 2010).
Zomorodian, A. & Carlsson, G. Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005).
Article MathSciNet MATH Google Scholar
Cang, Z. & Wei, G.-W. Integration of element specific persistent homology and machine learning for protein–ligand binding affinity prediction. Int. J. Numer. Methods Biomed. Eng. 34, e2914 (2018).
Article Google Scholar
Wang, M., Cang, Z. & Wei, G.-W. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nat. Mach. Intell. 2, 116–123 (2020).
Article Google Scholar
Wang, R., Nguyen, D. D. & Wei, G.-W. Persistent spectral graph. Int. J. Numer. Methods Biomed. Eng. 36, e3376 (2020).
Article MathSciNet Google Scholar
Mémoli, F., Wan, Z. & Wang, Y. Persistent Laplacians: properties, algorithms and implications. SIAM J. Math. Data Sci. 4, 858–884 (2022).
Article MathSciNet MATH Google Scholar
Meng, Z. & Xia, K. Persistent spectral-based machine learning (perspect ML) for protein–ligand binding affinity prediction. Sci. Adv. 7, eabc5329 (2021).
Article Google Scholar
Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Systems 12, 1026–1045 (2021).
Article Google Scholar
Horak, D. & Jost, J. Spectra of combinatorial laplace operators on simplicial complexes. Adv. Math. 244, 303–336 (2013).
Article MathSciNet MATH Google Scholar
Chung, F. R. K. & Graham, F. C. Spectral Graph Theory (American Mathematical Society, 1997).
Brouwer, A. E. & Haemers, W. H. Spectra of Graphs (Springer, New York, 2011).
Eckmann, B. Harmonische funktionen und randwertaufgaben in einem komplex. Comment. Math. Helv. 17, 240–255 (1944).
Article MathSciNet MATH Google Scholar
Kac, M. Can one hear the shape of a drum? Am. Math. Mon. 73, 1–23 (1966).
Article MathSciNet MATH Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol. 16, e9380 (2020).
Article Google Scholar
Qiu, Y., Hu, J. & Wei, G.-W. Cluster learning-assisted directed evolution. Nat. Comput. Sci. 1, 809–818 (2021).
Article Google Scholar
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Article Google Scholar
Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).
Article Google Scholar
Klesmith, J. R., Bacik, J.-P., Michalczyk, R. & Whitehead, T. A. Comprehensive sequence-flux mapping of a levoglucosan utilization pathway in E. coli. ACS Synth. Biol. 4, 1235–1243 (2015).
Article Google Scholar
Bubenik, P. et al. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16, 77–102 (2015).
MathSciNet MATH Google Scholar
Adams, H. et al. Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18, 1–35 (2017).
MathSciNet Google Scholar
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Article MathSciNet MATH Google Scholar
Qiu, Y. & Wei, G.-W. Clade 2.0: evolution-driven cluster learning-assisted directed evolution. J. Chem. Inf. Model. 62, 4629–4641 (2022).
Article Google Scholar
Rollins, N. J. et al. Inferring protein 3D structure from deep mutation scans. Nat. Genet. 51, 1170–1176 (2019).
Article Google Scholar
Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).
Article Google Scholar
Kawashima, S. & Kanehisa, M. AAIndex: amino acid index database. Nucleic Acids Res. 28, 374–374 (2000).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Google Scholar
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 4171–4186 (2019).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Article Google Scholar
Yu, F., Koltun, V. & Funkhouser, T. Dilated residual networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 472–480 (IEEE, 2017).
Humphrey, W., Dalke, A. & Schulten, K. VMD: visual molecular dynamics. J. Mol. Graph. 14, 33–38 (1996).
Article Google Scholar
Xiang, Z. & Honig, B. Extending the accuracy limits of prediction for side-chain conformations. J. Mol. Biol. 311, 421–430 (2001).
Article Google Scholar
Maria, C., Boissonnat, J.-D., Glisse, M. & Yvinec, M. The GUDHI library: simplicial complexes and persistent homology. In International Congress on Mathematical Software 167–174 (Springer, 2014).
Wang, R. et al. HERMES: persistent spectral graph software. Found. Data Sci. 3, 67 (2021).
Article MATH Google Scholar
Pedregosa, F. et al. scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning 115–123 (PMLR, 2013).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016).
Cheng, H.-T. et al. Wide and deep learning for recommender systems. In Proc.1st Workshop on Deep Learning for Recommender Systems 7–10 (ACM, 2016).
Kabsch, W. & Sander, C. Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
Article Google Scholar
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 422–446 (2002).
Article Google Scholar
Qiu, Y. YuchiQiu/TopFit: Nature Computational Science publication accompaniment (v1.0.0). Zenodo https://doi.org/10.5281/zenodo.7450235 (2022).

Download references

Acknowledgements

This work was supported in part by NIH grants R01GM126189 and R01AI164266; NSF grants DMS-2052983, DMS-1761320 and IIS-1900473; NASA grant 80NSSC21M0023; Michigan Economic Development Corporation; MSU Foundation; Bristol-Myers Squibb 65109 and Pfizer. We thank C. Hsu and J. Listgarten for helpful discussions.

Author information

Authors and Affiliations

Department of Mathematics, Michigan State University, East Lansing, MI, USA
Yuchi Qiu & Guo-Wei Wei
Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA
Guo-Wei Wei
Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, USA
Guo-Wei Wei

Authors

Yuchi Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Guo-Wei Wei
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceived this work, and contributed to the original draft, review and editing. Y.Q. performed experiments and analyzed data. G.-W.W. provided supervision and resources and acquired funding.

Corresponding author

Correspondence to Guo-Wei Wei.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The average performance of various models over 34 datasets.

a,b, This is a supplement for Fig. 3a. a, Line plots show identical data with Fig. 3a with additional data for two TopFit strategies. b, Results are evaluated by NDCG. a,b, Ensemble regression is used, except ridge regression for Georgiev and one-hot embeddings. Absolute values of ρ were shown for evolutionary scores. The width of shade shows 95% confidence interval from n = 20 repeats. Evolutionary scores use absolute values for corresponding quantities.

Source data

Extended Data Fig. 2 Spearman correlation for various models on individual datasets.

This is a supplement for Fig. 3a. Each line plots show average ρ for each dataset over n = 20 repeats. Training data sizes are 24, 96, 168, and 240. The width of the shade shows 95% confidence interval.

Extended Data Fig. 3 The frequency that an embedding is ranked as the best across 34 datasets using Spearman correlation.

a–d, This is a supplement for Fig. 3b including comparisons over different strategies. Histograms show the frequency that an embedding is ranked as the best across 34 datasets with 24, 96, 168 and 240 training data, respectively. For each dataset, the best embedding has average ρ over n=20 repeats within the 95% confidence interval of the embedding with the highest average ρ. Comparisons were performed for a sequence-based embeddings, b structure- and sequence-based embeddings, c structure-based embeddings, sequence-based embeddings, and evolutionary scores and d structure-based embeddings, sequence-based embeddings, evolutionary scores and two sets of TopFit (VAE+PST+ESM and VAE+PST+eUniRep). We showed and used absolute values Spearman correlation for evolutionary scores.

Source data

Extended Data Fig. 4 NDCG for various models on individual datasets.

This is an analog for Extended Data Figure 2 but using NDCG. Each line plots show average NDCG for each dataset over n = 20 repeats. Training data sizes are 24, 96, 168 and 240. The width of the shade shows 95% confidence interval.

Extended Data Fig. 5 The frequency that an embedding is ranked as the best across 34 datasets using NDCG.

a–d, This is an analog for Extended Data Figure 3 but measured by NDCG. Histograms show the frequency that an embedding is ranked as the best across 34 datasets with 24, 96, 168 and 240 training data, respectively. For each dataset, the best embedding has average NDCG over n = 20 repeats within the 95% confidence interval of the embedding with the highest average NDCG. Comparisons were performed for a sequence-based embeddings; b structure- and sequence-based embeddings; c structure-based embeddings, sequence-based embeddings and evolutionary scores and d structure-based embeddings, sequence-based embeddings, evolutionary scores and two sets of TopFit (VAE+PST+ESM and VAE+PST+eUniRep). We showed and used absolute values NDCG for evolutionary scores.

Source data

Extended Data Fig. 6 Relationships between quality of wild-type protein structure and PST performance.

a–b, This is a supplement for Fig. 3c. Boxplots show distribution of a percentages of coils for protein structure over 34 datasets and b third quartile (Q₃) of B factors at alpha carbons over 26 X-ray datasets. Datasets were classified into two classes depending on whether PST embedding is the best embedding. Scatter plots show same data with boxplots but for individual datasets. One-sided Mann–Whitney U-test examines the statistical significance that two classes have different values. Boxplots display five-number summary where center line shows median, upper and lower limits of the box show upper and lower quartiles, and upper and lower whiskers show the maximum and the minimum by excluding “outliers” outside the interquartile range. In a, sample sizes for PST ranked as the best model are n = 21, n = 15, n = 18 and n = 19 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 13, n = 19, n = 16 and n = 15 for training data size 24, 96, 168 and 240, respectively. The p-values are 0.01, 3 × 10⁻⁵, 1 × 10⁻³ and 1 × 10⁻³ for training data size 24, 96, 168 and 240, respectively. In b, sample sizes for PST ranked as the best model are n = 8, n = 12, n = 14 and n = 15 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 18, n = 14, n = 12 and n = 11 for training data size 24, 96, 168 and 240, respectively. P values are 0.03, 0.02, 0.07 and 0.02 for training data size 24, 96, 168 and 240, respectively.

Source data

Extended Data Fig. 7 Model occurrence in ensemble regression.

This is a supplement for Fig. 3e to show model occurrence on individual datasets. For each repeat, the top N = 3 regressors were picked and counted. Histograms count the model occurrence over 20 repeats.

Source data

Extended Data Fig. 8 Comparisons between TopFit and other methods for mutation effects prediction using Spearman correlation.

a,b, This is an analog for Fig. 4a-b, but TopFit combines VAE score, eUniRep embedding, and PST embedding. All supervised models use 240 labeled training data. Results are evaluated by Spearman correlation ρ. DeepSequence VAE takes the absolute value of ρ. The average ρ from n = 20 repeats is shown. All 34 datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). a, Dot plots show results across 34 datasets. b, Dot plots show pairwise comparison between TopFit with one method at each plot. Medians of difference for average Spearman correlation Δρ across all datasets are shown. One-sided rank-sum test determines the statistical significance that TopFit has better performance than VAE score, eUniRep embedding and PST embedding with P values 3 × 10⁻⁷, 2 × 10⁻⁷ and 4 × 10⁻⁷, respectively.

Source data

Extended Data Fig. 9 Comparisons between TopFit with other methods. TopFit consists of VAE score, ESM embedding and PST embedding.

This is a supplement for Fig. 4b to include results with various numbers of training data. Average Spearman correlation from n = 20 repeats are shown, and all datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). One-sided rank-sum test determines the statistical significance that TopFit has better performance than other strategies, except we use null hypothesis that TopFit has worse performance than VAE with 24 training data. The p-values are shown in the corresponding subfigures. They are 1. TopFit versus VAE: P = 4 × 10⁻⁶, 2 × 10⁻⁵ and 1 × 10⁻⁶; 2. TopFit versus ESM: P = 2 × 10⁻⁷, 2 × 10⁻⁷ and 2 × 10⁻⁷ and 3. TopFit versus PST: P = 2 × 10⁻⁷, 2 × 10⁻⁷ and 2 × 10⁻⁷ for training data size 24, 96 and 168, respectively.

Source data

Extended Data Fig. 10 Comparisons between TopFit with other methods. TopFit consists of VAE score, eUniRep embedding and PST embedding.

This is a supplement for Extended Data Figure 8b to include results with various numbers of training data. Average Spearman correlation from n = 20 repeats are shown, and all datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). One-sided rank-sum test determines the statistical significance that TopFit has better performance than other strategies, except we use null hypothesis that TopFit has worse performance than VAE with 24 training data. The p-values are shown in the corresponding subfigures. They are 1. TopFit versus VAE: P = 3 × 10⁻⁶, 3 × 10⁻⁵ and 8 × 10⁻⁷; 2. TopFit versus eUniRep: P = 4 × 10⁻⁷, 2 × 10⁻⁷ and 2 × 10⁻⁷; and 3. TopFit versus PST: P = 3 × 10⁻⁷, 2 × 10⁻⁷ and 3 × 10⁻⁷ for training data size 24, 96 and 168, respectively.

Source data

Supplementary information

Supplementary Information

Supplementary Figs. 1–16, Tables 1–5 and Notes 1–7.

Reporting Summary

Peer Review file

Supplementary Data 1

Dataset information. List of datasets, including their UniProt ID, the structure data ID, reference, sequence region for mutations and so on.

Supplementary Data 2

Raw data for computational results. Each folder is named by the name of dataset. Please refer to Supplementary Data 1 for the correspondence between dataset naming convention used in this file and that in figures. Dataset with additional suffix “_AF” or “_NMR” indicate AF or NMR structure was used for computations. In each folder, results are saved in “.csv” with eight possible names for different tasks. “ridge.csv”, “ensemble.csv”, “evolutionary_scores.csv” are results from random train/test splits using ridge regression, ensemble regression and evolutionary scores, respectively. “unseen_ensemble.csv” and “unseen_evolutionary_scores.csv” are results from train/test splits for unseen mutational sites using ensemble regression and evolutionary scores, respectively. “ridge_n_mut_1.csv”, “ensemble_n_mut_1.csv”, “evolutionary_scores_n_mut_1.csv” are results from extrapolation task that single mutations are used to predict multiple mutations for ridge regression, ensemble regression and evolutionary scores, respectively. Explanations for column name are available in “README.txt”. It provides raw data for all figures, Extended Data figures and Supplementary figures generated in this work, excluding Fig. 5, which is available from its own source data. Statistical data for Extended Data Figs. 2 and 4 are available directly from this data.

Source data

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 1

Statistical source data.

Source Data Extended Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 5

Statistical source data.

Source Data Extended Data Fig. 6

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Source Data Extended Data Fig. 8

Statistical source data.

Source Data Extended Data Fig. 9

Statistical source data.

Source Data Extended Data Fig. 10

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qiu, Y., Wei, GW. Persistent spectral theory-guided protein engineering. Nat Comput Sci 3, 149–163 (2023). https://doi.org/10.1038/s43588-022-00394-y

Download citation

Received: 09 August 2022
Accepted: 22 December 2022
Published: 20 February 2023
Issue Date: February 2023
DOI: https://doi.org/10.1038/s43588-022-00394-y

This article is cited by

Topological deep learning: a review of an emerging paradigm
- Ali Zia
- Abdelwahed Khamis
- Lars Petersson
Artificial Intelligence Review (2024)
SVSBI: sequence-based virtual screening of biomolecular interactions
- Li Shen
- Hongsong Feng
- Guo-Wei Wei
Communications Biology (2023)
Sensing the shape of functional proteins with topology
- Yunan Luo
Nature Computational Science (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links