Abstract
Although protein engineering, which iteratively optimizes protein fitness by screening the gigantic mutational space, is constrained by experimental capacity, various machine learning models have substantially expedited protein engineering. Three-dimensional protein structures promise further advantages, but their intricate geometric complexity hinders their applications in deep mutational screening. Persistent homology, an established algebraic topology tool for protein structural complexity reduction, fails to capture the homotopic shape evolution during filtration of given data. Here we introduce a Topology-offered Protein Fitness (TopFit) framework to complement protein sequence and structure embeddings. Equipped with an ensemble regression strategy, TopFit integrates the persistent spectral theory, which is a new topological Laplacian, and two auxiliary sequence embeddings to capture mutation-induced topological invariant, shape evolution and sequence disparity in the protein fitness landscape. The performance of TopFit is assessed by 34 benchmark datasets with 128,634 variants, involving a vast variety of protein structure acquisition modalities and training set size variations.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
SVSBI: sequence-based virtual screening of biomolecular interactions
Communications Biology Open Access 18 May 2023
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout





Data availability
There are 34 DMS datasets with experimentally measured fitness used in this work including: 32 DeepSequence datasets8, the avGFP dataset41 and the GB1 dataset42. The original data sources of the 32 DeepSequence datasets are provided in Supplementary Data 1 and Supplementary Note 7. Structure data were obtained from PDB database22 and AF38, and the specific entry ID is provided in Supplementary Data 1. The data analyzed and generated in this work, including sequence-to-fitness datasets, optimized structure data, MSAs, fine-tune parameters for eUniRep models, predictions from evolutionary scores for individual mutations and sequence- and structure-based embeddings are available at https://github.com/WeilabMSU/TopFit (ref. 65) and our lab server https://weilab.math.msu.edu/Downloads/TopFit/. Source data for Figs. 3–5 and Extended Data Figs. 1, 3, 5–10 are available with this paper. Source data for Extended Data Figs. 2 and 4 are available in Supplementary Data 2.
Code availability
All source codes and models are publicly available at https://github.com/WeilabMSU/TopFit (ref. 65).
References
Narayanan, H. et al. Machine learning for biologics: opportunities for protein engineering, developability, and formulation. Trends Pharmacol.Sci. 42, 151–165 (2021).
Arnold, F. H. Design by directed evolution. Acc. Chem. Res. 31, 125–131 (1998).
Karplus, M. & Kuriyan, J. Molecular dynamics and protein function. Proc. Natl Acad. Sci. USA 102, 6679–6685 (2005).
Wittmann, B. J., Johnston, K. E., Wu, Z. & Arnold, F. H. Advances in machine learning for directed evolution. Curr. Opin. Struct. Biol. 69, 11–18 (2021).
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019).
Hopf, T. A. et al. The evcouplings python framework for coevolutionary sequence analysis. Bioinformatics 35, 1582–1584 (2019).
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91–95 (2021).
Rao, R. M. et al. MSA transformer. In International Conference on Machine Learning 8844–8856 (PMLR, 2021).
The UniProt Consortium. UniProt: the universal protein knowledge base in 2021. Nucleic Acids Res. 49, D480–D489 (2021).
Rao, R. et al. Evaluating protein transfer learning with tape. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations (2018).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Elnaggar, A. et al. ProtTrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2022).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-n protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. Adv. Neural Inf. Process. Syst. 34, 29287–29303 (2021).
Notin, P. et al. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning 16990–17017 (PMLR, 2022).
Hsu, C., Nisonoff, H., Fannjiang, C. & Listgarten, J. Learning protein fitness models from evolutionary and assay-labeled data. Nat. Biotechnol. 40, 1114–1122 (2022).
Luo, Y. et al. ECNet is an evolutionary context-integrated deep learning framework for protein engineering. Nat. Commun. 12, 5743 (2021).
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Schymkowitz, J. et al. The FoldX web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
Leman, J. K. et al. Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat. Methods 17, 665–680 (2020).
Edelsbrunner, H. & Harer, J. Computational Topology: An Introduction (American Mathematical Society, 2010).
Zomorodian, A. & Carlsson, G. Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005).
Cang, Z. & Wei, G.-W. Integration of element specific persistent homology and machine learning for protein–ligand binding affinity prediction. Int. J. Numer. Methods Biomed. Eng. 34, e2914 (2018).
Wang, M., Cang, Z. & Wei, G.-W. A topology-based network tree for the prediction of protein–protein binding affinity changes following mutation. Nat. Mach. Intell. 2, 116–123 (2020).
Wang, R., Nguyen, D. D. & Wei, G.-W. Persistent spectral graph. Int. J. Numer. Methods Biomed. Eng. 36, e3376 (2020).
Mémoli, F., Wan, Z. & Wang, Y. Persistent Laplacians: properties, algorithms and implications. SIAM J. Math. Data Sci. 4, 858–884 (2022).
Meng, Z. & Xia, K. Persistent spectral-based machine learning (perspect ML) for protein–ligand binding affinity prediction. Sci. Adv. 7, eabc5329 (2021).
Wittmann, B. J., Yue, Y. & Arnold, F. H. Informed training set design enables efficient machine learning-assisted directed protein evolution. Cell Systems 12, 1026–1045 (2021).
Horak, D. & Jost, J. Spectra of combinatorial laplace operators on simplicial complexes. Adv. Math. 244, 303–336 (2013).
Chung, F. R. K. & Graham, F. C. Spectral Graph Theory (American Mathematical Society, 1997).
Brouwer, A. E. & Haemers, W. H. Spectra of Graphs (Springer, New York, 2011).
Eckmann, B. Harmonische funktionen und randwertaufgaben in einem komplex. Comment. Math. Helv. 17, 240–255 (1944).
Kac, M. Can one hear the shape of a drum? Am. Math. Mon. 73, 1–23 (1966).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol. 16, e9380 (2020).
Qiu, Y., Hu, J. & Wei, G.-W. Cluster learning-assisted directed evolution. Nat. Comput. Sci. 1, 809–818 (2021).
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397–401 (2016).
Olson, C. A., Wu, N. C. & Sun, R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr. Biol. 24, 2643–2651 (2014).
Klesmith, J. R., Bacik, J.-P., Michalczyk, R. & Whitehead, T. A. Comprehensive sequence-flux mapping of a levoglucosan utilization pathway in E. coli. ACS Synth. Biol. 4, 1235–1243 (2015).
Bubenik, P. et al. Statistical topological data analysis using persistence landscapes. J. Mach. Learn. Res. 16, 77–102 (2015).
Adams, H. et al. Persistence images: a stable vector representation of persistent homology. J. Mach. Learn. Res. 18, 1–35 (2017).
Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proc. Natl Acad. Sci. USA 110, E193–E201 (2013).
Qiu, Y. & Wei, G.-W. Clade 2.0: evolution-driven cluster learning-assisted directed evolution. J. Chem. Inf. Model. 62, 4629–4641 (2022).
Rollins, N. J. et al. Inferring protein 3D structure from deep mutation scans. Nat. Genet. 51, 1170–1176 (2019).
Georgiev, A. G. Interpretable numerical descriptors of amino acid space. J. Comput. Biol. 16, 703–723 (2009).
Kawashima, S. & Kanehisa, M. AAIndex: amino acid index database. Nucleic Acids Res. 28, 374–374 (2000).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1, 4171–4186 (2019).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).
Yu, F., Koltun, V. & Funkhouser, T. Dilated residual networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 472–480 (IEEE, 2017).
Humphrey, W., Dalke, A. & Schulten, K. VMD: visual molecular dynamics. J. Mol. Graph. 14, 33–38 (1996).
Xiang, Z. & Honig, B. Extending the accuracy limits of prediction for side-chain conformations. J. Mol. Biol. 311, 421–430 (2001).
Maria, C., Boissonnat, J.-D., Glisse, M. & Yvinec, M. The GUDHI library: simplicial complexes and persistent homology. In International Congress on Mathematical Software 167–174 (Springer, 2014).
Wang, R. et al. HERMES: persistent spectral graph software. Found. Data Sci. 3, 67 (2021).
Pedregosa, F. et al. scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Bergstra, J., Yamins, D. & Cox, D. Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In International Conference on Machine Learning 115–123 (PMLR, 2013).
Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, 2016).
Cheng, H.-T. et al. Wide and deep learning for recommender systems. In Proc.1st Workshop on Deep Learning for Recommender Systems 7–10 (ACM, 2016).
Kabsch, W. & Sander, C. Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).
Järvelin, K. & Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 422–446 (2002).
Qiu, Y. YuchiQiu/TopFit: Nature Computational Science publication accompaniment (v1.0.0). Zenodo https://doi.org/10.5281/zenodo.7450235 (2022).
Acknowledgements
This work was supported in part by NIH grants R01GM126189 and R01AI164266; NSF grants DMS-2052983, DMS-1761320 and IIS-1900473; NASA grant 80NSSC21M0023; Michigan Economic Development Corporation; MSU Foundation; Bristol-Myers Squibb 65109 and Pfizer. We thank C. Hsu and J. Listgarten for helpful discussions.
Author information
Authors and Affiliations
Contributions
All authors conceived this work, and contributed to the original draft, review and editing. Y.Q. performed experiments and analyzed data. G.-W.W. provided supervision and resources and acquired funding.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Fernando Chirigati, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 The average performance of various models over 34 datasets.
a,b, This is a supplement for Fig. 3a. a, Line plots show identical data with Fig. 3a with additional data for two TopFit strategies. b, Results are evaluated by NDCG. a,b, Ensemble regression is used, except ridge regression for Georgiev and one-hot embeddings. Absolute values of ρ were shown for evolutionary scores. The width of shade shows 95% confidence interval from n = 20 repeats. Evolutionary scores use absolute values for corresponding quantities.
Extended Data Fig. 2 Spearman correlation for various models on individual datasets.
This is a supplement for Fig. 3a. Each line plots show average ρ for each dataset over n = 20 repeats. Training data sizes are 24, 96, 168, and 240. The width of the shade shows 95% confidence interval.
Extended Data Fig. 3 The frequency that an embedding is ranked as the best across 34 datasets using Spearman correlation.
a–d, This is a supplement for Fig. 3b including comparisons over different strategies. Histograms show the frequency that an embedding is ranked as the best across 34 datasets with 24, 96, 168 and 240 training data, respectively. For each dataset, the best embedding has average ρ over n=20 repeats within the 95% confidence interval of the embedding with the highest average ρ. Comparisons were performed for a sequence-based embeddings, b structure- and sequence-based embeddings, c structure-based embeddings, sequence-based embeddings, and evolutionary scores and d structure-based embeddings, sequence-based embeddings, evolutionary scores and two sets of TopFit (VAE+PST+ESM and VAE+PST+eUniRep). We showed and used absolute values Spearman correlation for evolutionary scores.
Extended Data Fig. 4 NDCG for various models on individual datasets.
This is an analog for Extended Data Figure 2 but using NDCG. Each line plots show average NDCG for each dataset over n = 20 repeats. Training data sizes are 24, 96, 168 and 240. The width of the shade shows 95% confidence interval.
Extended Data Fig. 5 The frequency that an embedding is ranked as the best across 34 datasets using NDCG.
a–d, This is an analog for Extended Data Figure 3 but measured by NDCG. Histograms show the frequency that an embedding is ranked as the best across 34 datasets with 24, 96, 168 and 240 training data, respectively. For each dataset, the best embedding has average NDCG over n = 20 repeats within the 95% confidence interval of the embedding with the highest average NDCG. Comparisons were performed for a sequence-based embeddings; b structure- and sequence-based embeddings; c structure-based embeddings, sequence-based embeddings and evolutionary scores and d structure-based embeddings, sequence-based embeddings, evolutionary scores and two sets of TopFit (VAE+PST+ESM and VAE+PST+eUniRep). We showed and used absolute values NDCG for evolutionary scores.
Extended Data Fig. 6 Relationships between quality of wild-type protein structure and PST performance.
a–b, This is a supplement for Fig. 3c. Boxplots show distribution of a percentages of coils for protein structure over 34 datasets and b third quartile (Q3) of B factors at alpha carbons over 26 X-ray datasets. Datasets were classified into two classes depending on whether PST embedding is the best embedding. Scatter plots show same data with boxplots but for individual datasets. One-sided Mann–Whitney U-test examines the statistical significance that two classes have different values. Boxplots display five-number summary where center line shows median, upper and lower limits of the box show upper and lower quartiles, and upper and lower whiskers show the maximum and the minimum by excluding “outliers” outside the interquartile range. In a, sample sizes for PST ranked as the best model are n = 21, n = 15, n = 18 and n = 19 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 13, n = 19, n = 16 and n = 15 for training data size 24, 96, 168 and 240, respectively. The p-values are 0.01, 3 × 10−5, 1 × 10−3 and 1 × 10−3 for training data size 24, 96, 168 and 240, respectively. In b, sample sizes for PST ranked as the best model are n = 8, n = 12, n = 14 and n = 15 for training data size 24, 96, 168 and 240, respectively. Sample sizes for PST not ranked as the best model are n = 18, n = 14, n = 12 and n = 11 for training data size 24, 96, 168 and 240, respectively. P values are 0.03, 0.02, 0.07 and 0.02 for training data size 24, 96, 168 and 240, respectively.
Extended Data Fig. 7 Model occurrence in ensemble regression.
This is a supplement for Fig. 3e to show model occurrence on individual datasets. For each repeat, the top N = 3 regressors were picked and counted. Histograms count the model occurrence over 20 repeats.
Extended Data Fig. 8 Comparisons between TopFit and other methods for mutation effects prediction using Spearman correlation.
a,b, This is an analog for Fig. 4a-b, but TopFit combines VAE score, eUniRep embedding, and PST embedding. All supervised models use 240 labeled training data. Results are evaluated by Spearman correlation ρ. DeepSequence VAE takes the absolute value of ρ. The average ρ from n = 20 repeats is shown. All 34 datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). a, Dot plots show results across 34 datasets. b, Dot plots show pairwise comparison between TopFit with one method at each plot. Medians of difference for average Spearman correlation Δρ across all datasets are shown. One-sided rank-sum test determines the statistical significance that TopFit has better performance than VAE score, eUniRep embedding and PST embedding with P values 3 × 10−7, 2 × 10−7 and 4 × 10−7, respectively.
Extended Data Fig. 9 Comparisons between TopFit with other methods. TopFit consists of VAE score, ESM embedding and PST embedding.
This is a supplement for Fig. 4b to include results with various numbers of training data. Average Spearman correlation from n = 20 repeats are shown, and all datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). One-sided rank-sum test determines the statistical significance that TopFit has better performance than other strategies, except we use null hypothesis that TopFit has worse performance than VAE with 24 training data. The p-values are shown in the corresponding subfigures. They are 1. TopFit versus VAE: P = 4 × 10−6, 2 × 10−5 and 1 × 10−6; 2. TopFit versus ESM: P = 2 × 10−7, 2 × 10−7 and 2 × 10−7 and 3. TopFit versus PST: P = 2 × 10−7, 2 × 10−7 and 2 × 10−7 for training data size 24, 96 and 168, respectively.
Extended Data Fig. 10 Comparisons between TopFit with other methods. TopFit consists of VAE score, eUniRep embedding and PST embedding.
This is a supplement for Extended Data Figure 8b to include results with various numbers of training data. Average Spearman correlation from n = 20 repeats are shown, and all datasets are categorized by their structure modality used: X-ray, nuclear magnetic resonance (NMR), AlphaFold (AF) and cryogenic electron microscopy (EM). One-sided rank-sum test determines the statistical significance that TopFit has better performance than other strategies, except we use null hypothesis that TopFit has worse performance than VAE with 24 training data. The p-values are shown in the corresponding subfigures. They are 1. TopFit versus VAE: P = 3 × 10−6, 3 × 10−5 and 8 × 10−7; 2. TopFit versus eUniRep: P = 4 × 10−7, 2 × 10−7 and 2 × 10−7; and 3. TopFit versus PST: P = 3 × 10−7, 2 × 10−7 and 3 × 10−7 for training data size 24, 96 and 168, respectively.
Supplementary information
Supplementary Information
Supplementary Figs. 1–16, Tables 1–5 and Notes 1–7.
Supplementary Data 1
Dataset information. List of datasets, including their UniProt ID, the structure data ID, reference, sequence region for mutations and so on.
Supplementary Data 2
Raw data for computational results. Each folder is named by the name of dataset. Please refer to Supplementary Data 1 for the correspondence between dataset naming convention used in this file and that in figures. Dataset with additional suffix “_AF” or “_NMR” indicate AF or NMR structure was used for computations. In each folder, results are saved in “.csv” with eight possible names for different tasks. “ridge.csv”, “ensemble.csv”, “evolutionary_scores.csv” are results from random train/test splits using ridge regression, ensemble regression and evolutionary scores, respectively. “unseen_ensemble.csv” and “unseen_evolutionary_scores.csv” are results from train/test splits for unseen mutational sites using ensemble regression and evolutionary scores, respectively. “ridge_n_mut_1.csv”, “ensemble_n_mut_1.csv”, “evolutionary_scores_n_mut_1.csv” are results from extrapolation task that single mutations are used to predict multiple mutations for ridge regression, ensemble regression and evolutionary scores, respectively. Explanations for column name are available in “README.txt”. It provides raw data for all figures, Extended Data figures and Supplementary figures generated in this work, excluding Fig. 5, which is available from its own source data. Statistical data for Extended Data Figs. 2 and 4 are available directly from this data.
Source data
Source Data Fig. 3
Statistical source data.
Source Data Fig. 4
Statistical source data.
Source Data Fig. 5
Statistical source data.
Source Data Extended Data Fig. 1
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 5
Statistical source data.
Source Data Extended Data Fig. 6
Statistical source data.
Source Data Extended Data Fig. 7
Statistical source data.
Source Data Extended Data Fig. 8
Statistical source data.
Source Data Extended Data Fig. 9
Statistical source data.
Source Data Extended Data Fig. 10
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Qiu, Y., Wei, GW. Persistent spectral theory-guided protein engineering. Nat Comput Sci 3, 149–163 (2023). https://doi.org/10.1038/s43588-022-00394-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00394-y
This article is cited by
-
SVSBI: sequence-based virtual screening of biomolecular interactions
Communications Biology (2023)
-
Sensing the shape of functional proteins with topology
Nature Computational Science (2023)