Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model

Abstract

Signal peptides (SPs) are essential to target and transfer transmembrane and secreted proteins to the correct positions. Many existing computational tools for predicting SPs disregard the extreme data imbalance problem and rely on additional group information of proteins. Here we introduce Unbiased Organism-agnostic Signal Peptide Network (USPNet), an SP classification and cleavage-site prediction deep learning method. Extensive experimental results show that USPNet substantially outperforms previous methods on classification performance by 10%. An SP-discovering pipeline with USPNet is designed to explore unprecedented SPs from metagenomic data. It reveals 347 SP candidates, with the lowest sequence identity between our candidates and the closest SP in the training dataset at only 13%. In addition, the template modeling scores between candidates and SPs in the training set are mostly above 0.8. The results showcase that USPNet has learnt the SP structure with raw amino acid sequences and the large protein language model, thereby enabling the discovery of unknown SPs.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: USPNet workflow for predicting SP and cleavage site.
Fig. 2: USPNet shows robust performance across different SP types and organism groups.
Fig. 3: Embedding and ablation study performance analysis of USPNet compared with alternative models.
Fig. 4: Performance of USPNet on domain-shift data.
Fig. 5: The exploration of metagenomics data for SP discovery.

Similar content being viewed by others

Data availability

All the datasets we used, including training data, benchmark data, independent test data, proteome-wide study results and metagenomic study results are listed in Methods and are available at https://doi.org/10.17605/OSF.IO/NH3CF ref. 49. Source data are provided with this paper.

Code availability

The open-source codes of USPNet can be found at https://github.com/ml4bio/USPNet and the Code Ocean software capsule https://doi.org/10.24433/CO.8184163.v1 ref. 50.

References

  1. von Heijne, G. Life and death of a signal peptide. Nature 396, 111–113 (1998).

    Article  Google Scholar 

  2. Heijne, G. V. The signal peptide. J. Membr. Biol. 115, 195–201 (1990).

    Article  Google Scholar 

  3. Bradshaw, N., Neher, S. B., Booth, D. S. & Walter, P. Signal sequences activate the catalytic switch of SRP RNA. Science 323, 127–130 (2009).

    Article  Google Scholar 

  4. von Heijne, G. Patterns of amino acids near signal-sequence cleavage sites. Eur. J. Biochem. 133, 17–21 (1983).

    Article  Google Scholar 

  5. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099–1106 (2023).

  6. Frank, K. & Sippl, M. J. High-performance signal peptide prediction based on sequence alignment techniques. Bioinformatics 24, 2172–2176 (2008).

    Article  Google Scholar 

  7. Petersen, T. N., Brunak, S., Von Heijne, G. & Nielsen, H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods 8, 785–786 (2011).

    Article  Google Scholar 

  8. Savojardo, C., Martelli, P. L., Fariselli, P. & Casadio, R. DeepSig: deep learning improves signal peptide detection in proteins. Bioinformatics 10, 1690–1696 (2017).

    Google Scholar 

  9. Armenteros, J. J. A. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol. 37, 420–423 (2019).

    Article  Google Scholar 

  10. Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023–1025 (2022).

  11. Juncker, A. S. et al. Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci. 12, 1652–1662 (2003).

    Article  Google Scholar 

  12. Bagos, P. G., Tsirigos, K. D., Liakopoulos, T. D. & Hamodrakas, S. J. Prediction of lipoprotein signal peptides in Gram-positive bacteria with a hidden Markov model. J. Proteome Res. 7, 5082–5093 (2008).

    Article  Google Scholar 

  13. Bendtsen, J. D., Nielsen, H., Widdick, D., Palmer, T. & Brunak, S. Prediction of twin-arginine signal peptides. BMC Bioinformatics 6, 167 (2005).

    Article  Google Scholar 

  14. Pasolli, E. et al. Accessible, curated metagenomic data through experimenthub. Nat. Methods 14, 1023–1024 (2017).

    Article  Google Scholar 

  15. Sczyrba, A. et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

    Article  Google Scholar 

  16. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  17. Rao, R. M. et al. MSA transformer. In Proc. 38th International Conference on Machine Learning, Proc. Machine Learning Research Vol. 139 (eds Meila, M. & Zhang, T.) 8844–8856 (PMLR, 2021).

  18. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389–396 (2021).

    Article  Google Scholar 

  19. Thireou, T. & Reczko, M. Bidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteins. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 441–446 (2007).

    Article  Google Scholar 

  20. Cao, K., Wei, C., Gaidon, A., Arechiga, N. & Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 32, 1567–1578 (2019).

    Google Scholar 

  21. Mnih, V. et al. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 27, 2204–2212 (2014).

    Google Scholar 

  22. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  MathSciNet  Google Scholar 

  23. Lin, T.-Y., Goyal, P., Girshick, R., He, K. & Dollár, P. Focal loss for dense object detection. In Proc. IEEE International Conference on Computer Vision 2980–2988 (IEEE, 2017).

  24. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

    Article  Google Scholar 

  25. Armenteros, J. J. A. et al. Detecting sequence signals in targeting peptides using deep learning. Life Sci. Alliance 2, e201900429 (2019).

    Article  Google Scholar 

  26. Ma, Y. et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Na. Biotechnol. 40, 921–931 (2022).

    Article  Google Scholar 

  27. Han, S. et al. Novel signal peptides improve the secretion of recombinant Staphylococcus aureus alpha toxinH35L in Escherichia coli. AMB Express 7, 93 (2017).

    Article  Google Scholar 

  28. Consortium, T. U. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531 (2022).

    Article  Google Scholar 

  29. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  30. Consortium, U. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).

    Article  Google Scholar 

  31. Sigrist, C. J. et al. New and continuing developments at prosite. Nucleic Acids Res. 41, D344–D347 (2012).

    Article  Google Scholar 

  32. Dobson, L., Lango, T., Reményi, I. & Tusnády, G. E. Expediting topology data gathering for the TOPDB database. Nucleic Acids Res. 43, D283–D289 (2015).

    Article  Google Scholar 

  33. Gíslason, M. H., Nielsen, H., Armenteros, J. J. A. & Johansen, A. R. Prediction of GPI-anchored proteins with pointer neural networks. Curr. Res. Biotechnol. 3, 6–13 (2021).

    Article  Google Scholar 

  34. Li, W. & Godzik, A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Article  Google Scholar 

  35. Youngblut, N. D. et al. Large-scale metagenome assembly reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other genetic diversity. mSystems 5, e01045-20 (2020).

    Article  Google Scholar 

  36. Looft, T., Bayles, D., Alt, D. & Stanton, T. Complete genome sequence of Coriobacteriaceae strain 68-1-3, a novel mucus-degrading isolate from the swine intestinal tract. Genome Announc. 3, e01143-15 (2015).

    Article  Google Scholar 

  37. Zhou, S. et al. Characterization of metagenome-assembled genomes and carbohydrate-degrading genes in the gut microbiota of Tibetan pig. Front. Microbiol. 11, 595066 (2020).

    Article  Google Scholar 

  38. Chen, C. et al. Prevotella copri increases fat accumulation in pigs fed with formula diets. Microbiome 9, 175 (2021).

    Article  Google Scholar 

  39. Groussin, M. et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053–2067 (2021).

    Article  Google Scholar 

  40. Tilocca, B. et al. Dietary changes in nutritional studies shape the structural and functional composition of the pigs’ fecal microbiome—from days to weeks. Microbiome 5, 144 (2017).

    Article  Google Scholar 

  41. Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).

    Article  Google Scholar 

  42. Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).

    Article  Google Scholar 

  43. Mirdita, M. et al. UniCclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).

    Article  Google Scholar 

  44. Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).

    Article  Google Scholar 

  45. McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://doi.org/10.48550/arXiv.1802.03426 (2018).

  46. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).

    Article  Google Scholar 

  47. DeLano, W. L. et al. PyMOL: an open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr. 40, 82–92 (2002).

    Google Scholar 

  48. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

    Article  Google Scholar 

  49. Shen, J. et al. Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model. OSF https://doi.org/10.17605/OSF.IO/NH3CF (2023).

  50. Shen, J. et al. Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model. Code Ocean https://doi.org/10.24433/CO.8184163.v1 (2023).

Download references

Acknowledgements

Special thanks to the people who suggested that we evaluate models on the 40% cut-off benchmark set. The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (project number CUHK 24204023, to Y.L.) and a grant from Innovation and Technology Commission of the Hong Kong Special Administrative Region, China (project number GHP/065/21SZ, to Y.L.). The work was partially supported by the National Key R&D Program of China (NO.2022ZD0160101).

Author information

Authors and Affiliations

Authors

Contributions

Y.L., J.S. and S.C. designed the computational method. J.S., Q.Y. and S.C. implemented the main algorithm. J.S., Q.Y., S.C., Q.T. and J.L. did the experiments. J.S. and Q.Y. performed the analysis. J.S., Q.Y. and S.C. wrote the paper. Y.L. supervised the project. All authors read and approved the paper.

Corresponding author

Correspondence to Yu Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Rita Casadio and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–7, Supplementary Figs. 1–6 and Tables 1–21.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, J., Yu, Q., Chen, S. et al. Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model. Nat Comput Sci 4, 29–42 (2024). https://doi.org/10.1038/s43588-023-00576-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-023-00576-2

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing