SignalP 6.0 predicts all five types of signal peptides using protein language models

Signal peptides (SPs) are short amino acid sequences that control protein secretion and translocation in all living organisms. SPs can be predicted from sequence data, but existing algorithms are unable to detect all known types of SPs. We introduce SignalP 6.0, a machine learning model that detects all five SP types and is applicable to metagenomic data.


Supplementary note 2: Signal peptide region prediction results
Signal peptides are traditionally described as consisting of three distinct regions with different biochemical properties (1,2). While it is possible to segment a SP sequence's regions by manual inspection of the sequence given sufficient domain knowledge, there is no clear set of rules on how to do so universally. Rather, expert annotators do so on a case-by-case basis to arrive at a probable solution.
However, this approach is not scalable to proteome-level data, so far preventing detailed analysis of SP structure. As there is also no experimental technique known that identifies region borders accurately, SignalP 6.0 instead learns SP structure directly from unlabeled data in a weak supervision fashion (Online methods). This lack of labeled region data for training naturally also implies that no labeled data is available for measuring performance. Thus, we benchmark our region identification by comparing the properties of predicted regions to known properties from literature (2). Specifically, we evaluate the predicted lengths, hydrophobicities and net charges of n-, h-, and c-regions in all SP types.
We find that the predictions of SignalP 6.0 match all expected properties. For n-regions, the model correctly recovers the average length and the differences between organism groups (17) ( Supplementary   Fig. 5). Predicted h-regions are less hydrophobic in Tat-translocated SPs than in Sec SPs, a property that is known to contribute to the selectivity of the pathways (3). While the c-region is generally uncharged, in Tat SPs it can contain positively charged residues to avoid recognition by the Sec system, a feature called the Sec avoidance motif (4). SignalP 6.0 accurately captures this property, with the majority of Tat/SPI c-regions having a net charge of 0 or +1. The model also predicts negatively charged Tat/SPI c-regions, hinting at negative charges also possibly being suitable to hinder recognition by the Sec pathway.
To further evaluate the region prediction capability, we predict a library of synthetic SPs that were found to be either functional or non-functional in Bacillus subtilis (5). In the original work, the authors did not find any discriminating properties between the two groups using traditional sequence analysis.

Supplementary note 3: Proteome-wide predictions
To gain insight into the diversity of SP usage throughout evolution, we predicted all reference proteomes available in Uniprot (6) (Supplementary Tables 10-11). Predictions confirm exceptionally high Tat/SPII frequencies in Halobacteria, as they were previously reported (7), even though the training dataset only contains 3 such sequences. Moreover, our predictions also revealed bacterial species with high Tat/SPII and Sec/SPIII frequencies, highlighting the importance of including these two often overlooked modes of protein export when investigating proteomes. In general, the number of secretory proteins is predicted to be highly variable among different species. Among all 9,915 organisms present in the data, the only species without predicted SPs are bacterial endosymbionts, indicating that protein translocation and export are indispensable to free-living organisms.
It is possible to estimate the number of proteins with SPs in an organism using proteomic methods.
Ivankov et al. (8) reported an estimated 11.5% of the E. coli proteome to have SPs. SignalP 6.0 predicts a total of 12.2 %, indicating that the model's genome-scale predictions yield estimates of plausible sizes.
When aggregating results on the organism group level, we observe further diversity in translocation strategies (Supplementary Table 11). Tat/SPII SPs are predicted to be the most frequent in Archaea, together with a low frequency of Sec/SPII, suggesting that lipoproteins are mostly exported via the Tat pathway. We further find that the overall frequency of secretory pathway proteins is the highest in Gram-negative bacteria. Notably, these results might be biased by the selection of organisms for which reference proteomes are available, as it is unclear how well they represent their complete organism group.       /SPI  146  36  61  15  Sec/SPII  -9  257  120  Tat/SPI  -9  51  18  Tat/SPII  -5  5  3  Other  5581  81  133  81   Supplementary Table 11. Composition of the SignalP 5.0 benchmark dataset after 1) removal of all sequences that were not retained in the new homology partitioning and 2) reclassification of Gramnegative and Tat/SPI samples to Gram-positive and Tat/SPII.