Introduction

Six secretion systems have been identified in pathogenic and endosymbiotic Gram-negative bacteria1,2,3,4,5,6. The type III secretion system (T3SS) mediates a wide range of bacterial infections in human, animals and plants7. This system comprises a hollow needle-like structure localized on the surface of bacterial cells that injects specific bacterial proteins, effectors, directly into the cytoplasm of a host cell3. During infection, effectors act in concert to convert host resources to their advantage and promote pathogenicity8. While the elements of T3SS are conserved among different pathogens, effector proteins are not7,9,10.

Although, next generation sequencing techniques yield an ever-growing number of bacterial genome sequences11, experimental verification needed to identify type III effectors remains very expensive and time-consuming. Considering the central role these proteins play in pathogenicity and symbiosis, there is a need for computational tools that predict and prioritize type III effector proteins. To address this need various machine-learning algorithms12,13,14,15 have been developed to identify type III effectors in silico. As input, these methods use similarities in gene GC content and protein amino acid composition, secondary structure and solvent accessibility to experimentally known effectors. Most often only the protein N-terminus is considered, as it is assumed to be most informative for the translocation of effectors through the type III secretion process16. An independent benchmark revealed that state-of-art-methods predict type III effectors at similar levels of up to 80% accuracy at 80% coverage17; thus, there is still room for substantial improvement.

We built pEffect, a method that combines two components - sequence similarity-based inference (PSI-BLAST18) and de novo prediction using Support Vector Machines (SVM19). Our method attains 87 ± 7% accuracy at 95 ± 5% coverage in predicting type III effectors, significantly outperforming each of its components. It also provides a score reflecting the strength of each prediction, allowing users to focus on most relevant results. When tested on sequence fragments similar in length to peptides translated from shotgun sequencing reads, pEffect’s performance was not significantly different. This result suggests that the information required for distinguishing effectors is not confined to any particular part of the amino acid sequence.

We applied pEffect to complete proteomes of over 900 prokaryotic species. pEffect’s high prediction accuracy and ubiquitous applicability raises an interesting question about its predictions of effectors in Gram-positive bacteria and archaea, which are not known to utilize type III secretion. For bacteria, these findings may hint at shared ancestry between flagellar and type III secretion systems9. Gene genealogies20 and protein network analysis approaches21 suggest evolution of both systems from a common ancestor. For archaea common ancestry is less clear. However, predominance in the number of predicted effectors in Gram-negative bacteria, as opposed to the number of predicted effectors in Gram-positive bacteria and archaea suggests repurposing of effector-like proteins independent of organism secretory abilities. In addition to pEffect’s application to evolutionary inference, we find that the time and T3SS completeness–driven results, which follow expectations for correlation with quantities of predicted effectors, are reassuring of our method’s performance.

Our method provides a basis for rapid identification of T3SS-utitlizing bacteria and their exported effector proteins as targets for future therapeutic treatments. The method also proposes interesting directions in which the evolution of bacterial pathogenicity can further be explored. Finally, we suggest using pEffect as a starting point for studies of interactions within microbial communities, detected directly from metagenomic reads and without the need for individual genome assembly.

Results

pEffect succeeded linking homology-based and de novo predictions

Most functional annotations of new proteins originate from homology-based transfer, i.e. on the basis of shared ancestry with experimentally characterized proteins22. For type III effector prediction, homology-based inference implies finding a sequence-similar experimentally annotated type III effector (‘Methods’ section), e.g. via PSI-BLAST.

The accuracy of homology-based inference by PSI-BLAST was comparable to that of our de novo prediction method on the cross-validation Development set (Table 1: 91% vs. 92%). However, at this level of accuracy, its coverage was significantly higher (Table 1: 84% vs. 60%). This result encouraged combining these two approaches as introduced in our recent work, LocTree323: use PSI-BLAST when sequence similarity suffices (e-value ≤ 10−3; Table 1: F1 = 0.87 complete set) and the SVM otherwise (Table 1: F1 = 0.67 on subset of proteins without PSI-BLAST hit). The combined method, pEffect, outperformed both its components, reaching an F1 measure of 0.91 (Table 1).

Table 1 Performance of pEffect and its components on the Development set.

pEffect outperformed other methods

We compared pEffect to publicly available methods: BPBAac13, EffectiveT312, T3_MM24, Modlab15 and BEAN 2.014. BPBAac, EffectiveT3, T3_MM and Modlab focus exclusively on N-terminal features, while BEAN 2.0 and pEffect are not confined to those regions only (Methods, Supplementary S2 Text). BPBAac, T3_MM and Modlab rely solely on amino acid composition; EffectiveT3 combines amino acid composition and secondary structure information; BEAN 2.0 uses BLAST18 and PFAM25 domain searches, evolutionary information encoded in N- and C-termini, as well as information from an intermediate sequence region. We compared performance for UniProt26 proteins that had NOT been used to develop any method and for T3DB11 proteins, some of which all methods (incl. pEffect) had used for development. In our hands, pEffect significantly outperformed its competitors on UniProt sets containing eukaryotic proteins (Fig. 1, Supplementary Table S1). The F1 performance of pEffect exceeded the other methods by more than 0.35 (∆F1 = (pEffect, BEAN 2.0) = 0.35 for both UniProt sets, Supplementary Table S1). On the bacterial T3DB data sets, pEffect performed within one standard error of the prediction performance achieved by its best performing competitor BEAN 2.0. Thus, pEffect performed as well or better when benchmarked with existing tools in distinguishing type III effectors from bacteria (F1 > 0.64) and from eukaryotes (F1 > 0.88). This improvement is particularly important to, e.g. annotate results from contaminated metagenomic studies27.

Figure 1
figure 1

Method performance comparison on independent test sets and protein fragments.

Performance (Supplementary S1 Text: F1 measure, Eqn. 3; ‘ ± ’ standard error, Eqn. 4) was measured for BPBAac, EffectiveT3, T3_MM, Modlab and BEAN 2.0 methods (Supplementary S2 Text). We also computed F1 for de novo (SVM-based) predictions alone, PSI-BLAST homology-based look up alone and pEffect: a combination of PSI-BLAST (if a hit is available) and de novo (otherwise). Panel (a) shows performance on evaluation data sets (Methods) including (1)UniProt’15HVAL0 (51 effectors and 691 non-effector bacterial and eukaryotic proteins, added to UniProt after 2014_02 release, sequence homology reduced at HVAL < 0), (2)UniProt’15Full (498 effectors and 1,509 non-effector bacterial and eukaryotic proteins added to UniProt after 2014_08 release, NOT homology reduced), (3)T3DBHVAL0 (66 effectors and 128 non-effector bacterial proteins from T3DB database, sequence homology reduced at HVAL < 0) and (4)T3DBFull (218 effectors and 831 non-effector bacterial proteins from T3DB database, NOT homology reduced). Note: T3_MM was not able to produce results for the UniProt’15HVAL0 set during manuscript preparation. Panel (b) shows performance on fragments produced from (3)T3DBHVAL0 (Methods) including (5)approach i: 30 N-terminal amino acids cleaved off, (6)ii: 30 C-terminal amino acids cleaved off, (7)iii: Randomly selected two thirds of the protein sequence and (8)iv: Randomly selected sequence fragments of typical translated read length (average 110 amino acids, Supplementary Fig. S1).

pEffect excelled even for protein fragments

To evaluate pEffect’s ability to annotate effectors from incomplete genomic assemblies and mistakes, we fragmented the proteins from the homology reduced T3DBHVAL0 set containing bacterial proteins only. We started with protein rather than gene sequence fragments, because we did not expect incorrect gene translations of DNA reads, even if sufficiently long, to trigger incorrect effector predictions from any method. Four different approaches were used to generate protein fragments: (i) remove the first 30 residues (N-terminus) from the full protein sequence, (ii) remove the last 30 residues (C-terminus), (iii) randomly remove one third of residues and (iv) randomly choose from each protein a single fragment of a typical translated read length (Supplementary Fig. S1).

pEffect compared favourably to all other methods for all fragment sets (i–iv). Most methods performed best on fragments with C-terminal cleavage (set ii, Fig. 1, Supplementary Table S2). Performance was lowest for random fragments of typical read lengths (set iv). For pEffect it dipped least (F1 = 0.59 ± 0.14 on set iv vs. F1 = 0.64 ± 0.14 on full length, Supplementary Table S1). For all fragment sets, performances of homology-based approaches, i.e. of PSI-BLAST, pEffect and BEAN 2.0 were within the standard error of the performance obtained when using full-length sequences (T3DBHVAL0 set; Fig. 1, Supplementary Table S1). These results suggest that the features distinguishing type III effectors are spread over the entire protein sequence and can be picked up by local alignment.

Reliability index identified confident predictions

pEffect provides a reliability index (RI) to measure the confidence of a prediction; the value of RI ranges from 0 (uncertain) to 100 (most reliable). For PSI-BLAST searches, RIs are normalized values of percentage pairwise sequence identities read of the alignments. For de novo predictions, RIs are values corresponding to SVM scores (Methods). Including predictions with low RIs gives many trusted results at reduced accuracy. Higher accuracy predictions are obtained by sampling at higher RIs, thus reducing the total number of trusted samples. For example, at the threshold of RI ≥ 50, over 87% of all predictions of type III effectors are correct and 95% of all effectors in our set are identified (Supplementary Fig. S2: black arrow). On the other hand, at RI > 80 effector predictions are correct 96% of the time, but only 78% of all effectors in the set are identified (Supplementary Fig. S2: gray arrow). Thus, users can choose the most appropriate threshold for a given study. Users can also focus on previously unidentified effectors (de novo predictions) or, vice versa, on validated homologs of known effectors (PSI-BLAST matches; Supplementary Fig. S3).

Application of pEffect: scanning proteomes for type III effectors

We used pEffect to annotate type III effectors in 862 bacterial (274 Gram-positive, 588 Gram-negative bacteria) and 90 archaeal proteomes from the European Bioinformatics Institute (EBI: http://www.ebi.ac.uk/genomes/; predictions available on the pEffect website). Each bacterium was predicted to have some type III effectors, with a minimum of 0.8% of the proteome - 2 out of all 240 proteins – identified as effectors (Fig. 2, Supplementary Table S3). For some Gram-negative bacteria, over 750 type III effectors were predicted (Supplementary Table S3), e.g. 870 effectors in S. aurantiaca DW4/3-1, which is indeed known to have a T3SS and effectors28.

Figure 2
figure 2

Percentage of predicted effectors in full proteomes.

The figure shows the box-plot-and-instance representation of percentages of pEffect-predicted type III effectors (Y-axis) in 90 archaeal, 274 Gram-positive and 588 Gram-negative bacterial organisms (X-axis), which are shown as dots. At least 50% of effector predictions in all, except 11 organisms in our set were predicted de novo. In the figure, the colour represents the percentage of de novo predictions for each organism: from green (50% de novo, 50% PSI-BLAST) to blue (100% de novo, 0% PSI-BLAST). While effectors predicted in archaea and Gram-positive bacteria are often picked up by PSI-BLAST, effectors in Gram-negative bacteria are mostly de novo predictions (mostly blue dots).

Overall, the number of predicted type III effectors was 1–10% of the whole proteome in Gram-positive bacteria and 1–15% in Gram-negative bacteria (Fig. 2, Supplementary Table S3). To further understand our predictions, we retrieved UniProt keywords of predicted effectors. Their annotations varied widely, with the most common for both types of bacteria being transferase, depicting a large class of enzymes that are responsible for the transfer of specific functional groups from one molecule to another, nucleotide-binding – a common functionality of effector proteins, ATP-binding – an essential component of T3SS and kinase – necessary for the expression of T3SS genes. About one fourth (26–29% per proteomes) of predicted type III effectors are functionally ‘unknown’ (Supplementary Table S4).

We also predicted type III effectors in all archaeal proteomes, with over 100 effectors identified in the proteomes of H. turkmenica DSM 5511 and M. acetivorans C2A (126 and 105 effectors, respectively; Supplementary Table S3). On average, there were fewer effectors predicted in archaea than in bacteria: 1.9% is the overall per-organism number for archaea vs. 3.4% for Gram-positive and 4.6% Gram-negative bacteria (Fig. 2). The most frequent annotations of predicted archaeal effectors were similar to those for predicted bacterial effectors, namely ‘unknown’, nucleotide-binding, ATP-binding and transferase (Supplementary Table S4).

Evaluation of predictions for proteomes

We BLASTed proteins representative of five T3DB Ortholog clusters (e-value ≤ 10−3; Supplementary Table S5) against the full proteomes of our 862 bacteria and 90 archaea set. We thus aimed to identify those proteomes likely equipped with the T3SS machinery (Fig. 3).

Figure 3
figure 3

Proteomes encoding some of the five components of T3SS machinery.

(a) 90 archaea proteomes, (b) 274 Gram-positive bacteria and (c) 588 Gram-negative bacteria were scanned for the presence of T3SS and are shown as dots in the figure. The percentage of type III effectors predicted by pEffect (Y-axis) is compared to the number of type III secretion machinery components (max. five T3 Ortholog clusters; Methods) identified in these proteomes (X-axis). Note that effector predictions are computationally completely independent of machinery component identifications. While type III effectors compose up to 3.7% of an archaeal proteome (mean 1.9%, blue horizontal line), this number is much larger for bacteria, reaching up to 10.1% of an entire proteome for Gram-positive bacteria (mean 3.4%) and 14.9% for Gram-negative bacteria (mean 4.6%; for those with five T3SS components, mean 4.8%). Note that six Gram-negative bacterial species did not contain detectable homologs of any of the required machinery components (not even ATPases), indicating that their genomes are further diverged than those of other species.

We found that, as expected, archaea never contain a full T3SS (maximum three out of five components). In Gram-negative bacteria, the number of predicted effectors correlated much better with the number of type III machinery components (Pearson correlation r = 0.37) than in Gram-positive bacteria (r = 0.13). The combination of a high percentage of predicted type III effectors and a high number of conserved type III machinery components provides strong evidence for the presence of the type III secretion abilities (Fig. 3). As a rule of thumb, based on our observations in archaea and Gram-positive bacteria, we suggest that these abilities can be reliably identified by the presence of the complete T3SS and ≥5% of the genome dedicated to effectors. With these cutoffs, 20% (115 species) of the Gram-negative bacteria in our set are identified as type III secreting. We randomly picked ten species from these 20% and found evidence in the literature for T3SS presence for seven of them (Supplementary Table S6). No archaeal species and only five Gram-positive bacteria fit these cutoffs. Note that our rule does not imply that organisms with full T3SS and over 5% predicted effectors necessarily have the complete ability to use the system. Instead, we suggest that organisms without the necessary components cannot use the system. Overall, our results indicate that the experimental annotation of the type III secretion in isolated and cultured organisms is incomplete, leaving significant room for improvement, possibly with assistance from pEffect.

Finally, we extracted from the HAMAP database29 available annotations of pathogenicity and symbiotic relationships for 115 Gram-negative bacteria in our set with a complete T3SS and ≥5% of the genome dedicated to effectors. We compared the number of predicted effectors in organisms that infect eukaryotic cells in general and mammalian cells in particular with those that are currently not known to be symbiotic or pathogenic. Note that further manual curation of currently not annotated bacteria still highlights possibility of type III secretion for a large fraction of them30,31,32. Our analysis showed that while the distributions of numbers of effectors across the different types of bacteria were not significantly different, mammalian pathogens carried, on average, more effectors than pathogens of other taxa. Those, in turn, carried more effectors than bacteria not currently annotated as pathogenic or symbiotic (Supplementary Fig. S4). Thus, we believe that pEffect can be used to pinpoint for future exploration of the type III secretion-mediated pathogenicity of newly sequenced organisms.

Discussion

pEffect successfully combines complementary approaches for the prediction of type III effector proteins: homology-based and de novo. Specifically, it uses PSI-BLAST for a high accuracy (precision) mode of prediction and SVM for improved coverage (recall). The resulting single method pEffect outperforms both of its individual components (Table 1) and other methods (Fig. 1, Supplementary Tables S1 and S2). When tested on samples contaminated with eukaryotic proteins, pEffect predicts effectors with a performance level that is significantly higher than that of any other currently available method (Fig. 1, Supplementary Tables S1 and S7). Similar to the results of Arnold et al.12, we find that there is no significant difference in performance across different species of bacteria (pEffect: F1 = 0.54 ± 0.31on a data set with no proteins of the same species shared between training and test sets vs. F1 = 0.91 ± 0.08 on the Development set). pEffect was trained on a sequence homology reduced data set at HVAL = 0 (i.e. there is no pair of sequences in our data set with over 20% sequence identity that have over 250 amino acid residues aligned) that to our knowledge presents the largest and most complete set of effector proteins currently available. The data set can be downloaded from pEffect’s website.

pEffect uses information stored in the entire protein sequence and performs on sequence fragments just as well as on full-length protein sequences (Fig. 1, Supplementary Table S2). This result made us conclude that signals discriminating effector proteins are distributed across the entire protein sequences and are not confined to the N-terminus, as it is currently anticipated. This finding was surprising and extremely relevant for the analysis of metagenomic read data. Deep Sequencing (or NGS) produces immense amounts of DNA reads, which need to be assembled and annotated to be useful. Erroneous (chimeric) gene assemblies or wrong gene predictions are common in sequencing projects33. To bypass the assembly errors when identifying type III secretion activity in a particular metagenomic sample it would help to annotate effectors from peptides translated directly from the DNA reads. pEffect facilitates this type of direct analysis of metagenomic sequence data, establishing the level of type III secretion activity and, by proxy, the endosymbiotic interactions and the potential presence or absence of pathogenic organisms in a particular environment.

We applied pEffect to over 900 prokaryotic proteomes with the aim of annotating those organisms that are likely to utilize a T3SS. We validated our results using three different metrics: (i) percentage of predicted effector proteins per proteome, (ii) evolutionary age of an organism and (iii) the number of conserved T3SS elements. As expected, pEffect predicted a higher percentage of effector proteins per proteome in Gram-negative bacteria with full T3SS (five conserved T3SS elements) than in Gram-positive bacteria and archaea that are not known to utilize the system (Figs 2 and 3). This indicates a possible acquisition of a larger effector repertoire in Gram-negative bacteria, which was unnecessary for other organisms. Incorporating the independently established evolutionary age estimate, effector proteins of T3SS-using Gram-negative bacteria appear to further diversify with the increasing evolutionary distance from the last common ancestor (Fig. 4a). This correlation could not be expected at random, as the age of bacteria and their effector quantities are independently established and are not correlating for other organisms.

Figure 4
figure 4

pEffect’s whole proteome predictions in Gram-negative bacteria.

(a) pEffect predicted type III effector proteins in the proteomes of 294 Gram-negative bacteria. The proteomes are shown as red and purple dots. Purple dots indicate proteomes with five type III machinery components (full T3SS) and red dots are proteomes with fewer components. For each proteome, the evolutionary distance from the last common ancestor (X-axis), extracted from Lang et al.52, is plotted against the percentage of proteins predicted as effectors (Y-axis). While there is a correlation between the age and the quantities of effectors in proteomes of organisms with full T3SS (purple trend-line), the same appears not to be the case for organisms with less than five components. (b) Proteomes with full T3SS identified by source. Green dots are the percentage of proteins predicted as effectors by homology searches (PSI-BLAST) and blue dots are de novo predictions. While PSI-BLAST appears to consistently pick up ~1% of each proteome of all organisms (green horizontal trend-line), the effectors in Gram-negative bacteria diversify further over evolutionary distance, as indicated by the increase in the number of de novo predictions.

Interestingly, homology searches have identified roughly equal numbers of effectors (on average, 1% of each respective proteome; Supplementary Table S3) across both types of bacteria. As their percentage per proteome remains stable over time (Supplementary Table S3) and as they are found in almost all organisms with PSI-BLAST, we suggest these effectors to be the older ones that had the time to spread throughout different species. On the other hand, the increasing number of new effectors, recognized by the SVM, in relationship to organism age (as long as organism is using T3SS, Fig. 4b), indicates likely new “inventions” that accumulate over time of T3SS use. These results are in line with potential ancestral presence of the early complete secretory system10,34, including the machinery and the secreted proteins and further diversification of effectors exclusively in T3SS-utilizing Gram-negative bacteria.

The set of de novo-identified effectors found across bacteria is a good starting point for further investigation into effector origins. Due to T3SS significance in pathogenicity of Gram-negative bacteria, the de novo identified effectors are also potentially interesting as drug targets.

pEffect’s high prediction accuracy raises an interesting question about its false positive predictions of effectors in Gram-positive bacteria, which is not known to utilize T3SS. Roughly one fourth of these predicted effectors are of yet-unknown function. Those that are annotated include enzymes necessary for flagellar motility (Supplementary Table S6). This finding is in line with evidence of shared ancestry between bacterial flagellar and type III secretion systems9. Gene genealogies20 and protein network analysis approaches21 suggest independent evolution of both systems from a common ancestor, comprising a set of proteins forming a membrane-bound complex. The fact that the flagellar system can also secrete proteins35 suggests that this ancestor may have played a secretory role9. The pervasiveness of the flagellar apparatus across the bacterial space also suggests that the ancestral complex existed prior to the split of the cell-walled and double-membrane organisms, indicated by the differences in gram staining. Thus, it is not surprising that we find T3SS component homology in Gram-positive bacteria even in the absence of type III secretion functionality. Curiously, our results show that the loss of the type III secretion functionality, indicated by the loss of the complete T3SS, has proceeded at a roughly similar rate in Gram-positive and Gram-negative bacteria (Fig. 5a); i.e. once the T3SS becomes incomplete (4 components) and, arguably, non-functional, further loss of components consistently follows. Notably, a complete T3SS is only visible in early Gram-positive bacteria, but preserved across time in Gram-negative bacteria (Fig. 5b), further confirming the likely presence of the ancestral secretory complex in the last common bacterial ancestor.

Figure 5
figure 5

Loss of T3SS functionality differentiates Gram-positive and -negative bacteria.

274 Gram-positive bacteria (blue dots) and 588 Gram-negative bacteria (red dots) are screened for the number of conserved components of T3SS (max. 5 T3DB Ortholog clusters; Material and Methods) in their genomes (Y-axis) and plotted against the evolutionary distance from the most recent common ancestor (X-axis). Once the T3SS is lost (a), i.e. less than 5 components are present, further rate of loss of components is the same for all bacteria. The number of Gram-negative bacteria with the complete system (b), i.e. all 5 components present, however, remains constant across evolutionary time, while the number of Gram-positive bacteria declines.

pEffect also predicts a significant number of false positive effectors in archaea, inspiring the question: did T3SS exist before the archaea/bacteria split? Unfortunately, the presence of the beginnings of T3SS in the common ancestor of bacteria and archaea is neither directly supported nor negated by our results. Archaeal flagella have little or no structural similarities to bacterial flagella and none of the archaea that we tested had the complete T3SS (Fig. 2). If the common ancestor of archaea and bacteria did encode the core ancestral complex, the latter observation would indicate a loss of functionality in archaea. Another possibility is that the T3SS in bacteria may have been built over time from duplicated and diversified paralogous genes of the core complex after the archaea/bacteria split. In both of these scenarios, the prediction of type III effectors in archaea would indicate re-purposing of the proteins secreted by the ancestral complex. In fact, 0.5% of an average archaeal genome is identified by homology to known effectors and another 0.9% de novo-identified proteins are homologous (PSI-BLAST e-value ≤ 10−3) to de novo-identified effectors of Gram-negative bacteria. These proteins must have been re-purposed in modern archaea; in fact, they are annotated with a range of molecular functionalities (Supplementary Table S6). The use of an additional 0.5% of the archaeal proteome that is picked up by pEffect de novo and has no homologs in bacteria remains an enigma. While similarity between archaeal proteins and bacterial type III effectors and machinery is insufficient to draw definitive conclusions regarding common ancestry, it is significant for further exploration; i.e. if roughly one tenth of the identified effectors of Gram-negative bacteria and half of the machinery have homologs in archaea, could there have been a common ancestral secretion complex that has developed early on in evolutionary time and has given root to many systems observed today?

pEffect immediately and importantly contributes to the study of type III secretion mechanisms. It allows for rapid identification of type III secretion abilities within unassembled genomic and metagenomic read data. Moreover, the quantity of identified effectors seems to correspond with bacterial pathogenicity, potentially contributing to the tracking of infectious strains. We believe that pEffect will facilitate future experimental insights in microbiological research and will significantly contribute to our understanding and management of infectious disease.

Methods

Development data sets

Our positive data set of known type III effector proteins was extracted from scientific publications12,36,37,38,39,40,41,42,43 and the Pseudomonas-Plant Interaction web site (http://www.pseudomonas-syringae.org/). We additionally queried Swiss-Prot with keywords ‘type III effector’, ‘type three effector’ and ‘T3SS effector’ and manually curated the results for experimentally identified effectors (removing entries with “potential”, “probable” and “by similarity” annotations). All corresponding amino acid sequences were taken from UniProt26, 2012_01 release. In total, our positive (effector) data set contained 1,388 proteins.

To compile our negative data set of non-type III effectors we used the experimentally annotated Swiss-Prot proteins from the 2012_01 UniProt release. We extracted all bacterial proteins that were NOT annotated as type III effectors and had no significant sequence similarity (BLAST e-value > 10) to any type III effector in our positive set. We also added all eukaryotic proteins applying no sequence similarity filters. Our negative set thus contained roughly 470,000 proteins.

We removed from our sets all proteins that were annotated as ‘uncharacterized’, ‘putative’, or ‘fragment’. We reduced sequence redundancy independently in each set using UniqueProt44, ascertaining that no pair of proteins in one set had alignment length of less than 35 residues or a positive HSSP-value45,46 (HVAL ≥ 0). After redundancy reduction our sequence-unique sets contained 115 type III effector proteins from 43 different bacterial species and 3,460 non-effector proteins (of which 37% were bacterial). Note that proteins from positive and negative sets were sometimes similar as homology reduction was only applied within sets and not across sets. Here, this set of sequences (positive and negative sets together) is termed the Development set. All pEffect performance results were compiled on stratified cross-validation of this Development set (five-fold cross-validation, i.e. we split the entire set into five similarly-sized subsets and trained five models, each on a different combination of four of these subsets, testing each model on every subset exactly once).

Additional data sets

Comparing pEffect performance to that of other methods using our cross-validation approach has only limited value due to the possible overlap between our testing and other methods’ training sets and can lead to an overestimate of other methods’ performance. A more meaningful way is to use non-redundant sets of effector and non-effector proteins that have never been used for the development of any method. Toward this end, we extracted the following data sets:

  1. 1

    We collected all type III effectors added to UniProt after the 2014_02 release and non-type III bacterial and eukaryotic proteins added to Swiss-Prot after the same release. These were redundancy reduced at HVAL < 0 to produce the UniProt’15HVAL0 set (51 effectors and 691 non-effectors, of which 53% were of bacterial origin). Note that additionally reducing this set to be sequence dissimilar to the Development set would retain only 10 type III effectors, too few for reliable performance estimates. However, even for this smaller and completely independent set, the performance of pEffect was higher than of other tools, making pEffect a uniquely reliable method for determining new effectors (Supplementary Table S7).

  2. 2

    To answer the question “how well will pEffect perform on protein sequences added to databases within the next six months?” we collected the proteins added to UniProt (type III effectors) and Swiss-Prot (non-effector bacterial and eukaryotic sequences) after the 2014_08 release, producing the set UniProt’15Full (498 effectors and 1,509 non-effectors, of which).

  3. 3

    We also extracted all bacterial type III effectors from the T3DB database11T3DBFull set (218 effectors and 831 non-effectors). We deliberately kept the redundancy in this set (up to HVAL = 66, i.e. over 85% pairwise sequence identity over 450 residues aligned). Note that some proteins from this set are contained in the training sets of all compared methods, including pEffect.

  4. 4

    Finally, we redundancy reduced T3DB set at HVAL < 0. This gave the T3DBHVAL0 set (66 effectors and 128 non-effectors).

T3DB Ortholog clusters of the type III secretion system (T3SS) machinery

T3DB is a database of experimentally annotated T3SS-related proteins in 36 bacterial taxa. Proteins of the same function and the same evolutionary origin are clustered in T3DB into T3 Ortholog clusters (http://biocomputer.bio.cuhk.edu.hk/T3DB/T3-ortholog-clusters.php). The proteins of these clusters form ten components of the T3SS. Proteins of five of these components (export apparatus, inner membrane ring, outer membrane ring, cytoplasmic ring and ATPase) are present in all 36 taxa in T3DB (Supplementary Table S2). We thus defined the minimum number of five components necessary for the formation of the T3SS machinery. With the exception of the outer membrane ring, these components have also been defined as the core before9.

Prediction methods

We tested several ideas for prediction, including the following:

Homology-based inference

We transferred type III effector annotations by homology using PSI-BLAST18 alignments. For every query sequence we generated a PSI-BLAST profile (two iterations, inclusion threshold e-value ≤ 10−3) using an 80% non-redundant database combining UniProt47 and PDB48. We then aligned this profile (inclusion e-value ≤ 10−3) against all type III effectors extracted from the literature and the UniProt 2012_01 release. For known effectors, we excluded the PSI-BLAST self-hits. We transferred annotation to the query protein from the hit with highest pairwise sequence identity of all retrieved alignments.

De novo prediction

We used the WEKA49 Support Vector Machine (SVM)19 implementation to discriminate between type III effector and non-effector proteins. For each protein sequence, we created a PSI-BLAST profile (as described above) and applied the Profile Kernel function50,51 to map the profile to a vector indexed by all possible subsequences of length k from the alphabet of amino acids; we found that k = 4 amino acids provides best results. Each element in the vector represents one particular k-mer and its score gives the number of occurrences of this k-mer that is below a certain user-defined threshold σ; we found that σ = 7 provides best results. This score is calculated as the ungapped cumulative substitution score in the corresponding sequence profile. Thus, the dot product between two k-mer vectors reflects the similarity of two protein sequence profiles. Essentially, the method identifies those stretches of k adjacent residues in profiles of type III effectors that are most informative for prediction and matches these to the profile of a query protein. The parameters for the SVM and the kernel function were determined separately for each fold in our 5-fold cross-validation and, thus, were never optimized for the test sets.

pEffect

Our final method, pEffect, combined sequence similarity-based and de novo predictions. Toward this end, over-fitting was avoided through the simplest possible combination: if any known type III effector is sequence similar to the query use this (similarity-based prediction), otherwise use the de novo prediction.

Reliability index

The strength of a pEffect prediction is represented by a reliability index (RI) ranging from 0 (weak prediction) to 100 (strong prediction). For de novo predictions, we computed RI by multiplying the SVM output by 100 for positive (type III effector) predictions and subtracted this score from 100 for negative predictions. For sequence similarity-based inferences, the RI is the percentage of pairwise sequence identity normalized to the interval [50, 100], to agree with the SVM prediction range.

Evolutionary distances

For the discovery of novel type III effectors in entirely sequence organisms, we extracted evolutionary distances from the phylogenetic tree of 2,966 bacterial and archaeal taxa, inferred from 38 concatenated genes and available in the Newick format52.

Additional Information

How to cite this article: Goldberg, T. et al. Computational prediction shines light on type III secretion origins. Sci. Rep. 6, 34516; doi: 10.1038/srep34516 (2016).