Introduction

The secretome refers to the set of proteins that are excreted/secreted by a given cell, including extracellular-matrix (ECM) proteins, vesicle proteins (e.g., from microsomal vesicles) and proteins shed from the cell membrane1. These Excretory/Secretory (ES) proteins play important roles in development, adhesion, proteolysis and extracellular matrix organization of the organism. In parasitic organisms, the ES proteins play important roles acting as virulence factors and as immune regulators to control the host immune recognition during infection. The ES proteins are crucial for parasite survival inside and outside the host and their expression usually changes in response to several environmental stimuli1. As the ES proteins are involved in clinical manifestations of the host organism, they represent attractive drug targets for the development of novel therapeutic strategies2. Moreover, ES proteins are an important source of immunogenic proteins due to their accessibility to be recognized by the host immune system. Thus, considerable attention has been made in ES proteins as biomarkers to detect the presence of a parasite and/or the status of the infection in different infectious diseases3,4,5,6. The prediction of ES proteins from sequenced genomes is a novel strategy used to prioritize the experimental study of new therapeutic and immunodiagnostic targets for human parasitic diseases2. The ability to sequence the whole genome of parasite organisms provides the opportunity to in silico screen for the encoded secretomes and for the most probable antigenic proteins before undertaking confirmatory experiments. The increasing availability of genomes provides the opportunity to systematically examine their encoded secretomes using bioinformatics approaches.

Echinococcosis (hydatid disease) and cysticercosis caused by the proliferation of larval tapeworms in vital organs, are important neglected tropical diseases7. Cysticercosis is a tissue infection caused by the Taenia solium parasite (known as the pork tapeworm). The life cycle includes pig as intermediary host and human as definitive host. The tapeworm is the adult stage of T. solium parasite and infects the human intestine delivering the eggs into the human feces. The intermediary host becomes infected by ingesting contaminated vegetation with eggs and subsequently oncospheres hatch, penetrate intestinal wall and circulate to musculature. The oncospheres develop into larval stage (cysticerci) in muscle and central nervous system (CNS). The life cycle is completed when humans ingest raw or undercooked infected meat and develop the adult tapeworm in the intestine8,9. However, humans accidentally ingest the eggs and develop the cysticerci. In humans the cysticerci is predominantly established in the CNS causing neurocysticercosis (NC), which is the most common worldwide tapeworm infection of the brain and it is an endemic disease of developing countries10,11. The NC causes symptoms that range from cephalea and dizziness to epilepsy and severe intracranial hypertension, impacting on the social and economic development of the affected communities11,12.

Tapeworms (Platyhelminthes, Cestoda) secrete several ES molecules to regulate the host immune system for parasite survival13,14,15,16,17,18,19. ES proteins involved in the uptake and sequestration of host hydrophobic molecules20 and mediating the host immune response to parasite infection21 have been experimentally characterized in different life cycle stages of T. solium22. Also, several ES proteins with peptidase activities has been reported23,24,25. However, since no curated protein database and no genome sequence for T. solium was then available, those studies only produced partial lists of Excreted/Secreted proteins. Recently, the T. solium genome has been published26, allowing us the opportunity to characterize the ES proteins encoded in the genome and to in silico screen for the most probable protective antigens before undertaking confirmatory experiments. The prediction of number of antigenic regions per each protein at genome-wide level can help in the design of vaccine components and immunodiagnostic reagents. There are many bioinformatics methods to predict antigenic regions from a protein sequence. The classical approach of epitope prediction is to utilize the amino acid properties including hydrophobicity27, hydrophilicity28, surface accessibility29, flexibility30 and antigenicity31. In addition, there are methods using machine learning algorithms such as Hidden Markov Model (HMM)32, Artificial Neural Network (ANN)33 and Support Vector Machine (SVM)34 to locate antigenic epitopes. However, sequence length to normalize the epitope density never has been considered to measure the antigenic potential of a protein sequence at a genome wide level.

In the present study, we predicted ES proteins encoded in the T. solium genome, followed by functional annotation. Predicted ES proteins were functionally annotated in terms of similarity to other known proteins, biochemical pathways, gene ontologies, protein families and domains. ES proteins were also analyzed for number of antigenic regions using three different bioinformatics algorithms and searched for structural homologues using fold recognition algorithms. We developed a novel genomic measurement to evaluate the potential antigenicity of a secretome using the sequence length and the number of antigenic regions of ES proteins. This measurement was formalized as the Abundance of Antigenic Region (AAR) value. We also determine the AAR value for a set of 46 experimentally determined antigenic proteins of T. solium and for previously reported ES proteins of 12 parasitic helminth species. We believe that our genome wide exploration of ES proteins is a valuable resource for future experimental studies of the T. solium secretome. Our work represents a starting point to the characterization of the parasite secretome and it would contribute to a better comprehension of the host-parasite interactions.

Results

Prediction of Excretory/Secretory (ES) proteins of T. solium genome

The bioinformatics pipeline is summarized in Figure 1. Of the 12,902 proteins encoded in the T. solium genome26, we could annotate a total of 731 proteins as classical secretory proteins by SignalP35 and 543 proteins as non-classical secretory proteins by SecretomeP36. The classical and non-classical secretory proteins were merged, yielding a set of only 1190 different proteins because 84 proteins were shared between both predictions (see Venn diagram in Figure 1). The 1190 proteins were subsequently analyzed by TargetP37 to identify mitochondrial proteins. After that, 98 proteins were predicted as mitochondrial and were removed from the original set of 1190 proteins. The remaining 1092 proteins were scanned using TMHMM38 and for 254 proteins transmembrane regions were predicted. These transmembrane proteins were removed from the protein dataset. Finally, a total of 838 sequences were predicted as ES proteins by our bioinformatics pipeline (Figure 1). The 838 ES proteins represent the 6.5% of the total sequences of T. solium genome. The ES proteins were searched against the RNAseq and ESTs libraries from T. solium to analyze the percentage of ES proteins that are supported at RNA level. The access to its RNA data was kindly provided by T. solium consortium (unpublished data). Interestingly, we found RNA support for 347 ES proteins, representing 41.4% of the total T. solium secretome.

Figure 1
figure 1

Bioinformatics pipeline to identify and annotate the ES proteins in T. solium genome.

Functional annotation of T. solium secretome

ES protein identification

Of the 838 ES proteins, 654 (81.6%) proteins show significant BLASTP matches with proteins deposited in the non-redundant (nr) database and 63 (7.5%) proteins represented significant BLASTP matches with hypothetical protein homologs. According to the sequence description of protein homologs, several ES proteins were indentified as diagnostic antigen gp50 (14 proteins), cystein-rich secretory protein (9 proteins), chorion class high cystein protein (6 proteins), oncosphere antigen a (5 proteins) and others.

Gene Ontology analysis

ES proteins were annotated for Biological Process, Molecular Function and Cellular Components with Gene Ontology (GO) terms. Out of 838 ES proteins, 349 (41.6%) proteins were annotated with GO terms using Blast2GO39,40. In an effort to obtain more sequences with annotations, the 488 unannotated proteins were subjected to GO terms annotation using Argot241. The advantage of Argot2 is that it exploits HMMER searches in addition to the typical BLAST searches and combines the clustering of GO terms based on their semantic similarities with a weighting scheme to annotate the query sequences41. After the analysis using Argot2, we can annotate 276 proteins from the 488 originally unannotated by Blast2GO39,40. In summary, of the 838 ES proteins, 625 (74.6%) proteins were annotated with 1429 different GO terms (835 for Biological Process, 231 for Cellular Component and 363 for Molecular Function) using the two annotation bioinformatics programs. The 12,064 non-ES proteins of the T. solium genome were also analyzed for GO terms annotation. After that, a total of 10,218 (84.7%) proteins were mapped with GO terms. The GO terms distribution to a second level category is provided in Figure 2 for ES and non-ES proteins from T. solium genome.

Figure 2
figure 2

Gene Ontology distribution of ES proteins and non-ES proteins from T. solium.

Distribution of Gene Ontology terms at level 2 for: (A) Molecular Function, (B) Cellular Component and (C) Biological Process.

The most represented GO terms in the 838 ES proteins at Molecular Function category (Figure 2A) were: binding (42%) and catalytic activity (37%). The molecular function regulator and catalytic activity terms show an overrepresentation of annotated sequences in the ES proteins as compared to the distribution of the same terms for the non-ES proteins of Taenia solium genome (Figure 2A). Contrary, transporter activity and binding terms show a subrepresentation of annotated sequences in the secretome as compared to the distribution of the same terms for the non-ES proteins. The binding term predominantly includes at the third level subcategory the ion binding (13%), protein binding (11%), organic cyclic compound binding (11%), heterocyclic compound binding (11%) and small molecule binding (10%) terms. The catalytic activity term predominantly includes at the third level subcategory the hydrolase activity (10%), transferase activity (10%), oxidoreductase activity (3%), isomerase activity (1%), ligase activity (0.5%) and lyase activity (0.5%) terms.

The most represented GO terms in the ES proteins at Cellular Component category (Figure 2B) were: cell (28%), organelle (21%), membrane (21%), macromolecular complex (10%), extracellular region (9%) and membrane enclosed lumen (4%) terms. The extracellular matrix, extracellular region and membrane terms show an overrepresentation in the secretome as compared to the distribution of the same terms for the non-ES proteins (Figure 2B). The most represented GO terms in the 838 ES proteins at Biological Process category (Figure 2C) were: cellular process (18%), metabolic process (16%), single-organism process (14%), biological regulation (10%), response to stimulus (7%) and multicellular organism process (5%) terms. The biological adhesion, biological regulation and metabolic process terms show an overrepresentation in the secretome as compared with the distribution of the same terms for the non-ES proteins of Taenia solium genome.

Gene Ontology terms enrichment

We analyze whether any GO term shows a significant enrichment in the secretome as compared to the expected by GO term distributions for all T. solium genome (Figure 3). In the molecular Function category a significant enrichment with terms related to the regulation of peptidase activities, extracellular matrix structural constituent and oxidoreductase activity was found (Figure 3A). The terms related to extracellular components, endoplasmic reticulum lumen and components anchored to membrane shows a significant enrichment in the Cellular Component category (Figure 3B). The terms that show a significant enrichment in Biological Process category were related to regulation of peptidase and hydrolase activity, proteolysis and extracellular structure organization (Figure 3C). The complete lists of significantly GO enrichments assigned to ES proteins are provided in Supplementary Tables S1–S3.

Figure 3
figure 3

Gene Ontology enrichment of ES proteins as compared to the total proteins from T. solium genome.

Significantly enrichments of Gene Ontology terms for: (A) Molecular Function, (B) Cellular Component and (C) Biological Process.

Pathway mapping

We used KAAS42,43,44 to annotate ES proteins to biochemical pathways. A total of 384 (45.8%) ES proteins were associated to 166 KEGG pathways. The most represented KEGG pathways are shown in Table 1 and full annotations are available in Supplementary Table S4. The two most frequently mapped KEGG pathways were protein processing in endoplasmic reticulum and Lysosome. Interestingly, four proteins were predicted as involved in antigen processing and presentation (ranking 23) which might play critical roles in host-parasite interactions.

Table 1 Top 15 most represented KEGG pathways in T. solium secretome

Enzyme Code Distribution

We classified the enzymes contained in the ES proteins and in the non-ES proteins according to the six enzymes commission classes (Figure 4). The results show an overrepresentation of hydrolases, oxidoreductases and ligases in the ES proteins as compared to the same enzyme types for the non-ES proteins of Taenia solium genome (Figure 4A). The hydrolases represented 43% of the enzymes in the ES proteins, while this enzyme type represented 31% of the non-ES proteins (Figure 4A). The oxidoreductases represented 16% of the enzymes in the ES proteins, while this enzyme type only represented 9% of the non-ES proteins (Figure 4A). The three most represented EC Subclasses of Hydrolase enzymes were: acting on peptide bonds (peptide hydrolases) (18 proteins), acting on ester bonds (8) and glycosylases (6) (Figure 4B). The three most represented EC subclasses of Transferase enzymes were: transferring phosphorous-containing groups (13 proteins), glycosyltransferases (5) and acyltransferases (4) (Figure 4C). Finally, the most represented EC subclasses of oxidoreductases enzymes are shown in Figure 4D.

Figure 4
figure 4

Enzyme commission classes and subclasses distribution of T. solium ES proteins.

(A) EC classes for ES and non-ES proteins, (B) EC hydrolase subclasses for ES proteins, (C) EC transferase subclasses for ES proteins and (D) oxidoreductase subclasses for ES proteins.

Analysis of protein domains and motifs

The annotation of ES proteins using InterProScan45,46 resulted in 491 protein families and domains. The most represented InterPro domains are shown in Table 2. The three most represented protein domains were the Immunoglobulin-like fold, CAP domain and fibronectin type III. Interestingly, the Immunoglobulin-like domains are involved in a variety of functions, including cell-cell recognition, cell-surface receptors, muscle structure and the immune system. The Taeniidae antigen was also overrepresented (ranking 14).

Table 2 Top 15 most represented protein domains in T. solium secretome

Functional analyses of the specific T. solium secretome

We compared the 838 ES proteins against the genomes of E. multilocularis (Family: Taeniidae) and H. microstome (Family: Hymenolepididae) to discard the ES proteins with homologues in both genomes. These two species are the closest evolutionary related genomes to the T. solium genome that are sequenced to date26. From these analyses, we retrieved 121 ES proteins without homologues in both genomes (threshold e-value of 1 E−3). These 121 ES proteins also were BLASTed against all the non-redundant (nr) proteins of NCBI and we did not find any related protein homologue (threshold e-value of 1 E−3). Thus, these 121 proteins constitute the specific secretome of the T. solium genome and can be used as specific targets for T. solium infections. After mapping the set of 121 ES proteins to the InterPro and KEGG databases, we did not obtain protein sequences with functional annotations. Nonetheless, we annotated 39 sequences with 83 different GO terms using Argot2. However, the GO term enrichment analysis of these 39 sequences does not show statistically significant results as compared with GO distributions for all genome of Taenia solium. In an effort to obtain more functional information for this set of ES proteins, we subjected the 121 sequences to a fold recognition analysis using the Phyre2 algorithm47. Phyre2 algorithm was recently used as an alternative approach for functional annotation of novel protein sequences. In this regard, if the predicted structure for query protein is confident, the template protein functions can be tentatively assigned to the query protein. The confidence score of Phyre2 was established to 55% as the minimum cut-off value and the proteins with confidence scores equal to or higher than this cut-off value are shown in Table 3. The protein 08062.0.1 has a high structural similarity with the UPLC1 protein. Interestingly, the UPLC1 protein is an important regulator in cancer cell migration/invasion and in actin-based cytoskeletal remodeling48.

Table 3 Phyre2 confident predictions found in the T. solium specific secretome

The Abundance of Antigenic Regions (AAR) value

To evaluate the antigenicity potential of T. solium secretome the number of antigenic regions for each protein sequence was obtained using three different bioinformatics algorithms: the method reported by Kolaskar and Tongaonkar31, CBTOPE34 and BepiPred32. The Kolaskar31 method is a classical approach that uses the antigenicity propensity and physicochemical properties of amino acids to make the prediction of antigenic regions. The BepiPred32 method combine the hydrophilicity property of amino acids with a Hidden Markov Model (HMM) to predict B-cell epitopes. The CBTOPE34 method predicts conformational B-cell epitopes using the amino acid composition as an input feature for a Support Vector Machine (SVM) model. However, to normalize the number of antigenic regions by sequence length we introduce the Abundance of Antigenic Regions (AAR) value (see materials and methods). This normalization was applied to the results of the three bioinformatics methods used for antigenic prediction. The AAR value was used to define the number of amino acids between antigenic regions per sequence. Hence, low AAR values means that protein has more antigenic regions (more epitope density). We determined the AAR value for the 838 ES proteins and we found in average one antigenic region each 26.2 amino acids using the Kolaskar method (Table 4), while the AAR values using the CBTOPE34 and Bepipred32 methods, were of 105.7 and 93.6 respectively (Table 4). The three different AAR values obtained for ES proteins are due to the different number of antigenic regions predicted by each method. However, the three methods used for the prediction of antigenic regions show a consistently AAR difference between the ES and non-ES proteins obtained in each method (Table 4). Hence, we use the obtained AAR values by Kolaskar31 method for comparisons between protein datasets.

Table 4 Abundance of Antigenic Regions (AAR) for different T. solium protein datasets

The AAR value for the 347 ES proteins that are supported at RNA level was of 26.2 (Table 4). The AAR value for the set of 121 ES proteins that is specific of T. solium genome was of 28.9, while the non-ES proteins have average one antigenic region each 42.1 amino acids (Table 4). The AAR value for the 48 ES proteins supported at RNA level which are specific of the T. solium secretome was of 28.3. Interestingly, all ES proteins datasets had twofold more antigenic regions in comparison with the non-ES proteins of the T. solium genome (Table 4). Hence the epitope density in ES proteins is higher than for non-ES proteins. For the validation of biological significance of AAR values, we calculated this value for a dataset of experimentally derived ES proteins of T. solium compiled from literature (see materials and methods). This set contained 46 protein sequences that have been experimentally reported to be useful in the diagnostic of human teniosis or neurocysticercosis (Supplementary Table S5). Interestingly, the AAR value for this antigenic protein dataset was 21.8, which is close to the calculated value for the secretome (Table 4). In contrast, the non-ES proteins showed an AAR value of 42.1. Interestingly, 44 (95.6%) of the 46 diagnostic proteins were found in our secretome (Supplementary Table S5). Furthermore, we also found RNA support for these 44 proteins (Supplementary Table S5). To test whether our obtained AAR values are similar to other known secretomes, we selected the secretomes of 12 helminth species which were recently reported in the Helminth Secretome Database (HSD)2 and their AAR values were calculated. Table 5 contains the AAR values for the 12 helminth secretomes (4 nematodes, 4 trematodes and 4 cestodes). Interestingly, the obtained AAR values for known helminth secretomes were very similar to that obtained for the T. solium secretome which is reported in this study (Table 5).

Table 5 Abundance of Antigenic Regions (AAR) for different known helminth secretomes

Discussion

The cysticercosis is a neglected zoonotic infection caused by T. solium parasite. It is one of the WHO's lists of most neglected tropical diseases and the most prevalent human tapeworm. We have applied different bioinformatics approaches to identify and annotate all the predicted ES proteins encoded in the T. solium genome. To the best of our knowledge, the present study is the most comprehensive in silico collection of the T. solium secretome and it represented the 6.5% of the total proteins encoded in their genome. This proportion of ES proteins is in agreement with secretomes previously reported for other species2,26. The ES proteins can circulate in the extracellular space of an organism making them attractive as targets for novel therapeutics, because they may be more accessible to drugs than other proteins. Our T. solium secretome provides a rich source of potential drug targets, vaccine candidates or diagnostic proteins for developing new treatment and diagnostics strategies. In addition, our study contributes to increase the knowledge of the molecular mechanisms of host-parasite interaction. As well as to identify novel proteins with immunomodulatory properties that could be used as targets to control inflammatory processes of non-infectious diseases.

Functional information of the T. solium secretome was obtained through the analysis of Gene Ontology (GO) annotations of the 838 ES proteins. The top 10 GO term enrichment showed a statistical overrepresentation in the ES proteins of biological activities that are strongly related to the typical functions of secreted proteins (Figure 3). The GO terms related to extracellular matrix, endoplasmic reticulum lumen and anchored to membrane showed a significant enrichment in the Cellular Component category. The secretome of an organism includes all proteins secreted by the cell including those of the extracellular matrix, proteins shed from the cell membrane and vesicle proteins like microsomal vesicles1,49,50. The GO term enrichment related to the endoplasmic reticulum lumen suggests that, even with a correctly predicted signal peptide, some proteins can be resident of the endoplasmic reticulum. The top 10 GO term enrichment of Biological Process and Molecular Function showed a statistical overrepresentation in the ES proteins of peptidase activities, extracellular organization and cell adhesion terms. Proteins with peptidase domains have been previously reported to be involved in virulence activity in several helminth species51. Several ES proteins were predicted to be involved in antigen processing and presentation pathway. Interestingly, there is evidence that secreted glycoantigens by cysticerci can modulate the host inflammatory response through the activation of dendritic cells in the experimental murine cysticercosis caused by T. crassiceps52. However, the relevance of ES proteins on the modulation of host-parasite relationships has not been studied in human cysticercosis. Although, it is well known that helminth ES proteins can modulate the host immune system during the infection for parasite survival13,14,15.

The functional annotations found in the T. solium secretome by GO term enrichment, pathway mapping, enzyme code distribution and protein domain analysis strengthened our bioinformatics workflow to be useful to predict secretomes in other genomes. However, it is clear that integration of bioinformatics strategies with RNAseq data can improve the identification of expressed secretomes. Interestingly, the 41.4% of our secretome was supported at RNA level (unpublished data). The 121 ES proteins specific of T. solium secretome represents potential novel drug or vaccine targets for therapeutic strategies and denotes the importance of future experimental research to characterize this protein dataset. The proteins of this dataset are not shared with other sequenced organisms, suggesting that it can be explored as diagnostic proteins for specific T. solium infections. The T. solium is unable to synthesize the amino-acid lysine and among the secreted proteins we found enzymes able to degrade lysine-containing peptides. This finding is an example of the complex host-parasite interactions. The presence of lytic proteins in our secretome, suggest that these proteins can be used to cut down nutrients making them more accessible for the parasite or to cut down immune response-related molecules that could induce parasite damage53,54,55,56,57,58. Interestingly, the hydrolases and oxidoreductases showed an overrepresentation in the secretome as compared to the distribution of the same terms for the non-ES proteins of Taenia solium genome. It is in agreement with the considerable enrichment of this enzyme types found in other experimentally determined secretomes50,59,60.

Previously was suggested that high epitope density in a single protein molecule significantly enhances their antigenicity and immunogenicity61. Here, we found that experimental determined antigenic proteins have more antigenic density, measured by the normalization of the number of antigenic regions by sequence length (AAR values in Tables 4 and 5). It is, in fact, a manageable metric which reflects the epitope density of a protein. To our knowledge, AAR is the first example of a tool implementing antigenic regions and sequence length to estimate the antigenicity of a protein at genome wide level. Nearly 40% of predicted ES proteins remain unannotated in the Helminth Secretome Database (HSD)2. The sequence annotation results obtained for the T. solium specific secretome, which were based in BLAST and HMMER searches, fold recognition strategies and AAR values, suggest that these strategies can be used to enhance the annotations of known secretomes. The Abundance of Antigenic Regions (AAR) value for the T. solium secretome (Table 4) showed that these proteins are enriched of antigenic regions as compared to the non-ES proteins. Interestingly, the AAR values for the ES proteins were very similar to that obtained for the diagnostic proteins, suggesting their potential use in the diagnosis of T. solium infections (Table 4). In addition, the obtained AAR values for known helminth secretomes were very similar to that obtained for T. solium secretome (Table 5). These results demonstrated the utility of the AAR value as a novel genomic measurement to evaluate the potential antigenicity of ES proteins at genome wide level. The traditional cloning of the proteins for immunization purposes is clearly not feasible on a genomic scale. The AAR approach is cost effective and can guide a genome wide search for antigenic proteins of therapeutic, diagnosis and immunological interest.

The use of different algorithms to make the prediction of antigenic regions could potentially improve the predictions. In this work, we obtained the AAR values using the number of antigenic regions predicted from three independent algorithms, the CBTOPE34 which is based in a Support Vector Machine (SVM) model, the BepiPred32 which is based in a Hidden Markov Model (HMM) and Kolaskar31 that uses the antigenicity propensity and physicochemical properties of amino acids to make the prediction of antigenic regions. Although, the obtained AAR values using Kolaskar31 method shows more antigenic regions per protein than the AAR values obtained using CBTOPE34 and BepiPred32, there is a consistently difference of AAR values between ES and non-ES proteins for each method (Tables 45).

The T. solium ES proteins could be used as antigens to capture antibodies from infected patients. Subsequently, the antibodies can be used to directly detect the ES antigens in infected patients through a sandwich ELISA. Actually, the human NC diagnosis has not high sensibility and specificity to establish the definitive NC diagnosis in patients with neurological diseases. The HP10 monoclonal antibody is one of the best proteins used for immunodiagnosis. However, the HP10 is only effective for the detection and the follow-up of the most severe forms of NC (this is when vesicular cysticercis are located in subarachnoid space at the base62). Although, novel ES antigens from oncosphere stage has been recently suggested for NC diagnosis25,63. However, the immunoassays in pigs using T. solium ES or total antigens have been demonstrated a low sensibility and many false positives and false negatives64. The experimental study of the ES proteins identified in this work will confirm the proteins that can be candidate for use in the development of new diagnostic tests and new disease treatments. However, protein functions are strongly context-dependent and further experimental analyses are needed to improve the reliability of the functional interpretation of our results. Additionally, further studies on the proteomic level are highly desirable to confirm the predicted secretome reported herein.

Methods

Prediction of Excretory/Secretory (ES) proteins of T. solium genome

The bioinformatics pipeline is summarized in Figure 1. We started out with 12,902 protein sequences of the T. solium genome26. For all of these proteins the SignalP (version 4.1)35 and SecretomeP (version 2.0)36 algorithms were applied. SignalP was used to predict classically secreted proteins, setting the option for eukaryote organisms and the positional limit of 70 residues for truncation before submitting it to the neural networks algorithm. The input sequences also may include TM regions and the D-cutoff values were setting as default. SecretomeP was used to predict the non-classical secreted proteins using the default options for mammalian organisms. All the classical and non-classical secretory proteins were merged together and the resulting list was scanned by TargetP37 to predict the mitochondrial proteins, using at 95% of specificity and the default options for non-plant organisms. The mitochondrial proteins predicted by TargetP were discarded from the protein data set. The resulting ES proteins were subsequently scanned for the presence of transmembrane helices by TMHMM (version 2.0)38 and protein sequences exhibiting transmembrane helices were also excluded from the final protein data set.

Functional annotation and comparative analysis of ES proteins

The ES proteins were functionally annotated using several bioinformatics tools. For identifying homologous proteins, ES proteins were BLASTed (BLASTP) against the non-redundant (nr) database using the Blast2GO package. The E-value cut-off was set at 1.0 E−3. Supported by Blast2GO39,40,65,66 ES proteins were functionally mapped to GO terms and annotated by setting the following parameters: E-Value-Hit-Filter: 1.0 E−3; Annotation cut-off: 55; GO weight: 5; Hsp-Hit Coverage cut-off: 0.

The ES proteins were also mapped to Gene Ontology terms using Argot241 by setting the Total Score (TS) to ≥200. Additionally, ES proteins were associated to protein families and domains through InterProScan45,46. Blast2GO was used to identify the statistically enriched GO terms represented in the ES proteins setting the term filter value to 0.05 and the term filter mode to FDR. The KAAS was used for mapping ES proteins to KEGG pathways and to KEGG BRITE objects using the BBH (bi-directional best hit) method to assign the orthologs and the representative genes data set was setting for eukaryotes42,43,44.

Functional analyses of the specific T. solium secretome

The 838 ES proteins were searched for sequence similarity against the Hymenolepis microstoma (Family: Hymenolepididae) and E. multilocularis (Family: Taeniidae) genomes26 using BLASTP (E-value cut-off was set at 1.0 E−3) to obtain the specific secretome of T. solium. The number of antigenic regions was calculated using the methods Kolaskar and Tongoankar31, CBTOPE34 and BepiPred32 for each protein. The Abundance of Antigenic Regions (AAR) was calculated as follows for each method:

Xp: The relative abundance of antigenic regions in protein p

Lp: The sequence length in protein p

Ap: The number of antigenic regions in protein p

The AAR value was introduced to define the number of amino acids between antigenic regions for each protein. This value was scored as the ratio between the sequence lengths to the number of predicted antigenic regions for each protein. Hence, the final value determines the number of amino acids that are needed to find one antigenic region in the corresponding sequence. The dataset of experimentally-determined proteins used to diagnose human T. solium infections was compiled from a search at NCBI database. After that, we found 46 different proteins, at the sequence level, that have been experimentally reported to be useful for the diagnostic of human teniosis or neurocysticercosis (Supplementary Table S5). The ES protein sequences also were submitted to Phyre2 program47 using the default options and the twenty top scoring matches (if any) were retained for each protein. The Phyre 2 result is based in secondary structure prediction coupled to fold-recognition and three-dimensional structure predictions47.

Additional information

How to cite this article: Gomez, S. et al. Genome analysis of Excretory/Secretory proteins in Taenia solium reveals their Abundance of Antigenic Regions (AAR). Sci. Rep. 5, 9683; DOI:10.1038/srep09683 (2015).