Genome-wide inference of the Camponotus floridanus protein-protein interaction network using homologous mapping and interacting domain profile pairs

Apart from some model organisms, the interactome of most organisms is largely unidentified. High-throughput experimental techniques to determine protein-protein interactions (PPIs) are resource intensive and highly susceptible to noise. Computational methods of PPI determination can accelerate biological discovery by identifying the most promising interacting pairs of proteins and by assessing the reliability of identified PPIs. Here we present a first in-depth study describing a global view of the ant Camponotus floridanus interactome. Although several ant genomes have been sequenced in the last eight years, studies exploring and investigating PPIs in ants are lacking. Our study attempts to fill this gap and the presented interactome will also serve as a template for determining PPIs in other ants in future. Our C. floridanus interactome covers 51,866 non-redundant PPIs among 6,274 proteins, including 20,544 interactions supported by domain-domain interactions (DDIs), 13,640 interactions supported by DDIs and subcellular localization, and 10,834 high confidence interactions mediated by 3,289 proteins. These interactions involve and cover 30.6% of the entire C. floridanus proteome.


Results and Discussion
Generating the interactome of ant C. floridanus. PPIs are typically mediated by interactions between domains that are often evolutionary conserved across species 15 and form stable interactions 16,17 . PPI (protein-protein interaction) maps from experiments on D. melanogaster were collected and augmented by PPI data from the DIP database (Database of Interacting Proteins). This provided a basis for interaction predictions according to interologs from C. floridanus: conserved proteins compared to Drosophila should also be conserved in their interactions 6,18 (see Materials and methods for details).
Optimally, for such predictions several methods are combined 19 (Fig. 1). We combined the orthology prediction methods InParanoid 20 and OrthoMCL 21 . This did yield a first estimate of the C. floridanus interactome with 6274 nodes and 51866 edges 22 . However, the preliminary ant PPI network could have several false positive interactions acquired from the interologs of template data as shown previously in similar other studies 5,10,23,24 , including transfer to curated databases 25 . To reduce false predictions, we counter-checked all our data by domain-domain interactions (DDI). DDI are often used as an approach independent from sequence homology-based methods to predict protein-protein interaction networks and thus strongly reduce the number of false positives 7,26,27 . Generally, some of the PPIs are achieved via interactions between short motifs that are often transient interactions 28 . On the other hand, conserved interactions are mediated by conserved interaction domains across species 6 . Moreover, many signals and processes in the cell rely on conserved interacting protein domains 16,29 . There were 51866 conserved proteins (interologs) and 20544 ant protein-protein interactions that also were associated with DDI pairs, yielding a curated C. floridanus interactome with 4589 nodes and 20544 edges. For final curation of the interactome we used the subcellular localization of ant proteins: interacting proteins have to share the same subcellular localization (summarized in Table 1), predicted interactions between proteins not in the same location were removed. This led to a consolidated ant interactome consisting of 3914 nodes and 13640 edges. The highest proportion of interactions were identified in the cytoplasm followed by nucleus and plasma membrane respectively. A closer inspection of the interactions that were enriched across subcellular compartments (such as Golgi apparatus-cytoplasm) showed that in numerous cases at least one of the interacting proteins was alternatively localized to a compartment other than its major site of localization and thus the interacting proteins did indeed share a common compartment. For instance, in 482 interaction pairs (Table 1) at least one protein showed both the Golgi apparatus localization and cytoplasmic localization. It should be noted that these interaction partners are multiple localized proteins and may also appear in other cellular compartments. This is not an uncommon situation, as > 50% of proteins of our final interactome network annotated with predicted subcellular localization information are, in fact, localized at two or more compartments.
As a final step of network reduction, isoforms of proteins are shown as a single node. These steps of successive filtering ultimately reduce the complexity of the network and increase the confidence of the C. floridanus interactome. Figure 1 summarizes the C. floridanus protein-protein interaction databases, our workflow, pruning steps and resulting ant network. It consists of 3289 nodes and 10834 edges (more details in 22 ). The complete four networks are provided in the Datasheets 1-4 in Supplementary Material. We also identified several novel interactions predicted to be present in C. floridanus. For instance, an interaction was observed between S-phase kinase-associated protein 1 (SkpA, Cflo_N_g10272) and immune receptor peptidoglycan-recognition protein LC (PGRP-LC, Cflo_N_g10272). As an important component of ubiquitin-proteasome pathway SkpA is involved in Immune Deficiency (IMD) pathway regulation in D. melanogaster 30 . Since PGRP-LC is also a regulator of the ant IMD pathway 3 , the interaction we identified suggests that SkpA can modulate the IMD pathway by the interaction with PGRP-LC. Not only the interaction between protein complexes such as laminin subunit beta-1 (Cflo_N_g14102) and laminin subunit gamma-1 (Cflo_N_g9869) but also the interaction between Cflo_N_ g14102 and C-type lectin precursor (Cflo_N_g765) was resolved (see Datasheets 3 in Supplementary Material for all the interactions).
To further supplement the proposed ant interactome, we performed a topology-based scoring of the network. The method CAPPIC 31 used the intrinsic modularity of PPI network for assessing the confidence of individual interactions. 88.5% of the total interactions are high confidence (Fig. 2) while 9.65% were assigned to medium confidence and 1.8% to low confidence.
We applied the Mann-Whitney test to compare the average confidence scores of all four PPI networks and observed significant increase of confidence score for the first three steps from the preliminary network through DDI mediated filtering and localization-based filtering ( Supplementary Fig. 1). The mean confidence score of the final interactome, after the isoform merging, did not change much. This is because the merging of this last step also eliminated some high confidence PPIs mediated by the isoforms. Nevertheless, the comparison of the proportions of high-confidence PPIs in the preliminary interactome and the final ant interactome indicates that it has a significantly increased number of high confidence interactions (in the preliminary network these are 78%, in the final 89%; Fisher's exact test p-value < 2.2e- 16). Note that the applied filtering steps also eliminated most of the low confidence PPIs (see low confidence zone in Supplementary Fig. 1). To further confirm the elimination after successive filtering steps, we compared the low confidence PPIs proportions in all four interactomes in a pairwise way with Fisher's exact test and show a significant decrease in the number of the low confidence PPIs between the preliminary, DDI-filtered and localization-filtered interactomes (in preliminary 4.5%, in DDI-filtered 2,2%, in localization-filtered 1.6% with maximum p-value < 3.4e-05). These analyses clearly demonstrate the improvement of network quality after filtering steps.
www.nature.com/scientificreports www.nature.com/scientificreports/ Network analysis of C. floridanus interactome and accuracy assessment. The resulting PPI summarizes the whole network and reveals central connecting nodes. The final high confidence ant interactome showed a clustering coefficient of 0.094 with a mean shortest path length of 4.359, network diameter 14 and an average degree of 6.970. As a typical biological network [32][33][34] it shows small-world connectivity and scale-free topology.
We further tested whether the proposed interactome aligns with the properties of a real biological network. To assess this, we derived three independent datasets and compared their topological properties with the proposed network. The average z-statistic value (Datasheet 5 in Supplementary Material) clearly indicates comparatively less variation of the ant interactome from the 'Barabási-Albert scale free model' (z-statistic = 23.06, −5.28) in terms of clustering coefficient and mean shortest path. However, the differences were high while comparing that of with 'Watts-Strogatz small world graph model' (z-statistic = −30.95, −58.49) and 'Erdős-Rényi flat-random' network model (z-statistic = 171.03, −52.72). Scale-free networks have been often observed in biological systems such as PPI and gene regulatory networks 35 , therefore the bias towards such a network is an indicator of the equality of the reconstructed ant interactome. To test another factor, the degree distribution of the ant interactome was www.nature.com/scientificreports www.nature.com/scientificreports/ much closer to Watts-Strogatz model (z-statistic = 0.49), although the differences with Barabási-Albert model was not too high (z-statistic = 2.45). The nodes in the network obey a power-law distribution indicating a typical, biological small-world and scale-free network.

Gene ontology (GO) enrichment analysis. The molecular function GO term over-representation analysis
indicates enriched protein functions in the ant networks (FDR <0.05; Table 2 and Datasheet 6 in Supplementary Material). Over-represented functional categories include the term 'binding' as to be expected from the PPI construction and a validation criterion. Out of 2804 proteins annotated as GO term GO:0005488 'binding' in C. floridanus proteome, 46.11% proteins are present in the final interactome. In total, 64 binding-related GO terms were identified constituting 34.97% of all over-represented GO terms. We only found the under-representation of two GO terms: GO:0003964 'RNA-directed DNA polymerase activity' and GO:0034061 'DNA polymerase activity' . This indicates during the filtering we did not lose most of the functional proteins that are involved in molecular binding.
We further compared the semantic similarity scores of the interacting pairs with the random networks of non-interacting proteins. We first assigned the level-4 GO annotations (for molecular function) to all the proteins coded by the ant genome using Blast2GO 36 . Next, we used the GOGO algorithm 37 to measure the semantic similarity scores of the high confidence interacting pairs in the proposed ant interactome. We further generated 30  The distribution of the different confidence levels were computed with CAPPIC 31 . Score distributions were separated into low, medium and high confidence category and the density for each category was plotted. In the three subsets scores range between 0 and 0.3 for subset 1 (green, low confidence), 0.3 and 0.7 for subset 2 (blue, medium confidence) and 0.7 and 1 for subset 3 (red, high confidence www.nature.com/scientificreports www.nature.com/scientificreports/ random networks each with 100 random interactions among the proteins that were assigned to level-4 molecular function GO annotations using a custom-made Perl script which can be accessed from the GitHub repository (https://github.com/ShishirGupta-Wu/ant_ppi). We made sure the random networks did not contain any proteins pairs apparent in the preliminary interactome. Using the GOGO algorithm 37 semantic similarity scores were also assigned to the random networks (non-PPIs) and these scores were further compared with the interacting proteins in a pairwise way using the Mann-Whitney U test. We observed that the interacting protein set had not only the highest average score of 0,47, this was also well separated and significantly higher than the average score in all the 30 non-PPI sets (Fig. 3). This comparison demonstrates the interactions in our calculated ant interactome are functionally relevant and clearly different from random networks.
C. floridanus interactome protein conservation compared with seven organisms. Proteins that perform essential functions are expected to be evolutionary conserved. We further investigated the evolutionary conservation of ant interactome proteins. Higher degree proteins are generally evolutionary better conserved 38 , some caveats are discussed in 39,40 . To analyze this, node degree and the fraction of proteins present in the ant interactome that are conserved in different model organisms were compared. It turns out that in general the interactions are conserved and supported by most species tested and not just by one (Fig. 4). There was a positive correlation between degree and conservation in the evolutionary closest analyzed species A. gambiae (Spearman's rank r = 0.62, p-value = 3.5e-09). Similar correlations are observed between ant and human (r = 0.60), and mouse (r = 0.51). Between ant and worm the correlation was weak (r = 0.33), while no significant correlation is observed between ant and A. thaliana, P. falciparum, and yeast. An ortholog table is provided in Datasheet 7 in Supplementary Material.

Overall conservation and infection induced hubs and bottlenecks in the ant interactome.
We also evaluated the overall conservation of all the ant proteins with the other seven model organisms and compared the relatedness of the ant interactome proteins using the chi-square test. The analysis indicated the relatedness of corresponding proportions with p-value < 0.05 in each case. The differences in the number of orthologs can be clearly visualized (Fig. 5a) in case of ant comparison with protozoan parasite, yeast and plant.
Due to the large phylogenetic distance to these three organisms there are less orthologs but these are well conserved (chi-square test).  www.nature.com/scientificreports www.nature.com/scientificreports/ The remaining set of the other four organisms including insect, human, mouse and worm together consists of/ contains higher number of orthologs in comparison to the ant proteins ( Fig. 5b and Datasheet 7 in Supplementary Material). 187 proteins of the ant interactome are ant-specific in this comparison: they do not have orthologs in any of the analysed organisms (Fig. 5b). The analysis of central topological properties of a PPI network helps to identify key multifunctional components of the network 41 . Infection induced proteins of C. floridanus are conserved in related organisms including key interactions. The degree of the node 42 and the betweenness centrality 43   www.nature.com/scientificreports www.nature.com/scientificreports/ We applied Fisher's exact test to compare the proportion of multi-localized proteins in hubs and bottlenecks to non-hubs and non-bottlenecks, respectively. Supplementary Fig. 2 shows differences between the localization of bottlenecks and hubs of the ant interactome. For bottleneck proteins, 70% were found to be multi-localized (versus 56% for non-bottleneck proteins; significant difference; p = 9.6 × 10 −10 ). On the other hand, 62% of the hub proteins had multiple localizations (versus 56% for non-hub proteins; significant difference, p = 0.001575).

GO ID GO molecular function term FDR
Integration of the RNASeq data 3 with the ant interactome revealed differentially expressed infection-induced hubs and bottlenecks during the bacterial infection of C. floridanus (Fig. 5c). These include also well-known key proteins involved in C. floridanus immune response such as nuclear factor NF-kappa-B p110 (Relish, Cflo_N_ g6082), acidic mammalian chitinase (Cflo_N_g2277), as well as stress-related protein cytochromes P450 6A1 (Cflo_N_g11706) 3 . Given the high importance of hubs and bottlenecks in PPI networks and their differential expression during bacterial infection, all the identified proteins are expected to participate in the defense against bacterial pathogen, and hence can also be examined for decoding immune mechanisms. The insect peritrophic membrane (PM) imposes protective physical barriers over the midgut epithelium 44 . The PM related proteins have shown their potential as targets for pest control 45,46 . Therefore, the important ant peritrophic membrane protein 1 (Cflo_N_g4555) (Fig. 5c) with no human homology could be further tested as a potential pest target. However, differential expression does not guarantee a protein to be the best target 47,48 and therefore, other topologically important proteins in the network without human homology (Datasheet 7 in Supplementary Material) should also be considered as potential pest targets in future.

conclusions
Our curated ant interactome is the first large-scale PPI network of an ant. It allows besides numerous analysis of network biology to study how different cellular processes connect to each other including hub proteins and different types of crosstalk, for instance in immunity.
Similarly, the PPI maps of other sequenced ants can be reliably predicted using the interologs of the reconstructed high-confidence C. floridanus interactome. Moreover, detailed cross-validation, comparison with random networks, GO annotation, and conservation analysis support the high quality of the resulting ant interactome and its construction steps. The network analysis including evolutionary conserved network proteins further suggest that topologically important proteins could also be exploited as future pest targets. For instance, cytochrome P450 6A1 (Cflo_N_g11706), peritrophic membrane protein 1 (Cflo_N_g4555), flexible cuticle protein 12 (Cflo_N_g6859), endocuticle structural glycoprotein SgAbd-1 (Cflo_N_g7775) were identified as topologically important differentially expressed proteins with no human orthologs. Nevertheless, specific interactions highlighted from our global analysis will need individual follow up by detailed investigations.

Materials and Methods
Reconstructing protein-protein interaction map of C. floridanus. We compiled the list of experimentally verified high-confidence PPIs available in Database of interacting proteins (DIP) 49 , D. melanogaster PPIs from DroID 50 database which includes data from different studies including interactions from high throughput Gal4 proteome-wide yeast two-hybrid (Y2H) screens 32 , LexA Y2H system screens [51][52][53] , PPIs from fly protein interaction map 54 , interactions determined in large-scale co-affinity purification (co-AP)/MS screens 55,56 , interactions from BIND 57 , BioGRID 58 , MINT 59 , IntAct 60 , and databases available in DroID v2014_10.
The C. floridanus interologs of the entire template PPIs were determined using orthology predictions from the software InParanoid 20,61 and OrthoMCL 21 . These were further customized using own perl and bash scripts. For DIP interactors we used the default parameters of InParanoid. For the fly data orthology was determined using the stricter Blosum80 matrix. For the OrthoMCL based interologs mapping a Blast e-value of 1e-05 was used and the MCL inflation index set to 1.5. InParanoid distinguished seed orthologs with co-orthologs and left fewer possibilities of mixing outparalogs in orthologous clusters. Consensus predictions of InParanoid and OrthoMCL were added to InParanoid seed orthologs to create a set of interologs.
pruning ppis with domain-domain interactions. The amino acid sequences of non-redundant preliminary PPIs were extracted and domains were assigned to them using Pfam version 27.0 62 . The list of non-redundant domain-domain interactions was prepared from the meta-databases Domine 63 , DIMA 3.0 64 and IDDI database 65 . These use complexes available in the Protein Data Bank (PDB) 66 to identify by interacting domains the Pfam families containing these domains. These Pfam families are then predicted to be interacting. This list was used to parse the template PPIs. All interactions were categorized whether they are supported (good interactions, used for further filtering steps) or not by domain-domain interactions (DDIs). Subcellular localization filtering. The subcellular localization of C. floridanus proteins was determined with orthology to Swiss-Prot proteins and the extended version of KnowPredsite 67 available at UniLoc server (bioapp.iis.sinica.edu.tw/UniLoc/), a knowledge-based classifier for protein subcellular localization. If in a binary interaction, both proteins do not share the same localization or at least one compartment in multiple localized proteins, the interaction was ruled out as probable not occurring.
Isoform filtering. The information on C. floridanus protein isoforms and their function was extracted from our previous publication of C. floridanus re-annotation and transcriptome sequencing 3,68 . To reduce network complexity and noise, isoforms of any specific protein present in the network were represented as a single node. Although, the data files for all the networks are provided in the Supplementary Tables (1-5) which allow interested readers to analyze the network of their choice further if they wish.
here s and t are network nodes different from node n, σ st is the number of shortest paths from s to t, and σ st (n) gives the number of shortest paths from s to t that goes through node n. Hubs and bottlenecks in the network were identified with cytoHubba 75 . Hubs were defined as proteins connecting with ≥5 proteins. Moreover, top 20% of bottlenecks and hubs were considered for mapping of the RNASeq expression data which was collected from our previous publication 3 .

Random networks.
We generated random networks following the Erdős-Rényi Model 76 , Barabási-Albert Model 77 and randomized the proposed (final) ant interactome while preserving the total number of interactome nodes using the Network Randomizer plugin 78 of Cytoscape 73 . A total of 1000 random simulation were employed to generate the undirected random graphs. For all three network sets we computed topological parameters, mean shortest path, degree distribution and clustering coefficient and compared their differences to the native ant interactome using the statistical Z-test 79 . functional annotation. Blast2GO 36 was used to annotate the Gene Ontology (GO) terms of proteins involved in the reconstructed interactome. Over-representation analyses of GO terms was performed using the Gossip package 80 of the Blast2GO suite. A two-tailed Fisher's exact test followed by false discovery rate (FDR) correction for multiple testing 81 was applied to see the functional difference of ant interactome proteins annotations (foreground set) and full C. floridanus proteome annotations 3 (background set). Only differences having an adjusted p-value < 0.05 were considered significant.
Orthology analysis. InParanoid 20 was used to identify the orthologs of topologically important nodes in seven model organisms: Anopheles gambiae, Arabidopsis thaliana, Caenorhabditis elegans, Homo sapiens, Mus musculus, Plasmodium falciparum, and Saccharomyces cerevisiae. Only the ortholog with 100% bootstrap support was considered as true ortholog. As a note of caution, the conservation was calculated rather conservatively demanding double orthology relations. Hence, the absence of an ortholog (Suppl. Datasheet 7) only indicates that the highly restrictive threshold was not met. Generally, a sequence related protein may still be found by less restrictive algorithms (e.g. BLAST).
For exact quantification of the degree of conservation of ant PPIs we did not check the possible restricted conservation of the binary ant PPIs, but more general the conservation of proteins that are present in the ant interactome and have orthologs in seven other species. After calculation of the orthology relationships between ant and other organisms we identified for every degree the occurrence value of the ant interactome and how many orthologs are present in other species. For each organism the fraction of proteins at a particular ant interactome degree is considered as the number of ant protein orthologs at that particular degree and greater divided by the number of proteins in the set.

Data availability
All data generated and analysed during this study are included in this published article (and its supplementary Information files). The dataset and codes for random network generation are also available at https://github.com/ ShishirGupta-Wu/ant_ppi.