Context-dependent prediction of protein complexes by SiComPre

Most cellular processes are regulated by groups of proteins interacting together to form protein complexes. Protein compositions vary between different tissues or disease conditions enabling or preventing certain protein−protein interactions and resulting in variations in the complexome. Quantitative and qualitative characterization of context-specific protein complexes will help to better understand context-dependent variations in the physiological behavior of cells. Here, we present SiComPre 1.0, a computational tool that predicts context-specific protein complexes by integrating multi-omics sources. SiComPre outperforms other protein complex prediction tools in qualitative predictions and is unique in giving quantitative predictions on the complexome depending on the specific interactions and protein abundances defined by the user. We provide tutorials and examples on the complexome prediction of common model organisms, various human tissues and how the complexome is affected by drug treatment.

The simulation algorithm has been implemented in C and CUDA for GPU computation. All the remaining scripts as well the GUI have been implemented in Python 2.7 using numpy, matplotlib, pubsub and wx modules. Executable of the GUI have been generated with PyInstaller. We also provide a python script to execute the whole pipeline from command line. This is particularly useful for users who can access high performance computing facilities.
SiComPre 1.0 is available both for Windows and Linux operating systems.
Execution of this pipeline with one simulation only on a desktop with intel core i7, GeForce GTX 760 and 16 GBs RAM takes proximally 4 hours for models generated from yeast proteome and 24 hours from human proteome data. The example provided within SiComPre 1.0 (Supplementary   Text 1 and Supplementary Table 6) can be completed without GPU in few minutes.

Compartment definition and simulation
Compartments can be defined by users, but we still provide a default compartment organization.
Protein assigned to a compartment can only diffuse in SVs that are part of the compartment. In our default model, we assign a group of SVs to each compartment. It is also possible to assign more compartments to a single SV in order to generate an overlap where proteins of different compartment can meet. This is useful to predict transmembrane protein complexes. From a simulation point of view proteins can interact with every protein in the same SV, however in the model used as example we didn't generate an interaction for proteins that are not in the same compartment except if the compartment is a membrane. As an example, cytoplasm and Golgi are in contact and they share some of the assigned SVs. A third compartment is used to define the membrane between Golgi and Cytoplasm. Proteins in the cytoplasm can reach the membrane as well those in the Golgi. However, they cannot interact because in our model we didn't allow the interaction between two compartments. Instead, proteins in the cytoplasm and Golgi can interact with proteins in the membrane. Thus, through membrane protein we can model transmembrane protein complexes. Size of compartments has been chosen according the square root of the volume reported for liver cells [1]. Next, small adjustments were adopted to better fit the size to the number of proteins in the relative compartment.

Controllable parameters
The simulation results are dependent on multiple parameters that can be tweaked by users according their needs. Below we summarize the parameter functionalities divided by their stage in SiComPre 1.0 pipeline: -Compartments: • Grid dimensions: the number of sub-volumes (SVs) used to discretize the simulation space.
-Model generation: • Binding strategy: the strategy used to resolve ambiguity in decomposing PPIs into DDIs. Possible values are Fictitious, Non Fictitious, and Functions.
• Cell volume: protein abundances given in the input file are multiplied by this number.
• Abundance reduction: is a value between 0 and 1; protein abundances after cell volume multiplication are raised to the power of this number. -Simulation: • Maximum concentration: number of proteins that a single sub-volume can simultaneously contain.
• Simulations: how many times a model will be simulated.
• Maximum simulation time: if defined, SiComPre stops when the simulation time reaches the indicated value.
-Refine complexes: • Merging threshold: threshold to merge similar complexes when the refined list of complexes is generated.

Parameter tuning
We tested SiComPre 1.0 on data from the yeast Saccharomyces cerevisiae with various assumptions: -We tried setting the protein binding limit (BL) to 10 or to 100, meaning that in one case the maximal number of binding partners of each protein was assumed to be 10 or 100 (quasi no limit). This was done to see whether this feature has an impact on the prediction of protein complexes.
-To see if there is any difference in the predicted complexes varying the number of SVs used in the simulation we run the simulations multiple times with SV=4096 and SV=1024.
-We also tested if restrictions on the compartmental localization of each protein has an effect on the results, thus we ran simulations with and without considering compartments.
-We set the number of simulations to 3 for each combination of parameters and as there is no proteome-wide information available on it, we set the binding and unbinding rates to 1.
Predicted complexes were compared to a reference dataset [2] using the composite score of recall, maximum matching ratio (MMR) and accuracy [3] and removing complexes whose proteins were not found in the initial PPI, in order to assess the method rather than the initial PPI dataset. We tested our simulations with multiple parameter choices (Supplementary Table 1).
Interestingly the consideration of compartments reduced the matching scores on the simulated complexes (SCs). Setting the binding limit of each protein to 10 increased the composite score, compared to BL=100 (minimal steric restriction on binding partners). Simulations with 1024 SVs reduced the composite score compared to SV = 4096. All of these scores were calculated removing complexes whose proteins were not found in the initial PPI, in order to assess the method rather than the initial PPI dataset. According to these scores, we selected as best option SV=4096 and BL=10, because with the same composite score it has a higher precision (0.4648 vs 0.4576) when compartments are not considered and it has a higher composite score with compartments. Note that we didn't run the whole pipeline for each combination, since our purpose was to see how the simulation performs with each parameter.
Next, we ran our simulated complex refinement pipeline with different settings and finally we checked the score of complexes enriched in the same compartment (Supplementary Table 2).
Without compartments SiComPre 1.0 can get a better composite score, however the resulting complexes are not consistent with compartments. Including the split step, the score improves by 0.5 without using compartments and 0.2 using compartments. The maximum composite score is reached with the splitting algorithm, but without compartments (2.38). In order to assess the method rather than the initial PPI dataset, all scores were calculated removing complexes whose constituents were missing from the initial PPI dataset.
Split is based on Markov clustering (MCL) with the inflation parameter equal 1.4. We observed that with this parameter MCL allows to split complexes without breaking single subunits and we can split protein complexes that are bounded by a weak interaction ( Fig 1B). Next, we merged complexes according to their similarity (Overlap score > 0.5). These two steps can be performed once the simulations are over. Although the process of merging similar complexes decreases the overall composite score, it is a necessary step to reduce the noise in our qualitative prediction and to have a reliable quantitative prediction, because many multicomponent complexes has a large number of variants, but in the predictions these should be considered the same complex instead of many different low abundant predictions. Users with interest in all the alternative forms can stop the pipeline at the simulation step and investigate those results.
As suggested by Nepusz et al. [3], due to the lack of completeness of reference dataset of protein complexes, a predicted complex without a match in this reference dataset can still be a true protein complex. Therefore, for precise prediction of protein complexes, we have to consider if interacting proteins (with complementary DDIs) can appear in the same cellular compartment. More precisely, if the predicted protein complex is located in multiple compartments, then either one of the members is a transmembrane protein or we consider it as false positive. As a further advancement, we tested the score after removing those complexes that are composed of proteins located in more than 2 compartments. To understand whether a complex composition is indeed consistent with the cellular compartment where its components were shown to be localized, we first checked what is the fraction of proteins of each protein complex are assigned in the same compartment in the CYC08 dataset [2]. About 80% of protein complexes show a fraction of 1, meaning that all their proteins are localized in the same compartment. Therefore, we removed predicted complexes which proteins are not in the same compartment. In the case without considering compartments the scores were reduced by 0.1 and 0.14, while in the case with compartments only by 0.01. Finally, we checked if other methods could achieve a higher score after removing complexes formed by proteins that are not located in the same compartment.
ClusterOne [3] -one of the best tools available to predict protein complexes, but without quantitative information -had a score of 2.17 (0.21 lower than our best composite score), however once complexes not in the same compartment are removed, it drops down to 1.76 which is 0.37 lower than our best prediction.

Algorithm to merge similar complexes and reach Refined Complexes
The merging step requires to measure all pairwise similarity score for each pair of complexes, and this can be highly memory demanding when it needs to be performed for every SCs. SiComPre 1.0 simulations generate a large number of SCs more than 100,000 with human data), therefore to reduce the number of complexes and therefore the amount of memory required for this step, we first run this heuristic algorithm to merge similar complexes.
Next, we compute a similarity matrix where element i,j represents the overlap between complex i and j.
While the maximum overlap in the similarity matrix is greater than 0.5, we take the pair with the maximum overlap and we merge them, then we recompute the overlap for the resulting new complex. Finally, we get the new maximum and we repeat the process. However, in this final step the overlap is not considered as we considered so far, but we increase the overlap linearly with the size of complexes merged.
Where a is the product of the complex sizes, length(c1) and length(c2), while the other parameters have been chosen arbitrarily to reduce as much as possible the number of redundant complexes.
The meaning of this step is to let larger protein complexes to bind each other with a smaller overlap compared to small complexes. We want this because large complexes have a higher number of variants compared to smaller one, moreover they tend to have a large number of attachments with a smaller fraction of proteins in the core of the complex.
The maximum threshold of 0.5 has been considered because for small complexes we believe that protein complexes with smaller overlap are different enough to be considered two separated complexes, e.g. three protein long complexes that share two proteins have an overlap of 0.44 and we want to keep them separated, in the same time we want to have large complexes that share a smaller fraction of the proteins to be merged together.
The algorithm used to merge complexes is illustrated below: where complexes is a list of protein complexes ordered by size and Thr is the overlapping threshold used for merging.

Compare simulation results from different conditions
The algorithm used to compare results builds a network where each node is a protein complex in a given condition, edges represent the overlap between the two nodes (complexes). This network is clustered with MCL in order to find group of complexes that are similar among the different conditions (e.g. the same complex in two conditions will be clustered together). Finally, the sum of protein complex abundances within each cluster for each condition is used to determine the abundance of the given cluster in the given condition. These can be used to evaluate any difference between protein complexes in different condition. Quantitative variation will be shown as a difference between the abundance of the cluster in different condition; complexes with different proteins (qualitative variation) will not be clustered together resulting again in clusters that have different abundance (i.e. 0 in a condition and a positive number in the other condition).

Performance with reduced protein abundances
SiComPre can be slower than existing methods to predict protein complexes since it requires to perform simulations with multiple copies of each proteins considering multiple binding sites and a large number of sub-volumes. This combinatorial problem can generate about 100,000,000 possible interactions. Some users might not have access to a machine equipped with the necessary hardware to run such a complex simulation. Therefore, we provide the score of the qualitative prediction in yeast based on reduced protein abundances and without compartments (Supplementary Table 4). We keep at 3 the abundances of those proteins which would fall below 3 by this transformation. Again, protein complexes after refinement step (RC) were compared to CYC08 [2], this time without removing complexes that are not consistent with protein localization.
Despite a good qualitative prediction, the refined complexes quantitative predictions cannot be reliable when protein abundances are reduced to low values.

3.
Step-by-step tutorial to execute SiComPre 1.0 In this part, we are going to describe the 6 simple steps to predict protein complexes with

Step 1. Requirements
SiComPre can be executed on Windows and Linux without any hardware or software requirements. The presented example can be simulated on almost any state of the art computers, however proteome-wide models used in this study needs a lot of memory (up to 4-8 GB) and require a many computation hours. Therefore, it is strongly suggested to accelerate the computation using GPU computing and this requires a CUDA device (https://developer.nvidia.com/cuda-gpus) and updated graphics card drivers.
To run the whole pipeline from command line it is also necessary to have Python (only version 2.7 have been tested) with numpy package installed. This is provided as part of our installation package.

Step 2. Define compartments
In this step it is possible to design or import the compartment organization for the simulation. Users can first select the grid size (1), save the current compartment model for later usage (2)  Once imported, SiComPre will visualize the space.
Selecting the desired compartment one can assign lattices to a desired compartment (4) selecting cells in the drawing panel. It is also possible to define new compartments with button "Add" (5).

Step 3. Import input files and generate the model
The tool provides the option to load a few examples with all input files required. Here we explain how to modify these or add data from other input files.
The tool requires to import the desired Protein-Protein Interactions dataset (1), Domain-Domain Interactions dataset (2), protein localizations (3), protein-domain associations (4), protein-function associations (5), protein information (abundances, diffusion rates, binding limits) (6). These text files require the following format. Three binding strategies are available as described in the related publication and in our previous study [4]: • Full: If a DDI is not found between interacting proteins of a PPI then specific fictitious interacting domains are added to these proteins.
• noFictitious: A PPI is considered only if a corresponding DDI is found. This strategy yields less false positives, but the false negatives increase. The remaining models, requires more memory and hybrid GPU and CPU computation to be executed, however by reducing the abundance of the proteins in the simulation (parameter 8 and 9) it is possible to execute these models on a less powerful machine (see section 2.1).

Step 4. Run simulations
Now the model is ready to be simulated, select number of simulations to run (1), max simulation time (2) which is typically between 0.5 and 2.0, number of CPU threads to use (3) and the maximum number of molecules that can be located in the same sub-volume at the same time (4).

Step 5. Refine complexes
In this part users can execute the steps needed to refine protein complexes as explained in the main article. Users only need to define the inflation parameter used by Markov Clustering (1) and the merging threshold (2) used to merge similar complexes. Markov Clustering with high inflation will break a complex in more pieces compared to a smaller value. The higher is the merging threshold the larger will be the generated protein complex. This process will be applied to each simulation in the session directory and resulting protein complexes are stored in session_directory/simulation_directory/refined_complexes. Finally, all refined complexes in different simulations will be merged together with the same algorithm in a single file located at session_directory/refined_complexes.txt

Step 6. Check and export results
Users can visualize the distribution of protein in the simulation space and how they diffuse over time (1). This is done for each simulation that can be selected with (2). It's also possible to restrict the visualization to a smaller set of proteins, listed in the textbox (7). It's possible to add proteins selecting them from the list of available proteins (4) and click "TRACK PROTEIN" (5) or, if users want to add multiple proteins without selecting them one by one, it is possible to load a file containing a list of proteins to be tracked (6). In this file, each protein name needs to be written in a different line. Through search panel (3) users can find search for protein in the list (4). Buttons (8-9) allow to remove proteins from (7).
Results are now ready to be exported and analyzed with other tools. CSV files, generated clicking button (1), can be imported in software such as MS Excel where it is possible to visualize the predicted protein complex abundance among the simulations, known complexes matching a predicted complex and the link to DAVID analysis chart for the involved proteins. Once terminated, the file will be available in the session directory with the name "prediction.csv". When comparing results (2), users need to select the directory of a previous session containing the same number of simulations, next the algorithm described in the main text will be executed. Results are stored in the session directory. Users can also generate a network of a specific complex by defining the ID indicated in the "prediction.csv" file generate with button (1). Network files can be imported in Cytoscape to generate networks like those used in this manuscript (3). However, it is not possible to generate the network of complexes that shares at least one common protein as their relative network would be merged together in Cytoscape. Finally, it is possible to compare the qualitative prediction to a set of reference complexes loaded with (4). Matching threshold (5) is used as minimum overlapping score necessary to match a reference complex. PPI filter (6) filters out reference complexes that were not represented in the PPI file used to generate the model.

Save and Load sessions
SiComPre assign to each session a directory that contains all results and file generate through the pipeline. When a new session is started, the default directory is SICOMPRE_PATH/tmp. Each simulation creates a sub-directory named with the simulation number. Users can control session through the File menu.
Renaming the session will rename the directory "tmp" with a different name. This will prevent to be overwritten by a new session. To load a previous session, users need to select the directory of a previous session, in this way SiComPre will change the session directory to the one selected by the user. In case of multiple GPUs in the system, users can indicate the ID of the GPU device to be used by editing the line

Execute SiComPre from the command line
In the runPipeLine.py. ID=0 usually indicate the default GPU. It is possible to disable GPU computation by setting this value to -1.

Run a test
To run a simple test at the scope to test the installation, we provided some example files located in the "Data/Example" directory (Supplementary Table 6). Execution time of this test should take only few seconds. "Protein -Location association" and "Compartment model" can be omitted. A more realistic test can be performed using the file "collins2007rRNA.txt" and tweaking the CellVolume value in order to run a simulation with a small or high number of proteins.
Alternatively, it is possible to use the "load example -> basic test" function in the file menu to automatically import a demo model. More complicated models used in this paper can be imported too, but they will require more computational power to be executed.

Utils provided as separate scripts
We also provide some scripts useful to convert external database file such as Biogrid [5], MOPED [6], Pax-DB [7] or Gene Ontology (GO) to a tabular format that can be read by SiComPre.

Conversion from GO compartments
The convertCompartments.py script converts protein localization associated to GO terms to organelles (e.g. mitochondria and integral to membrane are converted to "cm" which stands for cytoplasm mitochondria membrane). All localization related to complexes, such as ribosome, are deleted.
Execution: python convertCompartments.py GO_compartments The script uses as input a tab separated file download from GO database containing two columns.
The first one indicates the protein ID and the second one the name of the compartment.

Conversion from MOPED
The abundanceMOPED.py script converts files exported from MOPED database to the right format needed for SiComPre.
Execution: python abundanceMOPED.py file_downloaded_from_moped condition The first parameter is the filed downloaded from moped, and the second parameter is the condition used to filter proteins (column 9 of the input file).

Conversion from Pax-DB
The abundancePAXDB.py script converts files exported from PAX-DB database to the right format needed for SiComPre. Typically, protein ID on PAX-DB is different from Uniprot which is used in the example data provided with SiComPre, thus it is necessary to provide a mapping file to convert protein IDs.
Execution: python abundancePAXDB.py mapping_file file_downloaded_from_paxdb Where mapping file contains two columns separated with a tab. The first column indicates the ID of the protein in the Pax-DB database and the second column the protein IDs to be used in the output file.

Selection of the best PPI for human studies
SicomPre 1.0 works with context-specific information; therefore it is natural to ask if we should use a context-specific PPI (e.g. a tissue-specific PPI). In a context-specific network we have binary interactions that are observed only in particular cell type or condition. However, if some interaction is observed only in specific tissues it is probably due an emergent behavior arises from the different abundance levels. Instead, with SiComPre 1.0 we want to use PPI as a context-free source, containing interactions where proteins that are in the proximity of each other can interact.
Using protein abundances, instead, we will limit the possibility of protein interactions. Therefore, the behavior of context-specific interactions will be intrinsically embedded in the simulations. Under this idea, we need to have a PPI that reflects multiple conditions, increasing the probability to identify protein interactions that are possible to observe only in specific tissue (e.g. protein X is expressed only in liver).
In this study, we aim to have the highest possible coverage of the human proteome, therefore we want to test multiple PPI available in the literature in order to find the one that can better predict protein complexes from a reference dataset [8].
We tested different PPIN with different coverage, for these experiments we run our simulation using Pancreas and B cells protein abundances which are the cell type and tissue among those with the highest number of protein expressed in the Kim et al. experiments [9], therefore yielding to more reliable results. To make our findings not dependent purely on SiComPre we also tested these PPI with ClusterOne [3].
As shown in Supplementary Table 5, Hippie with a threshold of 0.8 shows a better coverage with the second best precision. LitBM13 also reaches similar results. Indeed, in none of these networks we observed important complexes such as Ribosomes and Proteasomes. A more precise network has been proposed by Havugimana et al. [10] containing 13992 interactions, however the coverage is very limited, probably because it has been generated from specific cell lines where many interactions are improbable to appear. In this dataset, we observed nearly all the important complexes, but many other were missing leading to a poor recall. However, what is important for us is also the ability of predicting important complexes such as ribosome and proteasome.
We didn't find any correlation between the number of PPIs or proteins covered with the Recall; suggesting that high ratio of FP interactions not only leads to a low precision but could influence the formation of complexes which proteins are covered by the PPI.
As a results of this analysis, we used Hippie [11] with 0.8 as threshold for the prediction of protein complexes in human. ID mappings are performed with UniProt IDmapping webtool. In case of multiple mapped IDs the interaction/protein has been duplicated.

Estimate protein-rRNA interactions and rRNA abundances
We were not able to find protein-rRNA interactions for eukaryotic ribosome. To solve this issue, we where A is the chain corresponding to a rRNA molecule.
In this way, we selected all the chains (proteins or other rRNA) which have at least one group within 3.0 A of distance from the desired rRNA molecule. We chose 3.0 A, because the length of all molecular bounds in ribosome should be shorter than 3.0 A. Therefore, this process gives us a good approximation of which proteins are in contact with the rRNA. Next, since ribosome is well conserved among eukaryotic organisms, we found all the homologous protein in Homo Sapiens.

Supplementary Figures and Tables
Supplementary reference_dataset_cyc08.csv Reference Complexes CYC08 reference complexes [2] needed to evaluate the results. To assess the quality of the prediction in Human, predicted complexes from multiple methods [3,[21][22][23][24][25] were compared to the CORUM reference dataset using Hippie PPI. When methods allowed the use of PPI weights we used the full network, otherwise a filter was used to exclude low confidence interactions (weight > 0.8). SiComPre was executed using protein abundance from 16 cell types.