Integrative analysis of extracellular and intracellular bladder cancer cell line proteome with transcriptome: improving coverage and validity of –omics findings

Characterization of disease-associated proteins improves our understanding of disease pathophysiology. Obtaining a comprehensive coverage of the proteome is challenging, mainly due to limited statistical power and an inability to verify hundreds of putative biomarkers. In an effort to address these issues, we investigated the value of parallel analysis of compartment-specific proteomes with an assessment of findings by cross-strategy and cross-omics (proteomics-transcriptomics) agreement. The validity of the individual datasets and of a “verified” dataset based on cross-strategy/omics agreement was defined following their comparison with published literature. The proteomic analysis of the cell extract, Endoplasmic Reticulum/Golgi apparatus and conditioned medium of T24 vs. its metastatic subclone T24M bladder cancer cells allowed the identification of 253, 217 and 256 significant changes, respectively. Integration of these findings with transcriptomics resulted in 253 “verified” proteins based on the agreement of at least 2 strategies. This approach revealed findings of higher validity, as supported by a higher level of agreement in the literature data than those of individual datasets. As an example, the coverage and shortlisting of targets in the IL-8 signalling pathway are discussed. Collectively, an integrative analysis appears a safer way to evaluate -omics datasets and ultimately generate models from valid observations.


Results
Proteomic data assessment. The high-resolution proteomic analysis was performed on samples enriched in secreted proteins (analysis of CM and ER/Golgi fractions) and CE, aiming at increasing proteome coverage. The respective workflow is depicted in Fig. 1. The results from 5 independent experiments per cell compartment indicate high-resolution and good reproducibility of the applied procedures. As shown in Table 1, for each experimental approach an average ( ± SD) of 10,062 ( ± 466), 7,298 ( ± 490), 6,053 ( ± 1,407) peptides, corresponding to 1,944 ( ± 85), 1,515 ( ± 75), 1,116 ( ± 164) proteins were identified in CE, ER/Golgi and CM, respectively. Detailed lists of proteins identified per individual MS-run (including/ excluding single peptide IDs) are provided in Supplementary Table S1. To increase reliability of protein identification and differential expression analysis, only proteins identified based on at least 2 unique peptides (in each individual run) and in at least 3/5 replicates in each cell line were considered for further analysis. The reproducibility rates were high with overlap among replicates on average of 77% (CE), 73% (ER/Golgi) and 76% (CM) of proteins detected in at least 3/5 replicates in each case ( Supplementary Fig. S1). These corresponded to a total number of 1,359, 1,062 and 816 non-redundant proteins from CE, ER/Golgi and CM, respectively, considered for further differential expression analysis (Supplementary Dataset S1). To obtain an estimate of the enrichment efficiency for secreted proteins, the SignalP algorithm, which predicts the presence of signal peptides (defining "secreted" proteins), was employed 22 (Supplementary Dataset S1). In overall, 30% of proteins in CM were predicted to have signal peptide in comparison to 14% in the ER/Golgi fraction and 9% in CE, indicating the relative efficiency of the enrichment strategies. The normalized signal intensity of these 'secreted' proteins corresponded on average to 49.80%, 16.72% and 6.67% of the total intensity for CM, ER/Golgi and CE, respectively. Moreover, enrichment efficiency was assessed based on the normalized average intensity values of specific proteins being representative for each fraction ( Supplementary Fig. S2). Actin cytoplasmic 1 and histone H2B type 1-K (protein markers for CE) were highly expressed in CE, whereas their abundance was reduced in CM (by approximately 2 fold for Actin, 10 fold for H2B type 1-K), and for H2B type 1K also reduced (approximately 4 fold) in ER/Golgi ( Supplementary Fig. S2). On a similar note, calumenin and 78kDa glucose regulated protein (markers for ER/Golgi) levels were higher in ER/Golgi (by about 2 fold) compared to CE and CM. Cathepsin B and Proactivator polypeptide (markers for CM) levels were found increased (by at least 5 fold) in CM compared to CE and ER/Golgi ( Supplementary Fig. S2). Taken together, these results support that the different strategies provided to some extent complementary information. However, large overlaps could also be observed (described below) allowing for investigation of consistencies among the differentially expressed proteins per method, as a means to increase confidence in individual observations. Complementarity of proteomic profiles. Comparative analysis of the 1,359 proteins identified in the CE, the 1,062 proteins detected in ER/Golgi and the 816 proteins found in CM revealed an overlap of 498 proteins (Fig. 2). This "core proteome" included multiple enzymes, ribosomal and cytoskeletal proteins, some signalling proteins and also abundant chromosomal proteins (such as histones; Supplementary Dataset S1). Each experimental approach also enabled the identification of multiple proteins not detected by the other two methods (408 for CE, 166 for ER/Golgi and 219 for CM; Fig. 2). The former included various nuclear and transcription factors and mitochondrial enzymes, the ER/Golgi fraction had multiple proteins synthesis-related (Protein Niban, ribosomal proteins, DnaJ homolog subfamily C member etc) and signalling proteins (RAS-related proteins, kinases, cell membrane receptors such as EGFR etc) and the CM fraction included various growth factors, interleukins, matricellular proteins and proteases, indicating a good degree of complementarity between these strategies (Supplementary Dataset S1).
Proteins exhibiting a nominal significant change (p < 0.05, Mann Whitney test) in their expression levels (>1.5 fold change) between the two cell lines in respective subcellular fractions were defined as differentially expressed. Based on these requirements, 253 (144-up and 109-down regulated), 217 (116-up and 101-down regulated) and 256 (169-up and 87-down regulated) proteins were considered as significantly altered among CE, ER/Golgi and CM, respectively in T24M vs. T24 cells (Supplementary Dataset S1). Upon Benjamini-Hochberg correction and considering the adjusted p-value (p < 0.05) and the fold change threshold ( >1.5), a total of 171 and 206 proteins were defined as differentially expressed in CE and CM, respectively (Supplementary Dataset S1); whereas none of the ER/Golgi differentially expressed proteins remained significant upon application of FDR correction. This indicates higher variability of this specific dataset, likely being a consequence of the applied multi-step enrichment protocol. Considering the low number of samples analyzed (n = 5 per group) as well as the observed consistency in expression trends among different fractions (as explained below), we further focused on the differentially expressed proteins (>1.5 fold change) defined using unadjusted p-value.
To obtain an initial insight in the biological function of the observed differentially abundant proteins per approach (i.e. the aforementioned 253, 217 and 256 proteins identified in CE, ER/Golgi fraction and CM analysis, respectively), gene ontology information deposited in protein databases (Uniprot 23,24 and NeXtProt 25 ) was investigated. Comparative analysis revealed that the percentage of differentially expressed proteins involved in metabolic processes, intracellular transport of various compounds (e.g. proteins, ions, lipids), protein folding, redox reactions and response to stress was higher in CE than in CM and ER/Golgi fraction (Fig. 3); whereas differentially expressed proteins implicated in proteolytic events, regulation of endopeptidase activity, extracellular matrix organization/ remodelling, migration, angiogenesis as well as signal transduction and cell proliferation were more prominent in CM vs. CE and ER/Golgi. In addition, the percentage of differentially abundant proteins associated with mRNA processing and splicing, protein synthesis as well as organization of actin cytoskeleton was increased in ER/Golgi when compared to the other samples (Fig. 3). These findings further indicate the complementarity of the applied enrichment strategies. Consolidation of the differentially abundant proteins from all experimental approaches (CE, ER/Golgi and CM) resulted in a total of 614 non-redundant changes (Supplementary Dataset S2). Some proteins (n = 19) were predicted by all 3 proteomics strategies to be differentially expressed and at similar trends of expression (up or down) in the T24M vs. T24 cells ( Table 2). These included proteins involved in actin binding such as gelsolin and  plastin-3, proteases (cytosol aminopeptidase), but also various enzymes [Glucose-6-phosphate 1-dehydrogenase, NAD(P)H dehydrogenase [quinone] 1, phospholipase D3, and others]. An additional 70 proteins belonging to various proteins families [signaling molecules such as Signal transducer and activator of transcription 1-alpha/ beta; metabolic enzymes such as Fatty acid synthase, UDP-glucose 6-dehydrogenase, Aldehyde dehydrogenase, and others; proteins involved in cell interactions such as Annexins (ANXA2 and ANXA6) etc.] were found to be de-regulated and at similar trends of expression by two fractionation strategies. Collectively, agreement in trends of expression and statistical significance between the strategies increases the confidence in individual observations (a total of 89 differentially expressed proteins as supported by at least two fractionation strategies may be considered "cross-validated", hence of higher confidence). In addition, and as shown in Supplementary Dataset S2, some significant changes supported by one fractionation strategy were also suggested by other strategies and at the same trends of expression in the T24M versus T24 cells, nevertheless did not pass the applied thresholds (of at least 1.5 fold change and/or p < 0.05) in the latter. This observation (applying to approximately 40 proteins per strategy) further facilitates prioritization and establishing confidence in individual findings.
Assessment of the validity of proteomic findings by mRNA sequencing analysis. To further assess the validity of the observed proteomic changes, mRNA sequencing data were obtained from the studied cell lines using different biological replicates. Corresponding transcripts for the vast majority of proteins existed. Specifically, corresponding mRNA sequences were found for 1,358 out of 1,359 proteins detected in CE (>99%); 1,061 out of 1,062 proteins in ER/Golgi; and 811 out of 816 proteins detected in CM (>99%).
Among the 253 differentially expressed proteins detected in CE, 98 were also detected with a fold change above 1.5 at the mRNA level. Of the 217 differentially abundant proteins from the ER/Golgi fraction, 85 were also found to be changed at the mRNA levels (fold change >1.5); while among the 256 differentially abundant proteins obtained in CM, 84 were also found to be differentially expressed at the mRNA level. When combined, a total of 210 proteomic changes can be considered as "verified" via agreement with the transcriptomics results (Supplementary Dataset S2). These 210 "verified" findings included various proteins which were defined as differentially expressed in at least two proteomic experiments (Supplementary Dataset S2; proteins marked in red or blue with asterisk) and also proteins which were predicted to be differentially expressed by one only proteomic approach (CE or ER/Golgi or CM; Supplementary Dataset S2; protein marked in green with asterisk), increasing the total number of "verified" findings based on data cross-validation from 89 (cross-validation based on agreement of at least two protein fractionation strategies) to 253 (agreement of at least two -omics strategies; any of protein fractionation approaches and/or transcriptomics, Supplementary Dataset S2). These "verified" features represent a variety of protein families including multiple signalling molecules (e.g. protein kinase C and casein kinase substrate in neurons protein 2, RAS suppressor protein 1, Tyrosine-protein kinase Yes, Interleukin-8, Macrophage colony-stimulating factor 1, Interleukin-6, Vascular endothelial growth factor C and others), proteases (Cathepsin L1, Cytosol aminopeptidase, Carboxypeptidase A4 and others), components of extracellular matrix (such as Fibronectin type III domain-containing protein 3B, Collagen alpha-1(XVIII or XII) chain, Laminin subunit gamma-1 or beta-1, Metalloproteinase inhibitor 3, and others) and also various enzymes (such Assessment of the validity of the multi -omics approach and its potential application. The validity of the differentially expressed proteins reported in individual proteomics experiments (CE, ER/Golgi, CM) as well as in the integrated "verified" dataset (abovementioned 253 proteins) was evaluated in the context of existing literature. Molecular features associated with BCa invasion or metastasis were retrieved using two independent approaches i.e. the BcCluster database 20 (n = 627) and the GLAD4U 21 tool (n = 671; Supplementary Dataset S3), as described in Methods. Validity was assessed based on the overlap between our experimental data and literature findings (listed in Supplementary Dataset S3). As presented in Table 3, the percentage of overlapping features between the "compiled" (CE, CM, ER/Golgi, i.e. 614 proteins) dataset and literature, was 8.8% (BcCluster) and 11.6% (GLAD4U); whereas for the verified findings (i.e. 253 proteins), the agreement with the literature data was generally higher (overlap range: 13.0-15.8% depending on the comparison (Table 3).
Considering the increased validity of the latter dataset, these 253 proteins were mapped to pathways using the Ingenuity software. The predicted statistically significant de-regulated pathways (p < 0.05, Fisher exact test) were shortlisted, based on their significance level, and the top 15 pathways with the lowest p-value are summarized in Table 4. As a representative example, we present the IL-8 signalling pathway, which notably was the only one found in the top 15 significant pathways predicted based on the literature data and also significant on each individual proteomics dataset. The graphical representation of the IL-8 signalling pathway is shown on Fig. 4 with the differentially expressed features, as detected per individual -omics method, highlighted. As presented, the molecular coverage of the IL-8 pathway increases through the integrative analysis, further reflecting the complementarity of the different approaches. Furthermore, as shown, the vast majority of molecular changes are considered "verified" (Fig. 4 -red frame). Based on this scheme, the chances that the observed "non-verified" changes (e.g. changes supported by one only -omics approach; purple frame) are valid increase, based on their biological relevance. To test this hypothesis the differential expression of the Vasodilator-stimulated phosphoprotein (VASP) was investigated in a set of invasive and non-invasive BCa tissue specimens by western blot. As shown in Supplementary Fig. S3, in line with the ER/Golgi proteomics analysis, a decrease in the level of this protein in invasive versus non-invasive tumors is suggested.

Discussion
Omics datasets are mine of information, nevertheless only a limited part of it is finally extracted and further investigated mainly due to challenges associated with: a) establishing reliability of findings (typically large numbers of differentially abundant proteins of low statistical power are identified per omics experiment); and b) developing targeted assays for further measurement of individual features, as a means for their verification. Particularly, frequent lack of specific antibodies and the associated costs of performing immuno-based assays result in only a small number of proteomics findings being ultimately confirmed (typically less than 10 per experiment). These verified findings, even though of high value, are not sufficient to comprehensively describe a disease at the molecular level. However, such comprehensive description is required for the successful application of spherical "systems biology" approaches 12 . In the case of proteomics studies more specifically, comprehensiveness and proteome coverage are dependent on the applied technique, with different subcellular fractions requiring the use of different enrichment strategies for their efficient resolution. The presented approach involving use of different enrichment strategies as well as transcriptomics, addressed the added value of cross-strategy, cross-omics comparisons and respective investigation of consistency in trends of expression, in increasing confidence in individual findings per omics dataset.
We focused on the analysis of BCa metastasis using a cell line model for the specific phenotype. This constitutes a clinically relevant question, as limited therapeutic options are available for patients with BCa metastatic disease, highlighting the need for development of novel therapeutic targets 19 . We placed special emphasis on the investigation of the secreted/ extracellular matrix proteome, considered of high relevance in cancer invasion and metastasis 11 . In parallel to the classical analysis of CM, we also investigated the ER/Golgi fraction, representing the path of proteins on their way to be secreted 26 . As recently demonstrated and also shown in our analysis, this latter method is of lower enrichment efficiency for bona fide secreted proteins in comparison to the analysis of CM, nevertheless, it can provide new information and highly complementary results to the latter (CM) 27 . Even though one cannot rule out the possibility that some of the observed differences (or overlaps) between protein identifications from different fractions may reflect sub-optimal enrichment and/ or differences in starting protein amounts (e.g. 5 μ g for CE, 3.5 μ g for ER/Golgi and 3.75 μ g for CM analyzed by LC-MS/MS -Methods), the overall specificity of the employed techniques is supported by the Signal IP analysis   and respective analysis of protein abundance per fraction ( Supplementary Fig. S2). Furthermore, the overall enrichment efficiency in our study is in line with previously published reports 28,29 . Several proteomics studies have been published involving characterization of changes underlying BCa invasion either at the total cell [30][31][32] or extracellular proteome [33][34][35] using various BCa cell line models. A high overlap between the reported identifications in our study and the existing literature was observed, further supporting the validity of the reported findings. Specifically, our shotgun analysis enabled detection of the majority (at least 69%) of proteins identified in previous proteomics studies of total cell proteome from T24M vs. T24 31 and T24T vs. T24 30 cells. Along the same lines, the majority of proteins previously identified in CM from T24M and/or T24 cells 33,35 were also found in our analysis. These multiple existing studies serve as reference points, nevertheless their findings remain disparate and any potential added value from the parallel proteomic analysis of different cell compartments can be assessed with moderate confidence only.
As the first step in this direction, and to establish the relevance of each individual proteomics dataset, we evaluated our main findings in the context of the existing literature. We used a manually curated database of features (genes, transcripts, proteins) associated with BCa invasion/ progression (BcCluster) 20 . Importantly, BcCluster lists molecules highlighted from studies with sample size of at least 50, suggesting a high validity of the collected features. The second dataset contains the list of the BCa-associated molecules retrieved using the GLAD4U 21 tool, without any sample size selection criteria. The two datasets appear to be highly complementary, with an overlap of 179 features, (corresponding to over 25% of features from each literature set), further supporting the assumption that these two approaches provide a good and comprehensive reflection of the current knowledge. It should be noted, that these literature data used as reference in our study include entries reported from different -omics (genomics, proteomics, transcriptomics) as well as non-omics (e.g. immunohistochemistry) studies, apparently collected under different applied methodologies. Investigation of the inter-laboratory variability reflected in these databases would be out of the realm of this study, nevertheless it is expected that this exists. Even though the latter clearly compromises comparability of different studies, on a positive note, it may also be used as a means to increase confidence in individual findings, based on their detection in multiple studies and under different protocols. Along these lines, multiple observed protein changes included in the individual datasets (CE, ER/Golgi, CM) had already been reported in the context of BCa invasion/ progression or metastasis. Independently of the source of literature data, as aforementioned (Results), overlap ranged from approximately 7 to 15% depending on the applied proteomics strategy.
Taking this one step further, an integration of -omics datasets from different molecular levels (proteomicstranscriptomics) was also performed. For almost all proteins identified by LC-MS/MS, we were able to obtain the corresponding mRNA, which strongly supports the reliability of the protein identification process. Even though, in general a moderate correlation between mRNA and protein expression is reported 18,36 , the regulation trend was well supported by the transcriptomic analysis for many of the differentially expressed proteins (210 out of 614 (34%); notably for 344 transcripts a >1.5-fold change was not reached, whereas only 57 exhibited opposite trend of expression in the T24M vs T24 cells). It should be noted that the presented transcriptomic analysis has some limitations, mostly as a result of the high costs of the next generation sequencing analysis, resulting in a low number of analysed samples (n = 2 per cell line) hampering the application of proper statistical analysis.
Through the application of transcriptomics, which complements but also verifies proteomics findings, an increase in the number (from 89 up to 253) of cross-validated features obtained in the three individual proteomics experiments could be achieved.
The reliability of these latter "cross-validated" proteins was further evaluated in the context of available literature. An improvement in the agreement with existing literature data is observed (as described in Table 3), indicating the applicability and value of such a multi-omics approach to verify large scale proteomics data. Of these 253 features, 33 (13.0% BcCluster) 20 or 40 (15.8% Glad4U) 21 have been associated with BCa/ BCa invasion or metastasis. This corresponds to approximately a 5% increase in the overlapping features when compared with the respective overlap of all 614 differentially abundant proteins, identified across the three proteomics experiments. This increase appears to be significant, considering that the "verified" dataset consists of a lower number of proteins (253) compared to the combined "all differentially expressed proteins" dataset (614). In other words, the presented strategy facilitates shortlisting more confident findings, which currently range from the small number (regularly less than 10) of verified findings via typical targeted analysis, to the whole list of differentially expressed features per omics experiment (regularly prone to many false positives). The described cross-omics comparison offers the valuable intermediate step between these two extremes, allowing to maximize extraction of features of increased confidence for their further use as input data in systems biology approaches.
As an example in this direction, pathway analysis was conducted. IL-8 signaling was selected, as being predicted (at high significance levels) to be affected based on all, literature mined datasets as well as individual proteomics datasets (CE, ER/Golgi and CM). As presented in Fig. 4, the integrative analysis of -omics data provided a fairly comprehensive molecular phenotype underlying the pleiotropic effects of IL-8 function: The up-regulation of IL-8 in the T24M cells was associated with an up-regulation of matrix metalloproteases (MMP2), implicated in tumor invasion 37 , as well as VEGFC and ICAM1, factors implicated in angiogenesis 38,39 (Fig. 4). Interestingly, the overexpression of MMP2 was accompanied by the down regulation of TIMP metallopeptidase inhibitor 3 (identified in CM analysis), further supporting the activation of MMPs in the context of BCa invasion. Even more: data integration from the different preparation methods (CE, ER/Golgi, CM) links disparate observations revealing events in cases not associated with BCa yet. As shown on Fig. 4, formation of chemosynapse is predicted based on the observed proteomics changes (involvement of VASP, LASP-1), with anticipated impact on focal adhesion and cell migration 40,41 . In addition, interestingly, involvement of PLD3, a non-classical member (as it lacks lipase activity) of the phospholipase D family of enzymes 42 is predicted. PLD enzymes have been implicated as key components of HRAS signaling in cancer cells 43 -with, notably, HRAS also detected at different levels in T24M versus T24 cells, based on the proteomics analysis (Fig. 4). In addition, PLD3 has been recently shown to be involved Scientific RepoRts | 6:25619 | DOI: 10.1038/srep25619 in hypoxia-induced lipid metabolism in colorectal cancer cells 44 , suggesting collectively, that it merits further investigation in BCa. In parallel to these effects, IL-8 signaling also occurs through G protein coupled receptors, specifically in our system, through Guanine nucleotide-binding protein subunit gamma-12 (GNG12), not studied in BCa yet. Impacts on regulation of calcium channels are expected 45,46 . Of note, some calcium channels were found at differential levels in the T24M versus T24 cells based on the proteomics analysis e.g. Plasma membrane calcium-transporting ATPase 1 (CE and ER/Golgi), Calcium-binding mitochondrial carrier protein SCaMC-1 (CE)-Supplementary Dataset S2.
Collectively, through the proposed combined analysis of multiple cellular fractions and molecular levels, these multi-level pleiotropic effects of IL-8 previously described in different publications (reviewed by Waugh et al) 47 can be better reflected at the molecular level, encompassing changes at the extracellular space (e.g. IL-8 differential abundance), all the way to the nucleus (e.g. changes on Bax; Fig. 4). There is no doubt that multiple missing links still exist nevertheless, such an approach obviously increases coverage (hence confidence), but also facilitates definition of targets for further verification. To better explain this point, the example of the VASP, a protein involved in cytoskeleton remodeling 41 and not yet associated with BCa was provided. Being differentially expressed in the ER/Golgi fraction (only), VASP was not included in the shortlisted proteins (i.e. the 253 cross-verified findings). Nevertheless, based on its biological relevance to the IL-8 pathway, the chances that this finding from the ER/Golgi analysis was not a false association increased. Indeed, by using western blot analysis, our preliminary results further supported the down-regulation of VASP in muscle invasive BCa, a finding which we currently further investigate.
In conclusion, our study collectively shows that comparative and in parallel analysis of multiple -omics (in our case: proteins identified in CE, ER/Golgi and CM and also at a different omics level -transcriptomics) has added value on two very important aspects; it can improve proteome coverage and fill missing links, through the complementarity of different techniques. Even more, it can increase validity of individual observations, by cross-omics correlations, facilitating prioritization of findings and ultimately knowledge extraction. Considering the general low statistical power of individual -omics investigations (high number of variables, small sample sizes) such a cross-omics and platform analysis appears a safe way forward particularly towards development of disease molecular models based on valid experimental observations.

Methods
Sample preparation. Cell culture. T24 and T24M 31 BCa cells were employed as described in Makridakis et al 31 . Briefly, cells were cultivated in DMEM medium (High Glucose, GlutaMAX ™ , Pyruvate) supplemented with 10% FBS and 1% Penicillin-Streptomycin (P/S) and harvested using 0.05% trypsin/0.02% EDTA and centrifugation (1,000× g, 5 min, room temperature). Cell pellets were washed twice with PBS and stored at − 80 °C until further processing. Each experiment was repeated in five replicates (five different flasks with cells originated from same initial stock) per condition.
Collection of secreted proteins from conditioned medium (CM). CM was collected are described previously 27,35 from 10•10 6 cells after 24h incubation in serum deprived medium. Protein extraction was performed as described in Latosinska et al 27 . 75 μ g of proteins were processed by Filter Aided Sample Preparation method (FASP), as described below.
Enrichment in Endoplasmic Reticulum/ Golgi Fraction. 20•10 6 cells were used in order to enrich for ER/Golgi as described by Sarkar el al. 26 with minor modifications 27 . Sequentially, samples were depleted in nuclei (3,000 × g, 10 min) and mitochondria (10,000 × g for 10 min) leading to enrichment for ER/Golgi (16,000 × g for 30 min). Pellet containing the final fraction was dissolved in buffer containing 7M urea, 2M thiourea, 4% CHAPS, 100 mM DTE and 1% ampholytes. 70 μ g of proteins were processed by FASP.
Preparation of total cell extract. 4•10 6 cells were harvested and cell pellet was re-suspended in 200 μ L of lysis buffer (7M urea, 2M thiourea, 4% CHAPS, 100 mM DTE, 1% ampholytes). Cells were disrupted by water bath sonication for 10 min followed by centrifugation (16,000× g, 10 min, RT). 100 μ g of proteins were processed by FASP. using a data-dependent acquisition (top 40). Changes between MS1 (MS) and MS2 (MS/MS) modes were done at 60,000 and 7,500 resolution respectively. Parent ions were fragmented at and energy of 40 by higher energy collision-induced dissociation (HCD). Data processing. The analysis of the raw MS data files was performed using Proteome Discoverer (PD) v.
1.4.0.288 (Thermo Scientific). An event detection node was used at a setting of 2 ppm. The Human Swiss-Prot Database 24,50 with 20 277 canonical sequences only (downloaded at 30/10/2013) and the Sequest search engine 51 were employed. The following criteria were applied: a) precursor mass tolerance 10 ppm, b) fragment mass tolerance 0.05 Da, c) fix modifications: carbamidomethylation of cysteine, d) variable modifications: oxidation of methionine and proline, and e) allowed missed cleavages: one. The false discovery rate (FDR) evaluation was performed by using the Percolator node 52 (PD 1.4).
Protein identification and label-free quantification. Protein identification was based on the rank 1 peptides allowing for mass deviation below 5 ppm and FDR below 1%. Only proteins identified with at least 2 unique peptides in individual samples were included for further analysis. The label-free quantification was based on the peak area (i.e. area under the curve), determined based on the extracted ion chromatogram (Precursor Ions Area Detector node in PD). Quantification at the protein level was based on the top three peptides per protein calculated by PD. For the few cases where the protein area was not calculated by the software, as a consequence of lack of integration of the peptide area (a software error), the average area for the particular protein per studied group (T24, T24M) was assigned. In the case of proteins not identified in a particular sample, the missing value was replaced by zero. Twelve proteins derived from the FBS 53 or reagents used for MS were excluded from analysis as potential contaminations (Supplementary Table S2 Briefly, mRNA isolation was performed using oligo-dT magnetic beads followed by mRNA fragmentation and cDNA synthesis. For the latter, the quality and yield was measured via Lab-on-a-Chip analysis (expected product size: 200-500 bp). Clustering and DNA sequencing were performed using Illumina cBot and HiSeq2500 in line with manufacturer's instructions at the concentration of 16pM of DNA. Image analysis, base calling and the quality check were conducted using the Illumina data analysis pipeline RTAv1.18.64 and Bclfastqv1.8.4. Data obtained from the HiSeq2500 in fastq format was used as source for the downstream data analysis. Alignment of fastq reads was performed using TopHat version 2.0.12 54 against the assembled human genome GRCh37.p13 with the corresponding Ensembl release 75 annotation 55 (http://grch37.ensembl.org/index.html). The alignment run involved default parameters but allowing for a genome multihit search and transcriptome build and mapping. Alignment quality metrics were collected using Qualimap version 2.0.1 56 . Quantification of feature alignments was performed using HTSeq-counts from HTSeq framework version 0.6.1p1 57 . Default parameters were used for a non stranded RNA-seq library using the intersection non empty algorithm. Normalization of the count data and statistical analysis for the differential expression was performed with DESeq2 package version 1.6.3 58 for R statistical computing software 59 .
Western Blot. BCa tissue specimens were collected in Germany (Department of Urology and Urological Oncology, Hannover Medicine School) from patients undergoing resection of the bladder. All individuals gave written informed consent. All experimental protocols for tissue sample collection were approved by the Hannover Medical School Ethics committee (case number: 614-2009) and experiments were performed according to relevant guidelines. Specimens from non-muscle invasive (n = 3), muscle invasive (n = 3) BCa and negative biopsies (n = 3) were analyzed. Tissue lysis was performed as described earlier 49 . 20 μ g of total protein per extract were separated by NuPAGE ® Gradient Gel 4-12% under reducing conditions and electroblotted to nitrocellulose membrane (LG), as presented elsewhere 60 . Membranes were incubated overnight at 4 °C with the primary mouse anti-VASP antibody (Enzo LifeScience, ALX-804-177-C050, dilution 1:500) or anti-β -actin antibody conjugated to HRP (Santa Cruz, sc-47778 HRP, 1:4,000), in the first case followed by incubation with anti-mouse HRP-conjugated secondary antibody (Santa Cruz; dilution 1:2,000) for 2h at room temperature. Target protein was detected by Enhanced Chemiluminescence (Perkin-Elmer LAS, Inc.).
Literature mining. Molecules (proteins and transcripts) associated with BCa invasion/ progression were retrieved from the BCa database (http://bccluster.org/) 20 . GLAD4U 21 was also employed to retrieve relevant featured from MEDLINE database using the following keywords: ("bladder cancer" or "urothelial cancer" or "transitional cell carcinoma" or "urothelial cancer") and ("invasion" or "progression" or "invasiveness" or "aggressiveness" or "metastasis") with the undefined threshold settings for genes prioritization. database 25 . In parallel, differentially expressed proteins which were considered as "verified" were mapped to pathways using QIAGEN's Ingenuity ® Pathway Analysis (IPA ® , QIAGEN Redwood City, www.qiagen.com/ingenuity). Statistical analysis was conducted by using right-tailed Fisher's exact test. Pathways with a p-value below 0.05 were considered as significant.