SWATH-based proteomics identified carbonic anhydrase 2 as a potential diagnosis biomarker for nasopharyngeal carcinoma

Nasopharyngeal carcinoma (NPC) is a serious threat to public health, and the biomarker discovery is of urgent needs. The data-independent mode (DIA) based sequential window acquisition of all theoretical fragment-ion spectra (SWATH) mass spectrometry (MS) has been proved to be precise in protein quantitation and efficient for cancer biomarker researches. In this study, we performed the first SWATH-MS analysis comparing the NPC and normal tissues. Spike-in stable isotope labeling by amino acids in cell culture (super-SILAC) MS was used as a shotgun reference. We identified and quantified 1414 proteins across all SWATH-MS analyses. We found that SWATH-MS had a unique feature to preferentially detect proteins with smaller molecular weights than either super-SILAC MS or human proteome background. With SWATH-MS, 29 significant differentially express proteins (DEPs) were identified. Among them, carbonic anhydrase 2 (CA2) was selected for further validation per novelty, MS quality and other supporting rationale. With the tissue microarray analysis, we found that CA2 had an AUC of 0.94 in differentiating NPC from normal tissue samples. In conclusion, SWATH-MS has unique features in proteome analysis, and it leads to the identification of CA2 as a potentially new diagnostic biomarker for NPC.

Scientific RepoRts | 7:41191 | DOI: 10.1038/srep41191 high-throughput methods, including proteomics and next generation sequencing, are rapidly moving the field forward 15 . For example, Chen et al. employed isobaric tags for relative and absolute quantitation (iTRAQ) -based mass spectrometry (MS) to compare the motile and non-motile NPC cells, in which they found and validated the biomarker potentials of RAN, SQSTM1 and TRIM29 16 . Besides iTRAQ-MS, shotgun MS approaches are primarily based on the data-dependent acquisition (DDA) mode, known with the advantage of deep proteome coverage. Theoretically, shotgun MS is a probability based identification; and controlling the false discovery is an inevitable challenge, especially when dealing with the database search against large reference databases 17 .
As a major advance in the field of proteomics, the data-independent acquisition (DIA) based approach of sequential window acquisition of all theoretical fragment-ion spectra (SWATH) -MS has brought general attentions 18 . The basic theory of SWATH is to acquire all theoretical fragment ion spectra, and assemble them back to the parent ions to achieve high-throughput identification and quantification 19 . This is an extraordinary and complementary expansion of the relatively low-throughput selected/multiple reaction monitoring (SRM/MRM) 20,21 .
SWATH has been justified to be valuable in a few cancer types for biomarker discovery, such as prostate cancer 22 , colorectal cancer 23 , gastric cancer 24 , and lung cancer 25 . However, its application in the NPC field is very limited. In this study, we performed the first SWATH-MS analysis comparing the NPC and normal tissues, in which we successfully identified and verified carbonic anhydrase 2 (CA2) as a potentially new diagnosis biomarker of NPC.

SWATH and super-SILAC MS analyses.
To quantify the proteome change of NPC tissues, we took advantage of the DIA feature of SWATH, and the super-SILAC MS was used as a general shotgun MS verification. Tissue lysates from 9 normal and 9 NPC subjects were respectively pooled and analyzed by super-SILAC MS analysis. In addition, 6 normal and 5 NPC subjects from the same batch of donors used for the super-SILAC MS were individually analyzed with SWATH-MS. Six more clinical samples were obtained for the subsequent verification steps. No statistical difference was observed regarding donor gender or age (P > 0.05, Fisher's exact test, Supplementary Table S1).
We identified and quantified 1414 proteins across all 11 samples in the SWATH-MS analysis (Supplementary Table S3). While in the super-SILAC MS analysis on the pooled samples, 4065 proteins were quantified from both the normal and the NPC groups with protein FDR < 1% (Supplementary Table S4). Comparing the two methods, 1321 proteins were quantified in overlap (Fig. 1a). All MS raw data are available in iProX (accession number: IPX00080100).
We next focused on the physical-chemical characteristics of the SWATH-MS identified 1414 proteins, in comparison of the same amount (1414) of proteins randomly selected from super-SILAC identifications and the neXtProt PE1 proteins. Both MS methods preferentially detected proteins with lower isoelectric points (pI) (Fig. 1b) and higher charges (Fig. 1c) than the human proteome background. It is known that MS is keen on identifying more acidic proteins. In addition, more charges will lead to easier ionization of peptides, which is favorable for MS analyses. Hence, the above two features can be expected. Nonetheless, we found that SWATH-MS tended to identify smaller proteins with the median molecular weight (MW) of 43.7 kDa, significantly less than the super-SILAC MS identifications (median MW = 52.5 kDa) and the background PE1 proteins (median MW = 51.3 kDa) (Fig. 1d).
Differentially expressed proteins in SWATH-MS. We next used the power law global error model (PLGEM) algorithm 26 to determine differentially expressed proteins (DEPs). We and others have found that this method is of advantages to consider the global error while fitting the proteome abundance data for statistical tests 27,28 . We found that our SWATH-MS data could be well fitted by PLGEM, with the slope of 0.886 and adjusted r 2 of 0.996 (Pearson r = 0.961) (Fig. 2a). The residuals between the modeled and the measured standard deviation (SD) generally followed normal distribution (Fig. 2b). These results suggested that PLGEM worked properly in dealing with the SWATH-MS data.
It has been shown that PLGEM provides a more powerful signal-to-noise (STN) ratio, especially in low-abundance protein evaluations, by incorporating the PLGEM-derived SD 26,27 . In this study, we observed that the PLGEM-STN was significantly correlated to the relative protein fold changes, with a Spearman r of 0.99 (Fig. 2c). Based on PLGEM-STN, we computed out 29 proteins as DEPs (P < 0.01), which took over ~2% of total proteins (Fig. 2d, Supplementary Table S2). These DEPs have certain reproducibility as quantified by super-SILAC and SWATH-MS (Fig. 2e). Such 29 DEPs could not completely separate the NPC tissues from the normal tissues per cluster analyses, potentially due to the small sample size (Fig. 2f). Despite this, we noted that 4 subgroups could be differentiated, and no NPC and normal tissues were clustered together (Fig. 2f).
Common upstream regulator detected for SWATH and super-SILAC DEPs. The upstream analysis module of Ingenuity Pathway Analysis (IPA) suggested that interferon gamma (IFNG) was the significantly unbiased upstream activator of SWATH-MS detected 10 DEPs (z-score = 2.282, P = 1.44 × 10 −6 ; Fig. 3a). In the super-SILAC detected DEPs, consistent prediction of IFNG was computed out (Fig. 3b); in addition, other known viral infection relevant pathways, such as IFNA2, IFNB1 and TLR, were also deemed up-stream activators by IPA (Fig. 3b). Specific to the mechanistic network of IFNG, 5 common DEPs, such as STAT1 and IFIT1, were shared by both SWATH-MS (Fig. 3c) and super-SILAC MS (Fig. 3d). While other proteins were unique to methods; for example, PRDX2 was only detected in SWATH-MS (Fig. 3c). These results favored the complementary feature of different MS methods. In addition, such a similar upstream regulator prediction suggested that even we used different acquisition mode-based MS approaches (DDA or DIA), the bioinformatics implicated reproducibility, which favorably argued that our general research strategy was valid.
Scientific RepoRts | 7:41191 | DOI: 10.1038/srep41191 SWATH spectra of CA2. Among the 29 DEPs detected by SWATH-MS, thirteen of them were also detected to be up-regulated in the super-SILAC MS analysis (Table 1). When applying the 1.5-fold threshold as referenced by the super-SILAC quantitation, seven proteins were left for further evaluation. We prioritized CA2 for further evaluation based on the following reasons: 1) CA2 had 4 unique peptides identified in the shotgun mode (Fig. 4a, and Supplementary Fig. S2), and the SWATH results could pass the manual spectra inspection; 2) Wu et al. previously reported that CA2 was up-regulated in the serum of NPC xenograft mice 29 ; and 3) the CA2's diagnostic power had not been reported in the NPC field.
As illustrated in Fig. 4, we showed an exclusively unique peptide spectrum of CA2 identified in the shotgun mode; it had 8 consecutive y-ions labeled with mass error less than 0.5 Da (Fig. 4a). When analyzed in the SWATH mode, 6 y-ions were found to have the retention time at ~97 min (black arrow, Fig. 4b), which was consistent with their parent-ion retention time as recorded in the shotgun mode. These 6 y-ions could be found in the MS2 spectrum of the SWATH analysis (Fig. 4c). These results proved that the SWATH identification of CA2 was of high confidence.
CA2 has diagnostic value in NPC. We next used more clinical samples to perform immunoblotting (IB) verification of CA2 changes. We found that although the NPC group tended to have more CA2 expression than the normal group, no significant difference was observed (Fig. 5a). The raw images of IB could be found in Supplementary Fig. S3. We reasoned that the sample size was not sufficient for the statistics of IB results. Thus, we employed a commercialized tissue microarray (NPC donor n = 52 and normal donor n = 13) to evaluate the diagnostic power of CA2. We could observe remarkably more CA2 expression in some NPC tissues, such as the example shown in Fig. 5b. But, we could still visually distinguish the within group variations; the immunohistochemistry (IHC) results of all tissue points could be found in Supplementary Fig. S1 and Table S5. Despite the variation, the KS-test results showed that the CA2 Histologic Scores of the NPC tissues were significantly higher than the normal tissues (Fig. 5c). We noted that the sample size was considerably different between the two groups. As such, we used bootstrap analysis to further verify the conclusion. We found that post 10000 times bootstrap resampling from each group, the 95% confidence intervals of the two groups were not overlapped, suggesting that the mean values of the two groups were statistically different (Fig. 5d). With receiver operating characteristic (ROC) analysis on the Histologic Score, we found that CA2 showed high diagnostic power with the Area Under Curve (AUC) of 0.94 (Fig. 5e).

Discussion
In this study, we demonstrated that SWATH-MS in the DIA mode worked properly in the NPC biomarker discovery with a unique feature to preferentially detect proteins with low MWs. With such an aide, we have identified and justified that CA2 is a potentially new diagnostic biomarker that has high statistical power to differentiate NPC from normal tissues.
In our past work, we have found that human cells are preferentially translating shorter mRNAs, and when considering the mRNA length, translating mRNAs and proteins are highly correlated in their abundances 30 . We have further proposed a computational model to justify that such length-dependence translation preference is a survival strategy of human cells to maintain a functional proteome by avoiding erroneous protein products 31 . These findings suggest that the stoichiometry of translating mRNA into protein is correlated to the mRNA length 30,31 . A recent study from Huang et al. have evidenced that SWATH-MS signal intensities have precisely linear correlation to the sample loading abundances in a label-free analysis 32 . It is known that SWATH scanning is generally referenced by the shotgun library, which is usually based on single-injection MS analysis. In such a scenario, proteins with higher molar concentrations will have a higher probability to be identified by the shotgun MS. In sum, SWATH-MS tends to focus on the fragment ions derived from protein products with less amino acid length, while it has outstanding quantitation precision. Such a unique feature of SWATH-MS should partially explain its complementary feature to other MS approaches, such as super-SILAC results demonstrated in this study. It should not be a surprise to see its capacity in identifying new biomarkers.
In human, the carbonic anhydrase family has 16 enzymes that catalyze the reversible reaction from carbon dioxide and water to bicarbonate and protons. They have diversified associations with cancer, autoimmune disease and viral infection 33,34 . Viikila et al. have found that CA2 and CA12 have prognostic power in colorectal carcinomas 35 ; and Kurono et al. have reported that CA2 expression in breast cancer is significantly higher than normal tissues 36 . It is known that solid tumors are featured by the tumor microenvironment, with aberrant activation of numerous immune cells 28 . Interestingly, CA2 and other carbonic anhydrases can cause autoimmune reaction via activating mast cells 37 and plasma cells 38,39 . Thus, the diagnostic power of CA2 on NPC has comparable evidence from other solid tumor types, and it potentially reflects the inflaming tumor microenvironment.
Furthermore, we demonstrated that the DEPs detected by SWATH-MS had significantly common regulator of IFNG, which implicated the anti-viral response in NPC subjects. IFNG is an essential cytokine produced by both innate and adaptive immune cells, which is highly active to fight against viral, certain bacteria and protozoal infections. The IFNG level is to rise post primary EBV infection in humans 40 ; while along with the increased EBV viral load, the serum IFNG level increases in NPC patients 41 . Such rationales help to argue the biological relevance of the SWATH-MS DEPs to NPC or EBV infection. Favorably, among these DEPs, other groups have proven VTN 42,43 and MIF 44,45 as potential NPC biomarkers. Therefore, as the viral response contributes to the NPC-associated inflammation, these findings suggest that proteins participated such a bioprocess constitute a resource of potential NPC diagnosis biomarkers.

Methods
Human nasopharyngeal tissue samples. The nasopharyngeal tissue samples were acquired from The First Affiliated Hospital of Jinan University via biopsy. The scientific and ethics review committees of Jinan University approved this study, and written informed consents were obtained from all of the study participants. All methods were performed in accordance with the relevant guidelines and regulations.
Cell culture. Human cell line CNE-1 and CNE-2 were acquired from American Type Culture Collections (ATCC, Rockville, MD). All cells were maintained in the complete Dulbecco's modified Eagle's medium (DMEM), supplemented with 10% fetal bovine serum (FBS), 1% penicillin/streptomycin, 10 μ g/mL ciprofloxacin. SILAC labeling. CNE-1 and CNE-2 were subjected to SILAC labeling as we previously described 28,46 . In brief, cells were cultured in the heavy SILAC medium, DMEM containing 73 mg/L 13 C 6 15 N 2 -L-lysine (Lys8) and 42 mg/L 13 C 6 15 N 4 -L-arginine (Arg10) (Cambridge Isotope Laboratories, Andover, MA, USA), supplemented with 10% dialytic FBS (Life Technologies), 1% pen/strep and various forms of essential amino acids (Cambridge). After at least 8 passages, cells were lysed, and a pooled cell lysate was used for the spike-in standard for the subsequent super-SILAC-based shotgun MS analysis as developed by Mann's group 47 . Protein extraction and digestion. Post-biopsy, tissue samples were immediately treated with 1% SDS lysis buffer (Beyotime, Nanjing, China), supplemented with 1 mM phenylmethanesulfonyl fluoride (PMSF), and 2% (v:v) protease inhibitor (Roche, Shanghai, China), followed by grounding extraction in liquid nitrogen. The tissue lysate was sonicated and centrifuged at 17,000 × g for 30 min. Supernatants were collected, and the protein concentration was determined by a BCA kit (ThermoFisher Scientific, Shanghai, China). Regarding cell sample lysis, we followed our reported procedure 48 .
We employed in-solution protein digestion with a filter-aided sample preparation (FASP) method 49 , as we described previously 48 . Briefly, samples were subjected to reduction (8 M urea and 50 mM DTT at 37 °C, 1 h) and alkylation (100 mM IAA, at room temperature, 30 min) in the 30 kDa ultracentrifugal filters (Sartorius Stedim Biotech, Shanghai, China). After 15 min centrifugation (12,000 × g, 4 °C), two sequential buffer changes were performed using 8 M urea and 50 mM NH 4 HCO 3 , respectively. Trypsin was then added into the filter at a mass ratio of 1:30 for 8 h, at 37 °C. Peptides were collected by centrifugation at 12,000 × g, 4 °C, 15 min.
Super-SILAC-based shotgun mass spectrometry. Tissue protein extracts were mixed with the SILAC labeled spike-in standard at a 1:1 mass ratio prior to the in-solution digestion. Peptides were fractionated with an approach using SAX StageTip and C18 StageTips, and 6 peptide fractions were collected by using serial

SWATH mass spectrometry.
To generate the ion library for the SWATH-MS analysis, peptides from different nasopharyngeal tissue samples were pooled, and analyzed in a shotgun model with a TripleTOF ® 5600 MS (AB). In detail, peptides were analyzed in the high sensitivity IDA mode; precursor ion selection range was 350~1500 m/z, using 0.25 s accumulation. For each precursor ion with 50 ms minimum accumulation time in the range of 100-1500 m/z, the maximum precursor number per cycle was set to 40. Dynamic exclusion was applied. Next, peptides of each tissue samples were individually analyzed in the SWATH mode 18 . Specifically, the 14 m/z precursor isolation window was used for the consecutive data-independent acquisition across the range of 400~1150 m/z. The accumulation time was set to 50 ms, and the total cycle time was ~3.2 s.
The SWATH-MS raw data were searched by ProteinPilot with the identical parameters to those used for the SILAC searches, except that the Sample Type was set as Identification. Physical and Chemical Features of Proteins. The amino acid sequence and mass were obtained from Swiss-Prot human protein database. The isoelectric point (pI) and the charge at physiological conditions (pH = 7.4) of each protein were calculated by MATLAB bioinformatics toolbox (MathWorks, Natick, MA, USA).

Ingenuity Pathway Analysis.
We performed core analysis of IPA for fold changes of DEPs in each MS method as we describe previously 28,50 . The Upstream Analysis was used to find out upstream regulators of DEPs with activation/inhibition status predictions 51 .
Immunoblotting analysis. IB was performed as we previously described 46,48,51 . Antibodies used were rab-

Immunohistochemistry.
A commercial tissue array (array number: NPC1503, Super Biotek, Shanghai, China) was used to analyze the diagnostic power of CA2. IHC was performed by using rabbit anti-CA2 pAb (1:100, Sino Biological Inc., Beijing, China) and the SuperPicture ™ 3rd Gen IHC Detection Kit (Thermo). The tissue array was incubated with the primary antibody, followed by the visualization with DAB staining and hematoxylin counter-staining. The staining intensity was scored as 0 (negative), 1 (weak), 2 (moderate) and 3 (strong), independently evaluated by at least two professional pathologists. A Histologic Score (intensity score × percentage of stained cell) was used to evaluate the CA2 expression.
Statistics. DEPs were determined by using the R package PLGEM 27,28 , and the statistical significance was accepted when P < 0.01. All of the correlation analyses were shown with Spearman r. Histologic Score of IHC results were examined by the unpaired Kolmogorov-Smirnov test (KS-test), using GraphPad Prism version 6.02 (GraphPad Software, Inc., San Diego, CA, USA), which P < 0.05 was deemed significantly different. In addition, we employed the bootstrap analysis to compensate the variation from the different sample sizes provided by the tissue microarray. The 10000 bootstrap resampling was performed for each group to statistically compare the Histologic Scores of the NPC and the normal groups 46,48 . Both bootstrap and cluster analyses were generated by MATLAB software version R2016a. The ROC curve and the AUC were generated and computed by GraphPad Prism.