Use tumor suppressor genes as biomarkers for diagnosis of non-small cell lung cancer

Lung cancer is the leading cause of death worldwide. Especially, non-small cell lung cancer (NSCLC) has higher mortality rate than the other cancers. The high mortality rate is partially due to lack of efficient biomarkers for detection, diagnosis and prognosis. To find high efficient biomarkers for clinical diagnosis of NSCLC patients, we used gene differential expression and gene ontology (GO) to define a set of 26 tumor suppressor (TS) genes. The 26 TS genes were down-expressed in tumor samples in cohorts GSE18842, GSE40419, and GSE21933 and at stages 2 and 3 in GSE19804, and 15 TS genes were significantly down-expressed in tumor samples of stage 1. We used S-scores and N-scores defined in correlation networks to evaluate positive and negative influences of these 26 TS genes on expression of other functional genes in the four independent cohorts and found that SASH1, STARD13, CBFA2T3 and RECK were strong TS genes that have strong accordant/discordant effects and network effects globally impacting the other genes in expression and hence can be used as specific biomarkers for diagnosis of NSCLC cancer. Weak TS genes EXT1, PTCH1, KLK10 and APC that are associated with a few genes in function or work in a special pathway were not detected to be differentially expressed and had very small S-scores and N-scores in all collected datasets and can be used as sensitive biomarkers for diagnosis of early cancer. Our findings are well consistent with functions of these TS genes. GSEA analysis found that these 26 TS genes as a gene set had high enrichment scores at stages 1, 2, 3 and all stages.

Lung cancer is the leading cause of death worldwide 1 which accounts for approximately 20% of deaths caused by cancer in Europe 2 and has a high risk of recurrence 3 . In 2010, about 222,520 new cases of lung cancer were diagnosed but only 15% of patients were estimated to be alive after 5 years 4 . Based on histopathological analysis, lung cancer is divided into four major histological subtypes: small cell lung cancer, squamous cell carcinoma, adenocarcinoma and large cell carcinoma. The latter three are collectively referred to as non-small cell lung cancer (NSCLC) and account for 80% of lung cancer 5,6 . About 25 ~ 30% of patients with NSCLC had stage 1 disease and received surgical intervention alone. Despite undergoing curative surgery, more than 25% of NSCLC patients at stage 1 will die from recurrent disease within 5 years 7 . Risk factors for lung cancer identified include smoking 3 , radiation, chemical exposure, and other exposure factors 3 . Lung diseases such as chronic bronchitis, emphysema, pneumonia and tuberculosis 8 , familial tumor history 8 , and diet 9,10 may also be considered as risk factors causing lung cancer 11 . In Western countries, 70-90% of lung cancers are attributed to cigarette smoking, whereas in Taiwan, only 7% of female lung cancer cases are associated with smoking 12,13 . Over the past 30 years, lung cancer mortality rate has increased by 464.84% in the Chinese mainland, which is much higher than the worldwide average 14 . The high mortality rate of NSCLC is partially due to lack of early detection, unclear molecular mechanism, and therapeutic methods. Also reliable clinical and molecular diagnostic and prognostic factors as well as guidelines for treatment of recurrent NSCLC stage1 have not yet been well elucidated. Identification of gene signatures and molecular pathways that are critical for development of metastasis could lead to improved therapy 7 . Advances in human genomics and proteomics have generated lists of candidate biomarkers with potential clinical values. For example, genes such as TP53 15,16 , EGFR 17,18 , KRAS 19 , PIK3CA 20 , and EML4-ALK 21 have been identified as biomarkers for diagnosing and predicting survival outcome of lung cancers. However, most single genes are pretty unstable in expression, and hence single genes used as biomarkers do not reliably predict early lung cancers or accurately diagnose lung cancer stages 22 . Recently, many studies tried to screen gene Differential gene-expression profiles. We then performed a ranking analysis of microarray (RAM) 25 on microarray data of stages 1, 2, and 3. Figure 1A summarizes numbers of DE genes identified at stages 1, 2, and 3. Among 54,675 gene probes, only 195 (0.35%) were found to be differentially expressed at stage 1. Heatmap at stage 1 in Fig. 2 shows the differential expression of these 195 gene probes between cancer and normal samples. The result indicates that not many genes had expression change at tumor stage 1. At stage 2, however, 3352 gene  www.nature.com/scientificreports/ probes (6%) were identified to have differential expression at FDR < 0.01, of which 2517 (75%) were down-regulated and the others (25%) were up-regulated. Compared to stage-1 heatmap, stage-2 heatmap (Fig. 2) clearly shows differential expression of these 3352 gene probes between normal lung and cancer cells. From stage 1 to stage 2, total DE gene probes increased 1700%, down-regulated gene probes increased 2029% and up-regulated gene probes increased 1176%. At stage 3, 4424 DE gene probes (8.1%) were identified, of which 2644 (60%) were down-regulated and 1780 (40%) were up-regulated. The differential expression of the 4424 gene probes between normal and cancer samples at stage 3 is shown in stage-3 heatmap (Fig. 2). From stage 2 to stage 3, total DE gene probes increased 131%, down-regulated genes increased 105% and up-regulated genes increased 213%. Among the up-regulated gene probes, we found 484 gene probes only at stage 2, 1299 only at stage 3, and 351 at both stages ( Fig. 1C). For down-regulated gene probes, 1009 gene probes were detected only at stage 2, 1054 only at stage 3, and 1508 at both stages (Fig. 1D). These results indicate that abnormal expression of genes for tumor progression may start with stage 2.  Table S1).

Definition of tumor suppressor genes.
Classification of tumor samples. We also performed principal component analysis (PCA) of the three tumor stages using data of these 26 TS genes. Figure 1B shows that normal samples (green) are clustered to the right side of Z axis (the third component) along X axis (the first component) and tumor samples at stages 1-3 are grouped into the left side of Z axis along X. Figure 1B also demonstrates that tumor samples at stage 1 (yellow) are separated from the tumor samples at stages 2-3 in the direction of normal samples by these 26 TS genes but tumor samples at substages 2A, 2B, 3A and 3B cannot be clearly separated.
Expression of tumor suppressor genes in NSCLC. We used our method (see "Methods") to define 26 tumor suppressor (TS) genes. Since TS genes are directly associated with tumor development, it is interested in exploring dynamical changes of these TS genes in differential expression along with development of tumors. Figure 3 shows the differential expression of TS genes EXT1, LIMD1, DAB2IP, DDX5, SASH1 and MCC from tumor stage 1 to stage 3. These 6 TS genes showed much lower expression in the tumor samples than in the normal samples at stages 2 and 3. Except for that EXT1 and DDX5 were not found to be differentially expressed between the normal and tumor samples at stage1, the other 4 TS genes were down-expressed in tumor with In addition, NBL1 and PTCH1 were not detected to be differentially expressed between the normal and tumor samples at stages1 and 2 but were suppressed in the tumor samples at stage3 (not shown in Supplementary Fig.S2). Therefore, LIMD1, DAB2IP, SASH1, MCC, RASSF2, RECK, CBFA2T3, GPC3, KLK10, STARD13, CDKN1C, LATS2, RAP1A, FOXP1 and DCC can be used for diagnosis of early NSCLC patients. We chose two different data to validate these TS genes selected. Because GSE18842 had the same microarray platform with GSE19084, we used probeid to retrieve expression data of 26 TS genes from GSE18842 and made a heatmap in R environment (https ://www.r-proje ct.org). The result was shown in Supplementary Figure S3. The heatmap shows that these selected TS genes were really also down-expressed in the 45 cancer samples ( Supplementary  Fig. S3). Interestingly, RNA-seq RPKM data retrieved from cohort GSE40419 also demonstrates that these TS genes were down-expressed in either smoking or non-smoking NSCLC samples ( Supplementary Fig. S3).

Evaluation of TS genes as biomarkers for diagnosis of NSCLC cancer.
S-score analysis. We here used correlation of TS genes found at a tumor stage with the DE genes identified at the same stage to define Sscores (see "Methods") and used S-scores to explore the role of a TS gene in regulating expression of functional genes. With correlation coefficients > T (T is a threshold for selection of correlation coefficients with Bonferroni adjusted p-values < 0.05, here T = 0.62 for stage 1, 0.765 for stage 2, and 0.75 for stage 3), we calculated S + of 26 TS genes at stages 1, 2 and 3. The results were plotted in Fig. 4. At stage 1, Fig. 4 shows that DCC had the largest S + , indicating that DCC was the strongest TS gene to positively impact on gene down-expression at stage 1. Besides DCC, CBFA2T3, FOXP1, SASH1, LATS2, and DAB2IP had S + > 0, suggesting that the six defined TS genes had roles in regulating down-expression of the other functional genes in early lung cancer. As seen in the boxplots, these 6 TS genes were significantly down-expressed in lung cancer ( Fig. 3 and Supplementary Fig. S2); hence, their expression may be used to diagnose early NSCLC patients. At stage2, SASH1 had the strongest positive network effect (network effect is such an effect that a TS gene impacts on a set of the other functional genes in expression by correlation network) on gene down-expression in cancer and STARD13, RECK, CBFA2T3, ARHGEF12, CDKN1C, and RASSF2 also had larger S + , implicating that these 6 TS genes were stronger TS and RECK had S + over 0.5. At stage 3, RASSF2 and ARHGEF12 had S + less than 0.3, while KCNRG, DCC, TRIM13 and LIMD1 increased S + from less than 0.2 at stage 2 to more than 0.3 (Fig. 4). APC, EXT1 and DDX5 still had small S + . These suggest that RAP1A, STARD13, SASH1, RECK, and CBFA2T3 had big positive network effects on down-expression of genes in stage-3 cancer. LATS2, MCC, DCC, RHOB, TRIM13, LIMD1, and GPC3 also strongly and positively regulated gene down-expression in cancer. APC, PTCH1, EXT1, TBRG1, NBL1, PIK3CA, and KLK10 still had very weak network effects. S − was given by setting correlation coefficients < -T (see "Methods") and hence used to define a negative network effect of a TS gene on regulating gene up-expression. At stage 1, only DCC, DAB2IP, LIMID1 and RAP1A had weak negative network effects on up-expression of genes. These four TS genes were very significantly downexpressed in cancer at stage 1 with p-value ≤ 0.009 ( Fig. 3 and Supplementary Fig. S2c). Figure 4 shows that both stages 2 and 3 had very similar TS gene S − profiles. First, TBRG1, PICH1, NBL1, TRIM13 and APC did not impact gene up-expression in cancer. Second, SASH1 was the strongest TS gene acting on gene up-expression in cancer. Next, RECK, STARD13 and RASSF2 were stronger TS genes to negatively regulate gene up-expression. MCC, CBFA2T3, CDKN1C, ARHGEF12, PIK3CA and RHOB had approximately negative expression association with the other functional genes.
S-score is defined by the sum of correlation coefficients of a TS gene larger than T or smaller than -T ("Methods"). If a TS gene has S > 0, then the TS gene has larger positive network effect on gene down-expression than negative one on gene up-expression in cancer. If a TS gene has S < 0, then it has larger negative network effect on gene up-expression than positive one on gene down-expression in cancer. If a TS gene has S = 0, then it does not act on gene expression or balances between up-and down-regulations of gene expression. At stage 1, LIMD1, RAP1A and DAB2IP had negative network effects on gene up-expression in early cancer, while SASH1, . Histogram plots for S + , S − and S of TS genes. Correlation coefficients were used to measure roles of TS genes in regulation of expression of the other genes. TS genes with negative correlation coefficients ≤ -T (T is a threshold value with Bonferroni adjusted p < 0.05, see "Methods" for calculation of T value) have a negative role that up-regulates expression of genes while those with positive correlation coefficients ≥ T have a positive role that down-regulates expression of the other genes. S-score is defined as sum of all correlation coefficients larger than or equal to a given positive threshold or less than or equal to a given negative threshold for a TS gene i: S i = m j=1 r ij ||r| ≥ T where j=1, 2, ..., m and i = 1, 2, ..., n. Here n and m are numbers of DE and TS genes, respectively. S + is proportion of sum of all correlation coefficients larger than or equal to a given threshold to sums of correlation coefficients larger than zero for a TS gene i: www.nature.com/scientificreports/ CBFA2T3, LATS2, RECK, FOXP1 and DCC had positive network effects on gene down-expression, in other words, down-expression of these 6 TS genes resulted in or associated with down-expression of genes in early cancer. Interestingly, 9 TS genes with S > 0 or S < 0 were found to be indeed significantly down-expressed in cancer at stage 1, while the other TS genes such as EXT1, KLK10, NBL1, RHOB, PTCH1 that were not found to be down-expressed in early cancer had S of zero, that is, these of being not differentially expressed between normal and cancer samples at stage 1 were not found to be correlated with other functional genes in expression in early cancer. Indeed, except for that PTCH1, KLK10, TBRG1, NBL1 and EXT1 did not act on gene expression in stage-2 cancer, the other 21 TS gene had larger positive network effects on gene expression than negative network effects in cancer. At stage 3, however, all these 26 defined TS genes have S > 0, suggesting that all these 26 TS genes had larger positive network effects on gene down-expression in cancer than negative network effects on gene up-expression even though APC, EXT1 and DDX5 had very small S-scores.
In order to validate impacts of TS genes on gene expression in cancer, we used T = 0.474 to calculate their S + , S − and S in cohort GSE18842 27 . The results were summarized in supplementary Figure S4. Similarly, STARD13, RECK, RAP1A, LATS2, LIMD1, RASSF2, CBFA2T3, SASH1, DDX5, ARHGEF12, CDKN1C, and RHOB had S + > 0.4 and S − > 0.3, indicating that in cohort GSE18842 these TS genes had larger positive and negative network effects on expression of genes in cancers. PIK3CA, APC, KLK10 and PTCH1 still weakly acted on gene downexpression and did not impact gene up-expression. Except for that GPC3 had negative S-score, all the other 25 TS genes had positive S-score, demonstrating that in cohort GSE18842 27 these TS genes also had much larger positive network effects on gene expression than negative ones. In particular, CBFA2T3, ARHGEF12, SASH1 and STARD13 still had the strongest impact on gene down-expression. Interestingly, in the RNA-seq data from cohort GSE40419 28 without smoking history, S + , S − and S obtained from correlation coefficients selected by T = 0.538 or T = − 0.538 show similar results ( Supplementary Fig.S4). Likewise, all these 26 TS genes had S + > 0. Similarly, KLK10, KCHRG, EXT1, PIK3CA, and PTCH1 also had S − = 0. Except for that RAP1A and RASSF2 had negative S, all the other 24 defined TS genes had positive S-scores, demonstrating that in no smoking cohort GSE40419 28 , TS genes had much larger positive network effects on gene expression than negative ones, or, down-expressed genes due to down-expression of these TS genes were many more than up-expressed genes resulted from their down-expression in cancers. Supplementary Figure S4 displays S + , S − and S profiles of these 26 TS genes in smoking cohort GSE40419 28 . Compared to these score profiles of TS genes in no smoking cohort, we found that smoking strongly impacted S + , S − and S scores. For example, GPC3, CDKN1C, LIMD1 had S + larger than 0.4 in no smoking cohort but decreased to less than 0.2 in smoking cohort. APC, DDX5, CDKN1C, LIMD1, TBRG1, FOXP1 decreased S − to 0.05 in the smoking cohort, while MCC, RAP1A, STARD13, NBL1, increased S − from 0.15 in no smoking cohort to 0.35 in smoking cohort. However, except for that LATS2, RASSF2, RAP1A, NBL1 and RHOB had negative S-scores, the other 21 TS genes also had positive S-score, demonstrating that in smoking cohort GSE40419, TS genes had much larger positive network effects on gene expression than negative ones, or, down-expressed genes due to down-expression of these TS genes were many more than up-expressed genes resulted from their down-expression in cancer.
N-score analysis. As examples, we here employed correlation coefficients of three strong TS genes and three weak TS genes defined by using S-scores with DE genes identified at stage 2 in cohort GSE19804 1 to respectively construct positive and negative correlation networks (Fig. 5). Strong TS genes SASH1, STARD13 and CDKN1C had very complicated positive and negative networks (Fig. 5A,C) in which these TS genes commonly shared many more DE genes (double-line or multi-line nodes) than they unshared DE genes (single-line nodes), while weak TS genes DDX5, KCNRG and CDKN2C had very simple positive and negative correlation networks (Fig. 5B,D), that is, these three weak TS genes commonly shared less DE genes (double-line or multi-line nodes) than they unshared DE genes(single-line nodes). Summarily, strong TS genes had not only stronger network effects (strongly connected more DE genes) but also stronger synergy effects (commonly shared more DE genes) than weak TS genes. To evaluate TS genes in synergy effects on commonly regulating gene expression, we here proposed N-scores (or network scores) to compare these TS genes (see "Methods"). Figure 6 Fig. S5a). Inversely, stronger TS genes RAP1A, and MCC at stage 3 in cohort GSE19804 became very weak TS genes with N + < 0.05 and N − <0.01. N + profile in cohort GSE40419 28 (Supplementary Fig. S5b) is pretty similar to that at stage 3 in GSE19804. For example, EXT1, PTCH1, KLK10 and APC were weak TS genes with small N + in these two cohorts, while strong TS genes RECK, SASH1, CBFA2T3, STARD13, DCC, and MCC in GSE19804 were still strong TS genes in GSE40419 with N + > 0.3 (Supplementary Fig. S5b)

Score accumulation profiles.
To globally compare scores of negative and positive networks, we calculated score accumulations from the smallest value to each ordered value ( X a i = i k=1 S a k where S a 1 ≤ S a 2 ≤ · · · ≤ S a n or Y a i = i k=1 N a k where N a 1 ≤ N a 2 ≤ · · · ≤ N a n , i is an ordered number, i=1, 2, ..., N and a=+/−) and plotted these score accumulations to form an accumulation curve along numbers of TS genes. The plots are shown in Fig. 7 where we found that at stage 1 in cohort GSE19804, S + and S − accumulations are zero when numbers of TS genes < 18 and the largest difference between S + and S − accumulations is 0.32. At stage 2, S + and S − accumulations were almost the same along numbers of TS genes < 18. Over 18 TS genes, S + accumulation became larger than S − accumulation and the largest difference between both is 0.75. At stage 3, however, S + accumulation was larger than S − accumulation at all points of which numbers of TS genes are larger than 6 and difference between them became larger and larger as number of TS genes increased and the largest difference is 3.0. This means that at stage 3, more than 6 TS genes commonly connected many more down-expressed genes than they commonly connected up-expressed genes in cancer. Supplementary Figure S7 shows that in GSE40419 non smoking cohort, S + and S − accumulation curve profile is very similar to that at stage 3 in GSE19804 and the largest difference between S + and S − accumulations is 4.8. This allows us to infer that these lung tumor samples in GSE40419 were in between tumor stages 3B and 4. In cohort GSE18842, S + and S − accumulation curve profile is in between stages 2 and 3, that is, the largest difference between S + and S − accumulation curves is 1.0, larger than 0.75 at stage 2 but much smaller than 3.0 at stage 3, inferring that most of tumor patients in cohort The results, as we expected, shows that S + and S − accumulation curve profile is between those in GSE18842 and at stage 2 in GSE19804 (Fig. 7), and the largest difference between S + and S − accumulations is 1.5, demonstrating the above inference of tumor stages of patients in GSE18842. From Fig. 7 and Supplementary Figure S7, we found that N + and N − accumulation curve profile is pretty similar to S + and S − accumulation curve profile in all cohorts though S-score accumulation scale is much larger than N-score accumulation scale.
Gene set enrichment analysis (GSEA). To furthermore demonstrate impact of the defined 26 TS genes on development of NSCLC cancer, we performed GSEA 30 to analyze enrichment of the 26 TS genes as a gene set on gene expression data GSE19804 at tumor development stages 1, 2 and 3, and all stages, respectively. We separately performed permutations among samples and among genes to calculate p-value for enrichment analysis. The results obtained from 1000 permutations among samples and among genes are shown in Fig. 8 and Supplementary Figure S8, respectively. All the GSEA results show very similar enrichment profiles at tumor development stages 1-3 and all stages. All the defined TS genes were enriched on positive correlation (down expression of TS genes) and the maximum enrichment score was 0.8 with p-value < 0.002. The results indicate that the 26 TS genes are of strong information for biomarkers to predict or diagnose NSCLC.   Table S2). These TS genes globally influence on expression of genes in cancer, which is supported by gene function annotation and protein structure knowledge. SASH1 and STARD13 contain SAM (sterile alpha motif) domain 31,32 while the weak TS genes do not have this domain. SAM is an interaction module presented in a wide variety of various proteins and involved in many biological processes. It has homo-and hetero-oligomerise forming multiple self-association architectures and binding to various SAM-domain and non-SAM domain proteins 33 , DNA, and RNA 34 . Proteins with one or more SAM domains broadly regulate RNA transcription and protein translation. SASH1 possessing of two SAM domains helps us to understand why SASH1 was so strong TS gene in all collected data including GSE21933 ( Supplementary Fig. S6 and Supplementary Table S2). The effects of strong TS genes SASH1 and STARD13 suggest that if a TS gene has functions in transcription, it could play a global role in expression regulation. This is supported by also another strong TS gene CBFA2T3 (Supplementary Table S2) that encodes a member of the myeloid translocation gene family which interacts with DNA-bound transcription factors and recruits a range of corepressors to facilitate transcriptional repression. These strong TS genes were down-regulated either due to histone methylation (HM) or due to loss of heterozygosity (LOH) 35,36 . Our results show that the strong TS genes SASH1, STARD13, and CBFA2T3 were negatively correlated with up-regulated histone methylation genes SMYD3 and SMYD5 at stage 2 ( Supplementary Fig. S9). Furthermore, SASH1 and STARD13 were negatively correlated with all these four up-regulated histone methylation genes detected at stage 3 ( Supplementary  Fig. S10), and CBFA2T3 was connected with SMYD3, SMYD5 and SUV39H2. Hence, the strong TS genes SASH1, STARD13, RECK, CBFA2T3, RASSF2, RAP1A, and ARHGEF12 can be used as specific biomarkers for diagnosis of NSCLC cancer. Expression correlations of EXT1 with DE genes provide further evidence for this assumption. EXT1 (Exostosin glycosyltransferase 1) encodes a protein that is an endoplasmic reticulum-resident type II transmembrane glycosyltransferase involved in the chain elongation step of heparan sulfate biosynthesis and hence it does not globally regulate expression of the other functional genes. Our results show that it had small S-scores and N-scores (Supplementary Table S2) in all these collected data and hence EXT1 is a weak TS gene. Interestingly, EXT1 was not connected with histone methylation (HM) genes except for WHSC1 at stage Figure 7. S + and S − and N + and N − accumulation profiles. Accumulations of S-scores and N-scores are calculated by X a 1 = N a 1 , X a 2 = N a 1 + N a 2 , …, X a n = N a 1 + N a 2 + · · · + N a n . Y a 1 = S a 1 , Y a 2 = S a 1 + S a 2 , …, Y a n = S a 1 + S a 2 + · · · + S a n where a = " + " or "-", N a 1 ≤ N a 2 ≤ · · · ≤ N a n and S a 1 ≤ S a 2 ≤ · · · ≤ S a n are ranked N-score and S-score lists from the smallest to the largest. Here, 1, 2, …, n in X or Y are code for n TS genes. For example, from Fig. 4, S + 1 = S + PTCH1 and S (+) 26 = S + SASH1 at stage 2. X a 1 is N-score for the first TS gene, X a 2 is sum of N-scores for the first and second TS genes, and so on. Plot these score accumulations along numbers of TS genes to form an accumulation curve. Stages 2 and 3 are tumor stages 2 and 3. The data for calculating S-scores and N-scores are from stage2 and 3, respectively. Difference between X + and X − or between Y + and Y − may be related to development of tumor stage.  Table S2). Biological study shows that PTCH1 is a protein receptor for ligand Sonic Hedgehog. PTCH1 and Sonic Hedgeog matching triggers signals that prevents cells from growing and dividing (proliferating) 37 . So, unlike SASH1, PTCH1 does not impact on expression of multiple other functional genes in mechanics. Likewise, PTCH1 also did not have association with histone methylation genes SMYD3 and SMYD5 at stage 2( Supplementary Fig.S9). These implicate that S-scores and N-scores can allow one to find another type of biomarkers such as EXT1, PTCH1, KLK10, APC, TRIM13 that have strong information sensitive to early cancer. All results show that stronger TS genes were down-expressed in cancer patients in all collected cohorts and at all tumor stages and had larger S + and N + . This property lets stronger TS genes be used as ideal biomarkers for precision diagnosis and prognosis of cancer patients. www.nature.com/scientificreports/ of which 7 patients were diagnosed to be at stage 1, 3 at stage 2, 6 at stage 3 and 5 at stage 4 and 11 patients were adenocarcinoma lung cancer and 10 were squamous lung cancer. GSE21933 has 30,968 human genome probes but only 17,819 probes have gene annotation. The RNA transcriptomic sequences analysis was conducted in cancer samples and matching adjacent normal samples 29 . Since our study used published cohort data available in the public domain, we did not necessarily seek specific ethical reviews and/or consents from the patients of the original studies.

Methods
Quality control of the microarray data. We used two-way scatter plot to visualize data of the two replicate tumor samples and used Pearson correlation coefficient to evaluate the quality of the data.

Differential expression (DE) analysis.
At stage 4 only one patient was recruited and hence the microarray data at stage 4 were removed out from our statistical differential expression analyses. Since in Taiwan female cohort some substages have too few patients, our differential analysis was done at stage level, not at substage level. We here employed RAM 25 and Bejamini Hochberg multiple test procedure 38 to compare differences in expressions of genes between the tumor and normal tissue sample at stages 1-3, respectively, and at all these three stages. All differentially expressed genes were identified at FDR < 0.01.
Classification of tumor stage samples. The principal component analysis (PCA), which reduces higherdimensional data into three-dimensional components, was used to classify lung cancer stages by using data of 26 TS genes and heatmap was used to visualize differential expression of DE genes between two conditions. The PCA and heatmaps were conducted by using Genesis 1.7.7.
Heatmaps. In cohort GSE18842, the expression data of the 26 TS genes were retrieved from microarray using probe id. In cohort GSE40419, since the dataset does not have gene probe ids, we used gene names (gene office symbols) to retrieve expression data of 55 RNA isoforms of these 25 TS genes from RPKM dataset. These expression data of TS genes or isoforms were transformed into z-score data. Heatmaps for these z-score data were made by using heatmap.2 in R-environment.

S-scores.
If a gene plays an important or central role in tumor development, then the gene would be correlated with many other functional genes in expression. For example, transcriptional regulation factor or posttranscriptional regulation factor regulates expression of a lot of other genes and hence it correlates with these genes in expression. A TS gene that is negatively correlated with a set of genes in expression may have a negative role in expression regulation of these genes while a TS gene that is positively correlated with a set of genes in expression has positive effect on expression of these genes. Based on this principle, we here proposed S-scores to measure role of a TS gene in tumor development. Briefly, S-scores are represented by S + ,S − and S . S + is defined as proportion of sum of correlation coefficients larger than or equal to a given threshold T under Bonferroni adjusted α = 0.05 to total of sums of correlation coefficients > 0: where m are numbers of DE genes detected in differential expression test. Similarly, S − is defined as S-score is S-score depends on algebra sum of positive and negative correlation coefficients and hence its domain is (−∞, 0, +∞) . S > 0 shows that the TS gene has larger positive effect on down-regulation of genes (positive correlation expression or both a TS gene and DE genes are down-expressed in cancer) than negative effect on up-regulation of genes (negative correlation expression or a TS gene is down-expressed but the DE genes are up-expressed in cancer), inversely, S < 0 indicates that the negative effect on up-regulation of gene expression is larger than the positive effect and S = 0 implicates that a TS gene has no role in gene expression regulation or balances up-expression and down-expression of the other genes in cancer. T = |t| √ t 2+(n−2) is a threshold value where t = qt(p,df) where qt is a function converting p to t-value with df where p= α m (Bonferroni adjusted cutoff probability) and df is degree of freedom (df = (n-2)). In cohort GSE19804, T was calculated to be 0.58 at stage 1, 0.765 at stage 2, and 0.75 at stage 3. In cohort GSE40419, T = 0.538 for nonsmoking status, and 0.45 for smoking status. In cohort GSE18842, T = 0.474.

N-scores.
We here define network score or N-score as ratio of a number of nodes shared with a TS gene and all other TS genes in a network constructed at a given significance level to that in a network constructed at insignificance level for a TS gene. Let R be correlation matrix n × m with element r ij , i = 1, 2, …, n and j = 1, 2 ,…, m www.nature.com/scientificreports/ where n is number of TS genes defined and m is number of DE genes identified in differential analysis. Let G be a gene network matrix n × m with element g ij where g ij =1 if r ij > T or g ij = 0 otherwise. Thus, G is a bivariable matrix constructed with "0" and "1" where "1" presents a node connecting a TS gene to a DE gene and "0" presents that there is no node between ST and DE genes, that is, a TS gene is not connected to a DE gene at T significance level. In order to know how many nodes a TS gene shares with another TS gene, we need to compare them across all DE genes. Let A i be a vector of TS gene i from G and B k be another vector of TS gene k from G where i = k . Let C + ik = A i B k , that is, c + ikj = a ij b kj = 1 if a ij = 1 and b kj = 1 or c + ikj = 0 otherwise where j = 1, …, m. Therefore, C + ik is a node vector shared with TS genes i and k. Repeat this procession from k = 1,…, k = i ,… k = n and take maximum c + ikj over all m DE genes to form a shared node vector for TS gene i:C + i = m j=1 max m k� =i c + ikj . In a similar way, we get C +0 i = m j=1 max m k� =i c +0 ikj under insignificance level where g ij =1 if r ij > 0 or g ij =0 if r ij ≤ 0 . The positive N-score is defined as in which g ij =1 if r ij < -T or g ij = 0 otherwise under significance level of α for defining C − i and g ij =1 if r ij < 0 or g ij = 0 if r ij ≥0 for defining C −0 i .

Gene sets enrichment analysis (GSEA).
To explore if the 26 TS genes used as a gene set co-work or coact in biological functions or pathways, we performed GSEA analysis 30 of the 26 TS genes using microarray from Taiwan female cohort (GES19804) as gene expression dataset associated with the fold changes corresponding to their differential expression between normal and tumors samples. In GSEA analysis, we setup these 26 TS genes selected from GO analysis of differentially expressed genes as a gene set data, used normal and tumor samples to make phenotype dataset. We selected Human_AFFY_HG_U133_ MSigDB.v7.1.chip as annotation. GSEA analysis was respectively performed on microarray datasets at stages 1, 2, 3 and all stages. Phenotype and gene permutations were respectively performed each for 1000 times for calculating p-value.