Recurrence-associated gene signature in patients with stage I non-small-cell lung cancer

Recurrent gene mutations and fusions in cancer patients are likely to be associated with cancer progression or recurrence by Vogelstein et al. (Science (80-) 340, 1546–1558 (2013)). In this study, we investigated gene mutations and fusions that recurrently occurred in early-stage cancer patients with stage I non-small-cell cancer (NSCLC). Targeted exome sequencing was performed to profile the variants and confirmed their fidelity at the gene and pathway levels through comparison with data for stage I lung cancer patients, which was obtained from The Cancer Genome Atlas (TCGA). Next, we identified prognostic gene mutations (ATR, ERBB3, KDR, and MUC6), fusions (GOPC-ROS1 and NTRK1-SH2D2A), and VEGF signaling pathway associated with cancer recurrence. To infer the functional implication of the recurrent variants in early-stage cancers, the extent of their selection pattern was investigated, and they were shown to be under positive selection, implying a selective advantage for cancer progression. Specifically, high selection scores were observed in the variants with significantly high risks for recurrence. Taken together, the results of this study enabled us to identify recurrent gene mutations and fusions in a stage I NSCLC cohort and to demonstrate positive selection, which had implications regarding cancer recurrence.


Landscape of gene mutations and fusions in stage I NSCLC patients.
We performed targeted exome-seq covering 323 cancer genes for 650 stage I non-NSCLC patients to profile the landscape of gene mutations and fusions. Detecting gene fusion was highly dependent on the processing of mapped reads. Thus, a variety of process parameters were tested through comparison with the fusion genes reported previously in the TCGA pancancer cohort 8 . The occurrence of gene fusion events in our cohort was similar to that in the TCGA cohort after adjusting the sample size ( Fig. 1A, top, r = 0.923; Pearson's correlation coefficient). Two fusion genes were observed in one patient as most, and more than half of patients (115/141, 82%) were determined to have no fusion gene events (Fig. 1B). We also profiled somatic mutations and their distribution accordingly by each patient and gene. After normalization by total sample size, as described in the fusion gene analysis, a similar distribution pattern of mutations was observed compared with the TCGA cohort ( Fig. 1A, bottom, r = 0.949; Pearson's correlation coefficient). Most patients (133/141, 94%) had more than 2 mutations, and the median mutation frequency was 4 ( Fig. 1C, left).

Recurrent mutations and fusion genes of stage I NSCLC patients and comparison with the TCGA cohort.
Mutations and fusion genes occurring in more than one patient were identified in our stage I NSCLC cohort. We discovered 72 mutations and 6 fusion genes whose frequency was greater than two. There were 8 genes that were frequently mutated by over 10% across our cohort ( Fig. 2A). EGFR showed a higher frequency of truncating mutations compared with other genes predominantly exhibiting missense mutations. ROS1 had variants of both missense mutations and gene fusions (Fig. 2B). Three of these variants (KDR, EGFR, and TP53) were known to be significantly mutated or associated with key pathways in lung cancer 9 . We further compared the recurrently mutated genes in our cohort with TCGA early-stage (stage I) and late-stage (stage IV) lung cancer cohorts. Notably, the recurrently mutated genes of the TCGA stage I cohort were more closely matched to our cohort than to the TCGA stage IV cohort ( Fig. 2A, top). We investigated how much the genes of each stage were significantly matched to our identified genes by permutation test. As a result, the recurrently mutated genes of our cohort matched more significantly to the recurrently mutated genes of the TCGA stage I cohort than to those of the TCGA stage IV cohort ( Supplementary Fig. 1). We obtained 6 recurrent fusion events from our analysis, which did not share any genes with one another; therefore, the events consisted of 12 genes. ALK-EML4 fusion showed the second highest frequency of 2.8% (4/141), which was reported in a previous study 8 of a TCGA stage I lung cancer cohort (LUAD, frequency was 1.0%). In particular, the recurrent fusion genes of TCGA stage IV lung cancer patients did not overlap with the fusion genes of our cohort ( Fig. 2A, bottom). The rest of the fusion events, except ALK-EML4 and KIF5B-RET, had not been reported previously, whereas several genes were known to play a role in lung cancer development when considering the gene level separately. We further investigated how the recurrent mutations and fusion genes were enriched in biological pathways (Fig. 2C). The pathway terms for the retrieved genes were enriched for cancer-related biological functions. These pathway enrichment results were compared with the pathway enrichment findings for TCGA earlyand late-stage lung cancer cohorts. As a result, the Rap1 signaling pathway and the PI3K-Akt signaling pathway, www.nature.com/scientificreports/ which were enriched in our cohort, were also specifically enriched in the TCGA stage I cohort (Fig. 2C). In contrast, the pathways of the TCGA stage IV cohort were not shown to be concordant with the pathways of our cohort.
Gene mutations and fusions correlated with recurrence in stage I NSCLC patients. We investigated whether there were signatures of gene mutations or fusions associated with recurrence. Genomic alteration in the early stage could be crucial in driving cancer recurrence because many variants in the later stage have  www.nature.com/scientificreports/ a strong possibility of being passenger mutations due to cancer genome instability. The 32 somatic mutations and 4 fusion genes that recurrently occurred in our stage I NSCLC cohort were selected to analyze recurrence-free survival (RFS). Twenty-two somatic mutations and 4 fusion genes were identified to increase the risk of recurrence when a patient had the variants, but only 4 of the mutated genes and 2 of the fusion genes had a significant correlation with RFS ( Fig. 3A-C). The mutation status of ATR, ERBB3, KDR, and MUC6 was significantly associated with patient recurrence, although their frequencies were not notably high; specifically, they were 9 (6.3%) for ATR, 4 (2.8%) for ERBB3, 14 (10%) for KDR, and 2 (1.4%) for MUC6. The second and fourth most frequent fusion genes, GOPC-ROS1 and NTRK1-SH2D2A, showed a significantly high risk of recurrence in the patient group with those fusion events. There were many notable mutations that showed a low occurrence frequency in our cohort. Therefore, we performed RFS analysis for those mutations with an extended gene set that is associated with a relevant pathway. The gene set with VEGF signaling pathway enrichment in our mutation profiling was tested, and their mutation status showed a significant correlation with cancer recurrence (Fig. 3D). KDR, SH2D2A, SRC, and KRAS were included in the VEGF signaling pathway, of which variants were associated with cancer recurrence when they were assessed simultaneously. These 4 genes strongly interacted with each other in physical interactions, cell signaling pathways, and genetic interactions (Supplementary Fig. 2A and B). To examine whether the 4 genes of the VEGF pathway independently contribute to cancer recurrence, the mutual exclusivity of variant occurrence among those 4 genes was profiled. Only one case of coexistence of KDR and KRAS mutations was observed in the same patient, but no cases of concurrence were observed ( Supplementary  Fig. 2C). We further assessed complementary recurrence patterns of interacting 4 genes, as described in the previous method 10 , for each pair of genes. We calculated variant complementarity, which is the frequency of variants for the gene set of interacting pairs. As shown in Supplementary Fig. 2D, the results showed higher     Kaplan-Meier curves of the RFS tests for 2 fusion genes showing a significant hazard ratio. All fusion events, including the above two fusion genes, were also assessed by RFS tests. (D) RFS tests for the gene set of the VEGF signaling pathway. We combined the status of the variants when they were enriched in a specific cancer-related pathway together when their frequency was more than two. Only the VEGF signaling pathway showed statistical significance in the RFS test among the cancer-related pathways. KDR, KRAS, SRC, and SH2D2A were included in the VEGF signaling pathway with variants in our cohort.

Selective functional implications of gene mutations and fusions associated with cancer recurrence.
To infer the functional implication of the recurrent variants identified from our cohort, we examined whether the recurrent gene mutations and fusions had a pattern of positive selection. Recently, selection patterns across a large number of cancer patients have been identified 11,12 by calculating the ratio of nonsynonymous to synonymous mutations at the gene level. A significantly high frequency of nonsynonymous mutations will be observed for the given background mutation rates of synonymous mutations when they are under positive selection. Thus, this concept can be used to discover candidate genes having cancer driver mutations in addition to passenger mutations. We obtained the selection scores that were previously calculated for each gene by Bayesian inference 11 and a statistical model for covariates (dNdScv) 12 based on the mutation patterns observed in the TCGA data. Using these scores, we examined the degree of positive selection on the genes that were recurrently mutated or fused. As a result, both Bayesian inference 11 and dNdScv 12 indicated positive selection on recurrent gene mutations and fusions in contrast to the absence of recurrent gene mutations and fusions in our cohort (Fig. 4A, left). These signatures were validated by a simulation in which the same number of recurrently mutated genes was tested with permutation ( Supplementary Fig. 3). We also calculated the selection score for the recurrently mutated genes in the TCGA early-(stage I) and late-stage (stage IV) lung cancer cohorts. The selection score difference between recurrently and non-recurrently mutated genes was consistent with the difference observed in our cohort, as recurrently mutated genes were under more positive selection than nonrecurrently mutated genes ( Fig. 4A and Supplementary Fig. 4A, middle and right). Furthermore, higher selective scores in recurrently mutated genes were observed to be more significant for stage I than stage IV, regardless of the measurement method in the permutation test ( Supplementary Fig. 3). Taken together, these results indicated that mutations and fusions in the early tumor stage tended to be more subject to positive selection so that they have more functional implications in cancer progression. In our study of tumor recurrence, a significant correlation with RFS was observed in gene mutations (ATR, ERBB3, KDR, and MUC6) and fusions (GOPC-ROS1 and NTRK1-SH2D2A). Interestingly, this significant RFS benefit with two gene mutations (ERBB3 and KDR) was only observed in our early-stage cohort when compared with the TCGA stage IV cohort (Supplementary Fig. 5). While the role of ERBB3 mutation in cancer progression and therapeutic resistance has been previously studied 13,14 , KRAS 15,16 , KDR 17 , and ATR 18,19 mutations and the GOPC-ROS1 20,21 fusion gene have also been widely studied to determine their functions in cancer biology. Therefore, we investigated whether the above genes were under positive selection in lung cancer patients ( Fig. 4B and Supplementary Fig. 4B). As we expected, most genes showed a selection score over 1, indicating positive selection. Specifically, the Q61H mutation on KRAS, of which the occurrence was 3 in our cohort, was reported for its prognostic and predictive value in advanced NSCLC 22 and its role as a predictor of resistance to treatment with tyrosine kinase inhibitors in advanced NSCLC 23 (Fig. 4C). The Q12D/C mutation, whose occurrence was 8 in our cohort, is also well-known for its oncogenic role in proliferation and widespread neoplastic and developmental defects in lung cancer 24 (Fig. 4C, top). In addition, the Q472H mutation in KDR, whose frequency was 11 in our cohort, has been reported to occur in tyrosine kinase inhibitor-resistant NSCLCs. These previous studies describing the variants in our cohort have implications of oncogenic relevance to lung cancer recurrence.

Discussion
We profiled one of the largest selective functional studies including only stage I NSCLC patients in an Asian population. Gene mutations and fusions were analyzed by targeted exome-seq (panel-seq), which covered 323 curated genes. Despite the limited resolution of targeted sequencing, this method was employed successfully to identify the genes reported in the previous study for a large lung cancer cohort (TCGA); these genes were KDR, BRCA1, BRCA2, EGFR, ERBB2, and TP53 for mutations and ALK-EML4 and KIF5B-RET for gene fusions. These gene mutations and fusions were compared to the TCGA lung cancer cohort according to stage. We confirmed that a set of recurrent gene mutations and fusions in our cohort was successfully reproduced in the TCGA stage I cohort, rather than the stage IV cohort, at the gene and pathway levels, thereby validating the reliability of our analysis. Furthermore, we investigated whether the recurrently mutated genes are under positive selection as an indicator of cancer progression. The identified genes that occurred recurrently in our cohort were under more positive selection than non-recurrently mutated genes, and the difference was more significant in early-stage lung cancer patients in the TCGA cohort. This result indicates that the genetic variants identified in early-stage cancer by targeted sequencing are able to detect candidate drivers for cancer progression. Our primary concern was to unravel the signature related to cancer recurrence in stage I lung cancer patients. Four gene mutations (ATR, ERBB3, KDR and MUC6) and two gene fusions (GOPC-ROS1 and NTRK1-SH2D2A) were obtained as markers that had a significant correlation with RFS. Little is known regarding the implications of the identified variants in lung cancer recurrence, except for ERBB3. The feature of positive selection for the variants may also be a "proxy signature" of cancer recurrence because gene mutations and fusions would likely be under positive selection if they brought gain-of-function in cancer recurrence. The identified markers substantially exhibited the signature of positive selection, as expected. In the comparison between early-and latestage cancer patients in the TCGA cohort, the genes detected in the early stage exhibited higher selection scores. These results imply that genetic variants in the early cancer stage have higher probabilities of cancer recurrence. Clearly, all these findings need to be confirmed by data from larger cohorts. We further attempted to identify the signature associated with cancer recurrence at the system level to overcome limited detection power due to the small sample size. The combinatorial mutation status of the genes in the VEGF signaling pathway showed  www.nature.com/scientificreports/ a significant correlation with cancer recurrence in RFS analysis. In conclusion, recurrent gene mutations and fusions in stage I NSCLC patients have functional potential in cancer progression and recurrence, which we were able to identify with high fidelity through targeted sequencing.

Materials and methods
Stage I NSCLC patients. A total of 141 NSCLC patients with surgically resected primary lung cancer were prospectively enrolled from Asan Medical Center between 2009 and December 2018. All patients provided prior written informed consent, and this study was conducted with the approval of the Institutional Review Board of Asan Medical Center. All research was performed in accordance with relevant guidelines/regulations, informed consent was obtained from all participants and/or their legal guardians. The clinicopathological data and survival outcomes were collected prospectively. Surgical tumor tissue sections (at least 4 µm thick) were collected for next-generation sequencing analysis.

Detection of mutations and fusion genes.
Targeted exome-seq was performed with OncoPanel AMC version 4, which was designed by Asan Medical Center through SureDesign. The OncoPanel was composed of 225 genes for entire exons, 6 genes for rearrangements, and 99 hotspots. The sequencing reads were aligned to the hg19 reference genome by using Burrows-Wheeler Aligner (BWA) (http:// bio-bwa. sourc eforge. net). 25 . The Genome Analysis Toolkit (GATK) (https:// softw are. broad insti tute. org/ gatk). 26 was applied for base quality score recalibration, indel realignment, and duplicate removal. The BAM files produced after these processes were subjected to MuTect (https:// softw are. broad insti tute. org/ cancer/ cga/ mutect). 27 for the calling of single nucleotide variants (SNVs) and small insertions and deletions (indels). ANNOVAR 28 was then used for the annotation of the called SNVs and indels. Data processing and analysis were performed with default parameters. To estimate fusion genes, a fusion event detection tool specific for whole-exome and short-read sequencing platforms, LUMPY 29 , was used. Genes within 1000 bp each other on the same chromosome were skipped in the analysis due to their positional effect, making false matching a fusion gene.
TCGA data processing. We  Selection score analysis. The two types of selection scores were used to investigate functional implications for our identified genes that were mutated or fused. All of the two scoring methods were used to calculate selection patterns at the gene level based on the ratio of nonsynonymous to synonymous mutations across a large number of cancer patients 11,12 . However, these methods used different methodologies to compute the statistical significance of the difference between nonsynonymous and synonymous mutations, which adopted Bayesian inference 7 or a statistical model for covariate 8 , respectively. The higher selection score derived from these scoring methods indicates stronger positive selection because a gene under positive selection will carry an extra complement of driver mutations in addition to neutral (passenger) mutations. Additionally, the statistical significance of the landscape of positive selection was tested nonparametrically by randomly selecting the same number of genes as the gene mutations (fusions) from the human genome and then comparing the mean and median of selection scores between the original list and permuted list. This permutation was repeated 1,000 times to generate a null distribution.
Recurrence-free survival analysis. We utilized gene mutations and fusions as predictors to perform patient recurrence-free survival analysis. For our clinical trial data, the cases in which the genes were mutated or fused were classified from the controls and subjected to survival analysis. Patients who died for reasons other than cancer were excluded from the analysis. The log-rank test (Mantel-Cox test) was used to determine the significance of differences between two groups. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.