Validating a targeted next-generation sequencing assay and profiling somatic variants in Chinese non-small cell lung cancer patients

Non-small cell lung cancer (NSCLC) is featured with complex genomic alterations. Molecular profiling of large cohort of NSCLC patients is thus a prerequisite for precision medicine. We first validated the detection performance of a next-generation sequencing (NGS) cancer hotspot panel, OncoAim, on formalin-fixed paraffin-embedded (FFPE) samples. We then utilized OncoAim to delineate the genomic aberrations in Chinese NSCLC patients. Overall detection performance was powerful for mutations with allele frequency (MAF) ≥ 5% at >500 × coverage depth, with >99% sensitivity, high specificity (positive predictive value > 99%), 94% accuracy and 96% repeatability. Profiling 422 NSCLC FFPE samples revealed that patient characteristics, including gender, age, lymphatic spread, histologic grade and histologic subtype were significantly associated with the mutation incidence of EGFR and TP53. Moreover, RTK signaling pathway activation was enriched in adenocarcinoma, while PI(3)K pathway activation, oxidative stress pathway activation, and TP53 pathway inhibition were more prevalent in squamous cell carcinoma. Additionally, novel co-existence (e.g., variants in BRAF and PTEN) and mutual-exclusiveness (e.g., alterations in EGFR and NFE2L2) were found. Finally, we revealed distinct mutation spectrum in TP53, as well as a previously undervalued PTEN aberration. Our findings could aid in improving diagnosis, prognosis and personalized therapeutic decisions of Chinese NSCLC patients.

The effects of sequencing coverage depth and MAF on INDELs detection performance were assessed by subsampling data (Table S5a,c). The INDELs detection sensitivity increased with increasing coverage depth at MAF ≥ 5%, and high sensitivity was observed up to 700× median coverage depth. At 700×, >99% of INDELs at MAF ≥ 5% (15/15) were successfully detected. We observed a fluctuating trend of detection sensitivity with varying coverage depth at low MAF (<5%) (Fig. 1B). Moreover, PPV achieved was 85-100% across the full coverage depth (Table S6). The correlation between measured and expected MAF for INDELs was high (Fig. 1D).

Concordance between OncoAim NGS test and orthogonal ARMS-PCR test.
We tested the EGFR mutation status in 253 FFPE NSCLC specimens (a subset of 452 total cases) using both OncoAim and ARMS-PCR (Amplification refractory mutation system-Polymerase Chain Reaction) technique. In total, two methods detected 126 EGFR mutations with 94% concordance (119 aberrations were detected by both platforms) (Fig. 2). Six variants were detected by ARMS-PCR but not by NGS (Table S7), likely due to the low MAF of these variants and/or tumor heterogeneity. NGS, but not ARMS-PCR, detected one mutation that was confirmed as TP by Sanger sequencing (Table S7).

Reproducibility of NGS test on FFPE samples.
We assessed the precision (reproducibility and repeatability) of OncoAim test with 3 FFPE tumor samples that possess 5 known alterations (SNVs and INDELs including KRAS p.G12V, KRAS p.G12D, TP53 p.R248W, TP53 p.R333fs*12, EGFR p.L858R) in total. Among these 5 known alterations, mutations of KRAS and EGFR were detected previously by ARMS-PCR method, and TP53 mutations were detected previously by NGS (Illumina Miseq platform). The samples were analyzed 5 times in 3 different experiments to evaluate the inter-run and intra-run reproducibility. Concordance between replicates was 96% (Table S8), with no significant differences between inter-and intra-run replicates, demonstrating the robustness of the NGS test.  Concordance between OncoAim and ARMS-PCR approaches on FFPE specimens. Overlap of positive variant calls by NGS (OncoAim) and ARMS-PCR methods at tested sites in 253 FFPE clinical cancer specimens. Note: among the 119 common variants, 106 variants with MAF ≥ 5% was called by OncoAim when the bioinformatics analysis pipeline used 5% as cut-off for variant calling. The other 13 variants missed by OncoAim analysis pipeline (5% cut-off) had MAF < 5%, which were detected but filtered out. These 13 variants were also counted as detected by OncoAim here for better comparison of two methods. www.nature.com/scientificreports www.nature.com/scientificreports/ Mutational profiling of Chinese NSCLC patients. A total of 422 cases of the 452 FFPE samples was successfully sequenced for subsequent variation analysis (Fig. S1), and 30 cases were unsuccessful for several reasons, such as poor DNA quality, low library concentration, and low median read coverage and uniformity. We then utilized OncoAim to profile the mutational landscape of 422 Chinese NSCLC patients. The clinicopathological characteristics of patients were summarized in Table 1.
Missense mutation was the most prevalent form in almost all the top 10 mutated genes, with PTEN as the exception in which in-frame-insertion was the major type ( Fig. 4). In-frame-deletion and nonsense mutation ranked in 2 nd place in EGFR and TP53 mutations, respectively. All mutations in KRAS, PIK3CA, nuclear factor erythroid-derived 2-like 2 (NFE2L2) and Catenin, beta-1 (CTNNB1) were missense type (Fig. 4).
Mutational hotspots in the top 5 most frequently mutated genes. The mutational hotspots of the 5 most frequently mutated genes are summarized in Fig. 5.

Characteristics
Number of patients    www.nature.com/scientificreports www.nature.com/scientificreports/ EGFR is a transmembrane tyrosine kinase belonging to HER/erbB protein family. Its activation can activate downstream PI3K-AKT-mTOR and RAS-RAF-MEK-ERK signaling pathways 20 . The most observed mutations in EGFR were L858R missense mutation (exon 21) and an in-frame deletion from E746 to A751 (exon 19) (Fig. 5A), consistent with previous reports 21 . These mutations confer increased sensitivity to both first-generation and second-generation EGFR Tyrosine kinase inhibitors (TKIs) [21][22][23] . Notably, the T790M variant, which is often induced by TKIs treatment and thus a resistance marker to first-generation TKIs (erlotinib and gefitinib), was detected 8 times (approximately 4% of the 211 EGFR mutations) (Fig. 5A). T790M mutation has been suggested as the primary mutation in some NSCLC patients 24 .
TP53 is one of the best-known tumor suppressor genes. Since TP53 protein forms a tetramer for binding to cis-elements, some of its missense mutants may have dominant-negative effects on wild-type TP53 via oligomerization 25,26 . We noticed 73.83% of TP53 aberrations were the missense type. TP53 mutations were clustered in the DNA binding domain (amino acid 101-300), with R273 (7.09%), G266 (5.67%), R213 (4.96%), R175 www.nature.com/scientificreports www.nature.com/scientificreports/ (4.26%), and R280 (4.26%) as the top 5 most frequently mutated sites (Fig. 5B). Two of these 5 residues, R273 and R175, belong to the previously reported 6 mutational hotspots of TP53 (i.e., R175, G245, R248, R249, R273 and R282) 26 . Among these 6 reported residues, R248 and R282 also had high frequency in the present study, both at 3.55%. However, G245 and R249 showed up at low frequency, 1.42% and 0.71%, respectively, indicating they were not prevalent in Chinese NSCLC patients. R213* nonsense mutation was the most common nonsense mutant (Fig. 5B), consistent with a previous study 26 . However, the high mutation frequency of R213* in our study has not previously been reported. This result, together with the observed high mutation frequency of G266 and R280, might represent the unique features of TP53 mutations in Chinese NSCLC patients.
KRAS is a member of the small GTPase superfamily whose activating mutations can constitutively activate downstream RAF-mediated signaling pathway, leading to uncontrolled cell proliferation 27 . NSCLC patients carrying KRAS mutations develop primary resistance to EGFR-targeted drugs such as cetuximab and gefitinib 28 . G12 was the most dominant mutated residue in our study (Fig. 5C), consistent with other reports 1 . Lung cancer patients with KRAS G12C and G12V mutations have shorter PFS (progression-free survival) compared with those patients carrying other KRAS mutations or KRAS wild type 29 .
PIK3CA is one of the catalytic subunits of phosphatidyl 3-kinases (PI3K), modulating various cellular processes 30 . We identified E542, E545 (exon 10, the helical domain) and H1047 (exon 21, the kinase domain) as the most frequently mutated sites (Fig. 5D). These PIK3CA mutations are regarded as oncogenic variants and can be targeted for drug development 31 .
PTEN acts as a tumor suppressor, and its best-known role is to antagonize the PI3K signaling pathway through its lipid phosphatase activity 32 . A333*fs10 (COSM5346961) in the C2 domain was the most frequently observed PTEN aberration in our cohort (Fig. 5E). The C2 domain is crucial for anchoring the PTEN catalytic domain onto the membrane.

Association of patient clinicopathological characteristics with mutations. Associations of muta-
tions with the following clinicopathogical characteristics, gender, age, smoking etc were evaluated ( Table 2).
Mutations in lung cancer can be associated with gender 1 . We observed that female patients had significantly higher frequency of EGFR mutations (P = 1.41E-11, Fisher's exact test) but a significantly lower mutation frequency of TP53 (P = 8.912E-08) and PTEN (P = 3.2E-2) ( Table 2 and Fig. S2), consistent with previous reports 5, 33,34 . Interestingly, PTEN mutations were more prevalent in male patients.
Compared with the middle-aged group (45-65 years old), the young-aged group (18-45 years old) had a significantly lower EGFR mutation rate (P = 1.16E-03, adjusted by Bonferroni's correction) ( Table 2 and Fig. S3), which agrees with other studies 35,36 . Mutations in TP53 were less frequent in the young-aged group as opposed to both the middle-aged and older group (>65 years), with P values of 2.45E-06 and 2.59E-04, respectively (Table 2).
Smoking is a crucial risk factor for lung cancer 1,9 . The increased frequency of KRAS mutation in smokers (P = 1.7E-02) ( Table 2 and Fig. S4) is consistent with other reports 24,27 . We observed a positive correlation of PIK3CA aberrations with smoking status (P = 1.7E-02) ( Table 2), inconsistent with a previous study in Japanese lung cancer patients 37 . This discrepancy might result from ethnic variations.
Lymph node metastasis is a strong independent predictor of poor prognosis 38 . The lung tumors with lymphatic spreading were enriched with mutations in EGFR (P = 4E-03) and TP53 (P = 1.3E-02) ( Table 2 and Fig. S5), indicating that tumors with EGFR or TP53 mutations were more likely to develop metastatic tumors and might have poorer prognosis. These findings were consolidated by a previous report showing that EGFR expression level is higher in lung cancer patients with lymph node metastasis 39 .
Lung cancer, at later developing stages, tends to spread to other body parts forming metastatic lung tumors. Compared with primary lung tumors, metastatic lung tumors had higher mutational frequency of EGFR (P = 1.6E-02, Table 2 and Fig. S6). Cautious screening and monitoring of potential metastasis should be performed for patients with primary lung tumors that harbor EGFR driver mutations.
Histologic grade of lung cancer is an independent predictor of patient survival 40 . We observed that the well differentiated tumor has significantly lower mutation frequency of EGFR (P = 5.25E-06, adjusted by Bonferroni's correction) and TP53 (P = 3.54E-04, adjusted by Bonferroni's correction) than moderately differentiated tumors ( Table 2 and Fig. S7). The well-differentiated tumors also have lower EGFR and TP53 mutation frequencies compared with poorly differentiated tumors, but these differences did not reach a significant level (P = 3.99E-01 and 1.14E-01, adjusted by Bonferroni's correction, respectively). The low mutation frequency of EGFR and TP53 in well-differentiated tumors might represent a marker for better prognosis.
Adenocarcinoma (ACA) and squamous cell carcinoma (SCC) are the two major histological subtypes of NSCLC, and they are featured with distinct gene expression profiles with different clinical implications 41 . While EGFR mutations were more prevalent in ACA (P = 9.12E-07), the mutations of TP53, PIK3CA, and NFE2L2 were enriched in SCC, with P values of 1.56E-08, 1.8E-02, and 2E-03, respectively (Table 2 and Fig. S8). These results indicated that RTK (Receptor Tyrosine Kinase) signaling pathway activation was enriched in ACA, whereas PI(3) K pathway activation, oxidative stress pathway activation, and TP53 pathway inhibition were more prevalent in SCC. Although such correlations for each gene have been previously reported [42][43][44][45] , this study is the first to reveal the histological subtype association of these 4 genes in one lung cancer cohort study. The transcription factor NFE2L2 plays central roles in modulating the expression of genes involved in antioxidant and stress-response 46 , and its D29Y, D29H and R34Q variants (on exon 2) detected in this study might indicate poor prognosis of lung cancer 42,47 .
We didn't find differentially mutated genes in different tumor sites (left lung vs. right lung) ( Table 2 and www.nature.com/scientificreports www.nature.com/scientificreports/ (BRAF) (Fig. 6). The mutual exclusiveness between EGFR and KRAS or BRAF has been well-known 20,48,49 . This is the first evidence showing NFE2L2 mutations did not coexist with EGFR mutations, consolidating the oncogenic nature of NFE2L2 mutations 47 . Interestingly, TP53 mutations were mutually exclusive with EGFR mutations to a significant level (Fig. 6). This new observation was, to some extent, supported by the finding that RAS and TP53 show mutual exclusiveness in acute myeloid leukemia 50 , suggesting that inactivating TP53 alone may be sufficient for lung cancer cells to proliferate and circumvent apoptosis.
We observed that PTEN mutations tended to coincide with BRAF mutations (Fig. 6). The PTEN mutations result in activation of the PI(3)K pathway. Studies in lung cancer have not been reported on the co-occurrence of mutations in PTEN and BRAF, but the PIK3CA mutations have been associated with BRAF mutations 51 . In melanoma, however, loss-of-function PTEN mutations and BRAF activation mutations coexisted 52 Table 2. Association Analysis of Clinicopathological Characteristics with Mutations. Except for BRAF gene in smoking status, only statistically significant mutated genes were shown. For two nominal variables, twosided Fisher's exact test was performed. For analysis involving more than 2 groups (age and histologic grade), following two-sided Fisher's exact test (P value was provided in the table), pairwise comparison was done with P value adjusted with Bonferonni correction. The adjust P value for statistically significant pair was specified in the main text where appropriate. "/" for tumor site indicates no significantly differentially expressed genes. ACA, Adenocarcinoma; SCC, Squamous cell carcinoma. *P ≤ 0.05; **P ≤ 0.01; ***P ≤ 0.001. (2020) 10:2070 | https://doi.org/10.1038/s41598-020-58819-5 www.nature.com/scientificreports www.nature.com/scientificreports/ findings indicate that in some types of lung cancer, the PI(3)K pathway activation through loss of inhibition due to PTEN mutations can cooperate with BRAF-dependent RTK signaling pathway activation to promote cancer development.

Discussion
With deeper understanding of the underlying genetic mutations revolutionized by NGS technology, molecular testing has become an indispensable tool for lung cancer diagnosis 1,10 . In this study, we first performed an extensive performance evaluation of a NGS panel, OncoAim, on FFPE samples, then explored the genetic variants in Chinese NSCLC patients.
The turnaround time of the entire NGS process is about 2-3 days. We demonstrated that the targeted NGS had high sensitivity, specificity, accuracy and precision for both SNVs and INDELs. False-negative calls were predominantly low-frequency variants (MAF < 5%). Detection sensitivity rose up as sequencing coverage depth increased, up to 100% for MAF ≥ 5% variants at a certain coverage depth (500× for SNVs and 700× for INDELs) (Fig. 1A,B). High specificity was obtained across the whole sequencing coverage range (50-1000×). OncoAim test exhibited high concordance (94%) with ARMS-PCR approach for MAF ≥ 5% variants (Fig. 2), indicative of its high detection accuracy. Taken together, we concluded that the NGS-based OncoAim test had robust performance in FFPE samples. OncoAim panel covers common mutations (6000 hotspots in 59 cancer genes) in six prevalent Chinese cancer types (Oesophagus, stomach, liver, lung, breast and colon). Thus, in addition to lung cancer, it may be applied to other cancers. Further prospective studies of this panel on lung cancer and other cancer types should be performed to establish its application in clinical assay.
In five cases, EGFR exon 19 deletion was detected by ARMS-PCR, but not by NGS, which accounted for the major discordances observed between these two techniques. ARMS-PCR is super sensitive and can robustly identify alterations with 1% MAF 53 , whereas NGS can only reliably call variants with MAF ≥ 5%, particularly for INDELs 54 . Therefore, the difference in analytical sensitivity between these two methods may lead to inconsistencies in the test results. The low MAF could be caused by intratumoral heterogeneity, that is, the same tumor may possess cells that harbor different subclones with distinct mutations 55,56 .
We didn't perform side-by-side comparison of OncoAim panel with previous genetic profiling tests. However, we noticed that most of our discoveries are consistent with previous reports 1, 10 . For example, EGFR, TP53, KRAS, and PIK3CA were the top most frequently mutated genes in NSCLC. EGFR mutations were more common in female patients, and smokers tended to have higher mutation incidence of KRAS, and EGFR mutations were mutually exclusive with KRAS mutations. This validated the quality and effectiveness of our panel and bioinformatics pipeline, also reflected the robustness of these molecular signatures in NSCLC across populations with distinct demographic and racial background. In addition, our test sensitivity (>99% for variants with MAF ≥ 5% at >500X coverage depth) was comparable with that (95-99%) reported by Frampton, G.M. 54 .
We have also identified novel mutational patterns and novel correlations of genomic aberrations with patient characteristics in Chinese NSCLC. One intriguing finding is high mutation incidence of R213, G266 and R280 but low mutation incidence of G245 in TP53 (Fig. 5B). These mutations together may aid in examining tumorigenesis, epidemiology, and therapeutic decisions of NSCLC in Chinese population 57 . The high frequency of PTEN A333*fs10 represents a previously undervalued genomic variant in Chinese NSCLC, and the future functional www.nature.com/scientificreports www.nature.com/scientificreports/ characterization of this variant for clinical diagnosis and drug development is well justified. For the first time, we revealed the significant correlation of mutations in 4 genes, including EGFR, TP53, NFE2L2 and PIK3CA, with the specific histologic subtype of NSCLC in a single cohort study (Table 2), emphasizing the value of utilizing these molecular markers for subclassifying NSCLC patients and unearthing the distinct potential tumorigenesis mechanisms for NSCLC histologic subtypes. Moreover, OncoAim uncovered the previously unknown mutual exclusiveness of NFE2L2 mutations with EGFR mutations, which highlighted the oncogenic nature of NFE2L2.
As a retrospective study using archived FFPE specimens, one caveat of the current study is that not all the patients' clinicopathological information was available. The small number of patients in certain characteristic groups might limit the power of statistical analysis. For example, BRAF mutations are more prevalent in nonsmokers than in smokers (P = 0.019) 58 . Although we detected BRAF mutations only in nonsmokers, statistical analysis failed to show difference between nonsmokers and smokers ( Table 2), likely because only 13 smokers existed in our cohort.
In summary, the present findings based on NGS test could aid in subdividing NSCLC patients according to specific molecular signatures, improving diagnosis and prognosis, and implementing precision and personalized treatment for Chinese NSCLC patients.  (3) tumor cell content ≥20%. A 4-μm section of a hematoxylin and eosin-stained slide was reviewed by a pathologist to ensure a sample volume ≥1 mm 3 , nucleated cellularity ≥80% or ≥30000 cells and tumor cell content ≥20%. Clinicopathological information was gathered for association analysis.

Methods
DNA extraction, library preparation and sequencing. DNA of FFPE samples was extracted using the QIAamp DNA FFPE Tissue Kit (Qiagen, Hilden, Germany) strictly according to the manufacturer's protocol. The extracted DNA was quantified using the Qubit dsDNA HS Assay kit and a Qubit 3.0 fluorimeter (Life Technologies). Libraries were constructed from 20 ng DNA and the OncoAim DNA Panel using the Ion AmpliSeq Library kit v2.0-96LV (Life technologies). The panel covers more than 6000 highly frequent mutation hotspots in 59 cancer genes (Table S1). Libraries were quantified with Qubit 3.0 fluorometer (Life technologies). The individual libraries were diluted and then pooled for generating an 8 pM library to amplify on Ion sphere particles (ISP) on the Ion One Touch 2 instrument (Life technologies). ISP templates were enriched, loaded on an Ion 318 chip and sequenced on the PGM sequencer (Life Technologies).
Sequencing data analysis. OncoAim panel pipeline (OncoAim version 7.2) was used for sequencing data analysis. Briefly, the quality of raw reads (fastq files) was evaluated with FastQC (version 0.9.5, Babraham Bioinformatics, Cambridge, UK). High-quality reads were aligned against the human reference genome (hg19). The Burrows-Wheeler Aligner algorithm (https://github.com/lh3/bwa) was utilized for alignment, using default parameters. Insertions and deletions in sequence alignment files were left-aligned using a custom software tool, and left-aligned reads were processed using Freebayes (https://github.com/ekg/freebayes) for variant calling. The median coverage per locus was 500-1000x to ensure confident variant calling. The minimum mutation allele frequency (MAF) for SNVs and INDELs was set to 5%. Variants were annotated for effect prediction and clinical practice guidance. All the variants were manually checked on the Integrative Genomics Viewer (https://www. broadinstitute.org/software/igv/home).

Performance validation.
For SNV and indel validation, the six reference standards mentioned above were used. We mixed some of the reference standards by a 1:1 ratio to obtain more mutation sites and a wide MAF range. Then, individual standard and mixed standards were sequenced, targeting to >1000× coverage depth. Each bam file of sequencing data exported from the sequencer was randomly sampled (random selection of subsets of reads) to examine performance over a wide coverage depth range (50-1000×). We sampled 10 times for 50-500× and 3 times for 550-1000×. These sampled data were analyzed to identify variants.
For sensitivity analysis, all tested variants were assigned either a true positive (TP) if detected in the reference standards or false negative (FN) if not detected. Sensitivity at each test site was calculated as detected times/sampling times. For specificity analysis, each called variant was classified as a TP if the variant was a known mutation in the reference standards or a parental cell line, or a false positive (FP) if the variant was not a known mutation. Positive predictive value (PPV) was calculated as TP/ (TP + FP). (2020) 10:2070 | https://doi.org/10.1038/s41598-020-58819-5 www.nature.com/scientificreports www.nature.com/scientificreports/ For accuracy analyses, we compared OncoAim panel with AmoyDx EGFR Mutations Detection Kit that uses the principle of Amplified Refractory Mutation System (ARMS) (Amoy Diagnostics, Xiamen, China). For inconsistent results, Sanger sequencing was performed using the specified primers (Table S2).
We validated test reproducibility by examining mutation calls in replicates of clinical FFPE specimens. The samples were analyzed 5 times (5 independent library preparations starting from the same extracted DNA) in three different experiments (including three replicates in a single run). Statistical analysis. The statistical analysis was done using R (R version 3.4.1). For two nominal variables, two-sided Fisher's exact test was performed. For analysis involving more than 2 groups, following two-sided Fisher's exact test, pairwise comparison with Bonferroni's correction was performed. All tests were two-sided, and P < 0.05 was considered significant. Mutual exclusivity of variants was analyzed with R package 'maftools' (https://github.com/PoisonAlien/maftools), which performed Fisher's exact test for mutual exclusive events.

Data availability
All data generated or analyzed during this study are included in this article (and its Supplementary Information Files).