Landscape of somatic alterations in large-scale solid tumors from an Asian population

Extending the benefits of tumor molecular profiling for all cancer patients requires a comprehensive analysis of tumor genomes across distinct patient populations worldwide. In this study, we perform deep next-generation DNA sequencing (NGS) from tumor tissues and matched blood specimens from over 10,000 patients in China by using a 450-gene comprehensive assay, developed and implemented under international clinical regulations. We perform a comprehensive comparison of somatically altered genes, the distribution of tumor mutational burden (TMB), gene fusion patterns, and the spectrum of various somatic alterations between Chinese and American patient populations. Here, we show 64% of cancers from Chinese patients in this study have clinically actionable genomic alterations, which may affect clinical decisions related to targeted therapy or immunotherapy. These findings describe the similarities and differences between tumors from Chinese and American patients, providing valuable information for personalized medicine.

C ancer morbidity and mortality remain a major challenge to public health in China, with over two million cancer deaths per year in China 1,2 . In recent years, precision oncology has enabled individual diagnosis, prognosis, and treatment based on increasingly accurate and high-resolution molecular stratification of cancers largely focused on genome targeted therapies 3,4 .
Next-generation sequencing (NGS) technology with the advantages of high throughput can identify all classes of genomic alterations on hundreds of genes including single nucleotide variant, insertion/deletion, copy number variation, and fusion/ rearrangement at one time across multiple samples simultaneously. Considering the complexity of NGS technology and rigorous requirements of clinical practice, strict quality assurance and validation are necessary. For instance, the accreditation of the College of American Pathologists (CAP) or certification of Clinical Laboratory Improvement Amendments (CLIA) is a standard for effective verification. The previous study has shown that the NGS targeted CSYS assay for the clinical practices has been strictly validated 5 . Comparison with the F1CDx of Foundation Medicine which has been proved by the FDA also verifies the reliability of this CSYS assay 6 . The recognition of these panel NGS technologies provides the possibility for large-scale genomic characterization of cancer patients.
Patient ethnicity can also be a factor in cancer diagnostics and treatment since differences in cancer gene alterations exist between populations of various ethnicities [7][8][9] . A number of largescale NGS pan-cancer studies on the Western population have been reported and displayed on cBioPortal 10 . It is blank for the Asian population, although many studies focusing on particular tumor types have been performed. In this work, we collected both tissue and blood samples from over 10,000 solid tumor patients and identified genomic alterations by using the previously validated clinical NGS panel and elaborated on the different genomic characteristics of Eastern and Western tumor patients comprehensively. This is the large-scale molecular profiling study of Asian solid tumor patients by deep sequencing of hundreds of cancer genes from both tissue and blood samples in a validated lab, and clinical significance interpretation of comprehensive genomic alteration detection and precision medicine.

Results and discussion
Description of the cohort. To explore the genomic landscape of Chinese patients with solid tumors as encountered in clinical practice, we collected tumor specimens and matched peripheral blood specimens from 11,553 individuals encompassing 25 principal tumor types and more than 100 tumor subtypes. After excluding samples (n = 1359) with insufficient tumor content or DNA yield or subsequent technical failure ( Supplementary Fig. 1a), we successfully sequenced 10,194 (88%) tumor samples. In order to reduce statistical bias, the cancer types with <50 cases were excluded from the analysis. Summaries of the clinical characteristics of the patients' specimens and the median sequencing target coverage of samples are presented in Supplementary Data 1 and 2 and Supplementary Figs. 2 and 3. A total of 31 ethnicities were presented in our cohort with Han being the most frequent (92%, 9382 /10,194). The majority of patients in this study were from eastern and southern provinces in China ("East China" and "South China" in Wikipedia) (41 and 29%, respectively). In terms of tumor stage, 55% (5652/10,194) of patients had advanced-stage cancers (stage III/IV), while 35% (3579/10,194) had early-stage cancers (pre-cancers or stage I/II). In our entire cohort, majorities (76%) of patients were treatmentnaive, and patients with previous treatments accounted for 16%. The remaining 8% of patients do not have confirmed or available treatment history information (Supplementary Fig. 1b). The major tumor types were non-small cell lung cancer (NSCLC; 20%), colorectal carcinoma (CRC; 12%), liver hepatocellular carcinoma (LIHC; 11%), gastric cancer (GC; 8%), esophageal carcinoma (ESCA; 6%), soft tissue sarcoma (STS; 6%), intrahepatic cholangiocarcinoma (ICC; 5%), pancreatic cancer (PAC; 5%), extrahepatic cholangiocarcinoma (ECC; 3%), and breast carcinoma (BRCA; 3%) ( Fig. 1a). In general, the distribution of these predominant tumor types such as liver cancer (LIHC, ICC, and ECC) and lung cancer (NSCLC and SCLC) represented the distribution of tumors and mortality encountered in clinical practice in China 1 .
Comparison of the frequency of somatically gene mutations.
To assess the characteristics of cancer genomes from Chinese patients in a global context, we made a comparison of genomic alterations with the largest published cancer genomic study of the Memorial Sloan Kettering Cancer Center (MSK) IMPACT study 12 , including 10,366 cases, mostly advanced cancer specimens. 266 genes in common between the two platforms were compared in 15 comparable advanced-stage tumor types between the advanced OrigiMed (OM) cohort (aOM, n = 2820) and MSK cohort (n = 2820). To limit the bias of comparisons, we subdivided NSCLC of the two cohorts into lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) and we used PSM (Propensity Score Matching) to balance available clinic confounder factors in different cohort such as primary/metastasis/recurrent tumor specimens, sampling method, gender, and smoke. Overall, only 12 tumor type: gene pairs presented significant differences in the frequency of gene variants between the aOM cohort and the MSK cohort (FDR < 0.05) (Fig. 2c, d and Supplementary Data 9), suggesting frequencies of the most common mutated genes and the tumor-type distribution in aOM cohort were highly consistent within the MSK cohort, such as CRC: APC (71.9 vs. 72.7%, FDR = 1) and SCLC: RB1 (84.2 vs. 71.1%, FDR = 1). The significant differences between the two cohorts were mainly found in lung adenocarcinoma and hepatobiliary tumors, such as LUAD: EGFR, ICC: KRAS. Moreover, several gene fusions and CNVs also showed differences between the aOM cohort and the MSK cohort.
To further confirm the similarities and differences between the OM and MSK studies observed in advanced cancers, we also compared the aOM data with genomic data of advanced-stage cases from The Cancer Genome Atlas studies (aTCGA). Because of heterogeneous methodologies (including detecting platform, algorithm, and report criteria of variants), SNVs, InDels and truncations mutations were considered in the comparison. In 9 comparable tumor types and 266 genes, we identified a total of 6 tumor types: gene pairs with significant differences between the aOM cohort (n = 1008) and the aTCGA cohort (n = 1008) (FDR < 0.05) (Fig. 2e, f and Supplementary Data 10), of which 3 different tumor type: gene pairs presented consistently changing trends with those in the comparison between the aOM cohort and the MSK cohort, including higher frequencies of CRC: TP53 and LUAD: EGFR and lower frequencies of LUAD: KEAP1 in the aOM cohort, compared with other two cohorts. Altogether, these multiple comparisons revealed at the greatest extent the similarity and distinctive of genomic alteration across these cohorts.
Immunotherapy-related biomarkers. In addition to targeted therapy, the recent clinical success of immune checkpoint blockade [13][14][15][16] makes the comparison of immunotherapy-related mutations and signatures across cancers from patients in different countries another important question. Hence, we analyzed the distribution of tumor mutational burden (TMB) within tumor types. Even though an algorithm to evaluate TMB in routine clinical practice has not yet reached a consensus 17 , an individual TMB has been shown to predict patient outcomes after immunotherapy [13][14][15][16] . Here, we identified TMB high (TMB-H) and TMB low (TMB-L) according to the TMB-high status definition from the KEYNOTE-158 study (the value ≥10 Muts/Mb or not) 16 . As shown in Fig. 3a and Supplementary Data 11, median TMB values in nearly half tumor types in the aOM cohort were different compared with the MSK cohort ( Supplementary  Fig. 10). Overall, the whole pattern of TMB distribution in the aOM cohort was similar to that in the MSK cohort, characterized by a "tail" that includes 119 samples with TMB ≥ 40 (Fig. 3b). We further analyzed the distribution of 186 samples harboring MSI-H in our cohort and found that the overall proportion of patients with MSI-H was 2% and was mainly in CRC (55%, 102/186) (Fig. 3c). Previous studies have suggested that TMB and PD-L1 expression are two independent biomarkers, and there is no significant correlation between PD-L1 expression and TMB in most cancer subtypes 18,19 . However, because MSI-H and TMB-H have recently been recognized as biomarkers for response to immune checkpoint blockades (anti-PD-1/PD-L1) 13 (Fig. 3d and Supplementary Fig. 11), suggesting the possibility of a high proportion of Chinese patients with lung cancer benefitting from immunotherapy.
In addition, recent evidence has suggested somatic amplification in the gene for programmed cell death ligand 1 (PD-L1/ CD274) as a response biomarker to immunotherapy in solid tumors, even in the absence of MSI-H, PD-L1 overexpression or TMB-H 20 . Herein, we identified a total of 85 (1%) tumors with Fig. 2 Analysis of somatic altered genes. a Numbers of correlated altered genes with six clinical features across tumor types. Only genes with significant differences (FDR < 0.05) between two groups of clinical features were calculated. The "age" feature included the younger patient group and the older patient group, separated by the median initial diagnosis age of patients of each tumor type. The "stage" feature included early-stage cancer group and advanced-stage cancer group. The "smoke" feature, including the smoker group (current smokers and former smokers) and nonsmoker group (neversmokers), was analyzed in lung cancers (NSCLC and SCLC) and head and neck cancers (HNC). The "treatment" feature included treatment-naive group and the pretreated group. The "sample type" feature included the primary sample group and metastatic/recurrent sample group. b Correlation between Tier 1 Cancer Gene Census genes and clinical features. Genes with significant differences (FDR < 0.05, number of each group >60, and sum of variation frequencies >10%) between two feature groups were shown. The group with a higher variation frequency in each clinical feature was labeled in orange. c Frequency of altered gene in 15 comparable tumor types between the aOM cohort and MSK cohort. d Comparison of significantly different altered genes (FDR < 0.05) between the aOM cohort (left) and MSK cohort (right). Altered genes whose sum of frequencies in the two cohorts were displayed. The alteration frequencies (%) of specific genes were shown in the "aOM" and "MSK" columns. e Frequency of altered gene in 9 comparable tumor types between the aOM cohort and aTCGA cohort. f Comparison of significantly different altered genes (FDR < 0.05) between the aOM cohort (left) and MSK cohort (right). Altered genes whose sum of frequencies in the two cohorts were displayed.
CD274 amplification (copy number ≥6) in the OM cohort, a proportion consistent with a previous study 21 ( Supplementary  Fig. 12a). Furthermore, in 30 evaluable samples with CD274 amplification tested for PD-L1 expression, the PD-L1 positive rate was 70% (Supplementary Fig. 12b). Subsequently, we also examined the mutational landscape of the 85 samples with CD274 amplification and found the co-occurrence of CD274 amplification with adjacent PDCD1LG2 and JAK2 amplification (89% and 82% respectively), which are nearby genes in chromosome 9p24.3-9p22.2, associated with advanced stage and poorer outcome 21 . A high frequency of TP53 mutations (78%) was also observed in these tumors (Supplementary Fig. 12c) Clinically actionable alterations. To assess the potential clinical impact of the somatic alterations found in our cohort, we used the MSK criteria 12 To further investigate whether the remaining 3, 696 patients without OncoKB Level 1-4 variants in the OM cohort had an actionable biomarker, we analyzed PD-L1 expression. We found that 4% of these patients exhibited at least PD-L1 positive ( Supplementary Fig. 13b), suggesting those patients could be candidates for treatment with immune checkpoint inhibitors even if their tumors did not meet the criteria for Level 1-4. A higher ratio of Level 1 was observed mainly in NSCLC, BRCA, SCLC, and UC, compared to that in other cancer types (Fig. 4b). Level 1 was predominantly represented by TMB-H and EGFR mutations in NSCLC, including EGFR L858R (20%; the proportion of samples in the tumor type with the variant), exon 19 deletion (19%) and G719 (3%) mutations. Others included ALK (7%) fusions in NSCLC, PIK3CA mutations (31%), and ERBB2 amplification (24%) in BRCA and MSI-H in CRC (8%) (Fig. 4c). In terms of population-level mutation of actionable variants, KRAS, EGFR and PIK3CA SNVs/InDels, ERBB2 amplification, and ALK fusions were most common, which was consistent with reports in the MSK cohort ( Supplementary Fig. 13a). Interestingly, in NSCLC, TMB-H was negatively associated with fusionpositive (3 vs. 13% fusion frequency in TMB-H cohort and TMB-L cohort, respectively, P = 1.31E−11), mostly from ALK gene. In contrast, MSI-H showed a positive association with fusionpositive (6% vs. 1% fusion frequency in MSI-H cohort and MSS cohort, respectively, P = 0.04), mostly from NTRK gene, which hinted that clinical benefit of patients from the combination of fusion-based targeted therapy and immunotherapy is different in different types of cancers and the finding requires more studies to confirm in the future. All these findings suggested the relevance of treatment to the mutational landscape of Chinese tumor patients.
In conclusion, we report herein the somatic mutation landscape of over 10,000 solid tumors in Chinese patients. To our knowledge, this is the largest and most comprehensive mutational landscape analysis of solid tumors in an Asian population. This report provides a highly reliable dataset and resources for cancer medicine. More importantly, this population-level comparative analysis has comprehensively revealed similarities and differences between somatic alterations and actionable variants between Chinese and other ethnic populations with solid tumors and has an important implication for the selection of patients for clinical trials with molecularly targeted therapies. Unique tumor samples and matched normal blood samples of each patient were collected by standardized protocols. All tumor samples were formalin-fixed and paraffin-embedded (FFPE). Hematoxylin and eosin (H&E)-staining sections of tumor samples were reviewed by senior pathologists for the estimation of tumor cellularity. For each tumor sample, 15 to 25 eligible unstained sections were collected for DNA extraction. According to multiple quality control metrics, 825 (7%) samples with insufficient tumor content (<10%), 321 (3%) samples with inadequate extracted DNA yield (<50 ng), and 213 (2%) samples with a sequencing technical failure (unique mean coverage lower than 300×, biased coverage distribution or sample contamination) were excluded. In total, 10,194 (88%) samples were successfully included in the final analysis ( Supplementary Fig. 1).
Sequencing workflow. The laboratory and bioinformatics protocols of the CSYS panel had been described and validated in previous study 8 (Supplementary Fig. 14). DNA extracted from tumor tissues and matched normal peripheral blood was fragmented to~250 bp and subjected to library construction using KAPA HyperPrep Kits (KAPA Biosystems), followed by hybridization capture using custom xGen Lockdown Probes and Reagents (Integrated DNA Technologies). As the main component, the custom hybridization capture panel targets~2.6 Mb of the human genome containing all coding exons of 450 genes (Supplementary Data 12), as well as the promoter of TERT and select introns of 39 genes frequently rearranged in cancer. Post-capture libraries were mixed, denatured and diluted to 1.5-1.8 pM (NextSeq 500) or 200-230 pM (NovaSeq 6000) and subsequently sequenced on NextSeq 500 or NovaSeq 6000 sequencers (Illumina). Paired-end sequencing was done following the manufacturer's protocols. Tumor samples were sequenced to a median unique coverage of 1202× (Supplementary Fig. 3) and matched normal blood samples were sequenced to a mean unique coverage 300×. Data quality was inspected and controlled, followed by a suite of customized bioinformatics pipelines for variant calls. SNVs, InDels, and CNVs were identified using MuTect, Pindel, and EXCAVATOR, respectively. Gene rearrangements were detected using an algorithm developed in-house. At least 5 unique supporting reads were necessary for a SNVs/InDels. All variants were manually reviewed in the Integrative Genomics Viewer (IGV) and a custom visual software to avoid false positives. Test results, including somatic variants and inherited pathogenic variants, were returned to patients and their physicians based on their needs.
Microsatellite instability (MSI) and tumor mutational burden (TMB). MSI status and TMB of tumor samples are according to bioinformatics approaches developed in house 8 . Microsatellite instability-high (MSI-H) is defined as more than 15% of selected microsatellite loci showing unstable in tumors compared to matched peripheral blood. The TMB score of each tumor sample is calculated by counting the number of somatic SNVs and InDels per megabase (Mb) in the targeted coding region of the genome. Noncoding mutations, hotspot mutations and known germline polymorphisms in the U.S. National Center for Biotechnology Information's Single Nucleotide Polymorphism Database (dbSNP) are not counted. Fig. 3 Correlation of tumor mutational burden (TMB), microsatellite instability high (MSI-H), and PD-L1 expression in OM cohort. a The tumor-typespecific distribution of TMB (excluding samples with TMB of 0) between the aOM cohort (light red) and the MSK cohort (light blue). Tumor types were sorted from left to right based on median TMB values (y-axis). The total number of samples was shown for each tumor type. P values were labeled on the top of corresponding tumor types in which TMB was significantly different between cohorts and calculated using a two-sided Wilcoxon rank-sum test. The boxplot elements indicate the maxima, 75th percentile, median, 25th percentile, and minima. Notches are used to compare groups; if the notches of two boxes do not overlap, this suggests that the medians are significantly different. b Distribution of TMB density between the aOM cohort (light red) and the MSK cohort (light blue). c The tumor-type-specific distribution for 186 samples with MSI-H. d The analysis for the cohort-level or tumor type-specific correlation of TMB, MSI, and PD-L1 expression in 2,723 samples with available information on MSI, TMB,    In this study, 10 was adopted as the threshold value for differentiating TMB high (TMB-H) from TMB low (TMB-L).
Overall comparative analysis pipeline. The available full data (mutation results and clinical information) of the MSK-IMPACT and TCGA (PanCancer Atlas and ovarian cancer, Nature 2011) studies were downloaded from cBioPortal (https:// www.cbioportal.org/). The corresponding tumor types with >60 patients in each cohort were comparable. Somatic variants of OM and MSK datasets were comparatively analyzed, including somatic SNVs, InDels, deletions of tumor suppressor genes, amplifications of oncogenes, and functional fusions/rearrangements. Considering the differences in detecting methods between OM and TCGA studies, only a comparative analysis of somatic SNVs and InDels of the TCGA dataset was performed. These variants were in coding regions, exon-intron flanks and 5′ flanks (TERT gene) of 266 comparable cancer-related genes. All variants were divided into several subtypes, including SNVs/InDels, truncation, amplification, deletion, and fusion/rearrangement. Chi-squared test (χ²) and Fisher's exact test were performed to the comparison of the frequencies of gene variants between two cohorts, and then P values were corrected with Benjamini-Hochberg (BH) method. Genes whose cohort-level altered frequency difference with statistic false discovery rate (FDR < 0.05) were reported as significant. We eliminated the influence of confounder factors by performing the comparative analysis (PSM) before comparing frequency. Be restricted to the available clinical information, only gender, smoking status [for LUAD, LUSC, and HNC only], sampling method, and primary/ metastasis/recurrent tumor specimen were considered between aOM and MSK cohorts, and only gender, age, and primary/metastasis/recurrent tumor specimen between aOM and aTCGA cohorts. We tried to balance as many available confounder factors (primary/metastasis/recurrent tumor specimens, sampling method, gender, smoke, age, stage, grade, subtype of the tumor, depth of sequencing coverage, tumor purity, and patient ancestry) as possible. Given the limited availability of factors, this study will inevitably have some biases.
Programmed death-ligand 1 (PD-L1) immunohistochemistry staining assay. We performed immunohistochemistry (IHC) staining of FFPE tissue sections for PD-L1 protein using an anti-PD-L1 antibody (clone 28-8; Cat#ab205921; Abcam; 1:300). Briefly, slides were incubated at 60°C, deparaffinized in xylenes, and rehydrated with graded ethanol. Antigen retrieval was performed using the Universal HIER antigen retrieval reagent (Cat#ab208572; Abcam; 1:10) in a steamer. Non-specific binding was blocked with the Dako EnVision FLEX Peroxidase-Blocking Reagent. All other staining was performed primarily with Dako series reagents (Cat#K8002; Dako; Undiluted). All slides were counterstained with hematoxylin. Specimens were scored as positive by the pathologist using the Tumor Proportion Score (TPS), which is the percentage of viable tumor cells with partial or complete membrane staining at any intensity. PD-L1 positivity in the study was defined as TPS ≥ 1%, and the specimens with 1-50% TPS and ≥50% TPS were respectively scored as weakly and strongly positive, respectively.
Clinical utility evaluation. We used previously reported criteria in the MSK study to assess the clinical actionability of variants. OncoKB (August 31, 2021, http:// oncokb.org/) knowledge base was used to annotate and classify variants into different levels: Food and Drug Administration (FDA)-recognized biomarkers (Level 1), variants that predict response to standard-of-care therapies (Level 2), variants that predict response to investigational agents in clinical trials (Level 3), or variants that predict to investigational agents in preliminary, preclinical studies (Level 4). These levels were also subdivided according to evidence within or between tumor types: Level A (1, 2A, 3A, 4) for the same tumor type, and Level B (2B, 3B) for different tumor types. Although wild-type KRAS was defined as a level-related factor, we excluded wild-type KRAS in CRC in this study when establishing the subset of Level 1. A high level of MSI-H was considered as an independent predictive biomarker with evidence of Level 1, regardless of the tumor type. If a variant was involved in different levels, the highest level was chosen for further analysis according to the rank: Level 1 > 2 > 3A > 3B > 4. The final level of each patient was defined as the highest evidence level of all variants detected in the patient. Information about drugs was from the U.S. Food and Drug Administration (FDA) (http://www.fda.gov) and the National Medical Products Administration (NMPA) of China (http://www.nmpa.gov.cn).
Statistical analysis. Chi-squared test (χ²), Fisher's exact test, and Benjamini-Hochberg (BH) method were used in the comparison of the gene alteration frequency between two cohorts, and they were also used to evaluate the association between clinical characteristics and significantly altered genes mutations. Corrections were also performed using the BH method. A Wilcoxon test could be done within each tumor type to compare the tumor mutational burden (TMB) between the MSK and the OM. The significant differences in this study were based on P values or FDR < 0.05.