Genomic analysis of oesophageal squamous-cell carcinoma identifies alcohol drinking-related mutation signature and genomic alterations

Approximately half of the world's 500,000 new oesophageal squamous-cell carcinoma (ESCC) cases each year occur in China. Here, we show whole-genome sequencing of DNA and RNA in 94 Chinese individuals with ESCC. We identify six mutational signatures (E1–E6), and Signature E4 is unique in ESCC linked to alcohol intake and genetic variants in alcohol-metabolizing enzymes. We discover significantly recurrent mutations in 20 protein-coding genes, 4 long non-coding RNAs and 10 untranslational regions. Functional analyses show six genes that have recurrent copy-number variants in three squamous-cell carcinomas (oesophageal, head and neck and lung) significantly promote cancer cell proliferation, migration and invasion. The most frequently affected genes by structural variation are LRP1B and TTC28. The aberrant cell cycle and PI3K-AKT pathways seem critical in ESCC. These results establish a comprehensive genomic landscape of ESCC and provide potential targets for precision treatment and prevention of the cancer.

O esophageal squamous-cell carcinoma (ESCC) ranks the fourth leading cause of cancer death and approximately half of the world's 500,000 new ESCC cases each year occur in China 1,2 . Epidemiological studies have suggested that alcohol intake, tobacco smoking, micronutrient deficiency and dietary carcinogen exposure may cause ESCC [3][4][5] . However, the mutational signatures in ESCC associated with lifestyle or environmental aetiological factors have not been explored. There are currently no specific molecule-targeting agents for ESCC treatment and therefore the long-term outcome of this cancer is still dismal, with 5-year survival rate around 30% (refs 6,7).
Several studies on whole-exome sequencing (WES) in ESCC in Chinese and Japanese populations have been published recently.
These studies reported an extremely high frequency of TP53 mutations and low prevalence but statistically significant singlenucleotide variations (SNVs) in several other genes, including CDKN2A, NOTCH1, RB1 and PIK3CA [8][9][10][11][12][13][14][15] . All these WES studies only took into account somatic variations in the protein-coding regions; however, the protein-coding components of the genome account for only B2% of the total sequences and previous studies indicated that the non-coding regions are more frequently affected by mutations compared with the coding regions 16 . The effects of the non-coding variations on ESCC development have not been characterized at the whole-genome level. Although both copy-number variations (CNVs) and structural variations (SVs) in ESCC have been reported, the results were based on low- C>T    coverage whole-genome sequencing (WGS) and/or singlenucleotide polymorphism (SNP) array with relatively small sample size [9][10][11][12][13][14][15]17 . The apparent limitations in the previous studies warrant performing more comprehensive studies.
Here we report an integrated analysis of genomic signatures, SNVs, CNVs, SVs and their correlations with mRNA expression in ESCC from Chinese individuals. We have comprehensively characterized the genomic landscape features in ESCC that may provide clues to environmental aetiological factors and identified potential targets that may guide to develop precision treatment and prevention of this malignancy.  Fig. 2a,b).

WGS
We found a genome-wide mutation rate of 2.1 mutations per megabase (Mb). There were 6,797 (1.1%) SNVs or indels with 1.98 mutations per Mb in the protein-coding regions and 610,832 (98.9%) SNVs or indels with 2.13 mutations per Mb in the noncoding regions, respectively ( Supplementary Fig. 2a). The amount of SNVs was positively associated with mutational status of TP53 (P ¼ 1.34 Â 10 À 4 , unpaired t-test; Supplementary Table 1). We found that 9.6% (9/94) of ESCC samples had occasional kataegis loci (range ¼ 1-3), which are localized substitution hypermutations and associated with genomic rearrangement, and 4 of these kataegis had SVs nearby (Supplementary Fig. 2c and Supplementary Table 2). We counted the number of SNVs in the matrix of 96 possible mutations occurring in a trinucleotide context in each ESCC sample and found that the predominant mutations were the C4T and C4G transitions at TpCpW trinucleotide sites (Fig. 1a), which is similar to the signature of APOBEC enzyme family-mediated mutagenesis 18 .
The mutational signatures of ESCC. Mutational signatures in cancer genome might reflect and trace DNA damage caused by DNA-damaging agents to which cells have been exposed. However, the previous WES studies did not address this important issue in ESCC probably due to the limited statistical power. In this study, we combined our WGS data with WES data obtained from published studies 9,11-15 including 94 ESCC samples from The Cancer Genome Atlas (TCGA) database to increase sample size to 704 for analysing SNV profile in the protein-coding regions. By using 1,000 iterations of non-negative matrix factorization 19 , we identified six mutational signatures (Signatures E1-E6) in ESCC that are of high stability and low reconstruction error ( Supplementary Fig. 3).
We then performed cosine similarity analysis to compare the mutational signatures in ESCC with current Catalogue of Somatic Mutations in Cancer (COSMIC) 19 (Supplementary Table 3) and found that five ESCC signatures are highly similar to COSMIC signatures. Signature E1 is highly similar to COSMIC Signatures 2 and 13 (cosine similarity 0.819 and 0.813, respectively), which is believed to be due to over activity of the APOBEC, a family of cytidine deaminases. APOBEC enzymes convert cytidine to uracil that is usually coupled with the activities of base excision repair and DNA replication machineries 20 . In our samples, the expression levels of APOBEC3B and APOBEC3C were significantly higher in tumours than in paired normal tissues (P ¼ 8.81 Â 10 À 31 and P ¼ 3.21 Â 10 À 5 , respectively, unpaired t-test; Supplementary Fig. 2d) and strongly correlated with the total number of the C4T (P ¼ 0.022, Spearman's r ¼ 0.24) and C4G (P ¼ 0.019, Spearman's r ¼ 0.24) mutations at TpCpW trinucleotide sites, respectively ( Supplementary Fig. 2d). Signature E2 showed high similarity to COSMIC Signature 4 (cosine similarity 0.860), which was previously identified to be associated with smoking status in lung cancer patients 19 . This signature is only significantly associated with individuals' smoking status in our 94 ESCC samples (P ¼ 0.027, unpaired Wilcoxon rank-sum test; Supplementary Fig. 4c) but not in other WES studies ( Supplementary Fig. 4b,c). Signature E3 (cosine similarity 0.829 and 0.783) and Signature E5 (cosine similarity 0.838 and 0.892) are highly similar to COSMIC Signatures 1A and 1B, both of which were identified as age-dependent mutational signatures (Supplementary Table 3). Signature E3 and Signature E5 were both associated with age at diagnosis of 699 ESCC who had information on age (P ¼ 3.20 Â 10 À 4 and P ¼ 0.006, unpaired Wilcoxon rank-sum test; Supplementary Fig. 4a). We found that the average percentage of the T4C mutation per sample was significantly higher in drinkers (13.6%) than in non-drinkers (9.7%; P ¼ 3.60 Â 10 À 10 , unpaired t-test) and Signature E4, specifically characterized by the T4C mutation, is similar to COSMIC Signature 16 (cosine similarity 0.873) of which the origin was unknown before. In the present study, we identified Signature E4 significantly associated with individuals' smoking and drinking status (P ¼ 0.002 and P ¼ 9.69 Â 10 À 28 , unpaired Wilcoxon rank-sum test; Fig. 1d and Supplementary Fig. 4b). Our previous genome-wide association studies have identified two functional SNPs, rs671 in ALDH2 on 4q23 and rs1229984 in ADH1B on 12q24, that are significantly associated with the risk of ESCC in a manner of interactions with alcohol drinking and tobacco smoking status 21,22 . Therefore, we further analysed the mutation profiles in the function of ALDH2 and ADH1B genotypes. We found that the frequency of Signature E4 mutations in ESCC from drinkers with the risk ALDH2 genotype (rs671-AG/-AA) was significantly higher than that in ESCC from drinkers with the non-risk genotype (rs671-GG; P ¼ 0.003, unpaired Wilcoxon rank-sum test). It was also significantly higher than that in ESCC from non-drinkers with the rs671-AG/-AA genotype (P ¼ 0.006, unpaired Wilcoxon ranksum test) or ESCC from non-drinkers with the rs671-GG genotype (P ¼ 0.001, unpaired Wilcoxon rank-sum test) (Fig. 1e). The significant differences between the frequencies of Signature E4 mutations were also seen when the analysis was stratified by the ADH1B rs1229984 genotypes (Fig. 1e). Interestingly, the similar results were also observed in ESCC in Japanese 14 , whose ALDH2 and ADH1B genotype data are available (Fig. 1f). We also found that individuals with Signature E4 mutations had significant transcriptional strand bias for the T4C change ( Supplementary Fig. 5), which is believed to be caused by the differences in repair efficiency of DNA damage and maintenance processes between transcribed and untranscribed strands of genes 19 . Signature E6 (Fig. 1b), characterized specifically by the T4A and T4G mutations, showed low similarity to any of the COSMIC signatures (all cosine similarity o0.8) and appears to be a new mutational signature in ESCC. expressed and 41, 46 and 53% mutations were confirmed if the locus has at least 10, 5 and 1 read/s, respectively, with the mutant nucleotide by RNA sequencing. In addition, we selected 105 putative somatic mutations for further validation and 95 (90.5%) selected mutations were confirmed by PCR-based Sanger sequencing. Using the MutSigCV 23 , we identified six significantly mutated genes, including CDKN2A, TP53, NOTCH1, FBXW7, NFE2L2 and AJUBA (all false discovery rate (FDR) qo0.5; Supplementary Fig. 6). All of them have previously been reported in ESCC and two other types of squamous cell carcinomas, head and neck squamous cell carcinomas (HNSCC) and lung squamous cell carcinomas (LUSCC) (Supplementary Data 4).
To increase the statistical power to detect the driver mutations, we analysed the data in combined sample of 704 ESCC, which provides 90% power to detect genes mutated in 2% individuals 24 . As a result, we detected 20 putative driver genes and found that 93% (657/704) of ESCC had at least one somatic mutation in these genes ( Fig. 2 and Supplementary Figs 7 and 8).
Recapitulating previous reports, we confirmed mutations in 12 genes, including MLL2, FAT1, PIK3CA, EP300, ZNF750, CREBBP, NOTCH3, PTCH1, RB1, KDM6A, TGFBR2 and PTEN. Furthermore, we identified two novel mutated genes, CUL3 and RBPJ, in ESCC ( Fig. 2 and Supplementary Data 4). We found that knockdown of CUL3 or RBPJ significantly promoted malignant phenotypes of ESCC cells such as proliferation, migration and invasion; but overexpression of these genes significantly suppressed these malignant phenotypes ( Supplementary Fig. 9). In addition, we observed that the rate of PTCH1 mutation was significantly higher in ESCC in smokers compared with that in nonsmokers (P ¼ 0.030, one-sided Fisher's exact test), while the rates of TP53, EP300, PTCH1, NOTCH3, TGFBR2 and ZNF750 mutations were significantly higher in ESCC in drinkers than that in non-drinkers (all Po0.05, one-sided Fisher's exact test). ESCC at late stage had significantly more NOTCH1 mutations compared with ESCC patients at early stage (P ¼ 0.001, onesided Fisher's exact test; Supplementary Data 5).
We then compared the mutational profiles in ESCC, HNSCC, LUSCC or oesophageal adenocarcinoma (EAC) and found that the 20 genes significantly mutated in ESCC were also frequently mutated in HNSCC and LUSCC; however, in EAC, only TP53, CDKN2A and PIK3CA were frequently mutated (Supplementary Data 4). Clustering analysis showed that the mutational profile of ESCC is similar to HNSCC and LUSCC but dissimilar to EAC ( Supplementary Fig. 10). These results indicate that cancers originated from the same cell type might have the same driver mutations but cancers in the same organ generated from different cell type might have distinct driver mutations, reflecting different aetiology and the molecular mechanisms of pathogenesis 15,25 .
Recurrent mutations in the non-coding elements. We next analysed the mutations in the non-coding regions, including the gene promoters, untranslated regions (UTRs) and long noncoding RNAs (lncRNAs) annotated by GENCODE v24 (ref. 26). Using the regional recurrence testing approach 16 , we found 15 regions that had significantly higher mutation rates than that expected by chance (qo0.05; Supplementary Table 4 and Supplementary Fig. 11  LINC01360, PART1 and CTA-280A3.2), 1 promoter (FCMR) and 10 UTRs (SMAD2, YIPF4, ZNF605, FOXJ3, NABP1, CLOCK, YTHDC1, TMEM178B, TNFRSF11A and GALR1). For instance, 6.4% ESCC samples had mutations in NEAT1, an lncRNA that is associated with malignant phenotypes of several types of cancer including ESCC [27][28][29][30] . By integrating analysis of WGS and RNA sequencing data, we found that the mutations in the 3 0 UTR of FOXJ3 were significantly correlated with the overexpression of this gene, whereas the mutations in the 3 0 UTR of CLOCK were significantly associated with downregulation of this gene ( Supplementary Fig. 12).
Through additional cluster analysis of RNA expression data, we found that the RNA expression pattern in ESCC is also similar to that in HNSCC and LUSCC but differs from that in EAC and STAD ( Supplementary Fig. 13c,d). The similar SNV and CNV profiles along with the similar RNA expression pattern support that ESCC belongs to the 'squamous' molecular subtype as identified by Hoadley et al. 31 We also identified 23 focal regions with recurrent gain of copy number and 34 regions with recurrent loss of copy number (both qo0.25; Fig. 3b We then retrieved these 57 focal CNV peaks and examined whether the gene with copy-number gain or loss has substantial effect on its mRNA expression measured by RNA sequencing. The results showed that, among the 6,432 genes located in these CNV regions, 538 (33.8%) with copy-number gain and 1,416 (29.3%) with copy-number loss had mRNA levels that are significantly correlated with gene copy number (Spearman's r40.3 and Po0.05; Supplementary Fig. 14 and Supplementary Data 9). Through the analysis using a gene-drug interaction database 32 , we found that, among the genes that show aberrations in DNA copy number and mRNA expression, 184 (including EGFR, CCND1, PIK3CA and MAPK1) have the potential to interact with at least one drug (Supplementary Data 10).
We also identified SVs in FHIT, RAD51B, MYH9 and CDKN2A in a frequency of 45% samples (Supplementary Data 12). Since SVs may alter gene expression due to the change of geneenhancer distance or alterations of gene copy number or structure, we therefore examined the mRNA expression of genes with or without SVs in ESCC (Fig. 4b). Interestingly, we found 23 genes whose expression were significantly altered by SV (Po0.05, unpaired t-test; Supplementary Data 12) and among them were well-known cancer-related genes CDKN2A (SV in 5 samples), PRKCA (SV in 3 samples) and EP300 (SV in 3 samples). We further screened gene fusion events in our samples, which were validated by PCR-based Sanger sequencing (Supplementary Table 5). We identified ERC1 that was fused with different partners, PRAF2, WNK1 or RAD51, respectively, in three samples. ERC1 is a member of RIM-binding protein family that regulates neurotransmitter release and has been shown to be fused to the RET in papillary thyroid carcinoma 39 . We also detected and validated NRG1-ZCCHC7 and WT1-MRPL19 fusions, respectively, in two samples (Fig. 4c).
Functionally aberrant pathways in ESCC. We integrated the above described 20 genes having driver mutations and 586 genes with CNV into pathway analysis to identify the functionally aberrant pathways in ESCC. We selected the genes with CNV for the analysis based on both copy number and mRNA expression changes in tumour tissues compared with normal tissues. As a result, we identified a highly interconnected network of   aberrations targeting the receptor tyrosine kinases (RTK/RAS/ phosphoinositide-3 kinase (PI3K)), cell cycle regulators, WNT, NOTCH and TP53 pathways (Fig. 5, Supplementary Data 13 and 14). Several receptor tyrosine kinases (that is, EGFR, ERBB2 and ERBB3), their downstream signal transducer PIK3CA and other components in the RTK/PI3K pathway exhibited recurrent focal amplification and/or SNV. We found 14.9% (14/94) samples that had EGFR amplification while SNV in this gene was only 1.1% (1/ 94). The amplification rate for PIK3CA, a known oncogene in many types of human cancer, was 38.3% (36/94) and only two samples with PIK3CA amplification displayed concurrent mutations. We found a significant positive correlation between PIK3CA copy-number gain and elevated mRNA expression. The deletion rate of PIK3R1 in our samples was 4.3% (4/94) and the mRNA expression levels of this gene were significantly lower in tumours than in normal tissues (P ¼ 3.71 Â 10 À 24 , paired t-test).

Discussion
Several studies on whole-exon sequencing of ESCC have recently been published; however, the results are limited to somatic mutations in exonic regions of protein-coding genes, the approximately 2% of DNA sequences in human genome. In the present study, we performed an integrative analysis of WGS and RNA sequencing of 94 ESCC samples. Based on the systematical analysis of whole-genome mutational signatures, SNVs, CNVs,  ARTICLE SVs and mRNA expression and combined analysis with previous WES data, we have established a more comprehensive genomic landscape of ESCC. Several novel findings have been achieved in this study. First, we identified six mutational signatures in ESCC and found that Signature E4, characterized by the T4C mutation in the transcribed strand, is associated with alcohol intake and germline variants in two alcohol-metabolizing genes. Second, we identified recurrent mutations in 20 protein-coding genes, among them were CUL3 and RBPJ, two genes whose recurrent mutations have never been reported before in ESCC. We also identified for the first time the significant mutations in 15 non-coding regions, including that produces lncRNA NEAT1 (refs 27-30) and 3 0 UTR of FOXJ3 and CLOCK. Third, we identified 57 focal regions with recurrent CNVs and 6 genes in these regions had aberrant mRNA expression, which significantly promoted proliferation, migration and/or invasion of SCC cells. Furthermore, we identified that LRP1B and TTC28 are the most frequently affected genes by SV in ESCC. We also found frequent aberrations of cell cycle pathway and PI3K-AKT signaling pathway in ESCC, suggesting that these may be the most critical pathways involved in the development and progression of ESCC.
In the present study, we combined available ESCC WES data with our WGS results (total N ¼ 704) for analysis of mutational signatures and SNVs in coding regions 9,11-15 . We did not include the first whole-exonic sequencing study 8 for combined analysis because of small sample size of only 12 ESCC. Using combined analysis with increased sample size, we not only confirmed 18 significant SNVs but also identified two novel recurrent mutations in the CUL3 and RBPJ genes. CUL3 encodes cullin 3, a member of a ubiquitin ligase complex that functions in the oxidative stress response pathway. This gene displays frequent deletions or mutations in SCC 47,48 . RBPJ encodes a transcriptional factor regulating Notch signaling pathway, which has been shown to be an effective therapeutic target in various tumours 45,49,50 . By using this large sample size, we generated clinical background-related mutational signatures in ESCC. More importantly, we found that, among the six mutational signatures, Signature E4 occurred mostly in the transcribed DNA strand and was correlated with patients' drinking status and the variant genotypes in ALDH2 and ADH1B that have been associated with increased susceptibility to ESCC by genome-wide association studies 21,22 . The correlations between mutations in Signature E4 in ESCC and drinking and alcohol-metabolizing genotypes further support an important role of alcohol intake in aetiology of ESCC 4,5 .
In this study, we also performed functional assays to validate the novel findings. We found that disturbing CUL3 and RBPJ expression substantially altered the malignant phenotypes, such as proliferation, migration and invasion in vitro in SCC cells. Therefore, it is biologically plausible to define these two mutations as potential driver mutations in ESCC. We also demonstrated that knockdown of EGFR, BRD9 and PPFIA1, which showed frequently copy-number gain at the genomic level, significantly inhibited cancer cell proliferation, indicating that these genes may play important oncogenic roles in ESCC. Moreover, we found that knockdown of expression of CLPTM1L, ATP9B, PQLC1 as well as EGFR significantly inhibited cancer cell migration and invasion. While the effect of CLPTM1L on the progression of cancer has been previously reported 51 , the functions of ATP9B and PQLC1 in SCC cell migration and invasion have never been reported before. These findings provide evidence that gain of function of these genes in ESCC might serve as therapeutic targets for ESCC treatment.
Non-coding DNA elements comprise the majority of the genome, and mutations in these regions have not been well characterized in ESCC. Accumulating evidence has indicated that lncRNAs play important roles in cancer 52 . Mutations in lncRNAs might cause RNA structure change or influence binding affinity to proteins. It has been shown that NEAT1, which forms SWI/SNF chromatin-remodelling complex 53 , participates in carcinogenesis of various types of cancer [27][28][29][30] , consistent with our finding that NEAT1 is significantly mutated in ESCC. On the other hand, mutations in gene regulatory regions might change the binding affinity of miRNAs or transcriptional factors and result in alterations of gene expression. In this study, we indeed found that the mRNA expression levels in samples with mutations in 3 0 UTR of FOXJ3 and CLOCK had significantly altered compared with samples without the mutations. FOXJ3 and CLOCK have been reported to be involved in tumorigenesis 54,55 , which is in line with our findings. However, due to relative small sample size and low coverage of WGS in this study, further validation of the identified mutations in noncoding regions is needed.
Our present study has some limitations. First, the functional roles of the four mutated noncoding RNAs and their mutations are not clear yet and need to be addressed by further functional studies. Second, we identified several genes (for example, LRPIB, FHIT and TTC28) with high frequency of SV but the expression of these genes was not significantly changed. Although this phenomenon has also been observed in the WGS study of hepatocellular carcinoma and other WES studies 28,38,56 , further studies are warranted to elucidate the potential reasons, such as whether the SV occurs only in one allele but the gene expression is complemented by the other allele, whether there exist other mechanism such as SNV, CNV or DNA methylation, all of which may influence the expression of the affected genes, and whether these SVs may occur only in part of the whole gene region leading to a change of structure but not the expression of the gene.
In conclusion, this study characterized the genomic alterations in ESCC through WGS and integrative analysis. We have not only confirmed some previously reported findings but also identified several novel genomic alterations in ESCC. These results establish a comprehensive genomic landscape of ESCC and provide potential targets for precision treatment and prevention of the cancer.

Methods
Collection of biospecimens. ESCC were obtained from individuals (N ¼ 94) recruited at the Cancer Hospital, Chinese Academy of Medical Sciences (Beijing, China) and Zhejiang Cancer Hospital (Hangzhou, China) between 2010 and 2014. All the individuals with ESCC were unrelated Han Chinese whose demography characteristics and clinical information were obtained from medical records. Informed consent was obtained from each participant, and this study was approved by the Institutional Review Board of the Chinese Academy of Medical Sciences, Cancer Institute and Zhejiang Cancer Hospital. All the individuals underwent oesophagectomy and received no chemotherapy or radiotherapy before surgery. The biospecimens, including tumour tissues, adjacent normal oesophageal tissues (Z5 cm from tumour site) and blood samples from each individual, were collected at the time of treatment.
Collection of clinical data. Clinical data for each individual were collected, which included gender, age at diagnosis, ethnicity, smoking status, drinking status, tumour site in the oesophagus, tumour stage and tumour grade. Individuals who smoked an average of o1 cigarette per day and for o1 year in their lifetime were defined as nonsmokers; otherwise, they were defined as smokers. Individuals were classified as drinkers if they drank at least twice a week and continuously for at least 1 year during their lifetime; otherwise, they were defined as nondrinkers. Tumour TNM staging components including Tumour, Node and Metastasis were reviewed by at least three pathologists and defined according to the American Joint Committee on Cancer Seventh edition. Previous studies have investigated numerous molecular and clinical risk factors without validation for potential clinical applications. Therefore, few patients had documentation for any molecular biomarkers for further analyses. Distributions of selected characteristics among the 94 patients are shown in Supplementary Table 6.
Verification of histopathologically diagnosis. All the tumours were histopathologically classified as ESCC. Tumour samples and distant normal tissues were ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/ncomms15290 embedded in optimal cutting temperature medium and histological sections stained with haematoxylin and eosin were reviewed by at least two pathologists for quality assurance that tumour specimens contained an average of 75% tumour cell nuclei with o20% necrosis, whereas normal specimens contained no tumour cells.
Processing of biospecimens. DNA and RNA from tissue samples were extracted using the Allprep DNA/RNA Kit (Qiagen) while blood DNA was extracted using QiaAmp Blood Midi Kit (Qiagen). DNA samples were quantified by Picogreen fluorescence assay and resolved by 1% agarose gel electrophoresis to confirm high molecular weight fragments. A custom Sequenom SNP panel was utilized to verify that tumour DNA and germline DNA was derived from the same patient. One microgram of DNA from each tumour and normal tissue was sent to Qiagen for REPLI-g whole-genome amplification using a 100 mg reaction scale. RNA was quantified by Abs260 using an UV spectrophotometer and then analysed via the RNA6000 nano assay (Agilent) for determination of an RNA Integrity Number. We discarded samples with RNA Integrity Number o7.0 ( Supplementary Fig. 1). Aliquots of DNA and RNA were shipped to the sequencing centre for all subsequent testing.
Genomic mutation detection and analysis. High-coverage WGS sequencing of DNA samples from 94 ESCC was performed on the Illumina HiSeq X Ten System. The sequence reads were aligned to the genome (hg19) using the bwa mem (v0.7.4) 57 with default parameters. Duplicates were removed using Picard. The somatic mutations were called by Strelka 58 . Final somatic calls were filtered to require a quality score of 430 and were further annotated by snpEff and snpSift 59 . Non-negative matrix factorization was applied to the 96-substitution pattern and the contribution of each signature to each mutation catalogue was estimated by an algorithm from the Wellcome Trust Sanger Institute 19 . We performed PCR-based Sanger sequencing on a total of 105 putative somatic mutations identified in this study. PCR primers were designed using Primer3 to amplify a 100-to 700-bp region containing each mutation (Supplementary Data 15).
We carried out significance analysis of mutations using the MutSigCV algorithm 22 , as elaborated in several genome projects, with qo0.5 being considered significant in our data and qo0.1 being considered significant in the combined analysis. Genes mutated in o1% of samples or expressed (RSEM value of RNA sequencing is not zero) in o10% of samples were considered non-ESCC driver genes. For mutations in the non-coding regions, we performed analysis on promoters and UTRs of coding genes (N ¼ 20,014) and lncRNAs (N ¼ 7,747) annotated by Gencode v24 (ref. 26). The promoter is defined as the region between upstream 2 kb and downstream 250 bp from any transcription start site of a coding transcript gene and excluded any overlap with the coding regions. A regional recurrence testing approach was used to search for significant regions that were mutated more frequently than expected by chance 16 . To avoid false positive, only regions with FDR qo0.05 and 43% mutation samples were considered significant. Mutual exclusive analysis was performed using gitools (v2.2.2) 60 . All statistical analyses were performed in R using standard implementation.
We analysed copy number using SynthEx (unpublished, http://chenmengjie. github.io/SynthEx). In brief, 100 kb bin counts data (segment data) were generated using BEDTools 61 . The read ratios were calculated using the 'synthetic normal' strategy described in SynthEx by averaging normal samples sequenced by the same protocol to reduce unwanted variation contributed by local characteristics and factors in sequencing process. A trending filter procedure was applied to segment the genome. The log2 ratios for segments were finally corrected by purity estimate from SynthEx. To infer recurrently focal peak in amplified or deleted genomic regions, we used the GISTIC algorithm 62 with copy numbers in 1 kb windows as markers instead of SNP array probes. G-scores were calculated for genomic and gene-coding regions based on the frequency and amplitude of each gene with amplification or deletion. A significant CNV region was defined as having amplification or deletion with an FDR q value o0.25. For the arm-level analysis, chromosomal arms were considered to be altered if at least 66% of the arm (all_thresholded.by_genes.txt file from the GISTIC output) was lost or gained.
We used Meerkat 63 and Delly2 (ref. 64) to detect somatic SVs. Since the mean insert size is o300 bp and our WGS data are 150 bp paired-end reads, we first truncated the reads to 100 bp to avoid overlapping of the read pairs. Possible false positives were then filtered using a realignment procedure based on a hidden Markov Model and realigns discordant reads against a set of candidate sequences, including the local reference sequence and sequences obtained by inserting SNPs and indels implied in the short reads mapped nearby 65 . We then merged Meerkat and Delly2 predictions: if the breakpoints of a SV are within 500 bp of those of another SV, the two SVs are merged and viewed as one single SV. Complex rearrangements are called by ShatterSeek 66 and a genome having complex rearrangement is called if P value o0.01. All genes were annotated and enriched with Kyoto Encyclopedia of Genes and Genomes pathway using DAVID (v6.8) 67 .
RNA expression analysis. Methods for RNA sequencing and data processing were based on the TCGA RNA-Seq protocol 38 . Briefly, total RNA of each sample was extracted, prepared into Illumina TruSeq mRNA libraries and sequenced by Illumina HiSeq2000 with a total of 10 Gb data. RNA reads were aligned to the hg19 genome assembly using Mapsplice 68 . To confirm SNVs reported in the WGS DNA samples, we applied UNCeqR 69 to call somatic variants from RNA-Seq data. Gene expression using RSEM 47 was quantified for the transcript models corresponding to the TCGA GAF2.1 and normalized to a fixed upper quartile of total reads within the sample. For further gene-level analysis, expression values were the RSEM value added by 1 and then log2 transformed. Principal component analysis (PCA) with R standard implementation was conducted in ESCC and paired normal tissues and batches. We were blinded to the group allocation during the experiments. The plots of the first two principal components are shown in Supplementary Fig. 22. Based on these figures, the samples are clearly divided into two groups, tumour and normal tissue, according to the second component. It is reassuring that none of the tumour or normal tissues had been contaminated, swapped or misplaced. None of the batches in tumour or normal tissues showed batch effects in PCA plots. We further performed regression analysis using batch ID and first 10 PCs and none of the regression coefficients was significant (all P40.05), suggesting that technical batch effects in our data sets were reasonably small and thus unlikely to influence high-level analyses. Genes with Po0.05 (paired t-test) and expression fold change (41.2 or o0.8) in tumours comparing with normal tissues were considered significantly differently expressed. To estimate the effect of copy-number change of each gene on its mRNA expression, Spearman's correlations were calculated between gene copy number and its corresponding mRNA expression. Correlations were considered significant and positive when Po0.05 and r40.3.
Comparison of ESCC with other types of cancer. To compare SNV pattern of ESCC with other types of cancers, WES data of 384 HNSCC and 179 LUSCC samples from the TCGA project and 141 EAC samples from the TumorPortal database 70 were obtained. To compare CNV and RNA patterns, CNV and RNA data in 94 ESCC, 519 HNSCC, 512 LUSCC, 439 STAD and 79 EAC samples were obtained from the TCGA project. We used the Hierarchical Clustering to assess the SNV, CNV and RNA differences across different cancer types. PCA method was also used to assess the CNV and RNA differences. For the Hierarchical Clustering, mutation frequency or arm-level data were hierarchically clustered in R based on Euclidean distance using the Ward's method and RNA data were hierarchically clustered using the average linkage algorithm with 1 minus Pearson correlation coefficient as the dissimilarity measure.
Cell lines. ESCC cell lines, KYSE30, KYSE150 and LUSCC cell line, NCI-H520, were cultured in RPMI 1640 medium supplemented with 10% foetal bovine serum (FBS). HNSCC cell line FaDu was cultured in Eagle's Minimum Essential Medium supplemented with 10% FBS. All cell lines were cultured in an atmosphere containing 5% CO 2 at 37°C. The KYSE series were generous gifts from Dr Y. Shimada at the Kyoto University. The LUSCC and HNSCC cell lines were purchased from the Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences (Beijing, China). All cell lines used in this study were authenticated by short tandem repeat profiling and tested for absence of mycoplasma contamination (MycoAlert, Lonza Rockland).
siRNA transfection. We mixed three siRNAs targeting the same gene as the siRNA-mix. Cells were transfected with siRNA-mix and Lipofectamine 2000 (Invitrogen) according to the manufacturer's instructions. In brief, 24 h before transfection, cells were plated onto a six-well plate (Corning, USA) at 30-50% confluence. The next day, cells with about 90% confluence were transfected with 2 ml Lipofectamine 2000 and 50 nM siRNA control or corresponding siRNA mix. After transfection, cells were incubated at 37°C for 6 h and the medium was then replaced with fresh medium containing 10% FBS. After 48-72 h, cells were collected for further experiments. All siRNA oligos were purchased from GenePharma and their sequences are shown in Supplementary Data 16.
Quantitative real-time PCR. Total RNA from cells was extracted with TRIzol (Invitrogen) according to the manufacturer's instruction. First-strand cDNA was synthesized using the PrimeScript 1st Strand cDNA Synthesis Kit (Takara). Relative RNA expression levels determined by quantitative real-time PCR were measured using the SYBR Green method on an ABI Prism 7,900 sequence detection system (Applied Biosystems) with SYBR Premix Ex Taq (Takara, Japan), 50 ng cDNA and 1 mM gene specific primers in a 25 ml reaction mixture. Reactions were performed with an initial 30-s denaturation step at 95°C, followed by 40 cycles of 95°C for 5 s and 60°C for 30 s. Independent experiments were done in triplicate. GAPDH was employed as an internal control. All gene specific primers are shown in Supplementary Table 7.
Cell proliferation and migration or invasion assays. Cells were plated in 96-well plates and each well contains 2,000 cells in 200 ml cell suspension after transfection with siRNA-mix. Cell numbers were measured with CCK-8 (Dojindo) at 24, 48, 72 and 96 h after incubation. The migration and invasion ability of each cell line was evaluated, respectively, using the 24-well chambers (8 mm pore size; Corning). For migration assays, 5-10 Â 10 4 cells in 150 ml serum-free RPMI 1640 medium were added into the upper chamber. For invasion assays, cells were added after coating filters with matrigel. RPMI 1640 medium with 20% FBS was used as a chemoattractant in the lower chamber. After 24-h incubation at 37°C, cells were fixed in methanol and stained with 0.5% crystal violet. After removing cells on the upper chamber, cells that migrated to the bottom side of the membrane were counted. Comparison with siRNA control was performed with unpaired t-test.
Data availability. The raw sequencing data of this study have been deposited in the Genome Sequence Archive of Beijing Institute of Genomics, Chinese Academy of Sciences (http://gsa.big.as.cn/) with accession number PRJCA000354. Data that support the findings of this study are available from TCGA database (http://cancergenome.nih.gov) and TumorPortal (http://www.tumorportal.org)