Landscape of somatic single nucleotide variants and indels in colorectal cancer and impact on survival

Colorectal cancer (CRC) is a biologically heterogeneous disease. To characterize its mutational profile, we conduct targeted sequencing of 205 genes for 2,105 CRC cases with survival data. Our data shows several findings in addition to enhancing the existing knowledge of CRC. We identify PRKCI, SPZ1, MUTYH, MAP2K4, FETUB, and TGFBR2 as additional genes significantly mutated in CRC. We find that among hypermutated tumors, an increased mutation burden is associated with improved CRC-specific survival (HR = 0.42, 95% CI: 0.21–0.82). Mutations in TP53 are associated with poorer CRC-specific survival, which is most pronounced in cases carrying TP53 mutations with predicted 0% transcriptional activity (HR = 1.53, 95% CI: 1.21–1.94). Furthermore, we observe differences in mutational frequency of several genes and pathways by tumor location, stage, and sex. Overall, this large study provides deep insights into somatic mutations in CRC, and their potential relationships with survival and tumor features.

C olorectal cancer (CRC) is the third most common cancer and the second leading cause of cancer deaths worldwide 1 . CRC is a complex disease caused by multiple environmental, lifestyle, and genetic risk factors. Exposures to exogenous and endogenous factors, aberrant DNA editing, and defective DNA maintenance cause mutations and epigenetic alterations, which confer cellular transformation and growth, leading to the development of CRC.
Next-generation sequencing (NGS) has identified a diversity of driver mutations in genes and altered signaling pathways in CRC [2][3][4][5] . Limitations of these studies are the scarcity of clinical data, as well as the inability to achieve statistical significance due to small sample size. Advances in DNA extraction from archived formalin-fixed paraffin-embedded (FFPE) tissues and sequencing have enabled us to utilize a sizeable collection of CRC cases with available clinical data from the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) and the Colon Cancer Family Registry (CCFR). Here, we present a systematic mutation analysis of 2105 CRC cases using targeted, deep sequencing data. We constructed a custom AmpliSeq panel of 205 genes, prioritized from the analyses of CRC mutation datasets and literature review 2,6,7 , and conducted targeted deep sequencing on DNA from FFPE tumors and matching normal tissues from five-wellcharacterized studies. The profiling of mutations in the largest population-based CRC sequencing study to date provides a deep insight into the mutational landscape of CRC and associations with survival.

Results
Targeted sequencing. To construct an AmpliSeq panel, we prioritized a list of 205 genes selected based on analysis of whole exome sequencing data for CRC from TCGA 2,6 , and two prospective cohort studies, the Health Professionals' Follow-Up Study (HPFS) and the Nurses' Health Study (NHS) 7 , as well as a literature search. We included homopolymer repeats to evaluate microsatellite stability (see Supplementary Methods for panel design). The DNA extraction, sequencing, and mutation calling are described in detail in "Methods" and Supplement sections. We successfully sequenced tumor DNA (from FFPE tissue) and normal DNA (primarily from blood) from 2105 CRC cases recruited across five observational studies that collected survival data, CORSA, OFCCR, SCCFR, CPS-II, and DACHS. The mean sequencing coverage of tumor and normal DNA was 857× and 302×, respectively.
Overall, we observed that HM tumors were less likely to be diagnosed at stage IV than non-hypermutated (NHM) tumors (4% vs. 10%, respectively), and more likely to arise in right-sided CRC (76% vs. 24%). We also observed that CRC-specific survival was significantly more favorable among individuals with HM tumors than among those with NHM CRC (HR = 0.36, 95% CI: 0.24-0.54). This association was consistent regardless of stage at diagnosis or tumor site, and was not impacted by adjustment for these variables. Associations with survival were also consistent among both those with (HR = 0.24, 95% CI: 0.10-0.58) and without somatic POLE exonuclease domain mutations (HR = 0.41, 95% CI: 0.26-0.65).
Frequently mutated genes. We defined gene mutations based on the presence of non-silent mutations. As expected, we observed substantial differences in the mutational frequency of genes between NHM tumors and HM tumors (Fig. 2). In NHM, the most frequently mutated genes based on non-silent mutations included APC, TP53, KRAS, SYNE1, PIK3CA, FBXW7, SOX9, RYR1, and SMAD4 (Fig. 2a). These genes also harbored nonsilent mutations in the HM tumors, but were mutated at different frequencies in the HM set. Among the HM tumors, SYNE1, RYR1, RNF43, and KMT2D were the most commonly mutated genes. Several of the frequently mutated and some of the less frequent mutated genes in NHM and HM tumors were classified as significantly mutated by MutSigCV (q < 0.1; Fig. 2a). In addition to previous known genes, we identified PRKCI, SPZ1, MUTYH, MAP2K4, FETUB, and TGFBR2 as significantly mutated by MutSigCV, which had not been reported in previous studies 2,4,6,7 , suggesting putative driver status of these genes. Validation of 84 mutations from these genes by Sanger sequencing showed 98.8% correct calls.
In analyses across all CRC cases, we examined associations between gene-level mutations and CRC-specific survival, accounting for multiple testing and restricting our analyses to genes with at least 10 CRC deaths in those with non-silent mutations. Median survival time ranged from 60.6 to 194.1 months across the included studies (Supplementary  Table 1), and the study-specific proportion of participants who died from CRC ranged from 14.8 to 27.0%. No gene-level mutations were significantly associated with CRC-specific survival after adjusting for age, sex, mutational burden and study (Supplementary Data 2). Similarly, we did not find any gene significantly associated with CRC-specific survival when we stratified the analyses by hypermutation status, although such stratified analyses were based on smaller numbers.
Alterations in main signaling pathways implicated in CRC. We conducted a detailed analysis of primary pathways implicated in CRC 2 . A total of 82% of tumors carried non-silent mutations in genes belonging to more than one signaling pathway, though ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17386-z differences in mutation frequencies were observed between HM and NHM tumors (Fig. 2b). In NHM tumors, 77% of tumors have mutations in the WNT/beta-catenin pathway, followed by TP53/ ATM (62%), receptor tyrosine kinases/RAS (RTK/RAS, 50%), transforming growth factor-beta (TGF-beta, 21%), and IGF2/ phosphatidylinositide 3-kinases (PI-3-kinase, 17%) pathways. In HM tumors, 97% of tumors were mutated in WNT/beta-catenin signaling genes, followed by TGF-beta (80%), RTK/RAS (74%), TP53/ATM (48%), and IGF2/PI-3-kinase (46%) pathways. Contributions of mutated genes in individual pathways are shown in Fig. 2c and Supplementary Fig. 3. Overall, 96% of NHM and 100% of HM tumors displayed at least one non-silent mutation in     a gene belonging to the main signaling pathways implicated in CRC. Compared to the NHM, HM tumors had more alterations in multiple pathways. WNT/CTNNB1(beta-catenin) signaling pathway: In the WNTsignaling pathway, APC most frequently carried non-silent mutations. Among APC mutated tumors, 99% of NHM and 85% of HM tumors harbored truncating mutations occurring within the first 1600 codons, for which truncating mutations are predicted to have the most functional consequences 8 (Supplementary Fig. 4). Approximately 19% of all NHM and 22% of all HM tumors carried two or more non-silent mutations in APC.
We did not observe a significant association between having any WNT-signaling pathway gene mutation and CRC survival (Supplementary Data 2) or between CTNNB1 hotspot mutations and CRC survival.
TP53/ATM pathway: TP53 was the second most commonly mutated gene in NHM tumors (60%), but less often mutated in HM tumors (32%). In contrast, the ATM gene was more frequently mutated in HM tumors (23%) than NHM tumors (4%; Fig. 2 and Supplementary Fig. 3). In TP53 76% of the mutations were missense which were predominantly in the DNA binding domain, and 23% of the mutations were truncating which were distributed along the entire length of the protein ( Supplementary  Fig. 4). All in-frame indels in TP53 (1%) were found in the DNA binding domain. Approximately 5% of TP53 mutated tumors carried two or more non-silent mutations in TP53.
The presence of a non-silent somatic mutation in TP53 was associated with modestly poorer CRC survival in all cases combined (HR = 1.27, 95% CI: 1.01-1.59, Supplementary Data 2). When categorizing somatic mutations by deleteriousness based on residual transcriptional activity, the association with CRCspecific survival was more pronounced (Table 1). TP53 residual activity was determined using the International Agency for Research on Cancer (IARC) TP53 database 9 (See "Methods"). Among individuals with a somatic mutation in TP53, those with 0% predicted residual TP53 activity had significantly poorer CRC-specific survival compared to those with mutations causing >5% predicted residual activity or those with no mutations (HR = 1.53, 95% CI: 1.21-1.94) after adjusting for age, sex, mutational burden, and study. This association was primarily driven by results among NHM cases (HR = 1.52, 95% CI: 1.19-1.94), among whom the observed association was modestly impacted by adjustment for stage at diagnosis (HR = 1.36, 95% CI: 1.06-1.95). TP53 mutations that resulted in minimal residual activity (>0% but <5%) were also associated with a non-statistically significant poorer CRC-specific survival (HR = 1.38, 95% CI: 0.89-2.13). TP53 mutations were less common in right-sided vs. left-sided CRC, and the proportion of tumors in right-sided CRC decreased with decreasing TP53 residual activity. However, adjustment for tumor site did not impact the observed association of TP53 mutation status with CRC survival, nor did these associations vary substantially by tumor site.
Receptor tyrosine kinase (RTK)/RAS pathway: Among genes analyzed in this pathway, KRAS was the most frequently mutated, followed by BRAF and NRAS in NHM tumors and BRAF, KRAS, and ERBB3 in HM tumors ( Fig. 2 and Supplementary Fig. 3). As expected, the frequently mutated codons in KRAS and NRAS were G12, G13, and Q61 and in BRAF was V600 (Supplementary Figs. 4 and 5). In the RTK/RAS pathway, we demonstrate that most mutations in KRAS, NRAS, BRAF, ERBB2, and ERBB3 are mutually exclusive (Supplementary Fig. 3). The presence of a mutation in the RTK/RAS pathway was not significantly associated with CRC survival overall, or when stratified by HM status (Supplementary Data 2). More specifically, the presence of a BRAF V600E mutation was not associated with CRC survival, even among cases diagnosed with distant metastatic disease, and the presence of known oncogenic mutations in KRAS was only modestly associated with poorer survival (Supplementary Data 2). Although these observations are somewhat contrary to prior studies 10 , they are consistent with results based on a large meta-analysis on previously measured tumor marker data including studies that are included here (Phipps et al., in press Gastroenterology).
TGF-beta superfamily pathway: Members of the TGF-beta superfamily are frequently mutated in sporadic and hereditary CRC causing loss of TGF-beta signaling and its anti-proliferative effects 11 . We sequenced genes encoding for the ligand GDF5, cell membrane-anchored receptors (ACVR2A, ACVR1B, BMPR1A, BMPR2, TGFBR1, and TGFBR2), intracellular receptor-regulated R-Smads (SMAD2, SMAD3), and SMAD4, which is a mediator of signal transduction to the nucleus to regulate expression of target genes. While multiple TGF-beta signaling genes often harbored non-silent mutations in HM tumors, mutated genes appear to be mutually exclusive only in NHM tumors ( Supplementary Fig. 3).
Overall, the presence of a mutation in the TGF-beta pathway was associated with less favorable survival for those with NHM CRC (HR = 1.40, 95% CI: 1.06-1.86), but the opposite was true for individuals with HM CRC (HR = 0.44, 95% CI: 0.18-1.06). Among those with NHM tumors, the presence of a somatic mutation in SMAD4 was most strongly associated with poorer survival (HR = 1.58, 95% CI: 1.13-2.22). However, none of these associations remain significant after adjusting for multiple comparisons.
IGF2/PI-3-kinase pathway: In the IGF2/PI-3-kinase pathway, mutually exclusive non-silent mutations were found in the PIK3CA, PIK3R1, and PTEN genes in NHM and HM tumors ( Supplementary Fig. 3). 97% of mutations in PIK3CA, the catalytic subunit of Phosphoinositide-3-kinase (PI3K), are missense or in-frame indels and 51% of mutations in the regulatory subunit, PIK3R1, and 49% of mutations in the negative regulator, PTEN, are frameshift, splicing, and truncating mutations. The most mutated gene in this pathway, PIK3CA, showed recurrent mutations in codons R88, E542, Q546, H1047. The oncogenic gain of function mutations in codons E542 and H1047 have been described to activate the AKT pathway 12 .
The presence of a mutation in the IGF2/PI-3-kinase pathway was not significantly associated with CRC survival overall or when stratified by HM status (Supplementary Data 2).
MSS-HM tumors without non-silent mutations in POLE and POLD1. Approximately 3% of tumors (n = 64) were MSS-HM and without non-silent mutations in POLE or POLD1. These tumors were frequently mutated in APC (73%), TP53 (50%), and KRAS (45%) (Supplementary Fig. 6). The most frequently mutated pathways in these tumors included the WNT-signaling pathway, RTK/RAS signaling pathway, TGF-beta superfamily signaling pathway, and IGF2-PI3K signaling pathway. Slightly more tumors occurred on the right side (53.1%). Across hypermutated tumors, MSS tumors with and without mutations in POLE/POLD1 and tumors exhibiting MSI showed similar frequency of different types of mutations, such as exonic, intronic, UTR, and intergenic regions ( Supplementary Fig. 7).

Mutated genes by tumor characteristics and sex.
We observed what has been well described by others, that MSI status differs by tumor site and sex, with MSI occurring more frequently in females and in right-sided CRC ( Table 2). After accounting for multiple comparisons and adjusting for age, sex, mutational burden, MSI status and study, we observed statistically significant differences in mutation status among right-sided versus left-sided tumors for several genes ( Table 2 and Supplementary Data 3), including the KRAS, TP53, BRAF, BCL9, AMER1, and FBXW7, as well as for several pathways, including RTK/RAS, TP53/ATM, TGF-beta, and IGF2/PI-3-kinase. When stratified by tumor stage, we found that genes mutated in the IGF2/PI3K pathway occurred more frequently in stages 2 and 3 compared to stage 1 tumors (OR = 1.48, p-value 7.0 × 10 −3 ; Supplementary Data 4). Whereas, genes mutated in the WNT-signaling pathway were less frequent in stages 2 and 3 compared to stage 1 tumors (OR = 0.67, p-value 6.6 × 10 −3 ). In addition, results were suggestive for an increased frequency of SMAD4 mutations in stages 2 and 3 tumors compared with stage 1 tumors (OR = 1.91, p-value 1.8 × 10 −3 ; Supplementary Data 4). Furthermore, mutations in BRAF and the RTK/RAS pathway occurred more frequently among females (BRAF: OR = 0.37, p-value 2.0 × 10 −6 ; RTK/RAS pathway mutation: OR = 0.76, p-value 2.5 × 10 −3 ; Supplementary Data 5).

Discussion
In this study, we provide a detailed look at the mutational profile and its link to survival in over 2000 CRC cases. As expected, the most frequently non-silent mutated genes belong to the WNT, TP53/ATM, receptor tyrosine kinase, TGF-beta, and PI-3-kinase pathways; however, mutational frequency varied substantially by hypermutation status. We not only found that HM tumors were associated with improved survival, but also that an increased number of mutations within HM tumors was associated with improved survival. When looking at specific genes while accounting for multiple comparisons, we found that mutations in TP53 with 0% predicted transcriptional activity were associated with poorer survival. Furthermore, we observed differences in mutational frequency of several genes and pathways by tumor location, stage and sex.
As tumors with a large number of mutations are linked to response to immunotherapy due to the larger number of potential neoantigens 13-16 , we had a closer look at the HM tumors which account for almost 20% of all CRC. About two-thirds of HM tumors are MSI which occur through defective DNA repair by germline and somatic mutations or promoter methylation in MMR genes that leads to an increased mutational burden. Among the MSS tumors that account for the remaining one-third of HM tumors we observed an enrichment of POLE mutations frequently resulting in an ultra-mutated phenotype. Given our large sample size we were not only able to show that HM tumors were associated with improved survival but further able to show that within the HM tumors an increasing number of mutations impacted survival positively. This is consistent with the observation that an elevated neoantigen load is associated with high-lymphocytic infiltration and improved CRC-specific survival 7 . In phase II trials, MMR deficient tumors were responsive to the immune checkpoint inhibition using pembrolizumab, nivolumab, and the combination of nivolumab and ipilimumab 13,14,17 which has led to FDA approvals of Opdivo (nivolumab) with or without Yervoy (ipilimumab) and Keytruda (pembrolizumab), for treating tumors with MSI or deficient MMR. While currently such treatment is limited to cases with MSI or MMR deficiency, our data show that these cases only account for about two-thirds of all HM tumors and that nearly one-third of all HM tumors are MSS. These HM-MSS cases may also benefit from checkpoint inhibitor and neoantigen vaccination immunotherapies, particularly as a sizable fraction is even ultra-hypermutated. Accordingly, immunotherapy studies should investigate if the subset of MSS tumors with a large number of mutations would also benefit from this promising treatment. Accordingly, the ongoing clinical trial, NCT01876511 (https://clinicaltrials.gov), extended inclusion to HM-MSS cases. Immunotherapy benefits are also observed in lung cancer with a high tumor mutational burden 18 .
Somatic mutations in TP53, which are present in the majority of cancers 3 , are associated with poorer clinical outcomes in several cancer types, including CRC 19 , consistent with our finding. However, analyses of TP53 somatic mutations and p53 function in relation to cancer outcomes, including CRC, have resulted in inconsistent findings [19][20][21] . This may be due, in part, to discrepant approaches to defining mutation status, sample sizes, methodologies, and population characteristics. Importantly, when we account for transcriptional activity in the definition of gene mutation status, the relationship between TP53 mutation status and CRC survival became substantially more pronounced. These results demonstrate that while the classification of a mutation as silent or non-silent is relevant for determining the functional effect of a mutation, additional functional annotations may aid in more accurate characterization of the putative effect of mutations on clinical outcomes. Consistent with our finding, a study of CRC and a study of glioma and gastric adenocarcinoma found differences in survival outcomes when accounting for p53 transcriptional activity 22,23 as defined using the IARC TP53 database 9 . Furthermore, a CRC study found that the type of TP53 DNA binding domain mutation affected CRC survival outcome 24 ; while using a different functional definition and analyzing a subset of patients, this finding is in line with ours. Mutations in TP53 could result in a gain of oncogenic function, reduced degradation, and dominant-negative effect on the wild-type protein. As such, heterozygous mutations without the loss of heterozygosity could further reduce the activity of p53 protein from wild-type allele by altering the ratios of mutant and wildtype p53 proteins and by generating p53 tetramers with reduced p53 activity. As more information about the functional effects of mutations is gained across other genes, it can be expected that future studies will be able to better characterize mutation phenotypes and clinical impacts of specific somatic mutations. Fortunately, higher throughput functional assays are becoming increasingly available to make such detailed analyses possible.
Our findings suggested that SMAD4 mutations are more frequent in stage 2/3 compared to stage 1 CRC, and are modestly associated with poorer survival; although results were not significant after adjusting for multiple comparisons. In line with this, SMAD4 loss, as determined by immunohistochemistry, was recently reported to be associated with worse CRC survival 25 . The study further showed that SMAD4 loss was associated with resistance to chemotherapy, and decreased tumor immune  25 . Large deletions in chromosome 18q were not measured in the present study. Somatic differences in CRC along the colonic axis from caecum to rectum have been well established. Our findings for differences in MSI, KRAS, BRAF, TP53, and FBXW7 mutations are in line with previous reports of tumor-site differences in these somatic mutations 26,27 . In addition, one study of 516 patients with stage 2 and 3 tumors identified site-specific differences in RTK/RAS, PI-3kinase, and TGF-beta pathways, which is consistent with our findings, though they did not find statistically significant differences in the mutational frequencies in the TP53/ATM pathway 27 . However, it appears that several studies evaluating gene and pathway mutation frequencies by tumor site did not account for MSI status and mutational burden. Many genes that we identified as having different mutational frequencies by tumor site, while accounting for MSI status and mutational burden, are not as well described in the literature, and additional studies are needed to confirm these. Consistent with our findings, BRAF mutations have been reported to be more prevalent in females 28 . In summary, our findings provide a more detailed understanding of the tumor heterogeneity and how that heterogeneity pertains to prognosis.
In cancers, mutations in CTNNB1 are often found in the hotspot region encoding for codons D32 to S45 2,29,30 . This region contains phosphorylation sites needed for the removal of excess beta-catenin 31 . The beta-catenin degradation complex is formed by APC, AXIN1, and AXIN2, which recruits casein kinase 1 alpha to phosphorylate beta-catenin at S45, and subsequent phosphorylation by GSK3-beta at T41, S37, and S33 32,33 . While S45 phosphorylation priming and subsequent phosphorylation of T41, S37, and S33 are required for the degradation of beta-catenin, the D32 and G34 are also essential for its interaction with BTRCP, a specificity component of ubiquitination machinery 33 . Mutations in this region can stabilize beta-catenin resulting in its cytoplasmic accumulation, nuclear translocation, and transcription of cell cycle-and growth-related genes. Interestingly, we observed that CTNNB1 hotspot mutations were enriched in tumors without inactivating mutations in APC. Treatment of these tumors will require direct inhibition of the beta-catenin/ TCF transcriptional complex.
We identified PRKCI, SPZ1, MUTYH, MAP2K4, FETUB, and TGFBR2 as significantly mutated new genes. These genes have been implicated in cancer. Reduced levels of PRKCI correlated with the worst survival in CRC patients, and its deficiency is linked to inflammation and tumorigenesis in mice 34 . SPZ1 promotes epithelial-mesenchymal transition and oncogenesis in liver cancer 35 . This transcription factor functions in the MAPK signaling pathway to stimulate cell proliferation and oncogenesis 36 . MUTYH is involved in oxidative DNA damage repair by removing improperly paired adenine with 8-oxoG 37 . Its deficiency is associated with MUTYH-Associated Polyposis syndrome. As MUTYH is transcriptionally regulated by TP53 38 , in tumors with defective TP53, decreased levels of MUTYH could affect DNA repair resulting in accumulation of mutations in cancer genes. The MUTYH mutational signatures persistent with mispaired adenine and 8-oxoG occur frequently in CRC genes including, APC, KRAS, PIK3CA, and SMAD4 37 . Downregulation of MAP2K4, a key mediator of Jun N-terminal kinase signaling, was associated with poor prognosis in CRC patients 39 . FETUB encodes for a cysteine protease inhibitor. Its expression in CRC cell lines and its tumor-suppressing activity in vivo in mice have been described 40 . TGFBR2 is an essential receptor in the TGFbeta signaling pathway, which is often mutated in CRC 2,3,5-7 . Its disruption resulted in invasive intestinal tumors in inflamed mucosa in mice 41 . Our large dataset and a robust mutation calling method have allowed identification of these genes as significantly mutated driver genes.
Independent of activation of EGFR, mutations that activate KRAS, BRAF, or NRAS relay signal to the nucleus to promote cell growth and survival through the RAS-RAF-MAPK pathway. Tumors with these mutations exhibit resistance to anti-EGFR therapy with panitumumab and cetuximab 10,42 . Among the EGF receptor family members, activating mutations in ERBB2 and its dimerization partner, ERBB3, have also been described in CRC [43][44][45] . A subset of cetuximab-resistant tumors with deregulated ERBB2 pathway was responsive to concurrent treatments with cetuximab and the ERBB2 inhibitor trastuzumab 46 . The oncogenic S310F mutation in ERBB2 responded to inhibition by neratinib, afatinib, and trastuzumab 43 . Treatments of cells harboring V104M mutation in ERBB3 with ERBB antibodies and other inhibitors blocked oncogenic signaling 45 . These two somatic mutations in ERBB2 (S310F, 6% of ERBB2 mutated) and ERBB3 (V104M, 10% of ERBB3 mutated) are found in our CRC cases. These results show that in patients with failed response to EGFR blockade, in addition to KRAS, BRAF, and NRAS sequencing of ERBB2 and ERBB3 is important, as these patients may benefit by ERBB signaling blockade using antibodies or small molecule inhibitors. Furthermore, while most BRAF mutated tumors contain the well-known V600E oncogenic mutation, we also found mutations in neighboring codons D594, L597, and K601 that can lead to resistance to anti-EGFR therapy. Accordingly, extensive mutation profiling of RTK/RAS signaling will allow to identify cases with better treatment options.
The MSS-HM tumors without non-silent mutations in POLE and POLD1 genes appear to have a distinct mutation profile than HM tumors with MSI or POLE/POLD1 mutations. Although most frequently mutated in APC, TP53, and KRAS, resembling MSS-NHM tumors, they are also mutated in other genes with higher frequencies. This subset of tumors comprises 3% of all tumors, and 16% of all HM tumors. These tumors are also likely candidates for immunotherapy like other HM tumors with MSI or mutations in POLE/POLD1 genes.
Our study has some limitations and strengths. We only conducted targeted sequencing on 205 genes. However, these genes were selected based on whole exome sequencing of 1211 CRCs. This large sample allowed us to identify any significantly mutated genes with a variant allele frequency of ≥2.5% 4 . Thus, we expect to have coverage of all important common and even infrequently mutated genes in our panel. While our chosen sequencing technology worked well for DNA from FFPE, we were not able to reliably measure copy number alterations with this technology. Given budget constraints, we were also not able to measure other mutational features, such as epigenetic changes or RNA-Seq, which would have allowed us to more comprehensively capture the various tumor characteristics. Our sample size also limited our ability to examine associations or patterns of alterations in smaller subgroups (e.g., stage strata). However, we were able to conduct the targeted sequencing analysis in over 2000 cases, making this one of the largest studies of somatic mutations in CRC to date. We were able to sequence at a very high coverage, conduct extensive quality control analyses that enable us to account for expected FFPE artifacts and validate mutations using orthogonal methods demonstrating the high quality of our data (for details see "Methods" and Supplementary Methods).
In summary, our study provides insights into the mutational profile of CRC and its potential link to survival. We are providing several findings which lay the foundation to advance better strategies, including personalized approaches for prevention, diagnosis, and treatment for CRC.

Methods
Study populations. We performed targeted sequencing on 2105 cases within GECCO and CCFR. GECCO is an international collaboration to study 130,000 CRC cases from 70 studies from North America, Australia, Asia, and Europe. GECCO focuses on the identification and characterization of genetic risk factors, gene-environment interactions, and the impact of germline genetic, environmental, and lifestyle risk factors on the tumor genome, microbiome, immune response, and survival. The CCFR is a National Cancer Institute-supported consortium consisting of six centers dedicated to the establishment of a comprehensive collaborative infrastructure for interdisciplinary studies in the genetic epidemiology of CRC. The CCFR includes data from~42,500 total subjects in 15,000 families (10,500 probands, 26 Targeted sequencing. We extracted tumor DNA from FFPE sections and isolated matching normal DNA from the blood, buccal, saliva, or adjacent normal colonic FFPE tissues. Tumor tissue was macrodissected from slides guided by a H&E stained slide marked for the tumor regions. All tumors underwent a pathology review to confirm that the tumor was a primary colorectal carcinoma. We extracted DNA from FFPE tissue using the QIAamp DNA Mini or QIAamp DNA FFPE tissue kits and normal DNA from other tissues using standard DNA extraction methods. DNA concentrations were determined by Quant-iT PicoGreen dsDNA Assay or the Qubit dsDNA HS Assay kits. DNA extracted from FFPE tissues was subjected to repair by using the PreCR Repair Mix (New England BioLabs, Ipswich, MA). AmpliSeq target amplification was performed using 20 ng of genomic DNA for each of the 2 AmpliSeq primer pools. Following removal of primers, PCR products from each pool were combined and subjected to end repair and A-tailing using the KAPA HyperPrep Kit (Roche). Adapter ligation was performed using the NEXTflex DNA barcodes Kit (PerkinElmer) and libraries were analyzed on High Sensitivity TapeStation and submitted for cluster generation. Barcoded DNA sequence libraries were pooled using 48 samples for tumors and 48 or 192 samples for normal DNA. Paired-end sequencing was performed on HiSeq 2500 using the Illumina Genome Analyzer operating procedure. Paired-end reads were aligned to the reference human genome (GRCh37/hg19) using Burrows-Wheeler Aligner (BWA-MEM version 0.7.9a). Local realignments and base quality recalibrations were performed on aligned data. Only reads aligned uniquely to the reference human GRCh37/hg19 genome assembly were used in downstream analysis.
Calling somatic variants and significantly mutated genes. We called somatic single nucleotide variants (SNV) using Strelka v1.0.15 47 and MuTect v1.1.7 48 , and took the intersection of mutation calls. We annotated somatic mutation calls by ANNOVAR. A set of additional filters were used such as strand bias, minor allele frequency in Exome Aggregation Consortium (ExAC), read-depth, alternative read-depth, and clustered read position. We further applied an amplicon artifact filtering to remove cases where mutant allele frequency varied across read clusters. We obtained indel calls using majority votes from VarScan2 v2.4.3 49 , VarDict (Feb 2017) 50 , and Strelka v1.0.15 47 . After initial filtering of indels based on coverage and mutant allele frequency, we noticed some background signals of alternative reads in normal samples. Thus, we used read counts from tumors and normal samples to construct a background filter to remove indel calls in a subset of samples where signals were not significantly higher than background. We used MutSigCV 51 to define significantly mutated genes. For more details on the quality control, calling and analysis see Supplementary methods, Supplementary Fig. 8, and Supplementary Tables 2 and 3. To define hypermutation status, we plotted point mutations for all samples and observed two very distinct peaks. The minimum value between the two peaks is 23 point mutations per sample (17 mutations per million bases), which we used as a cut-off for defining hypermutation status (Supplementary Fig. 9).
Validation of point mutations and indels. We evaluated calls for randomly selected 96 point mutations and 91 indels by Sanger sequencing. Results were used to improve mutation calling accuracy. Subsequently, we conducted a validation study using Sequenom as an orthogonal technology for point mutations and indels. For point mutation calls, we observed false positive and false negative rates of 0.3% and 4.1%, respectively, with a sensitivity of 95.9% and a specificity of 99.7%. As the validation for indels showed room for improvement, we used the data to further tune our calling algorithms. Subsequent Sanger sequencing for another validation of 109 indels showed 93.6% correct calls. In HM-MSS tumors without non-silent mutations in POLE and POLD1 validation of randomly selected mutations (n = 63) by Sanger sequencing showed 95% correct calls.
MSI status calling. We called MSI status using mSINGS 52 . Briefly, we established a baseline reference using control samples from peripheral blood and for each of the 169 microsatellite loci included in our panel design, we quantified and compared the number of differently sized repeats in tumor samples to the baseline for the same locus. A locus was considered unstable if the number of mutated alleles exceeded the baseline reference by three times the standard deviation. To define an MSI positive tumor, we evaluated the fraction of unstable microsatellite loci out of the total number of loci analyzed as well as a qualitative separation of samples (a cutoff fraction of 10% unstable loci, Supplementary Fig. 10). Of 2105 tumor samples, we assigned 310 MSI positive with a mean fraction of 0.27 unstable loci (range = 0.13-0.45; SD = 0.066) and 1795 MSI-negative with a mean fraction of 0.04 unstable loci (range = 0.01-0.12; SD = 0.019). As an additional way to validate calls, we compared classification of MSI status results for participants that had both existing tumor marker data and determined tumor characteristics from the targeted sequencing data. The classifications from orthogonal approaches were highly concordant with 98.6% concordance for the 1534 individuals with information on MSI status from both sources.
Definition of mutated gene and pathways. Mutations: We defined gene mutations based on the presence of non-silent mutations as determined by ANNO-VAR 53 refGene annotations. A SNV was considered to be non-silent if it was annotated as exonic and nonsynonymous, stop-gain, stop-loss, or splicing. An indel was considered to be non-silent if it was annotated as exonic and a frameshift deletion, frameshift insertion, in-frame deletion, in-frame insertion, stop-gain, or stop-loss. For a subset of genes, we further refined definitions based on known annotations regarding functional effects of mutations (e.g., V600E mutation in BRAF).
Pathways: We defined primary pathways implicated in CRC 2 for downstream analyses (see Supplementary Fig. 3 for list of genes included in each pathway). A pathway was considered mutated if any gene within that pathway had a non-silent mutation.
Statistical analyses. We used Bonferroni corrected p-values to assess statistical significance, accounting for the number of genes or pathways tested in each type of analysis.
Survival analyses: We used Cox proportional hazards regression to estimate adjusted HRs and 95% CIs for the association of mutated genes and pathways with CRC-specific survival. Person time accrued from the date of diagnosis to the date of death or the end of follow-up. Cases who died from causes other than CRC were censored at the date of death. We examined proportional hazards assumptions by testing for a nonzero slope of the scaled Schoenfeld residuals as a function of survival time. All analyses were adjusted for age at diagnosis, sex, and study. Primary analyses were conducted without adjustment for hypermutation status, to allow for the possibility of effect modification by this attribute; however, in instances in which associations were observed to be similar across HM and NHM cases, we conducted additional analyses adjusting for hypermutation status. Primary analyses of survival were also not adjusted for stage at diagnosis, to account for the fact that meaningful associations between somatic mutations and survival could operate via an impact on disease aggressiveness (and therefore stage). In instances in which we observed significant differences in the prevalence of mutated genes or pathways by stage at diagnosis or tumor site, we also carried out analyses stratified by and adjusted for these attributes.
Case only tumor comparison: We used a χ 2 test to compare gene and pathway mutation frequencies in HM and NHM tumors. To evaluate potential differences in mutated genes and pathways by tumor site, tumor stage, and sex, we ran unconditional case-only logistic regression analyses adjusting for MSI status. We defined left-sided tumors as those occurring in the splenic flexure, descending colon, sigmoid colon, rectosigmoid junction, and rectum, and right-sided tumor as transverse colon, hepatic flexure, ascending colon, and cecum. We excluded appendix and anal cancer cases from all analyses. We did not stratify further by rectum as the TCGA analysis did not observe substantial differences between colon vs rectum 2 . In analyses comparing mutation frequency by tumor site, we used leftsided tumors as the referent category. Stage 1 cases served as the referent category in analyses of mutation frequency by stage at diagnosis. In analyses comparing mutation frequency by sex, we used females as the referent category.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All data generated or analyzed during this study are included in this published article (and its supplementary information files). The original sequencing data are available at the database of Genotypes and Phenotypes (dbGaP, accession phs002050.v1.p1). IARC TP53 data are available at https://p53.iarc.fr. NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-17386-z ARTICLE NATURE COMMUNICATIONS | (2020) 11:3644 | https://doi.org/10.1038/s41467-020-17386-z | www.nature.com/naturecommunications Received: 13 June 2019; Accepted: 23 June 2020; policies of the National Cancer Institute or any of the collaborating centers in the Colon Cancer Family Registry (CCFR), nor does mention of trade names, commercial products, or organizations imply endorsement by the US Government, any cancer registry, or the CCFR. PMH study is funded by National Institutes of Health (R01 CA076366 to P.A.N).