Genome-wide association study identifies tumor anatomical site-specific risk variants for colorectal cancer survival

Identification of new genetic markers may improve the prediction of colorectal cancer prognosis. Our objective was to examine genome-wide associations of germline genetic variants with disease-specific survival in an analysis of 16,964 cases of colorectal cancer. We analyzed genotype and colorectal cancer-specific survival data from a consortium of 15 studies. Approximately 7.5 million SNPs were examined under the log-additive model using Cox proportional hazards models, adjusting for clinical factors and principal components. Additionally, we ran secondary analyses stratifying by tumor site and disease stage. We used a genome-wide p-value threshold of 5 × 10–8 to assess statistical significance. No variants were statistically significantly associated with disease-specific survival in the full case analysis or in the stage-stratified analyses. Three SNPs were statistically significantly associated with disease-specific survival for cases with tumors located in the distal colon (rs698022, HR = 1.48, CI 1.30–1.69, p = 8.47 × 10–9) and the proximal colon (rs189655236, HR = 2.14, 95% CI 1.65–2.77, p = 9.19 × 10–9 and rs144717887, HR = 2.01, 95% CI 1.57–2.58, p = 3.14 × 10–8), whereas no associations were detected for rectal tumors. Findings from this large genome-wide association study highlight the potential for anatomical-site-stratified genome-wide studies to identify germline genetic risk variants associated with colorectal cancer-specific survival. Larger sample sizes and further replication efforts are needed to more fully interpret these findings.

The global incidence of colorectal cancer (CRC) has been increasing while the mortality rate has been decreasing 1,2 . Advances in scientific knowledge, treatment modalities, and medical screening programs are considered among the major factors contributing to improved survival from this disease 1,3,4 . In the USA, Canada, Australia, and Europe, 5-year survival is around 65% [5][6][7][8] . Currently, CRC prognostication relies primarily on clinicopathological features with a primary focus on tumor characteristics, such as stage. Several additional factors, in relation to both the tumor (e.g. KRAS and BRAF mutations) and the individual, have been associated with survival times and clinical outcomes [9][10][11][12][13][14][15][16][17][18] . Germline genetic variants are commonly investigated as candidate prognostic markers; they are abundant in the human genome, are polymorphic among patients, are thought to remain unchanged over time, and may biologically modify disease characteristics and risk of progression or clinical outcomes [19][20][21] . These characteristics of germline genetic variants, therefore, make them attractive for cancer research studies.
There is now a large body of research on genetic variants in relation to CRC incidence, yet no variants have been confidently associated with CRC survival or used in clinical practice. Many approaches have been attempted in survival outcomes studies, such as candidate SNP, gene, or pathway analyses 22,23 , including the examination of associations with patient outcomes for variants identified in susceptibility studies. Compared with other study designs, genome-wide association studies (GWAS) offer a comprehensive, agnostic approach. Some CRC survival GWAS have been performed [24][25][26][27] , which have identified a small number of genetic variants at the genome-wide significance level 25,26 . While these studies have advanced the knowledge of the genetic basis of CRC survival, they have also been limited by relatively small number of cases, restricting the ability to identify modest associations or low frequency risk variants, which may only be apparent when large case cohorts are examined.
In this study, we evaluated germline genetic loci associated with CRC-specific survival using data from 16,964 CRC participants included in an international consortium comprising 15 epidemiologic and clinical studies. As a secondary goal, we evaluated stage-and tumor site-specific associations between genetic loci and CRC survival.

Results
Participant demographic and clinical characteristics are provided in Table 1. Median follow-up time after diagnosis was 13.8 years. Overall, 6,033 (36%) CRC cases died during follow up, of which 4,010 deaths (66%) were attributed to CRC. As expected, participants with stage 4 tumors at diagnosis were more likely to die from CRC (of those who died, 51% were stage 4 compared with 6% stage 1). Participants were ~ 50% female with a median age of 67 years (range 20-94 years).
No substantial systemic inflation was identified from quantile-quantile (QQ) plots ( Supplementary Fig. 1). No variants reached genome-wide significance for the primary GWAS (  Table 3). Two variants were statistically significant at a genome-wide P-value threshold among proximal colon tumors and one among distal colon tumors (Table 2, Fig. 2). No variants reached genome-wide significance among rectal tumors. The significant variants identified among proximal colon tumors were located on chromosomes 12 (rs189655236, hazard ratio [HR] = 2.14, 95% confidence interval [CI]: 1.65-2.77, p = 9.19 × 10 -9 ) and 14 (rs144717887, HR = 2.01, 95% CI: 1.57-2.58, p = 3.14 × 10 -8 ). Both variants were low frequency (MAF 1.4% and 1.5%, respectively) and neither were in linkage disequilibrium (LD; defined as R 2 > 0.6) with nearby variants. The rs189655236 variant was located within the intronic region of BORCS5 and rs144717887 was located in an intergenic region. The variant significantly associated with CRC survival among distal colon tumors was located in an intergenic region on chromosome 14 (rs698022, HR = 1.48, CI 1.30-1.69, p = 8.47 × 10 -9 ), was common with a minor allele frequency (MAF) of 11%, and was not in LD (R 2 < 0.6) with nearby variants. The two chromosome 14 variants identified in proximal and distal colon analyses were not in linkage disequilibrium with each other. None of the statistically significant variants were predicted to have regulatory effects (ranking scores = 4 or 5) as reported in the RegulomeDB database.

Discussion
In this analysis of common genetic variants in a sizeable study population, we did not identify any SNPs associated with CRC prognosis at the genome-wide significance level. We also found no SNPs associated with survival for specific tumor stages. However, our results suggest that there may be variants that predict CRC-survival for distal and proximal colon cancer cases.
The distal and proximal regions of the colon differ biologically and in terms of tumor incidence rates 28,29 . In addition, research shows that tumors located in these anatomical subsites display differences in molecular alterations involved in tumorigenesis, and are characterized by different disease progression and prognosis [30][31][32] . The identification of different sets of variants with survival for cases with distal and proximal colon tumors in this study is, therefore, not surprising. Using eQTLGen, rs189655236 was predicted to be in cis-eQTL with DUSP16, which has been associated with chemotherapy resistance in colorectal cancer 33 . However, none of the identified SNPs were predicted to have putative regulatory functions using RegulomeDB, and according to the www.nature.com/scientificreports/ dbSNP database, they are within intronic or intergenic sequences 34 . Also, the two significant variants for proximal colon tumors are very low frequency. These considerations lead us to interpret our findings with caution pending further investigation. This study has several strengths. By leveraging data from 15 population-based and clinical studies, we were able to confidently examine with good statistical power genetic associations with CRC survival. Covariates were well characterized with detailed information on epidemiologic and clinical factors which allowed us to conduct subgroup analysis by stage at diagnosis and tumor anatomical location. In addition, we had a relatively long follow-up period and cause of deaths were uniformly ascertained. We also used an agnostic discovery-based approach to identify variants associated with CRC survival.
Our study also has some limitations. Although we had a large enough sample size to identify significant SNPs in CRC cases with European ancestry, we were unable to evaluate other ancestry groups. Additionally, we were unable to evaluate other tumor markers that might be associated with survival in our population. Another www.nature.com/scientificreports/ limitation inherent to the GWAS approach is the high likelihood of false-negative findings due to the stringent P-value threshold for genome-wide significance. This threshold is set to account for multiple testing and is designed to reduce the number of false-positive findings; however, a consequence of this stringency is that some important SNP-survival associations may have been missed. Finally, no replication analysis or functional follow-up was conducted.
In summary, in this largest yet GWAS for CRC specific survival, our analyses indicate that genetic variants in the form of SNPs are unlikely to explain variable risk of death from colorectal cancer in people of European ancestry. However, a few SNPs were identified that may be prognostic markers for distal or proximal colon cancers and these should be further examined in other populations, including cases from other ancestry groups.

Study population. Analyses utilized data from the International Survival Analysis in Colorectal Cancer
Consortium (ISACC), a compilation of participants with incident, invasive CRC obtained from clinical trials, case-control, and cohort studies from around the world. Study participants included people of European genetic ancestry diagnosed with invasive CRC and with available genotyping and CRC-specific survival data (as described in the Supplementary Methods). The following 15 ISACC studies were included: the Cancer Prevention Study-II (CPS-II) 35 , the German Darmkrebs: Chancen der Verhutung durch Screening Study (DACHS) 36 , the Diet Activity and Lifestyle Study (DALS) 37 , the Early Detection Research Network (EDRN) 38 , the European Prospective Investigation into Cancer (EPIC) 39 , the Health Professionals Follow-up Study (HPFS) 40     www.nature.com/scientificreports/ Ascertainment of CRC-specific survival. Protocols for survival outcomes assessment in this study population have been described previously [35][36][37][38][43][44][45]49,50,53,[55][56][57][58][59][60][61] . Briefly, studies ascertained vital status via linkage to the National Death Index, state cancer registries, state death records, or population registers with cause of death verified by death certificates (CPSII, DACHS, DALS, EPIC, MCCS, UKB, VITAL), or via active follow-up (CCFR, HPFS, NHS, PHS, PLCO, WHI, N9741) with dates and cause of death confirmed via regional mortality databases, review of death certificates and/or medical records by trained adjudicators. In all studies, cases alive at the most recent study follow-up or data linkage were censored on that date. In VITAL, individuals who moved outside of Washington State were censored at their date of move. CRC-specific survival was calculated as days from diagnosis to CRC-related death or end of follow-up. Individuals who died from causes other than CRC were censored at the time of death.
Tumor stage and location classification. Tumor stage was obtained from pathology and registry reports at the time of diagnosis. The Surveillance, Epidemiology, and End Results (SEER) summary stage categorizations of localized, regional, and distant were used, also incorporating extent of disease information when available. Additionally, the American Joint Committee on Cancer (AJCC) TNM classification of malignant tumors (TNM) categorizations were utilized to assign values I through IV.
Tumor location was obtained from registry and pathology reports. Location was grouped based on ICD-9 codes as follows: (1)  Genotype data. Genotyping methods have been reported previously [62][63][64][65][66] . Briefly, genomic DNA was extracted from blood or buccal samples using conventional methods, and samples were genotyped using the platforms listed in Supplementary Table 1. Each genotyping platform dataset underwent standard quality control analyses, including exclusion of samples and SNPs with low call rates (< 97% and < 98%, respectively), exclusion of variants departing from Hardy-Weinberg Equilibrium (p < 1 × 10 -4 ), exclusion of individual with discrepant reported and genotyped sex based on X chromosome heterozygosity, and exclusion of duplicates and individuals that were second-degree or more closely related based on identity by descent (IBD) calculations. Additionally, we inferred genetic ancestry using principal components analysis and excluded individuals of non-European ancestry from analyses due to small sample sizes. Participants with a value within one standard deviation of the median for the first and second eigenvectors were categorized as European genetic ancestry and included in the analysis (Supplementary Methods). A total of 16,964 individuals passed quality control filtering. Only variants passing quality control analyses and with missing call rates ≤ 2% were used for imputation.
Phasing and imputation were performed on each pooled set of studies with the same or similar genotyping platforms. Autosomal variants were phased using SHAPEIT2 and imputed to the Haplotype Reference Consortium panel release 1.1 (~ 39 million variants) using the University of Michigan Imputation Server [67][68][69] . Genotype probabilities were converted to allelic dosages. Evaluation was restricted to variants with MAF ≥ 1% and imputation accuracy R 2 > 0.3. A total of 7,829,749 genetic variants were included in the analyses. All imputed and cleaned individual-level genotype data were pooled for survival analyses.
We used PLINK (v1.9) for principal components analysis on pruned sets of autosomal variants obtained by removing regions with extensive long-range linkage disequilibrium. The first five principal components were used as coviariates to account for population substructure in analysis.
Statistical analysis. We used Cox proportional hazards regression to estimate HRs and 95% CIs for associations of each genetic variant with CRC-specific survival. A log-additive model was used, relating variant genotype dosage to CRC-specific survival. All models were adjusted for age at diagnosis, sex, a categorical variable encompassing genotyping platform and study, and five principal components to account for population substructure. The proportional hazards assumptions for age and sex were evaluated by testing for a non-zero slope of the scaled Schoenfeld residuals on ranked failure time 70 . The tests for both age and sex were statistically significant (p < 0.05), suggesting the proportionality assumption may not hold. For age, the non-proportionality was resolved by including both a continuous and categorical variable (dichotomized at median age; ≤ 57 years versus > 57 years). We stratified our analysis by sex.
We evaluated QQ plots of log-transformed p-values and calculated genomic control coefficients to assess for possible systemic inflation. We produced Manhattan plots and specified a genome-wide statistical significance level of p ≤ 5 × 10 -8 . We performed statistical analyses using R version 3.5.2.
In silico analyses. The NCI 'LDassoc' web tool (https:// ldlink. nci. nih. gov/) was used to evaluate LD (defined as R 2 > 0.6) in 1000 Genomes Phase 3 'EUR' population) for SNPs of interest 71 . The putative functional effects of variants were inferred based on information in the RegulomeDB database 72 . This database ranks variants ranging from 1-7 such that lower ranks represent variants with greater predicted regulatory impact (https:// regul omedb. org/ regul ome-help/). For example, eQTLs (expression quantitative trait loci) have a rank of 1; variants locate in a transcription factor binding motif and DNAse peak have a rank of 4; and variants that locate in a