Introduction

Colorectal cancer is the third most commonly diagnosed cancer and the second leading cause of cancer death worldwide, with over 1.8 million new cases and 0.9 million deaths in 20201. Remarkably, colorectal cancer is also the most common cause of cancer death in six countries and ranks among the top three leading causes of cancer death in 104 countries2. Therefore, there is an urgent clinical need to provide more effective survival prediction tools to reduce colorectal cancer mortality and improve patients’ outcome. It is well known that clinical and pathologic characteristics (e.g., clinical stage) are important prognostic factors in predicting survival outcomes3,4. In addition, recent studies have suggested that genetic biomarkers also play vital roles in determining the risk of cancer outcomes5; for example, one study demonstrated the clinical ability of genetic variants for predicting the recurrence and death of renal cell carcinoma6.

To date, genome-wide association studies (GWASs) have identified over 200 single nucleotide polymorphisms (SNPs) associated with the risk of colorectal cancer7,8. Interestingly, these risk-associated variants have contributed to the development of polygenic risk score (PRS), a valuable method that aggregates the modest effect of each SNP, which has been demonstrated to be effective in identifying high-risk individuals of developing colorectal cancer9,10,11. However, the genetic architecture of colorectal cancer survival outcome has not been widely estimated. Noteworthily, survival probability is another critical indicator, that can reflect the tumor burden and prognosis of disease patients12. In particular, our previous study demonstrated the limited clinical utility of risk-based PRS in predicting cancer survival, emphasizing that a polygenic prognostic score (PPS) is needed instead for determining the genetic risk of death among colorectal cancer patients13.

Notably, recent prospective studies have indicated that a healthy lifestyle (e.g., healthy diet) could significantly influence the risk of death among patients with colorectal cancer14,15. For example, Zutphen et al. found that improving individual lifestyle after colorectal cancer diagnosis could reduce the risk of all-cause mortality by approximately 20%15. However, whether there is a joint effect of pathologic characteristics, genetic risk, and healthy lifestyle on colorectal cancer progression remains unclear.

In this study, we performed a genome-wide survival association meta-analysis of colorectal cancer in East Asian (EAS) and European (EUR) populations; and developed a robust PPS that can be used to stratify colorectal cancer survival; and further evaluated the benefit of adherence to a healthy lifestyle in reducing the risk of death, particularly in the subset of patients with a high pathologic stage or grade, and a high genetic risk.

Results

Study design

Here, a three-stage study design was applied (Fig. 1). In the first derivation stage, leveraging two independent colorectal cancer survival GWAS datasets (i.e., NJCRC and UK Biobank cohorts), we performed a meta-analysis to identify survival-associated genetic loci, as well as eight candidate PPSs with different approaches. In the second validation stage, we assessed the discriminatory accuracy of each PPS in an independent longitudinal cohort from The Cancer Genome Atlas (TCGA) to determine an optimal PPS framework for 5-year overall survival prediction. In the third testing stage, using the external ZJCRC cohort and Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial, we further estimated the efficacy of the optimal PPS in colorectal cancer survival prediction, and evaluated the joint effect of pathologic stage or grade, genetic risk and healthy lifestyle (Supplementary Table 1) on the prognosis of colorectal cancer patients.

Fig. 1: Summary of the study design.
figure 1

QC quality control, MAF minor allele frequency, HWE Hardy-Weinberg Equilibrium, LD linkage disequilibrium, LASSO least absolute shrinkage and selection operator, TCGA The Cancer Genome Atlas, AUC area under the curve, PLCO Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial.

Meta-analysis of colorectal cancer survival GWASs

In the derivation stage, leveraging the genetic and clinical data of colorectal cancer patients from NJCRC (1082 cases of EAS ancestry) and UK Biobank (2621 cases of EUR ancestry; Supplementary Fig. 1) cohorts (Table 1), we performed a meta-analysis to identify genetic variants associated with colorectal cancer overall survival (Supplementary Fig. 2A). No residual population stratification was observed (lambda = 1.027; Supplementary Fig. 2B).

Table 1 Basic characteristics of study subjects

Notably, we found two independent variants that were significantly associated with colorectal cancer overall survival beyond the suggestive genome-wide significance (PCox < 5 × 10−6), namely the rs10967103 [9p21.2; hazard ratio (HR)meta = 1.70, Pmeta = 4.05 × 10−6] and rs79067806 (12q12; HRmeta = 1.89, Pmeta = 4.14 × 10−6; Supplementary Table 2; Supplementary Fig. 2C, D). However, there were no SNP-gene expression associations reported in the Genotype-Tissue Expression (GTEx) project for rs10967103 and rs79067806. In addition, although these two SNPs were located nearby previously reported risk-related regions, they were not observed to be associated with the risk of colorectal cancer in a previous GWAS meta-analysis of case-control studies9 [35,145 cases and 288,934 controls; rs10967103: odds ratio (OR)meta = 1.02, Pmeta = 0.449; rs79067806: ORmeta = 1.00, Pmeta = 0.955; Supplementary Table 3].

Construction and validation of PPSs with multiple approaches

Subsequently, we aimed to construct and validate a solid PPS for colorectal cancer survival prediction. Among the eight candidate PPSs (Table 2), seven were significantly associated with an increased risk of all-cause death in the TCGA cohort (470 patients) of EUR ancestry, with HR per standard deviation (SD) increase ranging from 1.47 (P = 0.001) for the clumping and P value thresholding (i.e., C + T) method (parameter of P value: 1 × 10−4) to 1.99 (P = 1.76 × 10−8) for the random survival forest (RSF) method.

Table 2 Performance of polygenic prognostic scores derived from different approaches in the TCGA cohort

Notably, the RSF approach-based PPS that harbored 287 SNPs (defined as PPS287; Supplementary Data 1) achieved the optimal discriminatory ability for 5-year overall survival prediction, with a time-dependent area under the receiver operating characteristics (ROC) curve (AUC) of 0.652. We then divided the patients into high- and low-PPS groups, with the median score of PPS287 as a cut-off value. Compared to patients in the low-PPS group, those carried with high-PPS had shorter overall survival (log-rank P < 0.001) in the validation (i.e., TCGA cohort; Supplementary Fig. 3A) datasets. In addition, the calibration and time-dependent ROC curves of the PPS287 model showed good agreement between the predicted and observed 5-year survival probability (Supplementary Fig. 3B), as well as excellent performance in 5-year survival prediction (Supplementary Fig. 3C).

Testing the optimal PPS in external cohorts

We further evaluated the performance of PPS287, the optimal PPS, in two external cohorts, namely the ZJCRC cohort (543 patients of EAS ancestry) and PLCO cohort (713 patients of EUR ancestry). As expected, PPS287 was significantly associated with an increased risk of all-cause death in both the ZJCRC (HR per SD = 1.90, P = 3.21 × 10−14) and PLCO (HR per SD = 1.80, P = 1.11 × 10−9; Supplementary Table 4) cohorts. Similar associations were also found between PPS287 and 3-year or 5-year colorectal cancer overall survival. The AUCs at 5-year were 0.649 in the ZJCRC cohort and 0.658 in the PLCO cohort, which were similar with the predictive accuracy in the validation cohort (i.e., TCGA).

In addition, using the median score as a cut-off to divide the low- and high-PPS subgroups, patients in the high-PPS group had poorer overall survival than patients carried with low-PPS in the two cohorts (ZJCRC: log-rank P = 7.68 × 10−9; PLCO: log-rank P = 3.82 × 10−5; Fig. 2A). Interestingly, when stratified by clinical factors (e.g., sex, age, smoking status and drinking status), the high-PPS was still broadly and significantly associated with poorer prognosis in the two cohorts (HR > 1; Supplementary Fig. 4A, B). Similar results were also observed in the sensitivity analyses (Supplementary Table 5).

Fig. 2: Prognostic evaluation of the optimal polygenic prognostic score (i.e., PPS287) in the ZJCRC and PLCO cohorts.
figure 2

A Kaplan–Meier curves for overall survival probability stratified by different levels of PPS (based on median value) in the ZJCRC and PLCO cohorts. B Calibration curve of different prognostic models for predicting 5-year survival probability in the ZJCRC and PLCO cohorts. The vertical error bars denote the 95% CI. C Time-dependent ROC curves of different prognostic models regarding 5-year survival probability in the ZJCRC and PLCO cohorts. The traditional model included sex, age, smoking status and drinking status for the ZJCRC cohort; and sex, age, smoking status, drinking status, stage and grade for the PLCO cohort. The combined model included both traditional factors and PPS. The sample sizes of ZJCRC and PLCO cohorts are 543 and 713 cases. Note: PLCO Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial, PPS polygenic prognostic score, ROC receiver operating characteristics, AUC area under the curve, 95% CI 95% confidence interval.

Additional benefits of PPS to the clinical prognostic model

In the ZJCRC and PLCO cohorts, several clinical factors associated with the overall survival of colorectal cancer were identified (Supplementary Tables 6 and 7), including age (ZJCRC: HR = 1.05, P = 8.33 × 10−10; PLCO: HR = 1.05, P = 5.21 × 10−5), stage (PLCO: HRtrend = 2.82, Ptrend = 4.69 × 10−34) and grade (PLCO: HRtrend = 2.53, Ptrend = 2.48 × 10−11). After adjusting for these clinical variables with a multivariate Cox regression analysis, higher PPS287 remained to be an independent prognostic factor for predicting overall survival (ZJCRC: HR = 3.24, P = 1.05 × 10−10; PLCO: HR = 2.25, P = 2.72 × 10−5) in the two cohorts.

To evaluate the additional prognostic value of PPS287 to the traditional clinical model, we constructed a combined Cox regression model by integrating PPS287 with several common clinical factors for each cohort (ZJCRC: sex, age, smoking status and drinking status; PLCO: sex, age, smoking status, drinking status, stage and grade). Compared to the traditional model, the calibration curve of the combined model showed better agreement between the predicted and observed 5-year overall survival (Fig. 2B).

In addition, the AUCs at 5-year overall survival prediction of the traditional prognostic model were 0.644 in the ZJCRC cohort and 0.807 in the PLCO cohort, while those of the combined model were 0.699 and 0.834, respectively (Fig. 2C), indicating that the predictive accuracy of the combined prognostic model was significantly higher than that of the PPS or traditional models alone in the two cohorts (PAUC < 0.01; Supplementary Table 8). Similar results were also observed using more evaluation metrics (e.g., Harrell’s C index and Royston and Sauerbrei’s R2D; Supplementary Table 9), as well as the decision curve analysis (DCA; Supplementary Fig. 5A, B), demonstrating the additional value of PPS in colorectal cancer survival prediction.

Joint effects of pathologic characteristics, genetic risk and healthy lifestyle on overall survival of colorectal cancer

Subsequently, given that the PLCO cohort included sufficient lifestyle information, we calculated an integrated healthy lifestyle score and aimed to evaluate the joint effect of pathologic stage or grade, genetic risk and healthy lifestyle on the prognosis of colorectal cancer patients in the PLCO cohort (Supplementary Table 10). Broadly, there was a notable dose-response manner on decreasing overall survival probability in the pattern of higher stage/grade, higher genetic risk (higher PPS), and unfavorable lifestyle (lower lifestyle score) (log-rank P = 4.86 × 10−19; Fig. 3A), but no second-order multiplicative interaction between them was observed (Pinteraction = 0.145). In particular, patients with a high stage/grade, a high genetic risk and an unfavorable lifestyle had a 27-fold increased risk of death than those with a low stage/grade, a low genetic risk and a favorable lifestyle (HR = 28.15, P = 3.68 × 10−9; Fig. 3B).

Fig. 3: The overall survival probability of colorectal cancer patients according to different levels of pathologic stage or grade, genetic risk, and healthy lifestyle in the PLCO cohort.
figure 3

A Kaplan–Meier curves for overall survival probability stratified by different levels of pathologic stage or grade, genetic risk and healthy lifestyle. B The association of pathologic stage or grade, genetic risk and healthy lifestyle with overall survival of colorectal cancer patients. The HR and 95% CI were derived from the Cox regression model with the adjustment of sex, age, research center, arm and top 10 principal components. The number in the bracket indicates the number of deaths/number of all cases. The horizontal error bars denote the 95% CI. The sample size of PLCO cohort is 713 cases. Note: PLCO Prostate, Lung, Colorectal and Ovarian Cancer Screening Trial, HR hazard ratio, 95% CI 95% confidence interval.

Interestingly, when stratifying patients by the categories of stage/grade and genetic risk, although few significant associations were observed, patients with colorectal cancer who maintained a healthy lifestyle could experience a lower risk of death (HR < 1; Table 3) than those who followed an unfavorable lifestyle. Especially, among patients with a low stage/grade and a low genetic risk, the overall survival rate ranged from 65.78% (unfavorable lifestyle) to 92.90% (favorable lifestyle; P = 0.042). Notably, among patients with a high stage/grade and a high genetic risk, the 5-year overall survival rate of those with an unfavorable lifestyle decreased to 41.9%, which could be increased to 49.52% among those with a favorable lifestyle (difference = 7.62%).

Table 3 The association of pathologic stage or grade, genetic risk and healthy lifestyle with overall survival of colorectal cancer patients in the PLCO cohort

Clinical application of the integrated prognostic model

To further apply the integrated model including clinical stage/grade, PPS287 and healthy lifestyle score in clinical practice, we developed a ColoRectal Cancer Survival Prediction System (CRC-SPS, http://njmu-edu.cn:3838/CRC-SPS/), including (i) “Colorectal cancer survival summary statistics” and (ii) “Colorectal cancer survival prediction” modules. The “About” page provides more details about the functions of this web server.

On the “Colorectal cancer survival summary statistics” page, when users enter a batch of SNP IDs, or enter a genetic region, a table [with chromosome ID, SNP ID, SNP genomic position, SNP alleles (A1: effect allele; A2: reference allele), effect allele frequency (EAF), beta, standard error (SE) in NJCRC and UK Biobank cohorts, and corresponding associations of meta-analysis] will be built. Users can download the results by clicking the “Download” button. Besides, users can select one SNP-survival pair and click the ‘Plot’ button, the diagrams of Kaplan–Meier plot will be provided to display the associations among the two cohorts.

On the “Colorectal cancer survival prediction” page, CRC-SPS can help users estimate individual 5-year overall survival probability, with the PLCO cohort as a reference dataset. In brief, users can easily input their sex, age, lifestyle information (e.g., smoking status) and clinical characteristics (e.g., clinical stage) along with the genotypes of 287 SNPs to obtain an estimated 5-year survival probability. In addition, we provided the 5-year survival probability (i.e., 77.1%) in the PLCO cohort as a reference threshold, to stratify the population into subgroups with high and low risk of death. For example, the colorectal cancer patient with a predicted 65.8% of 5-year survival probability was grouped as having a high risk of death.

Discussion

In the current study, we performed an EAS-EUR meta-analysis of colorectal cancer survival GWASs and found two suggestive genome-wide significant genetic loci (9p21.2 and 12q12) associated with colorectal cancer overall survival. Furthermore, we constructed and validated a robust PPS framework (PPS287), independent of clinical factors, that could effectively stratify colorectal cancer survival in three independent longitudinal cohorts. Notably, the detrimental effect of pathologic characteristics and genetic risk on the prognosis of colorectal cancer could be attenuated by adherence to a healthy lifestyle.

Although previous GWASs have identified multiple SNPs associated with colorectal cancer risk, few studies have focused on the genetic architecture of survival outcomes16,17,18. For example, Wills et al. performed a survival GWAS among 1926 patients with advanced colorectal cancer, and supported rs79612564 (2q34) in ERBB4 as a predictive biomarker of survival, as evidenced by the replication stage of independent colorectal cancer patients17. Here, leveraging the meta-analysis of EAS and EUR populations, we uncovered two variants, rs10967103 (9p21.2) and rs79067806 (12q12), linked to overall survival in colorectal cancer with substantial effect sizes (both HRs >1.5). Interestingly, these two prognostic variants were not associated with colorectal cancer susceptibility, indicating the diverse genetic background between the initiation and progression of colorectal cancer, which was consistent with previous findings13,19. Therefore, it will be necessary to identify variants carried with stronger effect sizes and increased statistical power among larger longitudinal populations, and to systematically decode the inconsistent features of the genetic architecture underlying the susceptibility and progression of colorectal cancer.

In recent decades, cumulative evidence has suggested the clinical utility of genetic biomarkers in estimating the risk of cancer death and improving patients’ survival outcomes5,20,21. It is noteworthy that inherited germline variants (i.e., SNPs) are fixed at conception and do not change over time; therefore, they are considered as robust and cost-efficient biomarkers for personalized medicine. Currently, PRS, defined as a weighted sum of a set of risk-associated SNPs, has been demonstrated to be effective in identifying individuals at high risk of developing diseases22,23. For example, we ever developed a EAS-EUR PRS framework derived from genome-wide SNPs that can effectively predict colorectal cancer risk in EAS and EUR populations, indicating the potential application of PRS in colorectal cancer risk stratification9. However, there was no significant association between PRS and the increased risk of cancer mortality among cancer patients, as evidenced by several prospective studies19,24 and our previous findings13. Therefore, considering the limited clinical utility of PRS in disease survival evaluation, we proposed a robust PPS287 framework, independent of clinical factors, that could be used for colorectal cancer survival stratification in EAS and EUR populations, as evidenced by three independent cohorts. Notably, compared to low-PPS287 patients, the subgroup with high PPS287 showed poorer prognosis, and these patients could be recommended for colorectal cancer personalized therapy.

Importantly, by integrating different categories of pathologic characteristics (i.e., clinical stage or grade), genetic risk and healthy lifestyle, we developed an analytical framework for colorectal cancer survival stratification. Interestingly, adherence to a healthy lifestyle could attenuate the risk of death, especially evident among patients with low stage/grade and low genetic risk (P < 0.05). Notably, among patients with a high stage/grade and a high genetic risk, the 5-year overall survival rate of an unfavorable lifestyle could be increased by 7.62% with adherence to a favorable lifestyle, further emphasizing the public notion that a healthy lifestyle among colorectal cancer patients can lead to an evident reduction in death14,15.

Our study has several strengths. First, we performed a EAS-EUR meta-analysis of colorectal cancer survival GWASs and identified two significant variants associated with overall survival of colorectal cancer. Second, we proposed and validated a robust PPS framework that could be effectively used for colorectal cancer survival stratification among EAS and EUR populations. Third, leveraging the information of pathologic characteristics, genetic risk and lifestyle, we developed a user-friendly web server to generate a customized estimate of 5-year survival probability for colorectal cancer patients, for use as a potential tool in personalized survival prediction. Nevertheless, we also need to acknowledge some limitations. First, we only included a total of 3703 colorectal cancer patients (i.e., NJCRC and UK Biobank cohorts) for the survival-based meta-analysis, with the limitation of statistical power for detecting genome-wide significant loci; thus, more datasets should be included when available in the future. Second, clinical stage and grade, as important prognostic factors, are not available in some cohorts (i.e., UK Biobank and ZJCRC), which should be further included for survival evaluation; besides, additional survival outcome-related factors (e.g., treatment) are also needed to be considered. Third, the lifestyle or other confounding factors were derived from the baseline questionnaire in the PLCO cohort, which could not reflect the dynamic changes during the follow-up after colorectal cancer diagnosis; thus, more detailed surveillance is also needed. Fourth, only EAS and EUR ancestry groups were included for PPS construction, other ethnic groups (e.g., African Americans and Hispanics), as well as more sophisticated methods should be considered in the future work. In addition, the model performance and benefit of healthy lifestyle maintenance need to be further validated using a larger longitudinal population with sufficient follow-up time and sample size.

In conclusion, leveraging the colorectal cancer survival GWAS meta-analysis and multi-center cohorts, we constructed and validated a robust PPS framework that could effectively predict colorectal cancer survival among EAS and EUR populations. Importantly, we also provided further evidence that a healthy lifestyle could attenuate the detrimental impact of pathologic characteristics and genetic risk on colorectal cancer progression, which could shed additional light on precision clinical management of colorectal cancer.

Methods

Study subjects

Derivation stage

NJCRC cohort of EAS ancestry

The subjects in the NJCRC cohort were recruited from the National ColoRectal Cancer Cohort (NCRCC), including 1082 Chinese patients, being part of the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO). Detailed information can be found in the Supplementary Methods9,25.

UK Biobank cohort of EUR ancestry

The UK Biobank cohort (https://www.ukbiobank.ac.uk/) is a prospective, population-based study that recruited 502,528 adults aged 40–69 years from the general population between April 2006 and December 201026. After applying individual-level filtering criteria (Supplementary Methods), a total of 2621 incident colorectal cancer cases of EUR ancestry were retained for our analysis27. This study was conducted using the UK Biobank Resource under Application #45611.

Validation stage

TCGA cohort of EUR ancestry

TCGA (https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga) is a joint cancer genomics program of the National Cancer Institute and National Human Genome Research Institute that began in 200628. Over the past decade, TCGA has collected more than 20,000 primary cancer and matched normal samples from over 10,000 cases across 33 cancer types. Here, a total of 470 individuals of EUR ancestry with colorectal cancer were retained for further analysis13.

Testing stage

ZJCRC cohort of EAS ancestry

The 543 Chinese colorectal cancer cases in the ZJCRC cohort were derived from the Jiashan Institute of Cancer Prevention and Treatment. The population details were described in the Supplementary Methods9.

PLCO cohort of EUR ancestry

The PLCO cancer screening trial is a cohort study that aims to evaluate the accuracy and reliability of screening methods for prostate, lung, colorectal, and ovarian cancer29. Based on the filtering criteria, a total of 713 white individuals of EUR ancestry with colorectal cancer remained in the subsequent analysis. Detailed information was described in the Supplementary Methods30. This study was approved by the ethics committees of the PLCO consortium providers (#PLCO-84).

The basic information of each cohort has been described in the Table 1, and the distribution of genetic ancestry is shown in the Supplementary Fig. 1. All participants provided written informed consent prior to data collection. Our study was approved by the ethics committee of Nanjing Medical University.

Genotyping, imputation and quality control (QC)

For each cohort, the detailed information about genotyping and imputation process is described in the Supplementary Methods. Subsequently, the imputed SNPs located in autosomal chromosomes were removed if they had (i) minor allele frequency (MAF) < 0.01; (ii) call rate <95%; (iii) Hardy-Weinberg equilibrium (HWE) P value < 1 × 10−6 and (iv) information metric (info score) <0.3.

Definition of overall survival

The follow-up time of overall survival was calculated from the date of colorectal cancer diagnosis to the date of death from any cause or the end of the follow-up period for censoring.

Meta-analysis of colorectal cancer survival GWAS

We used the Cox proportional hazards model to calculate HR and 95% confidence interval (CI) for the association between each SNP and colorectal cancer survival, separately for the NJCRC and UK Biobank cohorts, with the adjustment of corresponding covariates [NJCRC: sex, age, smoking status, drinking status, grade, stage and first 10 principal components; UK Biobank: sex, age, body mass index (BMI), smoking status, drinking status and first 10 principal components].

Furthermore, leveraging the summary statistics of the two survival GWASs (totally 3703 cases), a meta-analysis in an inverse variance-weighted fixed-effects model was performed to identify survival-associated variants across EAS and EUR ancestries, implemented by METAL software31. We then retained SNPs for subsequent analysis if they (i) passed filters in both the EAS (i.e., NJCRC cohort) and EUR (i.e., UK Biobank cohort) populations; (ii) did not show substantial heterogeneity among studies (P value for heterogeneity test ≥0.01); and (iii) harbored a significant association with colorectal cancer survival (P value for meta-analysis ≤0.001). Finally, also considering that the consistency of SNPs in at least one external dataset, a total of 300 independent SNPs (linkage disequilibrium, LD r2 < 0.1) were kept, and variants at P value < 5 × 10−6 were considered to be suggestively genome-wide significant.

In addition, we applied a colorectal cancer GWAS meta-analysis of case-control studies to evaluate the risk effect of genome-wide significant prognostic variants9. The meta-analysis was performed with totally 35,145 cases and 288,934 controls of EAS and EUR ancestries, derived from NJCRC (1316 cases and 2207 controls; EAS), BJCRC (932 cases and 966 controls; EAS), SHCRC (1116 cases and 1054 controls; EAS), ZJCRC (1046 cases and 1184 controls; EAS), BioBank Japan Project (BBJ; 7062 cases and 195,745 controls; EAS), GECCO (21,608 cases and 20,278 controls; EUR) and PLCO (2065 cases and 67,500 controls; EUR) GWASs.

Calculation of PPS

To aggregate the weak effect of individual SNPs, we calculated PPS using the following formula: \({{{{{\rm{PPS}}}}}}=\mathop{\sum }\nolimits_{i=1}^{n}{\beta }_{i}{{{\mbox{SNP}}}}_{{{\mbox{i}}}}\), where n is the number of selected SNPs, SNPi and βi are the number of effect alleles (i.e., 0, 1, 2) and weight corresponding to the i-th SNP, respectively. Using the genotype data of 300 independent SNPs, we constructed eight candidate PPSs for colorectal cancer survival prediction through four approaches, including classic clumping and P value thresholding32 (i.e., C + T, 3 scores), LASSO33 (2 scores), RSF34 (1 score), and CoxBoost35 (2 scores) methods. The details are described in the Supplementary Methods.

Calculation of healthy lifestyle score

The construction of healthy lifestyle score was based on our previous study9, of which included common lifestyle factors, and we kept lifestyle factors with low missing rate for analysis. Briefly, we calculated healthy lifestyle scores based on five common lifestyle factors in the PLCO cohort, derived from the baseline questionnaire and diet history questionnaire (DHQ), including BMI, tobacco smoking, alcohol consumption, red and processed meat intake, and vegetable and fruit intake. Each lifestyle factor was given a score of 0 or 1, with 1 representing the healthy behavior category, and the sum of the five scores was used as the healthy lifestyle score. The detailed information is shown in the Supplementary Table 1.

Statistical analysis

The Manhattan plot and quantile-quantile plot based on the -log10 (P value) were created by using R package qqman. The heterogeneity was measured using Cochran’s Q statistics and I2.

In the validation (i.e., TCGA) and testing (i.e., ZJCRC and PLCO) cohorts, we used the Cox proportional hazards model to estimate the HRs and 95% CIs for the association of PPS with colorectal cancer survival after adjusting for corresponding confounding factors. All datasets were analyzed underlying complete case analysis. The discriminatory ability of the prognostic model (i.e., Cox regression model) was evaluated using the time-dependent ROC curve [the optimal estimation of sensitivity and specificity was based on the Index of Union (IU) method36] using R package survivalROC, with a bootstrap method of 10,000 iterations for calculating 95% CI and ROC comparison. In addition, the Harrell’s C index and Royston and Sauerbrei’s R2D in Cox proportional hazards models were also used for evaluating model performance37. The DCA plot was also used to demonstrate the clinical benefit of different models at 5 years of follow-up, using R package dcurves. Participants were then classified into two genetic-risk subgroups (including low-PPS and high-PPS) according to the median value of PPS for group comparison. The Kaplan–Meier curve and log-rank test were used to evaluate the difference in overall survival probability stratified by different levels of PPS. In addition, to assess the robustness of the PPS in survival prediction, we performed the following sensitivity analyses: (i) excluded colorectal cancer patients who died during the first year of follow-up; (ii) evaluated the associations using ancestry-corrected PPS (briefly, fit a linear regression model using the first ten principal components of ancestry to predict PPS, and the residual from this model was used to create ancestry-corrected PPS)9.

In the PLCO cohort, participants were further classified into low stage/grade [i.e., low stage (stage I and stage II) and low grade (G1 and G2)] and high stage/grade [i.e., high stage (stage III and stage IV) or high grade (G3 and G4)] subgroups, as well as unfavorable (i.e., 0 and 1 lifestyle score) and favorable (i.e., ≥ 2 lifestyle score) subgroups. The log-rank test and Cox proportional hazards model were used to evaluate the association of different levels of pathologic stage/grade, genetic risk or healthy lifestyle with overall survival probability of colorectal cancer. The R package Shiny was used to construct the colorectal cancer survival prediction web server, which was freely available and open source.

All statistical analyses were performed using R software (version 4.0.3), and a two-sided P value less than 0.05 indicated statistical significance.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.