Folate is a naturally occurring, water-soluble B vitamin that cannot be produced by the human body and plays a key role in DNA formation and is necessary for cellular division and tissue differentiation. It is found abundantly in green leafy vegetables, legumes, fruits, and its more potent form, folic acid, is found in supplements and fortified foods1. Supplementary folic acid is routinely prescribed during pregnancy as an evidence-based intervention to prevent neural tube defects in utero2,3. Dietary deficiency is typically found in persons subsisting on inadequate diets, as well as chronic alcoholics with diminished absorption4. Fortification of grains with folic acid began in the early 1990s to prevent nutritional deficiencies5,6. To date, 71 countries have legislative mandates for including folate in the fortification of milled grains5. Results pre- and post-fortification and risk of CRC have been somewhat inconsistent7,8,9,10,11,12,13, suggesting that folate might play a more complex role in colorectal carcinogenesis through various interactions14,15,16. Given the complexity of the relationship between CRC and folate, there is a need to elucidate the underlying biological mechanisms and possible differential risk based on individual genetics15.

Increased folic acid consumption is known to lower circulating levels of homocysteine, a common amino acid that has been associated with numerous diseases6,17,18. The absence of folic acid leads to impaired DNA synthesis and disturbances in red blood cell maturation19. Due to its role as a carrier of one-carbon groups and in folate-mediated one-carbon metabolism (FOCM), insufficient folate consumption has been implicated as a possible cause of cancer12,20,21,22,23. Consistent with this hypothesis previous studies have shown evidence that greater folate intake is associated with a reduced risk of colorectal adenomas and cancers (CRC)11,21,24. A pooled analysis of 13 prospective studies in 2010 observed a modest effect, estimating a 2% risk reduction for CRC per 100 μg/day increase in total folate consumption25.

Candidate gene approaches targeting FOCM-related genes have shown associations with CRC risk24,26,27. This has raised interest in studying interactions between folate and genetic variants23,28. As such, it has been hypothesized that germline mutations to the enzyme 5,10-methylenetetrahydrofolate reductase (MTHFR) would be a driver of the effects on folate on CRC risk11,29,30. A common mutation, 677TT in MTHFR has been associated with a greater decreased risk of CRC in high consumers of folate and low alcohol consumption27,29,31,32 compared to lower folate consumers. However, such analyses have relied on the assumption that FOCM-related genes are the driving genetic force on the pathway from folate consumption to CRC development. A genome-wide approach has the potential to identify novel genes that may modify the folate–CRC association.

To this end, we conducted a novel set-based genome-wide analysis to test interactions between genes and total folate intake on CRC risk. By using a set-based approach we may increase the power to detect associations, which is a common issue in traditional gene–environment interaction studies. We incorporate functional annotation based weights from PrediXcan, a transcriptome prediction tool33.


Study participants

We used epidemiological and genetic data from studies included in three international CRC consortia: the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO), the Colorectal Transdisciplinary Study (CORECT) and the Colon Cancer Family Registry (CCFR). Full details have been published previously34,35, and the demographic characteristics of study participants are summarized in Table 1. We describe the study designs in Supplementary Table 1A and present results for the study design specific effects of total folate on CRC for study designs in Supplementary Table 1B. In case–control study designs, included cases were ascertained using population-based sampling and age-matched controls. In prospective cohorts, cases were identified through linkage to cancer registries. Participants with non-European ancestry were excluded due to small sample sizes among those with genetic data. Informed consent was given by all participants, and studies were approved by their respective Institutional Review Boards and complies with all relevant ethical regulations.

Table 1 Characteristics of participants from all studies by colorectal cancer case/control status and Chi-square p-values for statistical differences.

Genotype data

Details on genotyping and imputation have been reported previously36. In brief, DNA was mostly obtained from blood samples, with some from buccal swabs. Several platforms (the Illumina HumanHap 300k, 240k, 550k and OncoArray 610k BeadChip Array system, or Affymetrix platform) were used for genotyping37,38. Samples were excluded on the basis of sample call rate ≤ 97%, heterozygosity, unexpected duplicates or relative pairs, gender discrepancy and principal component analysis (PCA) outlier of HapMap2 CEU cluster. SNPs were excluded on the basis of inconsistency across platforms, call rate < 98%, and out of Hardy–Weinberg equilibrium (HWE) in controls (p < 0.0001)37. SNPs were imputed to the CEU population in Haplotype Reference Consortium (HRC version r1.0) if not directly genotyped39, and restricted by imputation accuracy (R2 > 0.3).

Genetically predicted gene expression

The sets of genetic variants and weights for predicting gene expression were downloaded from the publicly available PredictDB Repository ( The weights for the predicted gene expression were obtained by an elastic net penalized regression approach using the genome-wide variant data and transcriptome data from 169 colon tissue samples from the GTEx project (GTEx v6)40 (Supplementary material). We restricted GTEx data to the transverse colon as it included the entire colonic wall and as such the epithelial layer in the mucosa most relevant to CC development while the GTEx sigmoid colon data only included the muscle layer. Genes for which SNPs explained at least 1% of the variation in CRC risk were selected for interaction analyses. A total of 4839 genes were included.

Exposure assessment

Basic demographics and environmental risk factors were collected using in-person interviews and/or structured questionnaires35,41,42,43,44,45,46,47,48,49. For these data, we carried out a multi-step data harmonization procedure, reconciling each study’s unique protocols and data-collection instruments as discussed previously34. Folate and folic acid intake were assessed at the reference time using food frequency questionnaires (FFQs). For cohort studies, the reference time was time of enrollment or blood collection. Folate and folic acid intake in each study were determined based on micrograms per day (mcg/day) of folate from foods (i.e., dietary folate) and mcg/day of folic acid from supplements (single or multivitamins) when available. Only two of the 23 studies with dietary folate intake did not capture information regarding supplemental folate. To account for the higher bioavailability of synthetic folic acid vs. natural folate in foods, we calculated total folate intake as dietary folate equivalents (total mcg DFE = mcg of dietary folate + 1.7 × mcg folic acid from supplements)50. Because the time of enrollment for some studies overlapped or followed the period of folic acid fortification (1996–1998), these studies accounted for folic acid fortification when calculating dietary folate intake and entered dietary folate intake as mcg of natural food folate + 1.7 × mcg folic acid from fortified food (see Supplementary Table 1A). Two studies (OFCCR, DALS) entered supplement data as regular user vs. nonuser; for these, we assumed regular use was 400 mcg/day or 400 mcg/tablet (for multivitamins), which corresponds to the generic dose in supplements25,51. The primary analysis used sex-study specific quartiles of total folate using controls based on the calculated daily dietary and supplemental intake, if available. By using categorical sex-study specific quartiles we reduce the influence of outliers and skewed distributions and is consistent with the Cancer Cohort Pooling Project52. To further explore the differences in bioavailability, secondary analyses we explored sex-study specific quartiles of dietary folate and binary (yes/no) supplemental folate separately.

Statistical analysis

We used the Mixed effects Score Tests for interaction (MiSTi)53, a mixed effects score test for gene-based interaction test with folate consumption on CRC risk, to conduct a pooled analysis across all studies. MiSTi modeled the gene–environmental interaction effect by two components. The fixed effects component incorporates variant functional information from PrediXcan as weights with our genotype data to calculate the genetically predicted gene expression and then assess its interaction with folate consumption. The random effects component involves residual interaction effects that have not been accounted for by the fixed effects. We used sex- and study-specific quantiles of folate consumption. p-values were calculated separately for fixed and random effects interaction terms, after adjusting for age, sex, study, sex-study specific quartiles of total energy consumption in kcal, and principal components to account for population stratification. We used the MiSTi data-adaptive weighted combination approach to combine the fixed and random effects components.

Genes with p-values less than the Bonferroni correction (0.05/4839 = 1.03E−5) were considered genome-wide statistically significant for an interaction with folate. p-values that reached false discovery rate (FDR) at 20% were considered having suggestive evidence of interaction as it is less stringent than a Bonferroni threshold. We conducted follow-up analyses based on the fixed and random effects p-values. For associations driven by the fixed effects, we investigated the direction and magnitude of these interactions using the generalized linear model, which included all covariates in the original model, folate, standardized predicted gene expression, and an interaction term for folate and predicted gene expression. Genes for which the signal was driven by the random effects component were further investigated to identify individual variants of the gene set as drivers using the same approach with interactions for individual variants and folate while adjusting for all other variants in the gene set. Due to some of the variants having high collinearity, we pruned variants by R2 < 0.9.

All analyses were performed using R version 4.0.154.

We performed these additional follow-up analyses for MTHFR, as prior candidate gene studies have shown variants, specifically the C677T mutation, alter the association between folate and CRC31,32,55,56,57,58. We additionally include the results of the gene–environment interaction between rs1801133 (C677T mutation) per additional effect allele with sex-study specific quantiles of total folate consumption on colorectal cancer.


The final sample included 13,498 cases and 13,918 controls with both folate and energy consumption measures available from 23 studies. We present demographic characteristics of all samples and report on measures for factors associated with CRC risk for study participants by case–control status in Table 1. Cases were more likely to be male, have higher BMI, and report consuming less folate daily and more calories daily compared to controls. Multivariable logistic regression estimated a reduced risk of CRC per-quartile increase in total folate intake, adjusting for sex, age at reference, and total energy intake, and study (OR = 0.91, 95% CI: 0.89, 0.93, p-trend < 0.001, Supplementary Table 1B). Sensitivity analyses included further adjustment for smoking and alcohol consumption, which had little effect on the estimates for total folate and CRC risk.

We found no suggestion of interaction between predicted gene expression for the MTHFR gene and sex-study specific folate on risk of CRC in our analysis. Supplementary Table 2A–C present follow-up analyses conducted to test the interaction per standard deviation change in predicted gene expression within sex-study specific quantiles, allowing for a non-linear relationship between folate quantiles, as well as individual variant weights used in the modeling of predicted gene expression to capture the C677T mutation. In the snp-environment interaction analysis for the rs1801133 variant (C677T mutation), no interaction was show between each additional effect allele with sex-study specific quantiles of total folate consumption on risk of CRC (ratio of odds ratio = 1.02; 95% CI = 0.98, 1.06; interaction p-value = 0.235).

The median number of SNPs included in the gene sets was 25 (minimum: 1, inter-quartile range [IQR]: 13–43, maximum: 277). Figure 1 displays the quantile–quantile plot for the G × E test that combined both fixed and random effects using adaptive weight. While there was no G × E interaction that reached the Bonferroni threshold (0.05/4839), three did surpass the false discover rate (FDR) of 0.2.

Figure 1
figure 1

MiSTi results for Adaptive Weight test of predicted gene expression interactions with total folate intake. Dashed line is the Bonferroni corrected threshold, solid is the false discovery rate < 0.2 threshold for p-value significance.

We present the findings with p-values that surpassed the FDR threshold for gene interactions with total folate consumption and CRC risk in Table 2. We observed suggestive evidence of interactions between total folate intake and 3 independent gene sets on risk of CRC at FDR < 0.2, including Glutathione S-Transferase Alpha 1 (GSTA1; p = 4.3E−4), Tonsuko Like, DNA Repair Protein (TONSL; p = 4.3E−4), and Aspartylglucosaminidase (AGA; p = 4.5E−4). In follow-up analyses for these three genes we observed positive interactions for GST1A and AGA, showing greater risk for CRC associated with higher gene expression and increasing folate consumption (Table 3). As the signal for TONSL primarily came from the random effects, indicating one or a few variants were drivers of the association, we investigated the individual interactions of variants with sex-study specific folate. We see two variants as possible drivers of the signal in our main analysis, 8:144964455_T/C and 8:144965104, as shown in Table 4.

Table 2 Top genes from MiSTi for gene-based interactions with total folate for colorectal cancer based on Adaptive Weight Test p-values with FDR < 0.2.
Table 3 Estimated odds ratios (ORs) and 95% confidence intervals (CIs) of colorectal cancer risk per standard deviation change in predicted gene expression stratified by sex-study specific quantiles of total folate consumption.
Table 4 Variant specific ratio of odds ratios associated with per quantile increase in sex-study specific quantiles of total folate consumption on colorectal cancer risk for variants included in TONSL*.


In this sizable analysis including a large number of studies we harmonized data on folate consumption and genome-wide genetic data to investigate interactions between folate intake and variants in genes on CRC risk. We observed an inverse association between folate intake and CRC risk across 23 studies. Using our novel statistical set-based G × E mixed effects score tests, MiSTi, we identified 3 genes with suggestive interactive effects with total folate consumption on CRC risk: GSTA1, TONSL, and AGA.

We observed a positive interaction between the predicted gene expression of GSTA1 and folate for CRC risk. GSTA1 located at 6p12.2 encodes for an enzyme that functions in cellular detoxification of electrophilic compounds through glutathione metabolism. Electrophilic compounds include carcinogens, therapeutic drugs, environmental toxins, and products of oxidative stress. Glutathione is a product of homocysteine metabolism, a key amino acid correlated with folate intake, and is bound to free radicals by GSTA159. Our results suggest that folate consumption may increase remethylation of homocysteine to methionine, thus reducing the production of glutathione need for DNA repair. Mutations in GSTA1 could feasibly alter the binding affinity of glutathione to carcinogenic compounds, leading to variation in cancer susceptibility. Of the 20 SNPs included in our analyses of GSTA1, three of the alternative alleles result in missense mutations to the gene60. Compromised function of glutathione as an antioxidant due to mutations in GSTA1 in conjunction with depleted levels of glutathione due to lower homocysteine levels may be a pathway to tumorigenesis6,22. Candidate gene studies have shown no association between GSTA1 and colorectal cancer or adenoma risk61,62. However, previous studies have shown interactions between diet, such as cruciferous vegetable consumption, and GSTA1 genotypes, supporting that associations between this gene and CRC are likely driven by dietary exposures63,64,65.

TONSL in the 8q24.3 region codes for a 1378 amino acid protein component of the MMS22L-TONSL complex, which functions in recovery of damaged replication forks66. Numerous mutations in TONSL are considered pathogenic60. Low levels of the MMS22L-TONSL complex result in increased frequency of DNA double-strand breaks and compromised DNA integrity66. In combination with increased DNA damage due to deficiencies in folic acid, impaired functionality of the MMS22L-TONSL due to functional mutations may be a pathway to increase tumorigenesis. Follow-up analyses further suggested that possible associations may be primarily driven by a small subset of variants included in our gene set in the main analysis.

We observed increasing risk of CRC per standard deviation increase in predicted gene expression of AGA with increasing folate consumption. The AGA gene is in the 4q34.3 region and codes for a 346 amino acid protein that functions in pathways related to the innate immune system and asparagine degradation67. Once the protein is processed into the mature enzyme it takes part in the catabolism of N-linked oligosaccharides, cleaving asparagine from N-acetylglucosamines in one of the final steps in the lysosomal breakdown of glycoproteins. Mutations in the AGA gene are known to cause the lysosomal storage disease aspartylglycosaminuria, eventually resulting in neurodegeneration. Previous research has not indicated a link to cancer for this gene.

While we have many strengths in performing the largest investigation of gene–folate interactions to date using a powerful set-based approach that allows to account for functional prediction, some limitations should be considered when interpreting these findings. Approximately half of the studies in our consortium ascertained cases using a cohort study design which may have resulted in earlier and more frequent detection of tumors. Most cohort studies in our consortium used population-based registries for case ascertain. However, one study, The Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial, was a randomized trial to determine the effectiveness of screening. While we have adjusted for study in our approach there may be unknown residual effects of this design. Our study population was limited to those of European descent. As gene expression levels may differ across populations of different ancestry, our results may not be generalizable to populations of non-European ancestry. The studies included in our analysis occurred over a range of time and geographic locations. Fortification with folate occurred in different places at different times and we used adjusted dietary equivalents to account for these differences (see Supplementary Table 1A). Study designs also varied. We looked at the effect size of folate on CRC by case/control versus cohort study designs and did not find a substantive difference to justify stratified analyses (see Supplementary Table 1B). Lastly, studies in our consortium generally ascertained folate consumption through standard questionnaires. However, previous work has shown self-reported measures of folate intake to be positively and moderately correlated with plasma levels of folate, particularly when dietary supplement use was included as was generally the case in studies included in our analyses68.

We utilized colon-specific gene expression data, specifically transverse colon tissue captured by the GTex Project40. One limitation of this data is the diversity of cell types aside from epithelial cells of the mucosa of the colon, from which CRC derives given that the entire colonic wall was sampled. The impact of this would cause a dilution of gene expression for the tissue most relevant for CRC. However, we expect this to be an improvement over alternative tissue types including blood or sigmoid colon tissues in GTEx, which were collected from muscle tissues only and would not represent the gene expression profile of interest.

Although MiSTi is a powerful statistical tool, which accounts for both fixed- and random-effects of the gene–folate interaction, none of our findings reached the Bonferroni corrected threshold, which can be overly conservative as many genes are co-expressed. We did not perform independent replication and thus follow-up investigations are warranted, as a FDR of 0.2 should be considered liberal53. The previously suggested MTHFR gene was not identified in our analysis27. However, in using the penalized elastic net to create our predicted gene expression the C677T was not included in the variant weights due to the insignificant contribution to regulation of gene expression. While it was also not seen in the gene–environment interaction analysis either, we believe these results to be representative of an agnostic approach which has not been shown before, as opposed to candidate gene studies.

Our analysis was conducted in the largest pooled analysis of a well characterized and harmonized consortium of CRC with comprehensive genetic data which enabled a hypothesis-free genome-wide investigation of interactions with folate consumption on CRC risk. An extensive number of genes evaluated in prior candidate gene–folate interaction studies, including MTHFR, were included among the 4839 genes examined. However, none of those previously hypothesized genes were found to interact with folate consumption in our analysis31,57,69. We conducted additional follow-up analysis for MTHFR using indicator terms for sex-study-specific folate quantiles and interaction terms for all quantiles with predicted gene expression were null (see Supplementary Table 2A–C). No previous study has agnostically tested for genetic interactions with folate for cancer. Our statistical approach was potentially improved by incorporating functional variant weights and testing gene-sets rather than individual SNPs reducing the penalty for multiple testing. In the end, we found three genes that were suggestive of interacting with folate consumption on risk of CRC, supporting the hypothesis that associations of folate with CRC may be modified by common genetic variation.

The biological functions of our top genes serve to primarily prevent or repair DNA damage. The combined effects of increased DNA damage due to folate deficiencies and compromised functionality of these genes may be an important pathway in CRC tumorigenesis. These findings, particularly for GSTA1, warrant follow-up in future studies with comprehensive genetic and data on folate intake in order to confirm the potential role of these genes in interacting with folate on CRC risk.