One-carbon metabolism (1CM) is a metabolic network centered around the folate and methionine cycles, essential for methylation and nucleotide synthesis (Fig. 1). 1CM is vital for genome stability and function and is the target of antimetabolite/antifolate chemotherapy. An important biological role for 1CM in cancer development is therefore highly plausible1.

Figure 1: One-carbon metabolism.
figure 1

Graphical representation of the main aspects of one-carbon metabolism centered around the folate and methionine cycles. Abbreviations: BHMT, betaine homocysteine S- methyltransferase; CBS, cystathionine β-synthase; CH2THF, 5,10-methylenetetrahydrofolate; CH3THF, 5-methyltetrahydrofolate; CHOTHF, formyltetrahydrofolate; CHTHF, methenyltetrahydrofolate; CTH, cystathionine γ-lyase (also abbreviated CSE); DHF, dihydrofolate; DHFR, dihydrofolate reductase; dTMP, deoxythymidine 5′-monophosphate; dUMP, deoxyuridine 5′-monophosphate; FOLR, folate receptor; MTHFD, methylenetetrahydrofolate dehydrogenase; MTHFR, 5,10-methylenetetrahydrofolate reductase; MTR, methionine synthase; MTRR, methionine synthase reductase; RFC, reduced folate carrier; SAM, S-adenosylmethionine; SAH, S-adenosylhomocysteine; SHMT, serine hydroxymethyltransferase; TCN2, Transcobalamin II; THF, tetrahydrofolate; TYMS, thymidylate synthase.

The role of 1CM in colorectal cancer (CRC) development has been extensively studied. Findings include a possible dual role for the B-vitamin folate, depending on the dose and timing of exposure (i.e., protecting healthy mucosa but promoting undiagnosed lesions2,3). For other B-vitamins, the most notable finding is an inverse association between plasma concentrations of vitamin B6 (pyridoxal 5′-phosphate, PLP) and CRC risk4, whereas results have been inconclusive for vitamin B2 (riboflavin) and B12 (cobalamin) status5. For metabolites in the transsulfuration pathway, such as homocysteine and cysteine, results have been inconclusive6,7,8,9,10,11. For metabolites primarily involved in methylation, such as methionine and factors in the choline oxidation pathway, inverse associations between colorectal adenoma and CRC risk have been observed12,13,14,15. To our knowledge, dietary intake or circulating levels of serine and glycine, important one-carbon group donors to tetrahydrofolate (THF) in the folate cycle (Fig. 1), have not been studied in relation to CRC risk, but implications in cancer cell proliferation have been observed in vitro16. Several single nucleotide polymorphisms (SNPs) in genes coding for enzymes in 1CM have also been studied17, with the most important finding being a reduced risk of CRC in TT genotype carriers of the methylenetetrahydrofolate reductase (MTHFR) 677C > T polymorphism18. Randomized clinical intervention trials on the effects of folic acid, vitamin B6, or vitamin B12 supplementation on colorectal adenoma or cancer occurrence have been inconclusive19,20,21.

The varying presence, strength, and even direction of the observed associations between compounds of 1CM and CRC risk may have several explanations, for instance, variation in levels and timing of exposure between study populations18. Furthermore, the commonly used method of univariate modeling of single variables, typically in multivariable models adjusting for potential confounders, may miss higher-order interactions and mediating effects. This is particularly important for 1CM, given the complexity of input (diet, supplementation, and fortification), interrelationships, gene-environment interactions, and output (nucleotide synthesis, methylation, inflammation, oxidation, and energy metabolism22,23,24).

To account for the complexity of 1CM in molecular epidemiology, mathematical or pathway-based modeling based on prior biochemical knowledge have been successfully applied25. However, no empirical study using observational data has so far addressed the complex interplay of plasma markers of 1CM, related SNPs, and environmental factors in relation to cancer risk. A Bayesian network (BN) is a graphical representation showing all independent relations among a set of variables (a network of nodes, representing variables, connected by lines referred to as edges, representing independent relations between variables). A BN can be estimated - or learned - from data using machine learning algorithms26. This methodology has previously been applied in studies of complex systems in several scientific disciplines26, including epidemiology27,28. Using BNs in studies of 1CM and cancer could provide a more comprehensive understanding of the relation between 1CM and carcinogenesis.

In this study of 613 colorectal cancer cases with prediagnostic blood samples and 1190 matched controls from the population-based Northern Sweden Health and Disease Study (NSHDS), we used Bayesian network learning to investigate, simultaneously, the relative contributions of and interplay among a comprehensive panel of 14 prediagnostic plasma one-carbon metabolites, 17 SNPs involved in 1CM, and a set of other environmental factors, in relation to CRC risk.


Baseline characteristics

Baseline characteristics for case participants and matched control participants, and clinical characteristics for the cases, are presented in Table 1. There was a slightly larger proportion of ex-smokers and a lower proportion of never smokers among cases compared to controls. Body mass index (BMI), alcohol intake, physical activity (occupational and recreational), and B-vitamin intakes were similar for cases and controls. Vitamin supplement usage was low among both cases and controls. The median age at diagnosis was 65.2 years. Median follow-up time between blood sampling of the cases and their CRC diagnosis was 8.2 years. The tumors were roughly equally distributed by site (30% proximal, 35% distal, and 35% rectal) and stage (53% I/II and 47% III/IV).

Table 1 Baseline characteristics.

Baseline plasma metabolite concentrations differed between cases and controls for some metabolites, but only vitamin B2 differed significantly after Bonferroni correction for multiple testing (adjusted significance threshold: 0.05/14 ≈ 0.004, Table 1). Spearman correlations between plasma concentrations of one-carbon metabolites are presented in Supplementary Fig. S1. Four groups of more highly correlated metabolites were apparent: (1) metabolites in the choline pathway (choline, betaine, DMG, and sarcosine), (2) B-vitamins (folate, vitamin B6, B2, and B12), (3) serine and glycine, and (4) methionine and metabolites in the transsulfuration pathway (methionine, homocysteine, cystathionine, and cysteine). Correlations were highest between directly related metabolites, such as choline and betaine (r = 0.40) and glycine and serine (r = 0.52). Total homocysteine was negatively correlated with both folate (r = −0.37) and vitamin B12 (r = −0.28). Methionine and cystathionine were also correlated to metabolites in the choline pathway (r = 0.17–0.28). The correlations were essentially the same for cases and controls.

Genotype distributions did not differ between cases and controls for any SNP (Supplementary Table S1). No SNPs showed significant deviations from Hardy-Weinberg equilibrium (adjusted significance threshold: 0.05/34 ≈ 0.0015, Supplementary Table S1).

Bayesian network learning

The combined BN estimated from data using three different algorithms, including all metabolites, SNPs, and other variables in relation to CRC, is presented in Fig. 2a. An edge (i.e., drawn line) between two variables implies an association independent of all other variables in the network. The networks estimated using the Hill-climbing (HC) algorithm had more edges compared to the networks estimated using the Incremental Association Markov Blanket (IAMB) and Min-Max Hill-climbing (MMHC) algorithms. Regarding independent associations between CRC and other variables, the overall pattern was the same for all algorithms but with slightly stronger associations for more variables in the HC networks (measured by edge confidence, i.e., the frequency of the edge in the 1000 bootstrap networks) (Fig. 2b). The edge confidence significance thresholds that needed to be met for a relation to be included in the networks were essentially the same (HC = 49%, IAMB = 49%, MMHC = 50%).

Figure 2
figure 2

Bayesian network learning results (a) Bayesian network of plasma one-carbon metabolites divided into quartiles, related SNPs, and other environmental variables in relation to colorectal cancer (CRC) estimated with the HC algorithm. Analyses were made on 560 cases and 1090 controls (after excluding 53 cases and 100 controls with incomplete 1CM data). Edges in black were also present in IAMB and/or MMHC networks, whereas gray edges were present only in the HC network. Thicker edges indicate higher confidence (i.e., the frequency of the relation in the 1000 bootstrap networks). The estimated confidence thresholds for inclusion in the networks were: HC = 49%, IAMB = 50%, MMHC = 51%. The strongest independent associations with CRC risk, with edge confidences consistently higher compared to other variables for all algorithms, are marked with dashed edges. (b) Edge confidences of relations between CRC and 1CM variables for networks learned using the HC, IAMB, and MMHC algorithms. A higher edge confidence indicates a stronger independent association. Abbreviations: PA, physical activity; eGFR, estimated glomerular filtration rate; KTr, kynurenine/tryptophan ratio.

In the BNs, plasma concentrations of the metabolites were related to each other mainly according to known biochemical relationships (Fig. 2a). Homocysteine levels were related to the MTHFR 677C > T polymorphism. No other SNP was strongly associated with the plasma concentrations of any of the metabolites. SNPs within the same genes were associated, suggesting linkage disequilibrium. Some independent associations between environmental factors and metabolites were present. For instance, vitamin B6 was related to smoking and cysteine was related to BMI. The relation between sampling year and sarcosine (manifested as slightly higher levels in participants sampled in later years) was likely an artifact stemming from spurious amounts of sarcosine in the EDTA tubes used during that period. The BNs also picked up associations between background variables inherent to the study design (e.g., between cohort and sex, age, fasting status, and sampling year).

Folate, vitamin B6, and vitamin B2 had the strongest independent associations with CRC risk, with edge confidences consistently higher compared to other variables for all algorithms (Fig. 2b). Yet, the edge confidences were generally not above the estimated significance thresholds. The RFC1 80G > C polymorphism displayed a higher edge confidence to CRC compared to other SNPs, though it did not meet the threshold (Fig. 2b). Removing the metabolites from the BNs did not markedly affect the associations for SNPs.

Edge confidences for the strongest independent associations between 1CM variables and CRC risk (BNs estimated with the HC-algorithm) are presented for subgroups based on sex, follow-up, and tumor site and stage in Supplementary Table S2. In sex-specific BNs, the most apparent difference was a stronger relation between folate and CRC in men (Pheterogeneity = 0.004). Folate was mainly directly associated to stage III&IV cancers (Pheterogeneity = 0.04). Vitamin B12 was associated with rectal cancer in tumor site-specific BNs, though the test for heterogeneity was not significant (Pheterogeneity = 0.15). The remaining structure did not markedly differ between subgroup networks and the variables with the strongest relation to CRC or CRC subgroup were largely the same regardless of algorithm.

Univariate and interaction analyses

Plasma vitamin B2 concentrations were inversely related to CRC risk (highest vs. lowest quartile OR: 0.63, 95% CI: 0.46–0.85, Ptrend = 0.004, Fig. 3). The corresponding average absolute risk reduction was approximately 300 cases per 100 000 in the highest versus lowest quartile. Adjusting for potential confounders, including folate and vitamin B6, did not markedly change the risk estimates. Univariate analyses of the other variables with the strongest relationship to CRC, i.e. folate and vitamin B6, have either been published (lower CRC risk at lower plasma folate concentrations6,11) or submitted to a scientific journal (higher CRC risk at lower plasma concentrations of vitamin B6, data not shown here).

Figure 3: Risk of CRC by vitamin B2 status.
figure 3

Odds ratios (OR) were calculated by conditional logistic regression. Absolute risk differences (RD) were determined using weighted maximum likelihood estimation. Quartiles of plasma concentrations of vitamin B2 (riboflavin, nmol/l) were based on the distribution among the controls participants. Confidence intervals for the RDs were calculated by bootstrapping. Crude OR and RD estimates were adjusted only for the matching variables, using risk set stratification in conditional logistic regression and by including them as covariates in the weighted maximum likelihood models, respectively. Adjusted estimates were additionally adjusted for BMI, smoking status, occupational and recreational activity, alcohol intake, and plasma folate and vitamin B6 (PLP) concentrations. Ptrend was calculated by modeling log-transformed plasma concentrations in conditional logistic regression models.

We investigated 2-way interactions between the most influential variables: folate, vitamin B6, and vitamin B2. We observed no interaction between folate and vitamin B6 or folate and vitamin B2 (Pinteraction = 0.29 and 0.16, respectively), whereas vitamin B2 and B6 exhibited a significant interaction (Pinteraction = 0.004). Table 2 contains ORs for CRC risk by combinations of vitamins B2 and B6 levels estimated with the fitted parameters of the BN and conditional logistic regression including interaction terms (with plasma concentrations divided in tertiles to avoid spurious associations). The inverse association between vitamin B2 and CRC risk was attenuated at higher levels of vitamin B6, with the highest risk observed in the low-low category. ORs calculated from the BN were approximately the same as ORs from conditional logistic regression models.

Table 2 Risk of CRC by vitamin B2 and B6 status.

Sensitivity analyses

Since categorization of plasma metabolites may result in loss of information, we estimated BNs using finer categorization (septiles, representing a balance between increasing the number of categories and maintaining adequate numbers in each category). This analysis did not markedly change the resulting networks, with the exception of a moderate increase in confidence for the independent association between vitamin B2 and CRC.

Since undiagnosed cancer may affect plasma metabolite levels at the time of sampling (reverse causation), we estimated BNs excluding cases diagnosed within 1 years (25 cases) or 2 years (60 cases) of sampling, and their corresponding matched controls. This analysis did not markedly change the resulting networks.

As plasma metabolite concentrations can vary by fasting status, we estimated BNs excluding participants with fasting status less than 4 hours (141 cases and 272 controls). This analysis did not markedly change the resulting networks. In univariate analyses, the associations for folate, vitamin B6, and vitamin B2 did not differ by fasting status (Pheterogeneity = 0.35, 0.80, and 0.28, respectively).


In this population-based case-control study, a comprehensive panel of metabolites and SNPs involved in 1CM along with several environmental factors were analyzed simultaneously by Bayesian network learning to study interrelations and relative contributions to CRC risk. The associations represented in the estimated networks largely corresponded to plausible biochemical relationships. Plasma concentration of folate, vitamin B6 (PLP), and vitamin B2 (riboflavin) had the strongest independent relations to CRC risk. In multivariable, univariate analyses, vitamin B2 demonstrated a linear inverse association with CRC risk. Vitamin B6 significantly modified this relation.

This is the first time Bayesian network learning, or any similar multivariate statistical approach, has been applied to investigate 1CM in relation to any cancer. Bayesian network learning allows many variables to be modeled simultaneously to study all relations in a system. In our study, the estimated BNs identified known and biologically plausible associations between factors, which underscores the validity of the method. Bayesian network learning does not replace traditional methods but is a valuable exploratory tool for understanding independent associations among multiple variables and to facilitate proper selection of variables to consider in further univariate analyses. This is particularly relevant in studies of biological systems with many highly interrelated environmental and genetic factors, such as 1CM.

The independent associations between prediagnostic plasma concentrations of folate, vitamin B6, and vitamin B2 and CRC risk suggest that they may be the 1CM components of greatest importance in colorectal tumorigenesis. Low plasma folate concentrations were associated with a decreased CRC risk (previously published6,11), whereas low plasma concentrations of vitamins B6 and B2 were associated with an increased CRC risk. These observations are consistent with previous findings from univariate modeling, both in our data6,11 and other studies5. The relative importance and interconnectedness of B-vitamins in cancer development are also consistent with results from animal studies and mathematical modeling of 1CM29. Folate and vitamins B2 and B6 are involved in DNA synthesis and methylation, biological processes important for genome stability and repair1. Since these functions are critical in both the healthy colorectum and in tumorous lesions, a cancer-promoting effect has been proposed for folate2,3, consistent with the direct association with CRC risk observed in this study. As cofactors in the kynurenine pathway23,24,30, both B6 and B2 are linked to inflammation, a process known to influence cancer development31, though biomarkers of systemic inflammation available in this study (plasma neopterin and the kynurenine/tryptophan ratio) were not strongly associated with either B-vitamin, nor did they alter the relation between the B-vitamins and CRC risk. Vitamin B6 and B2 are also cofactors in a large number of other coenzyme reactions in macronutrient metabolism5, and vitamin B6 has been suggested to reduce oxidative stress, colon cancer cell proliferation, and angiogenesis32. The inverse associations between vitamin B6 and B2 and CRC risk in this study may, therefore, reflect other mechanisms than 1CM.

There were some differences in the most influential variables when we estimated BNs on stratified data. Independent associations between plasma folate concentrations and CRC were observed in men and for stage III and IV CRC, which is not entirely consistent with our previous findings based on multivariable, univariate modeling6,11. The independent association between plasma vitamin B12 concentrations and rectal cancer risk supports our previous findings6,11,33. Structural learning algorithms are less efficient in smaller sample sizes, especially in networks with many interrelations34. The results of the subgroup analyses must, therefore, be verified in larger data sets.

Among the examined SNPs, the only independent association observed was between the MTHFR 677C > T polymorphism and plasma homocysteine levels. Interestingly, none of the polymorphisms exhibited a strong independent relation to CRC risk. In previous univariate analyses of the same data, we found a small CRC risk reduction in individuals with the variant CT or TT genotype of the MTHFR 677C > T polymorphism11,35. The largely null findings for the SNPs in this study might, therefore, reflect a mediating effect through altered metabolite levels. This would be consistent with the premise of Mendelian randomization studies, for which MTHFR 677C > T is a commonly studied example. However, our results were not markedly affected by removing the plasma metabolites from the BNs.

The main limitation of this study was the analysis of only one blood sample from each participant. On the other hand, issues of storage stability and reproducibility of the included biomarkers are well studied and unlikely to have impacted the results markedly36,37. Although common practice, categorizing continuous variables (e.g., dividing plasma concentrations into quartiles) results in loss of information26. However, in a sensitivity analysis, BNs estimated using septile categories yielded similar results. Alcohol intake and physical activity were only available for the VIP cohort, which may have caused residual confounding. The majority of the data were from VIP (78%), and we included several other related environmental factors (e.g., BMI, smoking status, and inflammatory markers). A large impact of residual confounding on the main findings is, therefore, unlikely. Last, we were not able to validate the estimated network in an independent data set. However, we evaluated the robustness of our associations by a bootstrapping approach, and both known biochemical relationships and associations between background variables inherent to the study design were largely picked up by the Bayesian networks. Furthermore, the variables with the strongest independent relations to CRC risk in the networks (folate, B6, and B2), demonstrated associations and interrelationships consistent with previous reports, and were also significant in traditional univariate logistic regression models. Taken together, these observations support the validity of our findings.

The main strength of our study was the application, for the first time in a study of cancer risk, of multivariate statistical methods to a large panel of well-characterized circulating one-carbon metabolites, SNPs, and environmental factors. We tested several structural learning algorithms, yielding similar results regarding the strongest relations to CRC risk. In the overall network structure, networks estimated with the HC algorithm resulted in more edges than the IAMB and MMHC algorithms. This is likely explained by the higher sensitivity and better overall performance of the HC algorithm compared to the IAMB and MMHC algorithms, previously demonstrated in simulations34. Another strength of the study was the use of prediagnostic blood samples of high quality with respect to the collection, handling, and storage, including a majority of fasting participants. Follow-up time from sampling to CRC diagnosis was long (median 8.2 years), which minimized the risk of reverse causation. Furthermore, the study population from northern Sweden is generally characterized by low folate levels38,39,40. This allowed us to study the effects of much lower plasma folate concentrations in relation to CRC risk compared to other studies40,41.

In conclusion, this is the first study to address the complexity of 1CM in cancer risk in humans. We used multivariate Bayesian network learning to estimate, simultaneously, the associations of a comprehensive panel of prediagnostic plasma metabolites and SNPs involved 1CM and the risk of CRC. The associations between components of 1CM and CRC risk were mainly determined by variation in folate, vitamin B6, and vitamin B2 status, suggesting that these may be the elements of 1CM with the greatest potential impact for CRC prevention strategies. Our study demonstrates the importance of incorporating these B-vitamins in future studies of 1CM in colorectal cancer development, and the usefulness of Bayesian network learning in studies of complex biological systems in relation to disease.


Study design and cohorts

The present work is based on a nested case-control study within the Northern Sweden Health and Disease Study (NSHDS). Two population-based cohorts were used, the Västerbotten Intervention Programme (VIP, 78% of the study participants, men and women) and the Mammography Screening Project in Västerbotten (MSP, 22% of the study participants, all women). Both cohorts have previously been described in detail42. As of March 31, 2009, the final date for case identification for the present study, the VIP included 83 621 individuals and 114 793 blood samples, and the MSP 28 802 women and 54 787 blood samples. Selection bias in the VIP has been found to be low43, and the population-based nature of the VIP cohort is supported by comparisons of cancer incidence rates44.

Study participants

CRC cases diagnosed between October 17, 1986, and March 31, 2009, who had donated prediagnostic blood samples, were identified by linkage with the Cancer Registry of Northern Sweden (ICD-10 18.0 and 18.2–18.9 for colon, 19.9 and 20.9 for rectum), with essentially complete inclusion. All cases, as well as tumor data, were verified by a single pathologist specialized in gastrointestinal pathology. Patient records were used to verify tumor site. Exclusion criteria included: previous cancer diagnosis other than non-melanoma skin cancer, insufficient volume of plasma sample available, prioritizing to other studies, location of primary tumor outside the colon/rectum, serious infectious diseases (for lab staff safety, one case excluded), or no matching control obtainable.

Two controls were randomly selected for each case, matched by sex, age at and year of blood sampling and data collection, fasting status, and cohort. The exclusion criteria for the controls were the same as for cases, with the additional requirement that all controls had to be alive and with no diagnosed cancer other than non-melanoma skin cancer at the time of diagnosis of their index cases.

A total of 613 cases and 1190 controls were in included in the study after exclusions. In total, 127 participants were excluded (81 cases and 46 controls), mainly due to insufficient blood sample volume or to the sample being prioritized to other studies. A detailed description of the exclusions is available elsewhere35.

The subjects in the present study have previously been separately analyzed for eight of the plasma metabolites (folate, cobalamin, homocysteine, methionine, choline, betaine, dimethylglycine, and sarcosine), four of the polymorphisms (MTHFR 677C > T and 1298A > C, BHMT 742G > A, and MTR 2756A > G)6,11,33,35, and for subjects with index case diagnosis 1986–2003, also the RFC1 80G > A and FOLR1 1413G > A polymorphisms35,45, in relation to colorectal cancer (CRC) risk. A total of 17 CRC cases and 33 controls in the present study were also included in previous studies within the European Prospective Investigation into nutrition and Cancer (EPIC)14,46.

Ethical considerations

The study protocol was approved by the Research Ethics Committee of Umeå University, Umeå, Sweden. All participants gave a written informed consent. All analyses were conducted in accordance with relevant guidelines and regulations.

Blood sampling and laboratory analyses

Plasma from venous blood samples in the NSDHS is aliquoted and cryopreserved at −80 °C within one hour of collection, or at −20 °C for at most one week prior to long-term storage at −80 °C. In the VIP cohort, samples are collected in the morning, and only 34 of 1410 participants (2%) had fasted less than 4 hours and 295 (21%) less than 8 hours. In the MSP cohort, samples were collected throughout the day, and 379 of 393 participants (96%) had fasted less than 4 hours. Thus, in the total material, 60% of the participants had fasted for more than 8 hours, 17% had fasted 4–8 hours, and 23% had fasted less than 4 hours. Concentrations of 1CM metabolites in EDTA plasma and polymorphisms involved in 1CM were analyzed at Bevital AS (Bergen, Norway)47. Plasma concentrations of cystathionine, vitamin B2 (riboflavin), vitamin B6 (PLP), methionine, choline, betaine, dimethylglycine, creatinine, neopterin, and tryptophan were measured with liquid chromatography–mass spectrometry methods (between-day coefficient of variation (CV): 3–13%)48. Plasma concentrations of total homocysteine, total cysteine, serine, glycine, sarcosine, and kynurenine were measured using an isotope dilution gas chromatography–mass spectrometry method (between-day CV: 2–9%)49. Folate and vitamin B12 (cobalamin) concentrations were determined with a microbiological method using Lactobacillus casei and Lactobacillus leichmannii, respectively, which was adapted to a microtiter plate format and carried out by a robotic workstation (between-day CV: 5%)50,51. Single nucleotide polymorphisms were determined using MALDI-TOF mass spectrometry (estimated average error rate of ≤0.1% in duplicated samples)52. The genotyping method has previously been independently verified using RFLP or Taqman real-time PCR52. Samples were analyzed in case-control sets, with random positioning of the case. The investigators and laboratory staff were blinded to case and control status.


Plasma concentrations of 14 metabolites and 17 SNPs in 13 genes involved in, or related to, 1CM were considered for the Bayesian network learning. The panel was designed based on previous studies of 1CM and CRC risk5,17, and to capture a wide array of aspects of one-carbon metabolism while maintaining an adequate marker stability and reproducibility36,37. Included metabolites were: folate, vitamin B6 (PLP), vitamin B2 (riboflavin), vitamin B12 (cobalamin), homocysteine, cystathionine, cysteine, glycine, serine, methionine, choline, betaine, dimethylglycine, and sarcosine. Included SNPs were: MTHFR 677C > T and 1298A > C, CBS 844ins68 and 699C > T, MTR 2756A > G, MTRR 66A > G and 524C > T, BHMT 742G > A, TCN2 67A > G and 776C > G, RFC1 80G > A, FOLR1 1413G > A, MTHFD1 1958G > A, CTH 1364G > T, SHMT1 1420C > T, DHFR 19 deletion, and TYMS 6 deletion. Other environmental factors or background information included were: cohort (VIP or MSP), age at and year of blood sampling (quartiles), sex (male or female), fasting status (<4, 4–8, ≥8 hours), smoking status (current, ex-, never smoker), body mass index (BMI) measured by a health professional (<25, 25–30, ≥30 kg/m2), estimated glomerular filtration rate (eGFR) calculated by the Cockcroft-Gault formula (based on plasma creatinine levels, age, sex and body weight, quartiles), and plasma concentrations of neopterin and the kynurenine/tryptophan ratio (KTr), both markers of immune activation53,54 (quartiles). For the VIP cohort, we also had self-reported alcohol intake (zero intake, above/below sex-specific median of self-reported, g/day), recreational physical activity (regular exercise frequency on a scale from 1–5, where 1: never; 2: every now and then - not regularly; 3: 1–2 times/week; 4: 2–3 times/week; 5: more than 3 times/week), and occupational physical activity (on a scale from 1–5, where 1: sedentary or standing work; 2: light but partly physically active; 3: light and physically active; 4: sometimes physically strenuous; 5: physically strenuous most of the time). For these VIP-only variables, observations within the MSP cohort were assigned to a separate “missing” category.

Plasma concentration variables were analyzed in quartile groups (cut-offs based on the distribution of the controls). The CBS 844ins68, TCN2 67A > G, and FOLR1 1413G > A SNPs were analyzed in two groups, common and variant genotype, because of low allele frequencies (4, 36, and 3 individuals with the homozygous variant genotype respectively). All other SNPs were analyzed in three categories: common, heterozygous, and homozygous variant genotype.

Missing values for plasma metabolites and SNPs were assumed to be missing completely at random and were therefore omitted from the analyses (0–3% missing per variable). Missing values for the environmental factors were assigned to separate categories. Thus, the Bayesian network learning was conducted on 560 cases and 1090 controls with complete 1CM data.

Statistical analyses

All computations were conducted in R v.3.2.455. Network visualizations were created using Cytoscape v.3.2.156. All statistical tests were two-sided with a significant threshold of 0.05.

Mann-Whitney U test or Chi-square tests were used to test for differences in variable distributions between cases and controls. Correlations between plasma metabolite variables in all subjects were calculated with Spearman’s correlation coefficient on pairwise complete observations. A hierarchical cluster analysis of the metabolites was conducted using correlation distances with complete linkage. Pearson’s χ2-test was used to check if SNPs were in Hardy-Weinberg equilibrium for cases and controls separately when the expected cell count was above 5, otherwise Fisher’s exact test was used. The significance thresholds for the tests were corrected for multiple testing with the Bonferroni method.

The BNs were estimated on discrete data with a model-averaging approach based on bootstrapping34. In 1000 bootstrap samples, BNs were estimated with three different machine learning algorithms using the boot.strength function in the bnlearn R-package. Then, the final networks were obtained by averaging over the 1000 bootstrap networks using the function. An edge was included if its edge confidence, defined as the frequency of occurrence of that relation among the 1000 bootstrap networks, was above a threshold based on observed confidence levels34. The three machine learning algorithms used in each bootstrap sample were the score-based Hill-climbing (HC), the constraint-based Incremental Association Markov Blanket (IAMB), and the hybrid Min-Max Hill-climbing (MMHC) algorithms. The scoring function for the HC and MMHC algorithms was the Akaike information score (AIC) and the conditional independence test for the IAMB and MMHC algorithms was the asymptotic χ2 mutual information test.

Univariate risk estimates in the form of odds ratios (ORs) for the 1CM variables with a pronounced relation to CRC in the BNs, and for which we have not previously published results or submitted results to a scientific journal, were computed with conditional logistic regression. Linear trends for the metabolites were tested by modeling log-transformed plasma concentrations. Absolute risk estimates, defined as marginal risk differences (RDs), were computed with a weighted maximum likelihood estimator using cumulative incidence data from the study cohort at large, and within groups defined by sampling year, age, sex, and cohort (cumulative incidence of CRC in the study cohort was 830 per 100 000 over the period 1987–2009)35. We present both risk estimates from unadjusted models and from models adjusted for potential confounders. Adjusted estimates were adjusted for BMI, smoking status, occupational and recreational activity, alcohol intake, and plasma B-vitamins folate and vitamin B6 (PLP) concentrations. We evaluated 2-way interactions between variables demonstrating the strongest independent relations to CRC risk in the BNs. Conditional probabilities calculated from estimated parameters of the networks were used to determine ORs over combinations of the variables with the cpquery function in the bnlearn R-package. ORs from conditional logistic regression models fitted with interaction terms were also calculated. The overall significance of the interactions was evaluated by fitting interaction terms using log-transformed metabolite concentrations or treating SNPs as continuous variables (labeled 0,1 and 2, representing copies of the less common allele).

Heterogeneity of the associations was evaluated by estimating BNs on data stratified by sex, follow-up time from blood sampling to diagnosis (above or below median follow-up of 8.2 years), tumor site (proximal colon, distal colon or rectum), and tumor stage (I&II or III&IV). For variables that appeared to differ among the stratified BNs, we further evaluated heterogeneity with likelihood ratio tests using conditional logistic regression. The likelihood ratio tests used compared a conditional logistic regression model in which the risk association could vary across endpoints to a model in which all associations were held constant (or for interactions with sex: comparing a model with product terms to a model without)57.

Additional Information

How to cite this article: Myte, R. et al. Untangling the role of one-carbon metabolism in colorectal cancer risk: a comprehensive Bayesian network analysis. Sci. Rep. 7, 43434; doi: 10.1038/srep43434 (2017).

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.