Epigenome-wide association meta-analysis of DNA methylation with coffee and tea consumption

Coffee and tea are extensively consumed beverages worldwide which have received considerable attention regarding health. Intake of these beverages is consistently linked to, among others, reduced risk of diabetes and liver diseases; however, the mechanisms of action remain elusive. Epigenetics is suggested as a mechanism mediating the effects of dietary and lifestyle factors on disease onset. Here we report the results from epigenome-wide association studies (EWAS) on coffee and tea consumption in 15,789 participants of European and African-American ancestries from 15 cohorts. EWAS meta-analysis of coffee consumption reveals 11 CpGs surpassing the epigenome-wide significance threshold (P-value <1.1×10−7), which annotated to the AHRR, F2RL3, FLJ43663, HDAC4, GFI1 and PHGDH genes. Among them, cg14476101 is significantly associated with expression of the PHGDH and risk of fatty liver disease. Knockdown of PHGDH expression in liver cells shows a correlation with expression levels of genes associated with circulating lipids, suggesting a role of PHGDH in hepatic-lipid metabolism. EWAS meta-analysis on tea consumption reveals no significant association, only two CpGs annotated to CACNA1A and PRDM16 genes show suggestive association (P-value <5.0×10−6). These findings indicate that coffee-associated changes in DNA methylation levels may explain the mechanism of action of coffee consumption in conferring risk of diseases.

levels. Coffee and tea consumption data were collected from previously validated 389 item foodfrequency questionnaire [14]. The data on tea consumption was collected on black, green and herbal. We combined black and green tea as previously explained in the Methodology section of the manuscript. For the purpose of this study, data was collected from two separate cohorts: The third visit of RS-II and the first and second visit of RS-III. In RS-III, a selection of participants had DNA methylation data collected at the first visit (RS-III-1) and other part (not overlapping) had DNA methylation data collected at second visit (RS-III-2). However, in RS-III-2 the data for coffee and tea consumption was collected several years prior to DNA methylation data (at RS-III-1 visit). As it is case for ASLPAC and CHS cohorts, difference adjustment was additionally added in the model and all the other covariates were collected at the same time point as DNA methylation.
TwinsUK Study: The TwinsUK registry was initiated in 1992 to recruit healthy participants who were either monozygotic or dizygotic same-sex twins, aged over 18 [15]. In total, there are more than 14,000 participants recruited across the UK, where the majority being adult female of European descent. The number of females who had both whole blood DNA methylation profiles and data on coffee and tea consumption for the purpose of this study was 552 subjects. DNA for methylation assessment was extracted from whole blood and stored in EDTA tubes. The Infinium HumanMethylation450 BeadCHips (Illumina Inc, San Diego, CA) was used to measure DNA methylation levels, as previously described [16]. Data on coffee and tea consumption in the TwinsUK cohort was collected following the EPIC-Norfolk guidelines [17]. Data on coffee and tea consumptions was collected using the following options: never or less than once a month, 1-3 per a month, once a week, 2-44 per a week, 5-6 per a week, once a day, 2-3 per a day, 4-5 per a day and 6+ per a day. These data were subsequently used to calculate the average intake per a day (cups/day).
Ethical approval was granted by the National Research Ethics Service London-Westminster, the St Thomas' Hospital Research Ethics Committee (EC04/015 and 07/H0802/84). All twins provided written informed consent prior to taking part in research activities.

DNA methylation profiling
All participating cohorts measured DNA methylation in peripheral blood using the Infinium Human Methylation 450K Bead-Chip (Ilumina, San Diego, CA, USA) except Airwave cohort, where the Infinium Methylation EPIC (850K) Bead-Chip was used [18]. Genomic DNA was bisulfite-converted for methylation measurement. Following bisulfite-conversion of DNA, samples underwent wholegenome amplification and were fragmented and hybridized on Bead-Chip with their complementary probe sequences. DNA methylation status was assessed through a single-base extension step. Arrays were imaged with a high precision scanner (iScan system, Illumina Inc.) and the signal intensities were extracted by usage of a software package (GenomeStudio Software, Illumina Inc.). DNA methylation status was calculated with the β value -signal from the methylated probe divided by the overall signal intensity. The methylation percentage of CpG sites was reported as a continuous β value range between 0 (no methylation) and 1 (full methylation). As commonly performed, DNA methylation data preprocessing was conducted independently in different cohorts and β values were normalized by studyspecific methods. DNA methylation sites were annotated with the information provided by Illumina and the University of California Santa Cruz (UCSC) database (GRCh37/hg19).

EWAS of coffee and tea consumption
DNA methylation was considered as the dependent variable with coffee or tea consumption as predictors of interest (cups/day), each. Conventionally, each participating cohort performed an EWAS as a set of mixed effects liner-regression models, one CpG site at a time. In total, two linear mixed effects regression models were computed for each of the two exposures of interest. In the basic model (Model 1): we included age, sex, smoking status (never, former, and current), white blood cells (either measured directly or imputed based on Houseman algorithm [19]) as fixed effects and technical covariates as random effects to control for batch effects. In the second model (Model 2), we additionally adjusted for body mass index (BMI, kg/m 2 ) and alcohol consumption (g/day). All of the potential confounders were collected at the same time point of blood sampling for DNA methylation. Genetic principal components were included as covariates to account for population stratification if required.
For a subset of cohorts (ARIC_AA, ARIC_EA, ALSPAC, FHS, EPIC_IARC) surrogate variables were calculated and adjusted for in the modelling, due to the batch effects not controlled adequately by other modelling techniques. Airwave cohort did not have alcohol in gr/day, but the categorical variable (never drinker, current drinker, and ex-drinker) for which they adjusted in the analysis. FHS cohort had current and ever smokers (never and former combined), while ALSPAC cohort had only smoking variable defined as smokers (current and former combined) and non-smokers. The EPIC_IARC cohort includes prospective breast cancer cases and matched controls and EPIC_ITALY had in total 3 case-control studies nested in the study on breast cancer, lung cancer and colorectal cancer, and they adjusted for cases status. In addition, EPIC_ITALY had participants recruited from different cities in Italy, where center of recruitment was used as a random effect. For cohorts (RS-III-2, ALSPAC, CHS_EA and CHS_AA) that did not have coffee and tea consumptions measured at the same time as DNA methylation, additional time difference adjustment was introduced in the cohort specific adjustments. The findings from model 2 were considered as a primary results, as it is the most conservative model.

Mendelian Randomization (MR) study
We implemented a two-sample Mendelian randomization (MR) approach to evaluate the potential causal effect of coffee consumption on the identified CpGs, investigating whether the DNA methylation changes are a consequence of coffee consumption ( Figure S3). We used 50 independent SNPs as instrumental variables (IVs) of coffee consumption, units of cups of coffee per day (including drinkers and non-coffee drinkers) (Table S2) [20,21]. In addition, we assessed the potential causal association between coffee-consumption related CpG and cardiometabolic traits including type 2 diabetes, body mass index, waist-hip ratio, lipid traits (HDL-C, LDL-C, total cholesterol, triglycerides) and coronary heart disease (CHD). For each CpG, we chose instrumental variables for DNA methylation levels based on methylation quantitative trait loci (cis-meQTL) obtained from FHS cohort (N=4170) [22]. The IVs were selected from independent cis-meQTL SNPs pruned by LD r2 < 0.01. Genetic association data of cardiometabolic traits were obtained from publicly available GWAs namely DIAGRAM Consortium Two methods were used to explore causality. First, a weighted genetic risk score (GRS) was constructed for coffee consumption by multiplying the number of effect alleles at each locus by the corresponding reported β coefficient from the GWAS and then summing the products. The total score was then divided by the average effect size multiplied by 100 to rescale the scores and standardize them to a range between 0 and 100. The other MR approaches were performed using the summary statistics for genetic association of the selected SNPs with coffee consumption and with the CpG site of interest. The causal effect estimate was obtained through inverse variance weighting (IVW). We further used two sensitivity analyses, the weighted median and MR-Egger, to investigate potential effect of pleiotropic variants on the estimates. The effect sizes and standard errors for SNPs-CpG were obtained from meta-analyzing GWAS summary statistics from the RS and FHS (n=5,371) [22]. We used MR-PRESSO (Mendelian randomization pleitropy residual sum and outlier) to identify horizontal pleiotropic outliers in multiinstrument summary-level MR testing (https://github.com/rondolab/MR-PRESSO) [27]. All MR methods for multiple genetic instruments were conducted using "MendelianRandomization", a statistical package running under R [28].

Funding and Acknowledgements
ALSPAC: The UK Medical Research Council and Wellcome (Grant ref: 217065/Z/19/Z) and the University of Bristol provide core support for ALSPAC. This publication is the work of the authors and Srikant Ambatipudi will serve as guarantor for the contents of this paper. We are extremely grateful to all the families who took part in this study, the midwives for their help in recruiting them, and the whole ALSPAC team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists and nurses. CR is funded by the MRC Integrative Epidemiology Unit (MC_UU_12013_2)

ARIC:
The Atherosclerosis Risk in Communities study has been funded in whole or in part with federal funds from the National Heart, Lung, and Blood Institute, National Institutes of Health, Department of Health and Human Services (contract numbers HHSN268201700001I, HHSN268201700002I, HHSN268201700003I, HHSN268201700004I and HHSN268201700005I). Funding was also supported by 5RC2HL102419 and R01NS087541. The authors thank the staff and participants of the ARIC study for their important contributions.

CHS:
The CHS research was supported by NHLBI contracts HHSN268201200036C, HHSN268200800007C, HHSN268201800001C, N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, N01HC85086; and NHLBI grants U01HL080295, R01HL087652, R01HL105756, R01HL103612, R01HL120393, R01HL085251, and U01HL130114 with additional contribution from the National Institute of Neurological Disorders and Stroke (NINDS). Additional support was provided through R01AG023629 from the National Institute on Aging (NIA). A full list of principal CHS investigators and institutions can be found at CHS-NHLBI.org.   confidence intervals are illustrated, where line width is proportional to the weight assigned to the study in the meta-analysis. As expected, the cohorts with smaller sample size have a wider SE. Rectangular error bars are used to display confidence intervals, as well as the relative meta-analytic weight (height of the error bar) of each study. A.
B. heart disease (n=547261) (G). The Y axis represents the genetic association estimates with outcome. The X axis represents the genetic association estimate with exposure (cg11550064). Lines depict the causal estimates from the different Mendelian randomization methods   In the Discovery phase, P-value trehold was considered at P-value <1.1×10 -7 (Bonferroni adjusted 0.05/450,000). While for replication phase, it corresponds to number of significantly associated CpG sites from disovery phase divided by 0.05 (Bonferroni adjusted).    Excessive alcohol consumption was defined as >14 units/week for women and >21 units/week for men, where the unit would correspond to 10 grams. P value treshold = 0.016, adjusted for multiple testing of coffee with 3 liver enzymes (Bonferroni adjusted 0.05/3)