A systematic comprehensive longitudinal evaluation of dietary factors associated with acute myocardial infarction and fatal coronary heart disease

Environmental factors, and in particular diet, are known to play a key role in the development of Coronary Heart Disease. Many of these factors were unveiled by detailed nutritional epidemiology studies, focusing on the role of a single nutrient or food at a time. Here, we apply an Environment-Wide Association Study approach to Nurses’ Health Study data to explore comprehensively and agnostically the association of 257 nutrients and 117 foods with coronary heart disease risk (acute myocardial infarction and fatal coronary heart disease). After accounting for multiple testing, we identify 16 food items and 37 nutrients that show statistically significant association – while adjusting for potential confounding and control variables such as physical activity, smoking, calorie intake, and medication use – among which 38 associations were validated in Nurses’ Health Study II. Our implementation of Environment-Wide Association Study successfully reproduces prior knowledge of diet-coronary heart disease associations in the epidemiological literature, and helps us detect new associations that were only marginally studied, opening potential avenues for further extensive experimental validation. We also show that Environment-Wide Association Study allows us to identify a bipartite food-nutrient network, highlighting which foods drive the associations of specific nutrients with coronary heart disease risk.


Supplementary Tables
Supplementary Table 1 [28]. In our analysis, we covered a wide range of polyphenols. Among 44 pholyphenol-related exposures, 17 were found to be significantly associated with lower CHD risk meeting a 0.05 threshold P-value. However, only isorhamnetin and apigenin remained significant after correcting for multiple testing (HR < 0.92; P-value < 2 × 10 −4 ), in line with [29,30]. P-values are associated with two-sided Wald tests. Supplementary Figure 2: Comparing the linear model with the non-linear model where a spline of degree four for age is included. In Supplementary Figure 2-a and 2-b, we compared the hazard ratios and P-values from the linear model with the non-linear model where we included a spline of degree four for age. In both cases, we see that they define a straight line of unit slope, implying that two models are almost identical in terms of the estimated hazard ratios and P-values. In Supplementary Figure 2-c, we plotted the AIC obtained from each version of the model. The AIC from non-linear models is only slightly smaller than the AIC of linear models. In Supplementary Figure 2-d, we report the distribution of the percentage of the difference between AIC in the two modeling frameworks. We found that that the average of the percentage of the difference in AIC is -0.00225, with a very small variance. Therefore, the change in AIC is not sufficiently large to choose the non-linear model over the linear model, and add to the complexity of the analysis, which results in a significantly higher run-time for each analysis and each iteration of the permutation process. Source data are provided in Source Data - Supplementary Figure 2 Figure 3: Correlation Matrix. We looked at the Spearman correlation between 53 significant associations found in NHS. We calculated 1,378 pairwise correlations. Similar to [27], we used a permutation-based approach to estimate the two-sided P-value of significance for each pair of correlations. Given a pair of X and Y , we shuffled values of X and computed the correlation with Y , repeating this procedure for all pairs. The P-value for a correlation was the fraction of correlations from the permuted dataset with greater absolute value (equation 1). Next, we control for multiple testing using the Benjamini-Hochberg step-down approach.

Exposure
We show the resulting correlation in a correlation matrix, where non-significant correlations are indicated with a cross. Nutrients are color-coded with black labels and food items are color-coded with gray. Source data are provided in Source Data - Supplementary Figure 3.xlsx.
Supplementary Figure 4: Relation between the original coefficients (β no SES ) and those estimated by including census tract family median income (β SES ). The study population in NHS comprises all registered nurses with a relatively high homogeneity in educational attainment and socioeconomic status. Nevertheless, we investigated the role of SES by including census tract family median income in our analysis [31,32]. Given the zip-code of each individual in time we were able to associate the median income of families living in that area [33]. The results are consistent with our previous analysis, confirming 43 out of 53 significant exposures originally found without SES, with only the marginal associations being affected by the additional correction. We show the relationship between the original coefficients (β no SES ) and those estimated considering SES (β SES ). The observed linearity proves that the two analyses are consistent over the whole panel of nutrients and foods. Indeed, the exposures are ranked in a similar fashion according to the estimated p-value, with a Spearman correlation coefficient equal to 0.9733. Their agreement further improves when the exposures with the strongest signals are taken into account (orange). The 10 exposures whose significance is affected by SES (purple) were among the weakest of the previous analysis (β no SES → 0). Despite the marginal effect on the significance of dietary exposures, SES strongly influences CHD outcomes, with an estimated hazard ratio across the 347 exposures in NHS1 fluctuating in the narrow range, 0.9035 ± 0.0017 (see Supplementary Figure 5). Source data are provided in Source Data -Supplementary We implemented another approach to explore the associations between dietary factors and CHD risk. Instead of using the cumulative average of the food intakes from baseline to the start of each 2-year follow-up interval (which represents the long-term usual intake and reduces the random within-person variation), we investigated the time-dose effect of the dietary exposures, the so-called "latest-diet". We show the results of our analysis on the time-dose effect in Supplementary Table 4. After using the permutation procedure to account for multiple testing, we found 29 significant associations, including 6 food items and 23 nutrients. There are 28 common exposures among the original analysis and the time-dose effect analysis. In the latter, the PH assumption is violated for three exposures (the p-value for the PH assumption test is less than 0.05). The hazard ratios (HRs) obtained using the cumulative averages is consistent with HRs calculated with time-dose effect (Spearman Correlation 0.65), and the directionality for exposures that have been already found statistically significant in our original analysis is conserved. Overall, the time-dose effect analysis yielded weaker associations between dietary exposures and CHD risk. Source data are provided in Source Data -Supplementary Figure  6.xlsx.

Participant Selection
From 121,527 participants in NHS, we excluded 47,949 from the analysis who dropped out from the study before 1986, whose reported average energy intake was less than 600 or more than 3,500 kcal/day, who left more than 70 questions in the FFQ unanswered, or whose demographic data at baseline were missing. Moreover Our ascertainment of CHD follows the guidelines offered by WHO and has been applied in other cohorts as well, such as such as the Women's Health Initiative (WHI) Study [1], the Atherosclerosis Risk in Communities (ARIC) Study [2], and PREDIMED [3]. While our definition is one of the most used in the literature, other less stringent criteria were used in other cohorts, such as EPIC [4], Jackson Heart and Framingham Offspring Cohort Studies [5].

Cumulative Average of Dietary Exposures
The use of cumulative average measurements takes advantage of all prior data and helps reduce random within-person variation, and thus offers a statistically more powerful test of an association of cumulative exposure [6]. The use of the cumulative average model to examine the association between disease incidence and repeated measurements of exposures in cohort studies can be dated back to the 1960s [7], and has been used by many researchers [8][9][10][11][12][13].
It has also been shown that analyses using cumulative averages to reflect the dietary records yield stronger associations than those using either only baseline diet or most recent diet [14].
In [14], a study analyzing the association between dietary fat and CHD, Hu, et al., argued that observing stronger associations might be due to the fact that the use of the cumulative averages reduces measurement error from intraindividual variations over time. It is also possible that the cumulative averages, which reflect long-term diet, are more relevant etiologically than either the most remote (baseline) or the most recent diets.

Multiple Testing
Since in EWAS we are conducting multiple tests, we need to control the number of false positives to extract the truly significant associations. In the GWAS setting, Bonferroni correction is utilized which controls for Type I errors, guaranteeing the family-wide error rate. The Bonferroni corrected p-value is computed by dividing the original significance threshold α by the number of tests conducted on the dependent variable. The corrected p-value will be used to reject a null hypothesis. This method assumes all tests are independent of each other which is not always the case. Having correlated tests, the Bonferroni threshold would be too conservative and results in high rate of false negatives, in turn, losing the power of detection [15]. To eliminate the assumption of independence among tests, one can use the number of effective tests in Bonferroni adjustment, usually determined by principal component analysis [16][17][18]. However, by applying both approaches to our results we find the same 31 tests to be statistically significant with a threshold of 0.05.
An alternative approach is to control the false discovery rate (FDR) which is the proportion of significant results that are actually false positives. FDR-controlling methods, such as Benjamini-Hochberg, are less conservative compared to FWER procedures. Therefore, they have more power of detection, at the cost of higher Type I error [19,20]. The Benjamini-Hochberg procedure also assumes that the individual tests are independent of each other. We also utilized this procedure, finding 81 tests to be statistically significant, controlling for 0.05 FDR. Other more complicated techniques, such as [21], have been developed for controlling FDR that may be more appropriate when the independence assumption is violated.
Estimating the FDR through permutations of the dependent variable is preferred in the scenario in which the variables are correlated [22][23][24]. In the EWAS context that the independence assumption hardly meets, we compute an empirical estimate of the FDR derived through permutations of the phenotype multiple times, effectively creating a null distribution of test statistics. Since the estimate of the FDR utilizes the data itself, it inherently considers the correlated structure of the data [23][24][25], an important quality given the dense correlation of environmental factors. The FDR is the ratio of the proportion of results that were called significant at a given level α in the null distribution and the proportion of results called significant from our real tests.
Here, we discuss the procedure we used to create a distribution of P-values. Sampling from an unbiased urn follows a central hyper-geometric distribution. In our study, since the confounding effect of adjusting variables exists, certain subjects have greater odds of developing CHD than other subjects. Sampling from such population resembles sampling from a biased urn which follows a non-central hyper-geometric distribution. In our permutation scheme, we maintain the confounding role of adjusting variables even tough the association between exposures and the disease is broken [26]. We use Cox regression to estimate the probability of the developing CHD for each subject at each time. Further, we permute the disease status according to the estimated probabilities. Assume N 0 is the number of non-case subjects and N 1 is the number of case subjects. D j is a binary variable denoting the disease status of subject j.θ j indicates the estimated CHD odd for subject j. For permutation k, r k = (r k1 , r k2 , ..., r kN ) T (2) r k follows a multivariate hyper-geometric distribution. We can sample r k by using Fisher's noncentral hyper-geometric distribution with non-centrality parameterθ = (θ 1 ,θ 2 , ...,θ N ). While such sampling is computationally expensive, we simplify the sampling process. We generate a uniform random number and compare it with a spectrum of cumulative CHD odds and assign CHD to each subject accordingly. We continue generating random numbers until N 1 subjects are picked. Next, we re-run the analysis and collect null P-values 1000 times.
Assuming that the number of tests is i and the number of permutations is j, the raw FDR is the ratio of raw number of results that exceed that p-value threshold in the permuted data and the number of results that exceed that p-value among real p-values.

FDR Calculation
The raw FDR is calculated using equation (3).
r i = [I(p i ,1 < P i ) + ... + I(p i ,j < P i )]/j I(P 1 < P i ) + ... + I(P i < P i ) As FDR should be a monotonically increasing function of p-value, we ensure that FDR for a p-value is the minimum of the FDR's for all p-values equal or greater than that p-value: The null hypothesis for test i will be rejected if f i is smaller than an FDR threshold.