Abstract
Recent studies consider lifestyle risk score (LRS), an aggregation of multiple lifestyle exposures, in identifying association of gene-lifestyle interaction with disease traits. However, not all cohorts have data on all lifestyle factors, leading to increased heterogeneity in the environmental exposure in collaborative meta-analyses. We compared and evaluated four approaches (Naïve, Safe, Complete and Moderator Approaches) to handle the missingness in LRS-stratified meta-analyses under various scenarios. Compared to “benchmark” results with all lifestyle factors available for all cohorts, the Complete Approach, which included only cohorts with all lifestyle components, was underpowered due to lower sample size, and the Naïve Approach, which utilized all available data and ignored the missingness, was slightly inflated. The Safe Approach, which used all data in LRS-exposed group and only included cohorts with all lifestyle factors available in the LRS-unexposed group, and the Moderator Approach, which handled missingness via moderator meta-regression, were both slightly conservative and yielded almost identical p values. We also evaluated the performance of the Safe Approach under different scenarios. We observed that the larger the proportion of cohorts without missingness included, the more accurate the results compared to “benchmark” results. In conclusion, we generally recommend the Safe Approach, a straightforward and non-inflated approach, to handle heterogeneity among cohorts in the LRS based genome-wide interaction meta-analyses.
Similar content being viewed by others
Introduction
Many large-scale, collaborative genome-wide association studies (GWAS) have successfully identified genetic determinants described to explain part of the pathophysiological mechanism underlying a wide range of traits. Despite these efforts and increased sample sizes, the explained variability of many traits is relatively small and only a small proportion of the familial heritability can be explained by the candidate variants found [1,2,3].
In addition to genetics, environmental factors and gene–environment interactions may contribute to this unexplained trait heritability [3, 4]. Recently, genome-wide gene–environment interaction studies have been conducted to further explore the potential mechanisms underlying an array of diseases or disease traits of interest [5,6,7,8,9]. Thus far, these collaborative efforts have largely focused on a single environmental factor, such as smoking [7, 10,11,12,13], physical activity [8, 14], alcohol intake [15], educational attainment [6], and others [5, 16]. By accounting for the environmental risk factor, these efforts have identified novel loci beyond those identified by the traditional main effects-only GWAS. However, multiple environmental factors may simultaneously modify the genetics effects of loci [17]. In addition, single lifestyle variables may not capture the spectrum of relevant environmental variation, resulting in biased effect estimation and false-negative results due to reduced statistical power.
Lifestyle factors, such as smoking, physical inactivity and alcohol consumption, all contribute independently to the risk of developing cardiovascular diseases, and composite lifestyle risk scores (LRS) have been used previously to assess the combined effect of multiple lifestyle factors on cardiovascular disease development [18,19,20]. However, when applying LRS methodology to large collaborative consortium settings, challenges arise as not all lifestyle components in the LRS are available in all participating cohorts and/or may not be measured using the same instrument. If ignored, significant measurement error and potential heterogeneity may be introduced with reduced statistical power and potential bias. In the present study, we explore different approaches for incorporating cohort-wide missingness of individual lifestyle components with meta-analysis of genome-wide gene–environment interaction on systolic blood pressure in four European-ancestry (EA) cohorts.
Methods
Participating cohorts and subject inclusion
In this study, we included data from four cohorts: the Atherosclerosis Risk in Communities Study (ARIC), the Framingham Heart Study (FHS), the Hypertension Genetic Epidemiology Network (HyperGEN), and the Netherlands Epidemiology of Obesity Study (NEO). For cohorts with data collected from multiple visits, we chose a single visit that could maximize sample size with non-missing data. We included a total of 24,048 EA individuals aged 18–80 and with non-missing genotype and phenotype information, including age, sex, systolic blood pressure (SBP), antihypertensive medications, body mass index (BMI), and the four lifestyle factors (smoking status, alcohol consumption, education level, and physical activity). The individual level data for ARIC and FHS are available on dbGAP (dbGAP ID: phs000280.v7.p1 for ARIC and phs000007.v2.p1 for FHS), and HyperGEN and NEO are in-house data.
Phenotype and covariates
Resting SBP (mmHg) was calculated by taking the average of all available BP readings at the same clinical visit, and further adjusted by adding 15 mmHg for subjects with antihypertensive medication use [21]. SBP values that were more than six standard deviations away from the mean were winsorized to exactly six standard deviations from the mean, in order to reduce the potential influence of outliers. Other covariates included age, sex, field center (if appropriate), and principal components to account for population stratification.
Genotyping and Quality Control (QC)
Genotyping was performed separately within each cohort using Affymetrix (Santa Clara, CA, USA) or Illumina (San Diego, CA, USA) genotyping arrays (Supplementary Table S1). Each cohort performed imputations with IMPUTE2 [22] or MaCH [23], using the cosmopolitan reference panel from the 1000 Genomes Project Phase 1 Integrated Release Version 3 Haplotypes (2010–11 data freeze, 2012-03-14 haplotypes) [24]. SNPs were excluded if they were non-autosomal, had minor allele frequency < 1% or low imputation quality (r2 < 0.1). We conduct further quality control filters centrally during the meta-analysis.
Lifestyle Risk Score
We considered four lifestyle factors: smoking status (never/former/current smoker), current alcohol intake (abstinence/modest/heavy), educational attainment beyond high school (none/some college/college degree), and physical activity (inactive/active). Current alcohol intake included three groups (abstinence: 0 drinks/week; modest: 1–7 drinks/week; heavy: >7 drinks/week). We classified participants as “college degree” if they completed at least a 4-year college degree, as “some college” if they received any education beyond high school including vocational school but did not complete a 4 year college degree and as “none” if they received no education beyond high school [6]. Physical activity is expressed in metabolic equivalents (MET; 1 MET = 1 kcal/kg/h). Inactive individuals were defined as those with <225 MET—minutes per week of moderate-to-vigorous leisure-time or commuting physical activity, or in the lower quartile (25%) of the physical activity distribution within cohort. The detailed definitions of active and inactive physical activity followed a previous study on gene–physical activity interaction [14].
We constructed the LRS with two steps. First, each lifestyle factor, treated as an individual lifestyle component, was categorized into no risk (with value of 0), low risk (with value of 1) and high risk (with value of 2) based on its presumed effect on BP or cardiovascular health, except physical activity which only had no risk and low risk [17]. The higher risk value the category was assigned, the more relevant to unfavorable cardiovascular health outcomes. Note that we categorized modest alcohol intake as no risk and abstinence as low risk because there was evidence that moderate alcohol consumption had consistently been associated with a decreased risk of type 2 diabetes [25] and coronary artery disease [26] compared with abstention or excessive consumption [27]. Table 1 detailed the LRS component definition.
Second, the “Complete” Quantitative LRS (QLRS-C) was calculated by summing up all four components, ranging from 0 to 7. We also calculated the “Partially Missing” Quantitative LRS (QLRS-M) using only 2–3 components pre-selected for each cohort by design to simulate real cohort-level missingness, as described in Table 2. For example, for ARIC, we included three lifestyle components (smoking, education, and physical activity) when constructing QLRS-M. QLRS-M ranges from 0 to 4 or 5, depending on the inclusion of lifestyle components for each cohort.
After constructing the Quantitative LRS, we further created Dichotomous LRS for the “Complete” (DLRS-C) and the “Partially Missing” (DLRS-M) summary scores. We gave a value of 0 (unexposed group) if the corresponding Quantitative LRS < 2 and a value of 1 (exposed group) if Quantitative LRS ≥ 2 (i.e., at least one risk component classified as high risk or at least two components classified as low risk). These dichotomized LRS measures are used to define exposed and unexposed strata in our analyses.
Note that cohorts with partially missing lifestyle components have equal or lower LRS than its “true” score, had we observed all lifestyle components. This leads to potential misclassification when dichotomizing the LRS into exposed and unexposed groups. However, no participant would be misclassified as exposed and can only be misclassified as unexposed, leading to heterogeneity in the unexposed group only.
Statistical analysis
Overview
We conduct a two-stage analysis procedure. In Stage 1, each cohort performed LRS-stratified genome-wide association analysis on SBP using the main effect model (E(Y) = β0 + βG SNP + βC C), where Y is SBP, SNP is the imputed additive dosage value of the genetic variant, and C is the vector of covariates. This model was run in the DLRS-C exposed and DLRS-C unexposed strata separately, and then repeated in the DLRS-M exposed and DLRS-M unexposed strata. In Stage 2, we performed meta-analysis within each stratum, and then evaluated the joint effects of main and interaction effects by calculating the p-values for the 2 degree of freedom joint test. Under Stage 2, we considered four different meta-analysis approaches of handling cohort-level missingness of lifestyle components (Naïve, Safe, Complete and Moderator Approaches). We evaluated the performance of the four approaches under four scenarios that create various patterns of missingness among the cohorts.
Stage 1: Cohort-specific stratified analysis and QC of association results
For Stage 1, each cohort performed four genome-wide association analyses on SBP using the main effect model: two strata (exposed/unexposed) × two LRS (DLRS-C/DLRS-M). Association analyses were implemented either using ProbABEL [28] for studies with unrelated samples, or using MMAP (https://mmap.github.io/) for studies with family relatedness. Relatedness in families were accounted for using a kinship matrix in the linear mixed model. Each cohort provided the robust estimates of the stratum-specific genetic main effect and corresponding robust standard error for all four analyses. Cohort-specific details are presented in Supplementary Table S1.
We performed extensive quality control (QC) using the R package EasyQC [29] on each of the cohort-specific association results centrally, which contained ~8–9 million variants. We restricted to SNPs with the imputation quality score ≥ 0.5 and the product of the imputation quality and minor allele count ≥ 20. Details of central QC can be found in Supplementary Note I.
Since QC and filtering were performed separately within each stratum, the set of variants remaining in each stratum differed slightly. Thus, we further harmonized the set of variants between the exposed and unexposed strata within each LRS, to ensure that the set of variants was identical between strata. After QC, the number of variants in each association result was between 5.3M and 8.2 M.
Stage 2: Meta-analysis
After obtaining cohort-specific GWAS results using DLRS-C and DLRS-M, we first performed meta-analyses within each stratum (exposed/unexposed) using the results obtained from analyses using DLRS-C, and considered this set of meta-analyzed results as a “benchmark”, as there is no missing lifestyle component in each cohort’s LRS construction.
Then, to mimic the real life situation where some cohorts would provide GWAS association results obtained from analyses using DLRS-C (referring to “Complete” results) but the others could only provide results using DLRS-M (referring to “Partially Missing” results), we further performed the meta-analyses using a mixture of results obtained from cohort-specific analyses conducted with DLRS-C and DLRS-M. We considered four scenarios using different cohort mixture patterns by changing each cohort’s contribution of lifestyle components, in order to better utilize the data. The setting of each scenario is presented in Table 3. For example, Scenario 1 uses “Complete” results from ARIC, and “Partially Missing” results from HyperGEN, FHS and NEO.
As mentioned in the LRS section, cohort-level missingness in lifestyle components will cause misclassification when dichotomizing LRS into exposed and unexposed groups, hence leading to heterogeneity in the unexposed group only. To account for this heterogeneity, we considered four different meta-analysis approaches of utilizing “Complete” and “Partially Missing” results under various scenarios discussed above.
-
(1)
Naïve Approach. This approach simply takes all association results contributed by each participating cohort without worrying whether their LRS includes all lifestyle components, for both exposed and unexposed groups.
-
(2)
Safe Approach. Since heterogeneity only occurs in the unexposed group, it is “safe” to only take association results from cohorts with LRS-C for the unexposed group analysis, while including results from all cohorts no matter whether the missing data exist in LRS for the exposed group analysis.
-
(3)
Complete Approach. This approach only uses association results from cohorts with “Complete” LRS data in meta-analysis, for both exposed and unexposed groups.
-
(4)
Moderator Approach. This approach uses all the contributed data from cohorts without regard to their missingness in lifestyle components. It utilizes the framework of meta-regression, while including moderator terms indicating the missing LRS components across cohorts in the design matrix of the meta-regression to account for missingness during meta-analysis. Technical details of this approach are available in the Supplementary Method.
Table 3 also shows the inclusion of association results in the meta-analysis using each of the approaches described above under Scenarios 1–4. Note that different scenarios serve to illustrate different patterns of missingness across the cohorts and the cohort inclusion only vary in the cases of Safe and Complete Approaches, since both Naïve and Moderator Approaches will utilize all the cohort-contributed association results in the meta-analysis. Here we take Scenario 1 as an example: For the Naïve Approach, we analyze exposed and unexposed groups separately using “Complete” results from ARIC, and “Partially Missing” results from HyperGEN, FHS and NEO without differentiating “Complete” or “Partially Missing”. For the Safe Approach, we include ARIC results alone and ignore other cohorts’ contributions with “Partially Missing” results for the unexposed group; for the exposed group, we analyze all four cohorts using “Complete” results from ARIC, and “Partially Missing” results from HyperGEN, FHS and NEO. For the Complete Approach, we analyze exposed and unexposed groups separately, but only use “Complete” results from ARIC with no other cohorts included. For the Moderator Approach, we take “Complete” results from ARIC, and “Partially Missing” results from HyperGEN, FHS and NEO for both exposed and unexposed groups as input of the meta-regression.
For the “benchmark” meta-analysis and the first three approaches (Naïve, Safe and Complete), we used METAL [30] to perform meta-analyses within each stratum and EasyStrata [31] to calculate the two degree of freedom joint p values. For the Moderator Approach, we used the Moderator Web App and R code developed by Dr. RJ Waken (https://rjwaken.shinyapps.io/missing_lrs_meta/).
Results
Sample characteristics
Sample characteristics are presented in Supplementary Tables S2 and S3, and S4. ARIC had the largest sample size (N = 9426) and HyperGen cohort had the fewest number of participants (N = 1249). All cohorts had similar distributions of age, sex and BMI, except that FHS and HyperGEN had a wider age range than ARIC and NEO (Supplementary Table S2). In Supplementary Tables S3 and S4, the exposed group had slightly higher SBP level than the unexposed group for all four cohorts in terms of DLRS-C. However, the difference in SBP levels between exposed and unexposed groups was smaller when we defined exposure groups using DLRS-M. The proportion of subjects in the exposed group was smaller when using DLRS-M compared to DLRS-C, indicating potential misclassification.
Results comparison between approaches
Note that since Scenarios 2–4 presented similar patterns to Scenario 1 in terms of comparison with “benchmark” results and within-scenario comparison between different approaches, we would focus on illustrating Scenario 1 in the following of this subsection. The detailed comparison results of Scenarios 2–4 are available in the Supplementary Figs. S1–S9.
Figure 1 presents the results of the four meta-analysis approaches compared to the “benchmark” results. Among variants that reach genome-wide significance level (p value < 5 × 10−8), we observed that the Complete Approach yielded much larger p values than the “benchmark” results, thus could be considered with lower statistical power due to lower sample size. The Naïve Approach was able to detect the same set of genome-wide significant variants as the “benchmark” results, but with slightly smaller p values. The Safe and Moderator Approaches led to slightly larger p values than “benchmark” results. The Q–Q plot (Fig. 2) also shows that the Complete Approach obtained the most deflated p values among the four approaches (λComplete vs benchmark = 0.972). The Safe Approach and Moderator Approach yielded similar slightly conservative results (λSafe vs benchmark = λModerator vs benchmark = 0.985), while the results of the Naïve Approach were slightly inflated (λNaive vs benchmark = 1.004).
Figure 3 shows the pair-wise comparison among four meta-analysis approaches. The Safe and Moderator Approaches yielded similar but slightly larger p values than Naïve Approach, and the degree of similarity increased with significance. Notably, the results of Safe Approach and Moderator Approach were almost identical, but the number of variants included in the analyses for the Moderator Approach (Number of variants = 5,258,666) was much smaller than the Safe Approach (Number of variants = 8,181,669), because the analysis of Moderator Approach was restricted to SNPs with association results present in all four cohorts.
Result comparison between scenarios for Safe Approach
Here we further evaluated the performance of the same meta-analysis approach under different scenarios. Since we generally were concerned with false-positive results, we focused our attention only to the non-inflated Safe Approach. Figure 4 presents the scatterplot of association results between “benchmark” and the Safe Approach for each of the four scenarios, for variants with p value < 1 × 10−6 in at least one of comparing results. We observed that for SNPs reaching genome-wide significance (p value < 5 × 10−8) in “benchmark” results, the points of Scenarios 3 and 4 almost lay along the diagonal line, while points of Scenarios 1 and 2 were a bit away from the diagonal. This indicated that the Safe Approach under Scenarios 3 and 4 more accurately identified positive signals than under Scenarios 1 and 2.
The Q–Q plot (Supplementary Fig. S10) shows that when p values were large (>10−5), Scenario 4 with less missingness provided more similar p value distributions with “benchmark” results (λscenario 4 vs benchmark = 0.991) compared to Scenario 1 (λscenario 1 vs benchmark = 0.983) and Scenario 3 (λscenario 3 vs benchmark = 0.984). Although Scenario 2 seemed to perform very well on large p values (λscenario 2 vs benchmark = 0.994), it provided substantially deflated results toward the tail when reaching genome-wide significance. In the meantime, Scenarios 3 and 4 had similar p value distributions and both of their p values were very close to the “benchmark” distribution when p values were small. The p values of Scenario 1 were closer to the diagonal line than those of Scenario 2 when p values were small, and this may due to the sample size of the cohort with “Complete” results in Scenario 1 (ARIC, N = 9426), which was greater than that of Scenario 2 (FHS, N = 7638).
In general, we consider Scenario 4 performed better than Scenario 3, in turn than Scenarios 1 and 2. This meets our expectation as Scenario 4 had the smallest proportion of cohorts using “Partially Missing” results; thus it was expected to bring the most comprehensive information into meta-analysis.
Discussion
In this study, we evaluated four different strategies handling the cohort-level missingness of individual lifestyle components in the meta-analysis of gene–lifestyle interaction using LRS-stratified summary statistics from participating cohorts. We aimed to find the best way to leverage the available data while appropriately handling the heterogeneity due to missing data in the LRS, and further improve the power of identifying novel loci for the trait of interest. Only utilizing data contributed by the cohorts without missingness in any lifestyle components (the Complete Approach) has lower statistical power due to lower sample size, while freely meta-analyzing all the association results contributed by the cohorts even with missing components in the LRS (the Naïve Approach) is slightly inflated. The Safe Approach and Moderator Approach are both slightly conservative and their p values are almost identical to each other. We also observed that, as expected, the more cohorts with non-missing lifestyle components we used in meta-analysis, the more accurate the results. This result confirms our primary hypothesis.
A risk score is a commonly used approach to evaluate combined effects of risk factors and it may play an important role in personalized medicine. In the past, the scientific community has proposed several well-known risk scores. For example, the Framingham Risk Score [32] is a sex-specific score used to estimate the 10-year cardiovascular risk, and the diabetes risk score [33] is a screening tool for identifying subjects at high-risk for type 2 diabetes. The LRS has also become popular as people are increasingly interested in their clinical implications drawn by the joint effects of individual lifestyle factors to a specific trait, disease, or time-to event outcome. In the meantime, the genetic risk score (GRS) has become a widely used tool to improve identification of persons who are at risk for common complex diseases [34, 35].
There have been some prior studies combining GRS and LRS to explore their joint behavior on risk of CVD [19] and Colorectal Cancer [36]. Specifically, these studies divided study samples into subgroups based on the combination of GRS level and LRS level, and found that within and across genetic risk groups, adherence to poor behavioral lifestyle was associated with increased risk of diseases, and there was no interaction effect between genetic risk and lifestyle risk. This might seem discouraging regarding whether adding genetic information could add much to the risk prediction studies using LRS. However, it is important to note that the GRS was calculated based on variants reported from previous standard genome-wide significant analyses without taking its potential modification effect into consideration; variants whose effects may differ by level of LRS might therefore be missed by standard GWAS screening. Moreover, a LRS may have a different modification effect on each variant, so instead of looking at aggregated GRS only, interaction with one variant at a time should also be evaluated. Our study looked into the combination of genetic and lifestyle information by performing meta-analysis of gene-by-lifestyle interaction in order to find novel loci for complex disease traits, and those potential novel loci may provide additional information for computing a GRS, which could increase the power of previous studies.
Handling missing data in the aggregation of risk factors is challenging, yet important and worth the effort to explore in further detail. Based on the properties of genetic architecture, GRS can be computed using imputed or proxy SNPs, when the originally reported variants are not available, based on the largely available reference panel, such as 1000 Genome Project [37]. Thus, it is more flexible than LRS in terms of dealing with missingness. There were several methods proposed to impute phenotypes using the correlation structure between phenotypes, family structure or information from other cohorts [38,39,40], but these methods rarely dealt with the case that one phenotype is completely unavailable for all the individuals in one particular cohort contributing to a large meta-analysis, which is what we encountered in our study. When considering using summary statistics in meta-analysis, a previous study [41] tried to deal with the issue of missingness by restricting the study sample to cohorts with at least three out of five lifestyle behaviors available, reducing sample size and thus power to a great extent, with the issue of heterogeneity unresolved. Our study proposes making the best use of the available data gathered from cohorts to obtain accurate combined effects of risk factors, thereby providing a novel perspective for LRS based meta-analysis in future research.
Our study examined the Moderator Approach, which is a novel method of accounting for missingness via meta-regression in the gene-by-environment interaction field. Instead of performing stratum-specific meta-analyses and then evaluating the interaction, this approach can achieve the final goal in one step via meta-regression, with meta-analysis results of both exposure groups as input. However, due to the meta-regression setting, the Moderator Approach requires that the number of cohorts with GWAS results available for a SNP (4 in our study) is greater than the number of predictors divided by two (which is [one main effect + one interaction effect + four missingness effects]/2 = 3 in our study). Therefore, it restricted the analyses to the SNPs existing in the GWAS results of all four cohorts, thereby eliminating a large number of SNPs from the analyses and possibly missed positive signals. On the other hand, the design matrix of the meta-regression model in the Moderator Approach should be treated with caution because in some patterns of missingness, the design matrix would suffer from multicollinearity and we could not successfully obtain the least square estimates. Since the Safe Approach can provide almost identical results as the Moderator Approach but does not have a restriction on the missingness pattern and the number of cohorts and predictors, we would recommend using the Safe Approach to handle missingness during meta-analysis. Potential future works would be to further investigate the Moderator approach and to evaluate the performance of Safe Approach and Moderator Approach under large-scale meta-analyses.
Although our general suggestion is to use safe approach, we also provide some implementation suggestions for each of the approaches. Since Naïve Approach can be largely affected by the sample misclassification and it produces inflated results, Naïve Approach is only applicable when there is little issue on misclassification and researchers are more concerned about the type II error than the type I error. Complete Approach is underpowered due to sample size reduction, so if the sample size of cohorts with “Complete” results is large enough to have an adequate power, the Complete Approach can be an option as well. Moderator Approach utilizes all the available data but restricted to SNPs existing in all four cohorts in our study, so when the variants available from cohorts’ association results are consistent, Moderator Approach is also a good choice.
In the gene-by-lifestyle interaction analysis, the power depends on the effect size of the genetic variant, the effect size of the interaction and minor allele frequency. These factors may lead to differences in power when applying to other traits due to their potentially different underlying genetic mechanisms. Our work mainly focuses on the inclusion strategies of participating cohorts based on their availability of risk factors of interest and use BP as an illustrated phenotype to demonstrate the concept of this work. Therefore, we do not expect the observed pattern will change in the analysis of other heritable phenotypes. Similarly, we believe that our conclusions may be generalizable to other types of risk scores or environmental variables, where the misclassification issue may occur and affect the performance of the meta-analysis results. On the other hand, researchers may be interested in treating the LRS as continuous variable rather than performing LRS-stratified analyses. In this case, a stringent linear assumption of interaction effect is made. And Naïve, Safe and Complete Approaches, which all base on the fact that misclassification only exists in the unexposed group, will not work. However, the meta-regression framework has potential to deal with continuous LRS by incorporating moderator terms indicating the missingness. Further investigations are needed in this regard.
Our study has several important strengths. To our knowledge, this is the first study to explore how to deal with cohort-level missingness in individual lifestyle components in order to improve the power for identifying novel genetic loci for complex disease traits through collaborative meta-analysis. Our study performed thorough comparisons between four meta-analysis approaches via various cohort mixture scenarios, thus providing comprehensive information for investigators to refer to.
Although this study has several strengths as an innovative work for dealing with missingness in gene-by-lifestyle interaction, it has some limitations. Our empirical-based evidence serves as the first step to explore the effect of missingness in lifestyle factors and generates the potential hypothesis; however, it still needs systematic experiments to disentangle its underlying mechanisms as many uncertain factors can’t be controlled due to the nature of empirical data. For example, whether stronger effect in unexposed group may lead to inflated/deflated results due to the missingness and hence potential misclassification of unexposed group. Also, although we considered various settings, we still were not able to catch every possible pattern, such as missingness assignment in LRS-M calculation, cohort mixture scenarios with LRS-C or LRS-M. This kind of design may lose some flexibility and consequently fail to capture all the information during the comparison. Moreover, our study mainly evaluated the performance of different approaches in terms of joint effects instead of focusing on the interaction effect. We did not manage to capture a clear pattern when comparing the interaction effect between different meta-analysis approaches, due to the small sample size of our study. It is worth pursuing the comparison of the interaction effect itself among different approaches by incorporating more cohorts in our next step.
In summary, we evaluated four approaches of incorporating the cohort-level missingness of lifestyle components in the meta-analysis of gene-by-lifestyle interaction. Based on our results, we generally recommend using the Safe Approach since it is straightforward to implement and yields non-inflated results. Handling this missingness of individual lifestyle components appropriately can efficiently increase statistical power of gene-by-lifestyle interaction meta-analysis for identifying novel loci of complex traits.
References
Evangelou E, Warren HR, Mosen-Ansorena D, Mifsud B, Pazoki R, Gao H, et al. Genetic analysis of over 1 million people identifies 535 new loci associated with blood pressure traits. Nat Genet. 2018;50:1412–25.
López-Cortegano E, Caballero A. Inferring the nature of missing heritability in human traits using data from the GWAS catalog. Genetics 2019;212:891–904.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nat Nat Publ Group. 2009;461:747–53.
Rao DC, Sung YJ, Winkler TW, Schwander K, Borecki I, Adrienne Cupples L, et al. Multiancestry study of gene-lifestyle interactions for cardiovascular traits in 610 475 individuals from 124 cohorts: design and rationale. Circ Cardiovasc Genet. 2017;10:e001649.
Noordam R, Bos MM, Wang H, Winkler TW, Bentley AR, Kilpeläinen TO, et al. Multi-ancestry sleep-by-SNP interaction analysis in 126,926 individuals reveals lipid loci stratified by sleep duration. Nat Commun. 2019;10:5121.
de las Fuentes L, Sung YJ, Noordam R, Winkler T, Feitosa MF, Schwander K, et al. Gene-educational attainment interactions in a multi-ancestry genome-wide meta-analysis identify novel blood pressure loci. Mol Psychiatry. 2020:1–15.
Wu P, Rybin D, Bielak LF, Feitosa MF, Franceschini N, Li Y, et al. Smoking-by-genotype interaction in type 2 diabetes risk and fasting glucose. Meyre D, editor. PLoS One. 2020;15:e0230815.
Graff M, Scott RA, Justice AE, Young KL, Feitosa MF, Barata L, et al. Genome-wide physical activity interactions in adiposity—a meta-analysis of 200,452 adults. PLoS Genet. 2017;13:e1006528.
Liu CT, Estrada K, Yerges-Armstrong LM, Amin N, Evangelou E, Li G, et al. Assessment of gene-by-sex interaction effect on bone mineral density. J Bone Min Res. 2012;27:2051–64.
Sung YJ, de las Fuentes L, Winkler TW, Chasman DI, Bentley AR, Kraja AT, et al. A multi-ancestry genome-wide study incorporating gene–smoking interactions identifies multiple new loci for pulse pressure and mean arterial pressure. Hum Mol Genet. 2019;28:2615–33.
Sung YJ, Winkler TW, de las Fuentes L, Bentley AR, Brown MR, Kraja AT, et al. A large-scale Multi-ancestry Genome-wide Study accounting for smoking behavior identifies multiple significant loci for blood pressure. Am J Hum Genet. 2018;102:375–400.
Justice AE, Winkler TW, Feitosa MF, Graff M, Fisher VA, Young K, et al. Genome-wide meta-analysis of 241,258 adults accounting for smoking behaviour identifies novel loci for obesity traits. Nat Commun. 2017;8:14977.
Bentley AR, Sung YJ, Brown MR, Winkler TW, Kraja AT, Ntalla I, et al. Multi-ancestry genome-wide gene–smoking interaction study of 387,272 individuals identifies new loci associated with serum lipids. Nat Genet. 2019;51:636–48.
Kilpeläinen TO, Bentley AR, Noordam R, Sung YJ, Schwander K, Winkler TW, et al. Multi-ancestry study of blood lipid levels identifies four loci interacting with physical activity. Nat Commun. 2019;10:376.
De Vries PS, Brown MR, Bentley AR, Sung YJ, Winkler TW, Ntalla I, et al. Multiancestry genome-wide association study of lipid levels incorporating gene-alcohol interactions. Am J Epidemiol. 2019;188:1033–54.
Jiang X, O’Reilly PF, Aschard H, Hsu YH, Richards JB, Dupuis J, et al. Genome-wide association study in 79,366 European-ancestry individuals informs the genetic architecture of 25-hydroxyvitamin D levels. Nat Commun. 2018;9:260.
Osazuwa-Peters OL, Waken RJ, Schwander KL, Sung YJ, de Vries PS, Hartz SM, et al. Identifying blood pressure loci whose effects are modulated by multiple lifestyle exposures. Genet Epidemiol. 2020;44:629–641.
Lévesque V, Poirier P, Després JP, Alméras N. Relation between a simple lifestyle risk score and established biological risk factors for cardiovascular disease. Am J Cardiol. 2017;120:1939–46.
Abdullah Said M, Verweij N, Van Der Harst P. Associations of combined genetic and lifestyle risks with incident cardiovascular disease and diabetes in the UK biobank study. JAMA Cardiol. 2018;3:693–702.
Sotos-Prieto M, Baylin A, Campos H, Qi L, Mattei J. Lifestyle cardiovascular risk score, genetic risk score, and myocardial infarction in hispanic/latino adults living in Costa Rica. J Am Heart Assoc. 2016;5:e004067.
Tobin MD, Sheehan NA, Scurrah KJ, Burton PR. Adjusting for treatment effects in studies of quantitative traits: antihypertensive therapy and systolic blood pressure. Stat Med. 2005;24:2911–35.
Howie BN, Donnelly P, Marchini J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 2009;5:e1000529.
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34:816–34.
Altshuler DM, Durbin RM, Abecasis GR, Bentley DR, Chakravarti A, Clark AG, et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491:56–65.
Joosten MM, Grobbee DE, van der ADL, Verschuren WM, Hendriks HF, Beulens JW. Combined effect of alcohol consumption and lifestyle behaviors on risk of type 2 diabetes. Am J Clin Nutr. 2010;91:1777–83.
Klatsky AL. Moderate drinking and reduced risk of heart disease. Alcohol Res Heal. 1999;23:15–22.
Feitosa MF, Kraja AT, Chasman DI, Sung YJ, Winkler TW, Ntalla I, et al. Novel genetic associations for blood pressure identified via gene-alcohol interaction in up to 570K individuals across multiple ancestries. PLoS One. 2018;13:e0198166.
Aulchenko YS, Struchalin MV, van Duijn CM. ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinforma. 2010;11:134.
Winkler TW, Day FR, Croteau-Chonka DC, Wood AR, Locke AE, Mägi R, et al. Quality control and conduct of genome-wide association meta-analyses. Nat Protoc. 2014;9:1192–212.
Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinforma Appl Note. 2010;26:2190–1.
Winkler TW, Kutalik Z, Gorski M, Lottaz C, Kronenberg F, Heid IM. EasyStrata: evaluation and visualization of stratified genome-wide association meta-analysis data. Bioinformatics. 2015;31:259–61.
Wilson PWF, D’Agostino RB, Levy D, Belanger AM, Silbershatz H, Kannel WB. Prediction of coronary heart disease using risk factor categories. Circulation. 1998;97:1837–47.
Lindström J, Tuomilehto J. The diabetes risk score: a practical tool to predict type 2 diabetes risk. Diabetes Care. 2003;26:725–31.
Thanassoulis G, Peloso GM, Pencina MJ, Hoffmann U, Fox CS, Cupples LA, et al. A genetic risk score is associated with incident cardiovascular disease and coronary artery calcium the framingham heart study. Circ Cardiovasc Genet. 2012;5:113–21.
Nierenberg JL, Li C, He J, Gu D, Chen J, Lu X, et al. Blood pressure genetic risk score predicts blood pressure responses to dietary sodium and potassium: the GenSalt Study (Genetic Epidemiology Network of Salt Sensitivity). Hypertension. 2017;70:1106–12.
Cho YA, Lee J, Oh JH, Chang HJ, Sohn DK, Shin A, et al. Genetic risk score, combined lifestyle factors and risk of colorectal cancer. Cancer Res Treat. 2019;51:1033–40.
Auton A, Abecasis GR, Altshuler DM, Durbin RM, Bentley DR, Chakravarti A, et al. A global reference for human genetic variation. Nat Nat Publ Group. 2015;526:68–74.
Hormozdiari F, Kang EY, Bilow M, Ben-David E, Vulpe C, McLachlan S, et al. Imputing phenotypes for genome-wide association studies. Am J Hum Genet. 2016;99:89–103.
Chen Y, Peloso GM, Dupuis J. Evaluation of a phenotype imputation approach using GAW20 simulated data. BMC Proc. 2018. p. 56.
Dahl A, Iotchkova V, Baud A, Johansson S, Gyllensten U, Soranzo N, et al. A multiple-phenotype imputation method for genetic studies. Nat Genet. 2016;48:466–72.
Loef M, Walach H. The combined effects of healthy lifestyle behaviors on all cause mortality: a systematic review and meta-analysis. Preventive Med. 2012;55:163–70.
Acknowledgements
This project was largely supported by a grant from the U.S. National Heart, Lung, and Blood Institute (NHLBI), the National Institutes of Health, R01HL118305. The full list of acknowledgments appears in the Supplementary Note II.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
About this article
Cite this article
Xu, H., Schwander, K., Brown, M.R. et al. Lifestyle Risk Score: handling missingness of individual lifestyle components in meta-analysis of gene-by-lifestyle interactions. Eur J Hum Genet 29, 839–850 (2021). https://doi.org/10.1038/s41431-021-00808-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41431-021-00808-x
This article is cited by
-
Gene–environment interactions in human health
Nature Reviews Genetics (2024)