An integrative analysis of genomic and exposomic data for complex traits and phenotypic prediction

Zhou, Xuan; Lee, S. Hong

doi:10.1038/s41598-021-00427-y

Download PDF

Article
Open access
Published: 02 November 2021

An integrative analysis of genomic and exposomic data for complex traits and phenotypic prediction

Xuan Zhou^1,2,3 &
S. Hong Lee^1,2,3

Scientific Reports volume 11, Article number: 21495 (2021) Cite this article

1906 Accesses
8 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Complementary to the genome, the concept of exposome has been proposed to capture the totality of human environmental exposures. While there has been some recent progress on the construction of the exposome, few tools exist that can integrate the genome and exposome for complex trait analyses. Here we propose a linear mixed model approach to bridge this gap, which jointly models the random effects of the two omics layers on phenotypes of complex traits. We illustrate our approach using traits from the UK Biobank (e.g., BMI and height for N ~ 35,000) with a small fraction of the exposome that comprises 28 lifestyle factors. The joint model of the genome and exposome explains substantially more phenotypic variance and significantly improves phenotypic prediction accuracy, compared to the model based on the genome alone. The additional phenotypic variance captured by the exposome includes its additive effects as well as non-additive effects such as genome–exposome (gxe) and exposome–exposome (exe) interactions. For example, 19% of variation in BMI is explained by additive effects of the genome, while additional 7.2% by additive effects of the exposome, 1.9% by exe interactions and 4.5% by gxe interactions. Correspondingly, the prediction accuracy for BMI, computed using Pearson’s correlation between the observed and predicted phenotypes, improves from 0.15 (based on the genome alone) to 0.35 (based on the genome and exposome). We also show, using established theories, that integrating genomic and exposomic data can be an effective way of attaining a clinically meaningful level of prediction accuracy for disease traits. In conclusion, the genomic and exposomic effects can contribute to phenotypic variation via their latent relationships, i.e. genome-exposome correlation, and gxe and exe interactions, and modelling these effects has a potential to improve phenotypic prediction accuracy and thus holds a great promise for future clinical practice.

Refining the impact of genetic evidence on clinical success

Article Open access 17 April 2024

Genome-wide association studies

Article 26 August 2021

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

Introduction

Both genetic and environmental factors underlie phenotypic variance of complex traits. Understanding the influences of these factors not only helps explain why individuals differ from one another in phenotypes but also helps predict future phenotypes, such as disease diagnoses. The proliferation of genotypic data in the past decades, along with developments in relevant analytic tools, have already contributed a great deal to understanding phenotypic variations of complex traits^{1,2,3,4,5,6,7,8,9}, and enabled phenotypic predictions at a level of accuracy for potential use in clinical settings^10,11,12. However, these understandings and predictions are bounded by the heritability of the traits, and for many complex traits, large phenotypic variation remains unexplained, suggesting substantial environmental contributions to phenotypic variance.

Complementary to the genome, the concept of exposome has been proposed to capture the totality of human environmental exposures, encompassing external as well as internal environments over the lifetime of a given individual^13,14,15. Similar to genotypes, exposomic variables are standardised across cohorts¹⁶. Since the inception of the concept, considerable efforts have been made to assess and characterise the exposome¹⁷. For example, the Human Early-Life Exposome project is a European collaborative effort established to characterize the early-life exposome which includes all environmental hazards that mothers and children are exposed to¹⁸. Despite the progress in the construction of the exposome, few analytic tools exist to date that can integrate genomic and exposomic data for complex trait analyses. We hypothesize that exposomic variables do not only affect phenotypes on their own but also interact among each other^19,20 and with genotypes^20,21. In addition, the estimation of exposomic effects and genomic effects on phenotypes could be biased, if these effects are correlated but the estimation model assumes otherwise²². Hence, tools that integrate genomic and exposomic data are required to capture variance as well as covariance components of phenotypes.

Here we propose a versatile linear mixed model that fulfils these requirements. The proposed approach jointly models the random effects of the genome and exposome and can be extended to capture genome-exposome and exposome-exposome interactions and genome-exposome correlations in the phenotypic analysis of a complex trait. It also allows us to model exposomic effects modulated by one or a few specific environmental variables. We demonstrate the proposed approach using traits from the UK biobank with 11 complex traits and 28 lifestyle exposures that were measured using a standard protocol.

Results

Method overview

We used a novel linear mixed model (LMM) to jointly model the effects of the genome and exposome on the phenotypes of a complex trait. The exposome here is restricted to 28 lifestyle exposures that were measured using a standard protocol (see “Methods”). Our model has three key features. First, it allows estimation of the correlation between genomic and exposomic effects, relaxing the assumption of independence between those effects as in a conventional LMM²². Second, the model can capture both additive and non-additive effects of the exposome and genome, i.e. pairwise interactions between exposomic variables (exe interactions; e.g.¹⁹) and interactions between exposomic variables and genotypes (i.e., gxe interactions; e.g.²¹). Third, the model can handle correlated exposomic variables (see Simulations in “Methods”) that may cause biased variance estimations of exposomic variables (e.g.²⁰).

To illustrate the use of the model with real data, we selected 11 complex traits from the UK Biobank with heritability estimates above 0.05, including BMI, sitting height and years of education etc. (https://nealelab.github.io/UKBB_ldsc/), along with 28 lifestyle variables, including alcohol use, smoking, physical activity and dietary composition (see “Methods” for a detailed description). We performed the following analyses. First, for each trait, we used various models to estimate variance components of the additive and non-additive effects of the exposome and genome, including exe interactions and gxe interactions. The significance of the variance components was determined through a series of model comparisons using likelihood ratio tests (Table 1). Second, we extended the proposed model to examine the extent to which exposomic effects are modulated by covariates such as age, sex and socio-economic status (i.e., exc interactions). Third, we used fivefold cross validation to show that the prediction accuracy increased significantly after accounting for the exposomic effects and exe interactions. Finally, we explored the potential clinical use of the proposed integrative analysis of genomic and exposomic data, by projecting its prediction accuracy for a disease trait in terms of the area under the receiver operating characteristic curve (AUC). The projection was based on well-established theories^{23,24,25,26,27,28,29,30} that express AUC as a function of sample size, proportions of variance explained by genomic and exposomic effects and the population prevalence of the disease.

Table 1 P-values for estimated variance components of selected traits.

Full size table

Exposomic effects on phenotypes

In line with previous estimation (https://nealelab.github.io/UKBB_ldsc/), we found significant SNP-based heritability for all selected traits, with estimates ranging between 0.08 (years of education) and 0.52 (standing height; Fig. 1). We detected significant additive effects of the lifestyle-exposome on phenotypes of all traits (see Fig. 1 for e and Table 1 for p-values under H₀ ${\upsigma }_{\mathrm{e}}^{2}=0$). The magnitude of these additive effects, however, varied across traits. For example, the exposome accounted for 8.5% of the phenotypic variance of waist circumference, but less than 2.5% for height, standing height, heel bone mineral density and fluid intelligence. Importantly, the additive exposomic effects were mostly uncorrelated with the genetic effects (see Table 1 for p-values under H₀ ${\upsigma }_{\mathrm{ge}}=0$; see Supplementary Table 1 for covariance estimates), which was notably different from the genome-transcriptome correlation²².

The estimated variance component of non-additive effects of the lifestyle-exposome (exe) was highly significant for 7 out 11 traits (Table 1), although they only account for ~ 1% to 2% of phenotypic variance (See Fig. 1 & Supplementary Table 2). By contrast, significant gxe interactions are only evident for BMI, weight and years of education (Table 1), but they could account for up to 9% of total phenotypic variance (years of education; Fig. 1 and Supplementary Table 2). The low presence of gxe signals is probably due to relatively low power of detecting gxe interactions, which is caused by a large number of pairs of gxe interaction terms to be estimated in the model, i.e. 28 (number of exposomic variables) × 1.3 million (number of SNPs) in this study. In addition, the identified gxe and exe interactions are largely independent to each other. This is evidenced by that both gxe and exe remained significant when being jointly modelled (see p-values under H₀ ${\upsigma }_{\mathrm{gxe}|\mathrm{exe}}=0$ and under H₀ ${\upsigma }_{\mathrm{exe}|\mathrm{gxe}}=0$).

By extending the proposed model to a reaction norm model (RNM; see “Methods”), we examined whether the additive exposomic effects on phenotype vary depending on specific covariates, which would be evidenced by the presence of significant exc interactions. Using single-covariate RNMs, we identified several significant exc interactions (Supplementary Table 3), noting that most covariates are lifestyle related, which are in line with the exe interactions found above. For each trait, we then fitted an RNM model that simultaneously includes all significant exc interactions identified from single-covariate RNM analyses. The variance estimates of exc interactions from the joint analyses are presented in Supplementary Table 4.

It is important to note that the estimation of exposomic effects is sensitive to the correlation structure of exposomic variables. Specifically, multicollinearity between exposomic variables would bias the estimate of ${\upsigma }_{\mathrm{e}}^{2}$ (see simulations in “Methods”); and by extension, correlated exe interaction terms and gxe interaction terms (model equations iv and v in Table 2) could bias the estimates of ${\upsigma }_{\mathrm{exe}}^{2}$ and ${\upsigma }_{\mathrm{gxe}}^{2}$, as empirically observed in the simulations (see “Methods”). Without knowing the true values of variance components, transforming exposomic variables and interaction terms using a principal component analysis (see “Methods”) seems necessary prior to model fitting in order to avoid estimation bias due to multicollinearity. While transforming the exposomic variables and the exe interaction terms are computationally trivial, transforming the gxe interaction terms is computationally infeasible (28 × 1.3 million variables). Nonetheless, the variance of gxe interactions is small in general, suggesting that using the gxe interaction terms without the transformation (i.e., derived from $\mathbf{G} \odot \mathbf{E}$ in model equation iv of Table 2) is generally free from the estimation bias due to multicollinearity. Note that the largest variance estimate of gxe interactions in this study is ~ 0.09.

Table 2 Model equations and their assumed sample variance–covariance matrices.

Full size table

Validation of exposomic effects

Figure 2a shows the phenotypic prediction accuracy based on genetic data alone. Using fivefold cross-validation, we found that including additive (e) and non-additive effects (exe) of the exposome, which were significant in the discovery dataset, could improve the phenotypic prediction accuracy in the target dataset. In general, the larger the variance estimates, the greater the prediction improvements (Fig. 2b,c), which indicates that the additive effects of the exposomic variables and exe interactions are genuine. Similarly, we also validated the exposomic effects modulated by specific covariates, by showing that the larger the total variance estimates of exc interactions, the greater the improvement of predication accuracy (Fig. 3). The validated exc interactions would in part explain the phenotypic variance due to residual x covariate interactions found in our previous studies^31,32,33.

By contrast, although gxe interactions contribute to the phenotypic variance of BMI, weight and years of education (Table 1), the contribution did not lead to significant gains in phenotypic prediction accuracy (Supplementary Fig. 1). This was most likely due to a lack of power. i.e. the size of discovery samples was insufficient to accurately estimate an extremely large number of parameters, i.e., best linear unbiased prediction (BLUPs) of gxe interaction effects^23,27,28,34. This is further verified using simulations (see Simulations in “Methods” and Supplementary Fig. 2).

Given the sample sizes of the discovery data sets (~ 28,000), the prediction accuracies of the model y = g + ε for the selected traits are only between 1/3 and 1/2 of the theoretical maximums (i.e., square root of heritability; Supplementary Fig. 3). They can improve, in theory, by increasing the sample size of discovery sets (Supplementary Fig. 3); or, as shown in the above, by accounting for the additive effects of the exposome and exe interactions (Fig. 2b,c). To examine prediction efficiency of the latter, we projected the observed prediction accuracies of the models y = g + e + ε and y = g + e + exe + ε onto the theoretical trajectory of prediction accuracies of the model y = g + ε as a function of the sample sizes of discovery datasets (Supplementary Fig. 3). As such, the use of exposomic information could improve phenotypic prediction accuracy to the same extent as a 1.2 to 14-fold increase in sample size, depending on the significance of the exposomic effects and their interactions (Fig. 4). Given the substantial costs and efforts associated with increasing sample size, the improved predictive accuracy by the models y = g + e + ε and y = g + e + exe + ε are considerable, despite the fact that the proportion of phenotypic variance explained by the exposome is small (see the x-axis of Fig. 2b,c).

Quantification of clinical relevance

We quantified the clinical relevance of the proposed model by exploring its prediction accuracy for quantitative traits and disease traits. For quantitative traits, we expressed the prediction accuracy of the model y = g + e + ε (i.e., correlation coefficient between the true and predicted phenotypes) as a function of the sample size of the discovery dataset, variances explained by the genome and exposome, and effective numbers of (independent) SNPs and exposomic variables (see “Methods”), using previous theoretical derivations^{27,28,29,30,34}. Based on the derived expression [Eq. (6)], we computed the expected prediction accuracies for the quantitative traits used in this study and found that they agreed well with the observed prediction accuracies from the fivefold cross validation (Supplementary Fig. 4). We then extended the derived expression to disease traits in terms of the area under the operative characteristic curve [AUC; see Eq. (10) in “Methods” for details] using well-established theories^23,24,25,26. AUC is a gold-standard measure used to evaluate how well a prediction model discriminates diseased from non-diseased individuals. An AUC between 0.7 and 0.8 is considered acceptable, 0.8 to 0.9 excellent, and above 0.9 outstanding³⁵. Figure 5 shows the expected AUC of the proposed integrative analysis of genomic and exposomic data for disease traits of different values of population prevalence (k), assuming different amounts of variance (on the liability scale) explained by the genome and exposome and discovery sample sizes. For simplicity, we use ${\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}$ to denote the total variance in disease liability explained by additive effects of the exposome and exe interactions as a whole.

When setting ${\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}$ to 0—that is, using no exposomic information at all—varying the heritability of disease liability h² from 0 to 0.3 improves AUC from 0.5 to ~ 0.6 when the sample size of the discovery set is 50 k. This is in contrast to a twofold improvement, from 0.5 to ~ 0.7, when the sample size is 500 k. Thus, genomic prediction accuracy heavily relies on sample size, such that for a disease trait with a moderate heritability, a clinically meaningful level of accuracy (AUC ≥ 0.7) may not be attainable unless the sample size of the discovery dataset is substantially large (≥ 500 k). On the other hand, the benefit of using exposomic information to disease prediction can be realised with a relatively small discovery sample. This is evidenced by that when setting h² to 0 (i.e., using no genomic information at all), increasing the value of ${\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}$ has the same effects on AUC whether using a discovery sample of 50 k or 500 k individuals. Importantly, AUC consistently improves with increasing ${\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}$ in all scenarios (Fig. 5). Thus, incorporating exposomic data is not only an efficient but also an effective way of improving prediction accuracy based on genomic data alone. Taken together, genomic prediction accuracy for disease traits is constrained by sample size; with a relatively small sample at hand, a desired level of prediction accuracy may only be achieved by combining genomic and exposomic information.

Comparison with existing models

The key model parameters of the proposed integrative analysis of genomic and exposomic data (IGE) compared to existing linear mixed models that incorporate genetic and environmental effects on phenotypes are outlined in Table 3. In general, IGE offers thus far the most detailed partition of phenotypic variance.

Table 3 Comparisons of methods (software packages) on the genomic and exposomic analysis of complex traits.

Full size table

Both IGE and GxEMM³⁶ are whole-genome approaches to the estimation of heritability and gxe interactions, although IGE is considered more comprehensive and versatile, which models variances explained by additive effects of exposomic variables, by exposome × exposome interactions, and by exposome × covariate (such as demographics) interactions; and covariance between genetic effects and exposomic effects (Table 3). Further, bivariate or multivariate IGE (i.e., simultaneously including two or more traits) can be feasibly performed using mtg2 version 2.18 (https://sites.google.com/site/honglee0707/mtg2).

In contrast, StructLMM has been developed primarily for a genome-wide by environment interaction study (GWEIS)²⁰ that examines one SNP at a time with a focus on association tests (providing p-values) for G × E interactions between the SNP genotypes and multiple exposomic variables. Using the well-established SNP BLUP method^2,37,38, IGE can also provide GWEIS summary statistics, including estimated allele substitution effects of all SNPs across environments, their standard errors and p-values. Note that SNP BLUP implemented in IGE can model all SNP jointly (a whole-genome approach). Nonetheless, one of the main scopes of this study is to provide unbiased estimates of exposomic variances, e.g., ${\upsigma }_{\mathrm{e}}^{2}$ that is common to both StructLMM and IGE (Table 3 and Supplmentary Note 1). Importantly, correlated exposomic variables would cause biased estimation of ${\upsigma }_{\mathrm{e}}^{2}$ (Supplementary Table 5) unless they are transformed to independent variables via a principal component analysis (“Methods”). To our knowledge, this transformation has not yet been implemented in any existing methods including StructLMM. Using results from simulations, we show that ${\upsigma }_{\mathrm{e}}^{2}$ estimates by StructLMM are prone to bias due to correlated environments (Supplementary Table 5). The other model parameters such as ${\upsigma }_{\mathrm{exe}}^{2}$, ${\upsigma }_{\mathrm{exc}}^{2}$, and ${\upsigma }_{\mathrm{g},\mathrm{ e}}$ cannot be estimated by StructLMM (Table 3).

Discussion

Using our approach, we demonstrate the importance of the exposome for understanding individual differences in phenotypes. Although the ‘exposome’ constructed in this study comprises only 28 lifestyle factors, when integrated with genomic data, it explained between 2 to 10% additional phenotypic variance and significantly improved phenotypic prediction accuracy to a level equivalent to a 1.2 to 14-fold increase in sample size. The additional phenotypic variance is not only from additive effects of the exposome but also from its non-additive effects (exe) and genome–exposome interactions (gxe). We expect that as the construction of the exposome continues to progress, more phenotypic variance will be explained and greater improvements in phenotypic prediction accuracy will be gained. This would be particularly promising for phenotypic analysis and prediction of traits with small to little heritability component, such as ovarian and colorectal cancer³⁹.

We noted that when exposomic variables are correlated, the variance estimate of additive effects of exposomic variables is biased unless these variables are transformed using a principal component analysis (i.e. E in Table 2 should be based on transformed variables). By extension, this would apply to exe interaction terms and gxe interactions terms, unless the proportions of phenotypic variance explained by these interaction effects are small (< 10%), as shown in our simulations. These observations have important implications for modelling environmental effects in LMMs. Recently, Moore et al.²⁰ proposed the structured linear mixed model (StructLMM) that incorporates random effects of multiple environments in order to study the interactions between these environments and genotypes of a single SNP (i.e., gxe interactions). However, the environmental variables in StructLMM are not transformed, even though they are very likely correlated, which would have biased the variance estimate of environmental effects. Consequently, it remains uncertain the extent to which the estimation bias affects the StructLMM-based test statistics for detecting gxe interactions.

Depending on the research question at hand, the construction of the exposome may be guided by causal analyses. A meaningful exposome may only contain causal information. Examples may include lifestyles that potentially alter the molecular pathways or the pathogenesis of the main trait, or biomarkers that potentially explain possible molecular pathways underlying the phenotypes. As a contrast, in our BMI analysis, for example, it is not useful to include weight and height as part of the exposome, even though they would explain a large amount of phenotypic variance. This is because variations in these traits inform nothing other than the fact that they are correlated with the trait.

Heritability estimates were slightly reduced after including more variance components (result not shown). We considered two possibilities. First, the exposome may mediate part of additive genetic effects on phenotypes. For example, some SNPs affect smoking status, which in turn affect BMI. A model that simultaneously includes genetic and exposomic data would account for smoking status and their genetic effects, and hence gives arise to reduced heritability estimates. Second, there is a genuine correlation between exposomic and genomic effects in some latent mechanism. It is noted that there are marginally significant correlation estimates, which were not significant after Bonferroni correction. Such correlation may be because people who have similar genotypes may somehow share similar exposures i.e. genotype-environment correlation⁴⁰.

In conclusion, the genomic and exposomic effects can contribute to phenotypic variation via their latent relationships, i.e. genome-exposome correlation, and gxe and exe interactions, for which our proposed method can provide reliable estimates. We show that the integrative analysis of genomic and exposomic data has a great potential for understanding genetic and environmental contributions to complex traits and for improving phenotypic prediction accuracy, and thus holds a great promise for future clinical practice.

Methods

Ethics statement

We used data from the UK Biobank (http://www.ukbiobank.ac.uk/) for our analyses. The UK Biobank’s scientific protocol has been reviewed and approved by the North West Multi-centre Research Ethics Committee (MREC), National Information Governance Board for Health & Social Care (NIGB), and Community Health Index Advisory Group (CHIAG). UK Biobank has obtained informed consent from all participants. Our access to the UK Biobank data was under the reference number 14575. The research ethics approval of the current study was obtained from the University of South Australia Human Research Ethics Committee. All methods were performed in accordance with the relevant guidelines and regulations.

Genotype data

The UK Biobank contains health-related data from ~ 500,000 participants aged between 40 and 69, who were recruited throughout the UK between 2006 and 2010⁴¹. Prior to data analysis, we applied stringent quality control to exclude unreliable genotypic data. We filtered SNPs with an INFO score (used to indicate the quality of genotype imputation) < 0.6, a MAF < 0.01, a Hardy–Weinberg equilibrium p-value < 1e−4, or a call rate < 0.95. We then selected HapMap phase III SNPs, which are known to yield reliable estimates of SNP-based heritability^42,43,44, for downstream analyses. We filtered individuals who had a genotype-missing rate > 0.05, were non-white British ancestry, or had the first or second ancestry principal components outside six standard deviations of the population mean. We also applied quality control on the degree of relatedness between individuals by excluding one of any pair of individuals with a genomic relationship > 0.025. From the remaining individuals, we selected those who were included in both the first and second release of UK Biobank genotype data. Eventually, 288,837 individuals and 1,133,273 SNPs passed the quality control of genotype data. Among these, 38,921 individuals had no missing data for any of the exposomic variables used in the present study, which were used in the main analysis. Depending on the missingness of the main phenotypic data, sample size varies across traits (see Table 1 for N).

Phenotype data

We chose eleven UK Biobank traits available to us that have a heritability estimate (by an independent open source; https://nealelab.github.io/UKBB_ldsc/) greater than 0.05. These traits are standing height, sitting height, body mass index, heel bone mineral density, fluid intelligence, weight, waist circumference, hip circumference, waist-to-hip ratio, diastolic blood pressure and years of education.

Prior to model fitting, phenotypic data were prepared using R (v3.4.3) in three sequential steps: (1) adjustment for confounders such as age, sex, birth year, social economic status (by Townsend Deprivation Index), population structure (by the first ten principal components of the genomic relationship matrix estimated using PLINK v1.9), assessment centre, and genotype batch using linear regression; (2) standardization; and (3) removal of data points outside ± 3 standard deviations from the mean.

Exposomic variables

We deliberately selected lifestyle-related variables that are known to affect some of the selected traits to construct the exposome in this study. These variables include smoking, alcohol intake, physical activity, and dietary composition. Details of these variables are listed in Supplementary Table 6. Our aim here is not to cover a comprehensive set of exposomic variables, but to demonstrate the potential use of the proposed integrative analysis of genomic and exposomic data for partitioning phenotypic variance and phenotypic prediction.

Statistical models

We used multiple random-effects LMMs to simultaneously model the effects of the genome and the exposome (model equation ii in Table 2). Genome-exposome correlation was also modelled (model equation iii in Table 2) where the kernel matrix for genome-exposome correlation was explicitly constructed using Cholesky decompositions of g and e²². In these models, a genomic relationship matrix (G) was constructed using an n x m₁ genotype coefficient matrix (A) as G = ${\mathbf{A}\mathbf{A}}^{\mathrm{t}}/{\mathrm{m}}_{1}$, where n is the number of participants and m₁ is the number of SNPs. Similarly, an exposomic relationship matrix (E) was estimated using an n x m₂ exposomic variable matrix (B) as E=${{\varvec{\Omega}}{\varvec{\Omega}}}^{\mathrm{t}}/{\mathrm{m}}_{2}$ where m₂ is the number of exposomic variables (Table 2). These relationship matrices were used to estimate the additive effects of the genome and the exposome. In addition, interaction effects, including gxe, exe and exc, were also considered in these multiple random-effects models (Table 2). The kernel matrices for the interaction terms were derived by the Hadamard product of g and e or e and e (model equations iv, v and vi in Table 2). Reaction norm model^31,32 was used to estimate exc. (model equation v in Table 2). All variance components were estimated using restricted maximum likelihood (REML)³.

Simulations

Using simulations, we identified two conditions that can cause biased variance estimates of additive effects of exposomic variables and exe interactions, which are correlations between exposomic variables and skewed distributions of exposomic variables. To show the impact of the correlation structure of exposomic variables on variance estimates of exposomic effects, we simulated, for 5,000 individuals, a set of ten orthogonal exposomic variables and another set of ten correlated exposomic variables, each from a multivariate normal distribution. Based on each set of exposomic variables, we then simulated phenotypes using the model y = e + exe + ε with ${\upsigma }_{\mathrm{e}}^{2}$, ${\upsigma }_{\mathrm{exe}}^{2}$, and ${\upsigma }_{\upvarepsilon }^{2}$ being set to 0.4, 0.5, and 0.1 respectively. The simulated exe effect was based on all possible interaction terms between exposomic variables (as specified in model v of Table 2). The simulation was repeated 100 times, resulting in 100 replicates, each with phenotypes for 5,000 individuals. For each replicate, we fitted the model y = e + exe + ε and averaged variance component estimates across replicates.

Variance estimates of phenotypes that were simulated using correlated and uncorrelated exposomic variables are summarized in Supplementary Table 5. When exposomic variables are orthogonal, all variance-component estimates are unbiased. By contrast, when exposomic variables are correlated, ${\upsigma }_{\mathrm{e}}^{2}$ is over estimated, although the estimate of ${\upsigma }_{\mathrm{exe}}^{2}$ is unbiased. To remedy the effect of correlated exposomic variables on ${\upsigma }_{\mathrm{e}}^{2}$ estimate, we used all principal components (PCs) of the correlated exposomic variables to construct the kernel matrix for estimating ${\upsigma }_{\mathrm{e}}^{2}$, and used all pair-wise interaction terms of these PCs to construct the kernel matrix for estimating ${\upsigma }_{\mathrm{exe}}^{2}$. Importantly, while retaining all information of the original exposomic variables, the PCs are orthogonal to each other (Jolliffe, 1982). We found that variance estimation based on the PCs of the correlated exposomic variables are unbiased (last column of Supplementary Table 5).

To show the impact of skewness of the distributions of exposomic variables on variance component estimation, we repeated the above simulations using 10 exposomic variables from the UK biobank with skewed distributions. We also noted that these exposomic variables are correlated. Results are presented in Supplementary Table 7. Estimation based on these exposomic variables is biased for both ${\upsigma }_{\mathrm{e}}^{2}$ and ${\upsigma }_{\mathrm{exe}}^{2}$. Using the PCs of these exposomic variables did not completely eliminate the bias, indicating that skewness of the distributions of exposomic variables affects variance estimation independently from the correlation structure of exposomic variables. As a remedy, we reduced the skewness by removing outliers outside 3 standard deviations from the mean. We found that after this quality control procedure the estimate of ${\upsigma }_{\mathrm{exe}}^{2}$ became unbiased; but the estimate of ${\upsigma }_{\mathrm{e}}^{2}$ remained biased. These results indicate that the estimation of ${\upsigma }_{\mathrm{e}}^{2}$ is sensitive to the correlation structure of exposomic variables, while the estimation of ${\upsigma }_{\mathrm{exe}}^{2}$ is sensitive to the skewness of the distributions of exposomic variables. When using all principal components of the skewness-corrected exposomic variables, all variance estimates became unbiased. Taken together, to avoid biased variance estimation of exposomic effects, it is necessary to (1) conduct quality control on the exposomic variables where values outside 3 standard deviations from the mean should be removed; and (2) transform quality-controlled exposomic variables using a principal component analysis.

We also tested the effect of the correlation structure of exposomic variables on ${\upsigma }_{\mathrm{gxe}}^{2}$ estimate. To do so, we simulated phenotypes based on ten correlated (but quality-controlled) exposomic variables for 5,000 individuals using the model y = g + e + gxe + ε with ${\upsigma }_{\mathrm{g}}^{2}$, ${\upsigma }_{\mathrm{e}}^{2}$, ${\upsigma }_{\mathrm{gxe}}^{2}$, and ${\upsigma }_{\upvarepsilon }^{2}$ being set to 0.3, 0.3, 0.3, and 0.1 respectively. The simulated genetic effect was based on 10 K SNPs that were selected randomly from the 1.1 M Hapmap3 SNPs used for real data analyses, and the simulated gxe effect was based on all possible pairwise interactions between causal SNPs and exposomic variables (as specified in model iv of Table 2). We repeated the simulation 100 times, resulting in 100 replicates. For each replicated, we fitted the model y = g + e + gxe + ε to the genetic data (i.e., 1.1 M Hadmap3 SNPs) and the exposomic data selected for the simulation to estimate variance components, and averaged variance estimates across replicates.

Similar to ${\upsigma }_{\mathrm{e}}^{2}$, the estimation of ${\upsigma }_{\mathrm{gxe}}^{2}$ is affected by the correlation structure of exposomic variables. As shown in Supplementary Table 8, all variance components are biased when the estimation is based on the correlated exposomic variables. Using PCs of the correlated variables corrected the bias for ${\upsigma }_{\mathrm{e}}^{2}$ and ${\upsigma }_{\mathrm{gxe}}^{2}$ (see ‘pc1’ in Supplementary Table 8). This observation holds for simulations under a different parameter setting (see Supplementary Table 9) and for simulations based on 10 correlated exposomic variables whose values were simulated from a multivariate normal distribution with a variance–covariance matrix that contains non-zero off-diagonal entries (see Supplementary Table 10 for results).

In Results, we reported for the real data that accounting for significant gxe interactions did not lead to phenotypic prediction accuracy improvements. We hypothesized that the power of phenotypic prediction based on gxe interactions is low; subsequently we used simulations to investigate the power of gxe-based phenotypic prediction. Specifically, we examined, for a sample of 10,000 individuals, of which 80% serves as the training set and 20% as the target set, the extent to which varying effect size of gxe interactions can improve phenotypic prediction accuracy. To do so, we simulated phenotypes using the model y = g + e + gxe + ε with ${\upsigma }_{\mathrm{gxe}}^{2}$ set to 0.2, 0.05, and 0.025, respectively. Each setting has 100 replicates, and each replicate contains phenotypes of 10,000 individuals. We randomly divided each replicate into a training set (n = 8000) and a target set (n = 2000) and subsequently computed the phenotypic prediction accuracy of two estimation models, y = g + e + ε (i.e., null model) and y = g + e + gxe + ε (i.e., full model) for each replicate.

Supplementary Fig. 2 presents the prediction accuracies of the two models by simulation setting (2a) and changes in prediction accuracy from the null model to the full model (2b). Despite the presence of genuine gxe interactions, little prediction accuracy is gained from accounting for these interactions, and this observation holds even under the setting with the largest gxe interactions (i.e., ${\upsigma }_{\mathrm{gxe}}^{2}$ = 0.2). This observation aligns with our results from real data analyses (Supplementary Fig. 1) and indicates that the power of phenotypic predictions based on gxe interactions is low.

Principal component-based transformed variables for E

If the degree of correlation among variables is high, it can cause biased estimates when the variables are fitted in a model, i.e. multicollinearity problem. Such bias is also problematic when using correlated exposomic variables to construct E to be fitted in an LMM to estimate the proportion of the variance explained by the variables (R² = ${\upsigma }_{\mathrm{e}}^{2}$ when phenotypes are standardised with mean zero and variance one). The R² can also be obtained from a linear model, i.e., the coefficients of determination. For problematically correlated variables, principal component regression has been introduced⁴⁵.

A linear model can be written as

$$\mathbf{y}=\mathbf{W}{\varvec{\upbeta}}+{\varvec{\upvarepsilon}}$$

(1)

where y is a n vector of phenotypes, W is a column-standardised n x m matrix containing correlated exposomic variables, β is their effects and ε is a vector of residuals.

When exposomic variables in W are highly correlated, estimated exposomic effects (β-hat) are inflated due to multicollinearity problem.

From the singular value decomposition, W can be expressed as

$$\mathbf{W}=\mathbf{U}\mathbf{D}{\mathbf{V}}^{\mathbf{t}}$$

where U is a matrix whose columns contain the left singular vectors of W, D is a diagonal matrix having a vector containing the singular values of W and V is a unitary matrix (i.e. VV^t = I⁴⁵) whose columns contain the right singular vectors of W.

V can be also obtained from the eigen decomposition of the covariance matrix of the variables, i.e. ${\mathbf{W}}^{\mathbf{t}}\mathbf{W}$.

The principal component regression approach⁴⁵ proposes to transform W to a column-orthogonal matrix, Ω, multiplied by V, which can be written as

$${{\varvec{\Omega}}} \, = \, {\mathbf{WV}}$$

Now, we can replace W with Ω in the model as

$$\mathbf{y}={\varvec{\Omega}}{\varvec{\upgamma}}+{\varvec{\upvarepsilon}}$$

(2)

The estimated regression coefficients from the model (2) are $\widehat{{\varvec{\upgamma}}}={\left({{\varvec{\Omega}}}^{\mathbf{t}}{\varvec{\Omega}}\right)}^{-1}{{\varvec{\Omega}}}^{\mathbf{t}}\mathbf{y}={\left({\mathbf{V}}^{\mathbf{t}}{\mathbf{W}}^{\mathbf{t}}\mathbf{W}\mathbf{V}\right)}^{-1}{\mathbf{W}}^{\mathbf{t}}{\mathbf{V}}^{\mathbf{t}}\mathbf{y}={\left({\mathbf{W}}^{\mathbf{t}}\mathbf{W}\right)}^{-1}{\mathbf{W}}^{\mathbf{t}}\mathbf{y}{\mathbf{V}}^{\mathbf{t}}={\mathbf{V}}^{\mathbf{t}}\widehat{{\varvec{\upbeta}}}$, where V is a unitary matrix such that VV^t = I (identity matrix) can be cancelled out⁴⁵.

Therefore, R² values obtained from models (1) and (2) are equivalent as.

$${\mathrm{R}}^{2}=\frac{\sum {\left[\overline{\mathbf{y} }-\widehat{{\mathbf{y}}_{\mathrm{i}}}\right]}^{2}}{\sum {\left[\overline{\mathbf{y} }-{\mathbf{y}}_{\mathbf{i}}\right]}^{2}}=\frac{\sum {\left[\overline{\mathbf{y} }-{(\varvec{\Omega}\widehat{\varvec{\upgamma} })}_{\mathrm{i}}\right]}^{2}}{\sum {\left[\overline{\mathbf{y} }-{\mathbf{y}}_{\mathrm{i}}\right]}^{2}}=\frac{\sum {\left[\overline{\mathbf{y} }-{(\varvec{\Omega} {\mathbf{V}}^{t}\widehat{\varvec\upbeta } )}_{\mathrm{i}}\right]}^{2}}{\sum {\left[\overline{\mathbf{y} }-{\mathbf{y}}_{\mathrm{i}}\right]}^{2}}=\frac{\sum {\left[\overline{\mathbf{y} }-{(\mathbf{W}\widehat{\varvec\upbeta })}_{\mathrm{i}}\right]}^{2}}{\sum {\left[\overline{\mathbf{y} }-{\mathbf{y}}_{\mathbf{i}}\right]}^{2}}$$

However, Eq. (2) can avoid a collinearity issue among the variables. Therefore, model (2) can be extended to a linear mixed model, i.e. the covariance structure can be constructed based on Ω, i.e. ΩΩ^t/m where Ω is column-standardised.

Suppose a LMM of the form

$$\mathbf{y}=\mathbf{W}{\varvec{\upbeta}}+{\varvec{\upvarepsilon}}$$

(3)

where y is a vector of phenotypes for n individuals; W is a n x m₂ matrix that contains m exposomic variables; β is a vector of random exposomic effects, each assumed normal with mean zero and variance ${\upsigma }_{\mathrm{e}}^{2}/{\mathrm{m}}_{2}$; and ε is a vector of residuals, each assumed normal with mean zero and variance ${\upsigma }_{\upvarepsilon }^{2}$.

Under this model, phenotypic variance is partitioned as

$$\mathrm{var}\left(\mathbf{y}\right)={\upsigma }_{\mathrm{e}}^{2}\mathbf{W}{\mathbf{W}}^{\mathbf{t}}/{\mathrm{m}}_{2}+{\upsigma }_{\upvarepsilon }^{2}\mathbf{I}$$

where I is the n × n identify matrix.

When exposomic variables are highly correlated, a transformed W, denoted as Ω, should be used, to avoid biased ${\widehat{\upsigma }}_{\mathrm{e}}^{2}$.

In a similar manner to the linear models (1) and (2), LMM (3) can be rewritten as

$$\mathbf{y}=\mathbf{U}\mathbf{D}{\mathbf{V}}^{\mathbf{t}}{\varvec{\upbeta}}+{\varvec{\upvarepsilon}}$$

Since ${\mathbf{V}}^{\mathbf{t}}\mathbf{V}=\mathbf{I}$

$$\mathbf{y}=\mathbf{U}\mathbf{D}{(\mathbf{V}}^{\mathbf{t}}\mathbf{V}){\mathbf{V}}^{\mathbf{t}}{\varvec{\upbeta}}+{\varvec{\upvarepsilon}}=(\mathbf{U}\mathbf{D}{\mathbf{V}}^{\mathbf{t}})\mathbf{V}{(\mathbf{V}}^{\mathbf{t}}{\varvec{\upbeta}})+{\varvec{\upvarepsilon}}=\mathbf{W}\mathbf{V}({\mathbf{V}}^{\mathbf{t}}{\varvec{\upbeta}})+{\varvec{\upvarepsilon}}={\varvec{\Omega}}({\mathbf{V}}^{\mathbf{t}}{\varvec{\upbeta}})+{\varvec{\upvarepsilon}}$$

Then

$$\mathrm{var}\left(\mathbf{y}\right)={\varvec{\Omega}}\mathrm{var}\left({\mathbf{V}}^{\mathbf{t}}{\varvec{\upbeta}}\right){{\varvec{\Omega}}}^{\mathbf{t}}+{\upsigma }_{\upvarepsilon }^{2}\mathbf{I}={\varvec{\Omega}}{\mathbf{V}}^{\mathbf{t}}\mathrm{var}\left({\varvec{\upbeta}}\right)\mathbf{V}{{\varvec{\Omega}}}^{\mathbf{t}}+{\upsigma }_{\upvarepsilon }^{2}\mathbf{I}={\upsigma }_{\mathrm{e}}^{2}{\mathbf{V}}^{\mathbf{t}}\mathbf{V}{{\varvec{\Omega}}}^{\mathbf{t}}/\mathrm{m}+{\upsigma }_{\upvarepsilon }^{2}\mathbf{I}={\upsigma }_{\mathrm{e}}^{2}{\varvec{\Omega}}\mathbf{I}{{\varvec{\Omega}}}^{\mathbf{t}}/{\mathrm{m}}_{2}+{\upsigma }_{\upvarepsilon }^{2}\mathbf{I}={{\upsigma }_{\mathrm{e}}^{2}{\varvec{\Omega}}{\varvec{\Omega}}}^{\mathbf{t}}/\mathrm{m}+{\upsigma }_{\upvarepsilon }^{2}\mathbf{I}$$

Therefore, using column-standardized principal components of exposomic variables as W for Eq. (3) can avoid biased ${\widehat{\upsigma }}_{\mathrm{e}}^{2}$. This is further verified using simulations.

Estimation of exc interactions

We extend the proposed model to a reaction norm model (RNM^31,32,33,46) by introducing exc interaction terms, where e is the additive effects of exposomic variables and c is a covariate. Given the significant additive effects found in the above, the interest of fitting RNMs is determine whether these effects vary depending on covariates, which would be evidenced by the presence of significant exc interactions.

While estimates of ${\upsigma }_{\mathrm{exe}}^{2}$ inform the phenotypic variance explained by the sum of all possible combinations of pairwise interactions between lifestyle-exposomic variables, it may also be of interest to estimate the modulated exposomic effects specific to particular covariates, using the RNM^31,32,33,46. The covariates include alcohol intake, smoking, energy intake, physical activity, sex, socio-economic status (indexed by Townsend deprivation index), age and ethnicity measured using the first two ancestry principal components. For each covariate, we fitted the RNM that allows the covariate to interact with exposomic effects and compared the fit of this model with a null model that assumes no exc interactions (see Supplementary Table 3 for p-values). Significant covariates were then included in a subsequent analysis of RNM that fit multiple covariates simultaneously. We reported the total variance of exc interaction effects in Supplementary Table 4.

Five-fold cross-validation

Using fivefold cross validation, we (1) validate significant variance components identified above (Table 1) and (2) evaluate the extent to which the inclusion of these variance components improves phenotypic prediction. For every trait, we randomly split the sample into a discovery set (~ 80%) and a target set (~ 20%) and iterated this process five times in a manner such that target sets did not overlap across iterations (see Fig. 6 for an outline). We derived the prediction accuracy of each model by averaging the Pearson’s correlation coefficients between the observed and predicted phenotypes across target sets; then compared prediction accuracies between models (e.g., y = g + ε vs. y = g + e + ε) to determine phenotypic prediction improvements gained by the inclusion of a given variance component [e.g., var(e)]. For each variance component, we regressed prediction accuracy improvements on estimates of the variance component and declared the variance component valid if the slope differs from zero.

Theoretical prediction accuracy for quantitative traits

Suppose we predict phenotypes of a quantitative trait (e.g., BMI) with SNP-based heritability h² using a discovery dataset of N individuals. Following previous theoretical derivations^{23,27,28,29,30,34}, the genomic prediction accuracy based on the model y = g + ε can be written as

$${\mathrm{r}}_{\mathrm{g}}=\sqrt{{\mathrm{h}}^{2 }\cdot \frac{{\mathrm{h}}^{2}}{{\mathrm{h}}^{2} +{\mathrm{M}}_{1}/\mathrm{N}}}$$

(4)

where M₁ is the effective number of chromosome segments, which is a function of the effective number of population size and can be estimated using the inverse of the variance of off-diagonal elements of genomic relationships (i.e., G in Table 2) between the discovery and target samples^27,28,29,30.

Assuming that phenotypes are standardized to have mean zero and variance one, if the total amount of phenotypic variance explained by the exposome is ${\upsigma }_{\mathrm{e}}^{2}$, Eq. 4 can be adapted to describe the prediction accuracy of the model y = e + ε in the form

$${\mathrm{r}}_{\mathrm{e}}=\sqrt{{\upsigma }_{\mathrm{e}}^{2}\cdot \frac{{\upsigma }_{\mathrm{e}}^{2}}{{\upsigma }_{\mathrm{e}}^{2} +{\mathrm{M}}_{2}/\mathrm{N}}}$$

(5)

where M₂ is analogous to M₁ and can be thought of as the effective number of (independent) exposomic variables. Similar to M₁, M₂ can be estimated using the inverse of the variance of the off-diagonal elements of exposomic relationships (E in Table 2) between the discovery and target samples.

Upon establishing an agreement between expected accuracies, based on Eqs. (4) and (5), and observed accuracies for the 11 traits in this study (Supplementary Fig. 4), we proceeded to the prediction accuracy of the proposed integrative analysis of genomic and exposomic data.

Assuming that the genomic and exposomic effects on phenotypes are uncorrelated, the prediction accuracy of the model y = g + e + ε can be written as

$$\mathrm{r}=\sqrt{{\mathrm{r}}_{\mathrm{g}}^{2} + {\mathrm{r}}_{\mathrm{e}}^{2}}$$

(6)

Equation (6) is verified by an agreement between the expected and observed prediction accuracies for the 11 traits in this study (Supplementary Fig. 4).

Theoretical prediction accuracy for disease traits

Considering a disease trait with a population prevalence k, we derived the expected prediction accuracy of the model y = g + e + ε for the disease in terms of the correlation coefficient between the true underlying disease liability and predicted values from the model^23,28,34,47, which can then be converted to an AUC value^23,24,25.

Similar to r_g and r_e, the expected prediction accuracies for the disease on the liability scale, denoted as ${\mathrm{r}}_{\mathrm{g}}^{\mathrm{^{\prime}}}$ (for y = g + ε) and ${\mathrm{r}}_{\mathrm{e}}^{\mathrm{^{\prime}}}$ (for y = e + ε), can be computed using previous derivations^23,28,34,47 as the followings.

$${\mathrm{r}}_{\mathrm{g}}^{\mathrm{^{\prime}}}= \sqrt{{\mathrm{h}}^{2}\cdot \frac{{\mathrm{h}}^{2}{\mathrm{z}}^{2}}{{\mathrm{h}}^{2}{\mathrm{z}}^{2} + {[\mathrm{k}(1-\mathrm{k})]}^{2} \cdot {\mathrm{M}}_{1}/[\mathrm{p}(1-\mathrm{p})\cdot \mathrm{ N}]}}$$

(7)

where h² is the SNP-based heritability on the liability scale, N is the discovery sample size, k is the population prevalence, p is the ratio of cases in the discovery sample, and z is the density at the threshold on the standard normal distribution curve.

$${\mathrm{r}}_{\mathrm{e}}^{\mathrm{^{\prime}}}= \sqrt{{\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}\cdot \frac{{\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}{\mathrm{z}}^{2}}{{\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}{\mathrm{z}}^{2} + {[\mathrm{k}(1-\mathrm{k})]}^{2} \cdot {\mathrm{M}}_{2}/[\mathrm{p}(1-\mathrm{p})\cdot \mathrm{ N}]}}$$

(8)

where ${\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2}$ is the total amount of variance explained by the exposome on the liability scale (i.e., ${\upsigma }_{\mathrm{e}}^{2}$+${\upsigma }_{\mathrm{exe}}^{2}$). Note ${\upsigma }_{\mathrm{e}.\mathrm{tot}}^{2} = {\upsigma }_{\mathrm{e}}^{2}$ when ${\upsigma }_{\mathrm{exe}}^{2} = 0$.

As in Eq. (6), we combined ${\mathrm{r}}_{\mathrm{g}}^{\mathrm{^{\prime}}}$ and ${\mathrm{r}}_{\mathrm{e}}^{\mathrm{^{\prime}}}$ to derive the expected prediction accuracy on the liability scale for the disease, denoted as ${\mathrm{r}}^{\mathrm{^{\prime}}}$, under the assumption that the genetic effects and exposomic effects are uncorrelated.

$${\mathrm{r}}^{\mathrm{^{\prime}}}=\sqrt{{\mathrm{r}}_{\mathrm{g}}^{\mathrm{^{\prime}}2}+ {\mathrm{r}}_{\mathrm{e}}^{\mathrm{^{\prime}}2}}$$

(9)

Following a well-established theory^23,24,25,28 that has been verified by a comprehensive analysis of real data²⁶, we converted ${\mathrm{r}}^{\mathrm{^{\prime}}}$ to the area under the receiver operating characteristic curve (AUC) as

$$\mathrm{AUC}\approx {\Phi}\left[\frac{(\mathrm{i}-{\mathrm{i}}_{2}){\mathrm{r}}^{\mathrm{^{\prime}}2}}{\sqrt{{\mathrm{r}}^{\mathrm{^{\prime}}2}\left\{[1-{\mathrm{r}}^{\mathrm{^{\prime}}2}\mathrm{i}(\mathrm{i}-\mathrm{t})] + [1-{\mathrm{r}}^{\mathrm{^{\prime}}2}{\mathrm{i}}_{2}({\mathrm{i}}_{2}-\mathrm{t})]\right\}}}\right]$$

(10)

where i (= z/k) is the mean liability for diseased individuals, i₂ (= − ik/(1 − k)) is the mean liability for non-diseased individuals, t is the threshold on the normal distribution that truncates the proportion of disease prevalence k and Ф is the cumulative density function of the normal distribution.

To derive the AUC values shown in Fig. 5, we set p = k, M₁ to 50,000 and M₂ to 28. M₁ (50,000) was estimated from the inverse of the variance of genomic relationships (G) between the discovery and target samples^27,29,30. Similarly, M₂ (28) was estimated from the inverse of the variance of exposomic relationships (E) between the discovery and target samples, which agrees with the number of transformed exposomic variables by a principal component analysis in this study (see the correlated exposomic variables section in “Methods”). Note that setting M₂ up to 100 would not yield expected prediction accuracies that notably differ from those from setting M₂ = 28.

Code availability

The source code for MTG2 v2.18 and example code along with related files for fitting IGE model can be accessed without any restrictions from https://sites.google.com/site/honglee0707/mtg2 or from https://github.com/honglee0707/IGE.

References

Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295. https://doi.org/10.1038/ng.3211 (2015).
Article CAS PubMed PubMed Central Google Scholar
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82. https://doi.org/10.1016/j.ajhg.2010.11.011 (2011).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. H. & van der Werf, J. H. J. MTG2: An efficient algorithm for multivariate linear mixed model analysis based on genomic information. Bioinformatics 32, 1420–1422. https://doi.org/10.1093/bioinformatics/btw012 (2016).
Article CAS PubMed PubMed Central Google Scholar
Speed, D. et al. Reevaluation of SNP heritability in complex human traits. Nat. Genet. 49, 986–992. https://doi.org/10.1038/ng.3865 (2017).
Article CAS PubMed PubMed Central Google Scholar
International Human Genome Sequencing. Finishing the euchromatic sequence of the human genome. Nature 431, 931–945. https://doi.org/10.1038/nature03001 (2004).
Article ADS CAS Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351. https://doi.org/10.1126/science.1058040 (2001).
Article ADS CAS PubMed Google Scholar
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921. https://doi.org/10.1038/35057062 (2001).
Article ADS CAS PubMed Google Scholar
Purcell, S. M. et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460, 748–752. https://doi.org/10.1038/nature08185 (2009).
Article ADS CAS PubMed Google Scholar
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569. https://doi.org/10.1038/ng.608 (2010).
Article CAS PubMed PubMed Central Google Scholar
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224. https://doi.org/10.1038/s41588-018-0183-z (2018).
Article CAS PubMed PubMed Central Google Scholar
Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults: Implications for primary prevention. J. Am. Coll. Cardiol. 72, 1883–1893. https://doi.org/10.1016/j.jacc.2018.07.079 (2018).
Article PubMed PubMed Central Google Scholar
Truong, B. et al. Efficient polygenic risk scores for biobank scale data by exploiting phenotypes from inferred relatives. Nat. Commun. 11, 3074. https://doi.org/10.1038/s41467-020-16829-x (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Wild, C. P. The exposome: From concept to utility. Int. J. Epidemiol. 41, 24–32. https://doi.org/10.1093/ije/dyr236 (2012).
Article PubMed Google Scholar
Vermeulen, R., Schymanski, E. L., Barabási, A.-L. & Miller, G. W. The exposome and health: Where chemistry meets biology. Science 367, 392–396. https://doi.org/10.1126/science.aay3164 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Jiang, C. et al. Dynamic human environmental exposome revealed by longitudinal personal monitoring. Cell 175, 277-291.e231. https://doi.org/10.1016/j.cell.2018.08.060 (2018).
Article CAS PubMed PubMed Central Google Scholar
Agier, L. et al. Early-life exposome and lung function in children in Europe: An analysis of data from the longitudinal, population-based HELIX cohort. Lancet Planetary Health 3, e81–e92. https://doi.org/10.1016/S2542-5196(19)30010-5 (2019).
Article PubMed Google Scholar
Burkett, J. P. & Miller, G. W. Using the exposome to understand environmental contributors to psychiatric disorders. Neuropsychopharmacology 46, 263–264. https://doi.org/10.1038/s41386-020-00851-0 (2021).
Article PubMed Google Scholar
Maitre, L. et al. Human early life exposome (HELIX) study: A European population-based exposome cohort. BMJ Open 8, e021311. https://doi.org/10.1136/bmjopen-2017-021311 (2018).
Article PubMed PubMed Central Google Scholar
Zammit, S., Lewis, G., Dalman, C. & Allebeck, P. Examining interactions between risk factors for psychosis. Br. J. Psychiatry 197, 207–211. https://doi.org/10.1192/bjp.bp.109.070904 (2010).
Article PubMed Google Scholar
Moore, R. et al. A linear mixed-model approach to study multivariate gene–environment interactions. Nat. Genet. 51, 180–186. https://doi.org/10.1038/s41588-018-0271-0 (2019).
Article CAS PubMed Google Scholar
Robinson, M. R. et al. Genotype–covariate interaction effects and the heritability of adult body mass index. Nat. Genet. 49, 1174–1181. https://doi.org/10.1038/ng.3912 (2017).
Article CAS PubMed Google Scholar
Zhou, X., Im, H. K. & Lee, S. H. CORE GREML for estimating covariance between random effects in linear mixed models for complex trait analyses. Nat. Commun. 11, 4208. https://doi.org/10.1038/s41467-020-18085-5 (2020).
Article ADS PubMed PubMed Central Google Scholar
Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348. https://doi.org/10.1371/journal.pgen.1003348 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wray, N. R., Yang, J., Goddard, M. E. & Visscher, P. M. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet. 6, e1000864. https://doi.org/10.1371/journal.pgen.1000864 (2010).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. H., Goddard, M. E., Wray, N. R. & Visscher, P. M. A better coefficient of determination for genetic profile analysis. Genet. Epidemiol. 36, 214–224. https://doi.org/10.1002/gepi.21614 (2012).
Article PubMed Google Scholar
Ripke, S. et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427. https://doi.org/10.1038/nature13595 (2014).
Article ADS CAS PubMed Central Google Scholar
Lee, S. H., Clark, S. & van der Werf, J. H. J. Estimation of genomic prediction accuracy from reference populations with varying degrees of relationship. PLoS ONE 12, e0189775. https://doi.org/10.1371/journal.pone.0189775 (2017).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. H., Weerasinghe, W. M. S. P., Wray, N. R., Goddard, M. E. & van der Werf, J. H. J. Using information of relatives in genomic prediction to apply effective stratified medicine. Sci. Rep. 7, 42091. https://doi.org/10.1038/srep42091 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Goddard, M. Genomic selection: Prediction of accuracy and maximisation of long term response. Genetica 136, 245–257. https://doi.org/10.1007/s10709-008-9308-0 (2009).
Article PubMed Google Scholar
Goddard, M. E., Hayes, B. J. & Meuwissen, T. H. E. Using the genomic relationship matrix to predict the accuracy of genomic selection. J. Anim. Breed. Genet. 128, 409–421. https://doi.org/10.1111/j.1439-0388.2011.00964.x (2011).
Article CAS PubMed Google Scholar
Ni, G. et al. Genotype–covariate correlation and interaction disentangled by a whole-genome multivariate reaction norm model. Nat. Commun. 10, 2239. https://doi.org/10.1038/s41467-019-10128-w (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhou, X. et al. Whole-genome approach discovers novel genetic and nongenetic variance components modulated by lifestyle for cardiovascular health. J. Am. Heart Assoc. 9, e015661. https://doi.org/10.1161/JAHA.119.015661 (2020).
Article CAS PubMed PubMed Central Google Scholar
Shin, J. et al. Lifestyle modifies the diabetes-related metabolic risk, conditional on individual genetic differences. MedRxiv https://doi.org/10.1101/2020.11.22.20236505 (2020).
Article PubMed PubMed Central Google Scholar
Daetwyler, H. D., Villanueva, B. & Woolliams, J. A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE 3, e3395. https://doi.org/10.1371/journal.pone.0003395 (2008).
Article ADS CAS PubMed PubMed Central Google Scholar
Mandrekar, J. N. Receiver operating characteristic curve in diagnostic test assessment. J. Thorac. Oncol. 5, 1315–1316. https://doi.org/10.1097/JTO.0b013e3181ec173d (2010).
Article PubMed Google Scholar
Dahl, A. et al. A robust method uncovers significant context-specific heritability in diverse complex traits. Am. J. Hum. Genet. 106, 71–91. https://doi.org/10.1016/j.ajhg.2019.11.015 (2020).
Article CAS PubMed PubMed Central Google Scholar
Maier, R. et al. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am. J. Hum. Genet. 96, 283–294. https://doi.org/10.1016/j.ajhg.2014.12.006 (2015).
Article CAS PubMed PubMed Central Google Scholar
VanRaden, P. M. Efficient methods to compute genomic predictions. J. Dairy Sci. 91, 4414–4423. https://doi.org/10.3168/jds.2007-0980 (2008).
Article CAS PubMed Google Scholar
Jiang, X. et al. Shared heritability and functional enrichment across six solid cancers. Nat. Commun. 10, 431. https://doi.org/10.1038/s41467-018-08054-4 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Jaffee, S. R. & Price, T. S. Gene–environment correlations: A review of the evidence and implications for prevention of mental illness. Mol. Psychiatry 12, 432–442. https://doi.org/10.1038/sj.mp.4001950 (2007).
Article CAS PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Lee, S. H. et al. Genetic relationship between five psychiatric disorders estimated from genome-wide SNPs. Nat. Genet. 45, 984 (2013).
Article CAS PubMed Google Scholar
Ripke, S. et al. Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat. Genet. 45, 1150 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. H. et al. Estimation of SNP heritability from dense genotype data. Am. J. Hum. Genet. 93, 1151–1155 (2013).
Article CAS PubMed PubMed Central Google Scholar
Jolliffe, I. T. A note on the use of principal components in regression. J. R. Stat. Soc. C 31, 300–303. https://doi.org/10.2307/2348005 (1982).
Article Google Scholar
Shin, J. & Lee, S. H. GxEsum: a novel approach to estimate the phenotypic variance explained by genome-wide GxE interaction based on GWAS summary statistics for biobank-scale data. Genome Biol. https://doi.org/10.1186/s13059-021-02403-1 (2021).
Article PubMed PubMed Central Google Scholar
Lee, S. H. & Wray, N. R. Novel genetic analysis for case-control genome-wide association studies: Quantification of power and genomic prediction accuracy. PLoS ONE 8, e71494. https://doi.org/10.1371/journal.pone.0071494 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research is supported by the Australian Research Council (DP190100766, FT160100229). The UK Biobank is funded by the UK Department of Health, the Medical Research Council, the Scottish Executive, and the Wellcome Trust medical research charity. Funding bodies had no role in the study design, the collection, analysis, and interpretation of the data, and the writing of the manuscript. We thank the staff and participants of the UK Biobank for their important contributions. Work was performed using computational resources provided by the Australian Government through Gadi under the National Computational Merit Allocation Scheme (NCMAS).

Author information

Authors and Affiliations

Australian Centre for Precision Health, University of South Australia, Adelaide, SA, 5000, Australia
Xuan Zhou & S. Hong Lee
UniSA Allied Health and Human Performance, University of South Australia, Adelaide, SA, 5000, Australia
Xuan Zhou & S. Hong Lee
South Australian Health and Medical Research Institute, Adelaide, SA, 5000, Australia
Xuan Zhou & S. Hong Lee

Authors

Xuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
S. Hong Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.Z. and S.H.L. conceived the idea. S.H.L directed the study. X.Z. performed analyses. X.Z. and S.H.L. drafted the manuscript. All authors contributed to the editing and approval of the final manuscript.

Corresponding author

Correspondence to S. Hong Lee.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, X., Lee, S.H. An integrative analysis of genomic and exposomic data for complex traits and phenotypic prediction. Sci Rep 11, 21495 (2021). https://doi.org/10.1038/s41598-021-00427-y

Download citation

Received: 11 March 2021
Accepted: 12 October 2021
Published: 02 November 2021
DOI: https://doi.org/10.1038/s41598-021-00427-y

This article is cited by

Unraveling phenotypic variance in metabolic syndrome through multi-omics
- Lamessa Dube Amente
- Natalie T Mills
- S. Hong Lee
Human Genetics (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Refining the impact of genetic evidence on clinical success

Genome-wide association studies

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Introduction

Results

Method overview

Exposomic effects on phenotypes

Validation of exposomic effects

Quantification of clinical relevance

Comparison with existing models

Discussion

Methods

Ethics statement

Genotype data

Phenotype data

Exposomic variables

Statistical models

Simulations

Principal component-based transformed variables for E

Estimation of exc interactions

Five-fold cross-validation

Theoretical prediction accuracy for quantitative traits

Theoretical prediction accuracy for disease traits

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Unraveling phenotypic variance in metabolic syndrome through multi-omics

Comments

Search

Quick links