Meta-analysis of SNP-environment interaction with heterogeneity for overlapping data

Jin, Qinqin; Shi, Gang

doi:10.1038/s41598-021-82336-8

Download PDF

Article
Open access
Published: 28 January 2021

Meta-analysis of SNP-environment interaction with heterogeneity for overlapping data

Qinqin Jin^1,2 &
Gang Shi¹

Scientific Reports volume 11, Article number: 2590 (2021) Cite this article

1133 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Meta-analysis is a popular method used in genome-wide association studies, by which the results of multiple studies are combined to identify associations. This process generates heterogeneity. Recently, we proposed a random effect model meta-regression method (MR) to study the effect of single nucleotide polymorphism (SNP)-environment interactions. This method takes heterogeneity into account and produces high power. We also proposed a fixed effect model overlapping MR in which the overlapping data is taken into account. In the present study, a random effect model overlapping MR that simultaneously considers heterogeneity and overlapping data is proposed. This method is based on the random effect model MR and the fixed effect model overlapping MR. A new way of solving the logarithm of the determinant of covariance matrices in likelihood functions is also provided. Tests for the likelihood ratio statistic of the SNP-environment interaction effect and the SNP and SNP-environment joint effects are given. In our simulations, null distributions and type I error rates were proposed to verify the suitability of our method, and powers were applied to evaluate the superiority of our method. Our findings indicate that this method is effective in cases of overlapping data with a high heterogeneity.

Improved analyses of GWAS summary statistics by reducing data heterogeneity and errors

Article Open access 08 December 2021

Cauchy combination methods for the detection of gene–environment interactions for rare variants related to quantitative phenotypes

Article 22 July 2023

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Article 18 May 2020

Introduction

Genome-wide association studies (GWASs) are effective for the identification of single nucleotide polymorphisms (SNPs) associated with complex traits or disease^1,2,3. Meta-analysis^4,5,6,7,8, which combines the results of multiple studies, is a common method used to increase the sample size^5,7,9, which can reduce false positive results, increase power, and increase the probability of finding new associations. The fixed effect model is commonly used for meta-analysis, in which the effects between studies are assumed to be equal. However, in recent studies, meta-analyses have been employed using new designs, by combining different related traits or diseases^10,11, environments¹², populations¹³, tissues¹⁴, and cancer types^15,16. These combinations lead to different effect sizes between studies, which is called heterogeneity¹⁷. Thus, the fixed effect model is not suitable. The traditional random effect model⁴ take heterogeneity into account ,but implicitly assumes a conservative null hypothesis model. Therefore, it provides a power even lower than the fixed model. A modified random effect model method¹⁸ was proposed to overcome this problem and be widely used in various analyses^{12,14,17,19,20}.

However, in practice, there are many overlapping individuals between studies. This may be caused unintentionally or intentionally by the researchers. If these overlapping individuals exist but they are ignored, spurious associations may occur^21,22. In recent years, researchers have proposed several methods for overlapping data^{16,17,21,22,23,24,25}. These methods are all used to test the main effects of SNPs. Lin²¹ proposed a correlation matrix and applied it to the fixed model method for overlapping data among studies. Han²² transformed the covariance structure of data, which then became a form of diagonal matrix. This transformation makes Lin’s method more flexible, which can be widely applied by meta-analysis methods, such as the random effects model. Based on the modified random effect model method¹⁸, Lee¹⁷ proposed a new method for overlapping data. This method combined the fixed effect model and random effect model, which gave a higher power regardless of the heterogeneity.

The meta-regression method (MR)²⁶ is a powerful and robust method in a fixed-effect model. This method has two steps. First, individuals in each study are divided into several groups based on the distribution of environmental variables, where the number of individuals in each group is equal. In each group, linear regression is used to estimate the main effects, standard errors, and mean environment variables of the SNP. Second, researcher collects all the above results and performs a meta-regression to investigate interactions between SNP and the environment. It can be used to test for the main effects of SNP, SNP-environment interactions, and joint effects. In addition, this method is considered to be a robust method when confounding effects exist, such as interactions between covariates and genetic effects or interactions between covariates and environmental factors²⁷. Based on Lin’s method²¹ and Han’s method²², we extended MR to overlapping MR (OMR)²⁸. This method is designed for SNP-environment interactions under the fixed effect model, as well as for overlapping data. We also extended MR to account for heterogeneity; that is, we added random effects of the SNP and SNP-environment interactions to the fixed-effect SNP-environment interaction model. This method is denoted as the random effect MR (RMR)²⁹, which gives a higher power than MR under the fixed-effect model when heterogeneity exists. The Q-Q plot of the null distribution obtained by MR will shift upward when overlapping data exists. The more overlapping the data, the more obvious the deviation will be. The fixed effect model OMR²⁸ controls spurious associations caused by overlapping data. When heterogeneity exists and is large, the power of RMR is higher than that of MR. Similarly, when overlapping data and heterogeneity exist, the power obtained by OMR will also be affected by heterogeneity, and the power it provides will be reduced. However, no study has yet considered this condition.

In this paper, inspired by OMR and Lee’s method¹⁷ which is proposed for testing SNP main effect with overlapping data, we propose random effect overlapping MR (ROMR) which is a new method to consider overlapping data based on the RMR ²⁹. Our method is designed to test the SNP-environment interaction effect or the SNP and SNP-environment joint effects with overlapping data. This paper is organized as follows. In the Materials and Methods section, we introduce the correlation matrix into the RMR. We also present a new method to calculate the likelihood function. In the Results section, we carry out simulations to examine the null distribution, type I error rate, and power of our method. We also compare our method with the OMR. In the Discussion and Conclusion section, the results of this paper are analyzed and used to draw conclusions.

Materials and methods

Fixed effect overlapping MR

OMR is a method that extends from fixed effect MR, which is a powerful and robust method under the condition of independent data. This method has two procedures. First, by continuous or dichotomous environmental exposure distribution, each study is divided into several groups. In this process, each group is a subset of the study, and percentiles of the environmental exposure can be used to divide the study into several groups with approximately the same sample sizes. Then, in each group, the coefficient and variance of the main effects of SNPs are estimated. Second, the main effect of SNP and its corresponding standard deviation in each group are collected for regression analysis. Then, either the overall mean SNP-environment effect and its variance is estimated, or the mean SNP and SNP-environment joint effect vector and its variance matrix are estimated.

Assume that the environmental exposure is continuous and ${\widehat{\beta }}_{ij}$ is set to be the estimation of the main effects of SNPs in the i-th study and j-th group, where subscript $i=\mathrm{1,2},\dots ,n$ is the sample size of studies and subscript $j=\mathrm{1,2},\dots , {n}_{i}$ is the sample size of the groups in the i-th study. ${\widehat{e}}_{ij}$ and ${E}_{ij}$ are the standard error and mean environment exposure of the i-th study and j-th group, respectively. Under the second OMR procedure, the formula for the environment-dependent SNP effect $\widehat{{\varvec{\beta}}}$ can be expressed in the following form:

$$\widehat{{\varvec{\beta}}}={\varvec{X}}{\varvec{\upalpha}}+{\varvec{\varepsilon}}$$

where

$$\begin{aligned} \hat{\user2{\beta }} & = \left( {\begin{array}{*{20}c} {\hat{\user2{\beta }}_{1} } \\ {\begin{array}{*{20}c} {\hat{\user2{\beta }}_{2} } \\ \vdots \\ \end{array} } \\ {\hat{\user2{\beta }}_{{\varvec{n}}} } \\ \end{array} } \right),\hat{\user2{\beta }}_{{\varvec{i}}} = \left( {\begin{array}{*{20}c} {\hat{\beta }_{i1} } \\ {\begin{array}{*{20}c} {\hat{\beta }_{i2} } \\ \vdots \\ \end{array} } \\ {\hat{\beta }_{{in_{i} }} } \\ \end{array} } \right),{\varvec{X}} = \left( {\begin{array}{*{20}c} {{\varvec{X}}_{1} } \\ {\begin{array}{*{20}c} {{\varvec{X}}_{2} } \\ \vdots \\ \end{array} } \\ {{\varvec{X}}_{{\varvec{n}}} } \\ \end{array} } \right),{\varvec{X}}_{{\varvec{i}}} = \left( {\begin{array}{*{20}c} 1 & {E_{i1} } \\ {\begin{array}{*{20}c} 1 \\ \vdots \\ \end{array} } & {\begin{array}{*{20}c} {E_{i2} } \\ \vdots \\ \end{array} } \\ 1 & {E_{{in_{i} }} } \\ \end{array} } \right), \\ {\varvec{\varepsilon}} & = \left( {\begin{array}{*{20}c} {{\varvec{\varepsilon}}_{1} } \\ {\begin{array}{*{20}c} {{\varvec{\varepsilon}}_{2} } \\ \vdots \\ \end{array} } \\ {{\varvec{\varepsilon}}_{{\varvec{n}}} } \\ \end{array} } \right),{\varvec{\varepsilon}}_{{\varvec{i}}} = \left( {\begin{array}{*{20}c} {\varepsilon_{i1} } \\ {\begin{array}{*{20}c} {\varepsilon_{i2} } \\ \vdots \\ \end{array} } \\ {\varepsilon_{{in_{i} }} } \\ \end{array} } \right) \\ {\varvec{\alpha}} & = \left( {\begin{array}{*{20}c} {\alpha_{0} } \\ {\alpha_{1} } \\ \end{array} } \right), {\varvec{\varSigma}}= \left( {\begin{array}{*{20}c} {{\varvec{\varSigma}}_{1} } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & {{{\varvec{\Sigma}}}_{{\text{n}}} } \\ \end{array} } \right),{\varvec{\varSigma}}_{i} = \left( {\begin{array}{*{20}c} {\hat{e}_{i1} } & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & {\hat{e}_{{in_{i} }} } \\ \end{array} } \right) \\ \end{aligned}$$

and ${\varepsilon }_{ij}\sim N(0,{\widehat{e}}_{ij})$ $i=\mathrm{1,2},\dots ,n, j=\mathrm{1,2},\dots , {n}_{i}$.

Let ${\varvec{C}}$ be the correlation matrix. The element of this matrix is

$${\gamma }_{{i}_{h}{j}_{k}}\approx {n}_{{i}_{h}{j}_{k}}/\sqrt{{n}_{{i}_{h}}{n}_{{j}_{k}}}$$

where the subscript ${n}_{{i}_{h}}$ and ${n}_{{j}_{k}}$ are presented as the size of the $h$-th group of study $i$ and $k$-th group of study j, respectively, and ${n}_{{i}_{h}{j}_{k}}$ is the size of the overlap individuals between the $h$-th and $k$-th group.

The covariance matrix of this method is

$$\boldsymbol{\Omega }={{\varvec{\Sigma}}}^{1/2}{\varvec{C}}{\boldsymbol{\Sigma }}^{1/2}$$

It has another form as

$$\boldsymbol{\Omega }={{\varvec{d}}{\varvec{i}}{\varvec{a}}{\varvec{g}}({{\varvec{e}}}^{\mathbf{^{\prime}}}{({\boldsymbol{\Sigma }}^{1/2}{\varvec{C}}{\boldsymbol{\Sigma }}^{1/2})}^{-1})}^{-1}$$

where ${\varvec{e}}=\left(\mathrm{1,1},\dots ,1\right)$ and its length is the sum of all group sizes.

The formula of the linear unbiased estimators $\widehat{\boldsymbol{\alpha }}$ and $\mathrm{Cov}\left(\widehat{\boldsymbol{\alpha }}\right)$ are expressed as follows

$$\widehat{\boldsymbol{\alpha }}={\left({{\varvec{X}}}^{\mathrm{^{\prime}}}{\boldsymbol{\Omega }}^{-1}{\varvec{X}}\right)}^{-1}{{\varvec{X}}}^{\mathrm{^{\prime}}}{\boldsymbol{\Omega }}^{-1}\widehat{{\varvec{\beta}}}$$

$${\widehat{\alpha }}_{2}=\left(\mathrm{0,1}\right)\widehat{\boldsymbol{\alpha }}$$

$$\mathrm{Cov}\left(\widehat{\boldsymbol{\alpha }}\right)={\left({{\varvec{X}}}^{\mathrm{^{\prime}}}{\boldsymbol{\Omega }}^{-1}{\varvec{X}}\right)}^{-1}$$

$${\mathrm{Cov}\left(\widehat{\boldsymbol{\alpha }}\right)}_{22}=(\mathrm{0,1})\mathrm{Cov}\left(\widehat{\boldsymbol{\alpha }}\right)\left(\begin{array}{c}0\\ 1\end{array}\right)$$

Under null distribution ${\widehat{\alpha }}_{2}=0$ and $\widehat{\boldsymbol{\alpha }}=0$, the Wald statistic of the SNP-environment interaction effect and the SNP and SNP-environment joint effects follow 1 and 2 degrees of freedom (df) ${\upchi }^{2}$ distribution, respectively.

Random effect overlapping MR

This method is an extension of the OMR²⁸ and the recently proposed RMR²⁹. Under this method, the random effects for the SNP main and SNP-environment interaction are denoted as $\gamma$. The environment-dependent SNP effect $\widehat{{\varvec{\beta}}}$ is presented as follows:

$$\widehat{{\varvec{\beta}}}=\boldsymbol{ }{\varvec{X}}{\varvec{\upalpha}}+{\varvec{Z}}{\varvec{\gamma}}+{\varvec{\varepsilon}}$$

where

$${\varvec{Z}}=\left(\begin{array}{ccc}{{\varvec{Z}}}_{1}& \cdots & 0\\ \vdots & \ddots & \vdots \\ 0& \cdots & {{\varvec{Z}}}_{n}\end{array}\right),\boldsymbol{ }{\varvec{\gamma}}=\left(\begin{array}{c}{{\varvec{\gamma}}}_{1}\\ \begin{array}{c}{{\varvec{\gamma}}}_{2}\\ \vdots \end{array}\\ {{\varvec{\gamma}}}_{n}\end{array}\right)$$

and

$${{\varvec{\gamma}}}_{i}=\left(\begin{array}{c}{\gamma }_{i0}\\ {\gamma }_{i1}\end{array}\right)$$

Here, variable ${\gamma }_{i0}$ is denoted as the random main effect of SNP in the i-th study and ${\gamma }_{i1}$ is denoted as the random effect of SNP-environment interaction in the i-th study. The vector ${{\varvec{\gamma}}}_{i}=\left(\begin{array}{c}{\gamma }_{i0}\\ {\gamma }_{i1}\end{array}\right)$ follows a bivariate normal distribution with $\left(\begin{array}{c}{\gamma }_{0i}\\ {\gamma }_{1i}\end{array}\right)\sim \mathrm{N}(0,{{\varvec{D}}}_{\mathrm{s}})$. The variable $\widehat{{\varvec{\beta}}}$ followed a multivariate normal distribution, as follows:

$$\widehat{{\varvec{\beta}}}\sim \mathrm{N}\left({\varvec{X}}{\varvec{\upalpha}},{\varvec{V}}\right)$$

where in the overlapping condition ${\varvec{V}}={\varvec{Z}}{\varvec{D}}{{\varvec{Z}}}^{\boldsymbol{^{\prime}}}+{\boldsymbol{\Sigma }}^{1/2}{\varvec{C}}{\boldsymbol{\Sigma }}^{1/2}$ is a real symmetric matrix, denote ${\lambda }_{1}, {\lambda }_{2},\dots ,{\lambda }_{M}$where $M=\sum_{i=1}^{n}{n}_{i}$ as the eigenvalues of matrix ${\varvec{V}}$, denote ${\xi }_{1}, {\xi }_{2},\dots ,{\xi }_{M}$ as the orthogonal eigenvector of matrix ${\varvec{V}}$, that is to say, $\left|\left({\xi }_{1}, {\xi }_{2},\dots ,{\xi }_{M}\right)\right|=1$ and $\left({\xi }_{1}, {\xi }_{2},\dots ,{\xi }_{M}\right)={\left({\xi }_{1}, {\xi }_{2},\dots ,{\xi }_{M}\right)}^{-1}$. Then $\left|{\varvec{V}}\right|={\lambda }_{1}*{\lambda }_{2}*\dots *{\lambda }_{M}$.

The likelihood function under this model can be written as

$${l}_{1}={\sum }_{i=1}^{M}\mathrm{ln}\left|{\lambda }_{i}^{1}\right|+{\left(\widehat{{\varvec{\beta}}}-{\varvec{X}}{\varvec{\upalpha}}\right)}^{^{\prime}}{V}^{-1}\left(\widehat{{\varvec{\beta}}}-{\varvec{X}}{\varvec{\upalpha}}\right)+\sum_{i=1}^{n}{n}_{i}\mathrm{ln}\left(2\pi \right)$$

The estimation of this likelihood function is given by the minimum variance quadratic unbiased estimator (MIVQUE(0))^30,31 and Newton–Raphson algorithms^32,33,34. The detailed process is given in Jin²⁹.

Test of SNP-environment interaction

Under this test, we suppose that there is no interaction effect and no interaction heterogeneity, that is, ${\alpha }_{1}=0$ and ${{\varvec{D}}}_{\mathrm{s}}=\left(\begin{array}{cc}{u}_{1}^{2}& 0\\ 0& 0\end{array}\right)$. The reduced model can then be given as follows:

$$\widehat{{\varvec{\beta}}}=\boldsymbol{ }{\varvec{X}}{\alpha }_{0}+{\varvec{Z}}{\gamma }_{0}+{\varvec{\varepsilon}}$$

where

$${\varvec{X}}={\varvec{Z}}=\left(\begin{array}{c}{{\varvec{X}}}_{1}\\ \begin{array}{c}{{\varvec{X}}}_{2}\\ \vdots \end{array}\\ {{\varvec{X}}}_{n}\end{array}\right),{{\varvec{X}}}_{i}={\left(\mathrm{1,1},\cdots ,1\right)}^{^{\prime}}$$

and the dimension of ${{\varvec{X}}}_{i}$ is ${n}_{i}$. In this model, $\widehat{{\varvec{\beta}}}\sim \mathrm{N}\left({\varvec{X}}{\alpha }_{0},{\varvec{V}}\right)$, the covariance matrix is ${\varvec{V}}={\varvec{Z}}{\varvec{D}}{{\varvec{Z}}}^{\boldsymbol{^{\prime}}}+{\boldsymbol{\Sigma }}^{1/2}{\varvec{C}}{\boldsymbol{\Sigma }}^{1/2}$. Denote ${\lambda }_{1}^{1}, {\lambda }_{2}^{1},\dots ,{\lambda }_{M}^{1}$ where $M=\sum_{i=1}^{n}{n}_{i}$ as the eigenvalues of matrix ${\varvec{V}}$, then $\left|{\varvec{V}}\right|={\lambda }_{1}^{1}*{\lambda }_{2}^{1}*\dots *{\lambda }_{M}^{1}$.

The -2 times of the log likelihood for this model is

$${l}_{2}={\sum }_{i=1}^{M}\mathrm{ln}\left|{\lambda }_{i}^{1}\right|+{\left(\widehat{{\varvec{\beta}}}-{\varvec{X}}{\alpha }_{0}\right)}^{^{\prime}}{V}^{-1}\left(\widehat{{\varvec{\beta}}}-{\varvec{X}}{\alpha }_{0}\right)+{\sum }_{i=1}^{n}{n}_{i}\mathrm{ln}(2\pi )$$

As in Jin²⁹, the likelihood ratio statistic for the test of SNP-environment interaction is given as follows:

$${\mathrm{L}}_{\mathrm{I}}={\widehat{l}}_{2}-{\widehat{l}}_{1}$$

where ${\widehat{l}}_{1}$ is the minimum of ${l}_{1}$ and ${\widehat{l}}_{2}$ is the minimum of ${l}_{2}$. The statistic ${\mathrm{L}}_{\mathrm{I}}$ asymptotically follows an equal mixture of 2 df ${\upchi }^{2}$ distribution and 3 df ${\upchi }^{2}$ distribution. Its p-value is calculated by 0.5 (P (${\upchi }_{2}^{2}>{\mathrm{L}}_{\mathrm{I}}$) + P(${\upchi }_{3}^{2}>{\mathrm{L}}_{\mathrm{I}}$))²⁹.

Joint test of SNP and SNP-environment

Under this test, we suppose that $\boldsymbol{\alpha }=0$ and ${{\varvec{D}}}_{\mathrm{s}}=\left(\begin{array}{cc}0& 0\\ 0& 0\end{array}\right)$, that is to say, no SNP main, SNP-environment interaction fixed effects, and no corresponding heterogeneity. The null model can be given as follows:

$$\widehat{{\varvec{\beta}}}={\varvec{\varepsilon}}$$

Then, $\widehat{{\varvec{\beta}}}\sim \mathrm{N}\left(0,{\varvec{V}}\right)$ and ${\varvec{V}}={{\varvec{\Sigma}}}^{1/2}\mathbf{C}{{\varvec{\Sigma}}}^{1/2}$. The eigenvalues of the covariance matrix ${\varvec{V}}$ are denoted as ${\uplambda }_{1}^{0},{\uplambda }_{2}^{0},\dots ,{\uplambda }_{M}^{0}$, where $M=\sum_{i=1}^{n}{n}_{i}$. Then $\left|{\varvec{V}}\right|={\lambda }_{1}^{0}*{\lambda }_{2}^{0}*\dots *{\lambda }_{M}^{0}$.

The -2 times of the log likelihood for this model is

$${l}_{0}={\sum }_{i=1}^{M}\mathrm{ln}\left|{\lambda }_{i}^{0}\right|+{\left(\widehat{{\varvec{\beta}}}-{\varvec{X}}{\mathrm{\alpha }}_{0}\right)}^{^{\prime}}{{\varvec{V}}}^{-1}\left(\widehat{{\varvec{\beta}}}-{\varvec{X}}{\mathrm{\alpha }}_{0}\right)+{\sum }_{i=1}^{n}{n}_{i}\mathrm{ln}(2\pi )$$

The likelihood ratio statistic for the joint test of the SNP main and SNP-environment is given as follows:

$${\mathrm{L}}_{\mathrm{J}}={\widehat{l}}_{0}-{\widehat{l}}_{1}$$

where ${\widehat{l}}_{0}$ is the evaluated value of ${l}_{0}$. The statistic ${\mathrm{L}}_{\mathrm{J}}$ asymptotically follows a $\upxi$:0.5:(0.5-$\upxi$) mixture of 3 df ${\upchi }^{2}$ distribution, 4 df ${\upchi }^{2}$ distribution, and 5 df ${\upchi }^{2}$ distribution. The value of $\xi$ depends on the given data and is solved by the information matrix. The p-value is calculated by(0.5-$\upxi$) P (${\upchi }_{3}^{2}>{\mathrm{L}}_{\mathrm{J}}$) + 0.5P(${\upchi }_{4}^{2}>{\mathrm{L}}_{\mathrm{J}}$)+ $\upxi$ P (${\upchi }_{5}^{2}>{\mathrm{L}}_{\mathrm{J}}$)²⁹.

Ethics approval

The authors have no ethical conflicts to disclose.

Results

Simulation

The relationship among the quantitative trait $Y$, the genotype of SNP $G$, and the environmental variable $E$ is presented as follows:

$$Y=({\beta }_{G}+{\gamma }_{G})G+({\beta }_{G\times E}+{\gamma }_{G\times E})G\times E+{\beta }_{E}E+\varepsilon$$

The quantitative trait $Y$ was simulated as a standardized normal distribution, that is, with a mean of 0 and a variance of 1. SNP was assumed as an additive genetic effect, and its minor allele frequency was 0.3. $G$ was coded as the number of minor alleles. In each study, 1000 points following a standard uniform distribution were generated. If these points fell in $[0,{0.3}^{2}]$, then $G$ was set to 2. If these points fell in $[{0.3}^{2},{0.3}^{2}+2\cdot 0.3\cdot (1-0.3)]$, then $G$ was set to 1, else $G$ was set to 0. The environmental variable $E$ was also simulated as a standardized normal distribution. A 10% variation in $Y$ was explained by the environmental term ${\beta }_{E}E$. Fixed effects ${\beta }_{G}$, ${\beta }_{G\times E}$ and random effects ${\gamma }_{G}$, ${\gamma }_{G\times E}$ changed in simulation datasets. The random error $\varepsilon$ was normally distributed, with a zero mean and a variance chosen so that the variance of Y was 1. In our simulation, we generated 1000 replications, each replication had 12 studies, each study had 1000 unrelated individuals, and each individual had one quantitative trait $Y$, one environmental variable $E$, and one SNP $G$. Across all the studies, 100 and 400 overlapping individuals were observed. Before analysis, the individuals in each study were divided into five groups according to the distribution of $E$. The main effects of SNP and standard errors were estimated by linear regression. The mean of the environmental variables $E$ of each stratum was then calculated.

To test the null distribution of statistics for the SNP-environment interaction, we assumed that both the SNP-environment interaction and its corresponding heterogeneity were zero. That is to say, ${\beta }_{G\times E}=0$ and ${\gamma }_{G\times E}=0$. The main effect of SNP ${\beta }_{G}$ was set to a square root of 0.1, and the random effect of this effect was set to be normally distributed, with a mean of 0 and a variance of 0.02. We calculated the minimum estimates of the likelihood functions ${l}_{1}$ and ${l}_{2}$, such that the statistic ${\mathrm{L}}_{\mathrm{I}}$ could be obtained. Finally, empirical P-values were calculated using a 0.5:0.5 mixture of 2 and 3 df ${\upchi }^{2}$ distributions, which was the theoretical distribution of the interaction test. Then, these were compared with expected values following a uniform distribution between 0 and 1. A Q–Q plot was drawn through the two types of P-values. To test the null distribution of statistics for the SNP and SNP-environment joint effects, we assumed that all the fixed effects and random effects of the SNP and SNP-environment interaction were zero. That is, ${\beta }_{G}$=${\beta }_{G\times E}$=${\gamma }_{G}$=${\gamma }_{G\times E}$=0.We calculated the likelihood ratio statistic ${\mathrm{L}}_{\mathrm{J}}$ by estimating the minimum of the likelihood function ${\widehat{l}}_{0}$ and ${\widehat{l}}_{1}$. The empirical P-values were calculated as a $\upxi$:0.5:(0.5-$\upxi$) mixture of 3 df ${\upchi }^{2}$ distribution, 4 df ${\upchi }^{2}$ distribution, and 5 df ${\upchi }^{2}$ distribution. The value of $\xi$ was data-dependent and calculated using Fisher information.

To test the powers of the SNP-environment interaction effect, we set the fixed effects of the SNP main ${\beta }_{G}$ and SNP-environment interaction ${\beta }_{G\times E}$ to $\sqrt{0.002}$. The random effect of SNP main ${\gamma }_{G}$ was normally distributed, followed $E\sim \mathrm{N}(\mathrm{0,0.015})$, variance of random effect of SNP-environment interaction ranging from 0.005 to 0.025, where each increased by 0.005. If the P-value of the test is less than 0.05, it was considered statistically significant. Experiments were repeated 1000 times, and the proportion of statistical significance was called empirical power. The OMR was also tested under this simulation. To test for the SNP and SNP-environment interaction joint effects, the fixed effects of ${\beta }_{G}$ and ${\beta }_{G\times E}$ were set to a square root of 0.002. The random effects of ${\gamma }_{G}$ and ${\gamma }_{G\times E}$ were set normally distributed with a mean of 0 and a variance ranging from 0.005 to 0.025.

Null distribution

As shown in Fig. 1A,B, these points are nearly standing on the diagonal line with 100 and 400 overlapping individuals between studies. This verifies that the method presented provides suitable distributions. In Fig. 1C,D, the empirical P-values are close to the expected ones with 100 and 400 overlapping individuals between any two studies, demonstrating the suitability of our distributions.

Type I error rate

To better illustrate the performance of our method, we considered three different scenarios. In scenario 1, two different sample sizes were considered: (1) a study with 1000 individuals and (2) a study with 2000 individuals. In scenario 2, two different significance levels were considered: (1) 0.01 and (2) 0.05. In scenario 3, two different main effects of SNP were considered: (1) a square root of 0.1 and (2) a square root of 0.2. Table 1 presents the values of the type I error rates in the different scenarios for the test of the SNP-environment interaction with 10% overlapping data. Table 2 presents the values of the type I error rates in the different scenarios for the test of the SNP-environment interaction with 40% overlapping data. Table 3 presents the values of the type I error rates in the different scenarios for the test of the SNP and SNP-environment joint effects with 10% overlapping data. Table 4 presents the values of the type I error rates in the different scenarios for the test of the SNP and SNP-environment joint effects with 40% overlapping data. For the 1000 replications, the 95% confidence intervals for the estimated type I error rates of nominal levels 0.05 and 0.01are (0.036, 0.064) and (0.002, 0.018) respectively. For the 2000 replications, the 95% confidence intervals for the estimated type I error rates of nominal levels 0.05 and 0.01are (0.040, 0.060) and (0.004, 0.016) respectively. From these tables, we can see that all of the estimated type I error rates are in the confidence intervals for interaction tests and joint tests, this indicates that our method is valid.

Table 1 The values of type I error rates at different scenarios for the test of SNP-environment interaction with 10% overlapping data.

Full size table

Table 2 The values of type I error rates at different scenarios for the test of SNP-environment interaction with 40% overlapping data.

Full size table

Table 3 The values of type I error rates at different scenarios for the test of SNP and SNP-environment joint effects with 10% overlapping data.

Full size table

Table 4 The values of type I error rates at different scenarios for the test of SNP and SNP-environment joint effects with 40% overlapping data.

Full size table

Statistical power

The powers of ROMR and OMR are compared in Fig. 2A,B. The OMR gives higher powers when heterogeneity is low with 100 and 400 overlapping individuals between any two studies. The powers of this method decrease slightly with increase in heterogeneity, however, the powers of the ROMR increase rapidly with increase in heterogeneity. When the heterogeneity is high, the ROMR gives higher powers. This is due to the fact that when the heterogeneity is low, most of the statistical evidence of ${\text{L}}_{\text{I}}$ is obtained from the fixed effect of the interaction. In fact, the test statistics of ROMR are penalized by high degrees of freedom, yielding less power. When the heterogeneity is high, the OMR tested for the fixed effect only, the genetic effect tested by the ROMR is much larger than that of OMR. Thus, ROMR gives higher power²⁹. As shown in Fig. 2C,D, although our method provides a similar tendency to interaction simulation, joint tests generally obtain higher results. This is because both the SNP and SNP-environment interaction are tested, thereby including more effects than the test for the interaction only.

Table 5 and 6 present the powers under different levels of heterogeneity with different overlapping data. Table 5 shows the powers of the SNP-environment interaction, showing that the power decreased with an increase in the number of overlapping data. For ROMR, the greatest drop was 0.299; for OMR, the greatest drop was 0.127. However, in any case, when the heterogeneity was large, ROMR gave a higher power than OMR. Table 6 shows the powers of the SNP and SNP-environment interaction joint effects. Similar to the case of SNP-environment interaction, as the number of overlapping data increased, the power of ROMR was reduced faster than OMR. However, when the heterogeneity was large, ROMR gave a higher power than OMR. In order to more intuitively understand the impact of different overlapping data, Fig. 3 is given. We selected one set of parameters from Table 5 and 6. The variance of heterogeneity for SNP-environment interaction was fixed as 0.02 and the overlapping individuals were 100, 200, 300 and 400 as in the tables. As can be seen from Fig. 3A,B, with the increase of overlapping individuals, the powers of the two methods are decreasing gradually. The powers of SNP-environment interaction effects and the powers of the SNP and SNP-environment interaction joint effects have the same variation tendency.

Table 5 The powers under different heterogeneity with different overlapping data for the test of SNP-environment interaction.

Full size table

Table 6 The powers under different heterogeneity with different overlapping data for the test of SNP and SNP-environment joint effects.

Full size table

Figure 4 present the powers under different number of studies. The variance of heterogeneity for SNP-environment was fixed as 0.02 and the numbers of studies were 9, 12, 15, 18, 21. Figure 4A,B present the powers of SNP-environment interaction effects under 100 and 400 overlapping individuals. Figure 4C,D present the powers of SNP and SNP-environment interaction joint effects under 100 and 400 overlapping individuals. As can be seen from these figures, with the increase of the numbers of studies, the powers of the two methods are increasing gradually.

Discussion

In contrast to the calculation of the likelihood function performed by Lee¹⁷, we only changed the calculation of $\mathrm{ln}\left|V\right|$. The covariance matrix $V$ is a real symmetric matrix that can be diagonalized in a similar manner. That is to say, $V$ can be written in the form of ${P}^{-1}\mathrm{\Lambda P}$, where $P$ is the matrix of eigenvectors and $\Lambda$ is the vector of eigenvalues. Then, $\mathrm{ln}\left|V\right|=\sum \mathrm{ln}\left|{\lambda }_{i}\right|$, where ${\lambda }_{i}$ is the eigenvalue of the covariance matrix $V$. The other terms of the likelihood function are the same as in the random effect model MR. Thus, the computational complexity of our method is much less than that of Lee’s method. We can also compute $\mathrm{ln}\left|V\right|$ using OMR or by combining both of the above mentioned methods.

As in Lee¹⁷, our method can also be combined with OMR, providing a higher power, regardless of the level of heterogeneity. This method focuses on heterogeneity, it designs statistic as follows

$$\mathrm{L}=\left\{\begin{array}{l}{\mathrm{L}}_{\mathrm{R}}\quad if\quad {p}_{R}\le {p}_{F} \\ 0\quad {p}_{R} >{p}_{F}\end{array}\right.$$

where ${\mathrm{L}}_{\mathrm{R}}$ is the likelihood ratio statistic for the SNP-environment interaction effect under the random effect model with overlapping data, and ${p}_{R}$ and ${p}_{F}$ are the P-values for test of the SNP-environment interaction effect applying the ROMR and OMR. The P-value for this statistic is similar to that used in Lee¹⁷.

When the data between studies are independent, the correlation matrix $\mathrm{C}$ becomes an identity matrix; that is to say, $\mathrm{C}=\mathrm{I}$. In this context, the method becomes RMR. However, as a result of the additional judgement process of the correlation matrix, the calculation amount of this method is increased. Therefore, the use of RMR is recommended when the data between studies are independent, while ROMR is recommended when the data is overlapping or it is not certain whether there is overlapping data.

We performed a simulation where the number of overlapping individuals were increased from 100 to 200, 300, and 400. Simulations with 500 or more overlapping individuals were not performed because there are 1000 individuals in our studies. That is, if there are 500 or more overlapping individuals between studies, none of the studies have individuals that are not applied to other studies, thus the correlation matrix cannot be guaranteed to be strictly diagonally dominant. In this case, the non-singularity of the variance matrix cannot be guaranteed; thus, this situation is not considered.

In the present study, we simultaneously evaluated the fixed effect and random effect of the SNP-environment interaction or the fixed effect and random effect of the SNP and SNP-environment interaction. When heterogeneity is high and overlapping data exists, our method provides accurate and valid power. However, our method also has some limitations. First, the calculation cost of our ROMR is much higher than that of the OMR. Second, more than one environmental variable may interact with the genetic effect being tested. Here, only one environmental variable was chosen for the interaction analysis, but other environmental variables can be entered as covariates.

Conclusion

This study generalized the RMR proposed in our previous paper to account for overlapping data. This method was designed to test the SNP-environment interaction effect or the SNP and SNP-environment joint effects with overlapping data. To this end, a correlation matrix was introduced into our random effect model. In addition, a new method to solve the likelihood function was proposed, which allowed the solution to $\mathrm{ln}\left|V\right|$ to be obtained more easily. By simulation, we verified that our method was suitable under the conditions of the random effect model of the SNP-environment interaction with overlapping data. As a result, our ROMR obtained a higher power than OMR when the heterogeneity was high. In practice, when data from large-scale meta-analyses originate from different factors, including ethnicities, environments, phenotypes, or some other factors, heterogeneity across studies is likely to exist, our proposed ROMR can be applied.

References

MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucl. Acids Res. 45, D896–D901 (2017).
Article CAS Google Scholar
Mannolio, T. A. Genomewide association studies and assessment of the risk of disease. N. Engl. J. 363, 166–176 (2010).
Article Google Scholar
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucl. Acids Res. 42, D1001–D1006 (2014).
Article CAS Google Scholar
DerSimonian, R. & Laird, N. Meta-analysis in clinical trials. Control Clin. Trials 7, 177–188 (1986).
Article CAS Google Scholar
Evangelou, E., Ioannidis, J. P. Meta-analysis methods for genome-wide association studies and beyond.Nat Rev Genet. 14, 379–389(2013).
Borenstein, M., Hedges, L. V., Higgins, J. P. T. & Rothstein, H. R. Introduction to Meta-Analysis 3–14 (Wiley, Hoboken, 2009).
Book Google Scholar
Fleiss, J. The statistical basis of meta-analysis. Stat. Methods Med. Res. 2, 121–145 (1993).
Article CAS Google Scholar
Field, A. P. The problems in using fixed-effects models of meta-analysis on real-world data. Underst. Stat. 2, 105–124 (2003).
Article Google Scholar
Zeggini, E. & Ioannidis, J. P. Meta-analysis in genome-wide association studies. Pharmacogenomics 10, 191–201 (2009).
Article Google Scholar
Lee, J. H. et al. Genetic susceptibility for chronic bronchitis in chronic obstructive pulmonary disease. Respir. Res. 15, 113 (2014).
Article ADS Google Scholar
Kiryluk, K. et al. Geographic differences in genetic susceptibility to IgA nephropathy: GWAS replication study and geospatial risk analysis. PLoS Genet. 8, e1002765. https://doi.org/10.1371/journal.pgen.1002765 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kang, E. Y. et al. Meta-analysis identifies gene-by-environment interactions as demonstrated in a study of 4,965 mice. PLoS Genet. 10, e1004022. https://doi.org/10.1371/journal.pgen.1004022 (2014).
Article CAS PubMed PubMed Central Google Scholar
Liu, J. Z. et al. Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations. Nat. Genet. 47, 979–986 (2015).
Article CAS Google Scholar
Sul, J. H., Han, B., Ye, C., Choi, T. & Eskin, E. Effectively identifying eQTLs from multiple tissues by combining mixed model and meta-analytic approaches. PLoS Genet. 9, e1003491. https://doi.org/10.1371/journal.pgen.1003491 (2013).
Article CAS PubMed PubMed Central Google Scholar
Petersen, G. M. et al. A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33. Nat Genet. 42, 224–228 (2010).
Article CAS Google Scholar
Bhattacharjee, S. et al. A subset-based approach improves power and interpretation for the combined analysis of genetic association studies of heterogeneous traits. Am. J. Hum. Genet. 90, 821–835 (2012).
Article CAS Google Scholar
Lee, C. H., Eskin, E. & Han, B. Increasing the power of meta-analysis of genome-wide association studies to detect heterogeneous effects. Bioinformatics 33, i379–i388 (2017).
Article CAS Google Scholar
Han, B. & Eskin, E. Random-effects model aimed at discovering associations in meta-analysis of genome-wide association studies. Am. J. Hum. Genet. 88, 586–598 (2011).
Article CAS Google Scholar
Keller, M. F. et al. Trans-ethnic meta-analysis of white blood cell phenotypes. Hum. Mol. Genet. 23, 6944–6960 (2014).
Article CAS Google Scholar
Hibar, D. P. et al. Genome-wide association identifies genetic variants associated with lentiform nucleus volume in N = 1345 young and elderly subjects. Brain Imaging Behav. 7, 102–115 (2013).
Article Google Scholar
Lin, D. Y. & Sullivan, P. F. Meta-analysis of genome-wide association studies with overlapping subjects. Am. J. Hum. Genet. 85, 862–872 (2009).
Article CAS Google Scholar
Han, B., Duong, D., Sul, J. H. & de, Bakker, P. I., Eskin, E., Raychaudhuri, S. ,. A general framework for meta-analyzing dependent studies with overlapping subjects in association mapping. Hum. Mol. Genet. 25, 1857–1866 (2016).
Article CAS Google Scholar
Zaykin, D. V. & Kozbur, D. O. P-value based analysis for shared controls design in genome-wide association studies. Genet. Epidemiol. 34, 725–738 (2010).
Article Google Scholar
Wen, X. Bayesian model selection in complex linear systems, as illustrated in genetic association studies. Biometrics 70, 73–83 (2014).
Article MathSciNet Google Scholar
Kim, E. E. et al. FOLD: a method to optimize power in meta-analysis of genetic association studies with overlapping subjects. Bioinformatics 33, 3947–3954 (2017).
Article CAS Google Scholar
Xu, X., Shi, G. & Nehorai, A. Meta-regression of gene-environment interaction in genome-wide association studies. IEEE Trans. Nanobiosci. 12, 354–362 (2013).
Article Google Scholar
Shi, G. & Nehorai, A. Robustness of meta-analyses in finding gene × environment interactions. PLoS ONE 12, e0171446 (2017).
Article Google Scholar
Jin, Q. & Shi, G. Meta-Analysis of SNP-Environment Interaction with Overlapping Data. Front. Genet. 10, 1400. https://doi.org/10.3389/fgene.2019.01400 (2019).
Article CAS PubMed Google Scholar
Jin, Q. & Shi, G. Meta-analysis of SNP-environment interaction with heterogeneity. Hum. Hered. 84, 117–126 (2019).
Article CAS Google Scholar
Wolfinger, R., Tobias, R. & Sall, J. Computing Gaussian likelihoods and their derivatives for general linear mixed models. SIAM J. Sci. Comput. 15, 15–17 (1994).
Article MathSciNet Google Scholar
Rao, C. R. Estimation of variance and covariance components in linear models. J. Am. Stat. Assoc. 67, 112–115 (1972).
Article MathSciNet Google Scholar
Gumedze, F. N. & Dunne, T. T. Parameter estimation and inference in the linear mixed model. Linear Algebra Appl. 435, 1920–1944 (2011).
Article MathSciNet Google Scholar
Jennrich, R. I. & Schluchter, M. D. Repeated-measures models with structured covariance matrices. Biometrics 4, 805–820 (1986).
Article MathSciNet Google Scholar
Lindstrom, M. J. & Bates, D. M. Newton–Raphson and EM algorithms for linear mixed-effects models for repeated measures data. J. Am. Stat. Assoc. 404, 1014–1022 (1988).
MathSciNet MATH Google Scholar

Download references

Funding

This work was supported by the national Thousand Youth Talents Plan.

Author information

Authors and Affiliations

State Key Laboratory of Integrated Services Networks, Xidian University, 2 South Taibai Road, Xi’an, 710071, Shaanxi, China
Qinqin Jin & Gang Shi
Applied Science College, Taiyuan University of Science and Technology, Taiyuan, 030024, Shanxi, China
Qinqin Jin

Authors

Qinqin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Gang Shi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Q.J.: conceived the concept, designed and conducted the simulation studies, and drafted the manuscript. G.S.: conceived the concept, supervised the work, reviewed and revised the manuscript.

Corresponding author

Correspondence to Qinqin Jin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Jin, Q., Shi, G. Meta-analysis of SNP-environment interaction with heterogeneity for overlapping data. Sci Rep 11, 2590 (2021). https://doi.org/10.1038/s41598-021-82336-8

Download citation

Received: 24 August 2020
Accepted: 18 January 2021
Published: 28 January 2021
DOI: https://doi.org/10.1038/s41598-021-82336-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.