Meta-analysis of SNP-environment interaction with heterogeneity for overlapping data

Meta-analysis is a popular method used in genome-wide association studies, by which the results of multiple studies are combined to identify associations. This process generates heterogeneity. Recently, we proposed a random effect model meta-regression method (MR) to study the effect of single nucleotide polymorphism (SNP)-environment interactions. This method takes heterogeneity into account and produces high power. We also proposed a fixed effect model overlapping MR in which the overlapping data is taken into account. In the present study, a random effect model overlapping MR that simultaneously considers heterogeneity and overlapping data is proposed. This method is based on the random effect model MR and the fixed effect model overlapping MR. A new way of solving the logarithm of the determinant of covariance matrices in likelihood functions is also provided. Tests for the likelihood ratio statistic of the SNP-environment interaction effect and the SNP and SNP-environment joint effects are given. In our simulations, null distributions and type I error rates were proposed to verify the suitability of our method, and powers were applied to evaluate the superiority of our method. Our findings indicate that this method is effective in cases of overlapping data with a high heterogeneity.

www.nature.com/scientificreports/ between covariates and genetic effects or interactions between covariates and environmental factors 27 . Based on Lin's method 21 and Han's method 22 , we extended MR to overlapping MR (OMR) 28 . This method is designed for SNP-environment interactions under the fixed effect model, as well as for overlapping data. We also extended MR to account for heterogeneity; that is, we added random effects of the SNP and SNP-environment interactions to the fixed-effect SNP-environment interaction model. This method is denoted as the random effect MR (RMR) 29 , which gives a higher power than MR under the fixed-effect model when heterogeneity exists. The Q-Q plot of the null distribution obtained by MR will shift upward when overlapping data exists. The more overlapping the data, the more obvious the deviation will be. The fixed effect model OMR 28 controls spurious associations caused by overlapping data. When heterogeneity exists and is large, the power of RMR is higher than that of MR. Similarly, when overlapping data and heterogeneity exist, the power obtained by OMR will also be affected by heterogeneity, and the power it provides will be reduced. However, no study has yet considered this condition.
In this paper, inspired by OMR and Lee's method 17 which is proposed for testing SNP main effect with overlapping data, we propose random effect overlapping MR (ROMR) which is a new method to consider overlapping data based on the RMR 29 . Our method is designed to test the SNP-environment interaction effect or the SNP and SNP-environment joint effects with overlapping data. This paper is organized as follows. In the Materials and Methods section, we introduce the correlation matrix into the RMR. We also present a new method to calculate the likelihood function. In the Results section, we carry out simulations to examine the null distribution, type I error rate, and power of our method. We also compare our method with the OMR. In the Discussion and Conclusion section, the results of this paper are analyzed and used to draw conclusions.

Materials and methods
Fixed effect overlapping MR. OMR is a method that extends from fixed effect MR, which is a powerful and robust method under the condition of independent data. This method has two procedures. First, by continuous or dichotomous environmental exposure distribution, each study is divided into several groups. In this process, each group is a subset of the study, and percentiles of the environmental exposure can be used to divide the study into several groups with approximately the same sample sizes. Then, in each group, the coefficient and variance of the main effects of SNPs are estimated. Second, the main effect of SNP and its corresponding standard deviation in each group are collected for regression analysis. Then, either the overall mean SNP-environment effect and its variance is estimated, or the mean SNP and SNP-environment joint effect vector and its variance matrix are estimated.
Assume that the environmental exposure is continuous and β ij is set to be the estimation of the main effects of SNPs in the i-th study and j-th group, where subscript i = 1, 2, . . . , n is the sample size of studies and subscript j = 1, 2, . . . , n i is the sample size of the groups in the i-th study. e ij and E ij are the standard error and mean environment exposure of the i-th study and j-th group, respectively. Under the second OMR procedure, the formula for the environment-dependent SNP effect β can be expressed in the following form: where and ε ij ∼ N(0, e ij ) i = 1, 2, . . . , n, j = 1, 2, . . . , n i .
Let C be the correlation matrix. The element of this matrix is where the subscript n i h and n j k are presented as the size of the h-th group of study i and k-th group of study j, respectively, and n i h j k is the size of the overlap individuals between the h-th and k-th group. The covariance matrix of this method is It has another form as www.nature.com/scientificreports/ where e = (1, 1, . . . , 1) and its length is the sum of all group sizes. The formula of the linear unbiased estimators α and Cov α are expressed as follows Under null distribution α 2 = 0 and α = 0 , the Wald statistic of the SNP-environment interaction effect and the SNP and SNP-environment joint effects follow 1 and 2 degrees of freedom (df) χ 2 distribution, respectively.
The likelihood function under this model can be written as The estimation of this likelihood function is given by the minimum variance quadratic unbiased estimator (MIVQUE(0)) 30,31 and Newton-Raphson algorithms [32][33][34] . The detailed process is given in Jin 29 .

Test of SNP-environment interaction.
Under this test, we suppose that there is no interaction effect and no interaction heterogeneity, that is, α 1 = 0 and D s = u 2 1 0 0 0 . The reduced model can then be given as follows: where www.nature.com/scientificreports/ and the dimension of X i is n i . In this model, β ∼ N(Xα 0 , V ) , the covariance matrix is V = ZDZ ′ + 1/2 C 1/2 . Denote 1 1 , 1 2 , . . . , 1 M where M = n i=1 n i as the eigenvalues of matrix V , then |V | = 1 1 * 1 2 * · · · * 1 M . The -2 times of the log likelihood for this model is As in Jin 29 , the likelihood ratio statistic for the test of SNP-environment interaction is given as follows: where l 1 is the minimum of l 1 and l 2 is the minimum of l 2 . The statistic L I asymptotically follows an equal mixture of 2 df χ 2 distribution and 3 df χ 2 distribution. Its p-value is calculated by 0.5 (P ( χ 2 2 > L I ) + P(χ 2 3 > L I )) 29 .

Joint test of SNP and SNP-environment.
Under this test, we suppose that α = 0 and D s = 0 0 0 0 , that is to say, no SNP main, SNP-environment interaction fixed effects, and no corresponding heterogeneity. The null model can be given as follows:

The -2 times of the log likelihood for this model is
The likelihood ratio statistic for the joint test of the SNP main and SNP-environment is given as follows: where l 0 is the evaluated value of l 0 . The statistic L J asymptotically follows a ξ:0.5:(0.5-ξ ) mixture of 3 df χ 2 distribution, 4 df χ 2 distribution, and 5 df χ 2 distribution. The value of ξ depends on the given data and is solved by the information matrix. The p-value is calculated by(0.5-ξ ) P ( Ethics approval. The authors have no ethical conflicts to disclose.

Results
Simulation. The relationship among the quantitative trait Y , the genotype of SNP G , and the environmental variable E is presented as follows: The quantitative trait Y was simulated as a standardized normal distribution, that is, with a mean of 0 and a variance of 1. SNP was assumed as an additive genetic effect, and its minor allele frequency was 0.3. G was coded as the number of minor alleles. In each study, 1000 points following a standard uniform distribution were generated. If these points fell in [0, 0.3 2 ] , then G was set to 2. If these points fell in [0.
, then G was set to 1, else G was set to 0. The environmental variable E was also simulated as a standardized normal distribution. A 10% variation in Y was explained by the environmental term β E E . Fixed effects β G , β G×E and random effects γ G , γ G×E changed in simulation datasets. The random error ε was normally distributed, with a zero mean and a variance chosen so that the variance of Y was 1. In our simulation, we generated 1000 replications, each replication had 12 studies, each study had 1000 unrelated individuals, and each individual had one quantitative trait Y , one environmental variable E , and one SNP G . Across all the studies, 100 and 400 overlapping individuals were observed. Before analysis, the individuals in each study were divided into five groups according to the distribution of E . The main effects of SNP and standard errors were estimated by linear regression. The mean of the environmental variables E of each stratum was then calculated.
To test the null distribution of statistics for the SNP-environment interaction, we assumed that both the SNP-environment interaction and its corresponding heterogeneity were zero. That is to say, β G×E = 0 and γ G×E = 0 . The main effect of SNP β G was set to a square root of 0.1, and the random effect of this effect was set to be normally distributed, with a mean of 0 and a variance of 0.02. We calculated the minimum estimates of the likelihood functions l 1 and l 2 , such that the statistic L I could be obtained. Finally, empirical P-values were calculated using a 0.5:0.5 mixture of 2 and 3 df χ 2 distributions, which was the theoretical distribution of the interaction test. Then, these were compared with expected values following a uniform distribution between 0 and 1. A Q-Q plot was drawn through the two types of P-values. To test the null distribution of statistics for the www.nature.com/scientificreports/ SNP and SNP-environment joint effects, we assumed that all the fixed effects and random effects of the SNP and SNP-environment interaction were zero. That is, β G =β G×E =γ G =γ G×E =0.We calculated the likelihood ratio statistic L J by estimating the minimum of the likelihood function l 0 and l 1 . The empirical P-values were calculated as a ξ:0.5:(0.5-ξ ) mixture of 3 df χ 2 distribution, 4 df χ 2 distribution, and 5 df χ 2 distribution. The value of ξ was data-dependent and calculated using Fisher information.
To test the powers of the SNP-environment interaction effect, we set the fixed effects of the SNP main β G and SNP-environment interaction β G×E to √ 0.002 . The random effect of SNP main γ G was normally distributed, followed E ∼ N(0, 0.015) , variance of random effect of SNP-environment interaction ranging from 0.005 to 0.025, where each increased by 0.005. If the P-value of the test is less than 0.05, it was considered statistically significant. Experiments were repeated 1000 times, and the proportion of statistical significance was called empirical power. The OMR was also tested under this simulation. To test for the SNP and SNP-environment interaction joint effects, the fixed effects of β G and β G×E were set to a square root of 0.002. The random effects of γ G and γ G×E were set normally distributed with a mean of 0 and a variance ranging from 0.005 to 0.025. Fig. 1A,B, these points are nearly standing on the diagonal line with 100 and 400 overlapping individuals between studies. This verifies that the method presented provides suitable distribu- Figure 1. Q-Q plots of the null distributions in the test for SNP-environment interaction effects and the joint test for SNP and SNP-environment interaction effects. (A, B) The tests for SNP-environment interaction effects with 100 and 400 overlapping individuals between any two studies. (C, D) The joint tests for SNP and SNPenvironment interaction effects with 100 and 400 overlapping individuals between any two studies. The vertical axis is the-log10 (observed P value) from data analyzed under the null hypothesis, and the horizontal axis is the-log10 (expected P value). www.nature.com/scientificreports/ tions. In Fig. 1C,D, the empirical P-values are close to the expected ones with 100 and 400 overlapping individuals between any two studies, demonstrating the suitability of our distributions.

Null distribution. As shown in
Type I error rate. To better illustrate the performance of our method, we considered three different scenarios. In scenario 1, two different sample sizes were considered: (1) a study with 1000 individuals and (2) a study with 2000 individuals. In scenario 2, two different significance levels were considered: (1) 0.01 and (2) 0.05. In scenario 3, two different main effects of SNP were considered: (1) a square root of 0.1 and (2) a square root of 0.2. Table 1 presents the values of the type I error rates in the different scenarios for the test of the SNPenvironment interaction with 10% overlapping data. Table 2 presents the values of the type I error rates in the different scenarios for the test of the SNP-environment interaction with 40% overlapping data. Table 3 presents the values of the type I error rates in the different scenarios for the test of the SNP and SNP-environment joint effects with 10% overlapping data. Table 4 presents the values of the type I error rates in the different scenarios for the test of the SNP and SNP-environment joint effects with 40% overlapping data. For the 1000 replications, the 95% confidence intervals for the estimated type I error rates of nominal levels 0.05 and 0.01are (0.036, 0.064) and (0.002, 0.018) respectively. For the 2000 replications, the 95% confidence intervals for the estimated type I error rates of nominal levels 0.05 and 0.01are (0.040, 0.060) and (0.004, 0.016) respectively. From these tables, we can see that all of the estimated type I error rates are in the confidence intervals for interaction tests and joint tests, this indicates that our method is valid.     Fig. 2A,B. The OMR gives higher powers when heterogeneity is low with 100 and 400 overlapping individuals between any two studies. The powers of this method decrease slightly with increase in heterogeneity, however, the powers of the ROMR increase rapidly with increase in heterogeneity. When the heterogeneity is high, the ROMR gives higher powers. This is due to the fact that when the heterogeneity is low, most of the statistical evidence of L I is obtained from the fixed effect of the interaction. In fact, the test statistics of ROMR are penalized by high degrees of freedom, yielding less power. When the heterogeneity is high, the OMR tested for the fixed effect only, the genetic effect tested by the ROMR is much larger than that of OMR. Thus, ROMR gives higher power 29 . As shown in Fig. 2C,D, although our method provides a similar tendency to interaction simulation, joint tests generally obtain higher results. This is because both the SNP and SNP-environment interaction are tested, thereby including more effects than the test for the interaction only. Table 5 and 6 present the powers under different levels of heterogeneity with different overlapping data. Table 5 shows the powers of the SNP-environment interaction, showing that the power decreased with an increase in the number of overlapping data. For ROMR, the greatest drop was 0.299; for OMR, the greatest drop was 0.127. However, in any case, when the heterogeneity was large, ROMR gave a higher power than OMR. Table 6 shows the powers of the SNP and SNP-environment interaction joint effects. Similar to the case of SNP-environment interaction, as the number of overlapping data increased, the power of ROMR was reduced faster than OMR. However, when the heterogeneity was large, ROMR gave a higher power than OMR. In order to more www.nature.com/scientificreports/ intuitively understand the impact of different overlapping data, Fig. 3 is given. We selected one set of parameters from Table 5 and 6. The variance of heterogeneity for SNP-environment interaction was fixed as 0.02 and the overlapping individuals were 100, 200, 300 and 400 as in the tables. As can be seen from Fig. 3A,B, with the increase of overlapping individuals, the powers of the two methods are decreasing gradually. The powers of www.nature.com/scientificreports/ SNP-environment interaction effects and the powers of the SNP and SNP-environment interaction joint effects have the same variation tendency. Figure 4 present the powers under different number of studies. The variance of heterogeneity for SNPenvironment was fixed as 0.02 and the numbers of studies were 9, 12, 15, 18, 21. Figure 4A,B present the powers of SNP-environment interaction effects under 100 and 400 overlapping individuals. Figure 4C,D present the powers of SNP and SNP-environment interaction joint effects under 100 and 400 overlapping individuals. As can be seen from these figures, with the increase of the numbers of studies, the powers of the two methods are increasing gradually.

Discussion
In contrast to the calculation of the likelihood function performed by Lee 17 , we only changed the calculation of ln|V| . The covariance matrix V is a real symmetric matrix that can be diagonalized in a similar manner. That is to say, V can be written in the form of P −1 P , where P is the matrix of eigenvectors and is the vector of eigenvalues. Then, ln|V| = ln| i | , where i is the eigenvalue of the covariance matrix V . The other terms of the likelihood function are the same as in the random effect model MR. Thus, the computational complexity of our method is much less than that of Lee's method. We can also compute ln|V| using OMR or by combining both of the above mentioned methods.
As in Lee 17 , our method can also be combined with OMR, providing a higher power, regardless of the level of heterogeneity. This method focuses on heterogeneity, it designs statistic as follows Figure 4. Statistical power of the test for SNP-environment interaction effects and the joint test for SNP and SNP-environment interaction effects with different number of studies (9,12,15,18,21) and with 100 and 400 overlapping individuals under a fixed heterogeneity. (A, B) The tests for SNP-environment interaction effects with different number of studies (9,12,15,18,21) and with 100 and 400 overlapping individuals under a fixed heterogeneity. (C, D) The joint tests for SNP and SNP-environment interaction effects with different number of studies (9,12,15,18,21) and with 100 and 400 overlapping individuals under a fixed heterogeneity. www.nature.com/scientificreports/ where L R is the likelihood ratio statistic for the SNP-environment interaction effect under the random effect model with overlapping data, and p R and p F are the P-values for test of the SNP-environment interaction effect applying the ROMR and OMR. The P-value for this statistic is similar to that used in Lee 17 .
When the data between studies are independent, the correlation matrix C becomes an identity matrix; that is to say, C = I . In this context, the method becomes RMR. However, as a result of the additional judgement process of the correlation matrix, the calculation amount of this method is increased. Therefore, the use of RMR is recommended when the data between studies are independent, while ROMR is recommended when the data is overlapping or it is not certain whether there is overlapping data.
We performed a simulation where the number of overlapping individuals were increased from 100 to 200, 300, and 400. Simulations with 500 or more overlapping individuals were not performed because there are 1000 individuals in our studies. That is, if there are 500 or more overlapping individuals between studies, none of the studies have individuals that are not applied to other studies, thus the correlation matrix cannot be guaranteed to be strictly diagonally dominant. In this case, the non-singularity of the variance matrix cannot be guaranteed; thus, this situation is not considered.
In the present study, we simultaneously evaluated the fixed effect and random effect of the SNP-environment interaction or the fixed effect and random effect of the SNP and SNP-environment interaction. When heterogeneity is high and overlapping data exists, our method provides accurate and valid power. However, our method also has some limitations. First, the calculation cost of our ROMR is much higher than that of the OMR. Second, more than one environmental variable may interact with the genetic effect being tested. Here, only one environmental variable was chosen for the interaction analysis, but other environmental variables can be entered as covariates.

Conclusion
This study generalized the RMR proposed in our previous paper to account for overlapping data. This method was designed to test the SNP-environment interaction effect or the SNP and SNP-environment joint effects with overlapping data. To this end, a correlation matrix was introduced into our random effect model. In addition, a new method to solve the likelihood function was proposed, which allowed the solution to ln|V| to be obtained more easily. By simulation, we verified that our method was suitable under the conditions of the random effect model of the SNP-environment interaction with overlapping data. As a result, our ROMR obtained a higher power than OMR when the heterogeneity was high. In practice, when data from large-scale meta-analyses originate from different factors, including ethnicities, environments, phenotypes, or some other factors, heterogeneity across studies is likely to exist, our proposed ROMR can be applied.