Goodness-of-fit test for meta-analysis

Meta-analysis is a very useful tool to combine information from different sources. Fixed effect and random effect models are widely used in meta-analysis. Despite their popularity, they may give us misleading results if the models don’t fit the data but are blindly used. Therefore, like any statistical analysis, checking the model fitting is an important step. However, in practice, the goodness-of-fit in meta-analysis is rarely discussed. In this paper, we propose some tests to check the goodness-of-fit for the fixed and random effect models with assumption of normal distributions in meta-analysis. Through simulation study, we show that the proposed tests control type I error rate very well. To demonstrate the usefulness of the proposed tests, we also apply them to some real data sets. Our study shows that the proposed tests are useful tools in checking the goodness-of-fit of the normal models used in meta-analysis.


Method
Suppose we have statistics y i (e.g., mean difference, effect size, log odds ratio, etc.) and its variance, v i , for study i (i = 1, 2, …, K) of K related but independent studies. We want to combine the information from those K studies using the following random effect model: where the independent study-specific effects, β i 's, represent a random sample from a grand normal population with overall mean μ (e.g, μ = ln(OR) with OR being the overall OR across studies for the binary trait, or the average mean difference across studies for the quantitative trait) and between-study variance τ 2 . In other words, we assume β β β , , ,  K 1 2 are independently and identically distributed as µ τ ( , ) N 2 . We also assume the study error terms are independent and follow a normal distribution, i.e, ∼ ( , ) e N v 0 i i . Note that here the variance v i for each individual study may not be the same.   Table 2. Estimated power for each method under settings with different number of studies (K), and different values of (a, b) from 1,000 replicates at significant level 0.05. Here β ′s i are generated from the uniform distribution U(−1, 1) and the e i 's from normal distribution N(0, v i ).
When τ 2 = 0, the random effect model (1) becomes a fixed effect model. Therefore, fixed effect model is a special case of the random effect model. The parameters µ and τ 2 in (1) are usually estimated through a two-step approach. In the first step, τ 2 will be estimated using one of many different estimators 7-12 . In general, those approaches give similar results and none of them performs uniformly the best. In addition, among those estimators, the DerSimonian-Laird (DL) method is most commonly used. Therefore, in this paper the DL estimator is used.
The parameter, τ 2 , which measures the between-study variance, is estimated by 10 : and Q is the statistic of the Cochran's test for homogeneity 13,14 where the null hypothesis is . Q is defined as follows:  Table 3. Estimated power for each method under settings with different number of studies (K), and different values of (a, b) from 1,000 replicates at significant level 0.05. Here β ′s i are generated from the log-normal distribution LN(0, 1) and the e i 's from normal distribution N(0, v i ).  Table 4. Estimated power for each method under settings with different number of studies (K), and different values of (a, b) from 1,000 replicates at significant level 0.05. Here β ′s i are generated from the double exponential distribution DE(0, 1) and the e i 's from normal distribution N(0, v i ).
Note that under the normality assumption in model (1) and the null hypothesis H 0 for the Cochran's test, Q has a chi-square distribution with degrees of freedom (df) equal to K-1. A large value of Q (or a small p-value from the Cochran's test) indicates the lack of fit of the fixed effect model; a random effect model as (1) is then usually performed.
In the second step, the overall effect of µ is then estimated by the weighted mean µ β , and its variance is estimated by Var µ . Then the hypothesis is tested for overall effect across all studies as: 2 asymptotically under H 0 , is the standard test of the hypothesis about the overall mean.
Like any statistical approach, model checking is a critical step as the results from a model with lack of fit may be misleading. Several goodness-of-fit tests have been proposed in the literature for the generalized linear mixed models 2-6 , which include the random effect models as special cases. However, unlike the ordinary random effect models where the within cluster error terms are assumed to be independently and identically distributed (iid) as a normal distribution, the RE model (1) in meta-analysis assumes the error terms ′ e s i have known and possible different variances. Therefore, those existing goodness-of-fit tests designed for the generalized linear mixed models are inapplicable for the random effect model (1) in meta-analysis.    . If the Cochran test statistic Q is large, from (2) we know that τ 2 is expected to be large as well. On the other hand, if τ 2 is large enough, the covariance matrix Σ can be approximated by a matrix with all diagonal elements equal to a constant. In this case the y i 's can be seen as approximately K iid samples from a normal distribution. Therefore, the normality tests can be used to check the goodness-of-fit for model (1). In this paper, we propose several goodness-of-fit tests for model (1) based on the following popular and powerful normality tests: the Anderson-Darling (AD) test 15,16 , the Cramer-von Mises (CvM) test 15,17,18 , and the Shapiro-Wilk (SW) test 19 . In general, from model (1) the y i 's are samples from K independent but not identical normal distributions; the p-values from the normality tests are not accurate. However, to overcome this difficulty we can estimate those p-values based on resampling approaches, such as parametric bootstrap. Specifically, the proposed tests are conducted by the following steps: Step 1. For given , y i (i = 1, 2, …, K), calculate the statistics ad 0 , cvm 0 , and sw 0 , from the AD, CvM, and SW tests, respectively; Step 2. Estimate τ τ , by 2 2 using (2); Step 3. Resample B sub-samples from ( , Σ) MVN 0 , where Σ is the estimate of Σ with τ 2 being replaced by its estimate τ 2 . For each sample j (j = 1, 2, …, B), calculate statistics, ad j , cvm j , and sw j using the AD test, the CvM test, and the SW test, respectively; Step 4. The p-values are estimated by the portions of ad j , cvm j , and sw j , which are greater than ad 0 , cvm 0 , and sw 0 , respectively.   Note that the parameter µ in model (1) has no effect on the normality tests, and therefore, in Step 3 we can simulate data from a multivariate normal distribution with mean vector equal to 0, instead of μ .
To assess the performance of the proposed tests, we estimate their type I error rates and the detecting powers using simulated data. We also illustrate the use of the proposed tests through two real data applications.

Results
Simulation results. In the simulation study, we assume there are K (K = 5, 10, 15, 20, 30, 50) independent studies. We also assume the variances (v i ) of the K effects are random samples from the uniform distribution U (a, b) with different values of a and b. To estimate the type I error rate, for the null hypothesis the parameter τ 2 takes three values 0, 0.1, and 1.0. The pair of (a, b) take values (0.1, 1.0), (0.01, 0.1), and (0.001, 0.01). Without loss of generality, β ′s i are assumed iid samples from the normal distribution N(0, τ 2 ). In our simulation study, we use B = 1,000 bootstraps to estimate p-value. The significance level 0.05 and 1,000 replicates are used to estimate the type I error rate and power. Table 1 reports the estimated type I error rates for each of the three methods. It clearly shows that the proposed tests control type I error rate around the nominal level 0.05 no matter whether the effects are homogeneous (τ 2 = 0), somehow heterogeneous (τ 2 is relatively small, e.g., τ 2 = 0.1 while a = 0.1, b = 1), or highly heterogeneous (τ 2 is relatively large, e.g., τ 2 = 0.1 while a = 0.001 and b = 0.01).
Tables 2-6 report the estimated power values from each method under different settings. It can be seen that when study size (K) is small, all methods have relatively low power values. However, the detecting power increases when the sample size increases. In general, the power decreases when the variances v i 's and their heterogeneity increase (e.g, b increases). From the simulation results, we can observe that these three methods have similar performance, and none of them is uniformly more powerful than the other two under all conditions. Real data application. We use two real data sets of meta-analysis to illustrate the usefulness of the proposed tests. Both of the original data sets gave the estimated odds ratio (OR) and its 95% confidence interval (CI) for each study included in the meta-analyses. Table 7 lists the data set 1 from a meta-analysis of 12 trials that exam the effect of patient rehabilitation designed for geriatric patients on functional outcome improvement. The data were taken from Figure 4 of Riley et al. 20 , part of the Figure 2 of Bachmann et al. 21 . The p-value from the Cochran's test for homogeneity was 0.021. Therefore, the authors ran the random effect model and estimated the overall OR as 1.36 with 95% CI (1.08, 1.71) 20,21 . It should be pointed out that with small sample size like K = 12 here, the Cochran test is anti-conservative 22 . Table 8 lists the data set 2, which were taken from Figure 2 of Danese and Tan 23 . In data set 2 there are 44 studies investigating the association between childhood maltreatment and obesity. The p-value from the Cochran's test for homogeneity was very small (< 0.0001); the authors then chose random effect model in their meta-analysis. The overall OR was estimated as 1.36 with 95% CI (1.26, 1.74).
In order to use the proposed tests, for data of OR and its 95% CI, as in Tables 7-8, we first take the logarithm of the OR to get log(OR) and then use log(U/L)/3.92 to get the standard error (square root of v i ) for each individual study, where U and L are the upper and lower limits of the 95% CI for OR. The above estimates of log(OR) and v i are appropriate as the 95% CI of the odds ratio is usually calculated from ( , ) (  For data set 1, B = 10 5 bootstraps were used, and the p-values were 0.025, 0.018, and 0.039 from the three proposed goodness-of-fit tests AD, CvM, and SW, respectively. Those small p-values suggest that the random effect model may not be adequate for this data set. For data set 2, the p-values were 0.047, 0.037, and 0.048 from the AD test, the CvM test, and the SW test, respectively. The goodness-of-fit of the random effect model in the meta-analysis was also rejected at significance level 0.05 by all of the three proposed tests.

Discussion
Meta-analysis is an important and useful tool for combining information from related studies. However, blindly using the tool may result in misleading results. Model checking is a critical step and should not be ignored. The goodness-of-fit tests proposed in this paper provide useful tool to check model adequacy in meta-analysis.
In practice, the Cochran's test for homogeneity is usually performed in meta-analysis to see whether or not the fixed effect model fits the data. In general, if the hypothesis of homogeneity is rejected, the assumption of the fixed effect model is questionable. As a result, the random effect model is then usually used. However, little has been found in the literature about the adequacy of the random effect model in meta-analysis. Our proposed goodness-of-fit tests fill the gap.
It should be pointed out that the Cochran's test for homogeneity was not a formally goodness-of-fit test for the fixed effect model, though in many situations rejecting the hypothesis of homogeneity indicates the inadequacy of the fixed effect model. This is because the Cochran's test assumes individual effects are from normal distributions. When the sample size for each individual study is large, this assumption may be valid. However, when sample sizes are small for some studies, the effects from those studies may not follow the normal distribution. Therefore, it is possible that in a meta-analysis the Cochran's test does not reject the homogeneity assumption, but the goodness-of-fit tests indicate the lack of fit for both fixed effect and random effect models.
In a meta-analysis, if there is evidence of lack of fit for the random effect model, which includes fixed effect model as a special case, what should we do next? Viechtbauer agued that even if the homogeneity assumption was rejected, it is still valid to use weighted or unweighted mean to estimate the overall effect for the K studies 1 . In addition, many methods based on combining p-values can be used as well in this situation [24][25][26][27] .
Lastly, the proposed tests can be easily extended to check the normality assumption of the random effects in generalized linear mixed models. Furthermore, the proposed tests can also be used to check the random effect assumption in the meta-regression analysis, which is a special case of the generalized linear mixed model and extends the random effect model (1) by adding some fixed effects of study-level covariates.