Goodness-of-fit testing for meta-analysis of rare binary events

Random-effects (RE) meta-analysis is a crucial approach for combining results from multiple independent studies that exhibit heterogeneity. Recently, two frequentist goodness-of-fit (GOF) tests were proposed to assess the fit of RE model. However, they tend to perform poorly when assessing rare binary events. Under a general binomial-normal framework, we propose a novel GOF test for the meta-analysis of rare events. Our method is based on pivotal quantities that play an important role in Bayesian model assessment. It further adopts the Cauchy combination idea proposed in a 2019 JASA paper, to combine dependent p-values computed using posterior samples from Markov Chain Monte Carlo. The advantages of our method include clear conception and interpretation, incorporation of all data including double zeros without the need for artificial correction, well-controlled Type I error, and generally improved ability in detecting model misfits compared to previous GOF methods. We illustrate the proposed method via simulation and three real data applications.

Meta-analysis is a valuable technique used in various fields, including medicine, biology, social sciences, and ecology, to combine information from multiple studies to increase inference reliability.A random-effects model (REM) is a popular choice in a meta-analysis, which assumes that the actual effect sizes of component studies θ i s follow a normal distribution with an overall mean θ 0 and variance τ 2 (often referred to as the heterogeneity parameter).When τ 2 = 0 , a REM is reduced to a fixed-effect model (FEM) with θ i ≡ θ 0 .REMs are preferred over FEMs in most scenarios because they account for the heterogeneity among studies and are, therefore, applicable to a broader range of scenarios 1 .
Among REMs, a generic model widely employed for binary and continuous outcomes uses the normal-normal hierarchical structure.For study i, let y i be the observed effect size (for binary data, y i is typically the log odds ratio), and σ 2 i denotes the within-study variance (i.e., the sampling variation in study i).The generic model specifies y i |θ i , σ 2 i ∼ N θ i , σ 2 i and θ i ∼ N θ 0 , τ 2 .However, for rare binary outcomes, the normal approximation for y i given θ i and σ 2 i may not work well due to the sparsity or small sample sizes.Alternatively, the binomial-normal (BN) hierarchal structure is a popular substitute for the normal approximation.It assumes that the number of observed events in the treatment (control) group for study i, denoted by x i2 (x i1 ) , follows a binomial distribution with the total number of subjects n i2 (n i1 ) and event probability p i2 p i2 .The logit transformed probabilities are then assumed to be distributed normally in the second hierarchy, where the log odds scale measures the effect size θ i .Several variations of the BN framework have been proposed.Bhaumik et al. 2 assumed logit p i1 = µ i and logit p i2 = µ i + θ i , where µ i ∼ N(µ 0 , σ 2 ) denotes logit-transformed background incidence rate for study i. Smith, Spiegelhalter, and Thomas 3 considered the equal variance between the control and treatment group by defining logit p i1 = µ i − θ i /2 and logit p i2 = µ i + θ i /2 .Li and Wang 4 proposed a more flexible model by defining logit p i1 = µ i − ωθ i and logit p i2 = µ i + (1 − ω)θ i , where the new parameter ω adjusts the variance ratio between two arms and the previous two models can be viewed as special cases by assigning ω = 0 and 1/2, respectively.To avoid the assumption of independency between µ i and θ i , Houweilingen, Zwinderman and Stijnen 5 proposed the use of a bivariate normal distribution for modeling (logit p i1 , logit p i2 ) , which allows any correlation structure between logit p i1 and logit p i2 in order to test the effects of each variable.
All the models discussed above make a common assumption that the true effect sizes θ ′ i s follow a normal distribution θ i ∼ N θ 0 , τ 2 .While this assumption is convenient for mathematical purposes, it may not always hold in reality as the distribution of true effect sizes across different studies could have any shape.Therefore, conducting a goodness-of-fit (GOF) test is crucial before drawing conclusions or making inferences, since a misspecified model may yield misleading results 6,7 .Researchers have come up with various solutions to test the normality of models.Recently, Chen, Zhang, and Li 8 proposed a parametric bootstrap-type GOF test, mainly focused on the generic REM.Subsequently, Wang and Lee 9 developed a standardization framework to evaluate the normality assumption.It avoids the need to generate reference distributions and is therefore computationally efficient.However, their methods require continuity corrections when encountering single or double-zero studies, which can impact both Type I error rates and statistical power.Furthermore, those who previously proposed methods did not investigate their approaches numerically under different background incidence rates, especially when dealing with rare binary outcomes.This is an interesting and important aspect to explore in meta-analyses of binary outcomes.
In terms of Bayesian alternatives, no Bayesian approach has been considered for GOF testing in meta-analysis to our knowledge.We propose a novel GOF test for meta-analysis that utilizes the pivotal quantity (PQ) methodology proposed for Bayesian model assessment 10,11 , and adapts the Cauchy combination test 12 , which combines dependent p values computed using posterior draws from Markov chain Monte Carlo (MCMC), to inform the final conclusion.A pivotal quantity is a function of data and model parameters whose distribution does not depend on unknown parameters.For instance, suppose θ = (µ, σ ) ∼ π , and θ 0 = (µ 0 , σ 0 ) is a random vector drawn from density π , which generates the normal data x = {x 1 , ..., is a pivotal quantity.Let μ and σ be samples from the corresponding posterior distribution p(µ, σ |x) .The PQ method is constructed based on the fact that f (x, µ 0 , σ 0 ) and f (x, μ, σ ) are identically distributed; that is, f (x, μ, σ ) ∼ χ 2 n .Our proposed method, called Improved Pivotal Quantities (IPQ), can detect a model failure at all levels in hierarchical models without extra computational cost.Additionally, it can be easily incorporated into standard Bayesian implementations and automatically accounts for all available data without requiring artificial corrections for rare binary events.While our method is suitable for general purposes, we focus primarily on its application for meta-analyses of rare binary outcomes.
The rest of this article is organized as follows.We first review Bayesian techniques that assess model adequacy, followed with a brief introduction to pivotal quantities and the Cauchy combination test.We then introduce our proposed method based on the generalized REM in Houweilingen, Zwinderman and Stijnen 5 for meta-analysis of binary events.We also describe the Bayesian implementation of our method, including adapting the proposed IPQ method within the MCMC algorithm and considering different bivariate covariance priors.In the simulation section, we conduct simulation studies to evaluate our method's performance in terms of Type I error rates and statistical power and compare it with other existing GOF methods.We also evaluate four different covariance priors based on estimating the overall treatment effect, the inter-study heterogeneity, and the correlation coefficient.In data examples section, we illustrate our method using three real data examples.The first example utilizes handedness and eye-dominance data from 54 studies, the second one employs Type 2 diabetes mellitus and gestational diabetes data from 20 studies, and the third uses GSTP1 gene and lung cancer data from 44 studies.We then end this paper with conclusions and discussions.

Review of related bayesian work
In current practice, Bayesian model diagnostics mainly fall into three categories: prior predictive, posterior predictive, and pivotal quantity-based approaches.See Figure 1 for illustration.

Prior and posterior predictive checks
Suppose x has a distribution function specified by p(x|θ) , where θ represents the parameters of a model (say M ) under study.Let x obs denote the observed data and x rep denote the replicated data that are generated to mimic real data.Box 13 recommended using the prior predictive distribution, p(x) = p(x|θ )p(θ )dθ , as a reference distribution to generate x rep for comparing with x obs .The steps to obtain the prior predictive distribution are illustrated in Figure 1A.Given θ rep j drawn from the prior p(θ ) for j = 1, .., R , we draw x rep j from the sampling distribution p x|θ rep j .We then use T(x, θ ) or T(x) , a function of data and model parameters or a function of data alone, to measure the discrepancy between data and model assumptions.Here, we take T(x) as an example for simplicity, evaluated at both x obs and x rep j for all j.The model misfit can be concluded if T x obs is unlikely from the reference distribution formed by T x rep j 's.However, prior predictive checks might be problematic when using improper or weakly-informative priors, which are commonly used in practice 14 .
Gelman, Meng, and Stern 15 proposed model assessment using the posterior predictive distribution, defined as p x rep |x obs = p(x rep |θ)p θ |x obs dθ .As shown in Figure 1B, replicated data x rep j are generated using θ rep j from the posterior distribution p θ|x obs .Then, the reference distribution based on the chosen discrepancy function T(x) can be computed.The Bayesian posterior predictive p value can be obtained as P T(x rep ) ≥ T x obs |x obs to quantitatively detect the model misfit.
The posterior predictive check has gained increasing popularity in Bayesian model checking due to its straightforward implementation via Monte Carlo Markov Chain (MCMC) algorithms.However, there are two major limitations associated with this type of approach.Firstly, unlike the traditional p value, the posterior predictive p value does not follow a uniform distribution under the null hypothesis of no lack of fit, making it difficult to interpret and assess the level of evidence against the null hypothesis 16 .Secondly, the method has almost no power to detect failures from the second or deeper layers in hierarchical models 11,17 .
To overcome the non-uniformity problem, a potential solution is to calibrate the posterior predictive p value so that the calibrated p value follows a uniform distribution asymptotically 18 .However, the statistical power of using the calibrated p value has not been investigated yet.To avoid this issue, Bayarri and Berger 19 proposed two new types of p values: the conditional predictive p values and the partial posterior predictive p values.Bayarri and Castellanos 16 further extended the partial posterior predictive method to test the second layer of hierarchical models, which avoids "using data twice." However, as mentioned in Johnson 20 , the partial posterior strategy is typically not straightforward to implement beyond normal-family problems.More recently, Gosselin 21 and Zhang 22 recommended randomly drawing a single value θ single from the posterior distribution π θ|x obs to generate x rep , namely sampled posterior check (Figure 1C).The corresponding p values is distributed uniformly when the data model is correctly specified, and the approach achieved higher power than the original posterior predictive check for detecting model misfit 21 .

Pivotal quantity methodology
Johnson 10 pioneered the use of pivotal quantities (PQ) to detect model misfit, and Yuan and Johnson 11 extended upon the methodology so that it can be applied to any level of hierarchical models.Since it does not involve replicated data, there is no need to distinguish x obs and x rep , and x is directly used for observed data.
A pivotal quantity, denoted by D(x; θ ) , is a function of both data x and model parameters θ .It possesses a sampling distribution F that is both known and invariant when evaluated at θ 0 , the "true" (data-generating) value of θ ; that is, D x; θ 0 ∼ F .Johnson 10 shows that D x; θ and D x; θ 0 are identically distributed, where θ is www.nature.com/scientificreports/drawn from the posterior distribution p(θ |x) .Based on this result, the approach to model assessment involves two main steps 10 .The first is to select a pivotal discrepancy measure D(x; θ ) with a known reference distribution F, and, the second step is to evaluate the model fit by determining whether D x; θ can be considered as a draw from F. However, when conducting a GOF test for the second or deeper layers in hierarchical models, one may encounter difficulties since D(x; θ ) depends on x , but these layers usually involve no data.For this reason, Yuan and Johnson 11 extended the method by defining the pivotal quantity D as a function of model parameters only and further showed that D θ 0 and D θ have identical distributions.This allows the application of pivotal quantities and the corresponding reference distributions to diagnose model inadequacy at any level of a hierarchical model.
As shown in Figure 1D, after drawing θ(i) from p(θ |x) , D(x; θ ) is evaluated at θ(i) for i = 1, ..., M .Then, each D x, θ(i) has the same distribution as D x, θ 0 .For example, suppose D x, θ 0 ∼ N(0, 1) , then under the null hypothesis of no lack of fit, D x, θ(i) ∼ N(0, 1) for each i marginally.To test normality, Johnson 10 suggested using a formal approach such as the Shapiro-Wilks test, and different p values from the tests on x, θ (i) are calculated for i = 1, ...M.However, combining those p values is not as straightforward as using Fisher's combina- tion test.This is because the p values are derived from posterior samples using the same dataset and so are dependent with an unknown covariance structure.
To address this issue, Johnson 10 suggested that one could avoid generating multiple draws from the same dataset by utilizing the prior-predictive distribution from Dey et al. 17 , which suggested generating 1000 replicated datasets, x rep i for i = 1, ..., 1000 , from p(x) as illustrated in Figure 1A.For each replicated dataset x rep i , Bayesian data analysis is performed to obtain the corresponding posterior distribution p(θ|x rep i ) , then a single θrep i is randomly sampled from the posterior, and one p values from testing the normality using the pivotal quantity is computed.This results in 1000 independent p values.Standard approaches, such as Fisher's test, can then be employed to draw a conclusion.However, this method may suffer from two limitations.Firstly, using 1000 replicated datasets can be computationally intensive since the same MCMC procedure needs to run 1000 times to draw independent posterior samples.Secondly, non-informative priors may not necessarily generate reasonable datasets.Considering these difficulties, Johnson 10 recommended finding probabilistic bounds on dependent p values using the properties of order statistics derived from Gascuel and Caraux 23 and Rychlik 24 .
Let x (1) , .., x (M) denote order statistics from a dependent sample of random variables, where each has distri- bution function F, and let F k:M denote the distribution function for the k-th order statistic out of M.Then, the bound of F k:M can be written as Let p 1 , ..., p M be dependent p values for m = 1, ..., M .Under the null, each p values should be distributed uni- formly on (0, 1) , implying F(t) = t in Eq. (1).Let x i = −p i , Li, Wu and Feng 25 showed that the Eq. ( 1) becomes which means that a p value upper bound for the observed k-th order statistic p obs To avoid choosing the value of k, they suggested reporting the minimum upper bound such that p min = min min 1, p obs . Yuan and Johnson 11 advocated using the rule-of-thumb value of 0.25 as a cutoff for declaring the model misfit in practice; that is, reject the null hypothesis H 0 if p min < 0.25 .However, the proposal may be liberal, and our simulation studies in the simulation section show that 0.25 is not necessarily a good choice and it is hard to select an optimal cutoff to balance the trade-off between Type I error and power.

Method
The generalized REM for meta-analysis of binary events Suppose a meta-analysis contains I independent studies, and for the i th study, let x i1 (x i2 ) be the number of observed events in the control (treatment) group, which follows a binomial distribution with the total number of subjects n i1 (n i2 ) and corresponding event probability p i1 p i2 .Let φ i1 (φ i2 ) denote the logit-transformed p i1 p i2 , i.e., φ ij ≡ ln Then the generalized binomial-normal REM in Houweilingen, Zwinderman and Stijnen 5 can be written as where (φ i1 , φ i2 ) is modeled by a bivariate normal distribution with an arbitrary covariance structure.We further define the treatment effect θ i = φ i2 − φ i1 for study i, which follows a univariate normal distribution with an overall mean effect θ 0 = µ 2 − µ 1 and the heterogeneity (1) The generalized REM builds a strong connection to many well-established models 26 .For example, the model in Li and Wang 4 and Zhang et al. 27 is a special case of model ( 2), yielding where in (2), As mentioned in the introduction, we can let ω be 0 or 0.5, which further reduces the model to the one in Bhaumik et al. 2 or Smith, Spiegelhalter, and Thomas 3 , respectively.Thus, model ( 2) is regarded as the most generalized binomial-normal model with fewer assumptions, so we choose it as the basis to design the GOF test for detecting non-normality of θ i 's.

The proposed GOF test
Inspired by previous research, we propose a novel GOF test and demonstrate its applicability in the context of meta-analysis of (rare) binary events.Our approach involves defining the null hypothesis, denoted by H 0 , which assumes normality for the true effect sizes θ i s, a prevailing assumption made in meta-analysis.The alternative hypothesis, denoted by H 1 , is formulated as any departure from H 0 .In other words, we aim to detect this specific departure from the bivariate normal model assumed for the second layer of the generalized REM, where we can draw our conclusion about the presence of an overall treatment effect θ 0 and between-study heterogeneity τ 2 .
2 , ρ * be data-generating parameter values for study i, and ˜ (m) i be the corresponding m th draw from the joint posterior distribution p( |X) for m = 1, ..., M .We define a discrepancy measure to capture the deviation from H 0 , namely which is a pivotal quantity and follows a standard normal distribution under H 0 .Furthermore, as pointed out by one of our reviewers, this measure can be viewed as the study-specific effect in units of standard deviations, for each study and each posterior draw.Then, according to Yuan and Johnson 11 , has a distribution identical to D * i ; that is, D ˜ i (m) ∼ N(0, 1) for every m marginally.
Conducting a standard normality test using the pivotal quantities in (3) based on a single draw is straightforward, but this sampled posterior approach can be problematic since the vagaries of randomness can produce a sample that seems unwise.Alternatively, combining multiple MCMC draws to draw a conclusion was recommended in Johnson 10 and Yuan and Johnson 11 , where probabilistic bounds of the order statistics of the p values are used to combine the dependent p values.Here, we propose to use the Cauchy combination idea 12 to combine the dependent p values.
Consider p i as the p values obtained from the i-th statistical test, and ω i as the corresponding nonnegative weight that sums up to 1. Liu and Xie 12 introduced the Cauchy combination test and demonstrated that, subject to certain regularity conditions, the tail of a test statistic that linearly combines individual transformed p values can be well approximated by a standard Cauchy distribution under the null hypothesis.Specifically, if there are k p values, then the test statistic is given by T = k i=1 ω i tan 0.5 − p i π , where the weight ω i is typically set to 1/k in the absence of any prior information.The Cauchy combination test has several salient features.Firstly, the test, by leveraging the Cauchy distribution, the test has a simple analytical formula to compute the p value.Next, unlike classical Fisher's test 28 or other common tests for combining p values, such as the minimum p value test 29 , the Berk-Jones test 30 and the higher criticism test 31 , the Cauchy combination test handles p values from correlated statistical tests and remains valid for arbitrary correlation structures.Finally, the test works well even if one main assumption required for the test, the bivariate normality between any two test statistics generating the p values, , www.nature.com/scientificreports/ is not satisfied.Thus, p i s can be from non-normal typed tests (i.e., those with test statistics that are not normally distributed), such as the Shapiro-Wilk test 32 , the Cramer-von Mises test [33][34][35] and the Anderson-Darling test 35,36 .In summary, the proposed GOF test, namely Improved Pivotal Quantities (IPQ), can be outlined by the following steps: Step 1: Given I independent studies, randomly sample ˜ (m) i from the joint posterior distribution p( |X) via MCMC for i = 1, ..., I and m = 1, ...M. .Then, we will reject the H 0 if p * < α with a pre-specified significance level (e.g., α = 0.01, 0.05, 0.1).

Bayesian implementation with different covariance priors
We now pivot the discussion to prior specification and the Bayesian implementation.We use a Hamiltonian Monte Carlo (HMC) algorithm via Stan (version 2.19.1) 37 in conjunction with R 38 to fit models with different priors discussed below.For each dataset, we run the algorithm with 5000 burn-in iterations and 5000 additional sampling iterations.The convergence of MCMC chains is detected using the Gelman-Rubin diagnostic 39 .
We start with the prior choices for logit-transformed mean effects µ 1 and µ 2 for the control and treatment groups, where we consider diffuse uniform priors such that µ j ∼ U L µ j , U µ j for j = 1, 2 .To define the range, we get rough estimates μij for all I studies, μij = ln 5 .Then, we define the lower bound L µ j = min i,j μij − c and upper bound U µ j = max i,j μij + c , where we let c = 5 as in Bai et al. 40 so that the priors are conservative enough to contain all plausible values.
Regarding the prior for the covariance matrix , several commonly used conjugate priors are available, including the independent prior (IND) that assumes mutual independence a priori among the elements of 41 , the inverse Wishart prior (IW) 42 and the hierarchical inverse Wishart prior (HIW) 43 .Other alternatives include the scaled inverse Wishart prior (SIW) 44 and the prior based on the separation strategy (SS) 45 .The Bayesian inference of a covariance matrix is highly sensitive to different choices of priors, and several studies have compared the performance of various priors.For example, Alvarez, Niemi and Simpson 46 compared four different priors (IW, HIW, SIW and SS) in the multivariate normal model and found that the IW prior performed the worst among all the four, especially when the true variances were small.Rúa, Mazumdar and Strawderman 41 conducted extensive simulation by comparing 38 priors, including IW, HIW, and IND, with different hyper-parameter specifications in multivariate Bayesian meta-analysis models.They found that the IW prior had overall poor performance, while the HIW prior had much more consistent performance across all scenarios examined.Akinc and Vandebroek 47 focused on the same priors used in Alvarez, Niemi and Simpson 46 and investigated Bayesian inference of the covariance matrix in mixed logit models.They suggested using different priors to check the robustness of the results but recommended avoiding the IW prior.To the best of our knowledge, the impact of different covariance priors on BN models in the context of meta-analysis of rare binary events has not been investigated.Thus, we aim to address the gap and access how these priors perform under rare binary settings.Below we briefly review four classes of priors, including IW, HIW, SS , and SIW, and their performance will be assessed in the simulation section.

Inverse Wishart prior
Due to the conjugacy property, the IW prior is often used as a default choice for covariance matrices.The density function of the IW prior IW(ν, H) is defined as p( , where ν > 0 is the number of degrees of freedom and H is a symmetric scale matrix with two dimensions.The marginal distribution of the correlation parameter ρ in is p when H is a diagonal matrix.If ν = 3 , ρ follows U(−1, 1) 45 .For our model, the conditional posterior distribution of is given by p �|φ 1 , φ 2 , µ 1 , µ 2 , x ∼ IW ν + I, H + � µ , where While the IW prior is a popular choice in Bayesian analysis due to its mathematical convenience, it also has limitations.One issue is that selecting the appropriate degrees of freedom ν and scaled matrix H can be chal- lenging.Although these are often set to default values of 3 and an identity matrix, respectively, recent studies by Rúa et al. 41 and Akinc and Vandebroek 47 have shown that these choices may not always be suitable.Another limitation is that the IW prior implies a strong relationship between variance and correlation, which can bias inference.Specifically, smaller variances are associated with correlation coefficients ρ around 0, while larger variances correspond to ρ approaching −1 or 1.proposed a two-layer hierarchical prior that builds upon the work of Wand et al. 48and Armagan, Artin et al. 49 , who showed that a half-t distribution can be expressed as a scale mixture of an inverse gamma distribution.In our case, the dimension of is two, so that their hierarchical prior is defined as p( |a 1 , a 2 ) ∼ IW(ν + 1, H * ) , where H * = 2νdiag(1/a 1 , 1/a 2 ) , a j ∼ Inverse-Gamma 1/2, 1/A 2 j for j = 1, 2 , ν > 0 , and A j > 0 is typically assigned a large value (e.g. 10 5 ) to indicate non-informativeness.They also showed that the marginal distribution of the correlation coefficient ρ is uniform on (−1, 1) for bivariate cases when ν = 2 .Compared to the IW prior, the HIW prior provides increased flexibility in the choice of the scaled matrix while retaining the conjugacy properties.In our model, the conditional posterior distribution of and a j for j = 1, 2 now become where −1 jj denotes the j, j entry of −1 , and we set ν = 2 .However, Alvarez, Niemi and Simpson 46 pointed out that, compared to the IW prior, the HIW prior is capable of reducing, but not eliminating, the dependency between variance and correlation. 45introduced a prior class known as the separation strategy (SS) that decomposes a covariance matrix into a diagonal matrix S of standard deviations (SDs) and a correlation matrix R , resulting in = SRS .Specifically, for bivariate data, S = diag(σ 1 , σ 2 ) and R = ρ 1 1 ρ .The SS prior assigns independent priors for the SDs and correlations, which eliminates the association between variance and correlation, setting it apart from the IW and HIW priors.Posterior computation with the SS prior is usually done via the Hamiltonian Monte Carlo (HMC) algorithm 50 , which was later improved by the No-U-Turn sampler 51 in Stan.In the Stan manual 37 , the recommended hyperprior settings for the SS prior are σ j ∼ Cauchy(0, 2.5) constrained by σ j > 0 for j = 1, 2 and R ∼ LKJCorr(1) , where LKJCorr(1) denotes the LKJ prior from Lewandowski et al. 52 with a shape parameter of 1.However, implementing the specific SS prior still requires intensive posterior computation.On the other hand, the IND prior is the simplest among the SS class, which assigns independent priors on σ 2 1 , σ 2 2 , ρ for the bivariate case, where σ 2 j ∼ IG(0.01, 0.01) for j = 1, 2 and ρ ∼ U(−1, 1) to reflect our lack of information about these terms.The posterior computation involved in the IND prior is much less compared to the SS prior suggested in the Stan manual and can be done via a Gibbs sampler.Thus, throughout our simulation and real data analyses, the IND prior was used for this SS class for computational efficiency. 44developed a scaled inverse Wishart (SIW) prior that decomposes a covariance matrix differently such that = Q , where for the bivariate case, Q is an two dimensional unscaled matrix with the i, j element Q ij and � = diag(δ 1 , δ 2 ) .The SIW prior is defined as Q ∼ IW(ν, H) and log δ j ind ∼ N b j , ζ 2 j , which implies that standard deviation σ j = δ j Q jj for j = 1, 2 and ρ =

Scaled inverse Wishart prior O'Malley and Zaslavsky
. Compared to the SS prior, the SIW prior avoids problematic transformation steps, yielding a more efficient sampling process.Following the specifications in Gelman and Hill 53 and Akinc and Vandebroek 47 , we set H = I , ν = 3,b j = 0 and ζ 2 j = 1 for j = 1, 2.

Simulation
We conducted two simulation studies focusing on meta-analysis of rare binary events: the first is to compare the performance of our Bayesian model under various covariance prior choices on the estimation of key model parameters in terms of bias, mean squared error (MSE) and coverage; the second is to assess the performance of our proposed IPQ method in comparison to existing GOF tests in terms of Type I error rates and statistical power.Our code is publicly available at https:// github.com/ chris zhangm/ MetaG OF.In our numerical experiments, we used the default continuity correction factor of 0.5 for all frequentist methods unless otherwise stated.On the other hand, we did not adopt any continuity correction or eliminate studies containing zero events for Bayesian methods since they handle such studies automatically via incorporation of prior information into data analysis.
In Figure 2, we report results based on 200 replicates for each setting with I = 20 , in which the three rows give bias, MSE and coverage results, and the three columns correspond to θ 0 , τ 2 and ρ , respectively.We observe that the performance of IPQ in estimating θ 0 seems to be insensitive to the choices of different priors.For τ 2 and ρ , the IND prior generally outperforms other choices as it produces smaller bias and MSE as well as coverage closer to the nominal level 95% in most scenarios.Among the other three priors, although IW tends to do better in point estimation, it gives the worst coverage for both τ 2 and ρ especially when τ 2 is small or |ρ| is close to 1.This is not surprising, as mentioned before, the IW prior induces dependency between variance and correlation, which can bias the inference.For I = 50 or 80 (results omitted here for brevity), while the discrepancies of the bias and coverage results using different priors become less salient, it is worth noting that IW still yields unsatisfactory coverage.
For GOF testing using the proposed IPQ method in this paper, the IND prior was used due to its demonstrated better performance and its simplicity.

Performance evaluation of GOF testing
Our interest lies in conducting the GOF test to detect departures from a common assumption in meta-analysis that the true effect sizes θ i s of component studies are normally distributed.Using the generalized REM, for null cases, we generated φ i1 from normal distributions; for non-null cases, we generated φ i1 from one of four pre- specified non-normal distributions: (i) an exponential distribution with a rate of 1; (ii) a gamma distribution with a shape parameter of 4 and a scale parameter of 0.5, a unimodal right-skewed distribution; (iii) a mixture of two equal-weighted normal distributions: N(1, 0.5) and N(4, 0.5) ; (iv) a t distribution with number of degrees of freedom of 4.Then, with z ∼ N(0, 1) , φ i2 given φ i1 was generated by φ i2 = ρ σ 2 σ 1 φ i1 + σ 2 1 − ρ 2 z + µ 2 − ρ σ 2 σ 1 µ 1 so that the correlation coefficient between (φ i1 , φ i2 ) is ρ and the mean and variance of φ ij are µ j and σ 2 j for j = 1, 2 , respectively.In other words, the conditional distribution of φ i2 given φ i1 is set to be normal in our simulation.Note that the four distributions of φ i1 cover different types of violation of the normality assumption, of which the first two cover skewness, the next covers multimodality, and the last covers heavy-tailedness.Generating the data in the above way would pass these types of violation to the distribution of θ i 's.
We compared our proposed method IPQ to six other approaches, including three frequentist-based approaches: the Naïve method that conducts the Shapiro-Wilk test on estimated effect sizes (log odds ratios) directly, the parametric bootstrap method (PB 8 ) and the standardization method (STD 9 ), and three Bayesian methods: the pivotal quantities method (PQ 10,11 ) using two cutoffs of 0.25 and 0.1, the posterior predictive check (PPC 15 ) using the discrepancy function recommended in Sinharay and Stern 54 , defined as , and the sampled posterior check (SPC 21, 22 ) using the same discrepancy function.We set the significance level α = 0.05 .For all frequentist approaches (PB, STD and Naive) and the proposed IPQ, we reject the normality assumption if the p value is less than 0.05.For PPC or SPC, we reject the null when the posterior predictive p value (PPP) is below 0.025 or above 0.975 22 ; for PQ, if the minimum p value upper bound p min is less than the chosen cutoff (0.25 or 0.1), we reject the null.As mentioned earlier, for either PPC or PQ, the reference distribution of PPP or p min is not uniform(0,1) even in an asymptotic sense, and so we do not expect that they maintain the Type I error rate.However, the cutoff value 0.1 for p min in the PQ method was chosen via preliminary simulation because it can offer error rates much closer to 0.05 in most of the simulation scenarios, compared to the rule-of-thumb value of 0.25.We simulated 200 replicates for each setting, and reported Type I error rates for data from null cases and statistical power otherwise.
Figure 3 reports the Type I error rates for all methods in various settings.The IPQ, STD, and SPC methods demonstrate superior performance, as they maintain an error rate close to the nominal value of 0.05 regardless of {I, µ, ρ} .Conversely, PQ-0.25 (the PQ method with the recommended cutoff of 0.25 11 ) frequently produces severely inflated Type I error rates, particularly as the event of interest becomes rarer, while PQ-0.1 performs much better in general.Therefore, we exclude PQ-0.25 from our power results below.Among the remaining three methods, PPC and PB are often conservative for rarer events (i.e., µ = −5 ), exhibiting Type I error rates below 0.05, while Naive tends to have inflated rates for less rare events.
Figures 4, 5, 6, 7 display power results with different underlying distributions of φ i1 .We observe that all approaches tend to report higher power as I increases or ρ decreases.Also, the differences in power among the methods become smaller as µ goes up.This is perhaps because some methods in the bottom group such as PB improve significantly while the proposed IPQ, as the best overall method, appears to be much less sensitive to the change of µ .Figures 4 and 5 present power results for skewed distributions (i.e., exponential and gamma distributions).IPQ is the best in nearly all scenarios, followed by PQ-0.1 and STD.Naive often stands somewhere in the middle among all.PB tends to perform poorly, except for larger µ and smaller ρ .SPC reports slightly better results than PPC, while both methods provide the worst overall results, particularly with large ρ .Figure 6 presents outcomes for a multimodal distribution (i.e., normal mixture).IPQ is a clear winner and provides the highest power in all settings.Among the others, STD and PQ-0.1 usually perform better, followed by Naïve and PB.SPC and PPC consistently give the worst results that show almost no power.Figure 7 displays power results for t 4 , a symmetric and heavy-tailed distribution, where again, IPQ outperforms other methods while PPC and SPC tend to perform the worst.
To summarize, the proposed IPQ maintains the Type I error rate at the target level well and offers the highest statistical power for various departures from the normality assumption compared to alternative approaches.

Data examples
We applied our IPQ method, along with six other methods (PB, STD, Naïve, PPC, SPC and PQ-0.1) to three real data sets of meta-analysis, for testing the normality assumption about the distribution of true effect sizes across component studies.The first involves hand-eye dominance data, the second involves diabetes data, and the third involves lung cancer data.Bourassa 55 conducted a meta-analysis of 54 studies to investigate the hand-eye dominance association (see Table A1 for detailed data in Supplementary Material).The study found that the hand-eye concordance was larger than one, indicating left-handed people tended to have left-eyed dominance, and the same was true for righthanded people.The meta-analysis included 54,087 subjects, summarized in 2 × 2 tables of four categories: left- handed/left-eyed, left-handed/right-eyed, right-handed/left-eyed and right-handed/right-eyed.We considered the event of interest to be " left-handed, " with the control and case groups being " left-eyed" and " right-eyed, " respectively.The overall incident rates for the control and case groups are about 6% and 18.5% , which are −2.75 and −1.48 on a logit scale equivalently.The left panel of Figure 8 displays the histogram, density and quantile- quantile plots of the observed log odds ratio, revealing a left-skewed distribution.
Bellamy et al. 56 conducted a meta-analysis of 20 studies to investigate the association between Type 2 diabetes mellitus and gestational diabetes (see Table A2 for detailed data in Supplementary Material).The analysis revealed that women with gestational diabetes had an increased risk of developing type 2 diabetes.The study included 675,455 subjects, of which 31,867 had Type 2 diabetes.Among the control groups (no gestational diabetes), 6,862 subjects had Type 2 diabetes, indicating an overall incident rate of ~ 1.1% (or −4.53 on a logit scale).For the case groups (with gestational diabetes), 3997 of them had Type 2 diabetes, resulting in an incident rate of ~ 12.5% (or −1.94 on a logit scale).The middle panel of Figure 8 shows the histogram, density, and quantile-quantile plots of the observed log odds ratio, suggesting a unimodal, symmetric, and bell-shaped curve.
Feng et al. 57 conducted a meta-analysis of 44 studies to evaluate the association between GSTP1 gene polymorphism and the risk of lung cancer (see Table A3 for detailed data in Supplementary Material).The event of interest is considered the GG genotype of GSTP1.The study included 26,516 subjects, of which 2763 had the GG genotype.Among the control (no lung cancer) and case (lung cancer) groups, 1406 and 1357 subjects had the GG genotype, implying overall incident rates of 10.0 (or −2.19 on a logit scale) and 10.8 (or −2.11 on a logit Table 1 shows that for the hand-eye dominance data, at the significance level α = 0.05 , all methods except for STD reject the null hypothesis, indicating a departure from the assumed normality.Note that among the existing methods, STD was quite competitive.Nevertheless, it failed in this specific example.On the other hand, PPC and SPC tend to be conservative in rejecting the null but worked here.For the diabetes data, Table 1 shows that all methods have the same conclusion: there is no evidence against the normality.
For the GSTP1 gene polymorphism and lung cancer data, Table 1 shows that Naïve, PQ-0.1 and IPQ reject the null hypothesis while other methods do not provide evidence against normality.As shown in Figure 7, given the symmetric and heavy-tailed distribution, IPQ offered the highest power across all the cases.When µ = −2 , similar to the overall incidence rates in this dataset, Naïve and PQ-0.1 performed relatively well.However, STD, PPC, and SPC performed poorly under the cases of µ = −2 .Therefore, we recommend avoiding the normality assumption in this example.
In summary, IPQ performs consistently well across all three real data examples, while PB, STD, PPC, and SPC sometimes fail.Although Naïve and PQ-0.1 also demonstrate good performance here, they are less satisfactory in our simulation studies.As such, IPQ is the recommended method of choice in this context.

Discussion
Meta-analysis commonly assumes that actual effect sizes from component studies follow a normal distribution for mathematical convenience, despite a lack of formal justification for this assumption.In practice, however, this assumption can be frequently violated, potentially leading to inaccurate conclusions.To address this issue, we propose a novel goodness-of-fit (GOF) test called Improved Pivotal Quantities (IPQ) for testing this assumption in the context of meta-analysis of rare binary outcomes, where the effect size is measured by log odds ratio.
The proposed IPQ method builds upon the strengths of the original PQ approach 10 , which is conceptually simple and efficient in detecting model misfit at any level of a hierarchical model without additional computational costs.However, the original PQ method employs the probability bound as a criterion to determine model misfit, which can result in inflated Type I error rates when used with the rule-of-thumb cutoff of 0.25.This highlights the need for selecting new cutoff values that are tailored to different applications.To address this limitation, our IPQ method improves the decision-making process of PQ by adopting the Cauchy combination idea 12 to account for dependent p value.In addition, given sparse data such as tables with zero events, IPQ naturally incorporates all data, without requiring artificial corrections due to its Bayesian model formulation.
In fact, IPQ is a hybrid approach.It adopts the frequentist framework for hypothesis testing, since it uses the Cauchy combination test to obtain a p value, from which the final conclusion is drawn.On the other hand, it constructs the test statistics by incorporating a Bayesian idea through Markov Chain Monte Carlo methods.We further note that, because of the use of pivotal quantities, the sampling distribution of the proposed test statistics, evaluated at posterior samples, is known and invariant (i.e., N(0,1)) under the null hypothesis.The set of posterior draws used to construct the test statistics is from the same data (i.e., the true observed data rather than any "fake" data).
Simulation results indicate that IPQ maintains well-controlled Type I error rates while achieving higher statistical power than other approaches in most scenarios.To demonstrate the effectiveness of our method, we provide examples of three real datasets.Specifically, our results suggest that the normality assumption should be avoided for the hand-eye dominance dataset 55 and the GSTP1 gene polymorphism and lung cancer dataset 57 , while it is likely to hold for the diabetes dataset 56 .In situations where the normality assumption does not hold, it becomes imperative to explore alternative distributions, such as those characterized by heavy tails (e.g., t distributions) or skewness (e.g., gamma distributions), in order to more accurately capture the characteristics of observed data.Alternatively, one can employ nonparametric methods for estimating treatment effects 58 and for estimating heterogeneity [59][60][61] .Furthermore, in scenarios where a meta-analysis involves a small number of studies, a situation commonly encountered in practice, alternative frameworks such as Bayesian model averaging may yield more reliable outcomes.
Although our focus is primarily on rare binary events, the IPQ method is directly applicable to meta-analysis of any binary data.However, we believe that the gain in performance for common binary events may not be as significant as that for rare binary events.As demonstrated in our simulation studies, the differences in power between our method and other approaches diminish when increasing the background incidence rate.Moreover, IPQ can be extended beyond testing normality to other scenarios where an appropriate test statistic can be designed to measure the discrepancy.In conclusion, our IPQ method is useful for detecting model misfits and selecting appropriate statistical models for different applications, particularly in scenarios where sparse data are present or when the normality assumption is in question.Table 1.P values of the GOF tests for three meta-analyses involving (i) hand-eye dominance data, (ii) type 2 diabetes mellitus and gestational diabetes data, and (iii) GSTP1 gene polymorphism and lung cancer data.Note that for PPC or SPC, the posterior predictive p value is reported and for PQ-0.1, the minimum p value upper bound p min is reported.

Figure 1 .
Figure 1.Schematic diagrams of different model diagnostic methods.

Figure 2 .
Figure 2. Comparison of bias, MSE and coverage results for Bayesian estimates of θ 0 , τ 2 and ρ using the Inverse-Wishart prior (IW), the Hierarchical Inverse-Wishart prior (HIW), the Scaled Inverse-Wishart prior (SIW) and the independent prior (IND) for meta-analysis of rare binary events with I = 20 studies.For each setting, results were generated using 200 replicates and the nominal coverage level was set as 95%.

Figure 8 .
Figure 8.The histogram and density plots (top) and Quantile-Quantile plot (bottom) of the observed effect sizes measured by log odds ratio (lnOR).The left panel is for hand-eye dominance data, the middle panel is for diabetes data and the right panel is for lung cancer data.