Editorial

The American Journal of Gastroenterology (2005) 100, 1233–1236; doi:10.1111/j.1572-0241.2005.50107.x

Metaanalyses Are Observational Studies: How Lack of Randomization Impacts Analysis

Eloise E Kaizar MS1

1Department of Statistics, Carnegie Mellon University, Pittsburgh, Pennsylvania

Correspondence: Eloise Kaizar, MS,, Department of Statistics, 132 Baker Hall, Carnegie Mellon University, Pittsburgh, PA 15213

Received 22 January 2005; Revised  0000; Accepted 7 March 2005.

Top

Abstract

Although metaanalyses are often syntheses of many experiments, they are themselves observational studies. As such, when performing or reading metaanalyses, we must consider the sources of bias that we usually expect in an observational study. The main two sources of bias for metaanalysis are publication bias and systematic heterogeneity. I consider the nature of both of these sources, methods to detect this bias, and ways to correct for the bias.

An increased focus on evidence-based medicine and the several advantages that metaanalysis offers over clinical trials have caused metaanalyses to become more popular among researchers. Unfortunately, while most researchers are proficient at analyzing clinical trials, many do not understand how clinical trial methods differ from those required for metaanalysis.

The main difference between metaanalyses and clinical trials lies in randomization. Typical clinical trials have either a parallel group or a cross-over design. In parallel group designs, each subject is randomly assigned to treatment or control. This random assignment eliminates hidden subject differences as possible causes of observed outcomes. In a cross-over design, over the course of the experiment, each subject will be studied under both treatment and control conditions. Under cross-over designs, potentially confounding individual characteristics are held constant, thus eliminating them as possible causes of any observed outcomes.

Similar to a cross-over design, there are many variables at the study-level that are held constant for all of a single experiment's subjects (regardless of the design used). For example, all the subjects are recruited by the same process and they all participate in the same experimental design. Once we collect many experiments for a metaanalysis, the study-level homogeneity is lost, and yet we do not replicate a randomization (as in a parallel group design) that eliminates the possibility that hidden study-level factors may cause (or influence) the observed outcomes and thus bias the results. Here, I discuss the two main ways study-level factors influence meta-analysis: publication bias and systematic heterogeneity, using as an example, the Cremonini, et al. metaanalysis reported in this issue of The American Journal of Gastroenterology (1). For further information about other sorts of bias, model-fitting issues and other statistical features of metaanalysis, I recommend The Handbook of Research Synthesis, edited by Cooper and Hedges (2).

Top

PUBLICATION BIAS

It is commonly known that studies that are large or have statistically significant results are more likely to be published than those that are small or have insignificant results (3). This phenomenon is generally referred to as publication bias.1Although placing such importance on p-values is increasingly being recognized by both publishers and authors as poor science (4), biased publishing persists. I note that statistical significance is not the only reason to withhold publication, but for these comments, I consider other reasons to be negligible.

If publication bias is present but not detected, the estimated overall effect will likely to be too large (due to the exclusion of the small effects) and the investigator will be overly confident in the biased value (due to the smaller spread of the effects, and thus smaller variance). The most common method for detection is to use a funnel plot (5). In the absence of publication bias, a plot of variation versus outcome measure should produce a funnel shape (3). Variation is usually measured with the square root of the precision (the inverse of the standard error), but the sample size may also be used. The funnel shape is based on statistical theory. Most outcome measures are approximately distributed according to a bell shape with mean equal to the true outcome and variance inversely proportional to the sample size. Because of this inverse relationship, smaller studies have a larger variance, and therefore, are more likely to fall further from the true outcome than larger studies with smaller variance, producing a funnel shape.

The funnel shape is shown in Figure 4B of the Cremonini, et al. paper, in which the studies with larger sample size (or larger precision) fall close to the true outcome, and those with smaller sample size (or smaller precision) fall in a range further from the true outcome. As Cremonini, et al. note, their data seem to have fallen victim to publication bias, since the funnel plot of their data (Figure 4A) seems to be missing the right-hand portion of the bell of the funnel, where the confidence intervals are more likely to include a relative risk of one (no effect). If the missing right-hand smaller studies had been included in their metaanalysis, the overall effect would have been estimated closer to a relative risk of one (no effect), and their confidence interval for this effect would have been larger; thus, if the researchers had been able to overcome the publication bias for their studies, they may have found no significant improvement due to proton pump inhibitors.

In addition to this visual method for detecting publication bias, several tests are available. One simple method, used by Cremonini, et al., is to turn the funnel on its side and perform a regression with the precision used as the response variable and the effect size used as the predictor variable. If the funnel shape is present, we would expect an approximately horizontal line through the center of the funnel. However, if there is publication bias, we would expect the regression to result in a sloped line where precision is a significant predictor of effect. Another popular test is Begg's rank correlation test (6).

The best way to address publication bias is to prevent it by using all available resources to collect both published and unpublished studies; much has been written about this process (7). But, if you have performed an exhaustive search, and still find evidence of publication bias, what can you do?

Two broad approaches have been proposed to estimate the effects of the unobserved studies and the true effect size based on the observed studies. The first method is to model the publication bias process. Although if the model is correct, this method would be quite effective, it can be quite complex and has been criticized for its subjective nature (8). The second approach uses the symmetry of the expected funnel and is called "trim and fill" (9). The "trim" portion involves removing the studies that make up the observed rim of the funnel until the symmetry has been restored, and the location of the center of the funnel can be determined. The "fill" portion restores all the observed studies, and then uses the funnel shape to simulate possible unobserved studies. Unfortunately, to be effective, both of these methods require a large number of studies, and so Cremonini, et al. could not use either, to estimate the unbiased effect size of proton pump inhibitors.

Top

HETEROGENEITY

Heterogeneity is when "a given collection [of studies] is more heterogeneous than would be expected on the basis of sampling variation alone" (2). It implies that each of the studies is measuring an effect that is different from the others, where the variance comes from different study designs or populations, in addition to the usual sampling variance. Ignoring heterogeneity results in an underestimate of the variance and the size of the confidence interval, usually increasing the estimated significance of any difference between the treatment and the control to a level much greater than is true. Several tests for heterogeneity exist; all have notoriously weak power, but the best seems to be the test proposed by Higgins and Thompson (10). If we assume the heterogeneity is random, we can correct for the heterogeneity by using random-effects rather than fixed-effects methods, as done by Cremonini, et al.

However, heterogeneity is often not random. The observational nature of meta-analysis opens the door to systematic effects. Systematic effects are study-level variables, such as lead investigator and study design that may systematically affect the outcome of each study. For example, Cremonini, et al. analyzed seven studies; four used the proton pump inhibitor omeprazole, two used lansoprazole, and one used rabeprazole (1). If the different drugs are not equally effective, then not adjusting for the type of proton pump inhibitor will bias the results. This possibility is displayed in Figure 1, where the effects of lansoprazole do not appear to be significantly different from placebo, while those of omeprazole and rabeprazole do seem to have a significant positive effect.

Figure 1.
Figure 1 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Cremonini, et al. Figure 2, separated by drug.

Full figure and legend (34K)

Metaregression is often used to identify and correct systematic effects (11). Cremonini, et al. are correct in noting that with only seven studies, a metaregression may be underpowered, but potential sources of heterogeneity should still be explored. In fact, a standard metaregression of these data shows that lansoprazole is indeed significantly less effective than the other two drugs (p < 0.001), but the support for their less effective status decreases using a less popular, but improved test (p= 0.10) (12). These analyses were performed using the Stata software package (StataCorp, College Station TX).

Hierarchical models may also be employed in identifying systematic effects, and may more appropriately model all of the variation, but hierarchical models appropriate for all outcome measures have not been developed or implemented in readily-available software.

While the estimated sensitivity and specificity do not seem to differ according to drug (Figure 2), we should be especially concerned with heterogeneity in enrollment criteria when pooling measures of diagnostic value. If patients likely to respond to PPIs (those with baseline symptoms suggestive of GER) are enrolled more or less actively than those who are not likely to respond, the pooled sensitivity and specificity will be biased (13). The size of this type of bias, commonly referred to as verification or spectrum bias, depends on the magnitude of the difference in enrollment; Knottnerus suggests several methods for bias correction (13). I am especially concerned with the Chambers study, which seems to be particularly different from the others with only 17% of the patients showing baseline characteristics suggestive of GER (Cremonini, et al. Table 2).

Figure 2.
Figure 2 - Unfortunately we are unable to provide accessible alternative text for this. If you require assistance to access this image, please contact help@nature.com or the author

Cremonini, et al. Figure 6, separated by drug.

Full figure and legend (10K)

Top

A NOTE ON CROSS-OVER STUDIES AND METAANALYSIS

Cremonini, et al.'s metaanalysis includes both cross-over and parallel studies, as is quite appropriate, especially if study design is considered as a potential source of heterogeneity. However, one must be careful to appropriately estimate the measure of effect (here relative risk) and its variance, since in a cross-over trial how a subject responds to treatment and control are correlated. Cremonini, et al. are correct to note that ignoring this correlation may cause some of their results to be erroneous.

In cross-over designs (a special case of a matched pair design), the variance of the relative risk is calculated differently than in parallel designs (14); unfortunately, the cross-over variance is not automatically calculated in popular metaanalysis packages such as Stata (StataCorp, College Station TX). For the correct calculations, we must know how many patients responded positively to both treatment and control, how many responded negatively to both, how many responded positively to treatment but negatively to control, and how many responded negatively to treatment but positively to control. Unfortunately, these data are not consistently reported. Just as CONSORT has standardized reporting for parallel trials (15), standards for reporting matched pair studies must be established so that these studies can be properly incorporated into metaanalysis. For Cremonini, et al.'s metaanalysis, only two of the five cross-over studies reported enough information to correctly calculate estimates and standard errors.

Top

CONCLUSIONS

As metaanalysts, we must be concerned about the repercussions of using a nonrandom sample of studies to draw an overall conclusion. We must be tireless in our search for all appropriate published and unpublished studies, and use corrective methods if publication bias persists. In addition, we must use random-effects models if heterogeneity is present, and make concerted efforts to discover and correct for the systematic effects.

However, under the weight of all these cautions, we must not lose sight of the usefulness of metaanalysis. Done correctly, metaanalysis allows us to draw clinically relevant conclusions where a mega-trial is impractical. Metaanalyses can be complicated to analyze, but we should not become daunted or forget their irreplaceable role in medical exploration and discovery.

I applaud Cremonini, et al. for their careful consideration of the role of publication bias and heterogeneity in their meta-analysis, but strongly agree with their conclusion that more data are needed. Their estimate of PPI efficacy is most likely to be too high (due to publication bias), and there is some evidence that PPIs are not uniformly effective. More data would shed light on the overall efficacy of PPIs, and would be especially helpful in discovering differences in efficacy across drugs. Further, their conclusions regarding diagnostic utility require qualification, as their pooled estimates of sensitivity and specificity are based on specialized patient populations, and therefore, may not be applicable in general practice.

Top

Notes

1 Publication bias is often the culprit in "small study effects," in which small studies tend to report more significant findings than their larger counterparts.

Top

References

  1. Cremonini, F, Wise, J, Moayyedi, P, et al. Diagnostic and therapeutic use of proton pump inhibitors in non-cardiac chest pain: A metaanalysis. Am J Gastroenterol 2005;100: 1226–1232. | Article | PubMed | ISI | ChemPort |
  2. Cooper H, Hedges LV, eds. The handbook of research synthesis. Russell Sage Foundation, 1994.
  3. Light, RJ, Pillemer, DB. Summing Up: The Science of Reviewing Research. Harvard University Press, 1984.
  4. Connor, JT. The value of a p-valueless paper. Am J Gastroenterol 2004;99: 1638–1640.
  5. Macaskill, P, Walter, SD, Irwig, L. A comparison of methods to detect publication bias in meta-analysis. Stat Med 2001;20: 641–654. | Article | PubMed | ChemPort |
  6. Begg, CB, Mazumdar, M. Operating characteristics of a rank correlation test for publication bias. Biometrics 1994;50: 1088–1101. | Article | PubMed | ISI | ChemPort |
  7. Section 5: Locating and selecting studies. In: Alderson P, Green S, Higgins JPT, eds. Cochrane reviewers handbook 4.2.2 [updated March 2004], Issue 1. John Wiley & Sons, Ltd., 2004.
  8. Begg, CB. Publication bias. In: Cooper H, Hedges LV, eds. The handbook of research synthesis. Russel Sage Foundation, 1994.
  9. Duval, S, Tweedie, R. Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics 2000;56: 455–463. | Article | PubMed | ISI | ChemPort |
  10. Julian, P TH, Simon, G. Thompson. Quantifying heterogeneity in a meta-analysis. Stat Med 2002;21: 1539–1558. | Article | PubMed | ISI |
  11. Thompson, SG, Sharp, SJ. Explaining heterogeneity in meta-analysis: A comparison of methods. Stat Med 1999;18: 2693–2708. | Article | PubMed | ISI | ChemPort |
  12. Knapp, G, Hartung, J. Improved tests for a random effects meta-regression with a single covariate. Stat Med 2003;22: 2693–2710.
  13. Knottnerus, JA. The effects of disease verification and referral on the relationship between symptoms and diseases. Med Decis Making 1987;7: 139–148.
  14. Lachin, JM. Biostatistical methods: The assessment of relative risks. John Wiley and Sons, 2000.
  15. Moher, D, Schulz, KF, Altman, DG for the CONSORT Group. The CONSORT statement: Revised recommendations for improving the quality of reports of parallel-group randomized trials. Lancet 2001;357: 1191–1194. | Article | PubMed | ISI | ChemPort |
Top

Acknowledgements

The author thanks Dr. Joel Greenhouse, Jason Connor, Cari Kaufman, and Kary Myers for their helpful comments.

Extra navigation

.

gastrojobs

ADVERTISEMENT