INTRODUCTION

Efficacy of antidepressant drugs for treatment of acute, unipolar, major depressive episodes continues to be a research and clinical topic of considerable interest. There have been major changes in clinical practice involving antidepressants since the early 1990s (Healy, 1997; Baldessarini, 2005; Ghaemi, 2008). Currently favored drugs include the serotonin reuptake inhibitors (SRIs) introduced since the late 1980s, and a series of additional modern, ‘second-generation’ antidepressants. These include agents with mixed inhibitory actions on the neuronal-uptake and inactivation of serotonin and norepinephrine (SNRIs, including desvenlafaxine, duloxetine, milnacipran, venlafaxine, and others), and ‘atypical’ agents with other actions (such as bupropion, nefazodone, mirtazapine, and vilazodone). These modern or ‘second-generation’ antidepressants have largely displaced older antidepressants including tricyclics (TCAs) and monoamine oxidase (MAO) inhibitors (Baldessarini, 2005, 2012).

The superiority of most clinically employed antidepressants over placebos in controlled trials has been modest in adult patients diagnosed with major depression, even lower in juvenile depressed patients, and probably has declined in recent years (Walsh et al, 2002; Baldessarini, 2005; Cipriani et al, 2007; Papakostas et al, 2007; Gartlehner et al, 2008; Kirsch et al, 2008; Tsapakis et al, 2008; Bridge et al, 2009; Wooley et al, 2009; Masi et al, 2010; Pigott et al, 2010; Khin et al, 2011). Evident decline in superiority of drugs over placebos has occurred despite evidence of selective reporting of positive findings of potential commercial interest from therapeutic trials (Ioannidis, 2008; Turner et al, 2008).

Moreover, there is little evidence that one antidepressant or pharmacological class of antidepressants is clearly and convincingly more effective than others (Anderson, 2001; Baldessarini, 2005, 2012; Cipriani et al, 2007; Papakostas et al, 2007; Gartlehner et al, 2008; Kirsch et al, 2008; Khin et al, 2011). In part, this lack of clear differentiation may arise from the modest drug–placebo differences in many controlled trials of antidepressants, which, in turn, may reflect broad clinical heterogeneity arising from the current broad concept of ‘major depression’ (Healy, 1997; Ghaemi, 2008). Possible differentiation of efficacy among antidepressants was a lively question soon after introduction of the SRIs and SNRIs (Healy, 1997; Baldessarini, 2005, 2012). However, the popularity of most modern antidepressants owes far more to their perceived safety, relative ease of use, and broad clinical utility rather than to well-demonstrated superior efficacy in major depressive disorder compared with older agents (Baldessarini, 2005, 2012; Cipriani et al, 2007; Papakostas et al, 2007; Wooley et al, 2009; Pigott et al, 2010; Khin et al, 2011).

Given current uncertainties regarding the relative efficacy of specific drugs and pharmacological classes of antidepressants, we carried out a systematic, meta-analytic review of peer-reviewed, placebo-controlled trials, reported since 1980, limiting inclusion to drugs with regulatory approval for major depression that are currently employed clinically in the United States. Specific aims were to: (a) compare the efficacy of older and modern antidepressants compared with placebo; (b) further test the apparently widely held assumption that modern agents are at least equivalent in efficacy to older antidepressants (specifically TCAs); and (c) examine factors associated with anticipated declining differences in drug- vs placebo-associated responses in randomized, placebo-controlled trials of antidepressants.

MATERIALS AND METHODS

Search Strategy

We conducted a computerized literature review using Medline, CINAH Library, Cochrane Library, and PsycINFO literature databases using the following search-terms: ‘antidepressant, amitriptyline, amoxapine, bupropion, citalopram, clomipramine, desipramine, depression (or major depression), desmethylvenlafaxine, duloxetine, escitalopram, fluoxetine, imipramine, isocarboxazid, mirtazapine, maprotiline, monoamine oxidase (or MAO) inhibitors, nortriptyline, phenelzine, paroxetine, S-citalopram, selegiline, sertraline, tranylcypromine, trazodone, tricyclic antidepressants, trimipramine, and venlafaxine,’ alone and in various combinations. Also, reference lists of articles and reviews on antidepressant efficacy were hand-searched for relevant reports. The search was limited to peer-reviewed, published, randomized, placebo-controlled trials (RCTs) in acute episodes of adult major depressive disorder diagnosed by standardized criteria, and reported from 1980 through August 2011.

Eligibility Criteria

Included were reports of randomized, double-blind, placebo-controlled trials in adults in an acute, apparently unipolar, major depressive episode (or with 10% identified cases of bipolar depression or diagnoses other that major depression) based on DSM-III, III-R, or -IV, ICD-9 or -10, or RDC diagnostic criteria, and with at least 20 subjects per arm. We excluded trials of drugs that are not US FDA-approved and indicated in the United States for treatment of acute episodes of major depressive disorder, as well as reports involving special populations, such as juvenile or geriatric patients, treatment-resistant depression or depression associated with major neuromedical or other psychiatric disorders. Only monotherapy trials were included; antidepressant doses could be fixed or flexible, with or without low-doses (below the approximate equivalent (Baldessarini, 2005) of 2 mg/day of lorazepam) of supplemental sedative or hypnotic agents. For 35 trials with three randomized treatment conditions involving two active agents and a placebo arm, we compared each drug–placebo pair separately; in some three-arm trials involving an experimental agent and a standard comparator, we considered only a marketed agent vs placebo. When an active agent was used in different doses in the same trial, we calculated mean doses and outcome measures, all considered as a single drug-arm. Total daily drug doses (mg/day) were converted to approximate imipramine-equivalents (IMI-eq), based on the median of the range of clinical doses recommended by the manufacturers as summarized elsewhere (Baldessarini et al, 2010), so as to permit comparisons of agents of dissimilar potency.

Outcome Measures

The primary outcome measure was categorical ‘response,’ usually defined as 50% reduction in initial depression rating-scale scores. Most often, ratings were based on the Hamilton (HDRS) or Montgomery–Åsberg (MADRS) Depression Rating Scales (Hamilton, 1960; Montgomery and Åsberg, 1979), or Clinical Global Impression (CGI) ratings (Guy, 1976) when these measures were not available. Scores employed for analyses were standardized as the percentage of maximum attainable scores on each rating scale (eg, 48 for 17-item HDRS, 60 for the MADRS, and 61 for 21-item HDRS). When the number of items in the HDRS was not specified by the investigators, we considered it to be the most commonly employed 17-item version. When more than one depression rating scale was employed, we gave priority to results obtained with the HDRS for greater comparability. All measures of initial depression severity and its change by end-point were standardized by use of percentages of observed ratings to the maximum attainable score with each rating scale employed. Continuous measures of change in depression ratings with drug vs placebo were considered as secondary measures, because lack of variance measures in most trials precluded formal meta-analysis. We considered factors that might influence outcomes, including numbers of subjects and collaborating sites, percentage women, initial depression ratings, IMI-eq daily drug doses, trial-duration, dropout rates, specific drugs and types, and year of reporting. As manufacturers of the drugs involved sponsored almost all trials, sources of support were not further considered.

Data Analysis

Averaged data are means with SD, unless stated otherwise. Meta-analyses based on Stata metan programs, used random-effects modeling to limit effects of inter-trial variance; responder rates for each drug–placebo pair yielded pooled rate ratios (RRs) and rate differences (RDs) with their computed 95% confidence intervals (CIs) (Tsapakis et al, 2008; Yildiz et al, 2011a, 2011b). Percentage-improvement in depression for drug–placebo pairs was compared by paired-t testing and averaged to provide overall estimates of response differences (RDs). We also carried out bivariate and multiple linear regression modeling from these analyses to evaluate associations of selected covariates with reporting year. Correlations employed nonparametric Spearman rank methods (rs) to avoid effects of non-normally distributed data and potential nonlinear relationships. The primary study-hypothesis was that all marketed antidepressants would be statistically more effective than placebo, on average, with only minor differences among specific drugs or types. Analyses were based on standard commercial software (Stata.8; StataCorp, College Station, TX; Statview.5; SAS Institute, Cary, NC).

RESULTS

Trials Characteristics

Initially, we screened >2000 potentially relevant reports appearing between 1980 and 2011. Based on reviewing abstracts, 179 reports appeared to meet selection criteria and not to include multiple reports of the same trials. Exclusions (71/179) were as follow: (a) 17 studies involved <20 patients per arm; (b) another 17 included >10% of subjects with diagnoses other than major depressive episode; (c) 11 studies involved special populations; (d) 4 trials did not include a placebo control arm; (e) another 22 reports were excluded for various other reasons, including outcomes that were not quantified or did not include responder rates or improvement in depression ratings, represented subpopulations of larger trials already considered, or involved unapproved drugs. Detailed review of entire reports led to inclusion of 107; they involved 142 drug–placebo comparisons (Table 1), owing to 35 trials arising from studies with three randomized arms (for which a total of 3677 placebo-treated subjects were considered twice). There were 27 127 non-duplicated adult subjects (17 059 randomized to an antidepressant, 9925 to placebo), of average age 40 years (62.0±9.9% women). Antidepressants tested (n=19) ranked by trial-count as: imipramine (23 trials), fluoxetine (17), venlafaxine (15), paroxetine (14), amitriptyline (12), duloxetine (10), bupropion (9), desvenlafaxine (8), sertraline (8), R,S-citalopram (7), S-citalopram (5), mirtazapine (4), selegiline (3), desipramine (2), clomipramine (1), nortriptyline (1), phenelzine (1), tranylcypromine (1), and trazodone (1). Types of antidepressants ranked: SRIs (52 trials (36.6%)), TCAs (38 (26.8%)), SNRIs (33 (23.2%)), atypical agents (bupropion, mirtazapine, trazodone; 14 (9.9%)), and MAO-inhibitors (5 (3.5%)). Subjects per trial ranked: SNRIs (288±118) >SRIs (230±146) > atypical agents (224±144) >MAO-inhibitors (181±98) > TCAs (139±101); there were far more sites per trial since the median reporting-year of 1998 (range 1983–2010): 22.7±16.8 vs 7.22±5.98, as well as more subjects per trial: 270±114 vs 181±122, indicating a major secular trend toward increasing trial-size.

Table 1 Characteristics of Placebo-Controlled Trials of Antidepressants in Major Depression

Initial depression scores (as percentage of scale maxima were similar in drug- (45.6±6.5%) and placebo-arms (48.6±8.4%)). There were 120±90 subjects (range: 11–521) per antidepressant arm and 96±57 per placebo arm (range: 18–273) or 216±136 participants per trial, and 16.1±15.3 collaborating sites per trial. Treatment lasted approximately 7.2±1.8 weeks, uncorrected for early dropouts at unspecified times, at rates of approximately 29.8±12.3% or 4.54±2.58% per week with drugs, and 33.3±15.7% or 5.06±3.02% per week with placebos (paired-t=2.71, p=0.007). Supplemental use of moderate doses of sedative-anxiolytics was permitted in 59.1% of all trials. Most trials (81.7%) included at least brief periods to allow previously administered drugs to ‘wash-out,’ and most (78.9%) employed intention-to-treat methods; 97.4% of trials were sponsored by pharmaceutical manufacturers. The overall estimated IMI-eq standardized dose was 158±68 mg/day, and did not differ by drug-type or between older (TCAs, MAO-inhibitors: 155±49 mg/day) and modern antidepressants (SRIs, SNRIs, and atypical agents: 159±77 mg/day).

Meta-Analysis

Meta-analyses with the 122 trials reporting on responder rates yielded pooled drug–placebo RRs (RR with CIs) for each agent, and an overall pooled RR value of 1.42. (95% CI: 1.38–1.48; z=16.3, p<0.0001). Among agents with more than one trial, amitriptyline ranked highest in apparent efficacy, and bupropion lowest; however, CIs for most agents overlapped, indicating the need for caution in attempting to rank drugs by efficacy (Figure 1). Single-trial data available for phenelzine, clomipramine, nortriptyline, trazodone, and tranylcypromine are likely to be unstable and unreliable (Figure 1). Construction of a ‘funnel plot’ (1/standard-error-of-RR vs 1/RR) for all reports with data on responder rates yielded a V-shaped distribution of values that was symmetrically distributed around the pooled value of 1/RR (not shown); this finding may provide evidence against selective reporting of positive trials results.

Figure 1
figure 1

Summary of meta-analytically computed relative rates (RR) of response after randomization to drug vs placebo) with 95% confidence intervals (CI, horizontal bars when n2 trials per drug) for controlled trials of each of 19 antidepressants (with numbers of trials on the left axis, and numerical values on the right). Drugs are listed by descending apparent efficacy, with symbol-size approximately proportional to weighting by trials per drug. The vertical solid line=null (1.0); vertical dotted line and solid diamond (width=CI)=pooled RR for all agents tested (*p<0.05; **p0.01; ***p0.001). Overall pooled RR=1.42 (CI: 1.38–1.48), indicating an average of 42% superiority of antidepressants over placebos. Note that phenelzine, clomipramine, tranylcypromine and trazodone (n=1 trial each) appear to be outliers.

PowerPoint slide

We also compared antidepressants by types with pooled data, and compared apparent efficacy by three outcome measures. These included meta-analytically computed response RRs and responder rate-differences (RD), as well as relative differences (RD) in changes in depression ratings with drug–placebo pairs. Although these outcome measures yielded slightly different rankings, TCAs consistently ranked as the most effective antidepressants considered, and atypical agents, seemingly least effective (Table 2). Trials carried out before the median reporting year (1998) yielded higher values of all efficacy measures (Table 2). Median years of trial-reporting ranked: TCAs (1991) < MAO-inhibitors (1997) < atypical agents (1998) = SRIs (1998) < SNRIs (2003). Efficacy based on responder-rate RR values was much greater for TCAs than other types of antidepressants (1.83±0.62 vs 1.48±0.41; F=11.8, p=0.0008). Moreover, when the numbers of placebo-responders and nonresponders in the TCA trials were substituted for corresponding placebo data for trials of modern antidepressants, the meta-analytically pooled RR value was identical to that found in the TCA trials, supporting the impression that apparent differences response rates with the two classes of antidepressants was accounted for by secular changes in placebo responses.

Table 2 Comparisons Among Antidepressant Types and Reporting Years

Factors Associated with Trials Results

Given the preceding findings suggesting that older agents, specifically TCAs, might appear to be somewhat more effective than modern antidepressants in general, and that older trials yielded consistently greater drug–placebo differences, we carried out several correlational analyses to further examine effects of reporting-year on numbers of sites and subjects per trial, on responses to drugs and placebos and their ratio (Figure 2). Both sites and subjects per trial increased between 1983 and 2010 (Figures 2a and b). Responses in placebo-arms of trials increased across the same era, but responses to antidepressant drugs decreased slightly (Figures 2c and d) to yield highly significant decreases in the drug–placebo responder rate-ratio (RR) across the same years (Figure 2e); drug–placebo differences in rates of response or of percentage improvement also declined (not shown). Responder RR values also declined significantly as the number of subjects per trial (Figure 2f) as well as sites per trial (not shown) increased. Trial-duration also increased significantly across the years sampled (rs=0.603, p<0.0001), and longer-trials led selectively to larger responses with placebos (slope, 1.90 (CI: 0.85–2.95), p=0.0005) than with drugs (0.92 (–0.15 to 1.99)).

Figure 2
figure 2

Correlations with Spearman nonparametric correlation coefficients (rs): (a) Subjects per trial vs year of trial reporting (p<0.0001); (b) Collaborating sites per trial vs year (p<0.0001); (c) Meta-analytic responder rate (% of subjects) for antidepressants vs year (p=0.58); (d) Responder rate (%) for placebo vs year (p<0.0001); (e) Responder rate ratio (RR: drug–placebo) vs year (p<0.0001); (f) RR vs subjects per trial (p=0.002). In addition RR decreased significantly with more sites per trial (rs=−0.302, p=0.004). Note that response after randomization to placebo but not antidepressant drugs selectively increased over years, as subject and site counts per trial increased, with corresponding decreases in drug–placebo relative response rate-ratio (RR).

PowerPoint slide

Multivariate linear regression modeling indicated that the following factors were associated significantly and independently with more recent reporting-years, as follows: (a) more drugs other than TCAs, (b) larger numbers of subjects per trial (or sites per trial), (c) lower response to drugs, (d) greater responses to placebo, (e) higher proportions of depressed women, and (f) longer trials. However, there was no evidence of secular changes in ratings of depression-severity at intake, IMI-eq drug-dose, or dropout rates (Table 3).

Table 3 Multivariate Linear Regression Model: Factors Associated with Year of Publication of Trial Reports

Finally, we carried out a preliminary, hypothesis-generating post hoc analysis of deciles of meta-analytically determined drug–placebo responder RR values as well as depression-improvement RD values vs trial-sizes (not shown). By both outcome measures, the apparently optimal number was 2–10 sites per trial, and 30–75 subjects per trial, with lower efficacy found at both lower and higher counts.

DISCUSSION

The present findings are congruent with reviews discussed above indicating that antidepressant drug-vs-placebo differences in published reports of controlled trials are generally moderate (Baldessarini, 2005; Gartlehner et al, 2008; Kirsch et al, 2008; Tsapakis et al, 2008; Bridge et al, 2009; Wooley et al, 2009; Masi et al, 2010; Pigott et al, 2010; Khin et al, 2011). This conclusion was reached in the previous literature despite typical reliance on initial improvement on scale ratings rather than less readily achieved clinical remission, and despite growing evidence of publication bias toward underreporting of studies without significant drug–placebo differences (Ioannidis, 2008; Turner et al, 2008). Following nearly identical mid-range, initial depression ratings across drug and placebo arms and reporting-years, the crude response rates in the reports reviewed here averaged 54% with FDA-approved antidepressants that are employed clinically to treat major depression in the United States, compared with 37% with placebo. These differences consistently favor active drugs, but by only 17%.

The present findings also support the broad consensus that drug–placebo differences have been declining for a variety of psychotropic drugs in recent decades, making it increasingly difficult to demonstrate efficacy (Khin et al, 2011; Yildiz et al, 2011a, 2011b). This trend probably has encouraged increased reliance on larger trials (more subjects and collaborating sites) in order to maintain statistical power. Moreover, increasing reliance on complex trials carried out in varied geographic locations and cultures may tend to limit the reliability of research findings (Vázquez et al, 2011).

It is evidently widely held that differences in efficacy among specific drugs or types of antidepressants in the treatment of acute episodes of major depressive disorder are generally minor (Healy, 1997; Baldessarini, 2005, 2012; Cipriani et al, 2007; Gartlehner et al, 2008; Ghaemi, 2008; Pigott et al, 2010; Khin et al, 2011). The present findings support the conclusion that pooling of data from placebo-controlled trials does not yield clear rankings of specific drugs or drug-types by apparent efficacy (Figure 1). Unexpectedly, however, there were significant differences in reported apparent efficacy between TCAs and newer antidepressants (Table 2). We propose that this outcome may reflect important changes in characteristics of clinical trials for depression over the past three decades. These include increasing size and complexity, with selective increases in response rates with placebos and somewhat decreasing responses with antidepressants (Figure 2). It is particularly noteworthy that when placebo-response data from the generally older TCA trials were substituted for those in more recent trials of modern drugs, both types of agents yielded identical meta-analytically pooled RR values. In contrast, we did not find evidence of significant changes over the years in initial ratings of depression-severity (adjusted for variance among rating scales), in approximate IMI-eq antidepressant doses, or in several other measured characteristics of trials (Table 3).

It is increasingly clear that drug–placebo differences in trials of antidepressants and other psychotropic agents have been declining (Gartlehner et al, 2008; Ioannidis, 2008; Kirsch et al, 2008; Tsapakis et al, 2008; Turner et al, 2008; Bridge et al, 2009; Masi et al, 2010; Khin et al, 2011; Vázquez et al, 2011; Yildiz et al, 2011a, 2011b). In accord with recent findings in controlled treatment trials for mania (Yildiz et al, 2011a, 2011b), a secular increase in sites and participants per trial was associated, selectively, with rising placebo-associated response rates, resulting in declining drug–placebo contrasts or effect-size (Figure 2; Table 3). We propose that this tendency may, at least in part, reflect declining quality-control and greater heterogeneity of diagnostic and clinical assessments in large, complex, multi-site trials, particularly when dissimilar cultures are involved and local standardization of assessment methods is limited (Yildiz et al, 2011a, 2011b; Vázquez et al, 2011). We propose that selective increases in response rates associated with randomized placebo-treatment might reflect ‘regression-to-mean’ effects (Anderson, 1990; Bland and Altman, 1994) or random outcomes. Placebo-associated responses have increased from former levels of 20 to 30% to current levels of 30 to 50%, and to as high as 59.2% in a 1997 trial involving paroxetine (Lecrubier et al, 1997).

Alternative factors that may contribute to the observed secular trends include changes in the types of patients recruited into antidepressant trials, including less severely ill patients willing to accept potential randomization to a placebo, and even partially treated subjects. Levels of training and expertise of personnel providing diagnostic and symptom-rating assessments may also have declined. In addition, trials have become longer over the years sampled (Table 2), requiring more clinical assessments with greater risk of measurement-variance, and providing more clinical contact and more time for spontaneous improvement—all of which may favor responses associated with placebo treatment. Additional technical factors may include less reliance on expert raters, with greater risk of less stable assessments in a very heterogeneous disorder (Healy, 1997).

If the preceding interpretation of the present findings is correct, it suggests several practical considerations for the design and conduct of therapeutic trials for major depression and perhaps other disorders. These include seeking an optimal range of trial-sizes, with redoubled efforts to maximize quality-control, limit placebo-associated responses, and maximize drug–placebo differences. Preliminary analyses of the present data suggest that an optimal range of collaborating sites per trial may be 2–10, and of subjects per trial, about 30–75. Such conservative considerations for the design of future trials may improve outcomes. Additional potential benefits may include reduced time, complexity, and costs, as well as limiting exposure of as many acutely depressed patient-subjects to placebo-treatment as possible.

Limitations of this study include a lack of relevant details in many reports of controlled trials, sometimes including inconsistent reporting of definitions and outcomes for responder rate and percentage improvement, of the number of rating-scale items and of their maximum attainable scores in a few trials. Also, in most trials, exposure times are estimated from nominal protocol requirements since precise, subject-based actual weeks of treatment usually are not stated. Also, numbers of patients with defined outcomes are usually, but not always, based on prevalent intention-to-treat methods, which can limit responses owing to early dropout. Routine reporting of such details would greatly benefit future meta-analyses. Additional limitations to generalization arise from our requirements of peer-review and publication of findings in placebo-controlled trials concerning antidepressants approved and marketed in the United States for acute adult, major depression.

In conclusion, the present meta-analytic review of outcomes of placebo-controlled trials of antidepressants for acute episodes of major depressive disorder found evidence that older antidepressants, particularly TCAs, yielded somewhat superior apparent efficacy to some modern, second-generation agents. However, such nominal differences appear to have been influenced by secular changes in the nature of such trials over the past three decades. These include rising subject- and site-numbers and increasing placebo-associated responses, leading to falling drug–placebo differences or effect-size. We hypothesize that more conservative numbers of subjects and sites, with improved quality-control of trial methods, may paradoxically yield superior results in controlled trials of some psychotropic drugs, and do so more economically. Finally, the lack of major and compelling differences in apparent efficacy among specific antidepressants, and moderate differences among drug-types, suggest that meta-analyses of controlled trials may have limited value in efforts to develop an evidence-basis (Sackett et al, 1996) for identifying superior treatments.