Introduction

Depression is a common and disabling mental disorder among women during pregnancy and in the postpartum period1,2. A 2005 systematic review estimated that the point prevalence of major depression during pregnancy and postpartum ranged from 1 to 6% at different time points (first trimester of pregnancy to one year postpartum), based on 2–6 studies at any given time point (N = 111–2104 participants)3. A 2008 national survey from the USA with more than 14,000 participants reported 12-month period prevalence was similar among pregnant women (8%), postpartum women (9%), and similar-aged non-pregnant women (8%)4. Nonetheless, perinatal depression may have substantial adverse effects on mothers, fathers, partner relationships, and infants, including impairment of maternal function, paternal depression, premature delivery, infants with low birth weight and developmental delays, and impaired parent-infant interactions5,6,7,8.

Depression screening involves using self-report questionnaires to identify individuals who exceed a pre-defined cut-off score for further diagnostic evaluation to determine whether they have depression9,10. Guidelines from the United States Preventive Services Task Force and the Australian government recommend depression screening in pregnant and postpartum women11,12. The United Kingdom National Screening Committee and Canadian Task Force on Preventive Health Care, on the other hand, recommend against screening due to concerns about false positives, possible associated harms, and a lack of evidence from randomized controlled trials that screening leads to improved health outcomes13,14.

The 10‐item Edinburgh Postnatal Depression Scale (EPDS) is the most commonly used self‐report questionnaire for depression screening in pregnancy and postpartum15,16. It is also used to monitor symptoms among people undergoing treatment for depression and as a continuous outcome measure in research. Respondents rate how they have felt in the previous seven days17. Each item is scored 0–3, and possible total scores range from 0 to 30; higher scores indicate more severe depressive symptoms. Cut‐off values of ≥ 10 and ≥ 13 are often recommended for screening15,18,19. A 2020 individual participant data meta-analysis (IPDMA) reported that a cut-off of 11 or higher maximized the sum of sensitivity (81%) and specificity (88%), when using a semi-structured diagnostic interview as the reference standard (N = 36 studies, 9066 participants, 1330 major depression cases)20.

Although brief tools have been designed specifically to assess suicide ideation and risk in health care settings21, item 10 of the EPDS is sometimes used as a proxy of suicidal ideation. Item 10 of the EPDS is intended to assess thoughts of self-harm: “the thought of harming myself has occurred to me”15,22. A review from 2005 reported that 5–15% of pregnant and postpartum women had thoughts of self-harm (item score ≥ 1) based on this item23. However, responses to this item may not accurately reflect whether suicide ideation is present. One study compared positive responses on item 10 to item 3 of the Hamilton Depression Rating Scale (HDRS), which directly asks about suicidal ideation, among a sample of women with mood disorders during the first year postpartum; 17% (22/131) of participants who were administered the EPDS had positive responses versus 6% (9/146) on HDRS item 324. A study of 574 pregnant and postpartum women with positive responses to item-10 found that 324 (57%) women had fleeting thoughts of avoiding problems but no intent to self-harm, and 75 (13%) misunderstood the item25. One potential reason for this is that the item is not specific; and some women may misinterpret it to include unintentional injury, such as due to falls from impaired balance26,27,28, for instance. When the full EPDS is used in research studies, misinterpretation of item 10 could require follow-up with many women who endorse the item, even though most responses are false positives. This consumes substantial resources without evidence of benefit to study participants from administering the item.

A 9-item version of the EPDS (EPDS-9), which omits item 10, is sometimes used. A study of 371 women from the United States referred to a program serving women with or at risk of postpartum depression found that 49% of participants scored ≥ 13 on the full EPDS compared to 48% based on the EPDS-929. If differences in performance between the EPDS-9 and full EPDS are minimal, the EPDS-9 could be used for a range of purposes in research studies, where administering item 10 could have significant resource implications, including the need to follow-up on potentially large numbers of false positive responses to item 10. It might also be considered in trials of screening programs or in jurisdictions where screening is done in practice. However, no study has compared correlations between continuous scores and level of agreement in screening accuracy between the EPDS-9 and full EPDS. The objectives of the present study were to (1) evaluate the association of continuous EPDS-9 and full EPDS scores for assessing depressive symptom severity; and (2) assess the equivalence of the accuracy of the EPDS-9 and full EPDS across relevant cut-offs for screening to detect major depression.

Methods

The present study used a subset of participants from a database originally synthesized for an IPDMA on the accuracy of the full EPDS for depression screening20. The original IPDMA was registered in PROSPERO (CRD42015024785), and a protocol was published30. Results from the main IPDMA of the EPDS have been published20. To assess the equivalence of the EPDS-9 and full EPDS, we followed similar methods to those used in our previously published study that assessed the equivalence of the Patient Health Questionnaire-8 (PHQ-8) and PHQ-931. Prior to initiating the present study, we published a study-specific protocol on the Open Science Framework (https://osf.io/n9mfq/).

Study eligibility

For the main IPDMA, studies and datasets were eligible if (1) they administered the EPDS; (2) diagnostic classification for current major depressive disorder or major depressive episode was done based on a validated semi-structured or fully structured interview using Diagnostic and Statistical Manual of Mental Disorders (DSM)32,33,34 or International Classification of Diseases (ICD) criteria35; (3) participants were women aged 18 or older who completed assessments during pregnancy or within 12 months of giving birth; (4) the EPDS and diagnostic interview were conducted within two weeks; and (5) participants were not limited to people receiving psychiatric assessment or seeking psychiatric care because screening is done to identify previously unrecognized cases. Datasets where not all participants were eligible were included if primary data allowed selection of eligible participants. There were no restrictions based on language or study design. For the present study, we only included datasets from primary studies that provided individual EPDS item scores for all 10 items, because only those datasets allowed us to generate EPDS-9 scores and compare the EPDS-9 and full EPDS.

Search strategy and selection of eligible studies

A medical librarian searched Medline, Medline In-Process and Other Non-Indexed Citations, PsycINFO, and Web of Science from database inception to October 3, 2018, using a peer-reviewed search strategy (Supplementary Methods 1). Additionally, investigators reviewed reference lists of relevant reviews and queried contributing authors about non-published studies. Search results were uploaded into RefWorks (RefWorks-COS, Bethesda, MD, USA). After de-duplication, unique citations were uploaded into DistillerSR (Evidence Partners, Ottawa, Canada), which was used to store and track search results, conduct screening for eligibility, document correspondence with primary study authors, and extract study characteristics.

Two investigators independently reviewed titles and abstracts for eligibility. For publications deemed potentially eligible by either reviewer, a full-text review was done by two investigators, also independently. Disagreements between reviewers after full-text reviews were resolved by consensus and a third investigator was consulted if necessary.

Data contribution, extraction, and synthesis

Authors of eligible datasets were invited to contribute de-identified primary data. We attempted to contact corresponding authors of eligible primary studies by email up to three times, as necessary. When authors did not respond to our emails, we tried to contact them by phone and emailed co-authors. There was no time limit for how long authors had to provide data.

Two investigators independently extracted information on the diagnostic interview administered and the country of study from the published reports. Any discrepancies were resolved by consensus. Participant-level data included in the synthesized dataset included country human development index (which reflects life expectancy, education, and income of a country)36, age, pregnant or postpartum status, diagnostic interview administered, major depression classification status, and EPDS item scores. We used major depressive disorder or major depressive episode based on DSM or ICD criteria; if both were reported, we prioritized major depressive episode because screening attempts to detect episodes of depression. Clinically, additional assessment would be needed to determine if episodes were related to major depressive disorder or another psychiatric disorder (bipolar disorder, persistent depressive disorder). We also prioritized DSM over ICD because DSM is more commonly used in existing studies. We used statistical weights to reflect sampling procedures if provided in the datasets, for instance, when primary studies administered a diagnostic interview to all participants with positive screening results but only a random sample of those with negative results. Some studies used sampling procedures that merited weights but did not use weights. For those studies, we used inverse selection probabilities to generate appropriate weights.

For all datasets, we verified that participant characteristics and screening accuracy results for the full EPDS matched those that had been published. When primary data and original publications were discrepant, we identified and corrected errors when possible and resolved any outstanding discrepancies in consultation with the original investigators. We transformed all study-level and individual-level participant data into a standardized format and combined in a single synthesized dataset. For nine studies that collected data at multiple time points (four with two time points, four with three time points, and one with four time points), we selected the time point with the most participants. If the number of participants was maximized at multiple time points, we selected the one with the most women who had major depression.

We used the Quality Assessment of Diagnostic Accuracy Studies-2 tool (QUADAS-2)37 to assess risk of bias of included studies. No QUADAS domain items, however, were associated with outcomes in our main EPDS IPDMA20. Furthermore, QUADAS is designed to assess risk of bias in estimates of screening accuracy but not study features that might bias differences between using a full scale and a minimally shortened version of that scale. Thus, QUADAS ratings are provided in Supplementary Methods 2 but were not include in analyses.

Statistical analyses

To evaluate the association of EPDS-9 and full EPDS scores for assessing depressive symptom severity, a Pearson correlation and a 95% confidence interval (CI) were first calculated between the EPDS-9 and full EPDS scores for each study, then we generated the pooled estimate of correlations, 95% CI, and the prediction interval (PI), with a random effect model that accounted for clustering within primary studies.

To compare correlations and the screening accuracy of the EPDS-9 and full EPDS, we included all primary studies combined across type of diagnostic interview reference standards (primary analysis). There are differences in the way different types of diagnostic interviews are designed and their likelihood of classifying major depression38,39,40,41,42, but, since in each primary study, EPDS-9 and full EPDS scores are compared to the same reference standard, we did not have reason to believe that differences between the two measures would depend on the specific reference standard used. Nonetheless, we separately analyzed primary studies by the type of diagnostic interview used as the reference standard (secondary analyses), as we did in the previously published main EPDS IPDMA20.

For all studies pooled and by reference standard, for the EPDS-9 and full EPDS cut-offs ≥ 7 to ≥ 15, separately, bivariate random-effects models using an adaptive Gauss Hermite quadrature with 1 quadrature point20,43. This 2-stage meta-analytic approach models sensitivity and specificity at the same time, taking the inherent correlation between them and the precision of estimates within studies into account. A random-effects model was used as we assumed true values of sensitivity and specificity would vary across primary studies. We estimated accuracy for cut-offs from 7 to 15 to provide a range around the most commonly used cut-offs of ≥ 10 and ≥ 13, consistent with our main IPDMA of the EPDS20.

To examine the equivalence in accuracy between the EPDS-9 and full EPDS across cut-offs, overall and by reference standard, we used the results of the random-effects meta-analyses at each cut-off to construct separate empirical receiver operating characteristic (ROC) curves and areas under the curve (AUC) based on the pooled estimates. Equivalence between the EPDS-9 and full EPDS sensitivity and specificity was evaluated at each cut-off separately. This allowed us to test whether the sensitivity and specificity of the EPDS-9 were similar to the full EPDS, up to a pre-specified maximum difference, that is, an equivalence margin44. In the present study, an equivalence margin of δ = 0.05 was used, which is the same margin that was used previously to compare the PHQ-8 and PHQ-931. CIs for the differences between the EPDS-9 and full EPDS sensitivity and specificity at each cut-off were constructed via a cluster bootstrap approach45,46, with resampling at the study and subject level. For each comparison, we ran 1000 iterations of the bootstrap. For each bootstrap iteration, the bivariate random-effects model was fitted to the EPDS-9 and full EPDS data, separately, and pooled sensitivities, specificities, and difference estimates between the EPDS-9 and full EPDS were computed. We compared the CIs around the pooled sensitivity and specificity differences to the equivalence margin of δ = 0.05. If the entire CI was between − 0.05 and + 0.05 then we rejected the hypothesis that there were differences large enough to be important and concluded that equivalence was present. If the entire CI was outside of the interval, then we failed to reject the hypothesis that the EPDS-9 and full EPDS were not equivalent. When the CIs crossed the ± 0.05 threshold, findings on equivalence were deemed indeterminate.

To investigate heterogeneity across studies, by overall and reference standard, we generated forest plots for the differences in sensitivity and specificity estimates between the EPDS-9 and full EPDS for cut-offs ≥ 10, ≥ 11 and ≥ 13 for each study. We also quantified heterogeneity at cut-offs ≥ 10, ≥ 11 and ≥ 13, by reporting the estimated variances of the random effects for the differences in the EPDS-9 and full EPDS sensitivity and specificity (τ2)47,48. Additionally, the 95% prediction intervals which we calculated reflect the range of true effects that can be expected in future settings or studies49.

All analyses were run in R software (R version R 3.5.050 and R Studio version 1.1.42351 using the lme4 package52.

Ethical approval

As this study involved secondary analysis of anonymized previously collected data, the Research Ethics Committee of the Jewish General Hospital determined that this project did not require research ethics approval. However, for each included dataset, we confirmed that the original study received ethics approval and that all patients provided informed consent.

Results

Search results and characteristics of the primary data

For the main IPDMA, 4434 unique titles and abstracts were identified from the electronic database searches. 4056 of these were excluded after title and abstract screening and 257 after full-text review (Supplementary Table S1), resulting in 121 eligible articles from 81 unique participant samples. Of these samples, 56 (69%) contributed datasets. Furthermore, authors of included studies contributed data from 2 unpublished studies. In total, 58 full EPDS studies were provided to the main IPDMA. For the present study, 17 studies (4626 participants, 659 major depression cases) with datasets that included full EPDS scores but not individual item scores were excluded. Thus, 41 studies (10,906 participants, 1407 major depression cases) were analyzed (Supplementary Figure S1). Characteristics of the included studies are shown in Supplementary Table S2. Characteristics of the 25 eligible studies that did not provide data and the 17 excluded studies that provided only EPDS total scores are shown in Supplementary Table S3.

There were 24 included primary studies (5412 participants, 803 major depression cases) that used semi-structured diagnostic interviews to assess major depression, 4 (3189 participants, 228 major depression cases) that used fully structured diagnostic interviews other than the Mini International Neuropsychiatric Interview (MINI), and 13 (2305 participants, 376 major depression cases) that used the MINI. The Structured Clinical Interview for DSM Disorders (SCID) was the most used semi-structured interview (22 studies, 5157 participants, 765 major depression cases), and the Composite International Diagnostic Interview was the most commonly used fully structured interview (3 studies, 2963 participants, 196 major depression cases). Characteristics of participants are shown in Table 1.

Table 1 Participant characteristics by subgroups.

EPDS-9 and item 10 scores

As shown in Table 2, among participants in all studies, 1% of participants screened negative at an EPDS-9 cut-off of ≥ 10 but had a non-zero EPDS item 10 score. This percentage was also 1% at a cut-off of ≥ 11 and increased to 2% at a cut-off of ≥ 13. The correlation between the EPDS-9 and full EPDS scores was 0.998 (95% PI: 0.991, 0.999). The forest plot is shown in Supplementary Figure S2.

Table 2 Characteristics of participants who rated item 10 by EPDS-9 score at cut-offs ≥ 10, ≥ 11, and ≥ 13.

Screening accuracy of the EPDS-9 and full EPDS

ROC curves that compare sensitivity and specificity estimates of the EPDS-9 and full EPDS for cut-offs ≥ 7 to ≥ 15 are shown in Fig. 1, overall and separately by semi-structured, fully structured, and MINI reference standards. The ROC curves for the EPDS-9 and full EPDS were highly overlapping for overall and each reference standard. The AUC of the EPDS-9 and full EPDS for all interviews combined was 0.906 versus 0.910. By interview type, it was 0.905 versus 0.910 for semi-structured interviews, 0.924 versus 0.926 for fully structured interviews (excluding the MINI), and 0.902 versus 0.907 for the MINI.

Figure 1
figure 1

(a)–(d) ROC curves for the EPDS-9 and full EPDS (a) compared to all reference standards, (b) compared to a semi-structured reference standard, (c) compared to a fully structured reference standard (MINI excluded), and (d) compared to the MINI reference standard.

Comparisons of sensitivity and specificity estimates between the EPDS-9 and full EPDS at cut-offs ≥ 7 to ≥ 15 for all reference standards combined are shown in Table 3. At cut-off ≥ 11, which maximized the sum of sensitivity and specificity of the full EPDS in the main IPDMA20, sensitivity was 0.78 (95% CI 0.71, 0.84) and specificity was 0.87 (95% CI 0.83, 0.90) for the EPDS-9 versus 0.80 (95% CI 0.74, 0.86) and 0.87 (95% CI 0.83, 0.90) for the full EPDS. Comparisons of sensitivity and specificity estimates between the EPDS-9 and full EPDS for cut-offs ≥ 7 to ≥ 15 across the three different reference standard categories are shown in Supplementary Table S4.

Table 3 Comparison of sensitivity and specificity estimates between EPDS-9 and full EPDS across cut-offs 7–15 for studies that used all reference standards (N Studies = 41; N Participants = 10,906; N major depression = 1407).

Overall, among all 41 primary studies, across cut-offs ≥ 7 to ≥ 15, sensitivity was between 1 percent higher and 4 percent lower for the EPDS-9 compared to the full EPDS (Table 3). At cut-off ≥ 10, the difference was − 0.02 (95% CI − 0.04, − 0.00), at cut-off ≥ 11, the difference was − 0.02 (95% CI − 0.04, − 0.01), and at cut-off ≥ 13, the difference was − 0.04 (95% CI − 0.08, − 0.02). The EPDS-9 and full EPDS were equivalent for cut-offs ≥ 7 to ≥ 12 and the equivalence was indeterminate for cut-offs ≥ 13 to ≥ 15. For specificity, the differences between the EPDS-9 and full EPDS were within 0.01 for all cut-offs. The EPDS-9 and full EPDS were equivalent for all cut-offs. As shown in Supplementary Table S4, in comparisons stratified by different reference standards, sensitivity estimates were similarly equivalent or indeterminate, and specificity estimates were equivalent at all cut-offs.

Forest plots illustrating the difference in sensitivity and specificity estimates between the EPDS-9 and full EPDS for the most used cut-offs ≥ 10, ≥ 11, and ≥ 13 are shown in Fig. 2. At cut-offs of ≥ 10, ≥ 11, and ≥ 13, low heterogeneity existed in the differences across all 41 studies; τ2 was < 0.01 for both differences in sensitivity and specificity, and the widest 95% prediction intervals were − 0.01 to 0.01 for differences in sensitivity and − 0.00 to 0.00 for differences in specificity. Forest plots of the differences of sensitivity and specificity estimates for cut-offs ≥ 10, ≥ 11 and ≥ 13 between the EPDS-9 and full EPDS among studies by reference standard category are shown in Supplementary Figure S3.

Figure 2
figure 2

(a)–(c) Forest plots of the difference in sensitivity and specificity estimates between EPDS-9 and full EPDS among all studies at cut-offs (a) ≥ 10, (b) ≥ 11, and (c) ≥ 13.

Discussion

The present study had two major findings. First, the scores between the continuous EPDS-9 and the full EPDS were highly correlated (0.998, 95% PI 0.991, 0.999). Second, across cut-offs, including the commonly used cut-offs of ≥ 10, ≥ 11, and ≥ 13, compared with the full EPDS, the EPDS-9 had similar sensitivity and specificity in screening major depression among pregnant and postpartum women, across all studies and for all three types of reference standard categories. In analyses pooled across reference standards, sensitivity was equivalent for cut-offs ≥ 7 to ≥ 12 and indeterminate for cut-offs ≥ 13 to ≥ 15. Specificity was equivalent for all cut-offs. Low heterogeneity in differences show that results were consistent across included studies.

Our findings about the EPDS-9 and full EPDS among pregnant and postpartum women are similar to results from a similar IPDMA on the equivalency of the screening accuracy of the PHQ-8 and PHQ-9, where the item removed in the shorter version also assessed self-harm. In that IPDMA, the screening accuracy between the PHQ-8 and PHQ-9 were similar across all cut-offs for detecting major depression31. Differences in sensitivity between the PHQ-8 and PHQ-9 were between 0.00 to 0.05, suggesting the sensitivity may be minimally reduced with the PHQ-8, although differences were deemed indeterminant. Specificity was equivalent for all cut-offs. We did not report positive predictive values in the present study, but these have been previously documented in our main IPDMA for the full EPDS20. For major depression prevalence values of 5–25%, positive predictive values for a cutoff of ≥ 11 compared to semi-structured interviews, for instance, ranged from 26 to 69%, and negative predictive values ranged from 93 to 99%. These would be similar for the EPDS-9.

Previous studies indicate that item 10 of the full EPDS overestimates the risk of suicidal ideation and identified substantially more people as at risk than scales designed to assess suicidal ideation risk24,28. Ideally, if researchers wish to assess suicide risk, a method designed specifically for that purpose, such as the P4 would be used given the limitations of EPDS item 1021. Potentially negative ramifications of using item 10 in research studies involve both resources and messaging to study participants. Ethically, all participants who score ≥ 1 on the item would need to be followed up with risk assessments, even though very few would be at risk, which could require substantial resources. Additionally, there is risk in impairing relationships with some women who must undergo these interviews even though they are not at risk. There are similar ramifications in clinical settings, where follow-up interviews would be needed for all women with positive EPDS screens and women with a non-zero item 10 score but negative screens overall.

The present study is the first meta-analysis using a large individual participant dataset to compare the measurement performance of the EPDS-9 and full EPDS, which is a major strength. The large sample size enabled us to generate precise estimates of correlations and equivalence for screening accuracy between the two versions of the EPDS. Furthermore, we compared results for the EPDS-9 and full EPDS from all studies and three different reference standard categories with all cut-offs, rather than just published cut-offs, which may result in bias due to selective cut-off reporting when individual participant data are not available53.

Limitations also need to be considered. First, we restricted meta-analysis to studies with complete data for full EPDS individual item scores (71%, 41 from 58 eligible studies) and were not able to include all studies. We do not know of any reason why studies that did and did not record item-level data might differ in the association between full EPDS and EPDS-9 scores. Secondly, although we categorized studies based on the interview administered, interviews might not have always been used as originally designed; for instance, it is possible that some interviewers may not have had the experience or training required to administer semi-structured interviews. The low heterogeneity in the main analysis, though, suggests that results are applicable across different diagnostic interviews. Third, we conducted a secondary analysis of data collected up to 2018 for a previously published IPDMA. Based on prediction intervals, which were all between − 0.01 to 0.01 for differences in sensitivity and − 0.00 to 0.00 for differences in specificity, however, additional data would not likely influence results meaningfully. It would not be a good use of resources to conduct additional studies on this research question. Fourth, we did not track inter-rater agreement for assessing eligibility at the title and abstract and full-text review levels.

In summary, this IPDMA showed that the EPDS-9 performs similarly to the full EPDS for assessing depressive symptom severity. The two EPDS versions also had similar screening accuracy in screening for major depression. The negative ramifications of false positive responses on item 10 suggest that using the EPDS-9 instead of the full EPDS should be considered as measurement performance is similar to the full EPDS.