Introduction

The postpartum period poses the highest risk to women for developing a mental disorder1, with postpartum depression (PPD) being the most frequent one2. PPD is defined as a major depressive disorder occurring in direct relation (within 4 weeks postpartum) to childbirth in the DSM-53. Early diagnosis and treatment of PPD can substantially improve the outcome, prevent relapse, and minimize the associated emotional and financial burden4. Maternal mental health is a reliable predictor of child’s cognitive development and subsequent achievements5. The risk of a mother-to-child transmission of the vulnerability to depression6,7, through genetic as well as other factors such as depression-related effects on parenting8, is particularly high. Successful treatment of maternal depression alleviates the risk of childhood behavioral problems9.

PPD is often overlooked during postnatal visits, missing the critical window for early intervention10,11. One reason is that low mood in the early postpartum period is largely deemed “normal” with 50–80% of new mothers experiencing initial sadness (i.e., postpartum blues), primarily due to dramatically plunging hormone levels at parturition12. Adjustment disorder (AD) in reaction to postpartum stress is another postpartum condition with similar symptoms. The crucial difference to PPD is that the severity of AD does not meet the criteria for depression at any time point. In the clinical context, AD needs to be considered as an important differential diagnosis to PPD13.

History of mental illness, vulnerability to hormonal changes, psychological and social distress, baby blues, premenstrual syndrome (PMS), unwanted pregnancy, traumatic birth experience and stressful life events are all associated with an increased risk of PPD11,12,14. It is of crucial importance to evaluate the relative and combined predictive value of these factors for development of PPD. Previous studies aiming at prediction of PPD focused either on time points in the late postpartum period (e.g., after 8–32 weeks)15 or only on single time points, thereby ignoring symptom dynamics or convolving PPD with major depression or AD16. Detailed in-clinic assessments are costly and burdensome, providing the likely reason for the cross-sectional nature of most previous studies. Online remote self-assessments may provide an easy means of obtaining the relevant information on symptom dynamics in individual patients.

Here, we recruited two cohorts of mothers giving birth and followed them longitudinally over 12 weeks to explore whether an accurate prediction of PPD is feasible based on socio-demographic and clinical-anamnestic information as well as early symptom dynamics using remote mood and stress assessments. Data from the first cohort were used to identify combinations of demographic and clinical data achieving highest accuracy for early identification and differentiation of PPD and AD using a machine learning approach. In this cohort, we identified and trained the optimal model for individual diagnostic prediction. The model and approach were pre-registered and evaluated against an independent validation cohort to obtain unbiased performance estimates of the proposed algorithm.

Methods

First cohort and study design

To identify the best predictors of PPD, a first cohort of 308 mothers (mean age = 31.7 ± 4.76) was recruited following childbirth at the University Hospital Aachen between November 2015 and June 2018. The current project was part of the Risk of Postpartum Depression (RiPoD) study conducted at the University Hospital Aachen. The main exclusion criteria were a depressive episode (according to a clinical interview) at the time of recruitment and specific child health conditions (for details see supplementary material). The recruitment was conducted at the Department of Gynecology and Obstetrics within the first two to five days postpartum. Out of a total recruitment pool of ~1000 births per year, 50–60% of women were contacted (30% were directly excluded based on some exclusion criteria due to close collaboration with the obstetrics department) of which 50% were willing to participate and met the inclusion criteria. Written informed consent was obtained from all participants. The authors assert that all procedures contributing to this work comply with the ethical standards of the relevant national and institutional committees on human experimentation and with the Helsinki Declaration of 1975, as revised in 2008. All procedures involving human subjects/patients were approved by the Institutional Review Board of the Medical Faculty of RWTH Aachen University (EK 208/15). The study design comprised follow-up for 12 weeks with evaluation at five time points each three weeks apart (T0-T4) (Fig. S1). Evaluations were conducted at the clinic for T0 and T4 and via remote online questionnaires for T1 to T3. All women were asked to complete mood and stress assessments (scale from one to ten, ten being high) online on a bi-daily basis. Remote assessments were sent via e-mail. If three consecutive assessments were missed, a reminder was sent via e-mail, which allowed for close monitoring of the participation.

A clinical interview was conducted at T0 to ascertain current conditions. At T4, an experienced psychiatrist conducted a second clinical interview for a final diagnosis. Based on this interview, participants were assigned into one of three groups: healthy controls (HC, N = 247, 80.2%) without any sign of depression during the whole observation period, and women meeting DSM-5 criteria for PPD (N = 28, 9.1%) or AD (N = 33, 10.7%)3. In case of a depression, the Hamilton Depression Rating Scale17 was administered. Clinical interviews were based on the DSM-53.

An sociodemographic-anamnestic questionnaire was used to obtain additional information about personal and socioeconomic status, psychiatric history, current pregnancy, child, breastfeeding at T0, postpartum blues (T4), PMS18 (T4), subjective quality of support at home (T4), and breastfeeding at T4 (Table 1, Table S1). The Stressful Life Events Screening Questionnaire19 was collected to assess encounter with stressful life events (T0) (Table 1). The Edinburgh Postnatal Depression Scale (EPDS)20 was collected at all time points (T0-T4). Maternal attachment was evaluated from T1 through T4 using the Maternal Postnatal Attachment Scale (MPAS)21.

Table 1 Sociodemographic and anamnestic data for the first and second cohort.

Second cohort

For the second cohort, further referred to as validation cohort, 193 mothers (mean age = 32.7 ± 4.78) were recruited between November 2018 and January 2020 following the same protocol and study design as for the first cohort (Fig. S1). The prevalence rates in the validation cohort were 76.2% for HC (N = 147), 8.29% for PPD (N = 16), and 15.5% for AD (N = 30).

Univariate analyses of the first cohort

All data were analyzed using MATLAB R2018a, Python Jupyter Notebook 5.6.0, IBM SPSS Statistics 22 and jamovi 1.0.5.022. Chi-square tests were performed to compare categorical sociodemographic-anamnestic variables across the groups in the first cohort. For continuous variables, logistic regressions were computed. Weekly mood and stress levels were calculated by averaging the corresponding bi-daily assessments. Mood-stress difference scores were calculated as the difference between both z-transformed variables to estimate individual discrepancies between perceived stress and mood (i.e., z-score mood minus z-score stress). Changes from baseline and the preceding week were computed for these variables. Dynamic changes in mood, stress, mood-stress difference, MPAS, and EPDS were analyzed using mixed effects repeated-measures analyses of variance (ANOVA) with week as within-subject and group as between-subject variable including an interaction term. Only post-hoc pairwise group comparisons (i.e., chi-square tests for categorical and binomial logistic regression for continuous sociodemographic-anamnestic variables, and independent samples t tests for mixed effects repeated-measures ANOVAs) were corrected for multiple testing using Bonferroni correction. The sample size was calculated as adequate for all univariate tests with a power of 0.8 and small to moderate effect sizes. Receiver operating characteristic (ROC) curves and their associated area under the curve (AUC) (within-sample) for differentiation between the three groups were computed for each measure per week.

Identification of most predictive combinations in the first cohort

Next, we aimed to evaluate if and which combinations of sociodemographic and clinical-anamnestic factors, mood, stress, MPAS and EPDS allow for an accurate differentiation between HC, PPD and AD in the first cohort. To that end, we used a logistic regression classifier (MATLAB built-in mnrfit and mnrval functions, no parameter optimization needed) performing 1000 repetitions of strict threefold cross-validation. The classification was performed for each pair-wise group comparison separately and oversampling was applied to the PPD and AD groups. Low-variance variables (family status, breastfeeding T0, education, completed professional education, income, and psychiatric diagnosis in previous pregnancy), i.e., variables with low group cell counts (less than 80% of expected cell counts >5), were excluded from the analysis in the whole sample (see Table 1 and Table S1). Independent samples t tests were performed in the training data to select the baseline variables to be included in the classifier (p < 0.05).

To identify the most sensitive combinations for early identification of PPD, the following nine feature combinations were evaluated: [1] baseline sociodemographic-anamnestic data alone, [2] mood scores, [3] stress scores, [4] mood-stress difference scores, [5] mood scores incl. changes (change to baseline and to preceding week), [6] stress scores incl. change scores, [7] mood-stress difference scores incl. changes, [8] combination of mood and stress scores incl. changes, [9] and combination of mood, stress, and mood-stress difference scores incl. changes. Combinations [1] to [9] were evaluated either alone or in combination with EPDS scores, MPAS scores or both. In addition, all combinations with features [2] to [9] were evaluated with and without inclusion of baseline sociodemographic-anamnestic information. The baseline sociodemographic-anamnestic information alone (i.e., feature combination [1]) served as null model for comparison with best performing models.

Balanced accuracies, sensitivities, specificities, positive and negative predictive values as well as ROC curves including the AUC were computed. The best performing combination (high balanced accuracy at earliest possible time-point) for each pair-wise comparison was selected for replication analysis. A logistic regression was computed for the selected combination using all participants. These results of the first cohort along with the validation plan were pre-registered on https://osf.io/ecmrp?view_only=6feb8e89818445a0b675621c8f22ba82. The obtained coefficients were applied to the prospectively collected validation cohort.

Application to the validation cohort

The selected and preregistered model as trained on the first dataset was then used to predict diagnoses in the independent validation cohort (Table S2). The class probability p for the validation cohort was obtained using the following standard logistic regression formula, where β denotes the coefficients and X the included features:

$$p = \frac{1}{{1 + e^{ - X\beta }}}$$

As for the validation cohort, we computed balanced accuracy, sensitivity, specificity, AUC, ROC, and positive and negative predictive value by comparing predicted versus actual group labels. To obtain a chance level spread estimate for the classifier, we randomly permuted the “predicted” labels 1000 times across the validation cohort recomputing all performance measures and their 95% confidence interval.

Results

Sociodemographic-anamnestic and baseline group comparisons

In the first cohort, PPD and AD were associated with personal (p < 0.001 for HC vs. PPD and HC vs. AD) and familial psychiatric history (p = 0.036 for HC vs. PPD, p = 0.009 for HC vs. AD), subjective birth-related psychological traumas (p = 0.024 for HC vs. PPD, p < 0.001 for HC vs. AD), and postpartum blues (p = 0.003 for HC vs. PPD, p < 0.001 for HC vs. AD) (Table 1, S1 and S2). A higher PMS prevalence (p = .012 for HC vs. PPD) and reduced breastfeeding at T4 were observed in PPD compared to HC (p = 0.021). No differences were seen between PPD and AD. Similar effects were observed in the validation cohort for all sociodemographic-anamnestic factors (Table 1, Table S1; for odds ratios see Table S3).

Univariate analyses of the first cohort

The average participation over a total of 84 days of observation was 40 responses with a maximum of 45 responses, with no significant differences between the subsamples (HC: M = 40, max = 45; AD: M = 40, max = 44; PPD: M = 40, max = 45; F(2, 305) = 0.33, p = 0.717). Both PPD and AD showed a distinct pattern in weekly mood, stress, and mood-stress difference scores over the course of 12 weeks (significant time by diagnosis interactions – mood: F(13.8,1303) = 16.3, p < 0.001; stress: F(11.3,1026) = 9.85, p < 0.001; mood-stress difference: F(13.1,1162) = 17.3, p < 0.001) (Fig. 1A-C). The groups differed significantly in mood and mood-stress difference at all weeks (p = 0.004 for mood-stress baseline, all other p < 0.001) (see Tables S4 and S5). For stress, the difference was significant at all weeks except for baseline (all p < 0.001, see Table S6).

Fig. 1: Mood, stress, mood-stress difference, EPDS, and MPAS scores.
figure 1

Weekly mood (A), stress, (B) and mood-stress difference scores (C) incl. 95% confidence intervals, results of the simple effects analyses, and within-sample AUCs incl. 95% confidence interval for each group comparison. EPDS (D) and MPAS (E) mean scores and associated within-sample AUCs for each time point and group separately incl. their standard error and 95% confidence interval. Statistically significant t tests for group comparisons are marked with *.

PPD had significantly lower mood levels compared to HC at all weeks except for baseline (Fig. 1A). AD had significantly lower mood relative to HC from baseline until week 6 reaching the highest difference at week 2. PPD had lower mood compared to AD from week 4 through week 12. Stress levels were significantly higher in PPD compared to HC from week 2 through week 12 and compared to AD between week 5 and week 12. AD had higher stress levels relative to HC from week 1 until week 4 (Fig. 1B). Mood-stress difference differed significantly between HC and PPD from week 1 through week 12, between HC and AD from week 1 through week 6, and between PPD and AD from week 4 through week 12 (Fig. 1C).

Both EPDS and MPAS showed significant time by diagnosis interactions (EPDS: F(6.87,1034) = 34.4, p < 0.001; MPAS: F(5.35,805) = 8.24, p < 0.001) with a significant between-group difference at all weeks (all p < 0.001) (Fig. 1D, E). EPDS scores were significantly lower in HC compared to PPD and AD at all time-points (T0-T4) (p < 0.001). The difference between PPD and AD was significant from T2 until T4 with higher EPDS scores in PPD women (p < 0.001). MPAS scores were significantly lower at all time points (T1–T4) in PPD (p < 0.001) and AD (p < 0.001 for T1-T3, p = 0.008 for T4) compared to HC. Lower MPAS scores were observed in PPD compared to AD at T4 (p = 0.001).

Prediction in the first cohort

Next, we evaluated which combinations of sociodemographic-anamnestic, mood, stress, EPDS, and MPAS data allow for reliable differentiation between PPD, AD, and HC. The outcomes of all evaluated combinations are summarized in Tables S714. For differentiation of PPD from HC, a high balanced accuracy of 87% was achieved at week 3 using a combination of baseline EPDS and follow-up EPDS and mood levels at week 3 (Table 2, Fig. 2A, and Table S7). The best early differentiation between AD and HC with a 91% balanced accuracy was also achieved at week 3 using a combination of baseline EPDS and follow-up EPDS, MPAS and mood scores at week 3 (Table 2, Fig. 2B, and Table S8). A reasonable differentiation of AD and PPD with a balanced accuracy of 76% was only achieved at week 6 using only the mood levels (Table 2, Fig. 2C, and Table S9). Logistic regression coefficients were trained with these combinations using the first cohort and applied to predict the diagnostic labels in the validation cohort (Table S2). The null model (i.e. sociodemographic-anamnestic information alone) performed inferior compared to the best performing models for all group comparisons (HC-PPD: BA = 0.72, HC-AD: BA = 0.75, AD-PPD: BA = 0.48; Table S9, Feature Combination 1).

Table 2 Results of prediction for the first and validation cohort.
Fig. 2: Results of machine learning analysis.
figure 2

Balanced accuracy, sensitivity, specificity and out-of-sample AUC for each group comparison are displayed for the first cohort (AC). For HC vs. PPD (A), the values are displayed for EPDS at baseline and follow-up incl. mood scores. For HC vs. AD (B), the values are displayed for EPDS at baseline, EPDS and MPAS at follow-up incl. mood scores. For PPD vs. AD (C), the values are displayed for mood scores. (DF) AUCs obtained for the validation cohort are displayed for the classifier selected based on results from the first cohort aside with chance-level performance.

Prediction in the validation cohort

The validation cohort had an average participation of 37 responses with a maximum of 45 responses for the remote assessments with no differences between the subgroups (HC: M = 38, max = 45; AD: M = 38, max = 43; PPD: M = 34, max =43; F(2, 190) = 1.51, p = 0.223). The classifier trained on the first cohort for differentiation of HC and PPD reached a high balanced accuracy of 93% in the validation cohort with a sensitivity of 88% and specificity of 99% (Table 2, Fig. 2D). The classifier differentiating HC and AD reached a balanced accuracy of 79% with a high specificity (98%) but only moderate sensitivity (60%) (Table 2, Fig. 2E). For PPD and AD differentiation, the selected classifier reached a balanced accuracy of 73%, again with high specificity (90%) but only low sensitivity (56%) (Table 2, Fig. 2F).

Discussion

Here, we adopted a within- and out-of-sample validation study design to identify combinations of sociodemographic-anamnestic and clinical factors allowing for early and accurate identification and differentiation of PPD and AD in two large cohorts of postpartum women. In both cohorts high accuracy was achieved at week 3 for identification of PPD and AD compared to HC using a simple combination of EPDS, mood, and MPAS (for AD) assessments. In contrast, differentiation of PPD and AD was possible only from week 6 based solely on mood levels.

In both cohorts, the prevalence of PPD was slightly lower than the 10–20 % reported in the literature23,24. As the focus of our study was on prediction of PPD, we purposely excluded women with manifest depression at the time of inclusion in the study, which may explain the lower prevalence. Furthermore, studies estimating early prevalence of PPD may have included women with AD. Although there is an increased risk for PPD within the first postpartum year25, meaning that some women may develop PPD after four to six weeks (i.e. late onset), this was not the case for our sample. In line with previous research, we found postpartum blues, psychiatric history, subjective birth-related psychological traumas, and PMS to be significant risk factors for PPD14,26,27.

Interestingly, no differences between the PPD and AD groups were found with respect to risk factors, suggesting that similar mechanisms may be involved in the generation of initial depressive symptoms in both groups. Over the observation period, stress levels continuously increased in women with PPD whilst they normalized after about five weeks in AD. Descriptively, mood levels in AD followed the stress levels normalizing only after about seven weeks. The temporal delay is in line with the interpretation that reductions in stress may contribute to the recovery observed in mood. The increase in stress levels and the simultaneous decline in mood levels in PPD may indicate the contribution of stress-mediated components in line with previous studies reporting parenting stress among the most important postpartum factors28,29. Whilst not a causal factor on its own, parenting stress is likely to increase vulnerability to depression in high-risk individuals.

Similarly, PPD and AD displayed distinct temporal courses of EPDS and attachment scores as measured by MPAS. The EPDS temporal dynamics were highly similar to the observed stress and mood levels. The initially lowest attachment scores were found to increase in AD while PPD maintained the low attachment levels throughout the study. These observations underscore the necessity of longitudinal monitoring of both measures to better characterize the dynamic relationship between depressed mood and maternal attachment30,31. Differences in MPAS and EPDS remained significant between AD and HC at all time points. According to recent findings, child neurodevelopment is affected by maternal depressive symptoms even when they do not exceed clinical thresholds32,33. Our observations emphasize the need for further detailed evaluation of potential consequences also for the AD group.

A combination of baseline EPDS and week 3 remote follow-up EPDS, and mood scores achieved about 90% balanced accuracy for early identification of PPD as compared to HC. The same combination with addition of MPAS achieved a similar accuracy for early identification of AD. Both findings were largely confirmed in the validation cohort with an accuracy reduction from 90 to 80% seen only for differentiation of AD and HC. None of the evaluated combinations allowed for an accurate early differentiation between PPD and AD with all classifiers performing close to chance level until week 5. A reasonable differentiation of both groups was only achieved through mood scores at week 6 with a moderately high accuracy but a high specificity for PPD as confirmed in the validation cohort. Our classification results suggest that a simple stepwise procedure including remote mood, EPDS, and MPAS assessments may be a promising approach towards early identification of PPD. Whilst week 3 remote testing provided a high accuracy and a particularly high specificity for detection of both populations at risk, week 6 data additionally allowed for further differentiation between PPD and AD. In particular, the addition of mood scores led to a substantial increase in balanced accuracies for all group differentiations compared to all other feature combinations (e.g., addition of stress scores). Interestingly, the classifiers performed superior for the out-of-sample prediction in several cases. As we applied a strict cross-validation procedure the differences in prediction may simply reflect random variation in the accuracy of our model.

Three potential limitations need to be mentioned. First, as we did not register the reason for refusal during recruitment, we cannot exclude a bias based on the differences between women willing and women unwilling to participate. However, according to a recent study, there are no differences in motivation and willingness to participate between healthy controls and patients with psychiatric mood disorders34. Therefore, we do not expect any significant bias regarding the exclusion of women with PPD or AD based on their refusal to participate in the study. A potential bias introduced by the recruitment after childbirth vs. before childbirth may be a second limitation. However, the main goal of the current study was the identification of a risk group through a method, which could be easily applied in routine care. Prediction before childbirth may be more difficult to incorporate into routine care as it may require the transfer of information between multiple institutions (e.g. gynecologist and hospital). Third, oversampling was applied only to the cross-validation in the first cohort, but not to the training of the classifier for prediction in the validation cohort, resulting in a potential bias of the logistic regression classifier due to asymmetric group sizes. However, considering that the highly similar results for the cross-validation and the out-of-sample (with the out-of-sample validation results being even superior at times), these findings indicate a minor influence of the asymmetric group sizes on the outcomes of our study.

In summary, by means of a longitudinal approach we identify and validate combinations of remote assessments allowing for early and accurate identification and differentiation of PPD and AD using a step-wise procedure. By administering the EPDS and mood assessments in-clinic immediately after childbirth and a second assessment remotely after three weeks, these findings can be easily translated into routine care. The behavioral and clinical time courses over 12 weeks provided important insight into the development and interaction of mood, stress, and maternal sensitivity in the first weeks postpartum.