Introduction

Postpartum depression (PPD), defined as having an episode of minor or major depression during pregnancy or up to one year after giving birth, is a relatively common condition that affects 8–15% of new mothers in Sweden every year1,2. The etiology of PPD is not well understood, but the condition likely arises from a combination of psychological, psychosocial and biological factors3,4. The most well documented biological risk factors for PPD are hypothalamic–pituitary–adrenal axis dysregulation, inflammatory processes, genetic vulnerability, and allopregnanolone withdrawal4. The strongest psychosocial factors are previous depression, severe life events, some forms of chronic stress and relationship struggles4,5. The role of resilience and personality have been lately also gaining attention6,7.

PPD is a condition that can have devastating effects on the mothers, as well as their children8,9. Mothers may experience persistent doubts about their ability to care for the child, have difficulties bonding with their child, and also have thoughts about hurting the child2. Moreover, PPD can affect a child’s development by interfering with the mother-infant relationship10,11. For instance, children of mothers with PPD have greater cognitive, behavioral and interpersonal problems compared to children of mothers without PPD12,13. Despite PPD being a detrimental health condition for many women, numerous affected women fail to receive adequate care14. There exist several effective treatments and interventions for PPD14,15,16, but they are only cost-effective among high-risk women. The idea of prenatal prediction of PPD has existed for several years and early studies using more traditional methods attempted to predict women at risk by prenatal assessment of critical variables17. However, to date, there has been no effective way to predict women at risk for the development of depressive symptoms postpartum.

Traditional statistical methods allow researchers to estimate risks by sequentially analysing the associations mainly between two variables, often controlling for the effect of others. Further, machine learning (ML) methods enable researchers to iteratively and simultaneously analyse multiple interacting associations between variables18 as well as to devise data-driven predictive models that then can be evaluated by quantifying the performance metrics across all models in order to find the best predictive model. The power of ML allows for the analysis of complex non-linear relationships and even the integration and pooling of multiple different data-types from several sources19,20,21. Over the last decade, there has been a steady increase in the use of ML in medicine and its effects can be observed in many fields including oncology22,23,24,25, cardiology and hematology26,27, critical care28,29, and psychiatry30,31,32,33,34,35. Importantly, PPD represents a unique case in which a moderately high chance to develop a serious psychiatric condition is coupled with a very precise temporal prediction of when such symptoms are to be expected. As such, and considering PPDs substantial societal burden, ML-based risk classification can be applied in an ideal situation with high expected societal benefit. With approximately 120,000 annual births in Sweden and the typical prevalence of PPD at 12% among women who nearly in their entirety present with a multitude of adaptations after childbirth, close monitoring of the whole population for early depressive sentinels after childbirth seems hardly feasible in reality. In contrast, close follow-up among high risk groups during midwife or nurse-led postpartum assessments may strongly contribute to more tailored and cost-efficient maternal perinatal mental care services.

However, despite promising results in other fields, relatively few studies have been performed using ML in the field of perinatal mental health. An early study in the field could predict PPD with an accuracy of 84% by use of multilayer perceptrons and assessment of 16 variables36. A recent pilot study used ML algorithms applied to data extracted from electronic health records to show that ML models can be utilized to predict PPD and identify critical variables that conform with known risk variables such as race, demographics, threatened abortion, prenatal mental disorder, anxiety, and an earlier episode of major depression34. Another study also developed models to predict PPD, which were then integrated into a mobile application platform to be used by pregnant women37, while a recently published study compared four PPD prediction models that comprised demographic, social and mental health data38. In the latter study, psychological resilience was pointed out as an important predictive factor. However, these studies have been limited by either sample size or richness of data. Finally, in a recently published study, Zhang et al. proposed a machine learning based framework for PPD risk prediction in pregnancy, using electronic health record data39.

To date, our study is the first using a population-based, large and rich dataset, including a wide range of clinical and psychometric self-report and medical journal-derived variables and evaluating a range of different ML algorithms against each other, and also after stratification for earlier or pregnancy depression, to provide a robust screening tool, at discharge from the delivery ward, for predicting women at risk for developing depressive symptoms later in the postpartum period.

Hence, we aim to predict women at risk for depressive symptoms at 6 weeks postpartum, from clinical, demographic, and psychometric questionnaire data available after childbirth, by use of machine learning methods.

Results

Descriptive statistics

Table 1 shows summary statistics of the study population by depressive symptom status at 6 weeks postpartum. Results are presented as frequencies and relative frequencies within EPDS status [N (%)] or median (interquartile range) for sociodemographic, clinical and questionnaire variables. Of the 4313 participants in the study, 577 had depressive symptoms at 6 weeks postpartum. The mean age for both groups was 31 years. Differences were seen among women with depressive symptoms and women without depressive symptoms across sociodemographic variables like education, employment, and country of origin, as well as many other variables known as risk factors for postpartum depression. A greater proportion of women with depressive symptoms postpartum did not receive adequate support from their partner and were not breastfeeding.

Table 1 Characteristics of the study participants by depression status at 6 weeks postpartum (EPDSa score 0–11 vs. 12–30) (n = 4313).

Classification graphs

To evaluate whether ML can predict women with depressive symptoms, two datasets were used, namely the BP variables and the combined dataset, that includes the BP variables and three psychometric questionnaires (RS, SOC, and VPSQ). Performance of different ML models was first evaluated for the BP data (Fig. 1). The performance metrics for Ridge Regression, LASSO Regression, Gradient Boosting Machines, Distributed Radom Forests (DRF), Extreme Randomized Forests (XRT), Naïve Bayes and Stacked Ensembles models are shown. Balanced accuracy, NPV and AUC were quite similar across the models, with accuracy reaching 72% and AUC 79% for XRT. NPV was over 92% for all models. Sensitivity was quite low and together with specificity and PPV, they varied between the models. Sensitivity was highest for DRF at 84%, while only 65% for XRT; DRF had though the lowest specificity and PPV. The highest PPV was observed for Ridge Regression and Stacked Ensemble, at 41%.

Figure 1
figure 1

Evaluation of model performance in the dataset containing only background, medical and pregnancy-related variables (n = 4277 women). The models tested were Ridge Regression, LASSO Regression, Distributed Random Forest, Extremely Randomized Trees, Gradient Boosted Machines, Stacked Ensemble, and Naïve Bayes. Models were assessed for accuracy (ACC), sensitivity (SENS), specificity (SPEC), positive predictive value (PPV), negative predictive value (NPV), and area under the curve (AUC), the outcome being depressive symptoms at 6 weeks postpartum. The bars represent the level of performance measures (in percent) and the table below the bar plot presents the exact numerical values. Error bars represent one standard deviation from the mean.

Performance of different ML models was then evaluated for the combined dataset, even including psychometric measures (Fig. 2). Performance metrics for the same models showed that NPV was still over 90% for all models, but otherwise, similar levels of accuracy and AUC were observed. More variability among the models was observed for sensitivity, specificity and PPV. XTR had the highest accuracy (at 73%) and AUC (at 81%) among all models, with a balance in sensitivity at 72% and specificity at 75%; PPV was at 33% and NPV at 94%. As this balancing act is an essential attribute of predictive models based on imbalanced datasets the subsequent experimental analysis was provided using only XRT.

Figure 2
figure 2

Evaluation of model performance in the total combined dataset (n = 2385 women). The combined dataset contained the background, medical and pregnancy-related variables, as well as answers to the questionnaires Resilience-14, Sense of Coherence-29 and Vulnerable Personality Scale Questionnaire. The models tested were Ridge Regression, LASSO Regression, Distributed Random Forest, Extremely Randomized Trees, Gradient Boosted Machines, Stacked Ensemble, and Naïve Bayes. Models were assessed for accuracy (ACC), sensitivity (SENS), specificity (SPEC), positive predictive value (PPV), negative predictive value (NPV), and area under the curve (AUC), the outcome being depressive symptoms at 6 weeks postpartum. The bars represent the level of performance measures (in percent) and the table below the bar plot presents the exact numerical values. Error bars represent one standard deviation from the mean.

Comparative performance of the XRT model using all variables, the top 50%, and the top 25% variables, for both the BP and the combined dataset is shown in Fig. 3. There was an apparent trade-off between model sensitivity and specificity, which were both affected by dataset used and percent of variables included (Fig. 3). Sensitivity was highest with use of only 25% of the combined dataset, while specificity was highest with the use of the top 50% of the BP dataset. None among the other measures were greatly affected by either dataset used or percent of variables included (a trend to lower PPV when 25% of variables used was noted). The AUC curves corresponding to Figs. 2 and 3 are available in the supplementary material (Supplementary Figure 1).

Figure 3
figure 3

Comparative performance of the dataset containing only background, medical history and pregnancy-related variables (BP) and the combined dataset (BP + RS + SOC + VPSQ). The Extremely Randomized Trees (XRT) algorithm was used to compare the performance of the two datasets for predicting depression at 6 weeks postpartum. Models were assessed for accuracy (ACC), sensitivity (SENS), specificity (SPEC), positive predictive value (PPV), negative predictive value (NPV), and area under the curve (AUC). The variable selection procedure shows results when All (100%), Top 50%, and Top 25% of variables were retained, ranked according to Mean Decrease in Impurity (MDI) relevance.

The results for the performance of the XRT models after stratification for previous depression are shown in Fig. 4. For all women, XRT achieved a balanced accuracy of 73%, a sensitivity of 72%, a specificity of 75%, a positive predictive value of 33%, a negative predictive value of 94% and an AUC of 81%. For women with depression in pregnancy or earlier in life, XRT achieved a balanced accuracy of 69%, a sensitivity of 76%, a specificity of 61%, a positive predictive value of 44%, a negative predictive value of 87% and an AUC of 77%. For women without any previous depressive episode, balanced accuracy was 64%, sensitivity 52%, specificity 76%, positive predictive value of 13%, negative predictive value 97% and AUC of 73% (Fig. 4). Among the results from analyses of the individual questionnaires, no single one achieved an accuracy of more than 70% (Supplementary Figure 2).

Figure 4
figure 4

Stratified classification graphs for Extreme Randomized Forest (XRT) model, by pregnancy/previous depression status. Results presented for all women (All, n = 2385, of which 14% had postpartum depression, PPD), women with depression during current pregnancy or earlier in life (With Previous Depression, n = 971, of which 27% had PPD), and women without any previous depression episode (Without Previous Depression, n = 1414, of which 6% had PPD). For each category, models were assessed for accuracy (ACC), sensitivity (SENS), specificity (SPEC), positive predictive value (PPV), negative predictive value (NPV), and area under the curve (AUC).

Variable importance

The 25 most important variables by MDI based on Distributed Random Forests (DRF) models, considering the women with different previous depression status are shown in Fig. 5. For all women, Anxiety During Pregnancy and Depressive During Pregnancy stand out as the two most important variables (importance level above 0.7) (Fig. 5A). The variables following in importance were questions included in the psychometric instruments, except for history of depression. Similarly, for women with previous depression, Anxiety During Pregnancy and Depressive During Pregnancy stand out as important variables for the presence of depression postpartum (importance level above 0.9) (Fig. 5B). Finally, for women without depression, Anxiety During Pregnancy was the absolutely most important variable (importance level of 1) (Fig. 5C). Even here, variables relating to resilience, sense of coherence and personality followed, but interestingly, variables such as breastfeeding, BMI, traumatic events in childhood, mode of delivery, hypoxia in the newborn and age place among the top 25 variables.

Figure 5
figure 5

Ranked importance of the assessed variables using the Extremely Randomized Trees (XRT) models in the combined dataset, considering the women with different previous depression status. Results presented for all women (A), All women (n = 2385), (B) women with depression during current pregnancy or earlier in life (Previous/pregnancy depression, n = 971), and (C) women without any previous depression episode (No previous depression, n = 1414). The graphs depict the variable importance as a relative measure that is scaled to a maximum of 1.0. The x-axis represents the relative contribution to the classification algorithm of the corresponding feature on the y-axis.

The 25 most important variables based only on BP variables for all women (n = 4313) can be found in Fig. 6. The two variables that have an importance level above 0.9 are again Depression During Pregnancy and Anxiety During Pregnancy. The next variable with an importance level above 0.3 is Depression History, while the remaining rate below 0.2.

Figure 6
figure 6

Ranked importance of the assessed background, medical history and pregnancy variables for all women (n = 4277) using Extremely Randomized Trees (XRT) models. The top 25% of the variables are reported. The x-axis represents the relative contribution of the corresponding variable to the classification algorithm.

Including only the top 20 variables, the AUC is only reduced by 1% to 0.79, including just 10 variables reduced the AUC by 2% to ~ 0.78, while after including just 5 variables reduced the AUC by 3% to ~ 0.77. For the previously non-depressed group, including 10 variables gives an AUC of 0.72, and 5 variables an AUC of 0.71.

Discussion

In this study, we evaluated a range of different machine learning (ML) methods to predict pregnant women at risk for postpartum depressive (PPD) symptoms. The classification performance of the chosen ML algorithms was not significantly different in regard to accuracy, NPV, AUC measures. However, variations were more pronounced in regard to sensitivity, specificity and PPV. In general, as expected, an inverse relationship is observed in performance with respect to sensitivity and specificity. Furthermore, PPV is considerably lower than NPV due to low prevalence of PPD, as expected.

Overall, XRT provides robust performance with highest accuracy and well-balanced sensitivity and specificity. Addition of resilience and personality self-reported variables to the background, medical history and pregnancy-related variables provides marginal improvement in both accuracy and AUC. It is nevertheless of note that these extra variables boost the sensitivity of the XRT model substantially for only a slight drop in specificity. As this does not depend on the lower sample size used for the second step of analyses involving personality and resilience measures, it could be hypothesized that there is either a certain redundancy between variables, e.g. that low resilience is a core feature among depressed patients during pregnancy, or that anxiety and depression measures, available for all patients, have such a strong predictive value that the further addition of variables does not greatly improve accuracy.

These results suggest a possible benefit of using ML to screen new mothers at discharge from the delivery ward in order to identify those at high risk for postpartum depressive symptoms. However, because of the low PPV across all models, due to the relatively low prevalence of PPD at 12%, one would expect that many women identified at high risk would in the end not get depressed. On the other hand, these methods may nevertheless permit the identification of a high-risk group, to which preventive interventions would be offered in a cost-effective way, mainly by avoiding large costs related to full-blown depressive episodes postpartum. These could include the provision of extra support as well as more focused and longitudinal assessments in these mothers. Furthermore, the variables included in the BASIC study refer to easily acquired web-based self-reports, which support their use for screening purposes. Because of the high NPV, we would not expect many women not identified as high risk to develop depression postpartum. As such, the application of our classification algorithms would boost cost-effectiveness, allowing for a tailored resource allocation towards the mothers initially identified at risk versus a more widespread follow up of all mothers; in the low-risk group, assessments could be limited to single timepoints, as is praxis today. As PPD affects more than 16,000 families every year in Sweden alone, with high associated costs, estimated at $30,000 per mother-infant pair for untreated peripartum mood disorders, preventive efforts would have substantial societal benefits40.

It is interesting that performance metrics, especially accuracy and AUC, remain stable even when the number of variables used in the models is reduced from 100 to 50% and even to 25% of all variables available, and AUC is relatively stable even at 5–10 variables. As discussed above, this is in line with the thought that there is some redundancy when it comes to the variables included, with depression and anxiety during pregnancy being highly correlated with some background and medical history variables, and possibly mediating their association with PPD. It is thus intriguing to observe that only among non-previously depressed, variables such as breastfeeding, BMI, traumatic interpersonal events in childhood, mode of delivery, infant hypoxia and age are emerging as important for prediction, along with resilience and personality variables, which are otherwise more prominent among those earlier depressed. This is important to have in mind when developing screening strategies; the variables used might need to be adjusted for the group of women with previous depression. Anxiety during pregnancy continues to be very predictive in both groups. The stability of the performance measures however, indicates that an abbreviated survey can be used to screen without significantly affecting predictive power.

Among possible explanations for the somewhat lower accuracy in both the depressed group (earlier or during pregnancy) (n = 971, accuracy = 69%) and never-depressed subgroups (n = 1414, accuracy = 64%) are the lower sample sizes as well as a relatively decreased variability in the data (the algorithms did not have a big number of examples of alternatives to learn from). Sensitivity is the same in the earlier depressed group, but drops to 52% in the never depressed group, underlining the difficulty in identifying women at high risk for having their first ever depressive episode after childbirth. In general, the high NPV figure in the never earlier depressed group means that women with a negative screening in that group do not need tighter follow-up; NPV nonetheless drops to 86% in the earlier depressed group, suggesting that further screening in the postpartum period might still benefit this high-risk group of women.

Our study showed a slightly higher AUC than most earlier studies’ best prediction models (79% by Wang et al. and 78% by Zhang et al.), though our accuracy of 73% is lower than the 84% reported by Tortajada et al.34,36,38. However, in the latter study, the main outcome was depression at 32 weeks and not at 6 weeks postpartum, genetic data was included and the study sample was more homogeneous since it consisted of SSRI-free Caucasian women. Moreover, a lower EPDS cut-off was used followed by clinical interviews, possibly reducing the risk of misclassification of study cases and controls. Nevertheless, in our study, a clinical evaluation was not possible for practical reasons, due to the much larger study population. Finally, in addition to clinical and environmental variables, information on related gene polymorphisms was also utilized in that study.

Furthermore, Wang et al. identified race, obesity, anxiety, depression, different types of pain, and antidepressant and anti-inflammatory drug use during pregnancy as the most important variables for their prediction models34. These variables differed somewhat from the ones we identified as being most important with the caveat that our model also indicated that anxiety during pregnancy and depression history or depressive symptoms during pregnancy were overwhelmingly the most significant predictors for PPD. It has to be noted that we included many psychometric measures, which followed in importance, e.g. the question 19 on the SOC scale “Do you have very mixed up feelings and ideas?” and question 4 on RS, which measures self-regard (“I am friends with myself”). The population in the BASIC study is quite homogeneous, most participants having a high education, are quite healthy and born in the Nordic countries. Further, the BASIC dataset has no information on race. BMI was also identified in our study as an important variable, both in the BP dataset analysis and the sub-analysis among women without previous depression. Rates of antidepressant use are low. Differences in the analytical approach might also account for some differences in the results.

These findings further illuminate the difficulties in predicting which women will go on to develop postpartum depressive symptoms after childbirth. From the variable importance plots, the most predictive variables for postpartum depressive symptoms, available at the time of discharge from the delivery ward, is to either have anxiety or depressive symptoms during pregnancy. In fact, these two variables are by far the most predictive, along nevertheless with distinct variables related to resilience, sense of coherence and personality. The predictive algorithms reach an accuracy for the whole group of 73% and AUC of 81%, which is at the limit for possible use in clinical settings. The algorithms might need to be different according to whether women had experience depression before in life. Further studies, possibly using more advanced methods and bigger samples, are warranted.

Very recently, Zhang et al. also proposed a machine learning based framework for PPD risk prediction using electronic health record (EHR) data39. While the techniques employed are comparable to our study with similar processing pipeline, they report higher AUC. This increment can be attributed majorly to the substantially larger cohort used in their study. Several ML studies have demonstrated that large datasets lead to lower estimation variance and hence provide better predictive performance. Furthermore, the top predictors also differ between our study due to differences in data sources. Additionally, a PPV higher than that reported in our study would significantly increase the clinical utility of our proposed framework. However, PPV is directly related to the prevalence of PPD in the population studied, which is only about 12%. While the classification threshold of the model can be adjusted to improve PPV, it does not ensure the expected benefit as other evaluation metrics, like sensitivity, specificity and NPV, would be adversely affected. Even Zhang et al. that reported higher AUC values, only report a PPV of ~ 27% for the validation site with prevalence of 6.5%, highlighting the issue39.

The lack of effective ways that would allow for early prediction of women at risk for depressive symptoms in the postpartum period has been addressed in the Introduction. In fact, the Edinburgh Postnatal Depression Scale is nowadays used as a screening tool for current depression41. National guidelines in several countries recommend screening for PPD at 6 to 8 weeks postpartum; however, the suggested target groups of women to be screened vary between countries42,43,44. Also, the use of the EPDS at this time is used to screen for concurrent depression. In contrast, the role of EPDS in pregnancy, in combination with other variables, for early identification of women at risk for development of depressive symptoms later in the postpartum period has not been studied. In our study we do show that high EPDS scores in pregnancy are highly predictive of postpartum depression.

This study had numerous strengths. First, it addresses a novel field, as there are very few studies in the area, none from the Nordic countries, and none of earlier algorithms is being widely used in clinical practice. The large sample size allowed us to train a robust range of different ML algorithms. The richness of the BASIC dataset provided us with the opportunity to investigate the predictive power of a large number of background, medical history, pregnancy and delivery related variables, as well as psychometric questionnaires; the last ones both as total scores but also at individual item level. A key novelty feature of the study in the inclusion of many resilience and personality-related variables, that have been identified in the literature but not included in previous models. We also explore the importance of variables in terms of their predictive power of PPD, an effort directed towards to designing a compact survey to screen for PPD. Finally, the analysis of clinically relevant sub-groups such as women with previous depression or depression during pregnancy gave clinically useful insights.

Some limitations of the study include the non-representative sample in that women born in Scandinavia, with a high education and cohabitating with the child’s father were over-represented in the cohort, which makes the findings difficult to generalize to the background population. Sources of selection bias are the exclusion of non-Swedish speaking women as the questionnaires were only offered in the Swedish language, and the fact that more healthy women are more prone to participate in studies of this kind. Not all women self-reported on all variables, but we addressed this problem of missing values with exclusions and imputations where appropriate. Class imbalance in the outcome made the training stages of the algorithms challenging but were also addressed appropriately. Lastly, theoretically, some items from the scales on personality (SSP), and attachment (ASQ) might have had a more prominent role in prediction if they would have been available for a larger proportion of the women in this study. The study by Zhang et al., published after our study was conducted, reported higher AUC and included some predictors lacking in our study39. Future studies should make sure to include these important predictive variables for further evaluation.

Depressive symptoms and anxiety during pregnancy are highly predictive factors for women who go on and develop postpartum depressive symptoms, while variables relating to resilience, sense of coherence and personality also play a modest role. The predictive algorithms have relatively good accuracy and AUC, with XRT performing best.

Methods

Data sources

Data for the development of the prediction models were obtained from the “Biology, Affect, Stress, Imaging and Cognition during Pregnancy and the Puerperium” (BASIC) study. BASIC is a population-based prospective cohort study at the Department of Obstetrics and Gynaecology at Uppsala University Hospital, Uppsala, Sweden7. Between September 2009 and November 2018 all pregnant women who were 18 years of age or older, did not have their identities concealed, had sufficient ability to read and understand Swedish and did not have known bloodborne infections and/or non-viable pregnancy as diagnosed by routine ultrasound were invited to participate in the study45. Data acquisition in the BASIC study was mainly based on online surveys and questionnaires that the women were asked to fill out during pregnancy at the 17th and 32nd gestational week and at 6 weeks, 6 months and 12 months postpartum. The surveys included questions about background characteristics, such as sociodemographic variables, psychological measures, medical information, information on reproductive history, lifestyle and sleep. All questionnaires were self-reported and web-based. Data are also retrieved from the medical journals. The participation rate for the study was 20% but the cohort had a relatively low attrition rate, with 71% of the participants remaining in the study at 12 months follow-up45.

This study focuses on two subsets of variables from the BASIC study: and (i) background, medical history and pregnancy/delivery variables (BP) and (ii) further psychometric questionnaires (information on exact assessment methods and coding is provided in Table 1 for the background variables and Supplementary Table 1 for the exact questions in the different questionnaires). The BP variables consisted of sociodemographic and lifestyle information, self-reported health, medical history and variables relating to pregnancy and childbirth. This dataset included even information on depression and anxiety symptoms during pregnancy. Depression symptoms were assessed by a score of 12 or more on the Edinburg Postnatal Depression Scale (EPDS) in pregnancy weeks 17, 32 or 38, while anxiety during pregnancy was defined as ratings in the highest quartile on either the State Trait Anxiety Inventory (STAI)46, the Beck Anxiety Inventory or the anxiety subscale of the EPDS (EPDS-3A). These variables were available for the majority of the BASIC participants. The total number of interpersonal and non-interpersonal events in the Lifetime Instances of Traumatic Events Scale (LITE)47 was also included among BP variables. The BP variables consisted of continuous, discrete, nominal and ordinal categorical variables, measured at various time points during the study.

The extra psychometric scales used were the Attachment Style Questionnaire (ASQ)48, the Resilience-14 scale (RS)49,50, the Sense of Coherence Scale-29 (SOC)51, the Vulnerable Personality Style Questionnaire (VPSQ)52,53, and the Swedish Scale of Personalities (SSP)54. ASQ, RS, SOC, VPSQ, and SSP were filled out at gestational week 17 or 32, VPSQ and LITE assessments were conducted at 12 months postpartum. All variables were assessed on a Likert scale and coded as ordinal variables. These scales were used for only specific period of time during the course of the BASIC project, different for each scale, and are thus available for different number of women (Table 1)45.

Additionally, the participants of BASIC study were also asked to fill out the EPDS at different time-points during and after pregnancy. The outcome in this study was EPDS score at 6 weeks postpartum, assessing the degree of self-reported depressive symptoms in the early postpartum period. The discrete scores for this timepoint were then aggregated and a cut-off of a score of 12 or higher was used to indicate women with depressive symptoms, in accordance to validation studies for the Swedish population55. The number of women in the BASIC study who had completed the EPDS at 6 weeks postpartum and were thus included was 4313.

Ethics declarations

The study has been approved by the Research Ethics Board in Uppsala (Dnr 2009/171, with amendments). All participating women gave written informed consent before being included in the study. All methods were carried out in accordance with relevant guidelines and regulations.

Data pre-processing

The pre-processing consisted of splitting the original BASIC dataset into different subsets. Two subsets were retained for our study, i.e. background & pregnancy (BP) data and psychometric questionnaire data. Data for twins and women with multiple pregnancies were removed from the dataset, as these are relatively rare, are followed very closely during and after childbirth, and are associated with higher risk for PPD56,57. Explorative data analyses were conducted on individual variables to check their distributions and to identify and remove outliers that were assessed to be non-informative. Psychometric questionnaires and BP variables that contained information about the women after the time point of the outcome, namely 6 weeks postpartum, were also excluded to avoid inadvertent biases of the results.

SSP was omitted from the analysis due to large number of missing observations, as this survey was used only for few years during recruitment for the BASIC study45. Its inclusion would have resulted in a much smaller sample size for the final analysis.

The dataset consists of continuous, nominal and ordinal variables. As continuous variables in the dataset have varying scales, normalization is performed to transform all the variables to a common range from 0 to 1. Furthermore, nominal and ordinal variables that represent non-numerical values are encoded using binary numerical representations for improving the performance of the ML algorithms.

Data imputation

As missing values can drastically impact the performance of ML models, a conservative approach was adopted to handle them. Firstly, samples (rows, corresponding to one pregnancy) with more than 50% missing values in the included variables were eliminated, and the final number of pregnancies in the ML analyses was 4277. Next, variables (columns, corresponding to a distinct variable) with more than 25% missing data were also eliminated. Finally, the remaining missing values were imputed from the available data. While continuous variables were imputed using multivariate imputation by chained equations (MICE)58, categorical and ordinal variables were imputed with K nearest neighbors’ imputation59.

Modeling

Classification techniques

With ML algorithms, there is no one-size-fits-all solution, making it imperative to try multiple alternatives. Consequently, this study explored different ML algorithms for supervised classification that modeled data in different ways. In order to present a comprehensive comparison, the following algorithms were implemented: Ridge Regression, LASSO Regression, Gradient Boosting Machines, Distributed Radom Forests, Extreme Randomized Forest, Naïve Bayes, and Stacked Ensembles. Ridge Regression specializes in analysing multiple regression data with multicollinearity, while LASSO Regression is a type of linear regression that shrinks data values towards a central point, and results in simple, sparse models (i.e. models with fewer parameters). Gradient Boosting Machines (GBM) and Random Forests are ensemble learners. In Distributed Radom Forests (DRF), a subset of features is used to determine the most discriminative thresholds to split the trees on. However, unlike DRF, where one builds an ensemble of deep independent trees, in GBM, we specify an ensemble of weak, shallow successive trees, where each tree is learning and improving on the previous tree. In Extremely Randomized Trees (XRT), instead of using the most discriminative thresholds for the splits, thresholds are drawn at random for each feature and the best of these random thresholds are used as the splitting rule, resulting in lower variance but more bias. XRT are similar to DRF with the caveat of more randomness. Naïve Bayes (NB) is a probabilistic classifier based on Bayes’ Theorem. The NB works under the assumption that the presence of any particular feature for a certain outcome is unrelated to the presence of any other feature for that outcome. Thus, despite if the features depend on each other or upon the existence of other features, the NB assumes that all of the features independently contribute to the outcome probability. Stacked Ensemble learns a new model by combining predictions of existing models. Stacked Ensembles are a class of supervised learning algorithms that work by training a meta-learner to find the optimal combination of base learners. Unlike bagging and boosting were the goal is to stack a number of weak learners together, the goal is to stack a number of diverse and strong learners together to optimize learning60.

For all the classification algorithms, the outcome measure was the participants’ EPDS score at 6 weeks postpartum represented as a binary variable with 12 as cut-off, while predictor variables included the BP variables and psychometric data described above.

Class imbalance

The BASIC dataset, as a population-based sample and in accordance to clinical situations, is predominantly composed of data from women who did not experience PPD at 6 weeks postpartum (less than 10% of the women representing PPD cases), consequently leading to extreme data class imbalance. ML classifiers trained on such imbalanced datasets usually generate biased results. To mitigate this imbalance, the minority class consisting of women with PPD was oversampled during ML training. Unlike under sampling of majority class consisting of women without PPD, this approach avoids loss of information and leverages all the samples from both classes.

Evaluation metrics

The performance of model prediction of the ML classification algorithms was evaluated using a variety of performance metrics. The performance of each classification model was captured by the Confusion Matrix that formed the basis for other metrics. In addition to the most commonly used classification accuracy, sensitivity (true positive rate) and specificity (false positive rate) are also reported. The positive predictive value (PPV) and negative predictive value (NPV) are also reported. Additionally, a Receiver Operating Characteristic (ROC) curve was specified for each classification to show the relation between the true positive rate and false positive rate. The performance of the classifiers was then summarized by the total area under the ROC curve (AUC), with the higher the AUC (between 0 and 1) indicating a better performance of the classification.

Variable (feature) importance/selection

The success of a ML algorithm does not only depend on good predictive performance but also on generalizability and easy interpretability. Identifying variables that have significant impact on the outcome is valuable, especially in the medical domain. Variable importance using Random Forests models can be calculated using Gini Importance or Mean Decrease in Impurity (MDI)61. The MDI relevance of a variable is obtained by calculating how effective the variable is at reducing the uncertainty when creating decision trees. The variable that is most effective and used the most will be ranked as most important.

Analytic strategy

The analytical strategy consisted of breaking the analysis down into steps and iteratively building towards a final classification model, all the while being cognizant of any potential biases introduced by the approach. The workflow is presented in Fig. 7. First, the raw data was split into the BP and the different psychometric questionnaires datasets in order to build predictive models independently on each psychometric questionnaire and to identify the ones with the highest accuracy for classification of PPD. Second, the psychometric questionnaires that yielded the highest accuracies were combined with the BP dataset. Predictions were then performed with the aggregate data (combined dataset). Additional models were trained with reduced datasets resulting from variable selection. Top 50% and top 25% variables with MDI were used to train separate classification models to determine the relative contribution of those variables to the prediction. Additionally, stratified analyses were performed, where participants were stratified by a previous history of depression (defined as earlier depression, earlier contact with psychiatrist/psychologist, or depression during pregnancy).

Figure 7
figure 7

Study workflow and analytical strategy. Data were obtained from the “Biology, Affect, Stress, Imaging and Cognition during Pregnancy and the Puerperium” (BASIC) study, a population-based prospective cohort study in Uppsala, Sweden. Data included in our study comprised (i) background, medical history and pregnancy-related variables (BP) from women, and (ii) further psychometric questionnaires, available at discharge from the delivery ward. The data were processed and either were used to test models or train the machine learning algorithms, to predict depressive symptoms at 6 weeks postpartum.

Based on preliminary analyses, SSP and ASQ did not provide any information gain relative to BP data. Hence, only RS, SOC and VPSQ variables that provided predictive performances comparable to BP variables were included in the aggregate analysis.