Introduction

Attrition, the loss of participants belonging to the initial sample of recruitment who do not return for subsequent follow-ups, is one of the most challenging problems faced by researchers in charge of cohorts1. Importantly, a cohort affected with attrition may have the validity of its results questioned, as attrition introduces selection bias if related to the outcome of interest2,3.

Efforts to tackle attrition in cohorts have been concentrated in two main actions: prevent its occurrence and develop statistical methods to alleviate its consequences in data analysis1. For the latter, regression imputation, inverse probability weighting, and multiple imputation are some of the available techniques4,5,6. To prevent or diminish the loss of participants during the study, retention strategies have been widely implemented, such as voucher incentives, reminders, birthday cards, and reimbursement of transport costs7. However, conflictual results on the effectiveness of these strategies7,8 suggest that there may not be a unique solution for all types of cohorts, settings, and participants, but rather specifically tailored strategies are required.

Birth cohorts of high-risk children, like those born very preterm (< 32 weeks of gestation), have an important role in providing a comprehensive assessment of the needs and development of these children across their lifespan9. Very preterm infants experience increased and long-term adverse outcomes, such as cognitive and behavioural problems, when compared with children born at term10. Hence, this type of cohort may provide valuable scientific evidence that, ultimately, will contribute to improving clinical care, supporting public health decisions, and planning health and education provisions to these children11.

An early and precise identification of which participants present an increased risk for dropping out may be of large benefit. Conventional statistical methods, such as Logistic Regression, have been the usual choice to predict attrition in cohorts12,13,14. However, these classical theory-based models are constrained by independence, additivity and linearity assumptions which may oversimplify complex relationships between predictors and outcome variables15.

The growing access to clinical data and the rapid advances in machine learning raised a great enthusiasm about its use to improve clinical care over the past decade16 and an increasing number of its application in epidemiological research and practice is known17. In addition, machine learning methods may bring advantages over conventional approaches. It offers highly flexible algorithms that often do not require underlying distributional assumptions or model specification, and is able to adapt to complex non-linear and non-additive interrelations between outcome and covariates18. However, when it concerns employing machine learning techniques to address methodological challenges in epidemiological studies, the results are scarce.

In this study, we developed predictive models of attrition in a birth cohort of very preterm infants applying a conventional regression model and different machine learning methods, and looked for the most relevant predictors of attrition.

Methods

Study population

The study population consisted of Portuguese children participating in the prospective population-based Effective Perinatal Intensive Care in Europe (EPICE) cohort. It included all very preterm births (between 22 + 0 and 31 + 6 weeks of gestation) in 2011/12 in 19 regions of 11 European countries19. In Portugal, there were 724 very preterm live births occurring in this period in the two geographic regions (Northern and Lisbon and Tagus Valley) included in the cohort20. This study included all infants discharged alive from Neonatal Intensive Care Units (NICUs) whose parents provided written informed consent to participate in the EPICE cohort in Portugal (EPICE-PT) and to be long-term followed-up, resulting in 544 children (89.6% of 607 eligible participants)19. We excluded two infants who died after discharge, remaining 542 participants for the analysis. Participant’s data at baseline were extracted from medical charts by health care professionals using a pretested standardized questionnaire19. In this study, we focused on the first four years of follow-up (follow-up 1–follow-up 4), where questionnaires on child's health and development were administered to parents by telephone (follow-up 1, 3 and 4) and postal questionnaires (follow-up 2).

Outcome

The outcome of interest was attrition, i.e., non-participation in offered follow-ups. Attrition was identified when the participant (a) could not be reached by any available contact (including relative’s contact), (b) repeatedly postponed the call to answer the questionnaire, (c) verbally refused to participate in that specific follow-up, (d) verbally requested to withdrawal from the cohort, or (d) did not mail the questionnaires back, even after several reminders (follow-up 2). Attrition at each follow-up was calculated considering the eligible participants, i.e., excluding possible deaths and/or previous formal refusals. Participation was considered when parents accepted the invitation for that specific follow-up and answered the questionnaires (either totally or partially) through any available method.

Predictors

Predictors were taken from information collected at baseline and from questionnaires completed at the three subsequent follow-ups. Based on the literature and experience of the researchers involved in the cohort, we selected a list of demographic, socioeconomic and clinical characteristics that are likely to be important predictors of attrition (Supplementary Table 1). The decision to not include all predictors available in the cohort dataset was taken to mitigate the curse of dimensionality21, to diminish the computational costs, prevent overfitting22 and, increase the usability of the model in similar cohorts.

Model development

Two predictive models framework were developed: (1) “Baseline”, where prediction of the first four follow-ups was done using baseline data only, independently and, (2) “Incremental”, where baseline variables were used to predict attrition in the follow-up 1 and from that on, we continuously added new predictors extracted from the subsequent follow-up (e.g. baseline plus follow-up 1 to predict attrition in the follow-up 2; baseline plus follow-up 1 and 2 to predict attrition in the follow-up 3, etc.). For the first follow-up, both models are equivalent.

To test the model's performance in predicting new data, we have used, for each year, 5 repetitions with replacement of a hold-out method23. In each of the five folds, the whole dataset was randomly split into a training set (80%) and a testing (20%). Most machine learning algorithms have a set of parameters that may be adjusted to get a good model (parameter tunning). We have adopted a wrapper approach24 to estimate the best combination of parameter’s values. We have split the training set into a tuning-training set (95% of the original training set) and a tuning-test set (5% of the original training set). The result of the wrapper is the parameter’s values that produced the best (AUC-PR) value on the prune-test set. The best combination of parameter values is used on the training set and the model is finally evaluated on the test set.

The prevalence of the outcome (attrition) in the various follow-up of EPICE-PT cohort ranged from 13 to 25%. Hence, we have a set of imbalanced datasets, which turns the models prone to be biased towards the majority class. In order to cope with this problem, the Synthetic Minority Over-Sampling Technique (SMOTE)25 was applied to mitigate the imbalance of the datasets.

Classification methods

Different classification methods were leveraged to build the predictive models. Selected machine learning methods included AdaBoost, Artificial Neural Networks, K-Nearest Neighbours, Decision Trees Classifiers (Functional Trees, J48 and J48Consolidated), and Random Forest. We also applied Logistic Regression, performed with identical predictors, without interaction terms. A short explanation of the different methods is described below:

AdaBoost is one of the most popular boosting algorithms, a group of methods that produce a classifier as a linear combination of weak classifiers, and does so in a way that minimizes exponential loss over such linear combinations26. A weak classifier can be described as one whose error rate is only slightly better than random guessing15.

Artificial Neural Networks are nonlinear statistical models, which extract linear combinations of the predictors as derived features, and then generate an outcome as a nonlinear function of these features. This learning method, inspired by neuroscience, is quite robust to noise in the training data15,27.

K-Nearest Neighbours models are based on the sample’s geographic neighbourhood. It uses the nearest observations, based on a distance measure, to predict the final classification outcome of a new observation28.

Decision Trees Classifiers (Functional Trees, J48 and J48Consolidated) are a group of algorithms that use a binary recursive partitioning of instant space29. It is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches, aiming to partition the data into smaller, more homogeneous groups. By fully revealing the feature space partition of a single tree, it allows for great flexibility in data analysis and interpretability15,29.

Random Forest algorithms are an extension of bagging30, an ensemble learning method that builds successive independent trees using a bootstrap sample of the data set. It adds a new layer of randomness when selecting predictors or combinations of predictors at each node to split it, while bagging considers all of the original predictors for splitting a node31.

Logistic Regression is typically the foremost statistical analysis used to model binary responses. It belongs to a family of techniques called Generalized Linear Models, which models the log odds of a binary dependent variable as a linear function28.

All models and algorithms were run using WEKA32.

Performance metrics

We used four metrics to estimate the performance of the different classification methods33: (1) Sensitivity: the ability of the model to identify all the relevant cases (dropouts) within the dataset, (2) Accuracy: it measures the fraction of all correct predictions, (3) F- measure: conveys the balance between precision and sensitivity and (4) AUC-PR: Area Under the Curve of Precision-Recall. AUC-PR was the primary metric adopted to assess the performance of the algorithms, given the purpose of our study is to identify the cohort’s participants more prone to attrition and to select a predictive model that is as generalizable as possible to other cohorts of very preterm infants.

Predictor variables importance

We collected the variable rank given by the best algorithm in each run and then we calculated the overall mean rank of the five best variables over all runs. To investigate the effects of the most relevant continuous predictor variables across different values, partial dependence plots were generated for the most accurate algorithm34. Aiming to improve interpretability, partial dependence plots were stratified by categories, when appropriated. The plots were presented with smooth curves to allow possible important patterns to more clearly stand out. Graphs were constructed using R programming language.

Ethics

The EPICE-PT cohort was approved by the Ethics Committee of the participating hospitals and by the Portuguese Data Protection Authority (authorization 7426/2011)20. All research was performed in accordance with relevant guidelines and informed consent was obtained from all parents or legal representatives, as required by national legislation. The study complies with the Helsinki Declaration 2008.

Ethics committees that approved the study

  1. 1.

    Ethics Committee of Hospital Center Alto Ave—Guimarães

  2. 2.

    Ethics Committee of Hospital Center Entre Douro e Vouga—Hospital São Sebastião

  3. 3.

    Ethics Committee of Hospital Center Médio Ave—Hospital de Famalicão

  4. 4.

    Ethics Committee of Hospital Center Porto—Maternidade Júlio Dinis

  5. 5.

    Ethics Committee of Hospital Center Póvoa de Varzim /Vila do Conde—Hospital Póvoa Varzim

  6. 6.

    Ethics Committee of Hospital Center São João—Hospital São João

  7. 7.

    Ethics Committee of Hospital Center Tâmega e Sousa—Hospital Padre Américo

  8. 8.

    Ethics Committee of Hospital Center Trás dos Montes e Alto Douro—Hospital São Pedro

  9. 9.

    Ethics Committee of Hospital Center Vila Nova de Gaia/Espinho—Unidade II

  10. 10.

    Ethics Committee of Hospital São Marcos—Hospital São Marcos

  11. 11.

    Ethics Committee of Local Health Unit Matosinhos—Hospital Pedro Hispano

  12. 12.

    Ethics Committee of Local Health Unit Alto Minho—Hospital de Santa Luzia

  13. 13.

    Ethics Committee of Hospital Center Nordeste—Hospital Bragança

  14. 14.

    Ethics Committee of Hospital Center de Setúbal—Hospital São Bernardo

  15. 15.

    Ethics Committee of Hospital Center Barreiro/Montijo—Hospital São Bernardo ~ 

  16. 16.

    Ethics Committee of Hospital Center Oeste—Hospital das Caldas da Rainha

  17. 17.

    Ethics Committee of Hospital Center Oeste—Hospital de Torres Vedras

  18. 18.

    Ethics Committee of Hospital Center Lisboa Central—Hospital Dona Estefânia

  19. 19.

    Ethics Committee of Hospital Center Lisboa Central—Maternidade Alfredo da Costa

  20. 20.

    Ethics Committee of Hospital Center Lisboa Norte—Hospital de Santa Maria

  21. 21.

    Ethics Committee of Hospital Center Lisboa Ocidental—Hospital de São Francisco de Xavier

  22. 22.

    Ethics Committee of Hospital Center Médio Tejo—Hospital de Abrantes

  23. 23.

    Ethics Committee of Hospital CUF Descobertas

  24. 24.

    Ethics Committee of Hospital Fernando Fonseca

  25. 25.

    Ethics Committee of Hospital da Luz

  26. 26.

    Ethics Committee of Hospital de Santarém

  27. 27.

    Ethics Committee of Hospital de Vila Franca de Xira

  28. 28.

    Ethics Committee of Hospital dos Lusíadas

  29. 29.

    Ethics Committee of Hospital Garcia de Horta

  30. 30.

    Ethics Committee of Hospital José de Almeida

Results

Of the 542 very preterm children included in the study, 57.2% were male. The median gestational age was 29 weeks (p25–p75:27–31) and the median birthweight was 1172 g (p25–p75: 940–1436.2). Mothers were mostly primiparous (63.2%), native (84.9%), with a median age of 31 years (p25–p75:27–35) and 83.2% belonged to the least deprived quartiles of neighbourhood socioeconomic deprivation (Table 1). Attrition in the four follow-ups were, respectively: 16%, 25%, 13% and 17%.

Table 1 General characteristics of the study population (n = 542).

The SMOTE technique improved the performance of all algorithms in both models, therefore, all the presented results are derived using this technique. To verify the reliability of the results with the oversampling technique, we compared the descriptive statistics of the original dataset and the oversampling counterpart and we found no significant differences.

Comparison of methods performance

Figure 1 depicts the discriminatory abilities of all methods for the prediction of attrition. There was a consistent and large superiority of Random Forest over the other methods in the baseline model. For the incremental one, Random Forest also had the best performance, but only slightly higher than AdaBoost (follow-up 2, 3 and 4) and Artificial Neural Networks (follow-3 and 4). Discrimination performance of Random Forest was excellent across all follow-ups in both models, baseline [AUC-PR1: 94.1 (2.0); AUC-PR2: 89.1 (2.3); AUC-PR3: 92.9 (2.2); AUC-PR4: 93.4 (2.6)] and incremental [AUC-PR1: 94.1 (2.0); AUC-PR2: 91.2 (1.2); AUC-PR3: 97.1(1.0); AUC-PR4: 96.5 (1.7)]. In all follow-ups, the conventional Logistic Regression approach had a worse performance than Random Forest, both in baseline [AUC-PR1: 78.8 (3.4); AUC-PR2: 72.2 (3.2); AUC-PR3: 81.1(2.0); AUC-PR4: 80.6 (3.8)] and incremental model [AUC-PR1: 78.8 (3.4); AUC-PR2: 79.1 (2.9); AUC-PR3: 92.1 (2.3); AUC-PR4: 91.4 (2.2)]. Supplementary Table 2 presents the odds-ratios of the Logistic Regression for the most relevant predictors. Adding new predictors in the incremental model led to a greater performance of all algorithms in all follow-ups.

Figure 1
figure 1

Area Under the Curve-Precision Recall (AUC-PR) for follow-ups 1, 2, 3 and 4.

Table 2 presents the mean and standard deviation of the assessed metrics (sensitivity, accuracy and F-measure). At follow-up 1, Random Forest (82.3; 6.3) and AdaBoost (82.3; 6.0) presented the higher values for sensitivity, which measures the proportion of positive cases (dropouts) that were correctly identified. At follow-up 2, K-Nearest Neighbours (87.6; 4.5) at the baseline model outperformed the other methods. Random Forest was the best algorithm for sensitivity in follow-3 (89.8; 4.1) and Functional Trees in follow-up 4 (91.5; 3.7), both at the incremental model. In an overall analysis of the three metrics, Random Forest presented the best performance in both models, at all follow-ups.

Table 2 Performance results of the classification methods applied to the prediction of attrition in four follow-ups of EPICE-PT cohort.

Predictor importance analysis

Predictor importance was computed by evaluating the decrease of impurity at each split across all decision trees in the forest35. Either in baseline or incremental model, of the five most relevant predictors, four were common for all follow-ups and circumscribed to clinical and demographic characteristics: birthweight, gestational age, maternal age, and length of hospital stay after birth. Region of birth (Lisbon and Tagus Valley) and sex of the child (male) were the other two more relevant predictors (Table 3). Figure 2 shows the top five predictors with the highest importance based on the Random Forest in Baseline model.

Table 3 The top- ranked variables by the variable importance for each year in Baseline and Incremental Model.
Figure 2
figure 2

Importance of the predictor variables (based on the mean decrease in impurity) in the Random Forest for each year (Baseline Model).

Partial dependence plots illustrating the effects of the continuous predictors across a range of values in the Random Forest algorithm are shown in Supplementary Figs. 1, 2, 3 and 4. As the plots are similar for baseline and incremental models, we opted to display only the baseline model. The risk for attrition increased with higher gestational age and lower maternal age, although the risk also increases for older mothers (> 35 years) at follow-ups 3 and 4. The stratification of birthweight by sex revealed different tendencies. For male participants, the risk for attrition has an inverted U-shape, with a lower risk for extreme values; and it shows two peaks of increased risk (1000 and 2000 g) for females. Length of hospital stay after birth was stratified by gestational age (≤ 27 and > 27 weeks). In both categories, the risk increased with length of hospital stay, with a more rapid increase generally occurring after 50 days.

Discussion

Using seven machine learning algorithms and conventional Logistic Regression, this study developed two models for characterizing the risk of attrition in the EPICE-PT cohort. Both models presented an optimal predictive performance, with the best performance reached by the incremental one, in which new predictors were progressively added. The Random Forest showed the best discrimination performance in all follow-ups, surpassing Logistic Regression. In addition, we achieved a good level of interpretability of the predictors, emphasizing the added value of this algorithm. Random Forest not only improved the discriminative ability but also provided clear information for supporting the development of tailored retention strategies along the cohort life cycle. Based on the results of the Random Forest algorithm, younger mothers, children born with higher gestational age and with longer length of hospital stay presented more risk of dropping out. Birthweight, sex, and region of birth were also among the most important risk factors for attrition.

The two predictive models of attrition have distinct advantages. The baseline model resulted in an excellent predictive performance, also offering the opportunity to predict attrition and plan tailored interventions to prevent it at an early stage of the cohort. The incremental model achieved an even higher predictive performance compared to the baseline model and improves the performance of the other algorithms, broadening the option of satisfactory methods. However, it increases the computational costs, is more time-consuming and less efficient at identifying potential dropouts at an early stage, which is a substantial disadvantage from the perspective of cohort maintenance. In both models, all the top-ranked predictors belonged to the baseline dataset. For these reasons, we consider the baseline model the most advantageous one to predict attrition in our study population and similar cohorts.

A superior performance of Random Forest over Logistic Regression for predictive models was shown in diverse biomedical applications, such as suicidal behaviour36, cancer metastasis37, readmissions in patients with heart failure38 and, unplanned rehospitalisation of preterm babies39. Likewise, a massive experimental evaluation of 179 algorithms using 121 datasets showed that Random Forest was very close to the best attainable accuracy for most of the datasets40. However, a systematic review consisting of 71 studies did not favoured machine learning methods over Logistic Regression for clinical prediction41. These discrepant results may be explained by the No-Free-Lunch theorem42, which states that no classifier can be always the best for all datasets. Nevertheless, the comparison of our model’s performance with previous research is limited by the lack of studies investigating the ability of machine learning methods to predict attrition in cohorts.

Identifying the key predictors of attrition is of great significance for mitigating its risk in cohorts. Although the top-ranked predictors of attrition in our research are non-modifiable variables, they certainly shed light on which participants should receive further attention and incentives to continue their participation. The identified predictors are consistent with previous findings in very preterm cohorts, such as lower maternal age43,44 and male sex45,46. The effects of the most relevant clinical predictors showed controversial results, either revealing that participants with better (higher gestational age, greater birthweight in females, average birthweight in males) or worse health (longer length of hospitalisation) are more prone to attrition. A systematic review of 57 publications of very preterm cohorts also identified the healthier (e.g., higher gestational age, better lung function) and the unhealthier participants (e.g., severe disabilities, poorer cognitive performance), more likely to drop out of the cohort47. Therefore, this paradox is not a new finding and remains to be elucidated. It is also important to refer to the noticeable absence of socioeconomic factors in our model, which are often among the strongest predictors of attrition43,44,48. This might be due to the small variability of our sample regarding the only socioeconomic indicator among our baseline predictors, neighbourhood socioeconomic deprivation index49 (82.5% of the participants belong to the least deprived quartiles).

Our study’s strengths include: (1) data from a population-based prospective cohort, which represented almost 70% of all VPT births that occurred in Portugal in 2011/2012, (2) several machine learning methods tested, given that the most appropriate algorithm may differ depending on data structure, (3) the selection of usual predictors collected at very preterm cohorts instead of all available predictors in our dataset, to broaden the usability of the model for similar cohorts, (4) the satisfactory level of model interpretation, allowing further practical implementation of the obtained results. Moreover, to the best of our knowledge, this is the first study developing prediction models of attrition in longitudinal cohort studies through machine learning techniques.

The primary limitation of the current study is that we assessed the performance of machine learning models by the hold-out method, a form of internal validation. External validation in other very preterm cohorts is needed to confirm the performance of the developed models. Another limitation was the lack of information on sociodemographic indicators at baseline, important known predictors of attrition, such as mother’s employment50 and educational level51. Though the availability of such information at baseline would likely improve the prediction ability, our models performed well enough. Moreover, the neighbourhood socioeconomic deprivation index is a robust measure that has been used as a valid proxy of individual socioeconomic position in previous research52. Lastly, variable importance of Random Forest was estimated by the mean decrease in impurity (or Gini importance) mechanism, which may produce biased variable selection when predictor variables vary in their scale of measurement or number of categories, such as in our dataset. Notwithstanding, the identified top-ranked predictors are in line with previous research on attrition in very preterm cohorts, reassuring our results. In addition, previous research has demonstrated that when Random Forest uses a significant number of trees in each run, which is our case, stable variable importance rankings are achieved53.

In conclusion, we have developed and validated robust machine learning predictive models of attrition in a cohort of very preterm infants and demonstrated their superiority and feasibility compared with conventional Logistic Regression. Other than the high-performance model, this study also provided interpretability of the most relevant predictors that contribute to attrition. Researchers involved in cohorts lack effective tools to early identify participants at risk of attrition and can benefit from our results to prepare for and prevent loss to follow-up, e.g., by directing efforts and developing tailored interventions geared toward those individuals to promote their continued participation54,55,56.