Machine learning methods to predict attrition in a population-based cohort of very preterm infants

The timely identification of cohort participants at higher risk for attrition is important to earlier interventions and efficient use of research resources. Machine learning may have advantages over the conventional approaches to improve discrimination by analysing complex interactions among predictors. We developed predictive models of attrition applying a conventional regression model and different machine learning methods. A total of 542 very preterm (< 32 gestational weeks) infants born in Portugal as part of the European Effective Perinatal Intensive Care in Europe (EPICE) cohort were included. We tested a model with a fixed number of predictors (Baseline) and a second with a dynamic number of variables added from each follow-up (Incremental). Eight classification methods were applied: AdaBoost, Artificial Neural Networks, Functional Trees, J48, J48Consolidated, K-Nearest Neighbours, Random Forest and Logistic Regression. Performance was compared using AUC- PR (Area Under the Curve—Precision Recall), Accuracy, Sensitivity and F-measure. Attrition at the four follow-ups were, respectively: 16%, 25%, 13% and 17%. Both models demonstrated good predictive performance, AUC-PR ranging between 69 and 94.1 in Baseline and from 72.5 to 97.1 in Incremental model. Of the whole set of methods, Random Forest presented the best performance at all follow-ups [AUC-PR1: 94.1 (2.0); AUC-PR2: 91.2 (1.2); AUC-PR3: 97.1 (1.0); AUC-PR4: 96.5 (1.7)]. Logistic Regression performed well below Random Forest. The top-ranked predictors were common for both models in all follow-ups: birthweight, gestational age, maternal age, and length of hospital stay. Random Forest presented the highest capacity for prediction and provided interpretable predictors. Researchers involved in cohorts can benefit from our robust models to prepare for and prevent loss to follow-up by directing efforts toward individuals at higher risk.

www.nature.com/scientificreports/ Birth cohorts of high-risk children, like those born very preterm (< 32 weeks of gestation), have an important role in providing a comprehensive assessment of the needs and development of these children across their lifespan 9 . Very preterm infants experience increased and long-term adverse outcomes, such as cognitive and behavioural problems, when compared with children born at term 10 . Hence, this type of cohort may provide valuable scientific evidence that, ultimately, will contribute to improving clinical care, supporting public health decisions, and planning health and education provisions to these children 11 .
An early and precise identification of which participants present an increased risk for dropping out may be of large benefit. Conventional statistical methods, such as Logistic Regression, have been the usual choice to predict attrition in cohorts [12][13][14] . However, these classical theory-based models are constrained by independence, additivity and linearity assumptions which may oversimplify complex relationships between predictors and outcome variables 15 .
The growing access to clinical data and the rapid advances in machine learning raised a great enthusiasm about its use to improve clinical care over the past decade 16 and an increasing number of its application in epidemiological research and practice is known 17 . In addition, machine learning methods may bring advantages over conventional approaches. It offers highly flexible algorithms that often do not require underlying distributional assumptions or model specification, and is able to adapt to complex non-linear and non-additive interrelations between outcome and covariates 18 . However, when it concerns employing machine learning techniques to address methodological challenges in epidemiological studies, the results are scarce.
In this study, we developed predictive models of attrition in a birth cohort of very preterm infants applying a conventional regression model and different machine learning methods, and looked for the most relevant predictors of attrition.

Methods
Study population. The study population consisted of Portuguese children participating in the prospective population-based Effective Perinatal Intensive Care in Europe (EPICE) cohort. It included all very preterm births (between 22 + 0 and 31 + 6 weeks of gestation) in 2011/12 in 19 regions of 11 European countries 19 . In Portugal, there were 724 very preterm live births occurring in this period in the two geographic regions (Northern and Lisbon and Tagus Valley) included in the cohort 20 . This study included all infants discharged alive from Neonatal Intensive Care Units (NICUs) whose parents provided written informed consent to participate in the EPICE cohort in Portugal (EPICE-PT) and to be long-term followed-up, resulting in 544 children (89.6% of 607 eligible participants) 19 . We excluded two infants who died after discharge, remaining 542 participants for the analysis. Participant's data at baseline were extracted from medical charts by health care professionals using a pretested standardized questionnaire 19 . In this study, we focused on the first four years of follow-up (followup 1-follow-up 4), where questionnaires on child's health and development were administered to parents by telephone (follow-up 1, 3 and 4) and postal questionnaires (follow-up 2).
Outcome. The outcome of interest was attrition, i.e., non-participation in offered follow-ups. Attrition was identified when the participant (a) could not be reached by any available contact (including relative's contact), (b) repeatedly postponed the call to answer the questionnaire, (c) verbally refused to participate in that specific follow-up, (d) verbally requested to withdrawal from the cohort, or (d) did not mail the questionnaires back, even after several reminders (follow-up 2). Attrition at each follow-up was calculated considering the eligible participants, i.e., excluding possible deaths and/or previous formal refusals. Participation was considered when parents accepted the invitation for that specific follow-up and answered the questionnaires (either totally or partially) through any available method.
Predictors. Predictors were taken from information collected at baseline and from questionnaires completed at the three subsequent follow-ups. Based on the literature and experience of the researchers involved in the cohort, we selected a list of demographic, socioeconomic and clinical characteristics that are likely to be important predictors of attrition (Supplementary Table 1). The decision to not include all predictors available in the cohort dataset was taken to mitigate the curse of dimensionality 21 , to diminish the computational costs, prevent overfitting 22 and, increase the usability of the model in similar cohorts.
Model development. Two predictive models framework were developed: (1) "Baseline", where prediction of the first four follow-ups was done using baseline data only, independently and, (2) "Incremental", where baseline variables were used to predict attrition in the follow-up 1 and from that on, we continuously added new predictors extracted from the subsequent follow-up (e.g. baseline plus follow-up 1 to predict attrition in the follow-up 2; baseline plus follow-up 1 and 2 to predict attrition in the follow-up 3, etc.). For the first follow-up, both models are equivalent.
To test the model's performance in predicting new data, we have used, for each year, 5 repetitions with replacement of a hold-out method 23 . In each of the five folds, the whole dataset was randomly split into a training set (80%) and a testing (20%). Most machine learning algorithms have a set of parameters that may be adjusted to get a good model (parameter tunning). We have adopted a wrapper approach 24 to estimate the best combination of parameter's values. We have split the training set into a tuning-training set (95% of the original training set) and a tuning-test set (5% of the original training set). The result of the wrapper is the parameter's values that produced the best (AUC-PR) value on the prune-test set. The best combination of parameter values is used on the training set and the model is finally evaluated on the test set.
The prevalence of the outcome (attrition) in the various follow-up of EPICE-PT cohort ranged from 13 to 25%. Hence, we have a set of imbalanced datasets, which turns the models prone to be biased towards the www.nature.com/scientificreports/ majority class. In order to cope with this problem, the Synthetic Minority Over-Sampling Technique (SMOTE) 25 was applied to mitigate the imbalance of the datasets.
Classification methods. Different classification methods were leveraged to build the predictive models.
Selected machine learning methods included AdaBoost, Artificial Neural Networks, K-Nearest Neighbours, Decision Trees Classifiers (Functional Trees, J48 and J48Consolidated), and Random Forest. We also applied Logistic Regression, performed with identical predictors, without interaction terms. A short explanation of the different methods is described below: AdaBoost is one of the most popular boosting algorithms, a group of methods that produce a classifier as a linear combination of weak classifiers, and does so in a way that minimizes exponential loss over such linear combinations 26 . A weak classifier can be described as one whose error rate is only slightly better than random guessing 15 .
Artificial Neural Networks are nonlinear statistical models, which extract linear combinations of the predictors as derived features, and then generate an outcome as a nonlinear function of these features. This learning method, inspired by neuroscience, is quite robust to noise in the training data 15,27 .
K-Nearest Neighbours models are based on the sample's geographic neighbourhood. It uses the nearest observations, based on a distance measure, to predict the final classification outcome of a new observation 28 .
Decision Trees Classifiers (Functional Trees, J48 and J48Consolidated) are a group of algorithms that use a binary recursive partitioning of instant space 29 . It is an iterative process of splitting the data into partitions, and then splitting it up further on each of the branches, aiming to partition the data into smaller, more homogeneous groups. By fully revealing the feature space partition of a single tree, it allows for great flexibility in data analysis and interpretability 15,29 .
Random Forest algorithms are an extension of bagging 30 , an ensemble learning method that builds successive independent trees using a bootstrap sample of the data set. It adds a new layer of randomness when selecting predictors or combinations of predictors at each node to split it, while bagging considers all of the original predictors for splitting a node 31 .
Logistic Regression is typically the foremost statistical analysis used to model binary responses. It belongs to a family of techniques called Generalized Linear Models, which models the log odds of a binary dependent variable as a linear function 28 .
All models and algorithms were run using WEKA 32 .
Performance metrics. We used four metrics to estimate the performance of the different classification methods 33 : (1) Sensitivity: the ability of the model to identify all the relevant cases (dropouts) within the dataset, (2) Accuracy: it measures the fraction of all correct predictions, (3) F-measure: conveys the balance between precision and sensitivity and (4) AUC-PR: Area Under the Curve of Precision-Recall. AUC-PR was the primary metric adopted to assess the performance of the algorithms, given the purpose of our study is to identify the cohort's participants more prone to attrition and to select a predictive model that is as generalizable as possible to other cohorts of very preterm infants.
Predictor variables importance. We collected the variable rank given by the best algorithm in each run and then we calculated the overall mean rank of the five best variables over all runs. To investigate the effects of the most relevant continuous predictor variables across different values, partial dependence plots were generated for the most accurate algorithm 34 . Aiming to improve interpretability, partial dependence plots were stratified by categories, when appropriated. The plots were presented with smooth curves to allow possible important patterns to more clearly stand out. Graphs were constructed using R programming language.
Ethics. The EPICE-PT cohort was approved by the Ethics Committee of the participating hospitals and by the Portuguese Data Protection Authority (authorization 7426/2011) 20 . All research was performed in accordance with relevant guidelines and informed consent was obtained from all parents or legal representatives, as required by national legislation. The study complies with the Helsinki Declaration 2008.
Ethics committees that approved the study.

Results
Of the 542 very preterm children included in the study, 57.2% were male. The median gestational age was 29 weeks (p25-p75:27-31) and the median birthweight was 1172 g (p25-p75: 940-1436.2). Mothers were mostly primiparous (63.2%), native (84.9%), with a median age of 31 years (p25-p75:27-35) and 83.2% belonged to the least deprived quartiles of neighbourhood socioeconomic deprivation (Table 1). Attrition in the four follow-ups were, respectively: 16%, 25%, 13% and 17%. The SMOTE technique improved the performance of all algorithms in both models, therefore, all the presented results are derived using this technique. To verify the reliability of the results with the oversampling technique, we compared the descriptive statistics of the original dataset and the oversampling counterpart and we found no significant differences.
Comparison of methods performance. Figure 1 depicts the discriminatory abilities of all methods for the prediction of attrition. There was a consistent and large superiority of Random Forest over the other methods in the baseline model. For the incremental one, Random Forest also had the best performance, but only slightly higher than AdaBoost (follow-up 2, 3 and 4) and Artificial Neural Networks  Table 2 presents the mean and standard deviation of the assessed metrics (sensitivity, accuracy and F-measure). At follow-up 1, Random Forest (82.3; 6.3) and AdaBoost (82.3; 6.0) presented the higher values for sensitivity, which measures the proportion of positive cases (dropouts) that were correctly identified. At follow-up 2, K-Nearest Neighbours (87.6; 4.5) at the baseline model outperformed the other methods. Random Forest was the best algorithm for sensitivity in follow-3 (89.8; 4.1) and Functional Trees in follow-up 4 (91.5; 3.7), both at the incremental model. In an overall analysis of the three metrics, Random Forest presented the best performance in both models, at all follow-ups.
Predictor importance analysis. Predictor importance was computed by evaluating the decrease of impurity at each split across all decision trees in the forest 35 . Either in baseline or incremental model, of the five most relevant predictors, four were common for all follow-ups and circumscribed to clinical and demographic characteristics: birthweight, gestational age, maternal age, and length of hospital stay after birth. Region of birth (Lisbon and Tagus Valley) and sex of the child (male) were the other two more relevant predictors (Table 3).  Supplementary Figs. 1, 2, 3 and 4. As the plots are similar for baseline and incremental models, we opted to display only the baseline model. The risk for attrition increased with higher gestational age and lower maternal age, although the risk also increases for older mothers (> 35 years) at followups 3 and 4. The stratification of birthweight by sex revealed different tendencies. For male participants, the risk for attrition has an inverted U-shape, with a lower risk for extreme values; and it shows two peaks of increased risk (1000 and 2000 g) for females. Length of hospital stay after birth was stratified by gestational age (≤ 27 and > 27 weeks). In both categories, the risk increased with length of hospital stay, with a more rapid increase generally occurring after 50 days.

Discussion
Using seven machine learning algorithms and conventional Logistic Regression, this study developed two models for characterizing the risk of attrition in the EPICE-PT cohort. Both models presented an optimal predictive performance, with the best performance reached by the incremental one, in which new predictors were progressively Regression. In addition, we achieved a good level of interpretability of the predictors, emphasizing the added value of this algorithm. Random Forest not only improved the discriminative ability but also provided clear information for supporting the development of tailored retention strategies along the cohort life cycle. Based on the results of the Random Forest algorithm, younger mothers, children born with higher gestational age and with longer length of hospital stay presented more risk of dropping out. Birthweight, sex, and region of birth were also among the most important risk factors for attrition. The two predictive models of attrition have distinct advantages. The baseline model resulted in an excellent predictive performance, also offering the opportunity to predict attrition and plan tailored interventions to prevent it at an early stage of the cohort. The incremental model achieved an even higher predictive performance compared to the baseline model and improves the performance of the other algorithms, broadening the option Table 1. General characteristics of the study population (n = 542). a Calculation of percentages does not include missing values. b SGA, small for gestational age, based on intrauterine curves developed for the cohort 54 . c The sum of the categories surpasses 100% as the numbers were rounded up. www.nature.com/scientificreports/ of satisfactory methods. However, it increases the computational costs, is more time-consuming and less efficient at identifying potential dropouts at an early stage, which is a substantial disadvantage from the perspective of cohort maintenance. In both models, all the top-ranked predictors belonged to the baseline dataset. For these reasons, we consider the baseline model the most advantageous one to predict attrition in our study population and similar cohorts. A superior performance of Random Forest over Logistic Regression for predictive models was shown in diverse biomedical applications, such as suicidal behaviour 36 , cancer metastasis 37 , readmissions in patients with heart failure 38 and, unplanned rehospitalisation of preterm babies 39 . Likewise, a massive experimental evaluation of 179 algorithms using 121 datasets showed that Random Forest was very close to the best attainable accuracy for most of the datasets 40 . However, a systematic review consisting of 71 studies did not favoured machine learning methods over Logistic Regression for clinical prediction 41 . These discrepant results may be explained by the No-Free-Lunch theorem 42 , which states that no classifier can be always the best for all datasets. Nevertheless, the comparison of our model's performance with previous research is limited by the lack of studies investigating the ability of machine learning methods to predict attrition in cohorts.
Identifying the key predictors of attrition is of great significance for mitigating its risk in cohorts. Although the top-ranked predictors of attrition in our research are non-modifiable variables, they certainly shed light on which participants should receive further attention and incentives to continue their participation. The identified predictors are consistent with previous findings in very preterm cohorts, such as lower maternal age 43,44 and male sex 45,46 . The effects of the most relevant clinical predictors showed controversial results, either revealing that participants with better (higher gestational age, greater birthweight in females, average birthweight in males) or worse health (longer length of hospitalisation) are more prone to attrition. A systematic review of 57 publications of very preterm cohorts also identified the healthier (e.g., higher gestational age, better lung function) and the unhealthier participants (e.g., severe disabilities, poorer cognitive performance), more likely to drop out of the cohort 47 . Therefore, this paradox is not a new finding and remains to be elucidated. It is also important to refer to the noticeable absence of socioeconomic factors in our model, which are often among the strongest predictors of attrition 43,44,48 . This might be due to the small variability of our sample regarding the only socioeconomic indicator among our baseline predictors, neighbourhood socioeconomic deprivation index 49 (82.5% of the participants belong to the least deprived quartiles).
Our study's strengths include: (1) data from a population-based prospective cohort, which represented almost 70% of all VPT births that occurred in Portugal in 2011/2012, (2) several machine learning methods tested, given that the most appropriate algorithm may differ depending on data structure, (3) the selection of usual predictors collected at very preterm cohorts instead of all available predictors in our dataset, to broaden the usability of the model for similar cohorts, (4) the satisfactory level of model interpretation, allowing further practical implementation of the obtained results. Moreover, to the best of our knowledge, this is the first study developing prediction models of attrition in longitudinal cohort studies through machine learning techniques.
The primary limitation of the current study is that we assessed the performance of machine learning models by the hold-out method, a form of internal validation. External validation in other very preterm cohorts is   www.nature.com/scientificreports/ needed to confirm the performance of the developed models. Another limitation was the lack of information on sociodemographic indicators at baseline, important known predictors of attrition, such as mother's employment 50 and educational level 51 . Though the availability of such information at baseline would likely improve the prediction ability, our models performed well enough. Moreover, the neighbourhood socioeconomic deprivation index is a robust measure that has been used as a valid proxy of individual socioeconomic position in previous research 52 . Lastly, variable importance of Random Forest was estimated by the mean decrease in impurity (or Gini importance) mechanism, which may produce biased variable selection when predictor variables vary in their scale of measurement or number of categories, such as in our dataset. Notwithstanding, the identified topranked predictors are in line with previous research on attrition in very preterm cohorts, reassuring our results. In addition, previous research has demonstrated that when Random Forest uses a significant number of trees in each run, which is our case, stable variable importance rankings are achieved 53 .
In conclusion, we have developed and validated robust machine learning predictive models of attrition in a cohort of very preterm infants and demonstrated their superiority and feasibility compared with conventional Logistic Regression. Other than the high-performance model, this study also provided interpretability of the most relevant predictors that contribute to attrition. Researchers involved in cohorts lack effective tools to early identify participants at risk of attrition and can benefit from our results to prepare for and prevent loss to followup, e.g., by directing efforts and developing tailored interventions geared toward those individuals to promote their continued participation [54][55][56] .

Data availability
Participants data used for modelling are available to researchers upon reasonable request.