Introduction

Despite significant technological advances, mechanical ventilation (MV) remains associated with lung injury and important short- and long-term morbidities in extremely preterm infants.1,2,3,4 Therefore, timely weaning and disconnection from MV is critical. Unfortunately, due to the inability to accurately judge extubation readiness in this fragile population, newborn infants face the highest rates of extubation failure of all intensive care settings.5,6,7,8

Currently, all tools investigated to predict extubation success lack sufficient balanced accuracy to justify their use in clinical practice.9 Given that these infants can develop significant morbidities and rarely experience severe adverse events after extubation,10 an enhanced decision-making process is highly desirable. Recent collaborations between medicine, biomedical engineering and computer science have proven to be paramount in health care.11,12 Indeed, the use of automated analyses of biological signals and artificial intelligence has allowed complex investigations of physiologic variables characterized by a highly elaborate, apparently random output, arising from nonlinear biological systems.13,14

These advanced signal analyses have also been used as prediction tools of clinical outcomes, including ventilation weaning and extubation readiness.15,16 Moreover, in preterm infants analyses of some biological signals have also improved the understanding of MV weaning and post-extubation care.17,18,19,20 Thus, the objective of this multicenter study was to develop and evaluate a prediction model with balanced accuracy for extubation success in extremely preterm infants using machine learning algorithms that combine clinical data with automated analyses of cardiorespiratory signals.

Methods

This prospective multicenter study was performed from September 2013 to August 2018 and is presented using the TRIPOD statement. The study protocol was registered in Clinicaltrials.gov (NCT01909947) and published.21 Ethics approval was attained from each Institutional Review Board and written informed consent was obtained from parents prior to enrollment.

Participating centers

Neonatal Intensive Care Units of the Royal Victoria Hospital, Montreal Children’s Hospital, Jewish General Hospital (Montreal, QC, Canada), Detroit Medical Center (Detroit, MI, USA); and Women and Infant’s Hospital (Providence, RI, USA).

Eligibility criteria and clinical management

Infants with birth weight (BW) ≤ 1250 g receiving MV and undergoing their first extubation attempt were eligible. Exclusion criteria were unplanned extubation, major congenital anomalies, major heart defects, cardiac arrhythmia, use of vasopressor or sedative drugs at the time of extubation, and patients extubated from high frequency ventilation (due to interference by the oscillations on the cardiorespiratory signals) or directly to oxyhood or low-flow nasal cannula. Guidelines to consider a patient ‘ready’ for extubation were proposed21 but all decisions concerning extubation and reintubation were made by the medical team. Caffeine administration was part of standard of care at all participating centers.

Outcome

The original, per-protocol outcome for the prediction model was extubation success, defined by the absence of specific criteria including oxygen needs, blood gases, and apneas in the 72 h post-extubation.21 However, during the conduct of the study it became evident that the data were inconsistently recorded within and across centers. Therefore, definition of extubation success was changed to the absence of reintubation within 72 h post-extubation. In addition, data on the occurrence, timing and reasons for reintubation were recorded. Blinding of health care providers to this outcome was not possible.

Predictors

Candidate predictors for the development of the classifier included 109 clinical parameters pertaining to patient demographics and pre-extubation characteristics.21 Cardiorespiratory signals included: (a) Ribcage (RCG) movements, using uncalibrated Respiratory Inductance Plethysmography (RIP; Viasys® Healthcare); (b) Abdominal (ABD) movements, using RIP; (c) Electrocardiogram (ECG), using three electrodes (Vermed©); (d) Photoplethysmogram (PPG) and oxygen saturation (SAT), using a pulse oximeter monitor (Masimo Radical®). Details on data acquisition set up have been previously published.10,21 Briefly, all cardiorespiratory signals were continuously acquired before extubation, for a period of 60 min while receiving MV at pre-extubation settings, followed by 5 min during endotracheal tube continuous positive airway pressure (ET-CPAP) to capture the spontaneous breaths without interference from mechanical inflations. Clinical instability during the recording period was not considered, as this was not a spontaneous breathing trial. Thus, all involved patients were extubated thereafter and remained eligible for inclusion in the development of the prediction model. All signals were anti-alias filtered at 500 Hz and sampled at 1000 Hz using the PowerLab 16/30 analog-digital data acquisition system (ADInstruments, Australia) with 16-bit analog-to-digital resolution. All health care providers were blinded to the cardiorespiratory signals during the study. Clinical and cardiorespiratory data were stored in a cloud-based repository and an automated anonymization protocol was developed.22 Moreover, an algorithm for quality control and validation was systematically applied.23

Statistical analysis methods

Stages 1 and 2 of the analyses were conducted using MATLAB (R2018a, The MathWorks) and stage 3 with the Python Scikit-Learn v0.19© (scikit-learn developers - BSD License). A simplified flow diagram for the development of the classifier is outlined in Fig. 1.

Fig. 1: Flow diagram for the development of the classifier of extubation readiness.
figure 1

The classifier was developed in 3 stages. Stage 1 - Clinical parameters were collected, and cardiorespiratory signals acquired; Stage 2 - Cardiorespiratory signals were processed into metrics describing power, frequency, and thoraco-abdominal synchrony. Using the metrics, a revised automated unsupervised respiratory event analysis algorithm (r-AUREA) categorized signals into respiratory patterns (pause [PAU], synchronous [SYB] or asynchronous [ASB] breathing, and movement artifact [MVT]) and patterns of desaturation [DST] and bradycardia [BDY]. Statistical representation of metrics, patterns and RR intervals led to a feature set characterizing the cohort’s cardiorespiratory behavior. Next, a Principal Component Analysis (PCA) was applied to transform the original features into PC features. Stage 3 – Machine-learning methods used the PC features to classify infants into extubation success or failure. First, the high-risk population was selected by using an automatically derived threshold for gestational age at birth and weight at extubation that minimized successful extubation while capturing all failures. Then, the most discriminatory combination of clinical and cardiorespiratory features (selected with five-fold cross validation) was used to train and test a balanced random forest classifier using leave-one-out procedure for validation. BW = birth weight, GA = gestational age, RC = ribcage, ABD = abdominal movements, SAT = oxygen saturation, ECG = electrocardiogram, PPG = pulse plethysmography.

Stage 1. Characterization of clinical data and cardiorespiratory signals

For clinical data, continuous variables were presented as median [interquartile range] and categorical variables as n (percentage). Comparisons between infants with successful and failed extubation were performed using Wilcoxon rank sum test, Chi square test or Fisher exact test, as appropriate. For cardiorespiratory data, the signals were first processed to compute sample-by-sample metrics describing power, respiratory and cardiac frequency, and thoraco-abdominal synchrony between RCG and ABD.24 Next, the RCG and ABD metrics were analyzed using AUREA, an Automated Unsupervised Respiratory Event Analysis algorithm originally developed in older infants25 and revised for the extremely preterm infant population (r-AUREA, Fig. E1). r-AUREA assigned each sample to one of the following unique respiratory patterns: Pause (PAU), Movement Artifact (MVT), Synchronous Breathing (SYB), Asynchronous Breathing (ASB), and Unknown (UNK). Furthermore, the sequence of pattern classification was revised and the probability of switching from one pattern to another was quantified using a Semi-Markov model.26 Cardiac frequency was computed from either the ECG or PPG and algorithms were designed to detect bradycardia (heart rate < 100 bpm). PPG was computed as a backup strategy in case of loss of the ECG signal. Desaturation events (oxygen saturation < 80%) were computed from the PPG. Lastly, the ECG was processed to derive inter-beat intervals and compute measures of heart rate variability (HRV). A detailed description of this methodology is provided (Methods E1).

Stage 2. Development of the clinical and cardiorespiratory feature set

Summary statistics describing the properties of the metrics, patterns, and inter-beat intervals during MV and ET-CPAP were computed to create a set of features. The median [inter-quartile range] of each metric and its power were calculated. For patterns, measures of their frequency of occurrence, duration, and the transition probability from one pattern to another were included. HRV features included time- domain, frequency-domain, and non-linear analyses. This resulted in a total of 224 cardiorespiratory and 109 clinical features. To reduce the size of the set, a principal component analysis (PCA) was used to transform the original features into a set of linearly uncorrelated Principal Component (PC) features that concisely explained the variance in the original set. In other words, PCA reduces the size of the feature set by removing unessential or highly correlated features while preserving as much information as possible. The contribution of each feature to the clinical (Fig. E2) and cardiorespiratory principal component were evaluated by heat maps (Fig. E3) and the percentage of total variance explained by every PC was assessed (Fig. E4).27 PC features that explained <1% of the total variance were excluded resulting in a set of 95 PC features that explained 95% of the variance. A detailed description of this methodology is provided (Methods E2).

Stage 3. Classification

The overall population was imbalanced since less than 20% of infants failed extubation. Therefore, a two-stage Clinical Decision - Balanced Random Forest (CD-BRF) classifier was used.28,29,30,31 In the CD stage, the best-performing clinical parameters were the gestational age (GA) at birth and weight at extubation and the optimal boundary was selected to include all failure infants and the fewest number of infants with success (Fig. E5). Infants outside of this boundary were automatically classified as success, while those in the high-risk group were passed to the BRF stage, which used clinical and cardiorespiratory PC features to complete the classification. In simple terms, a random forest classifier averages the predictions made by many decision trees (which individually may not be as accurate) into a single more robust estimate. Each decision tree uses a random subset of features to make its prediction, and the overall random forest has a fixed set of parameters, called hyperparameters, which determine the properties of the model. These BRF parameters included the number of trees in the forest, maximum depth of the trees, minimum number of patients to split a node and maximum number of features at each node. (Methods E3 and Fig. E6). Three BRFs classifiers were designed: Clinical (C) classifier using clinical PC features alone, a Cardiorespiratory (CR) classifier using cardiorespiratory PC features alone, and a Clinical and Cardiorespiratory (CCR) classifier using both. Due to the relatively limited number of failure patients, a leave-one-out cross-validation was used to evaluate the performance of the classifiers. A detailed description of this methodology is provided (Methods E3).

For the performance of the classifiers (C, CR and CCR), receiver operating characteristics (ROC) curves were generated, and the following measures computed: sensitivity (detection rate of successful extubation), specificity (detection rate of failed extubation), positive predictive value (PPV), negative predictive value (NPV), and area under the curve (AUC). The best performance was determined by the balanced accuracy (sensitivity and specificity/2). Lastly, the relative importance of each PC feature selected by the best classifier was estimated to quantify the most discriminative role.

Missing features

Some clinical data were not collected, or cardiorespiratory signals were not acquired (disconnection of leads or too short signal acquisition periods). If a feature was missing ≤10 infants, the missing values were imputed from the median value of the outcome group to which the infant belonged. If a feature was missing in >10 infants, the feature was excluded, which occurred with 29 clinical and 25 cardiorespiratory features.

Sample size

Sample size was estimated using the determination method published by Obuchowski and McClish.32 This calculation requires estimation of the prevalence of reintubation in the target population and variance of the ROC curve based on a pilot study17, and a determined precision for the Area Under the Curve (AUC).21 Using a conservative extubation failure rate of 20%, we anticipated that a minimum of 170 infants would be necessary to estimate the AUC of the ROC curve with a precision of 0.1. Accounting for low quality signals and practice changes over time, the sample size was increased to 250 patients. A prospective validation on a sample of 50 infants was initially planned but a posteriori decision was made to proceed with leave-one-out cross-validation because a sample of 50 infants was likely to be insufficient given the relatively low rate of extubation failure.

Risk groups

Considering this was a pragmatic study, included infants were extubated at variable postnatal ages. However, it is plausible that predictors of extubation success may vary between infants extubated early (<7d of age) or late (≥ 7d of age) as prolonged MV has been associated with lung injury2. Also, these infants are more prone to complications such IVH or PH during the first week of life, where an extubation success or failure might have a different impact. Consequently, a post-hoc analysis was carried out to evaluate the accuracy of the predictor in these groups. Furthermore, the classifier accuracy was also tested for re-intubations occurring ≤7 days (168 h) following extubation as longer observation windows can capture more cases of respiratory related extubation failures.33,34

Results

Participants

A total of 266 infants were enrolled and 241 included for the development of the classifier (Fig. 2). The median time between the ET-CPAP recording and extubation was 31 min [IQR 20–55 min] and all involved patients were extubated thereafter and remained eligible for inclusion in the development of the prediction model. Extubation success occurred in 197 infants (82%). Table 1 summarizes pre- and post-extubation characteristics of the population, including timing and reasons for reintubation in the 44 infants who failed extubation.

Fig. 2
figure 2

Patient Flowchart.

Table 1 Population demographics.

Model development and specification

In the CD stage, a threshold of GA at birth >28.6 weeks and weight at extubation >1160 g (Fig. E5) automatically classified 52 infants (22%) as successful extubations (no failures). The remaining 189 infants (78%) passed to the BRF stage. In this group, 44 (23%) failed extubation. For this stage, the optimal set of hyperparameters selected for the C, CR, and CCR classifiers are presented on Table E1.

Model performance

A total of 241 leave-one-out cross-validation tests were performed using 42 PC features selected across all tests: 11 clinical and 31 cardiorespiratory (Fig. E7). Nine of the ten PC features with greatest contribution to the classifier were cardiorespiratory. ROC curves and diagnostic performances for the three classifiers are shown in Fig. 3 and data presented on Table 2. The best performance was obtained using the CCR classifier, with a total of 18 PC features (Fig. E8). The highest three features for each cardiorespiratory PC used in the final CCR classifier are presented on Table E3. These top features spanned a wide range of categories, including measures of metrics (pertaining to respiratory and cardiac frequencies, power in the RCG and ABD signals, phase angle, and oxygen saturation), respiratory patterns (particularly the UNK and bradycardia patterns), and heart rate variability.

Fig. 3: Performance of the APEX classifier.
figure 3

The APEX classifiers included all infants extubated at any age during the study. The Receiving Operating Characteristic (ROC) curves shows the trade-off between sensitivity (Y axis) and 1-specificity (X axis) for the Clinical (orange), Cardiorespiratory (blue) and Clinical and Cardiorespiratory (green) classifiers at all thresholds. For each curve, the filled dot marker indicates the performance with the highest balanced accuracy. The black 45-degree diagonal dotted line represents the baseline model/random classifier. The closer an ROC curve comes to this diagonal line, the less powerful is the model. Note that the best accuracy was obtained with the combination of clinical and cardiorespiratory signals (green circled dot).

Table 2 Performance of the Clinical, Cardiorespiratory and combined Clinical and Cardiorespiratory classifiers.

The diagnostic and clinical values of the best classifier are presented on Fig. 4. From the diagnostic standpoint, 137/197 infants with successful and 33/44 infants with failed extubation were correctly identified by the classifier (70% sensitivity and 75% specificity, respectively) with a balanced accuracy of 73%. Clinically, when used as an adjunct tool, the classifier agreed with the decision to extubate in 148 infants (61%). Of those infants, 137 were correctly classified as extubation success (93%). However, the classifier predicted that 93 infants (39%) would fail. Of those, 60 were successfully extubated.

Fig. 4
figure 4

Diagnostic and clinical values of the clinical and cardiorespiratory classifier in extremely preterm infants.

Post-hoc analysis

The performance of the APEX CCR classifier was computed for the subgroup of infants extubated at <7d (n = 123) or ≥ 7d of age (n = 118), as shown in Table E2. In the group <7d, the classifier correctly identified 16/18 failures (Specificity 89%, 95% CI: 74%, 100%) compared to 17/26 in the late extubation group (Specificity 65%, 95% CI: 47%, 84%). The diagnostic and clinical values of this analysis are presented on Fig. E9. The performance of the classifier decreased when reintubation ≤7d following extubation was used as definition of failure (Fig. E10). Indeed, only 11 (50%) infants reintubated between 72 h and 168 h were correctly classified as failures.

Discussion

The use of machine learning algorithms to combine automated analyses of cardiorespiratory signals with clinical data improved prediction of extubation success in a high-risk population. More importantly, by using a two-stage CD-BRF approach the final classifier excluded a group of infants successfully extubated and was developed using extremely preterm infants with higher risk of extubation failure (Fig. E5). This precision medicine approach can help in the selection of a targeted population in future studies of extubation failure. Indeed, only 2 infants classified as extubation success failed. Unfortunately, the final classifier’s improved identification of extubation failures at the expense of misclassifying nearly one-third of infants who were successfully extubated making it not suitable for clinical decisions at this point. Notably, the classifier performed best amongst infants extubated before 7 days of age, identifying 70% of success and 89% of failures. Although a post-hoc analysis, this is an important finding as early and successful extubation may mitigate complications associated with prolonged MV while limiting the adverse effects caused by reintubation during a critical age.5,10

The decision to extubate extremely preterm infants is complex and subjected to substantial variability.35,36 Furthermore, extubation failure inevitably prolongs MV duration and has been associated with increased risks of respiratory morbidities and mortality, even after adjusting for the cumulative MV duration and other known confounders.7,8,10. Thus, accurate predictors of extubation success are needed but only a few have been evaluated in preterm infants.5,9,37,38 Most studies included limited number of patients from a single center and incorporated a particular clinical or physiological parameter. As a result, the accuracy in detecting extubation failures was low when compared to clinical judgment.13 In the current study, the combination of clinical parameters with automated analysis of cardiorespiratory signals generated the highest performance classifier, thereby demonstrating the added value in acquiring those signals.17

The APEX CCR classifier performed with a balanced accuracy of 73% (sensitivity of 70% and specificity of 75%). Thus, if APEX were used alone in clinical practice, it would correctly identify 3/4 infants that would go on to succeed extubation while misclassify as failure about 2 infants that could have been otherwise successfully extubated. Whether this trade-off is acceptable remains unknown, as it would require direct comparisons of the costs of postponing extubation vs. preventing reintubation. Surprisingly, only one study has investigated the impact of delaying extubation by 36 h in preterm infants.39 No differences in rates of extubation failure or bronchopulmonary dysplasia were noted, but amongst infants <1000 g delayed extubation was associated with significantly shorter cumulative MV duration. Therefore, it is unknown if a short delay can be harmful. Furthermore, the experience of adverse and sometimes severe events after extubation must be considered. Indeed, trade-offs in which significant morbidity reductions are offset by a marginal increase in mortality have often shaped recommendations in Neonatology.40,41,42. In the APEX cohort, 3 infants died within hours following an electively planned extubation. The causes of death were massive pulmonary hemorrhage minutes after extubation, withdrawal of life-sustaining therapy after a diagnosis of grade 4 intraventricular hemorrhage made 7 h post-extubation, and fulminant necrotizing enterocolitis diagnosed ≤8 h post-extubation. All these patients were correctly identified by the classifier as extubation failures.

More importantly is the use of the classifier as an adjunct tool to clinical judgment. In this study, when a patient ‘ready’ for extubation was classified as success, the probability of successful extubation increased from 82% to 93%. On the other hand, this probability decreased from 82% to 65% if the patient was classified as failure. In infants extubated <7 days of age, the classifier identified 89% of failures (95% CI: 74%, 100%) and would have increased the probability of successful extubation from 85% to 97% (95% CI: 94%, 100%; Fig. E10).

Interestingly, the performance of the APEX CCR classifier varied considerably depending on the age at extubation and observation window used to define extubation failure. With regards to age at extubation, the classifier accurately identified 16 out of 18 (89%) failed extubations in the first 7 days of life compared to only 17 out of 26 (65%) failed extubations beyond the first week. On one hand, these findings suggest that the classifier may not be as beneficial to the more immature and sicker infants, who tend to be extubated beyond the first week of life. On the other hand, an enhanced prediction of extubation readiness in the first week of life may prevent the clinical instability that may occur with a failed extubation during this critical period. As for the observation window used to define extubation failure, the classifier only correctly identified 50% of failures that occurred between 72 h and 7 days post-extubation. Considering that reintubations between 72 h and 7 days post-extubation are rarely caused by new non-respiratory pathologies (ex: infection or necrotizing enterocolitis), the classifier failed to capture a large number of true, respiratory-related reintubations. However, given that reintubations occurring within 48-72 h after extubation may be associated with the highest risk of mortality/morbidity, the classifier may be justifiable as a targeted tool for those infants at highest risk of complications.

The study had some limitations. The predictor was applied when the medical team has made the decision to extubate which tends to increase sensitivity and decrease specificity.43 Also, it may be possible that infants were ‘ready’ before which is difficult to precisely determine in clinical practice. Furthermore, it is unknown how the predictor would perform if it were conducted in infants on higher ventilatory settings or prior to being recognized as ‘ready’ by the clinical team. Also, although most respiratory features in the final classifier were recorded during ET-CPAP (Table. E3), it is unknown if reliable and useful recordings of respiratory data would be possible in infants under high frequency ventilation. Due to the small number of patients, a separate analysis of the performance of the predictive tool across centers was not possible. The signal acquisition set-up involved instrumentation and the possibility of recordings of low-quality signals, but the use of small multimodal wireless sensors may make data acquisition more practical and reliable.44,45 The classifier was developed using a pre-specified observation window of 72 h which may not capture all respiratory-related reintubations.33 However, the performance of the classifier decreased in the post-hoc analysis using a longer period of observation. Lastly, the pragmatic design of the study introduced significant heterogeneity in patient characteristics and pretest probabilities of extubation success which may have decreased the accuracy of the final classifier. Indeed, restriction to infants extubated <7 days of age showed an improvement on its performance. Although the classifier demonstrated better prediction of extubation failure, this was achieved at the expense of misclassifying a percentage of successful extubations. Therefore, at this point this tool should not be used to avoid extubation as it would prolong mechanical ventilation in some infants unnecessarily. However, by pointing out a population with a higher probability of failure in some infants considered ‘ready’ for elective extubation, the classifier can identify the patients requiring a more intense and continuous monitoring after disconnection from MV and that could maybe benefit from different modes of non-invasive support. Moreover, the classifier can be used to more precisely select a high-risk group to be enrolled in studies testing interventions to decrease extubation failure. Therefore, at the moment this tool should not be used for clinical decisions as this novel approach requires further fine-tuning adjustments and a more friendly instrumentation. The study has several important strengths. Novel cardiorespiratory features and detailed clinical information during the peri-extubation period were acquired from a multi-institutional population. Results are generalizable given the pragmatic design, large sample size, and heterogeneity of clinical practices to determine extubation readiness. A principal component analysis was used to provide only the features that explained the most variability. Finally, the effects of class imbalance were mitigated by pre-classifying the population during the CD stage and randomly under-sampling success patients during training, thereby giving an equal number of success and failure patients to the classifier. Therefore, the advanced analytical methods and interdisciplinary collaboration used in the APEX study are a unique and important step forward towards development of tools able to objectively and effectively expedite extubation46,47. Indeed, sustainable collaborations between disciplines are changing the focus towards multimodal data recording and analysis to enable the right choice of treatment, for the right patient, and at the right time.

Conclusion

The combined use of clinical data with automated analyses of cardiorespiratory signals by using machine learning algorithms may provide an adjunct tool to improve prediction of extubation outcomes, but still requires further refinement before adoption into clinical practice. This is critical for planning targeting trials in this understudied high-risk population. As an automated and objective method that requires no human intervention, APEX requires further investigation in larger populations from varied settings to understand its effect on patient outcomes, safety, and generalizability.