Machine learning-based prediction of acute severity in infants hospitalized for bronchiolitis: a multicenter prospective study

We aimed to develop machine learning models to accurately predict bronchiolitis severity, and to compare their predictive performance with a conventional scoring (reference) model. In a 17-center prospective study of infants (aged < 1 year) hospitalized for bronchiolitis, by using routinely-available pre-hospitalization data as predictors, we developed four machine learning models: Lasso regression, elastic net regression, random forest, and gradient boosted decision tree. We compared their predictive performance—e.g., area-under-the-curve (AUC), sensitivity, specificity, and net benefit (decision curves)—using a cross-validation method, with that of the reference model. The outcomes were positive pressure ventilation use and intensive treatment (admission to intensive care unit and/or positive pressure ventilation use). Of 1,016 infants, 5.4% underwent positive pressure ventilation and 16.0% had intensive treatment. For the positive pressure ventilation outcome, machine learning models outperformed reference model (e.g., AUC 0.88 [95% CI 0.84–0.93] in gradient boosted decision tree vs 0.62 [95% CI 0.53–0.70] in reference model), with higher sensitivity (0.89 [95% CI 0.80–0.96] vs. 0.62 [95% CI 0.49–0.75]) and specificity (0.77 [95% CI 0.75–0.80] vs. 0.57 [95% CI 0.54–0.60]). The machine learning models also achieved a greater net benefit over ranges of clinical thresholds. Machine learning models consistently demonstrated a superior ability to predict acute severity and achieved greater net benefit.

www.nature.com/scientificreports/ Bronchiolitis is the leading cause of infant hospitalization in the US, accounting for 107,000 infant hospitalizations each year with direct cost of 734 million US dollars 1 . Even among hospitalized infants, the severity of bronchiolitis can range from moderate severity (which requires observation and supportive therapies, such as supplemental oxygen, fluid, and nutrition) to near-fatal and fatal infections. Previous studies have identified individual risk factors for higher severity of bronchiolitis (e.g., young age, prematurity, viral etiology) [2][3][4][5] and developed prediction scoring models (e.g., logistic regression models) [6][7][8][9] . However, identifying the subgroup of infants with bronchiolitis who require higher acuity care (e.g., positive pressure ventilation, intensive care unit [ICU] admission) remains an important challenge. The difficulty and uncertainty of predicting acute severity-and, consequently, the appropriate level of care for infants with bronchiolitis-are reflected by the welldocumented variability in inpatient management across the nation 1,10-12 . Machine learning models have gained increasing attention because of their advantages, such as the ability to incorporate high-order, nonlinear interactions between predictors and to yield more accurate and stable predictions. Indeed, recent studies have reported that the use of machine learning models provide a high predictive ability in various conditions and settings-e.g., sepsis 13,14 , asthma exacerbation 15 , emergency department (ED) triage 16,17 , and unplanned transfers to ICU 18 . Despite the clinical and research promise, no study has yet examined the utility of modern machine learning models in predicting outcomes in infants hospitalized for bronchiolitis-a large population with high morbidity and health resource use.
In this context, we aimed to develop machine learning models that accurately predict acute severity in infants hospitalized with bronchiolitis, and compare their predictive performance with that of conventional scoring approaches 6 .

Results
During 2011-2014, 1,016 infants with bronchiolitis were enrolled into a 17-center prospective cohort study. The median age at the enrolment was 3.2 months (IQR 1.6-6.0), 40% were female, and 42% were non-Hispanic white. The length-of-hospital stay varied widely from 0 to 60 days (median, 2 days) ( Table 1). Clinical data had a small proportion of missingness; most had < 1% missingness (e.g., missingness on oxygen saturation with the use of supplemental oxygen, 0.1%) while the maximum proportion of missing was 4.8% (eTable 3 in Additional file 1). Overall, 55 infants (5.4%) underwent positive pressure ventilation and 163 infants (16.0%) had intensive treatment outcome.
predicting positive pressure ventilation outcome. In the prediction of positive pressure ventilation outcome, the discriminatory abilities of all models are summarized in Fig. 1A Table 2) and specificity (e.g., 0.57 [95% CI 0.54-0.60] in the reference model vs. 0.79 [95% CI 0.77-0.82] in the Lasso regression model). More specifically, all machine learning models correctly predicted a larger number of infants who underwent positive pressure ventilation (true-positives) with a fewer number of predicted outcomes (Table 3). For example, the reference scoring system categorized most infants (n = 629, 62%) into the prediction score groups of 2-3. The reference model correctly identified 16 out of 25 infants who underwent positive pressure ventilation, while predicting that 265 infants would have undergone positive pressure ventilation. In contrast, the gradient boosted decision tree model correctly identified 23 (of 25) patients, while predicting that 135 infants would have undergone positive pressure ventilation in the same patient groups. Considering the low prevalence of the positive pressure ventilation outcome, all models had a high negative predictive value (e.g., 0.96 [95% CI 0.95-0.97] in the reference model vs. 0.99 [95% CI 0.99-0.99] in the Lasso regression model; Table 2).
Likewise, in the decision curve analysis (Fig. 1B), all four machine learning models outperformed the reference model, demonstrating a greater net benefit throughout the range of clinical thresholds, indicating that the machine learning prediction would more accurately identify high-risk infants (true-positives) while taking the trade-off with false-positives into consideration.
predicting intensive treatment outcome. In the prediction of intensive treatment outcome, the discriminatory abilities of all models are shown in Fig. 2A (Table 3). In contrast, the gradient boosted decision tree correctly identified 52 (out of 80) infants with the outcome, while predicting that 162 infants would have had intensive treatment. www.nature.com/scientificreports/

Discussion
In this analysis of multicenter prospective cohort data from 1,016 infants, we applied four modern machine learning approaches (i.e., Lasso regression, elastic net regression, random forest, and gradient boosted decision tree) to the prediction of acute severity outcomes of bronchiolitis. Compared to the reference model that was derived in an ED sample 6 , these machine learning models consistently demonstrated a superior performance in predicting positive pressure ventilation and intensive treatment outcomes, including AUC and net reclassification. Additionally, the machine learning models achieved a higher sensitivity and specificity for the two outcomes, in both the overall cohort and the majority of cohort infants that were categorized into the reference score groups of 2-3. Furthermore, the decision curve analysis also demonstrated the net benefit of machine learning models was also greater-i.e., a larger number of true-positives considering a trade-off with false-positives-across a range of clinical thresholds. To the best of our knowledge, this is the first study that has investigated the performance of modern machine learning models in predicting severity in infants with bronchiolitis. One of the main objectives in the risk stratification of infants with bronchiolitis is to promptly identify infants at risk for higher severity and efficiently utilize finite healthcare resources. The American Academy of Pediatrics bronchiolitis guideline 2 highlights the importance of assessing the risk in infants with bronchiolitis. However, optimal risk stratification and prediction remains a challenge as the clinical course in this population (even in infants hospitalized for bronchiolitis) is highly variable [10][11][12] . Previous studies, by using conventional modeling (e.g., logistic regression models), have reported a moderate ability to predict severity outcomes (e.g., ED-to-hospital admission, hospital length-of-stay, ICU admission, positive pressure ventilation use) of infants with bronchiolitis [6][7][8][9]19 . Although the use of an expanded set of predictors-e.g., repeated examinations and invasive monitoring during hospital course-may yield better predictive performance, it is often impractical in the real-world acute care settings with an aim to promptly risk-stratify these infants. Alternatively, the use of advanced machine learning models may improve the clinician's decision-making ability. Indeed, machine learning models have recently been applied to the prediction of various disease conditions and clinical settings, such as early identification of mortality risk in patients with sepsis 13 , rehospitalization in patients with heart failure 20 , intensive treatment outcomes in patients with asthma exacerbation 15 , unplanned transfer to ICU 18 , and escalated care at pediatric ED triage 16 . Our multicenter study builds on these earlier reports, and extends them by demonstrating that the modern machine learning models outperform conventional approaches in predicting higher severity of infants with bronchiolitis. While external validation is warranted, these machine learning models using routinely-available predictors can be implemented to clinical practice (e.g., online risk calculators or build-in risk assessment systems)-similar to existent clinical scoring rules.
Clinical prediction systems strive for an appropriate balance between sensitivity and specificity because of the trade-off relationship between these two factors in the context of prevalence of clinical outcomes. In the present study, we observed that the reference score model did not effectively categorize most infants (i.e., 62% Table 2. Prediction performance of the reference, and machine learning models in infants hospitalized for bronchiolitis. AUC area under the receiver-operating-characteristic curve, NRI net reclassification improvement, PPV positive predictive value, NPV negative predictive value. a P-value was calculated to compare area-under-the-curve of the reference model with that of each machine model. b We used continuous NRI and its P-value.  www.nature.com/scientificreports/ of cohort were categorized into the two score groups) or appropriately predicted infants who developed the outcomes. By contrast, the machine learning models correctly identified a larger number of true-positives (i.e., higher sensitivity). This finding supports the utility of these models in the target population, for which the one of the major priorities is to reduce "missed" high-risk cases. Additionally, the machine learning models also had a fewer number of false-positives (i.e., higher specificity) in predicting both outcomes while they were imperfect in the setting of relatively-smaller prevalence of outcome (5.4% for positive pressure ventilation use). This may mitigate excessive resource use in this large population. These findings are further supported by the decision curve analysis that demonstrated a greater net benefit of the machine learning models incorporating the tradeoffs between true-positives and false-positives across the wide ranges of clinical thresholds. There are several potential explanations for the observed gains in the predictive abilities of machine learning models. For example, machine learning models incorporate high-order interactions between predictors and nonlinear relationships with outcomes. Additionally, machine learning models are able to mitigate potential overfitting by adopting several methods, such as regularization, out-of-bagging estimation, and cross-validation. Furthermore, the use of large multicenter data with rigorous quality assurance might have contributed to low bias and variance in the machine models. Although the machine learning models achieved superior predictive ability, their performance remained imperfect. This may be explained, at least partially, by the limited set of predictors, subjectivity of some data elements (e.g., parent-reported symptoms at home), variable clinical factors after prehospitalization assessment (e.g., ED management and patient responses), difference in clinician's practice patterns, and availability of intensive care resources. Notwithstanding the complexity and challenges of  www.nature.com/scientificreports/ clinical prediction in infants with bronchiolitis, machine learning models have scalable advantages in the era of health information technology, such as automated sophistication of models through the sequential extraction of electronic health records, continuous non-invasive physiological monitoring, natural language processing, www.nature.com/scientificreports/ and reinforcement learning [21][22][23][24] . In the past, this scalability had not been attainable with the use of conventional approaches. Taken together, our findings and recent developments support cautious optimism that modern machine learning may enhance the clinician's ability as an assistive technology. Our study has several potential limitations. Firstly, the data may be subject to measurement bias and missingness. However, the study was conducted by trained investigators using a standardized protocol, which led to the low proportion of missingness in the predictors (eTable 3 in Additional file 1). Secondly, the clinical thresholds for these outcomes may depend on local resources and vary between clinicians and hospitals (e.g., different criteria for admission to the ICU). Yet, the decision curve analysis demonstrated the greater benefit of the machine learning models across the wide range of clinical thresholds. Lastly, the study cohort consisted of a racially/ ethnically-and geographically-diverse US sample of infants hospitalized with bronchiolitis. While the severity of this population was highly variable and the model used pre-hospitalization data, our models might not be generalizable to infants in ambulatory settings. External validation of the models in different populations and settings is necessary. Nonetheless, our data remain highly relevant for the 107,000 infants hospitalized yearly in the US 1 .

conclusion
Based on data from a multicenter prospective cohort of 1,016 infants with bronchiolitis, we developed four machine learning models to predict severity of illness. By using prehospitalization data as predictors, these models consistently yielded superior performance-a higher AUC, net reclassification, sensitivity, and specificity-in predicting positive pressure ventilation and intensive treatment outcomes over the reference model 6 . Specifically, these advanced machine learning models correctly predicted a larger number of infants with higher severity-with a fewer number of false-positives-who would not be appropriately predicted by the conventional models. Moreover, the machine learning models also achieved a greater net benefit across wide ranges of clinical thresholds. Although an external validation is warranted, the current study lends support to the application of machine learning models to the prediction of acute severity in infants with bronchiolitis. Machine learning models have a potential to enhance clinicians' decision-making ability and hence to improve clinical care and optimize resource utilization in this high morbidity population.

Methods
Study design, setting and participants. The current study aimed to develop machine learning models that accurately predict acute severity in infants with bronchiolitis, by using the data from a multicenter prospective cohort study of 1,016 infants hospitalized for bronchiolitis-the 35th Multicenter Airway Research Collaboration (MARC-35) study 25,26 . MARC-35 is coordinated by the Emergency Medicine Network (EMNet, https ://www.emnet -usa.org 27 ) an international research collaboration with 246 participating hospitals. Briefly, at 17 sites across 14 U.S. states (eTable 1 in Additional file 1), MARC-35 enrolled infants (aged < 1 year) who were hospitalized with an attending physician diagnosis of bronchiolitis during three consecutive bronchiolitis seasons (November 1 to April 30) during 2011-2014. The diagnosis of bronchiolitis was made according to the American Academy of Pediatrics bronchiolitis guidelines 2 , defined as acute respiratory illness with a combination of rhinitis, cough, tachypnea, wheezing, crackles, and retractions. We excluded infants who were transferred to a participating hospital > 24 h after initial hospitalization or with a preexisting heart and lung disease, immunodeficiency, immunosuppression or gestational age of < 32 weeks.
We followed the Standards for Reporting Diagnostic Accuracy statement guideline for the reporting of prediction models 28  predictors. For predictors in the machine learning models, we selected variables based on clinical plausibility and a priori knowledge 3,6-9,29-31 . These predictors-which are available in most prehospitalization settingsincluded demographics (age, sex, and race/ethnicity), medical history (prenatal maternal smoking, gestational age, birth weight, postnatal ICU admission, history of hospital and ICU admission, history of breathing problems, and history of eczema), parent-reporting symptoms (poor feeding, cyanosis, apnea, and duration of symptoms), ED presentation (vital signs [temperature, pulse rate, respiratory rate, oxygen saturation], interaction between oxygen saturation and supplemental oxygen use, wheezing, retractions, apnea, and dehydration), and detection of respiratory syncytial virus (RSV) by PCR 25 . These clinical data were obtained through a structured interview and medical record review by trained physicians and investigators using a standardized protocol 26 . All data were reviewed at the EMNet Coordinating Center at Massachusetts General Hospital (Boston, MA), and site investigators were queried about missing data and discrepancies identified by manual data checks.
outcomes. The primary outcome was the use of positive pressure ventilation-continuous positive airway pressure ventilation and/or intubation during inpatient stay 32 . The secondary outcome was intensive treatment defined as a composite of ICU admission and/or the use of positive pressure ventilation during the inpatient stay 3,31 . In this observational study, patients were managed at the discretion of treating physicians. These two outcomes have been employed for outcomes in the MARC-35 study. www.nature.com/scientificreports/ Statistical analysis. In the training sets (80% randomly-selected samples) in fivefold cross-validation, we developed five models: the reference model 6 and four machine learning models for each outcome. As the reference model, we fit logistic regression models using the predictors of a previously-established clinical prediction score that was derived using an ED sample 6 . We selected this prediction score as the reference model since it was recently developed in a large sample and focused on similar clinical outcomes reflecting acute severity of bronchiolitis 6,33 . The predictors included age, poor feeding, oxygen saturation, retractions, apnea, and dehydration, excluding nasal flaring/grunting, based on the availability of data in the current study (eTable 2). Next, using the prehospitalization predictors, we developed four machine learning models: (1) logistic regression with Lasso regularization (Lasso regression), (2) logistic regression with elastic net regularization (elastic net regression), (3) random forest, and (4) gradient boosted decision tree models. First, Lasso regression is an extension of regression-based models that has an ability to shrink (or regularize) the predictor coefficients toward zero, thereby effectively selecting important predictors and improving interpretability of the model 34 . Lasso regression computes the optimal regularization parameter (lambda) that minimizes the sum of least square plus L1-shrinkage penalty using a cross-validation method 35 . Second, elastic net regression is another regression-based model incorporating both Lasso-regularization and Ridge-regularization 34,36 . Elastic net regression calculates the optimal regularization parameter that minimizes the sum of least square plus weighted L1-shrinkage penalty and weighted L2-shrinkage penalty. We used R glmnet and caret packages for Lasso regression and elastic net regression models 37,38 . Third, random forest is an ensemble of decision trees generated by bootstrapped training samples with random predictor selection in tree induction 34,39 . We created a hyperparameter tuning grid to identify the best set of parameters using cross-validation methods. We used randomForest and caret packages to construct random forest models 38,40 . Lastly, gradient boosted decision tree is another ensemble method which constructs new simple tree models predicting the errors and residuals of the previous model. When adding a new tree, this model uses a gradient descent algorithm minimizes a loss function 41 . We performed hyperparameter tuning sequentially using a fivefold cross-validation method. We used R xgboost and caret packages to construct gradient boosted decision tree 38,42 . To minimize potential overfitting, we utilized several methods-e.g., regularizations (or penalizations) in Lasso and elastic net regression models, out-of-bag estimation in random forest models, and cross-validation in all models.
As for the predictor engineering methods of the machine learning models, we preprocessed predictors sequentially. First, we investigated non-linear relationships between the continuous predictors and outcomes and created quadric terms of age, respiratory rate, and temperature. These quadratic terms were used only for regression-based machine learning models (i.e., logistic regression models with Lasso regularization and those with elastic net regularization). Second, we also chose either of highly-correlated predictors (e.g., age and weight at hospitalization). Third, we imputed predictors with missing values (eTable 3) using bagged tree imputation. Fourth, we converted continuous predictors into normalized scales using Yeo-Johnson transformation. Categorical predictors were coded as dummy variables while birth weight, gestational age, previous breathing problem, and degree of retraction were coded as ordinal variables. Fifth, to incorporate clinically evident interaction between oxygen saturation level and use of supplemental oxygen, we created an interaction term between oxygen saturation and use of supplemental oxygen. Lastly, we removed predictors that are highly sparse in the dataset. We applied these preprocessing methods independently to the training sets and the test sets to avoid carrying the information from the training sets to the test sets. We used R recipe package for these predictor preprocessing 43 .
To examine the variable importance in the random forest, we used permutation-based variable importancenormalized average values of difference between the prediction accuracy of out-of-bag estimation and that of the same measure after permutating each predictor. In the gradient boosted model, we also computed the variable importance that is summed over iterations 39 . We graphically presented the rank of variable importance using unscaled values.
To measure the test performance of each model, we computed the overall cross-validation performance from the test sets (the remaining randomly-selected 20% samples). As the predictive performance, we used (1) the area under the receiver-operating-characteristic curve (AUC), (2) net reclassification improvement, (3) confusion matrix results (i.e., sensitivity, specificity, positive predictive value, and negative predictive value), and (4) net benefit from decision curve analysis. To compare the AUC between the models, we used Delong's test 44 . To compute AUC and its confidential interval, we used pROC package 45 . We also used the net reclassification improvement to quantify whether a new model provides clinically relevant improvements in prediction when compared to the reference model 46 . To compute the net reclassification improvement, we used PredictABEL package 47 . To address the class imbalance in the both outcomes, we employed the value with the shortest distance to the topleft part of the AUC plot as the threshold for the confusion matrix 39 .The decision curve analysis incorporates the information on both the benefit of correctly predicting the outcome (true-positives) and the relative harm of incorrectly labelling patients as if they would have the outcome (false-positives)-i.e., the net benefit [48][49][50][51][52] . We made a graphical presentation of the net benefit for each model over a range of threshold probabilities (or clinical preferences) of the outcome as decision curves. We used decision curve analysis R source code from Memorial Sloan Kettering Cancer Center 53 and plotted the graphs using ggplot2 package 54 . We performed all analysis with R version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria) 55 .

Data availability
The datasets generated and analysed during the current study are not publicly available because of the informed consent documents. Per the informed consent documents of the MARC research participants, the data sharing and use are limited to the severe bronchiolitis, recurrent wheezing, asthma and related concepts. Accordingly, the data are not publicly available but available from the corresponding author on reasonable request.