Introduction

Critical illness in children leads to millions of hospital admissions to a Pediatric Intensive Care Unit (PICU).1,2 Since the inception of the field more than nearly six decades ago, outcomes for these patients have steadily improved, with PICU mortality rates as low as 1–2%.3,4,5 While clinical research has played a role in improving outcomes, there are surprisingly few therapies in pediatric critical care supported by high levels of evidence. For example, in recent guidelines for the care of children with sepsis6 and traumatic brain injury,7 the vast majority of recommendations were supported by “low” quality of evidence. Reasons for this paucity of evidence-based therapies include heterogeneity within the age spectrum seen in the specialty, limitations of extrapolation of adult studies, low rate of mortality necessitating other outcomes of possibly lower interest, patient volumes lower than adult critical care, and heterogeneity within clinical diagnoses (e.g., sepsis). A reliance on traditional ways to collect and analyze data has also limited the field of pediatric critical care research.

New paradigm-shifting approaches in machine learning, predictive modeling, functional immunophenotyping, and artificial intelligence (AI) have been developed to improve understanding and specificity in refining definitions of disease. There has been rapid growth both in computing power and data storage, enabling a wide range of applications for machine learning and AI within medicine. The term AI refers to the domain of tasks that historically required human input, while machine learning is the subset of AI where learning from data exists without explicit programming.8 Both have impacted drug discovery, personalized diagnostics, therapeutics, and medical imaging.9,10 Within the realm of pediatric critical care, the use of these techniques has the potential to significantly improve our understanding of disease and of therapeutic efficacy. In this review, we will outline different machine learning techniques, provide an overview of current AI applications and specific machine learning/AI limitations, and discuss how these technologies will further the field of pediatric critical care.

Machine learning

In machine learning, algorithms are used to correctly classify a piece of data or make correct predictions by examining other data provided. There are three broad stratifications: supervised machine learning, unsupervised machine learning, and neural networks (Fig. 1).

Fig. 1: Types of machine learning.
figure 1

Examples of only a few types of machine learning, including subcategories of supervised and unsupervised machine learning.

Supervised machine learning

Supervised machine learning is the most prevalent in medicine.11,12,13,14,15,16,17,18,19 In supervised learning, labeled datasets are used to train an algorithm to correctly classify data.11,12,13,14,15,16,17,18,19 To train an algorithm, the labeled data is used to deduce any association between independent and dependent variables. The respective weights of independent variables are adjusted within the algorithm until it arrives at the best fit that has the least error in predicting the dependent variable. The trained model derived from this process is then validated on additional datasets to assess the generalizability of the algorithm. Algorithm performance can be assessed in several ways including sensitivity, accuracy, and area under the curve of a receiver operating characteristic curve20 (Table 1).

Table 1 Common Measures of Model Performance.

Supervised machine learning methods are often task-driven and can complete classification, and regression tasks (Fig. 1). Classification is used when the outcome of interest is a categorical variable (alive/dead, high risk/low risk, etc.). The model uses the independent variables in the labeled dataset to determine the category of the dependent variable. A commonly used example of supervised machine learning is training a model to relate a patient’s demographics and smoking history to a certain outcome such as lung cancer.21 Commonly used algorithms for classification include logistic regression, k-nearest neighbors, decision trees, gradient boosting, support vector machines, and naive Bayes algorithms.9 Regression is used to predict a numerical value for the dependent variable. These models produce continuous outcomes and have been used outside medicine to predict house prices, stock prices, or sales. Commonly used algorithms for regression include linear regression, lasso regression, polynomial regression, support vector regression, random forest algorithms, and boosted as well as ensemble methods.22

In pediatrics, supervised machine learning is commonly used for prognostic predictions. In prognostic models, the algorithm is used for risk stratification for outcomes of interest.23,24,25,26 Examples of this include using machine learning to determine the risk of serious bacterial infection in a cohort of children in the emergency department,24 to determine if a subgroup of critically ill patients would be more likely to benefit from corticosteroids,23 and to determine the risk of developing childhood asthma.26

Predictive modeling can be used to predict whether a patient responds to treatment or is at risk for clinical deterioration.27,28,29,30 In the hospital setting, real-time or recent clinical data can be used as a decision-support tool to alert clinicians to subtle signs of clinical deterioration that can be acted on prior to decompensation. Examples of this include using algorithms to detect the need for transfer to an intensive care unit (ICU)29,30 and to detect deterioration of children in the cardiac ICU.13

Unsupervised machine learning

Unsupervised machine learning is used to find previously undetected patterns and clusters in unlabeled data (Fig. 1).11,12,13,14,15,31,32,33,34,35 Unsupervised machine learning may serve as a data exploration tool as it requires less manual intervention as it involves unlabeled data. While these techniques can yield previously undiscovered patterns, the groupings may not necessarily be clinically meaningful without clinician insight. In addition to data exploration, unsupervised machine learning can also be used for classification tasks.

Examples of unsupervised machine learning include cluster analysis; where data is grouped based on similarities, differences, and associations. Dimensionality reduction can be used in large datasets in the preproduction phase and reduces the number of variables while preserving the integrity of the dataset, making it more manageable for analysis. Common clustering algorithms include latent class or profile analysis, k-means clustering, and hierarchical clustering.

Latent class or profile analysis is the most used unsupervised machine learning technique in pediatric research.32,33,34,35 Latent class or profile analysis allows the detection of a possible unmeasured group within a population by inferring patterns or indicators from the observed variables.36 This differs from cluster analysis which uses a distance from a specific measure to assign grouping, while latent class or profile analysis estimates the probability of each unit belonging to a class.36 Recent reanalysis using latent class analysis of the RESTORE (Randomized Evaluation of Sedation Titration for Respiratory Failure) and BALI (Biomarkers in Children with Acute Lung Injury) studies have shown that while patients may fit under a unifying definition of pediatric acute respiratory distress syndrome (ARDS) within these groups there may be hypoinflammatory and hyperinflammatory phenotypes.33 Adult literature has shown that similar phenotypes in ARDS have varying responses to targeted therapies (e.g., Positive End Expiratory Pressure (PEEP), fluid overload, statins).37,38,39 Researchers have also used it to identify phenotypes in critically ill children with sepsis34 and near-fatal asthma.35

Neural networks

Artificial neural networks are another type of machine learning modeling and take inspiration from biological neural networks; they may be supervised or unsupervised. They are most comparable to gradient boosting methods and are a popular classifier algorithm.9 They consist of layers of neurons, with an input layer, one or more hidden layers, and an output layer. Input layers typically consist of input variables such as physiologic or lab markers, and hidden layers have a function applied (a series of calculations including weighing or combining input variables) to predict the output layer.40 Two common types of neural networks include recurrent neural networks—which can process large amounts of data and “learn” from missed predictions and convolutional neural networks which specialize in transforming imaging data.40 While several applications have been published,41,42,43 historic limitations include their “black box” nature and difficulty in determining clinical importance. Recent advances such as detector randomized input sampling or generative adversarial networks have substantially reduced the “black box” nature of neural networks, these techniques have allowed researchers to even determine which portions of an x-ray were important to an algorithm in predicting if an image belonged to a COVID-19 positive or negative patient.44,45

Predictive modeling techniques

There are several widely used illness severity scores in pediatric critical care that were developed over the past four decades using traditional approaches. The first widespread physiology-based scoring system to assess the risk of mortality in critically ill children was the Physiologic Stability Index (PSI) which was published in 1984.46 The same group of investigators simplified the PSI into the Pediatric Risk of Mortality (PRISM) score several years later, which improved usability by reducing the number of variables from 34 to 14.47 Another group developed the Pediatric Index of Mortality (PIM) in 1996.48 Also based on the PSI, the PIM score only required eight variables present within the first hour of PICU care. These scores have undergone serial refinement to the PRISM IV and PIM3 scores by adjusting what variables are included, their cut-offs, and their weights.49,50 Because mortality in the PICU is uncommon, other illness severity scores like the Pediatric Logistic Organ Dysfunction (PELOD) score, PELOD-2 score, and Pediatric Sequential Organ Failure Assessment (pSOFA) score are intended to quantify organ dysfunction. The pediatric organ dysfunction information update mandate (PODIUM) developed contemporary criteria to define pediatric single- and multi-organ dysfunction.51 The panel of 88 content experts from 47 institutions appraised the body of present-day peer-reviewed evidence defining pediatric organ failure for 11 organ systems. The goals of this endeavor are to promote early recognition and appropriate treatment of pediatric organ dysfunction to create a globally accepted platform for universal nomenclature, promoting enhanced multi-institutional collaborative research.

There are inherent limitations to these scores and the methods used to develop them. While PRISM and PIM relied on variables already established as predictive in the PSI score, the PSI variables themselves were selected subjectively via the consensus of “a group of pediatric intensivists”. Similarly, variables for PELOD were chosen by the Delphi method and in PODIUM the final variables were voted on by the panel of content experts after being selected through a rigorous examination of the literature. While expert consensus does identify variables associated with outcomes of interest, it is inherently limited in scope and prone to bias. Many of the variables themselves are single values of continuous variables (e.g., heart rate), with the “worst” value in a specified time range being used for scoring. Improved methods to identify and weigh variables could enable predictive scores to be improved for use on cohorts and refined sufficiently for use on individual patients. Many of these scores have also been developed to describe outcomes and stratify the severity of illness across the population or individual intensive care unit level and may be limited in their ability for individual patient prediction.

A large reason for the paucity of widely recognized and validated pediatric predictive tools includes the low mortality rate and substantially lower numbers of patients compared with critically ill adults. Many in the field are moving away from developing additional tools to predict mortality or define the severity of specific dysfunctional organ systems. There is now momentum targeting more nuanced outcomes such as disease trajectory, clinical deterioration during hospitalization, and the development of new cognitive or physical disability. Utilizing modern monitoring systems through machine learning and AI, the field is rapidly advancing towards higher-level predictive modeling that will likely soon become standard to the care of critically ill pediatric patients. Several single-center prospective and retrospective studies have recently been published underscoring the importance of advancing this field in addition to highlighting the significant momentum building worldwide. Several groups have started to define machine learning algorithms to predict the development of sepsis or septic shock in pediatric inpatients.52,53,54 This is in addition to the use of machine learning in pediatrics for predicting 30-day readmissions,55 need for massive transfusion following blunt trauma,56 risk of cerebral hemorrhage in preterm infants,57 and early prediction of AKI.58,59 Recent publications also include utilizing machine learning to predict the absence of serious bacterial infection at the time of pediatric intensive care unit admission, with a goal to reduce antibiotic days per patient.60 Finally, machine learning is being utilized to predict long-term neurologic outcomes in pediatric traumatic brain injury patients.61,62,63 Ultimately, the utilization of dynamic trends in physiologic data and changes in laboratory values over time, together with high fidelity machine learning algorithms, will provide a more robust and fertile landscape for outcome prediction in the critically ill pediatric patient.

Clinical decision support

The widespread adoption of electronic health records (EHR) has been followed by the increased development of clinical decision support systems (CDSS). These systems range from medication interaction alerts to patient safety reminders.64,65,66,67 CDSS have been demonstrated to improve process measures and clinical outcomes.68,69 The utilization of machine learning algorithms for CDSS is more recent and rapidly expanding.

Prediction models have been the most implemented modality of machine learning in clinical medicine.70,71,72 These models use machine learning techniques to synthesize large amounts of patient data into simplified scores that providers can use to assess each patient’s risk.73 The clinical application of these models includes the prediction of kidney injury,74 significant clinical deterioration,75 and mortality.76,77 These models are frequently derived using single-center data with validation performed on a separate cohort of patients admitted to the same center.74,75,76,77 The rise in popularity of similar models has led to calls for greater rigor in their derivation to ensure true clinical utility.78 Another group of models includes those created by EHR vendors that are available to use in hospitals that are paying for a particular EHR. While these models tend to be derived from larger datasets, their proprietary nature leads to limited information being published regarding their validation. Attempts at external validation have raised concerns about the validity of these models.79,80

A different approach to CDSS is the augmentation of data visualization to assist physicians in better understanding trends in real time. One system that utilizes this approach is the Etiometry (Etiometry Inc., Boston, MA) risk analytics algorithm81 that includes a data aggregation and visualization system in addition to a risk analytics engine. The T3 (tracking, trajectory, and triggering) data visualization system continuously aggregates real-time patient data including vital signs and select lab data. Similarly, Sickbay (Medical Informatics Corp., Houston, TX) is a vendor-neutral platform that aggregates data to improve data visualization.82 This contrasts with conventional data monitoring in the ICU which is limited to nurse-validated recordings at fixed time intervals. Current EHRs usually store and present data at hourly intervals. These models aggregate all data points continuously and therefore attempt to provide a more holistic picture.83

Some platforms use algorithms to provide additional functionality beyond the aggregation of patient data. Sickbay allows the use of continuous physiologic data to develop real-time risk calculators for outcomes. Published examples of this include predicting deterioration in children with congenital heart disease28 and predicting the need for extracorporeal membrane oxygenation (ECMO) in neonates with congenital diaphragmatic hernias.84 In contrast, Etiometry leverages continuous physiologic data through a proprietary machine learning algorithm to provide real-time risk-based analysis of patient deterioration. The algorithm uses patient data to continuously calculate the risk of inadequate delivery of oxygen (IDO2) and inadequate ventilation of carbon dioxide (IVCO2) that can be used as proxies to predict clinical deterioration. To their credit, the development of these metrics has been described in detail, providing users with an in-depth understanding of their workings.85 Publications testing the accuracy and utility of Etiometry models have had mixed results. One study demonstrated that the IDO2 index was significantly elevated in patients who failed to be weaned off vasoactive infusions compared to those who were weaned successfully.86 Another study found that the IDO2 index was outperformed by a conventional scoring system in predicting adverse events in children after cardiac bypass.87

Other examples include an algorithm developed by Better Care (Better Care Inc., Barcelona, Spain) that is currently being used in Spain88 and the Continuous Monitoring of Event Trajectories (CoMET) developed at the University of Virginia.89 In addition to using patient vital signs, the Better Care algorithm also incorporates ventilator metrics and analytics such as asynchrony in its features. This is built on previous efforts by the same company to utilize machine learning to categorize patient-ventilator interactions.90 CoMET combines continuous physiologic data and lab values in its algorithm to predict the risk of urgent intubation and assist in the early detection of sepsis.91 However, there are currently no published reports describing the performance of this system in an external pediatric cohort.

While the development of CDSS has been promising, there are challenges that have limited widespread adoption.92 Limitations include the knowledge and time required to deploy and maintain an update-to-date and clinically relevant CDSS. A good CDSS must further be well-integrated into the clinician’s existing workflow. Another area of concern is transparency and understanding. The workings of a CDSS must be transparent to garner trust. Providers are answerable to their patients and are unlikely to utilize a metric in their decision-making that they do not fully understand, and therefore cannot explain to their patients. Finally, the CDSS must continue to adapt to changes in both clinical guidelines and practice patterns. A stagnant CDSS that is created based on a historical dataset and not updated will lose accuracy and eventually become redundant. This concept is known as data drift, and updates to a CDSS are required if the data distribution passes a prespecified threshold.93 These systems must therefore be reviewed and updated regularly, either at fixed intervals or with significant changes in practice or the target population to remain relevant.94 Additional limitations that are specific to a pediatric population include the smaller patient population and fewer large datasets available to train algorithms.95 Within that smaller population, there is also greater heterogeneity due to the differences in normal ranges and at times treatment strategies by age group. Furthermore, adverse outcomes in pediatric patients tend to be more infrequent compared to adults. This limits the accuracy of a model derived to predict those rare outcomes. The financial consideration of implementing CDSS can be significant, limiting the widespread adoption of tools that are yet unproven in clinical efficacy. These limitations were highlighted in a recent study in which most pediatric critical care providers were neutral or disagreed that current predictive algorithms provided useful information.96 Providers agreed that important goals included evidence-based CDSS with a proven impact on patient safety, that were well-placed and delivered at the right time.96 Providers expressed concern about the accuracy of CDSS, the effect on practitioners’ critical reasoning, and the burden of increased time spent on the computer.96

Limitations

Several machine learning and AI-derived tools have failed to live up to their promise when deployed clinically. It is crucial to understand the pitfalls in their development and implementation, and why so many have struggled to make an impact at the bedside. Specific examples range from sepsis prediction to imaging classification.79,97,98

During development particular focus needs to be paid to the definitions and scope of the model, and the selection of predictor variables. Several important characteristics must be true, the predictors must not have collinearity with the predicted outcome and be known prior to the outcome.78 This is vital to ensure that the information provided by the model is clinically actionable. Predictors that become known either immediately before or after the outcome event occurs do not provide an opportunity for clinical intervention. Observable data including blood pressure or heart rate may be perturbed in sepsis only after the condition has developed, limiting any time to make actionable predictions.99 The model must also be able to retain accuracy when applied to new data and thus be generalizable to be useful to the bedside clinician.

Care should be further taken during data preprocessing, first to ensure data accuracy, and further to not disregard potentially useful information by binning continuous variables. Doing so often introduces assumptions into the model that are not biologically plausible (e.g., a model may treat a hemoglobin of 3 the same as 6.7 if a dichotomous predictor of hemoglobin <7 exists). While grouping variables may be useful for easy bedside prediction, we recommend close consideration of these tradeoffs when developing a model. Particular attention must also be paid to how specific models handle missing data. Extensive data preprocessing may yield better results but may dramatically limit real-world applications if the preprocessing is not able to be done in real time.

When developing models to evaluate binary outcomes, a key concept is the number of events per variable (EPV)—which represents the number of events/outcomes divided by the number of predictor variables.100 EPV may provide guidance on sample size requirements. Further attention is required in regression models to avoid overfitting. Overfitting occurs when the model begins to describe the random error in the data rather than the true relationship between variables—this often occurs as the model becomes more complex and reduces generalizability outside the original dataset.

When evaluating models, it is important to determine how a model has been evaluated and validated. Standard evaluation frequently includes internal validation, which is determining if model performance is reproducible in the same population it was derived from. This frequently means the same dataset and can be performed by either holding out a particular set of patients for model validation, or k-fold cross-validation. K-fold cross-validation generally includes partitioning the data into subsamples, the model is then trained on all subsamples except one and validated on the remaining subsample (Fig. 2).78 The data are then shuffled, and this is process is repeated until a stable model is derived. A rare outcome limits model performance since most mathematical models are designed to distinguish between two outcomes (i.e., event vs. no event) of equal likelihood. This is particularly important when considering applying machine learning to predict rare outcomes in pediatric patients. External validation determines if the model is reproducible in a distinct population from the one it was trained on. Models trained on single-center data often demonstrate high accuracy but fail to perform similarly in new populations, again limiting their generalizability.

Fig. 2: K-fold cross-validation.
figure 2

Example of K-fold cross validation. A different subset of the data is used for training and validation in each fold, and performance is based off combined performance in the validation folds.

Performance metrics for machine learning models also differ from common statistical models (Table 1). It is crucial to separate a model’s discrimination – its ability to separate events/outcomes from non-events/outcomes (e.g., Fig. 3), from its calibration—which is its ability to specify the probability of the outcome. Common measures such as area under the receiver operating characteristic curve (AUROC) are mainly a measure of discrimination and may be falsely high when predicting rare events. For rare events, it is also important to consider a series of metrics, including the area under the precision-recall curve (AUPRC), which reflects positive predictive value and sensitivity (the average probability that a positive prediction is true across all sensitivities). Other metrics may also provide further clarity, especially in rare events on a model’s performance, including its specificity, sensitivity, and F1 score which combines the precision and recall into a single metric.101 Understanding the calibration of a model is also crucial in knowing if the model will be useful clinically (e.g., there is a large difference between predicting an outcome will occur 51% of the time, or 90% of the time). Ultimately while AUROC and AUPRC curves are important, to the bedside clinician the important factors are the positive and negative predictive values for any algorithm, which will depend on the prevalence of the outcome in the local population.

Fig. 3: Confusion Matrix.
figure 3

Example of a 2x2 confusion matrix, comparing prediction to ground truth.

A recent concern has been how the incorporation of any systematic bias in the underlying dataset may perpetuate the bias in the algorithm, especially if the algorithm is utilized for triage or treatment decisions. These risks can be understood in models where there is higher clinical correlation or plausibility but a recent paper has shown that a deep learning model was able to identify self-reported race from radiological images even when the data was corrupted or cropped, and this capability was incredibly difficult to isolate.102 Further work is necessary before the broad deployment of these types of models.

Along the same lines, it is important to also understand how provider beliefs and actions affect machine learning models.103 Actions performed by clinicians such as obtaining certain lab tests are based on prior knowledge of disease patterns or clinical intuition. Patients whom a clinician is more concerned about expectedly may have more labs or diagnostic tests performed. Machine learning models that incorporate the results of these actions may in fact be predicting provider behavior and not disease patterns. Conversely, models trained exclusively on patient data that is independent of provider actions (e.g., vital signs on admission) may provide a more distilled approximation of the disease process but not reflect the realities of patient care within the hospital system. Overall, machine learning models that are trained on both clinician-initiated and clinician-independent variables are likely to encompass both physician intuition and patient factors in their predictions. Pediatric critical care generates swaths of clinician-independent data that can be harnessed to train more models that are agnostic of provider behavior patterns and home in on the disease process.

In evaluating model performance then, it is important to consider the variables used in model training, how the model was validated (retrospective vs prospective), how performance was reported, and if performance included measures of discrimination, calibration, and clinically relevant measures such as positive or negative predictive values in a representative population. Educating clinicians on understanding these metrics and critically appraising machine learning models will be essential in ensuring successful adoption at the bedside.

For these reasons, the addition of machine learning and artificial intelligence algorithms to pediatric critical care is not meant to replace decision-making by bedside staff. Rather, it should augment the knowledge employed to develop individualized care plans for increasingly complex patients and will ultimately lead to improved nuance and discrimination of diverse phenotypes within organ system failures.

Conclusions

Common machine learning and artificial intelligence techniques hold promise in their applications in predictive modeling and clinical decision support however to be fully impactful to the field, common pitfalls that may explain why current tools have failed to live up to their promise must be considered for the field to mature. As these tools and techniques become ubiquitous, understanding how they were developed and how to evaluate them will be vital for pediatric intensivists.