Introduction

The AI tsunami fueled by advances in artificial intelligence (AI) is constantly changing almost all fields, including healthcare; it is challenging to track the changes originated by AI as there is not a single day that AI is not applied to anything new. While AI affects daily life enormously, many clinicians may not be aware of how much of the work done with AI technologies may be put into effect in today’s healthcare system. In this review, we fill this gap, particularly for physicians in a relatively underexplored area of AI: neonatology. The origins of AI, specifically machine learning (ML), can be tracked all the way back to the 1950s, when Alan Turing invented the so-called “learning machine” as well as military applications of basic AI1. During his time, computers were huge, and the cost of increased storage space was astronomical. As a result, their capabilities, although substantial for their day, were restricted. Over the decades, incremental advancements in theory and technological advances steadily increased the power and versatility of ML2.

How do machine learning (ML) and deep learning (DL) work? ML falls under the category of AI2. ML’s capacity to deal with data brought it to the attention of computer scientists. ML algorithms and models can learn from data, analyze, evaluate, and make predictions or decisions based on learning and data characteristics. DL is a subset of ML. Different from this larger class of ML definitions, the underlying concept of DL is inspired by the functioning of the human brain, particularly the neural networks responsible for processing and interpreting information. DL mimics this operation by utilizing artificial neurons in a computer neural network. In simple terms, DL finds weights for each artificial neuron that connects to each other from one layer to another layer. Once the number of layers is high (i.e., deep), more complex relationships between input and output can be modeled3,4,5. This enables the network to acquire more intricate representations of the data as it learns. The utilization of a hierarchical approach enables DL models to autonomously extract features from the data, as opposed to depending on human-engineered features as is customary in conventional ML3. DL is a highly specialized form of ML that is ideally modified for tasks involving unstructured data, where the features in the data may be learnable, and exploration of non-linear associations in the data can be possible6,7,8.

The main difference between ML and DL lies in the complexity of the models and the size of the datasets they can handle. ML algorithms can be effective for a wide range of tasks and can be relatively simple to train and deploy6,7,9,10,11. DL algorithms, on the other hand, require much larger datasets and more complex models but can achieve exceptional performance on tasks that involve high-dimensional, complex data7. DL can automatically identify which aspects are significant, unlike classical ML, which requires pre-defined elements of interest to analyze the data and infer a decision10. Each neuron in DL architectures (i.e., artificial neural networks (ANN)) has non-linear activation function(s) that help it learn complex features representative of the provided data samples9.

ML algorithms, hence, DL, can be categorized as either supervised, unsupervised, or reinforcement learning based on the input-output relationship. For example, if output labels (outcome) are fully available, the algorithm is called “supervised,” while unsupervised algorithms explore the data without their reference standards/outcomes/labels in the output3,12. In terms of applications, both DL and ML are typically used for tasks such as classification, regression, and clustering6,9,10,13,14,15. DL methods’ success depends on the availability of large-scale data, new optimization algorithms, and the availability of GPUs6,10. These algorithms are designed to autonomously learn and develop as they gain experience, like humans3. As a result of DL’s powerful representation of the data, it is considered today’s most improved ML method, providing drastic changes in all fields of medicine and technology, and it is the driving force behind virtually all progress in AI today5 (Fig. 1).

Fig. 1: Exploring AI Hierarchy and Challenges in Healthcare.
figure 1

a Hierarchical diagram of AI. How do machine learning (ML) and deep learning (DL) work? ML falls under the category of AI. DL is a subset of ML. b Ongoing hurdles of AI when applied to healthcare applications. Key concerns related to AI and each concern affects the outcome of AI in Neonatology including; (1) challenges with clinical interpretability; (2) knowledge gaps in decision-making mechanisms, with the latter requiring human-in-the-loop systems (3) ethical considerations; (4) the lack of data and annotations, and (5) the absence of Cloud systems allowing for secure data sharing and data privacy.

There are three major problem types in DL in medical imaging: image segmentation, object detection (i.e., an object can be an organ or any other anatomical or pathological entity), and image classification (e.g., diagnosis, prognosis, therapy response assessment)3. Several DL algorithms are frequently employed in medical research; briefly, those approaches belong to the following family of algorithms:

Convolutional Neural Networks (CNNs) are predominantly employed for tasks related to computer vision and signal processing. CNNs can handle tasks requiring spatial relationships where the columns and rows are fixed, such as imaging data. CNN architecture encompasses a sequence of phases (layers) that facilitate the acquisition of hierarchical features. Initial phases (layers) extract more local features such as corners, edges, and lines, later phases (layers) extract more global features. Features are propagated from one layer to another layer, and feature representation becomes richer this way. During feature propagation from one layer to another layer, the features are added certain nonlinearities and regularizations to make the functional modeling of input-output more generalizable. Once features become extremely large, there are operations within the network architecture to reduce the feature size without losing much information, called pooling operations. The auto-generated and propagated features are then utilized at the end of the network architecture for prediction purposes (segmentation, detection, or classification)3,16.

Recurrent Neural Networks (RNNs) are designed to facilitate the retention of sequential data, namely text, speech, and time-series data such as clinical data or electronic health records (EHRs). They can capture temporal relationships between data components, which can be helpful for predicting disease progression or treatment outcomes11,17,18. RNNs use similar architecture components that CNNs have. Long Short-Term Memory (LSTM) models are types of RNNs and are commonly used to overcome their shortcomings because they can learn long-term dependencies in data better than conventional RNN architectures. They are utilized in some classification tasks, including audio17,19. LSTM utilizes a gated memory cell in the network architecture to store information from the past; hence, the memory cell can store information for a long period of time, even if the information is not immediately relevant to the current task. This allows LSTMs to learn patterns in data that would be difficult for other types of neural networks to learn.

Generative adversarial networks (GANs) are a class of DL models that can be used to generate new data that is like existing data. In healthcare, GANs have been used to generate synthetic medical images. There are two CNNs (generator and discriminator); the first CNN is called the generator, and its primary goal is to make synthetic images that mimic actual images. The second CNN is called the discriminator, and its main objective is to identify between artificially generated images and real images20. The generator and discriminator are trained jointly in a process called adversarial training, where the generator tries to create data that is so realistic that the discriminator cannot distinguish it from real data. GANs are used to generate a variety of different types of data, including images, videos, and text. GANs are used to enhance image quality, signal reconstruction, and other tasks such as classification and segmentation too20,21,22.

Transfer learning (TL) is a concept derived from cognitive science that states that information is transferred across related activities to improve performance on a new task. It is generally known that people can accomplish similar tasks by building on prior knowledge23. TL has been implemented to minimize the need for annotation by transferring DL models with knowledge from a previous task and then fine-tuning them in the current task24. The majority of medical image classification techniques employ TL from pretrained models, such as ImageNet, which has been demonstrated to be inefficient due to the ImageNet consisting of natural images25. The approaches that utilized ImageNet pre-trained images in CNNs revealed that fine-tuning more layers provided increased accuracy26. The initial layers of ImageNet-pretrained networks, which detect low-level image characteristics, including corners and borders, may not be efficient for medical images25,26.

New and more advanced DL algorithms are developed almost daily. Such methods could be employed for the analysis of imaging and non-imaging data in order to enhance performance and reliability. These methods include Capsule Networks, Attention Mechanisms, and Graph Neural Networks (GNNs)27,28,29,30. Briefly, these are:

Capsule Networks are a relatively new form of DL architecture that aim to address some of the shortcomings of CNNs: pooling operations (reducing the data size) and a lack of hierarchical relations between objects and their parts in the data. Capsules can capture spatial relationships between features and are more capable of handling rotations and deformations of image objects thanks to their vectorial representations in neuronal space. Capsule Networks have shown potential in image classification tasks and could have applications in medical imaging analysis27. However, its implementation and computational time are two hurdles that restrict its widespread use.

Attention Mechanisms, represented by Transformers, have contributed to the development of computer vision and language processing. Unlike CNNs or RNNs, transformers allow direct interaction between every pair of components within a sequence, making them particularly effective at capturing long-term relationships29,30. More specifically, a self-attention mechanism in Transformers is an important piece of the DL model as it can dynamically focus on different parts of the input data sequence when producing an output, providing better context understanding than CNN based systems.

Graph Neural Networks (GNNs) are a form of data structure that describes a collection of objects (nodes) and their relationships (edges). There are three forms of tasks, including node-level, edge-level, and graph level31. Graphs may be used to denote a wide range of systems, including molecular interaction networks, and bioinformatics31,32,33. GNNs have demonstrated potential in both imaging and non-imaging data analysis28,34.

Physics-driven systems are needed in imaging field. Several studies have demonstrated the effectiveness of DL methods in the medical imaging field35,36,37,38,39. As the field of DL continues to evolve, it is likely that new methods and architectures will emerge to address the unique challenges and constraints of various types of data. One of the most common problems faced with DL-based MRI construction35. Specific algorithms for this problem can be essentially categorized into two groups: data driven and physics driven algorithms. In purely data-driven approaches, a mapping is learned between the aliased image and the image without artifacts39. Acquiring fully sampled (artifact-free) datasets is impractical in many clinical imaging studies when organs are in motion, such as the heart, and lung. Recently developed models can employ these under sampled MRI acquisitions as input and generate output images consistent with fully-sampled (artifact free) acquisitions37,38,39.

What is the Hybrid Intelligence? A highly desirable way of incorporating advances in AI is to let AI and human intellect work together to solve issues, and this is referred to as “hybrid intelligence“40 (e.g., one may call this “mixed intelligence” or “human-in-the-loop AI systems”). This phenomenon involves the development of AI systems that serve to supplement and amplify human decision-making processes, as opposed to completely replacing them3. The concept involves integrating the respective competencies of artificial intelligence and human beings in order to attain superior outcomes that would otherwise be unachievable41. AI algorithms possess the ability to process extensive amounts of data, recognize patterns, and generate predictions rapidly and precisely. Meanwhile, humans can contribute their expertise, understanding, and intuition to the discussion to offer context, analyze outcomes, and render decisions42. The hybrid intelligence strategy can help decision-makers in a variety of fields make decisions that are more precise, effective, and efficient by combining these qualities3,4,43,44. Human in the loop and hybrid intelligence systems are promising for time-consuming tasks in healthcare and neonatology.

Where do we stand currently? AI in medicine has been employed for over a decade, and it has often been considered that clinical implementation is not completely adapted to daily practice in most of the clinical field5,45,46. In recent years, increasingly complex computer algorithms and updated hardware technologies for processing and storing enormous datasets have contributed to this achievement6,7,46,47. It has only been within the last decade that these systems have begun to display their full potential6,9. The field of AI research appears to have been taken up with differing degrees of enthusiasm across disciplines. When analyzing the thirty years of research into AI, DL, and ML conducted by several medical subfields between the years 1988 and 2018, one-third of publications in DL yielded to radiology, and most of them are within the imaging sciences (radiology, pathology, and cell imaging)48. Software systems work by utilizing biomedical images with predictive/diagnostic/prognostic features and integrating clinical or pre-clinical data. These systems are designed with ML algorithms46. Such breakthrough methods in DL are nowadays extensively applied in pathology, dermatology, ophthalmology, neurology, and psychiatry6,47,49. AI has its own difficulties with the increasing utilization of healthcare (Fig. 1b).

What are the needs in clinics? Clinicians are concerned about the healthcare system’s integration with AI: there is an exponential need for diagnostic testing, early detection, and alarm tools to provide diagnosis and novel treatments without invasive tests and procedures50. Clinicians have higher expectations of AI in their daily practices than before. AI is expected to decrease the need for multiple diagnostic invasive tests and increase diagnostic accuracy with less invasive (or non-invasive) tests. Such AI systems can easily recognize imaging patterns on test images (i.e., unseen or not utilized efficiently in daily routines), allowing them to detect and diagnose various diseases. These methods could improve detection and diagnosis in different fields of medicine.

The overall goal of this systematic review is to explain AI’s potential use and benefits in the field of neonatology. We intend to enlighten the potential role of AI in the future in neonatal care. We postulate that AI would be best used as a hybrid intelligence (i.e., human-in-the-loop or mixed intelligence) to make neonatal care more feasible, increase the accuracy of diagnosis, and predict the outcome and diseases in advance. The rest of the paper is organized as follows: In results, we explain the published AI applications in neonatology along with AI evaluation metrics to fully understand their efficacy in neonatology and provide a comprehensive overview of DL applications in neonatology. In discussion, we examine the difficulties of AI utilization in neonatology and future research discussions. In the methods section, we outline the systematic review procedures, including the examination of existing literature and the development of our search strategy.

We review the past, current, and future of AI-based diagnostic and monitoring tools that might aid neonatologists’ patient management and follow-up. We discuss several AI designs for electronic health records, image, and signal processing, analyze the merits and limits of newly created decision support systems, and illuminate future views clinicians and neonatologists might use in their normal diagnostic activities. AI has made significant breakthroughs to solve issues with conventional imaging approaches by identifying clinical variables and imaging aspects not easily visible to human eyes. Improved diagnostic skills could prevent missed diagnoses and aid in diagnostic decision-making. The overview of our study is structured as illustrated in Fig. 2. Briefly, our objectives in this systematic review are:

  • to explain the various AI models and evaluation metrics thoroughly explained and describe the principal features of the AI models,

  • to categorize neonatology-related AI applications into macro-domains, to explain their sub-domains and the important elements of the applicable AI models,

  • to examine the state-of-the-art in studies, particularly from the past several years, with an emphasis on the use of ML in encompassing all neonatology,

  • to present a comprehensive overview and classification of DL applications utilized and in neonatology,

  • to analyze and debate the current and open difficulties associated with AI in neonatology, as well as future research directions, to offer the clinician a comprehensive perspective of the actual situation.

Fig. 2: An overview of the structure of this paper.
figure 2

It is provided an overview of our paper’s structure and objectives: 1. Explaining AI Models and Evaluation Metrics: 2. Evaluating ML applied studies in Neonatology 3. Evaluating DL applied studies in Neonatology 4. Analyzing Challenges and Future Directions.

AI covers a broad concept for the application of computing algorithms that can categorize, predict, or generate valuable conclusions from enormous datasets46. Algorithms such as Naive Bayes, Genetic Algorithms, Fuzzy Logic, Clustering, Neural Networks (NN), Support Vector Machines (SVM), Decision Trees, and Random Forests (RF) have been used for more than three decades for detection, diagnosis, classification, and risk assessment in medicine as ML methods9,10. Conventional ML approaches for image classification involve using hand-engineered features, which are visual descriptions and annotations learned from radiologists, that are encoded into algorithms.

Images, signals, genetic expressions, EHR, and vital signs are examples of the various unstructured data sources that comprise medical data (Fig. 3). Due to the complexity of their structures, DL frameworks may take advantage of this heterogeneity by attaining high abstraction levels in data analysis.

Fig. 3: An overview of AI applications in neonatology.
figure 3

Unstructured data such as medical images, vital signals, genetic expressions, EHRs, and signal data contribute to the wide variety of medical information. Analyzing and interpreting different data streams in neonatology requires a comprehensive strategy because each has unique characteristics and complications.

While ML requires manual/hand-crafted selection of information from incoming data and related transformation procedures, DL performs these tasks more efficiently and with higher efficacy9,10,46. DL is able to discover these components by analyzing a large number of samples with a high degree of automation7. The literature on these ML approaches is extensive before the development of DL5,7,45.

It is essential for clinicians to understand how the suggested ML model should enhance patient care. Since it is impossible for a single metric to capture all the desirable attributes of a model, it is customarily necessary to describe the performance of a model using several different metrics. Unfortunately, many end-users do not have an easy time comprehending these measurements. In addition, it might be difficult to objectively compare models from different research models, and there is currently no method or tool available that can compare models based on the same performance measures51. In this part, the common ML and DL evaluation metrics are explained so neonatologists could adapt them into their research and understand of upcoming articles and research design51,52.

AI is commonly utilized everywhere, from daily life to high-risk applications in medicine. Although slower compared to other fields, numerous studies began to appear in the literature investigating the use of AI in neonatology. These studies have used various imaging modalities, electronic health records, and ML algorithms, some of which have barely gone through the clinical workflow. Though there is no systematic review and future discussions in particular in this field53,54,55. Many studies were dedicated to introducing these systems into neonatology. However, the success of these studies has been limited. Lately, research in this field has been moving in a more favorable direction due to exciting new advances in DL. Metrics for evaluations in those studies were the standard metrics such as sensitivity (true-positive rate), specificity (true-negative rate), false-positive rate, false-negative rate, receiver operating characteristics (ROC), area under the ROC curves (AUC), and accuracy (Table 1).

Table 1 Evaluation metrics in artificial intelligence.

Results

This systematic review was guided by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol56. The search was completed on 11st of July 2022. The initial search yielded many articles (approximately 9000), and we utilized a systematic approach to identify and select relevant articles based on their alignment with the research focus, study design, and relevance to the topic. We checked the article abstracts, and we identified 987 studies. Our search yielded 106 research articles between 1996 and 2022 (Fig. 4). Risk of bias summary analysis was done by the QUADAS-2 tool (Figs. 5 and 6)57,58,59.

Fig. 4: Identification of studies through database searches.
figure 4

Initial research conducted on 11th of July 2022, yielded 9000 articles, of which 987 article abstracts were screened. Of those, 106 research articles published between 1996 and 2022 were eligible for inclusion in this systematic review. The PRISMA flow diagram illustrates the study selection process in more detail.

Fig. 5: Bias summary of all research according to the QUADAS-2.
figure 5

Risk of bias summary analysis was done by the QUADAS-2 tool.

Fig. 6: Bias summary of all studies according to the QUADAS-2.
figure 6

Risk of bias summary analysis was done by the QUADAS-2 tool.

Our findings are summarized in two groups of tables: Tables 25 summarize the AI methods from the pre-deep learning era (“Pre-DL Era”) in neonatal intensive care units according to the type of data and applications. Tables 6, 7, on the other hand, include studies from the DL Era. Applications include classification (i.e., prediction and diagnosis), detection (i.e., localization), and segmentation (i.e., pixel level classification in medical images).

Table 2 ML based (non-DL) studies in neonatology using imaging data for diagnosis.
Table 3 ML based (non-DL) studies in neonatology using non-imaging data for diagnosis.
Table 4 ML based (non-DL) studies in neonatology using imaging data for prediction.
Table 5 ML based (non-DL) studies in neonatology using non-imaging data for prediction.
Table 6 DL-based studies in neonatology using imaging and non-imaging data for diagnosis.
Table 7 DL-based studies in neonatology using imaging and non-imaging for prediction.

ML applications in neonatal mortality

Neonatal mortality is a major factor in child mortality. Neonatal fatalities account for 47 percent of all mortality in children under the age of five, according to the World Health Organization60. It is, therefore, a priority to minimize worldwide infant mortality by 203061.

ML investigated infant mortality, its reasons, and its mortality prediction62,63,64,65,66,67,68. In a recent review, 1.26 million infants born from 22 weeks to 40 weeks of gestational age were enrolled67. Predictions were made as early as 5 min of life and as late as 7 days. An average of four models per investigation were neural networks, random forests, and logistic regression (58.3%)67. Two studies (18.2%) completed external validation, although five (45.5%) published calibration plots67. Eight studies reported AUC, and five supplied sensitivity and specificity67. The AUC was 58.3–97.0%67. Sensitivities averaged 63 to 80%, and specificities 78 to 98%67. Linear regression analysis was the best overall model despite having 17 features67. This analysis highlighted the most prevalent AI neonatal mortality measures and predictions. Despite the advancement in neonatal care, it is crucial that preterm infants remain highly susceptible to mortality due to immaturity of organ systems and increased susceptibility to early and late sepsis69. Addressing these permanent risks necessitates the utilization of ML to predict mortality63,64,65,66,68,70. Early studies employed ANN and fuzzy linguistic models and achieved an AUC of 85–95% and accuracy of 90%62,68. New studies in a large preterm populations and extremely low birthweight infants found an AUC of 68.9–93.3%65,71. There are some shortcomings in these studies; for example, none of them used vital parameters to represent dynamic changes, and hence, there was no improvement in clinical practice in neonatology. Unsurprisingly, gestational age, birthweight, and APGAR scores were shown as the most important variables in the models64,72. Future research is suggested to focus on external evaluation, calibration, and implementation of healthcare applications67.

Neonatal sepsis, which includes both early onset sepsis and late onset sepsis, is a significant factor contributing to neonatal mortality and morbidity73. Neonatal sepsis diagnosis and antibiotic initiation present considerable obstacles in the field of neonatal care, underscoring the importance of implementing comprehensive interventions to alleviate their profound negative consequences. The studies have predicted early sepsis from heart rate variability with an accuracy of 64–94%74. Another secondary analysis of multicenter data revealed that clinical biomarkers weighed the ML decision by integrating all clinical and lab variables and achieved an AUC of 73–83%75.

ML applications in neurodevelopmental outcome

Recent advancements in neonatal healthcare have resulted in a decrease in the incidence of severe prenatal brain injury and an increase in the survival rates of preterm babies76. However, even though routine radiological imaging does not reveal any signs of brain damage, this population is nonetheless at significant risk of having a negative outcome in terms of neurodevelopment77,78,79,80. It is essential to discover early indicators of abnormalities in brain development that might serve as a guide for the treatment of preterm children at a greater risk of having negative neurodevelopmental consequences81,82.

The most common reason for neurodevelopmental impairment is intraventricular hemorrhage (IVH) in preterm infants83. Two studies predicted IVH in preterm infants. Both studies have not deployed the ultrasound images in their analysis, they only predicted IVH according to the clinical variables84,85.

Morphological studies have demonstrated that preterm birth is linked to smaller brain volume, cortical folding, axonal integrity, and microstructural connectivity86,87. Studies concentrating on functional markers of brain maturation, such as those derived from resting-state functional connectivity (rsFC) analyses of blood-oxygen-level dependent (BOLD) fluctuations, have revealed further impacts of prematurity on the developing connectome, ranging from decreased network-specific connectivity82,88,89. Many studies investigated brain connectivity in preterm infants88,90,91,92 and brain structural analysis in neonates93 and neonatal brain segmentation94 with the help of ML methods. Similarly, one of the most important outcomes of neurodevelopment at 2-year-old-age is neurocognitive evaluations. The studies evaluated the morphological changes in the brain in relation to neurocognitive outcome95,96,97 and brain age prediction98,99. It has been found that near-term regional white matter (WM) microstructure on diffusion tensor imaging (DTI) predicted neurodevelopment in preterm infants using exhaustive feature selection with cross-validation96 and multivariate models of near-term structural MRI and WM microstructure on DTI might help identify preterm infants at risk for language impairment and guide early intervention95,97 (Table 4). One of the studies that evaluated the effects of PPAR gene activity on brain development with ML methods100 revealed a strong association between abnormal brain connectivity and implicating PPAR gene signaling in abnormal white matter development. Inhibited brain growth in individuals exposed to early extrauterine stress is controlled by genetic variables, and PPARG signaling has a formerly unknown role in cerebral development100 (Table 2).

Alternative to morphological studies, neuromonitorization is shown to be an important tool for which ML methods have been frequently employed, for example, in automatic seizure detection from video EEG101,102,103 and EEG biosignals in infants and neonates with HIE104,105,106,107,108. The detection of artifacts109,110, sleep states102, rhythmic patterns111, burst suppression in extremely preterm infants112,113 from EEG records were studied with ML methods. EEG records are often used for HIE grading114 too. It has been shown in those studies that EEG recordings of different neonate datasets found an AUC of 89% to 96%104,105,115, accuracy 78–87%114,116 regarding seizure detection with different ML methods (Table 3).

ML applications in predictions of prematurity complications (BPD, PDA, and ROP)

Another important cause of mortality and morbidity in the NICU is PDA (Patent Ductus Arteriosus). The ductus arteriosus is typically present during the fetal stage, when the circulation in the lungs and body is regularly supplied by the mother; in newborns, the ductus arteriosus closes functionally by 72 h of age117. 20–50% of infants with a gestational age (GA) 32 weeks have the ductus arteriosus on day 3 of life118, while up to 60% of neonates with a GA 29 weeks have the ductus arteriosus. The presence of PDA in preterm neonates is associated with higher mortality and morbidity, and physicians should evaluate if PDA closure might enhance the likelihood of survival vs. the burden of adverse effects119,120,121,122.

ML methods were utilized on PDA detection from EHR123 and auscultation records124 such that 47 perinatal factors were analyzed with 5 different ML methods in 10390 very low birth weight infants’ predicted PDA with an accuracy of 76%123 and 250 auscultation records were analyzed with XGBoost and found to have an accuracy of 74%124 (Table 3).

Bronchopulmonary dysplasia (BPD) is a leading cause of infant death and morbidity in preterm births. While various biomarkers have been linked to the development of respiratory distress syndrome (RDS), no clinically relevant prognostic tests are available for BPD at birth125. There are ML studies aiming to predict BPD from birth70,126, gastric aspirate content125 and genetic data127 and it has been shown that BPD could be predicted with an accuracy of up to 86% in the best-case scenario70 (Table 5), analysis of responsible genes with ML could predict BPD development with an AUC of 90%127 (Table 3) and combination of gastric aspirate after birth and clinical information analysis with SVM predicted BPD development with a sensitivity of 88%125 (Table 5).

In relation to published studies in BPD with ML-based predictions, long-term invasive ventilation is considered one of the most important risk factors for BPD, nosocomial infections, and increased hospital stay. There are ML-based studies aiming to predict extubation failure128,129,130 and optimum weaning time131 using long-term invasive ventilation information. It has been shown in those studies that predicted extubation failure with an accuracy of 83.2% to 87%128,129,130 (Tables 2 and 3).

Retinopathy of prematurity (ROP) is another area of interest in the application of machine learning in neonatology132. ROP is a serious complication of prematurity that affects the blood vessels in the retina and is a leading cause of childhood blindness in high and middle-income countries, including the United States, among very low-birthweight (1500 g), very preterm (28–32 weeks), and extremely preterm infants (less than 28 weeks)132. Due to a shortage of ophthalmologists available to treat ROP patients, there has been increased interest in the use of telemedicine and artificial intelligence as solutions for diagnosing ROP132. Some ML methods, such as Gaussian mixture models, were employed to diagnose and classify ROP from retinal fundus images in studies132,133,134, and it has been reported that the i-ROP134 system classified pre-plus and plus disease with 95% accuracy. This was close to the performance of the three individual experts (96%, 94%, and 92%, respectively), and much higher than the mean performance of 31 nonexperts (81%)134 (Table 2).

Other ML applications in neonatal diseases

EHR and medical records were featured in ML algorithms for the diagnosis of congenital heart defects135, HIE (Hypoxic Ischemic Encephalopathy)136, IVH (Intraventricular Hemorrhage)84,85, neonatal jaundice137,138, prediction of NEC (Necrotizing Enterocolitis)139, prediction of neurodevelopmental outcome in ELBW (extremely low birth weight) infants65,140,141, prediction of neonatal surgical site infections142, and prediction of rehospitalization143 (Table 5).

Electronically captured physiologic data are evaluated as signal data, and they were analyzed with ML to detect artifact patterns144, late onset sepsis145, and predict infant morbidity146. Electronically captured vital parameters (respiratory rate, heart rate) of 138 infants (≤34 weeks’ gestation, birth weight ≤2000 gram) in the first 3 h of life predicted an accuracy of overall morbidity and an AUC of 91%146 (Table 5).

In addition to physiologic data, clinical data up to 12 h after cardiac surgery in HLHS (hypoplastic left heart syndrome) and TGA (transposition of great arteries) infants were analyzed to predict PVL (periventricular leukomalacia) occurrence after surgery147. The F-score results for infants with HLHS and those without HLHS were 88% and 100%, respectively147 (Table 5). Voice records were used to diagnose respiratory phases in infant cry148, to classify neonatal diseases in infant cry149, and to evaluate asphyxia from infant cry voice records150. Voice records of 35 infants were analyzed with ANN, and accuracy was found 85%149. Cry records of 14 infants in their 1st year of life were analyzed with SVM and GMM, and phases of respiration and crying rate were quantified with an accuracy of 86%148 (Table 3).

SVM was the most commonly used method in the diagnosis of metabolic disorders of newborns, including MMA (methylmalonic acidemia)151, PKU (phenylketonuria)152,153, MCADD (medium-chain acyl CoA dehydrogenase deficiency)152. During the Bavarian newborn screening program, dried blood samples were analyzed with ML and increased the positive predictive value for PKU (71.9% versus 16,2) and for MCADD (88.4% versus 54.6%)152 (Table 3).

Neonatology with deep learning

The main uses of DL in clinical image analysis are categorized into three categories: classification, detection, and segmentation. Classification involves identifying a specific feature in an image, detection involves locating multiple features within an image; and segmentation involves dividing an image into multiple parts7,9,154,155,156,157,158,159,160.

Neuroradiological evaluation with AI in neonatology

Neonatal neuroimaging can establish early indicators of neurodevelopmental abnormality to provide early intervention during a time of maximal neuroplasticity and fast cognitive and motor development79,96. DL methods can assist in an earlier diagnosis than clinical signs would indicate.

The imaging of an infant’s brain using MRI can be challenging due to lower tissue contrast, substantial tissue inhomogeneities, regionally heterogeneous image appearance, immense age-related intensity variations, and severe partial volume impact due to the smaller brain size. Since most of the existing tools were created for adult brain MRI data, infant-specific computational neuroanatomy tools are recently being developed. A typical pipeline for early prediction of neurodevelopmental disorders from infant structural MRI (sMRI) is made up of three basic phases. (1) Image preprocessing, tissue segmentation, regional labeling, and extraction of image-based characteristics (2) Surface reconstruction, surface correspondence, surface parcellation, and extraction of surface-based features (3) Feature preprocessing, feature extraction, AI model training, and prediction of unseen subjects161. The segmentation of a newborn brain is difficult due to the decreased SNR (signal to noise ratio) resulting from the shorter scanning duration enforced by predicted motion restrictions and the diminutive size of the neonatal brain. In addition, the cerebrospinal fluid (CSF)-gray matter border has an intensity profile comparable to that of the mostly unmyelinated white matter (WM), resulting in significant partial volume effects. In addition, the high variability resulting from the fast growth of the brain and the continuing myelination of WM imposes additional constraints on the creation of effective segmentation techniques. Several non-DL-based approaches for properly segmenting newborn brains have been presented over the years. These methods may be broadly classified as parametric162,163,164, classification165, multi-atlas fusion166,167, and deformable models168,169. The Dice Similarity Coefficient metric is used for image segmentation evaluation; the higher the dice, the higher the segmentation accuracy10 (Table 1).

In the NeoBrainS12 2012 MICCAI Grand-Challenge (https://neobrains12.isi.uu.nl), T1W and T2W images were presented with manually segmented structures to assess strategies for segmenting neonatal tissue162. Most methods were found to be accurate, but classification-based approaches were particularly precise and sensitive. However, segmentation of myelinated vs. unmyelinated WM remains a difficulty since the majority of approaches162 failed to consistently obtain reliable results.

Future research in neonatal brain segmentation will involve a more thorough neural segmentation network. Current studies are intended to highlight efficient networks capable of producing accurate and dependable segmentations while comparing them to existing conventional computer vision techniques. In the perspective of comparing previous efforts on newborn brain segmentation, the small sample size of high-quality labeled data must also be recognized as a significant restriction169. The field of artificial intelligence in neonatology has progressed slowly due to a shortage of open-source algorithms and the availability of datasets.

Future research should also focus on improving the accuracy of DL for diagnosing germinal matrix hemorrhage and figuring out how DL can help a radiologist’s workflow by comparing how well sonographers identify studies that look suspicious. More studies could also look at how well DL works for accurately grading germinal matrix hemorrhages and maybe even small hemorrhages that a radiologist can see on an MRI but not on a head ultrasound. This could be useful in improving the diagnostic capabilities of head ultrasound in various clinical scenarios157.

Evaluation of prematurity complications with DL in neonatology

In the above discussion, we have addressed the primary applications of DL in relation to disease prediction. These include DL for analyzing conditions such as PDA (patent ductus arteriosus)158, IVH (intraventricular ventricular hemorrhage)155,157, BPD (bronchopulmonary dysplasia)170, ROP (retinopathy of prematurity)171,172,173, retinal hemorrhage174 diagnosis. This also includes DL applications for analyzing MR images159,175 and combined with EHR data176,177 for predicting neurocognitive outcome and mortality. Additionally, DL has potential applications in treatment planning and discharge from the NICU178, including customized medicine and follow-up6,67,125 (Tables 6 and 7).

Digital imaging and analysis with AI are promising and cost-effective tools for detecting infants with severe ROP who may need therapy132,171,172,179. Despite limitations such as image quality, interpretation variability, equipment costs, and compatibility issues with EHR systems, AI has been shown to be effective in detecting ROP180. Studies comparing BIO (Binocular Indirect Ophthalmoscope) to telemedicine have shown that both methods have equivalent sensitivity for identifying zone disease, plus disease, and ROP. However, BIO was found to be slightly better at identifying zone III and stage 3 ROP181,182. DL algorithms were applied to 5511 retinal images, achieving an AUC of 94% (diagnosis of normal) and 98% (diagnosis of plus disease), outperforming 6 out of 8 ROP experts171. In another study, DL was used to quantify the clinical progression of ROP by assigning ROP vascular severity scores172. A consecutive study with a large dataset showed in 4175 retinal images from 32 NICUs, resulting in an AUC of 98% for detecting therapy required ROP with DL173. The use of AI in ROP screening programs may increase access to care for secondary prevention of ROP and enable the evaluation of disease epidemiology173 (Table 6).

Signal detection for sleep protection in the NICU is another ongoing discussion. DL has been used to analyze infant EEGs and identify sleep states. Interruptions of sleep states have been linked to problems in neuronal development183. Automated sleep state detection from EEG records184,185 and from ECG monitoring parameters186 were demonstrated with DL. The underperformance of the all-state classification (kappa score 0.33 to 0.44) was likely owing to the difficulties in differentiating small changes between states and a lack of enough training data for minority classes186 (Table 6).

DL has been found to be effective in real-time evaluation of cardiac MRI for congenital heart disease187. Studies have shown that DL can accurately calculate ventricular volumes from images rebuilt using residual UNet, which are not statistically different from the gold standard, cardiac MRI. This technology has the potential to be particularly beneficial for infants and critically ill individuals who are unable to hold their breath during the imaging process187 (Table 6).

DL-based 3D CNN algorithms have been used to demonstrate the automated classification of brain dysmaturation from neonatal brain MRI188. In a study, brain MRIs of 90 term neonates with congenital heart diseases and 40 term healthy controls were analyzed using this method, which achieved an accuracy of 98%. This technique could be useful in detecting brain dysmaturation in neonates with congenital heart diseases188 (Table 6).

DL algorithms have been used to classify neonatal diseases from thermal images189,190,191,192. These studies analyzed neonatal thermograms to determine the health status of infants and achieved good AUC scores189,190,191,192. However, these studies didn’t include any clinical information (Table 6).

Two large scale studies showed breakthrough results regarding the effect of nutrition practices in NICU170 and wireless sensors in NICU193. A nutrition study revealed that nutrition practices were associated with discharge weight and BPD170. This exemplifies how unbiased ML techniques may be used to effectively bring about clinical practice changes170. Novel, wireless sensors can improve monitoring, prevent iatrogenic injuries, and encourage family-centered care193. Early validation results show performance equal to standard-of-care monitoring systems in high-income nations. Furthermore, the use of reusable sensors and compatibility with low-cost mobile phones may reduce monitoring.

Discussion

The studies in neonatology with AI were categorized according to the following criteria.

  1. (i)

    The studies were performed with ML or DL,

  2. (ii)

    imaging data or non-imaging data were used,

  3. (iii)

    according to the aim of the study: diagnosis or other predictions.

Most of the studies in neonatology were performed with ML methods in the pre-DL era. We have listed 12 studies with ML and imaging data for diagnosis. There are 33 studies that used non-imaging data for diagnosis purposes. Imaging data studies cover BA diagnosis from stool color194, postoperative enteral nutrition of neonatal high intestinal obstruction195, functional brain connectivity in preterm infants82,90,91,94,100, ROP diagnosis133,134, neonatal seizure detection from video records101, newborn jaundice screening137. Non-imaging studies for diagnosis include the diagnosis of congenital heart defects135, baby cry analysis148,149,150, inborn metabolic disorder diagnosis and screening151,152,153, HIE grading104,106,114,136,196, EEG analysis102,104,106,107,110,111,112,113,115,184,197,198, PDA diagnosis123,124, vital sign analysis and artifact detection144, extubation and weaning analysis129,130,131,144, BPD diagnosis127. ML studies with imaging data for prediction are focused on neurodevelopmental outcome prognosis from brain MRIs95,96,97,127,164,199. ML-based non-imaging data for prediction encompassed mortality risk63,64,65,68, NEC prognosis139, morbidity66,146, BPD125,126.

When it comes to DL applications, there has been less research conducted compared to ML applications. The focus of DL with imaging and non-imaging data focused on brain segmentation159,169,175,177,188, IVH diagnosis157, EEG analysis184,185, neurocognitive outcome176, PDA and ROP diagnosis171,172,173. Upcoming articles and research will surely be from the DL field, though.

It is worth noting that there have also been several articles and studies published on the topic of the application of AI in neonatology. However, the majority of these studies do not contain enough details, are difficult to evaluate side-by-side, and do not give the clinician a thorough picture of the applications of AI in the general healthcare system66,67,93,95,96,97,99,125,126,127,140,142,147,169,174,177,185,188,200,201,202,203,204,205.

There are several limitations in the application of AI in neonatology, including a lack of prospective design, a lack of clinical integration, a small sample size, and single center evaluations. DL has shown promise in bioscience and biosignals, extracting information from clinical images, and combining unstructured and structured data in EHR. However, there are some issues that limit the success of DL in medicine, which can be grouped into six categories. In the following paragraphs, we’ll examine the key concerns related to DL, which have been divided into six components:

  1. (1)

    Difficulties in clinical integration, including the selection and validation of models;

  2. (2)

    the need for expertise in decision mechanisms, including the requirement for human involvement in the process;

  3. (3)

    lack of data and annotations, including the quality and nature of medical data; distribution of data in the input database; and lack of open-source algorithms and reproducibility;

  4. (4)

    lack of explanations and reasoning, including the lack of explainable AI to address the “black-box” problem;

  5. (5)

    lack of collaboration efforts across multi-institutions; and

  6. (6)

    ethical concerns4,5,6,9,10,206.

Difficulties in clinical integration

Despite the accuracy that AI has reached in healthcare in recent years, there are several restrictions that make it difficult to translate into treatment pathways. First, physicians’ suspicion of AI-based systems stems from the lack of qualified randomized clinical trials, particularly in the field of pediatrics, showing the reliability and/or improved effectiveness of AI systems compared to traditional systems in diagnosing neonatal diseases and suggesting appropriate therapies. The studies’ pros and cons are discussed in tables and relevant sections. Studies are mainly focused on imaging-based or signal-based studies in terms of one variable or disease. Neonatologists and pediatricians need evidence-based proven algorithm studies. There are only six prospective clinical trials in neonatology with AI197,207,208,209,210,211. The one is detecting neonatal seizures with conventional EEG in the NICU which is supported by the European Union Cost Program in 8 European NICU197. Neonates with a corrected gestational age between 36 and 44 weeks who had seizures or were at high risk of having seizures and needed EEG monitoring were given conventional EEG with ANSeR (Algorithm for Neonatal Seizure Recognition) coupled with an EEG monitor that displayed a seizure probability trend in real time (algorithm group) or continuous EEG monitoring alone (non-algorithm group)197. The algorithm is not available, and the code is not shared. Another one is a study showing the physiologic effects of music in premature infants208. Even so, it could not be founded on any AI analysis in this study. The third study, “Rebooting Infant Pain Assessment: Using Machine Learning to Exponentially Improve Neonatal Intensive Care Unit Practice (BabyAI),” is newly posted and recruiting209. The fourth study, “Using sensor-fusion and machine learning algorithms to assess acute pain in non-verbal infants: a study protocol,” aims to collect data from 15 subjects: preterm infants, term infants within the first month of age in NICU admission and their follow-up data at 3rd and 6th months of age. They record pain signals using facial electromyography(EMG), ECG, electrodermal activity, oxygen saturation, and EEG in real time, and they will analyze the data with ML methods to evaluate pain in neonates. The data is in iPAS (NCT03330496) and is updated as recruitment completed210. However, no result has been submitted. The fifth study, “Prediction of Extubation Readiness in Extreme Preterm Infants by the Automated Analysis of Cardiorespiratory Behavior: APEX study”211 records revealed that the recruitment was completed in 266 infants. Still, no results have been released yet (NCT01909947). To sum up, there is only one prospective multicenter randomized AI study that has been published with its results.

There is an unmet need to plan clinically integrated prospective and real-time data collection studies in neonatology. The clinical situation of infants changed rapidly, and real-time designed studies would be significant by analyzing multimodal data and including imaging and non-imaging components.

The need for expertise in the decision mechanisms

In terms of neonatologists determining whether to implement a system’s recommendation, it may be required for that system to present supporting evidence95,96,125,202. Many suggested AI solutions in the medical field are not expected to be an alternative to the doctor’s decision or expertise but rather to serve as helpful assistance. When it comes to struggling neonatal survival without sequela, AI may be a game changer in neonatology. The broad range of neonatal diseases and different clinical presentations of neonates according to gestational age and postnatal age make accurate diagnosis even harder for neonatologists. AI would be effective for early disease detection and would assist clinicians in responding promptly and fostering therapy outcomes.

Neonatology has multidisciplinary collaborations in the management of patients, and AI has the potential to achieve levels of efficacy that were previously unimaginable in neonatology if more resources and support from physicians were allocated to it. Neonatology collaborates and closely works with other specialties of pediatrics, including perinatology, pediatric surgery, radiology, pediatric cardiology, pediatric neurology, pediatric infectious disease, neurosurgery, cardiovascular surgery, and other subspecialties of pediatrics. Those multidisciplinary workflows require patient follow-up and family involvement. AI-based predictive analysis tools might address potential risks and neurologic problems in the future. AI supported monitoring systems could analyze real time data from monitors and detect changes simultaneously. These tools could be helpful not only for routine NICU care but also for “family centered care”212,213 implications. Although neonatologists could be at the center of decision making and giving information to parents, AI could be actively used in NICUs. Hybrid intelligence would provide a follow-up platform for abrupt and subtle clinical changes in infants’ clinical situations.

Given that many medical professionals have a limited understanding of DL, it may be difficult to establish contact and communication between data scientists and medical specialists. Many medical professionals, including pediatricians and neonatologists in our instance, are unfamiliar with AI and its applications due to a lack of exposure to the field as an end user. However, the authors also acknowledge the increasing efforts in building bridges among many scientists and institutions, with conferences, workshops, and courses, that clinicians have successfully started to lead AI efforts, even with software coding schools by clinicians214,215,216,217,218.

Neonatal critical conditions will be monitored by the human in the loop systems in the near future, and AI empowered risk classification systems may help clinicians prioritize critical care and allocate supplies precisely. Hence, AI could not replace neonatologists, but there would be a clinical decision support system in the critical and calls for prompt response environment of NICU.

Lack of imaging data and annotations and reproducibility problems

There is a rising interest in building deep learning approaches to predict neurological abnormalities using connectome data; however, their usage in preterm populations has been limited81,88,89,90,91. Similar to most DL applications, the training of such models often requires the use of big datasets11; however, large neuroimaging datasets are either not accessible or difficult and expensive to acquire, especially in the pediatric world. Since the success of DL methods currently relies on well-labeled data and high-capacity models requiring several iterative updates across many labeled examples and obtaining millions of labeled examples, is an extreme challenge, there is not enough jump in the neonatal AI applications.

As a side note, accurate labeling always requires physician effort and time, which overcomplicates the current challenges. Unfortunately, there is no established collaboration between physicians and data scientists at a large scale that can ease some of the challenges (data gathering/sharing and labeling). Nonetheless, once these problems are addressed, DL can be used in prevention and diagnosis programs for optimal results, radically transforming clinical practice. In the following, we envision the potential of DL to transform other imaging modalities in the context of neonatology and child health.

The requirement for a massive volume of data is a significant barrier, as mentioned earlier. The quantity of data needed by an AI or ML system can grow in proportion to the sophistication of its underlying architecture; deep neural networks (DNN), for example, have particularly high volume of data needs. It’s not enough that the needed data just be sufficient; they also need to be of good quality in terms of data cleaning and data variability (both ANN and DNN tend to avoid overfitting data if the variability is high). It may be difficult to collect a substantial amount of clean, verified, and varied data for several uses in neonatology. For this reason, there is a data repository shared with neonatal researchers, including EHR202 and clinical variables. Some approaches for addressing the lack of labeled, annotated, verified, and clean datasets include: (1) building and training a model with a very shallow network (only a few thousand parameters) and (2) data augmentation. Data augmentation techniques are not helpful in the medical imaging field or medical setting219.

In the field of neonatal imaging, high-quality labeling and medical imaging data are exceedingly uncommon. One of the other comparable available neonatal datasets the authors are aware of has just ten individuals166,220,221. This pattern holds even in more recent research, as detailed by the majority of studies involving little more than 20 individuals167. Regardless of sample size and technology, it is crucial to be able to generalize to new data in the field of image segmentation, especially considering the wide range of MRI contrasts and variations between scanners and sequences between institutions. Moreover, it is generally known that models based on DL have weak generalization skills on unseen data. This is especially crucial for the future translation of research into reality since (1) there is a shift between images obtained in various situations, and (2) the model must be retrained as these images become accessible. Adopting a strategy of continuous learning is the most practical way to handle this challenge. This method involves progressively retraining deep models while preventing any virtual memory loss on previously viewed data sets that may not be available during retraining. This field of endeavor will advance169.

Most of the studies did not release their algorithms as open source to the libraries. Even though algorithms are available, it should be known whether separate training and testing datasets exist. There is a strong expectation that studies should have clarified which validation method has been chosen. In terms of comparing algorithm success, reproducibility is a crucial point. Methodological bias is another issue with this system. Research is frequently based on databases and guidelines from other nations that may or may not have patient populations similar to ours96. A database that only contains data that is applicable to the specific problem that must be solved; however, obtaining the relevant information may be difficult due to the number of databases.

Lack of explanations and reasoning

The trustworthiness of algorithms is another obstacle222. The most widely used deep learning models use a black-box methodology, in which the model simply receives input and outputs a prediction without explaining its thought process. In high-stakes medical settings, this can be dangerous. Some models, on the other hand, incorporate human judgment (human-in-the-loop) or provide interpretability maps or explainability layers to illuminate the decision-making process. Especially in the field of neonatology, where AI is expected to have a significant impact, this trustworthiness is essential for its widespread adoption.

Lack of collaboration efforts (multi-institutions) and privacy concerns

New collaborations have been forged because of this information; early detection and treatment of diseases that affect children, who make up a large portion of the world’s population, will change treatment and follow-up status. Monitoring systems and knowing mortality and treatment activity with multi-site data will help. Considering the necessity for consent to the processing of personal health data by AI systems as an example of a subject related to the protection of privacy and security96. Efforts involving multiple institutions can facilitate training, but there are privacy concerns associated with the cross-site sharing of imaging data. Federated learning (FL) was introduced recently to address privacy concerns by facilitating distributed training without the transfer of imaging data223. Existing FL techniques utilize conditional reconstruction models to map from under sampled to fully-sampled acquisitions using explicit knowledge of the accelerated imaging operator223. Nevertheless, the data from various institutions is typically heterogeneous, which may diminish the efficacy of models trained using federated learning. SplitAVG is proposed as a novel heterogeneity-aware FL method to surmount the performance declines in federated learning caused by data heterogeneity224.

AI ethics

While AI has great promise for enhancing healthcare, it also presents significant ethical concerns. Ethical concerns in health AI include informed consent, bias, safety, transparency, patient privacy, and allocation, and their solutions are complicated to negotiate225. In neonatology, crucial decision-making is frequently accompanied by a complicated and challenging ethical component. Interdisciplinary approaches are required for progress226. The border of viability, life sustaining treatments227 and the different regulations worldwide made AI utilization in neonatology more complicated. How an ethics framework is implemented in an AI in neonatology has not been reported yet, and there is a need for transparency for trustworthy AI.

The applications of AI in real-world contexts have the potential to result in a few potential benefits, including increased speed of execution; potential reduction in costs, both direct and indirect; improved diagnostic accuracy; increased healthcare delivery efficiency (“algorithms work without a break”); and the potential of supplying access to clinical information even to persons who would not normally be able to utilize healthcare due to geographic or economic constraints4.

To achieve an accurate diagnosis, it is planned to limit the number of extra invasive procedures. New DL technologies and easy-to-implement platforms will enable regular and complete follow-up of health data for patients unable to access their records owing to a physician shortage, hence reducing health costs.

The future of neonatal intensive care units and healthcare will likely be profoundly impacted by AI. This article’s objective is to provide neonatologists in the AI era with a reference guide to the information they might require. We defined AI, its levels, its techniques, and the distinctions between the approaches used in the medical field, and we examined the possible advantages, pitfalls, and challenges of AI. While also attempting to present a picture of its potential future implementation in standard neonatal practice. AI and pediatrics require clinicians’ support, and due to the fact that AI researchers with clinicians need to work together and cooperatively. As a result, AI in neonatal care is highly demanded, and there is a fundamental need for a human (pediatrician) to be involved in the AI-backed up applications, in contrast to systems that are more technically advanced and involve fewer healthcare professionals.

Methods

Literature review and search strategy

We used PubMed™, IEEEXplore™, Google Scholar™, and ScienceDirect™ to search for publications relating to AI, ML, and DL applications towards neonatology. We have done a varying combination of the keywords (i.e., one from technical keywords and one from clinical keywords) for the search. Clinical keywords were “infant,” “neonate,” “prematurity,” “preterm infant,” “hypoxic ischemic encephalopathy,” “neonatology,” “intraventricular hemorrhage,” “infant brain segmentation,” “NICU mortality,” “infant morbidity,” “ bronchopulmonary dysplasia,” “retinopathy of prematurity.” The inclusion criteria were (i) publication date between 1996–2022 and, (ii) being an artificial intelligence in neonatology study, (iii) written in English, (iv) published in a scholarly peer-reviewed journal, and (v) conducted an assessment of AI applications in neonatology objectively. Technical keywords were AI, DL, ML, and CNN. Review papers, commentaries, letters to the editor and papers with only technical improvement without any clinical background, animal studies, and papers that used statistical models like linear regression, studies written in any language other than English, dissertation thesis, posters, biomarker prediction studies, simulation-based studies, studies with infants are older than 28 days of life, perinatal death, and obstetric care studies were excluded. The preliminary investigation yielded a substantial collection of articles, amounting to approximately 9000 in total. Through a meticulous examination of the abstracts of the papers, a subset of 987 research was found (Fig. 4). Ultimately, 106 studies were selected for inclusion in our systematic review (Supplementary file). The evaluation encompassed diverse aspects, including sample size, methodology, data type, evaluation metrics, advantages, and limitations of the studies (Tables 27).