Introduction

Neonatal jaundice is one of the most common conditions encountered by neonatologists and pediatricians, and occurs in ~60–80% of healthy term newborns during the first days of life.1,2 Although most jaundice is benign, 8–9% of newborns might develop severe hyperbilirubinemia (HB), defined as total serum bilirubin (TSB) level above the 95th percentile for age in hours (high-risk zone) during the first week.3 Neonates with HB, when unmonitored or untreated, can develop acute bilirubin encephalopathy (ABE), which can lead to varying degrees of brain damage and neurobehavioral disorders.4,5 If the TSB concentration is not reduced in time to prevent further neurotoxicity in these neonates, chronic irreversible encephalopathy, known as kernicterus, or even death can occur.6 Nowadays, ABE remains a significant cause of morbidity and mortality throughout the world, which can account for up to 15% of neonatal deaths in low- and middle-income countries.4 The incidence of ABE may have decreased in developed countries in recent years, but it still occurs at a rate of 0.4–2.7 cases per 100,000 infants,7,8 with a higher incidence in Asia, the Middle East, and Africa.9 In Nigeria, 159 cases of ABE were diagnosed in 1040 patients who were admitted for treatment of jaundice (15.3%)10 and ~4.8% in China.11 Early identification for newborns at high risk of ABE for timely treatment is crucial to minimize the incidence of kernicterus or to avoid overtreatment.

The TSB concentration is most widely used for evaluating neonatal jaundice, but it is not a direct index of the actual bilirubin level in the brain and not an accurate predictor of ABE.12 Moreover, because the collection of blood is required, TSB measurement has a risk of infection and anemia.13,14 Therefore, a noninvasive method for direct detection of bilirubin-induced changes in the brain is needed for ABE diagnosis. Magnetic resonance imaging (MRI) has been widely used for diagnosing neurological diseases, including bilirubin encephalopathy.15,16,17 Many radiological reports found that in the early stages of ABE, T1 hyperintensity of the globus pallidus (GP) bilaterally is a common characteristic in most cases.18,19,20 This might be caused by the relatively high resting neuronal activity in the GP, which makes it particularly vulnerable to the intense, subacute oxidative stresses from mitochondrial toxins such as bilirubin.6 However, this radiological signature does not hold for all cases. To complicate things further, non-ABE neonates with HB conditions often exhibit T1 hyperintensity in the GP as well, making it difficult to differentiate ABE and non-ABE HB patients only by T1-weighted images (T1WIs).

Previous studies revealed that T1WI, T2-weighted imaging (T2WI), and diffusion-weighted imaging (DWI) all contributed to the diagnosis of ABE and may provide complementary information to improve diagnostic accuracy. Wisnowski et al.15 and Wang et al.19 reported that MRI of neonates in the first days to weeks following ABE showed an increased T1 signal in the GP, while T2WI of this region was often unremarkable or showed subtle T2 hyperintensity. The increased T2 signal in the GP was often observed in the chronic stage.21 In a study of 30 ABE patients and 24 control subjects, Cece et al.22 found that there was a significant correlation between bilirubin values and DWI-based apparent diffusion coefficients (ADCs) (r = 0.41, p < 0.05). We speculate that the accuracy of diagnosing ABE could be further improved by combining information from multimodal MRI.

While a traditional radiological decision is based on visual inspection, it can be objective, highly empirical, and especially difficult in identifying diseases that do not have a clear radiological standard, such as ABE. To this end, machine-learning-based methods have gained acceptance among radiologists and clinicians.23 Particularly, deep-learning algorithms, such as convolutional neural networks (CNNs), have been widely used in medical image analyses and achieved great success.24,25,26,27 In this study, we evaluated whether the use of the multimodal MRI and a deep-learning approach can differentiate ABE patients from non-ABE neonates with HB. Two advanced CNN models, namely, ResNet18 and DenseNet201, were tested for classifying ABE and non-ABE patients from a cohort of HB neonates based on T1WI, T2WI, and ADC, and in combination. We also compared CNN-based results with a traditional statistical approach based on normalized T1WI intensity, T2WI intensity, and ADC values in the GP, with logistical regression as the classifier. This study demonstrates that potential and noninvasive diagnostic methods for ABE, which might improve the clinicians’ performance and support clinical management, especially for those regions with high ABE incidence.

Materials and methods

Study subjects

The data were collected retrospectively from routine clinical examinations at the Children’s Hospital of Zhejiang University School of Medicine between 2016 and 2020. All research protocols were approved by the local Institutional Review Board with a waiver of consent. MRI data were collected from a total of 150 HB neonates who were clinically confirmed with TSB >5 mg/dL,28 including 75 with ABE and 75 with non-ABE, who underwent MRI during their hospitalization at postmenstrual age (PMA) of 37–41 weeks at the time of the scan. All ABE-positive cases had a bilirubin-induced neurologic dysfunction (BIND) score of ≥1. A BIND score of 1–3, 4–6, and 7–9 represent mild, moderate, and severe ABE, respectively, which is scored based on the muscle tone, cry pattern, and behavioral and mental status with a total of nine points.29 Non-ABE infants did not exhibit any ABE-related clinical symptoms. The diagnosis was confirmed based on the clinical records by two experienced pediatricians with >8 years of clinical practice (X.S. and C.L.).

MRI acquisition

All images were acquired using a 3.0-T MRI scanner (Achieva, Philips Healthcare, Best, The Netherlands) based on a routine clinical brain MRI protocol with T1WI, T2WI, and DWI. The T1-weighted fast gradient-echo sequence was performed using the following parameters: echo time (TE) of 2.14 ms, repetition time (TR) of 200 ms, flip angle of 80°, a field of view (FOV) of 330 × 330 mm2, in-plane resolution of 0.45 × 0.45 mm2, and 18 slices with a thickness of 4.5 mm in the axonal direction. T2-weighted turbo spin-echo sequence was performed using the following parameters: TE/TR = 80/3000 ms, FOV of 230 × 230 mm2, in-plane resolution of 0.34 × 0.34 mm2, and 18 slices with a thickness of 4.5 mm in the axial direction. Diffusion-weighted echo-planar imaging was acquired using the following parameters: TE/TR = 80/2109 ms, FOV of 230 × 230 mm2, in-plane resolution of 0.90 × 0.90 mm2, 18 slices with a thickness of 4.5 mm in the axial direction, one non-DWI (b0), and a single DWI at a b value of 800 s/mm2. All images were visually examined by pediatric radiologists to ensure adequate image quality for further analysis.

Image preprocessing

ADC map was calculated using the following equation: ADC = −log(SDWI/Sb0)/b. In order to combine the three types of images for the CNN models, we first performed image registration between the different image modalities by aligning the T2WI and ADC images to T1WI using the FMRIB’s Linear Image Registration Tool (FSL v6.0, FMRIB, Oxford, UK)30 with a 2D rigid-body transformation since the images were acquired with the same slice center and the same slice thickness. Then, three continuous slices centered around the GP region from the T1WI, T2WI, and ADC images were selected as the inputs to the networks. Thus, 225 slices from 75 ABE patients and 225 slices from 75 non-ABE patients were selected for each MRI modality. We then cropped the images around the brain, resized them uniformly to the size of 224 × 224 pixels, and normalized the intensities between 0 and 1.

Logistic regression with normalized T1WI, T2WI, and ADC

As the GP is known to be the most vulnerable brain region affected by bilirubin neurotoxicity,19 we utilized the MR features of the GP for classification using logistic regression.31 For quantification purposes, we normalized the T1WI, T2WI, and ADC signal intensities of GP to that of the subcortical white matter (WM) as there is no known effect of ABE on the MR properties of the WM. The normalized intensity of the GP was calculated as GPnorm = \(\frac{{\overline {\mathrm{GP}} }}{{\overline {\mathrm{WM}} }}\), where \(\overline {\mathrm{GP}}\) and \(\overline {\mathrm{WM}}\) were averaged intensities in the manually delineated GP and WM regions of interest (ROIs) on a center slice that covered the GP (Fig. 1a).

Fig. 1: ROI definition and multimodal MRI measurements in the ABE and non-ABE neonates with HB.
figure 1

ac T1WI, T2WI, and ADC images of representative ABE neonates. df T1WI, T2WI, and ADC images of a representative non-ABE neonate. The blue outlines indicate the WM ROI and the red outlines indicate the GP ROI. gi Comparison of the MR features between ABE and non-ABE neonates, in terms of GPnorm,T1 (g), GPnorm,T2 (h), and GPnorm,ADC (i). GPnorm,T1 of ABE and non-ABE neonates are 1.345 ± 0.062 and 1.405 ± 0.126. GPnorm,T2 of ABE and non-ABE neonates are 1.342 ± 0.059 and 1.426 ± 0.146.GPnorm,ADC of ABE and non-ABE neonates are 0.767 ± 0.050 and 0.774 ± 0.056.

Logistic regression was performed using a MATLAB toolbox (Mathworks, Natick, MA), with the following input schemes: (1) individual single-modal features of GPnorm, T1, GPnorm, T2, or GPnorm, ADC, (2) combination of any two of these features, and (3) combination of all three features. The maximum Youden index32,33 was used to determine the optimal cut-off threshold of these features for separating ABE and non-ABE patients.

Deep-learning framework

We applied two CNN models, ResNet1834 and DenseNet201,35 which were pre-trained on a public database named ImageNet36 with three-channel (i.e., RGB images) inputs, with a transfer learning strategy for differentiating ABE and non-ABE patients based on multimodal MRI images. Since each single-modal image is a 2D grayscale image, the following strategies were taken to meet the three-channel input scheme: (1) for the single-modal data, we simply duplicated the normalized image to make three identical channels; (2) for the two-modal data, i.e., T1WI + T2WI, T1WI + ADC, or T2WI + ADC, we added an empty image with all zero values as the additional channel; (3) for the three-modal data, T1WI, T2WI, and ADC naturally constituted the three channels. The resulting 225 images were divided into 80% and 20% for the training and testing sets. Data augmentation was applied to the training dataset, which included image rotation with a random angle in the range of −30° to 30°, image zooming by a random scale within the range of 0.9–1.1, and image horizontal and vertical translation with random distance in the range of −30 to 30 pixels. A 5-fold cross-validation was applied to assess the models’ generalization performance with metrics of classification accuracy, the area under the ROC curve (AUC), sensitivity, specificity, precision, and F1 score. Equations (1)–(5) showed the definition of these performance metrics, where TP, FP, TN, and FN represent the numbers of true-positive, false-positive, true-negative, and false-negative cases, respectively. The performance metrics were presented as mean ± standard deviation from the 5-fold cross-validation:

$${\mathrm{Accuracy}} = \frac{{{\mathrm{TP}} + {\mathrm{TN}}}}{{{\mathrm{TP}} + {\mathrm{FP}} + {\mathrm{TN}} + {\mathrm{FN}}}}$$
(1)
$${\mathrm{Sensitivity}} = \frac{{\mathrm{TP}}}{{{\mathrm{TP}} + {\mathrm{FN}}}}$$
(2)
$${\mathrm{Specificity}} = \frac{{\mathrm{TN}}}{{{\mathrm{TN}} + {\mathrm{FP}}}}$$
(3)
$${\mathrm{Precision}} = \frac{{\mathrm{TP}}}{{{\mathrm{TP}} + {\mathrm{FP}}}}$$
(4)
$${F}1\;{\mathrm{score}} = \frac{{2 \times {\mathrm{precision}} \times {\mathrm{sensitivity}}}}{{{\mathrm{precision}} + {\mathrm{sensitivity}}}}$$
(5)

The models were obtained from the Deep Learning Toolbox in MATLAB 2019a. The hyperparameters of the CNN were heuristically set as follows: the learning rate was initialized to 0.0003, maximum epoch number was limited to 6, stochastic gradient descent momentum-based solver was used with a minibatch size of ten images for training. The experiment was implemented in MATLAB 2019a.

Statistical analyses

Differences in the sex distribution among groups were evaluated using the χ2 test, while other clinical features, which all passed the Kolmogorov–Smirnov normality test, were evaluated by a two-tailed t test with unequal variance. The differences of GPnorm, T1, GPnorm, T1, and GPADC measurements between ABE and non-ABE groups were also tested using t tests. A p value < 0.05 was considered statistically significant. All statistical analyses were performed using IBM SPSS Statistics 21 (https://www.ibm.com/products/spss-statistics).

Results

The demographic and clinical characteristics of the patients in our study are listed in Table 1, including sex, age, weight, gestational age (GA), PMA at scan, TSB, and albumin. Significant differences in age (p = 0.004, ~2 days off) and TSB (p = 0.000, ~5.03 mg/dL difference) were found between the ABE and non-ABE groups, while other features were comparable between the two groups (p > 0.05).

Table 1 The demographic and clinical characteristics of the patients.

Figure 1 shows T1WI, T2WI, and ADC images of representative ABE (Fig. 1a–c) and non-ABE (Fig. 1d–f) patients with HB. Blue and red outlines indicate the manually traced WM and GP ROIs for calculations of normalized intensities. The differences in GPnorm, T1, GPnorm, T2, and GPnorm,ADC are presented in Fig. 1g–i. T tests indicated a significant difference in GPnorm,T1 (1.345 ± 0.062 versus 1.405 ± 0.126, p = 0.000) and GPnorm,T2 (1.342 ± 0.059 versus 1.426 ± 0.146, p = 0.000) values, but no significant difference was found in GPnorm,ADC (0.767 ± 0.050 versus 0.774 ± 0.056, p > 0.05). Considerable overlaps were observed between the two groups for all three measurements, indicating the difficulty of classification based on any of the single modalities.

The performance of logistic regression on identifying ABE and non-ABE infants is shown in Table 2 and Supplementary Fig. S1, with ROC curves shown in Fig. 2a. The combined feature of GPnorm,T2 and GPnorm,ADC achieved the highest AUC of 0.681, while the highest accuracy of 0.833 was obtained using the combination of GPnorm,T1 and GPADC. The combination of all three modalities provided the second-highest AUC of 0.677 and the second best accuracy of 0.800. Accuracies of 0.720, 0.773, and 0.600 were found for GPnorm,T1, GPnorm,T2, and GPnorm,ADC, respectively, with optimal cut-off values of 1.439, 1.435, and 0.714, respectively. These results indicated that although ADC alone did not have a good predictive value, but in combination with T1WI or T2WI the prediction accuracy improved. From Supplementary Fig. S1(a) we can see that the sensitivities for the logistic regression are almost 100% for all modalities except ADC, indicating that the logistic regression method has a good capability for predicting true-positive samples (ABE) with no false-negative samples are detected in our experiments. However, since there was no statistically significant difference between the GPnorm,ADC of ABE and non-ABE (shown in Fig. 1i), the optimal cut-off value of GPnorm,ADC (0.714) can hardly separate ABE and non-ABE accurately with a lot of false-negative samples and no false-positive samples were detected in the result, which directly leads to the poor sensitivity of 20% and the high precision of 100% and the specificity of 100%.

Table 2 The performance metrics of logistic regression, ResNet18 and DenseNet201, on classifying ABE and non-ABE based on single- and multimodal data.
Fig. 2: ROC curves of the single- and multimodal MRI features for differentiating ABE and non-ABE.
figure 2

Single-modal data of T1WI, T2WI, ADC, and multimodal data of T1WI + T2WI, T1WI + ADC, T2WI + ADC, and T1WI + T2WI + ADC were tested, respectively. a ROC curves based on logistic regression classifiers using the semi-quantitative MRI measurements in the GP. b ROC curves based on ResNet18 using the single- and multimodal images. c ROC curves based on DenseNet201 using the single- and multimodal images.

We then evaluated the performances of the ResNet18 and DenseNet201 CNN models based on the single- or multimodal images through a 5-fold cross-validation. Comparing the results using different classifiers (Table 2), the DenseNet201 achieved the best overall performance, followed by ResNet18, which both outperformed logistic regression. T tests indicated that the classification accuracy of DenseNet201 was significantly higher than ResNet18 when using combined images of T1WI + ADC (p = 0.048, <0.05) and T2WI + ADC (p = 0.003, <0.05), but their performance was similar in terms of T1WI, T2WI, ADC, T1WI + T2WI, and T1WI + T2WI + ADC.

Figure 2 and Supplementary Fig. S1 show that with the increased number of MR modalities fused in the input image, the AUC gradually improved for both CNN models, and the combination of all three modalities gave the best performance in almost all of the evaluation metrics with high accuracy of 0.929 and an AUC of 0.991 for DenseNet201. Among the single-modal MRI data, T1WI had the best classification performance, followed by T2WI and then ADC for DenseNet201, which is similar to the findings from logistic regression. Interestingly, the sensitivities of DenseNet201 for single-modal MRI from high to low were T2WI, ADC, and T1WI, while their specificities showed approximately an opposite order, which again suggested their complementary roles in the classification task. Among the two-modal data, T1WI + T2WI achieved an accuracy of 0.918 with an AUC of 0.991, which was considerably higher than that for T1WI + ADC and T2WI + ADC.

Discussion

This study evaluated whether multimodal MRI could improve the diagnostic performance compared with using a single modality. We also demonstrated the advantage of deep-learning networks compared with the traditional statistical methods based on the multimodal MRI markers. This was done in the framework of separating ABE and non-ABE neonates who both had HB, which is known to be particularly challenging with current clinical and radiological examinations. Our results indicated that multimodal MRI plays an important role in the clinical management of ABE, and should be incorporated into the clinical routine whenever MRI is available.

At present, ABE remains one of the most significant causes of neonatal mortality and lifelong disability. The commonly used physiological parameters, such as TSB, albumin need, unconjugated or free bilirubin levels, and bilirubin bound to albumin, do not have sufficient diagnostic power as they do not directly reflect bilirubin toxicity in the brain.37,38 The clinical manifestations and neurological symptoms could also be absent, subtle, or nonspecific in the early phases of ABE.39 When an overt clinical sign appears, the bilirubin-induced neurological injuries may have already been present and become irreversible. Although MRI has been increasingly used to investigate the neuropathology induced by ABE in the clinical setting, its diagnostic accuracy is limited and research in this field is relatively scarce. A study by Mao et al.20 reported that 20 of 36 neonates with HB have symmetric hyperintense GP on T1WI; and among these 20 HB neonates, 15 had ABE. Coskun et al.18 reported that 8 of 13 (61.54%) ABE patients demonstrated bilateral, symmetric increased signal intensity in the GP on T1WI and these lesions were not apparent on T2WI. Clearly, visual inspection is not sufficient for diagnosing ABE given the subtle and nonspecific differences from single-modality MRI.

Our results demonstrate that for all three methods the performance metrics gradually improved when the input data combined more modalities. The combination of three-modal images (T1WI + T2WI + ADC) achieved the best performance with a mean accuracy of 0.929, 0.871, and 0.800 for DenseNet201, ResNet18, and logistic regression, respectively. This was considerably higher than when using a single modality (accuracies < 0.750). This may be because images from the three modalities provide different presentations of the ABE pathology that support each other,15,19,22 as T1WI and T2WI reflect the chemical components of the tissue, while ADC is associated with tissue microstructures that dictate the water diffusion. The different modality images were complementary in classification sensitivity and specificity, e.g., T2WI had the highest sensitivity and T1WI gave the highest specificity.

Our results also show that the CNN models outperformed the statistical approach of logistical regression, as expected. One reason for this outcome is that the features used for the classification work are much different for CNN and logistic regression. CNN uses the whole image as the input data so that all the information is captured, while the logistic regressor that uses manually defined MRI features of GPnorm and other image features that are potentially useful for the classification are ignored. Among the two CNN models, DenseNet201 achieved higher classification accuracy than ResNet18, owing to the more learnable layers, which likely benefited the feature extraction and classification efficacy. However, using a model with a complex architecture has a risk of overfitting, especially for a limited training sample set with little heterogeneity. As shown in Fig. 3, we found that the training loss decreased, whereas the validation loss increased after 160 iterations for DenseNet201, indicating overfitting.

Fig. 3: Traning and validation loss.
figure 3

Training and validation loss on training the DenseNet201 with images fused by T1WI, T2WI, and ADC.

Another limitation of the study is that our data were all collected from one hospital; therefore, the generalizability of the models is unknown. Future multicenter studies are necessary to validate the models’ generalizability. Also, in addition to the use of conventional MRI contrasts of T1WI, T2WI, and DWI, the integration of more advanced MRI techniques, such as susceptibility-weighted MRI, perfusion MRI, and spectroscopy, as well as the clinical information, is likely to further enhance the diagnostic power. Moreover, it would be ideal to test the prediction ability to kernicterus, the chronic phase of ABE, which is critical to the clinical management of newborns.

Conclusion

Here, we demonstrate the potential of multimodal MRI with machine-learning approaches in identifying ABE in HB patients. The results indicate that the multimodal MRI outperforms the single modalities for all types of classifiers, and the CNN models outperform the logistic regression with predefined features in the GP. The best performance was achieved by DenseNet201 with the fusion images combined by T1WI, T2WI, and ADC, which achieved an accuracy of 0.929 with an AUC of 0.991. The strategy of the multimodal MRI-based diagnosis of ABE is potentially applicable to clinical practice to facilitate clinical management.