Introduction

Antenatal hydronephrosis (HN) is a common prenatal ultrasound finding, detected in up to 2–5% of fetuses1. After birth, the condition is closely monitored with up to 80% of cases experiencing resolution without intervention. In the remaining patients, HN may be secondary to a pathologic process, such as ureteropelvic junction obstruction (UPJO), ureterovesical junction obstruction (UVJO), or vesicoureteral reflux (VUR), which may benefit from surgical intervention. The challenge is to risk-stratify patients early in life. However this is currently not possible, therefore babies with HN are monitored with serial ultrasounds, and many will undergo invasive testing, requiring urethral catheterization, intravenous access, and exposure to radioisotopes and radiation. In addition to the anxiety, discomfort, and morbidity related to these additional tests, there is growing concern about the potential link between radiation exposure and future malignancies2. Risk stratification using ultrasound images alone has the potential to streamline care for low-risk patients, reduce the number of patients investigated with invasive tests and help providers to comply with the as low as reasonably achievable (ALARA) radiation principle, while expediting interventions for those that may benefit.

Machine learning (ML) models have shown tremendous promise in healthcare, including for those with HN. Predicting patients most likely to progress to surgery, or those at risk for urinary tract infection (UTI) has been explored using clinical variables3,4. Standardized assessment of anatomical regions of the kidney in ultrasound images has been explored in multiple works including the parenchyma to hydronephrosis area5 the hydronephrosis index using comparing the total kidney area with the renal pelvis area6, automatic segmentation of kidney regions in the ultrasound to predict obstruction7, and morphometric feature extraction from kidney ultrasound8. Others have developed a convolutional neural network model to broadly classify HN as Society for Fetal Urology (SFU) low vs. high grade based on the full ultrasound image of the kidney9; however, the clinical utility of such a distinction is unclear. Providers use this grading system to communicate the severity of HN, but the HN grade alone does not inform clinical decision making. In addition, assigning HN grades relies on the subjective assessment of repeated patient imaging, which has been shown to be highly variable with poor reliability among raters. This introduces a critical bias into these models10,11,12.

Objective

In an attempt to remove the subjectivity of HN kidney ultrasound interpretation and help liberalize access to a reliable assessment tool, we built a model to estimate risk of requiring surgery for patients with HN directly from ultrasound images, following publication of a proof of concept13. We refer to the output this model produces as the HN Severity Index (HSI). Herein, we test this score at 4 large paediatric quaternary care institutions in North America for its ability to discriminate between surgical and non-surgical HN patients with the goal of adapting follow-up and assessment for this condition from the current standard of care (Fig. 1A) to a more streamlined approach, particularly with fewer follow-ups and scans for low-risk patients (Fig. 1B).

Fig. 1
figure 1

Clinical management of HN with proposed integration of HN Severity Index. (A) Current clinical management of all HN patients. (B) The target application of our model to reduced testing and follow-up for low-risk HN patients, with the potential to expedite care for high-risk patients. (C) Deep learning model proposed to stratify patients using 2 views from renal ultrasound. (D) AUROC, AUPRC, sensitivity, and specificity across all test data sets with 95% confidence intervals indicated by line ranges. Threshold for sensitivity and specificity set in SickKids validation set.

Materials and methods

The aim of this study was to evaluate the HSI score for paediatric patients with HN. This model was evaluated by treating surgical cases (i.e. obstructive HN) as the ground-truth label and using area under the receiver-operator curve (AUROC), area under the precision-recall curve (AUPRC), sensitivity (true positives/all positive cases), and specificity (true negatives/all negative cases) to test a prospectively-collected set of 202 consecutive patients from the same institution. HSI was then tested in data from 3 additional paediatric hospitals. A power analysis based on the SickKids training data (Supplementary Methods) was used to assess the power of each HSI test. The prospective sample size from the development institution (SickKids) targeted 80% power at a 0.2 null hypothesis margin. The present study is reported in compliance with the Standardized Reporting of Machine Learning Applications in Urology (STREAM-URO) framework for reporting methods and results of machine learning tools built in Urology (Supplementary Table 8)14.

Ethical review

Each site received approval from their respective Internal Review Boards (IRB) and Research Ethics Boards (REB) for this work. Deidentified data was collected via retrospective chart review and therefore a waiver of consent was applied. Specifically, approval for data collection and analysis was granted by the Hospital for Sick Children REB, the Children’s Hospital of Philadelphia IRB, the Lucile Packard Children’s Hospital IRB, and the University of Iowa Stead Children’s Hospital IRB. All research was performed in accordance with relevant guidelines and regulations.

Retrospective data collection

Data collection included all samples from our original study13 and was extended to include more SickKids samples for training, with additional, new, prospective samples for testing. HN patients who were seen in the paediatric urology clinic between 2015 and 2019, were less than 24 months of age at baseline, and had ultrasound findings of isolated hydronephrosis or hydroureteronephrosis, were included in this study. Patients with vesicoureteral reflux (VUR) were also included if the reflux was diagnosed during the workup of HN and was associated with HN on ultrasound. Patients with VUR detected after a urinary tract infection (UTI) without evidence of HN, as well as those with known congenital anomalies of the urinary tract—such as duplication anomalies, posterior urethral valves and neurogenic bladder—were excluded. The inclusion of children with VUR diagnosed during the work-up of HN allowed for a fair comparison with children who have HN with an unknown VUR status, meanwhile the exclusion of patients with no HN and with more complex anomalies ensured the a priori consistency of the condition being assessed. De-identified kidney ultrasound images along with a linked set of clinical characteristics were retrospectively-collected at each study site. Captured variables included: patient age, sex, kidney laterality,, and any surgical intervention. Ultrasound images and the surgical intervention variable were used to develop and test the model, whereas the remaining variables were used to stratify model performance and assess bias. HN was graded according to the SFU grading system15, and grades were assigned by a paediatric radiologist and by experienced paediatric urology clinicians. One representative sagittal and transverse view were collected by capturing a screenshot in PNG format centered around the kidney. This was done by reviewing the full ultrasound sequence and selecting a sagittal and transverse kidney image which was clearest to the person selecting the image. Images were selected by urologists, trainees, and research assistants who had received training in how to select images, most often using images that had been used to measure the sagittal kidney size and anteroposterior diameter. All kidney images were then resized to 256 × 256 pixels, set to greyscale, contrast adjusted to a uniform histogram equalization and saved as a PNG file. In some cases the images were saved as a different image format first (image preparation 1) and in other cases they were saved as a PNG file always (image preparation 2). This procedure was used for all images from each machine and institution.

Hydronephrosis Severity Index

We defined the HN Severity Index (HSI) as the likelihood HN was secondary to an obstruction. The HSI varies between 1 and 0, with 1 indicating obstruction with certainty and a 0 indicating that there is no probability of obstruction. The HSI threshold was set to target 90% sensitivity in HN patients from SickKids. We propose that different HSI thresholds can be used toward different clinical management decisions at individual institutions or clinical settings. In this work, the threshold and HSI value derived from SickKids was used to assess the transferability of this single-institution management strategy and model to other independent institutions and clinical-management teams.

Surgical indications and confirmation of diagnosis

Obstruction was defined as decreased differential function (< 40%) at baseline, a decrease of ≥ 5% function between serial nuclear scans, prolonged drainage (T1/2) time, progression of HN, and/or development UTI, pain or calculi in the setting of HN (SFigure 4). UTIs were diagnosed by catheter specimen in febrile children with positive urinalysis (leukocyte esterase ± nitrites) and a positive urine culture of > 50,000 CFU/ML of a single organism. For patients deemed to be obstructed, surgical management included pyeloplasty for ureteropelvic junction obstruction16 and ureterostomy17, ureteral reimplantation18 or ureterovesicostomy for ureterovesical junction obstruction19. The diagnosis of obstruction (UPJO or UVJO) was confirmed intraoperatively and supported by pathology when applicable. All surgical patients in this study had an intraoperatively confirmed obstruction. The decision to send specimen(s) for pathological assessment was at the discretion of the surgeon. We did not include surgeries that were performed solely to address VUR, such as endoscopic injection or ureteral reimplantation. Resolution of HN was defined as SFU ≤ 1 or APD ≤ 10 mm on at least 2 consecutive ultrasounds.

Machine learning model

A Siamese convolutional neural network (CNN) was trained from random weights using 2 kidney ultrasound images (sagittal and transverse) and surgery labels only (Fig. 1C).

Model architecture

The original model used in this study is a 7 layer convolutional, Siamese neural network, described in detail in Erdman et al.13, trained to discriminate between obstructed and non-obstructed HN cases. This model uses two 256 × 256 pixel images and passes each through the same (i.e. Siamese) convolutional layers. Here the Siamese architecture of the network is used to regularize the model weights while using images from two different ultrasound views. The channel depth of each image is tripled from one dimension (greyscale) to mimic 3-dimensions (RGB). From there the first convolution applies a kernel of 11 pixels, with a stride of 2 pixels, and 0 padding. The second layer has a kernel size of 5, stride of 1, and padding of 2. The following 3 convolutions have kernel size of 3 with padding of 1, with the sixth convolution having kernel size of 2, stride of 1, and padding of 1, and the final convolution having kernel size of 3, stride of 2, and padding of 0. We then flatten the output and pass it through a fully connected layers, concatenate the output from the sagittal and transverse view and pass the concatenated feature vector through 3 additional fully-connected layers. We do not test alternative convolutional architectures, as our previous work13 showed no significant difference in the performance between the custom architecture described here and DenseNet-12120, ResNet-1821, or VGG-1622.

Model training

Our model was trained and tested using the python 3.8 and pytorch v1.7.0. We trained our model over 50 epochs using a stochastic gradient descent optimizer with a learning rate of 0.001, momentum of 0.9, and weight decay of 5e−4. These parameters were set in our previous work retrospectively validating our model using fivefold cross validation, using grid-search with validation-set performance to identify the best parameter set13. Model training and evaluation was performed using a laptop with an NVIDIA GeForce RTX 2070 Max-Q GPU.

Model selection

We held out a random 20% of our data as unseen test data and trained our model for 50 epochs using fivefold cross-validation. For each epoch, in each fold, we set a threshold where our validation set achieves 90% sensitivity (Supplementary Fig. 3). We then assessed the average sensitivity of our fivefold models in our test set for each epoch, selecting the epoch in which the test sensitivity is > 90% and the specificity is maximized. This epoch was then used as our stopping point in a model trained with only a training/validation split.

Model performance calculation

Model performance was assessed using 4 statistics: AUROC, AUPRC, sensitivity, and specificity. AUROC was computed using the pROC v1.17.0.1 package23 and AUPRC was computed using the EGAD v1.18.0 package24 within R v4.0.2. Sensitivity and specificity were computed by first finding the highest value in the validation set that will achieve 90% sensitivity. This threshold value is then used to split observations in the unseen data into predicted obstructed (above the threshold) and predicted non-obstructed (below the threshold). Sensitivity was then the share of obstructed cases which are predicted to be obstructed and specificity is the share of non-obstructed cases predicted to be non-obstructed. Confidence intervals at α = 5% level were computed using bootstrapping. Specifically, observations were drawn from our dataset with replacement to create 500 simulated datasets of the same size as the dataset for which the confidence interval is being computed. AUROC, AUPRC, sensitivity and specificity were computed for each of these simulated datasets. Each of these statistics were then ordered over the full set of simulated data and the values at the 2.5%- and 97.5%-ile location were used as lower- and upper-confidence level bounds, respectively.

Model evaluation data

Following development and retrospective testing of our model, data for the prospective SickKids sample was collected from patients after they have been evaluated and their treatment decision made. Patients were selected using the same inclusion criteria described in our “Retrospective data collection” section. Model performance was assessed overall and stratified across patient features (age, sex, and postal code) and patient visit features (machine used, date) to assess bias. Patient postal code was used to assess model performance bias across patients from systematically different geographic regions of the province for the Canadian cohort. These are equivalent to American ZIP codes but were only available in our Canadian data. We next evaluated the correlation between HSI and surgical indication at three independent institutions, using the same patient selection criteria: Stanford Children’s Health (Stanford), University of Iowa Children’s Hospital (UIowa), and Children’s Hospital of Philadelphia (CHOP). These data were then passed through the SickKids-trained model to evaluate the ability of this model to generalize to different settings.

Results

The HSI model was trained using a retrospectively-collected dataset of 1938 ultrasound images for 403 patients and their linked health records from SickKids, of which a random 80% was used for training and 20% for testing (Supplementary Table 1). Of the 403 patients, 96 (24%) underwent surgical interventions: pyeloplasty was the most common procedure (74/96; 77%), followed by ureterovesicostomies (19/96; 20%) and 3 ureterostomies/reimplantations (3%).

Model generalization was tested prospectively in 202 consecutive patients evaluated at the SickKids Urology clinic. Of these, 28 (14%) underwent surgical interventions. Similar to the patients in the training set, pyeloplasty was the most common procedure (20/28; 71%), followed by ureterovesicostomies (7/28; 25%) and 1 ureterostomies/reimplantations (4%).

Overall, we found that our model scores produced an area under the receiver operator curve (AUROC) of 93%, which indicates a strong sensitivity/specificity trade-off across model thresholds, and an area under the precision-recall curve (AUPRC) of 58%, demonstrating our model’s precision/sensitivity (recall) trade-off (Table 2). Therefore, the HSI ultrasound model performs well, providing a score from only 2 images with none of the reliability issues inherent with grading HN.

Model performance was next evaluated with a threshold targeting 90% sensitivity, resulting in 93% (95% CI 91%, 100%) sensitivity and 58% (47%, 70%) specificity in the prospective test set (Table 2), which was in alignment with our prior findings from retrospective testing13. The HSI model showed a negative predictive value (NPV) of 99% (95% CI 99%, 100%) and positive predictive value (PPV) of 28% (23%, 34%). Here NPV is favored over PPV as the HSI score is first intended to triage patients with HN unrelated to obstruction and concentrate resources on patients with more likely obstruction.

The HSI model was further evaluated for robust prediction across patient sex, age, the number of previous ultrasounds, ultrasound machine, APD, affected side, postal code, and preprocessing batch to identify groups with low model performance. We found consistently high sensitivity in all patient groups with sample size > 10 (Table 2, Fig. 2, Supplementary Table 5). Specificity fell for patients with high ApD: 9–14 mm (54%) and ApD > 14 mm (16%), leading to a PPV of 10% and NPV of 100% for patients with ApD 9–14 mm and a PPV of 51% and NPV of 92% for patients with ApD > 14 mm (Fig. 2E). Group-specific findings remained consistent when only the most recent patient visits were considered (Supplementary Table 6). Of these patients, the HSI model was 100% sensitive and > 80% specific with only images from their first ultrasound, suggesting that this model indeed provides the opportunity to streamline patient monitoring earlier than SFU grade allows.

Fig. 2
figure 2

Model performance in prospective samples from SickKids. Model performance in terms of AUROC, AUPRC, sensitivity, specificity by (A) Sex, (B) Age groups, (C) Ultrasound visit number, (D) Ultrasound machine, (E) Ap Diameter, (F) Kidney side, (G) HN Side, (H) Postal code, and (I) Image preparation. All patient groups with > 10 individuals are plotted here. All data shown in Supplementary Table 5.

Subsequently, we tested the model in three samples of convenience from external institutions with different characteristics, including different distribution of surgical cases (the Stanford dataset had 98 vs. 12, UIowa 16 vs. 53 and CHOP 29 vs. 57 non-obstructed and obstructed patients respectively) and a different age distribution (Table 1). Despite these differences, we found that the model generalizes effectively as shown by the AUROCs of 90% for Stanford, 93% for UIowa, and 92% for CHOP (Table 2). Overall, the model performance when applied to the Stanford dataset was 89% sensitive and 73% specific (99% NPV, 15% PPV), to the UIowa dataset, 96% sensitive and 54% specific (92% NPV, 74% PPV), and to the CHOP dataset, 82% sensitive and 79% specific (68% NPV, 89% PPV) (Fig. 1D, Table 2).

Table 1 Test data demographics all observations. Counts of observations within each demographic group.
Table 2 SickKids model performance in full prospective dataset. AUROC, AUPRC, sensitivity, and specificity (95% confidence interval) for each center, split by each factor. For group level sample-size, see Table 1 and Supplementary Table 1. All data shown below has not been previously seen by the model.

Last, we broke down the model’s performance across patient sex, age and HN side (Table 2). In every institution, we found our model performed as well or better in female patients than male patients, however none of these differences were significant, and neither sex shows significantly lower than 90% sensitivity. We also found no trend in performance across age groups. No institution showed significantly lower than 90% sensitivity in any HN side, except for some low populated subcategories: right-side HN at CHOP (with 61% sensitivity, based on 18 patients) and at Stanford (50% sensitivity in the right kidney, based on 6 patients).

Discussion

Artificial intelligence-driven evaluation of HN patients based on a single set of ultrasound images represents a novel and important opportunity to streamline evaluation, improve access to care and ensure safety for children through standardized clinical management. For years, the value of this technology has been an area of great interest in pediatric urology, and a clear next step in advancing care beyond the use of current classification systems. However, large datasets to build and evaluate models for HN are challenging to collect. In addition, concerns have been raised regarding the consistency of ultrasound images as well as the variability in clinical interpretation and management between different providers and institutions.

The present work tackles this challenge, providing a model that requires users to upload a small number of images and reliably outputs a score that can be easily introduced into the decision-making process. We tested the generalization of our model and found that the HSI can distinguish surgical (i.e. obstructed) HN cases from non-obstructed HN cases. Our power analysis shows that our datasets are sufficiently powered to establish clinically significant sensitivity and specificity. This, coupled with the reliable results independent of the prior experience or bias introduced during the subjective assessment of images is an opportunity to standardize results and effectively compare care between different providers or centers. Moreover, HSI thresholds can be adjusted in different settings, granting the ability to customize the output and refine it as new patients are evaluated.

When evaluating patients with HN, ultrasound imaging is often supplemented with nuclear scans for determination of differential function and drainage time. SickKids samples were used to explore the proportion of nuclear scans that could be avoided if this model were used in clinical practice. To do this, if the model showed HSI < 90% sensitivity threshold, no nuclear scan would be ordered and the follow-up time lengthened. If HSI ≥ 90% sensitivity threshold, it would recommend proceeding with the clinical standard of care (Fig. 1b). Using this strategy, we noted that at the first ultrasound, 18 of 36 (50%) non-obstructed patients would avoid a nuclear scan and 5 of 9 (56%) non-obstructed patients would avoid a nuclear scan following their second ultrasound, with no obstructed patients seeing a deviation in their care (Supplementary Table 3.1). This is consistent with our original finding in a retrospective test set that our model would be able to reduce nuclear scan usage in 58% of non-obstructed patients13.

We therefore find that an important utility of our tool is the opportunity to decrease the burden of HN monitoring and management on institutions and families. Our model can help to identify patients who are unlikely to have HN related to a surgical condition, which may mean these children could undergo evaluation by their primary care provider, with fewer additional ultrasounds, and avoid nuclear scans. Using our model as an opportunity to decrease the frequency and type of monitoring for this population also offers an advantage to families. Many tertiary care facilities are located in large city centers and have wide catchment areas, resulting in some families having to travel long distances for their appointments as access to tests and expertise may be limited outside of these institutions25. The hidden costs associated with visits to hospitals include lost income from missing work, travel, lodging and food which may be a challenge for some families26. In addition, by providing this information to parents and caretakers after the first ultrasound, there is great potential in decreasing anxiety and providing reassurance.

While there is value in the present work, there are important limitations to address. We acknowledge that an agreed upon definition -or a “gold standard”- for the diagnosis of obstruction does not exist. Thus, the decision to intervene surgically is dependent on the individual provider, patient and family, and their assessment of available information. Providers use proxies for determining the presence of clinically significant obstruction such as a decrease in differential function, worsening HN, delayed drainage time or the development of symptoms. Most surgeons tend to rely on the same variables for recommending surgery across paediatric urology clinics27,28,29. In an attempt to objectively confirm the diagnosis of obstruction in our study, we reviewed the documented intraoperative findings as well as with pathology when available. For non-surgical cases, confirmation of non-obstructive HN was resolution without intervention. We believe that these additional measures add validity to our study and partially explain the consistency of our model’s performance across different centers. In addition, we do not have access to consistent patient race or socioeconomic status data from across the data sets used in this study. While we do include data from highly different geographic regions and therefore different racial make-up, a systematic assessment of the model’s performance in patients based on these factors would be of value. Therefore, future work will disentangle the potentially complex interplay this algorithm may have based on patient race and socioeconomic status.

Future work for assessing the HSI will include a cost–benefit analysis for the potential savings this tool could deliver by reducing low-risk patient follow-up and imaging in different clinical settings, including different clinical divisions (e.g. Urology, Nephrology, Emergency Department) and levels of care (primary to quaternary). In addition, this work uses manual selection and cropping of specific views of the kidney. While this selection and cropping can be performed by non-specialist clinicians and technicians, automation of these steps would be beneficial and will be addressed in future adaptations of the model. We will also perform a pilot to assess the optimal deployment strategies for this tool in different settings, for example comparing its use when integrated in a Picture Archiving Communication System (PACS) vs. within the ultrasound machine vs. as a desktop application on clinic computers. Finally, as noted above, we will examine how this model may alter care in settings with different racial make-up of the care providers and patients. Past studies have shown that patient race impacts the timeline to surgery for UPJO30, therefore patient race may not show bias with respect to our model performance but may show a difference in terms of the cost–benefit of using this model for risk stratification. In addition, patient-physician racial concordance has been linked to patient outcomes31,32,33, therefore, this association will also be included as a potential factor for model bias and difference in cost–benefit in future assessments of this model.

Significance

We have demonstrated that our novel algorithm can accurately and reliably distinguish obstructive HN vs. non-obstructive HN directly from ultrasound images alone with independent data from 4 different populations. This carries a significant potential impact on the care and management of children with HN. The model presented here can be used to develop multiple institution-specific models with far fewer training samples than were required for the original model34. This has important implications for smaller medical centers and settings where data collection and storage are challenging.

Conclusion

The HSI score, an artificial intelligence-generated prediction of HN severity based on ultrasound images alone, is accurate and generalizable, and lacks issues related to subjective interpretation. The use of this technology may help reduce invasive testing for those children who may resolve without intervention and expedite care for those that may benefit from it. In addition, this model offers the opportunity for less financial burden to families and institutions and offers the potential for standardization across health care settings and risk stratification for those practicing in remote areas.