Ensembled deep learning model outperforms human experts in diagnosing biliary atresia from sonographic gallbladder images

It is still challenging to make accurate diagnosis of biliary atresia (BA) with sonographic gallbladder images particularly in rural area without relevant expertise. To help diagnose BA based on sonographic gallbladder images, an ensembled deep learning model is developed. The model yields a patient-level sensitivity 93.1% and specificity 93.9% [with areas under the receiver operating characteristic curve of 0.956 (95% confidence interval: 0.928-0.977)] on the multi-center external validation dataset, superior to that of human experts. With the help of the model, the performances of human experts with various levels are improved. Moreover, the diagnosis based on smartphone photos of sonographic gallbladder images through a smartphone app and based on video sequences by the model still yields expert-level performances. The ensembled deep learning model in this study provides a solution to help radiologists improve the diagnosis of BA in various clinical application scenarios, particularly in rural and undeveloped regions with limited expertise.

on natural images (e.g., ImageNet) compared to most other model architectures. Considering that SENet was much more computationally expensive than SE-ResNet during training, we adopted SE-ResNet as the CNN architecture for intelligent analysis of BA in this study. The adopted SE-ResNet mainly consists of 50 residual convolutional units, with each unit followed by a squeeze-and-excitation (SE) block ( Supplementary Fig. 4). Each residual unit is composed of three convolutional layers and a skip connection from the input of the first layer to the output of the last (third) layer. Each convolutional layer contains multiple mathematical convolution operations between the input of the layer and the convolutional kernels in the layer. Batch normalization and a rectified linear unit (ReLU) are also part of each convolutional layer by default. The SE block is a three-layer fully connected subnetwork inserted to the end of each residual unit to adaptively adjust the importance (or "excitation") of each output channel of the residual unit by considering (or "squeezing") the global visual information of the input image. With this SE block, relevant visual features would become more excited, while irrelevant visual features would be suppressed to a large extent, therefore making the model focus on more relevant information during prediction. Besides, the last fully connection layer was adapted for our binary classification task.

Supplementary Note 3: Training the SE-ResNet
Before training each SE-ResNet, all images from the training cohort were pre-processed as follows. First, from each original image, six new images were obtained by randomly cropping around the provided bounding box, with each new image slightly larger than and containing the entire gallbladder region. Then, each new image was rescaled to a square image of the size 224-by-224 pixels.
Each single-channel grayscale image became a three-channel image by duplicating the original single channel three times. The mean and standard deviation of pixel intensities were computed over all cropped images, which were then used to normalize each pixel in all rescaled images. The normalization method was consistent with the pretrained model used. The pre-trained model we used comes from one of the most commonly used repositories (https://github.com/Cadene/pretrainedmodels.pytorch) on the github. Each normalized image will be used as an input to the SE-ResNet for model training.
Training a SE-ResNet was an iterative process. At each iteration, a batch of (normalized) images were respectively input to the SE-ResNet, and the scalar output of the SE-ResNet would be compared to the expected output ("1" if the class of the input image was BA, and "0" for non-BA). Training a deep learning model is actually to update the parameters of the model such that the real outputs are as close to the expected outputs as possible for each batch of input images, where the model parameters mainly consist of the elements in each convolution kernel in each convolutional layer and the edge weights in each fully connect layer. The differences between the real outputs and the expected outputs are measured by a loss function called cross-entropy loss, which is a mathematical function of the model parameters. In other words, model training is to search for the best set of model parameter values such that the cross-entropy loss is minimum with the training images as input to the model. Mini-batch stochastic gradient descent (SGD) is one of the widely used methods to find the best model parameters over iterations, and was used for all model trainings in this study, where the batch size (i.e., the number of the images) was set to 16, and the learning rate of the SGD method was initially set to 0.01 and divided by 10 after every 35 epochs. Each epoch consisted of a sequence of training iterations through which all training images were fed into the model once. The maximum number of epochs was set to 210 before which each model has been well trained without much change in model parameters.
To improve the generalizability of each SE-ResNet, besides the ensembled learning mentioned in the main text, a few effective tricks were also applied for model training. First, considering that there were fewer images for the BA class than for the non-BA class, the cross-entropy loss was slightly modified using the well-known cost-sensitive learning 4 to improve the influence of BA each image.
Second, a pre-trained SE-ResNet based on the large-scale natural image dataset ImageNet was used to initialize the model parameters, because such initialization has been experimentally proven effective in improving the classification performance particularly for new classification tasks with relatively small training dataset 5 . The last fully connection layer in the SE-ResNet was randomly initialized for our binary classification task, as commonly adopted when using a pre-trained model. Third, the dropout operation was only applied to the last fully connect layer of the SE-ResNet, which has shown to reduce the potential inter-dependence between neurons in the network and therefore improve the generalizability of the model 6 . The rate of dropout was set to 0.2 for all training period in this study.
In addition, besides the random crop mentioned above, the horizontal flipping of each training image at a probability 0.5 was also used as part of data augmentation during model training.  Table   6, first row).

Supplementary Note 5: Libraries
The image processing libraries used for training model included PyTorch (1.5.1), torchvision  Note: # Numbers of training and test images are included in brackets in the second and third columns. 95% confidence intervals are included in brackets in other relevant columns.
*The P values were from the comparison between the AUC of the ensemble deep learning model and the AUCs of two human experts. Differences between various AUCs were compared using a Delong test.
'AI', artificial intelligence; 'AUC', area under receiver operating characteristic curve. Note: # Numbers of training and test patients are included in brackets in the second and third columns. 95% confidence intervals are included in brackets in other relevant columns. *The P values were from the comparison between the AUC of the ensemble deep learning model and the AUCs of two human experts. Differences between various AUCs were compared using a Delong test.

Supplementary
'AI', artificial intelligence; 'AUC', area under receiver operating characteristic curve. Note: 95% confidence intervals are included in brackets. *The P values were from the comparison between the AUC of the proposed 5-fold ensemble deep learning model ('Part') and the AUCs of the others. Differences between various AUCs were compared using a Delong test.

Supplementary Table 3. Consistency assessment of regions of interest between one individual model within the ensemble deep learning model and human radiologists on the external validation dataset.
'AUC', area under receiver operating characteristic curve; 'PPV', Positive predictive value; 'NPV', Negative predictive value.