A calibrated deep learning ensemble for abnormality detection in musculoskeletal radiographs

Musculoskeletal disorders affect the locomotor system and are the leading contributor to disability worldwide. Patients suffer chronic pain and limitations in mobility, dexterity, and functional ability. Musculoskeletal (bone) X-ray is an essential tool in diagnosing the abnormalities. In recent years, deep learning algorithms have increasingly been applied in musculoskeletal radiology and have produced remarkable results. In our study, we introduce a new calibrated ensemble of deep learners for the task of identifying abnormal musculoskeletal radiographs. Our model leverages the strengths of three baseline deep neural networks (ConvNet, ResNet, and DenseNet), which are typically employed either directly or as the backbone architecture in the existing deep learning-based approaches in this domain. Experimental results based on the public MURA dataset demonstrate that our proposed model outperforms three individual models and a traditional ensemble learner, achieving an overall performance of (AUC: 0.93, Accuracy: 0.87, Precision: 0.93, Recall: 0.81, Cohen’s kappa: 0.74). The model also outperforms expert radiologists in three out of the seven upper extremity anatomical regions with a leading performance of (AUC: 0.97, Accuracy: 0.93, Precision: 0.90, Recall:0.97, Cohen’s kappa: 0.85) in the humerus region. We further apply the class activation map technique to highlight the areas essential to our model’s decision-making process. Given that the best radiologist performance is between 0.73 and 0.78 in Cohen’s kappa statistic, our study provides convincing results supporting the utility of a calibrated ensemble approach for assessing abnormalities in musculoskeletal X-rays.

In this study, we present a new calibrated ensemble approach based on the aforementioned three deep neural networks for the task of detecting musculoskeletal abnormalities. Our model was motivated by the individualized proficiencies the three models displayed during the model training phase. Specifically, the three baseline models tended to have similar predicative power on overall studies (i.e., scans of all body parts), however, their efficacies differed significantly depending on the target anatomical region. In practice, such information could be overlooked when the overarching goal is high overall performance. To capitalize on the advantages of each individual model, we propose a calibrated ensemble approach trained using different deep architectures for different body parts. We compare the performance of our model to that of the three baseline models and a traditional meta-learner consisting of ResNet+DenseNet. Experimental results based on the MURA dataset 2 demonstrate that our proposed model consistently outperforms all four models across five performance evaluation metrics. The model also outperforms expert radiologists in three out of the seven upper extremity anatomical regions.
Model interpretability has always been a limiting factor for DL-based approaches. In the medical domain, the transparency of a model's decision making process is critical in validating the model, and is often a prerequisite before its clinical deployment. To facilitate our model's interpretability, we resorted to the class activation map (CAM) 10 , a technique which can be used to visualize the regions used by the model in making its predictions. In the "Class activation map (CAM)" section, we illustrate effective localizations of abnormal musculoskeletal regions using CAMs.

Related work
Applying machine learning techniques to identify musculoskeletal disorders is an active area of research [11][12][13][14] . In earlier studies, Cao et al. presented a generalized bone fracture detection method using features extracted from bone X-ray images and a novel discriminative learning framework called the Stacked Random Forests Feature Fusion 15 . Their method is able to capture 81.2% of the fracture findings reported by radiologists. Boissoneault et al. 16 investigated the correlations between brain abnormalities and chronic musculoskeletal pain conditions that can be used for illness classification. The authors applied machine learning techniques to separate brain MRI-based biomarkers of chronic pain patients from healthy controls with high accuracy (70-92%).
Recent advances in computer vision and the increased availability of large radiological datasets have enabled deep learning models to achieve performance comparable to medical professionals in a wide variety of musculoskeletal diagnosis tasks. Existing DL-based approaches can be further classified into three categories depending on the type of backbone neural network adopted in the model architecture. The first category consists of approaches based on vanilla convolutional neural networks (ConvNet) [17][18][19][20][21][22] . ConvNet is arguably the most popular underlying structure adopted by researchers, attributing to the highly successful VGG 7 model best known for its proficiency in extracting image features. The other two categories are ResNet-based [23][24][25] and DenseNet-based approaches 2,23,26 .
It is worth noting that most of the existing studies are extensions of the three aforementioned base deep learning structures. For example, Saif et al. proposed a capsule network addressing the limitations of ConvNet when aggregating large amounts of data is not possible 21 . Although some of these studies compared the performance of models with different backbone networks, the studies are mostly limited to one or a few particular types of anatomical regions. For example, Mondol et al. compared the performance of ConvNet, ResNet, and some ensemble methods on four types (wrist, humerus, finger and elbow) of studies 25 . Tiulpin et al. applied a deep Siamese ConvNet to automatically score knee osteoarthritis severity according to the Kellgren-Lawrence grading scale 22 . In our study, we evaluate the performance of these three backbone deep neural networks on overall and seven standard upper extremity regions.
In addition to designing individual deep learning models, researchers have also explored the ensemble technique 27 in search of more robust and superior performance 25,[28][29][30] . In a recent study, Jones et al. introduced an effective deep-learning system to detect fractures across the musculoskeletal system using an ensemble of ten ConvNets 28 . Compared to Jones et al. 's study, our research bears a similar mission but is different in three aspects. First, our work is more comprehensive and challenging because the types of disorder in our study are not limited to fractures. The additional abnormalities present in our dataset include hardware, joint diseases, lesions, and subluxations 2 . Second, our model is a heterogeneous ensemble of different deep networks, whereas their system is a homogeneous ensemble of the same type of neural network, i.e., ConvNets. Last, we propose a new ensemble methodology that employs different learners for different body parts, whereas their algorithm is a traditional ensemble in which each learner is trained using the entire dataset. Performance-wise, depending on anatomical regions, Jones et al. report AUC scores ranging from 0.888 to 0.98. Our approach achieves comparable efficacy with AUC scores ranging from 0.87 to 0.97 in a more comprehensive study. We introduce our new calibrated ensemble learner in the "Methods" section.

Contributions
The main contribution of this work is a new ensemble learning approach that capitalizes the strength of individual learners in building predictive models. In the clinical setting, the resulting models can be interpreted as experts, each of them specializes in a particular type of patient or abnormality. Our method can be applied to any dataset with distinct data subgroups.
Another contribution of our work is presenting a use case of the CAM technique to alleviates the "black box" limitation of neural network models. Our findings demonstrate the effectiveness of the CAM, which helps to build trust towards adopting DL-based automatic methods in clinical practices.  Table 1 presents the statistics of the number of positive (abnormal) and negative (normal) images in each type of study. Each image was manually labeled as normal or abnormal by board-certified radiologists from the Stanford Hospital at the time of the clinical acquisition of the radiography 2 . The first row in Fig. 1 presents sample normal and abnormal images in the MURA dataset. In addition, the types of abnormalities present in the dataset were studied by reviewing the radiologist reports and manually labeling 100 abnormal images, with the following finding: 53 studies were labeled with fractures, 48 with hardware, 35 with degenerative joint diseases, and 29 with other miscellaneous abnormalities, including lesions and subluxations 2 .
Data preprocessing. The images in the MURA dataset are of different sizes. The width ranges from 89 to 512, and the height ranges from 132 to 512. We resized them to standard 256 by 256 grayscale images to facilitate the training of our models. The input size was selected experimentally to preserve as much information as possi-  Elbow  1734  2584  272  341  230  235  2236  3160   Finger  1710  2750  258  388  247  214  2215  3352   Forearm  583  1042  78  122  151  150  812  1314   Hand  1287  3549  197  510  189  271  1673  4330   Humerus  514  593  85  80  140  148  739  812   Shoulder  3627  3673  541  538  278  285  4446  4496   Wrist  3489  4993  498  772  295  364  4282  6129   Total  We applied three image augmentation techniques to boost the performance of our deep networks: horizontal flip, vertical flip, and rotation between −30 • and 30 • . A random sequence of one to three of these transformations was applied to each image in the training data. We experimented with other augmentation methods, including random brightness change, which turned out to be less effective. The second row in Fig. 1 presents samples of randomly augmented input images corresponding to the original images in the first row.

Results
We trained our deep learning models using the Adam optimizer with the hyper-parameters set experimentally at 0.9 for the exponential decay rate for the first moment estimates (i.e., β 1 ), 0.999 for the exponential decay rate for the second moment estimates ( β 2 ), and 0.001 for the learning rate (varying learning rates resulted in similar performance).
The large size of our training data (Table 1) makes it challenging to load the entire dataset in memory even with the most state-of-the-art system configuration. To this end, we employed a data generator 32 to dynamically generate the training batches and feed them to the model training procedures. The batch size was set to 32, except for DenseNet for which the batch size was set to 16 due to its large memory consumption during training.
We found it was sufficient to set the training epochs at 100 with an early stopping condition monitoring the convergence of the training loss. With the help of the ModelCheckpoint callback function from the Keras package, the best model was selected as the one with the highest validation set performance among all training epochs. Table 2 presents the total number of parameters for each model and the training time.
Evaluation metrics. We evaluated the performance of our models using five metrics defined as follows: • AUC score (of the ROC curve): A ROC curve 33 displays the trade-off between the True Positive Rate (TPR, or Sensitivity) and the True Negative Rate (TNR, or Specificity) of a classification model at different threshold settings. AUC reveals the capability of a model to separate the positive and negative classes, i.e., the higher the AUC score, the more effective a model is at performing the classification. • Accuracy: the fraction of correctly classified images in the test data.
• Precision: the fraction of correctly classified images among all positive (abnormal) predictions.
• Recall: the fraction of correctly classified images among all positive (abnormal) images in the test data.
• Cohen's kappa coefficient ( κ ): measures inter-rater reliability for categorical items 34 . It is considered to be a more robust measure than a simple percent agreement calculation.
Performance comparison and analysis. Because we applied image augmentation techniques ("Data preprocessing" section) to boost our models' performance, each X-ray study contains a collection of multi-view images from different angles. We define the abnormality probability of an X-ray study to be the average predicted probabilities for the positive class of all images in the study. Final classification decisions are based on a threshold of 0.5 on the average predicted probabilities. Figure 2 presents the AUC scores and the corresponding ROC curves of the five models outlined in the "Methods" section for the overall study. We observe that, except for ConvNet, the other four models have similar proficiencies in detecting abnormal scans. Furthermore, the two ensemble learners hold a marginal advantage over the individual models. Table 3 presents the performance comparison between our five models on overall and individual studies. We group the comparisons by the five performance metrics (Column 1) described in the "Evaluation metrics" section. Each group displays its models' overall and study-type specific performance on the test dataset. Using ConvNet as our baseline model, the numbers in parentheses indicate the percentage change ( ) in a particular evaluation metric for model M over the baseline. Specifically, � = (P M − P ConvNet )/P ConvNet , where P M and P ConvNet are performance scores of model M and ConvNet, respectively. We first examine the overall performance (Column 3). The results are summarized as follows: • The ResNet model outperforms ConvNet in all five measures with 5%, 4%, 5%, 8%, and 12% improvement in AUC, Accuracy, Precision, Recall, and κ respectively. • The DenseNet model outperforms ConvNet in all five measures with 4%, 4%, 4%, 6%, and 10% improvement in AUC, Accuracy, Precision, Recall, and κ respectively. From Column 4-10 of Table 3, we can examine the models' performance on images of various body parts. The results are summarized as follows: • The last four models consistently outperform the baseline ConvNet across all evaluation metrics and specific body part studies. across all categories, suggesting that the ensemble model has more discriminative power and would be a more desirable model compared to individual learners. • Our "Calibrated" model displays the highest performance across all categories and measures. For some studies (e.g., Hand, Forearm, and Finger), we observe significantly improved κ values over the other models. • Our "Calibrated" model achieved κ values above 0.7 across all eight studies. Given that the best radiologist performance is between 73 and 78% depending on the parts of the body imaged 2 , our "Calibrated" model exhibits performance superior to human experts in Elbow, Humerus and Wrist studies.

Class activation map (CAM). The Class Activation Map (CAM) is a technique that explicitly enables neu-
ral networks to have localization ability despite being trained on image level labels 10 . A CAM for a particular class indicates the discriminative region(s) used by the model to make the prediction. Thus, CAMs serve as another method to evaluate the efficacy of convolutional-based models. By extracting the final Conv2D layer in our best performing calibrated model and mapping the corresponding output on each original image, we can highlight the region(s) of an image and visualize our model's attention while making its prediction. Figure 3 presents the CAM regions identified by our model on sample images with high confidence positive scores. We observe that our model correctly pinpointed the problematic regions in all cases with high confidence in its positive prediction.

Discussion
In this study, we evaluated five deep learning approaches for the task of detecting abnormalities using musculoskeletal radiographs. Our experimental results suggest that ResNet and DenseNet have similar advantages over ConvNet in the overall study, however, their performance on different anatomical regions varied significantly. Furthermore, both ensemble approaches are more effective compared to the individual models. The calibrated ensemble approach consistently outperforms the other four models across all five evaluation metrics. For three out of seven specific anatomical regions (i.e., Elbow, Humerus, and Wrist), the calibrated model achieved performance superior to expert radiologists. Two factors could have contributed to the success of our CE algorithm. The first one is the implicit i.i.d. assumption made by all machine learning algorithms, i.e., training and test samples are drawn independently from an identical distribution. This assumption is better enforced when we restrict our data to images from specific anatomical regions, thereby allowing greater potential for success. The second factor is that different deep neural networks can be most effective for different regions due to their customized architectures. The CE algorithm exploited the individual strength of each learner by observing their behavior in the validation data. Our proposed approach can be extended to other applications where unique subgroups are present in the data. Unlike traditional ensemble approaches in which each learner is trained using the entire dataset, the calibrated ensemble would designate a desired learner for each idiosyncratic data component.
One limitation of our approach is the requirement of a pre-defined partition of data subgroups. In our study, there is an unambiguous effective division of data using anatomical regions. In practice, not all datasets exhibit straightforward distinctive data clusters. For future work, we foresee a potential integration with unsupervised clustering algorithms to help discover the underlying subgroups of the data with similar characteristics.
Lastly, we applied the CAM technique to highlight the region(s) essential to the network's decision making process. Such information alleviates the "black box" limitation of DL-based approaches and makes a model's decision process transparent to practitioners. Our findings demonstrate the effectiveness of the CAM, which helps to build trust towards adopting DL-based automatic methods in clinical practices. Table 3. Performance comparison of five models-overall and across different parts of body. Bold numbers indicate the highest performing model(s) for each evaluation metric across eight studies. The numbers in parenthesis indicate the percentage change (P M − P ConvNet )/P ConvNet in a particular evaluation metric for a model M over the baseline ConvNet model. "Res+Dense" is a meta-learner which makes its predictions based on the average of the output probabilities from the ResNet and DenseNet models. "Calibrated" is our proposed ensemble model which trains a designated deep learner for each anatomical region.

Methods
In this section we introduce the five deep learning methods we employed to conduct our experiments. The first three (ConvNet, ResNet, and DenseNet) are models that have been widely adopted by researchers. The other two are meta-learner Res+Dense and our proposed calibrated ensemble learner. We elaborate our method of addressing the data imbalance issue in the "Customized loss function" section.

Convolutional neural network (ConvNet).
ConvNet is a class of deep neural networks designed to analyze visual images. A ConvNet model consists of multiple convolutional layers which serve the roles of extracting image features, ranging from simple patterns (e.g., edges, curves, etc.) to complex figures (e.g., hands, elbows, shoulders, etc.). The subsampling (e.g., maxpooling) layers control the intermediate feature map sizes between the consecutive convolutions. While the convolutional layers act as feature detectors, the fully connected (FC) layers perform classification based on the features extracted by the convolutional layers. Table 4 illustrates the architecture of our ConvNet model. Our model is a modified version of the popular VGGNet [2] architecture. We adjusted the sizes of some convolutional and FC layers in the original VGG model. These customizations were motivated by the large discrepancy between the training and test performance based on the VGG-16 model, which indicated potential overfitting. We further applied batch normalization (BN) after each convolutional layer.

Residual neural network (ResNet). ResNet 35 attracted a great amount of attention after it won all five
main tracks of ILSVRC 36 and CoCo 37 image detection, localization, and segmentation competitions in 2015. In the ILSVRC classification competition, ResNet achieved 3.6% top 5 error (i.e., correct answer not present among top 5 predictions), a performance comparable, if not superior, to human ability.
The most noticeable characteristic of ResNet is its "ultra deep" architecture. Due to the vanishing/exploding gradients problem 38 , a deep learning model is often limited in the total number of convolution layers. ResNet addresses this issue by introducing the "ResNet block" with skip-connections to feed the activation function directly into layers much deeper into the model. As illustrated in Fig. 4a, instead of learning y = M(x) , ResNet is designed to learn the residual function F(x), where M(x) = F(x) + x . Figure 4b presents another type of "ResNet block" 8 capable of accommodating for the desired number of feature maps in each layer. Furthermore, blocks in Fig. 4 adopt the "full pre-activation" design illustrated by He et al. 8 , in which batch normalization (BN) and activation (ReLU) proceed each convolution (Conv2D) layer. Our ResNet consists of a mixture of 25 convolutional blocks of both types.
Dense convolutional network (DenseNet). DenseNet was first introduced in 2017 9 . It extended the idea of skip-connections in ResNet to dense connections. In particular, for each layer, the feature maps of all preceding layers are used as inputs, and its own feature maps are fed into all subsequent layers. Since each

Ensemble of ResNet and DenseNet (Res+Dense).
Ensemble learning has been proven to produce improved and more robust performance than a single model 27 . To further evaluate the performance of our proposed model, we compare its performance to a traditional meta-learner Res+Dense which is constructed by combining the outcomes from these two baseline models. Our experimental findings suggest that including ConvNet will not further improve the ensemble model's performance. The Res+Dense learner makes its predictions based on the average of the output probabilistic scores from the ResNet and DenseNet models. Formally, where ŷ is the predicted label for instance x i . p r and p d are the predicted abnormality probabilities for instance x i from the ResNet and DenseNet, respectively. Our meta-learner outperforms standalone classifiers in overall performance across all five evaluation metrics outlined in the "Evaluation metrics" section.
Calibrated ensemble model. In training our deep learning models, we found that different classifiers excel at analyzing studies of different parts of the body. Table 5 presents the performance of three individual models on the validation data evaluated using AUC, accuracy, and Cohen's κ . We observe that the overall performance of ResNet and DenseNet were highly similar, and both models outperformed the ConvNet. However, if we examine the specific anatomical regions, DenseNet showed a noticeable advantage over the Leveraging these findings, we trained a customized ensemble model that would designate a deep learner for each specific type of study. Specifically, for each anatomical region, we selected the desired learner with the best validation performance in at least two out of the three metrics as presented in Table 5. As a result, DenseNet was selected for Elbow and Forearm studies, and ResNet for the remaining categories.
We calibrated our ensemble model by fine-tuning each learner with its corresponding subset of training images. Each model's weights were initialized using the pre-trained weights of its overall model. Experimental results on the test data (Table 3) demonstrate our calibrated model not only achieves the highest performance on individual studies but also leads the performance in the "Overall" category. Furthermore, the model's advantage is consistent across all five evaluation metrics.
Customized loss function. We select cross-entropy (CE) as the loss function for training our models.
However, from Table 1, we observe that the number of positive and negative instances is imbalanced in the majority of the seven body parts studies. Applying standard machine learning algorithms to an imbalanced dataset often leads to unsatisfactory performance on the minority class, which is the more interesting and important class in our task. To address this issue, we modified the standard CE loss function to be a weighted cross-entropy loss such that the weight for class i ∈ {positive, negative} is inversely proportional to the number of instances in class i. For example, when training for a binary classification model for the "Elbow" studies, we have 1,734 and 2,584 instances in the positive and negative classes (Table 1), respectively. As a result, the respective weights for the positive and negative classes are set to be 2584/(2584+1734) and 1734/(2584+1734) in the weighted loss function. Formally,ŷ . , x N } are the training instances • Y = {y 1 , y 2 , . . . y N } are the ground truth labels for the corresponding instances in X • p(y i = v|x i ) is the predictive probability for y i ∈ {0, 1} given instance x i • T = {Elbow, Finger, Forearm, Hand, Humerus, Shoulder, Wrist} contains the study types • |N t | is the number of normal images of study type t ∈ T • |A t | is the number of abnormal images of study type t ∈ T • w t 0 = |A t |/(|N t | + |A t |) is the weight for all normal instances of study type t ∈ T • w t 1 = |N t |/(|N t | + |A t |) is the weight for all abnormal instances of study type t ∈ T.

Data availability
This study is based on the publicly available MURA dataset provided by the Stanford ML Group (reference 2 ).