Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset

The reliability of machine learning models can be compromised when trained on low quality data. Many large-scale medical imaging datasets contain low quality labels extracted from sources such as medical reports. Moreover, images within a dataset may have heterogeneous quality due to artifacts and biases arising from equipment or measurement errors. Therefore, algorithms that can automatically identify low quality data are highly desired. In this study, we used data Shapley, a data valuation metric, to quantify the value of training data to the performance of a pneumonia detection algorithm in a large chest X-ray dataset. We characterized the effectiveness of data Shapley in identifying low quality versus valuable data for pneumonia detection. We found that removing training data with high Shapley values decreased the pneumonia detection performance, whereas removing data with low Shapley values improved the model performance. Furthermore, there were more mislabeled examples in low Shapley value data and more true pneumonia cases in high Shapley value data. Our results suggest that low Shapley value indicates mislabeled or poor quality images, whereas high Shapley value indicates data that are valuable for pneumonia detection. Our method can serve as a framework for using data Shapley to denoise large-scale medical imaging datasets.

www.nature.com/scientificreports/ outliers and corruptions, while high Shapley value informs the type of new data that should be acquired to most efficiently improve the predictor performance 17 . Furthermore, data Shapley has been shown to substantially outperform the leave-one-out (LOO) score 35 , a commonly used statistical measure of data importance. Moreover, data Shapley has several advantages as a data valuation framework 17 : (a) it is directly interpretable because it assigns a single value score to each data point and (b) it satisfies natural properties of equitable data valuation.
To the best of our knowledge, our work is the first to apply data Shapley to biomedical imaging data. We aim to assess the effectiveness of data Shapley in capturing low quality data as well as informing valuable data in the context of pneumonia detection from chest X-ray images. Our main contributions are as follows. First, we propose a framework (see Fig. 1) to quantify the value of training data in a large chest X-ray dataset in the context of pneumonia detection using data Shapley values. Second, we show that low Shapley value indicates mislabeled examples, while high Shapley value indicates data points that are valuable for pneumonia detection. Finally, our approach can serve as a framework for using data Shapley to clean up large-scale medical imaging datasets.

Results
Overview. In this study, we aim to characterize the effectiveness of data Shapley in identifying low quality and valuable data in ChestX-ray14, a large public chest X-ray dataset whose pathology labels were extracted from X-ray reports using text mining techniques. We sampled 2,000 chest X-rays as the training set to train the data Shapley algorithm and compute the Shapley values, 500 chest X-rays as the validation set to compute the predictor performance score during training, and 610 chest X-rays as the held-out test set to report the final results (see Table 1). We extracted features from a pre-trained convolutional neural network (CNN) called CheXNet 36 , and computed the data Shapley value of each training datum with respect to the accuracy of a logistic regression algorithm for pneumonia detection. Furthermore, in collaboration with three radiologists, we evaluated the most valuable and least valuable chest X-rays, and provided qualitative interpretations for their Shapley  www.nature.com/scientificreports/ effect on the model performance. This suggests that data Shapley value is a highly accurate measure of a data point's importance since data points with high Shapley values were crucial to pneumonia detection. Interestingly, the 100 most valuable images were all labeled as pneumonia in ChestX-ray14, which might be because the training data were highly imbalanced. As shown in Supplementary Fig. S2a, the percentage of positive training examples dropped rapidly when 10% of the training data were removed, suggesting that most of the positive Table 1. Training, validation and held-out test sets used in our study. All three sets were sampled from the train or validation set of ChestX-ray14 dataset 9 . The training set was used to train data Shapley algorithm and compute Shapley values, the validation set was used to compute the predictor performance score during training, and the held-out test set was used to report the final results. Because the distribution of pneumonia labels in the ChestX-ray14 dataset is highly imbalanced, we sampled a larger proportion of pneumonia cases in the training set and sampled balanced validation and held-out test sets in this study.  Effects of removing high value data points to pneumonia detection performance. We removed the most valuable data points from the training set, as ranked by TMC-Shapley, leave-one-out (LOO) and uniform sampling (random) methods. We trained a new logistic regression model every time when 1% of the training data points were removed. The x-axis shows the percentage of training data removed, and the y-axis shows the model performance on the held-out test set in terms of (a) accuracy, (b) precision and (c) recall.
Removing the most valuable data points identified by TMC-Shapley method decreased the model performance more than using LOO or randomly removing data. We note that after removing more than 20% of the training data, the precision and recall scores for TMC-Shapley values increased slightly, which might be because the percentage of positive cases increased after 20% of the training data were removed (see Supplementary Figure S2a www.nature.com/scientificreports/ training data were identified as valuable. We note that after removing more than 20% of the training data, the precision and recall scores for TMC-Shapley values increased slightly (Fig. 2b-c), which might be because the percentage of positive cases increased after 20% of the training data were removed (Supplementary Figure S2a). Conversely, we removed data points from the least valuable to the most valuable as shown in Fig. 2d-f. Importantly, removing data points that have low Shapley values improved the model performance, suggesting that low Shapley value data points actually harm the model performance. Similarly, removing data points that have low LOO values or randomly removing data points did not affect the prediction accuracy or recall. We note that removing low Shapley value data points had a smaller effect on the precision score ( Fig. 2e), which might be because of the imbalanced training set. Besides the TMC-Shapley approximation method, we also explored a faster approximation, G-Shapley 17 , and obtained comparable results (see Supplementary Figure S3). Moreover, we experimented with features extracted from a different CNN (ResNet-50 37 ), a different learning algorithm (multilayer perceptron) or different validation and held-out test sets (see Methods), and obtained highly correlated Shapley values for the training data (see Supplementary Table S1) and comparable results when the training data were removed from the most (least) valuable data point (see Supplementary Figures S4-S9).

Low Shapley value indicates mislabels in the dataset.
We asked three radiologists to re-label the 100 most valuable, 100 least valuable and 100 randomly sampled chest X-ray images in the training set. Supplementary Figure S10 shows the histograms of the Shapley values in these three sets of images. Next, we took the majority vote of the three radiologists' labels as the final label for each of these 300 images. Table 2 summarizes the radiologists' labels for the 300 images, and Supplementary Table S2 shows the inter-reader agreement. If no agreed label was achieved for an image, we excluded the image from further analyses (see Supplementary  Table S2).
There were two important observations from Table 2. First, there were many more mislabels in low value images (i.e. 65) compared to high value (i.e. 22, pairwise p = 8.61e-10) or randomly sampled images (i.e. 20, pairwise p = 1.22e-10). In particular, 13 low value images were mislabeled as pneumonia, and 52 were mislabeled as no pneumonia. This suggests that low Shapley value effectively captures mislabels in the dataset. Second, despite having high Shapley values, 22 images were mislabeled as pneumonia, which suggests that there might be other factors contributing to their high values (see further analyses in next sections). Note that no high value image was mislabeled as no pneumonia because all of the 100 most valuable images were associated with pneumonia in the dataset.
In addition, Supplementary Figure S11 visualizes the cumulative number of mislabels as data points were inspected in the descending, ascending and random orders of their Shapley values for the 100 most valuable, 100 least valuable and 100 randomly sampled images respectively. Low value images had a much steeper slope of accumulated mislabels compared to high value or randomly sampled images. In contrast, high value and randomly sampled images had similar number of mislabels, and their slopes of accumulated mislabels were similar.

Heatmaps suggest contradictions between feature vectors and labels in low value images.
To better understand the factors contributing to low Shapley values, we visualized the most salient local regions in the 65 mislabeled low value chest X-ray images (see Methods). Figure 3a, b show example heatmaps of low value chest X-rays that were mislabeled as pneumonia (Fig. 3a) or no pneumonia (Fig. 3b).
In Fig. 3a, the heatmaps had low activations in relevant areas in the lung but high activations in irrelevant areas outside the lung, which suggests that the corresponding feature vectors favored no pneumonia over pneumonia for these images. Unsurprisingly, this pattern existed in the heatmaps of 12 (out of 13) low value images that were mislabeled as pneumonia.
In contrast, the heatmaps in Fig. 3b had high activations in areas that indicate pneumonia, suggesting that their corresponding feature vectors favored pneumonia over no pneumonia. This pattern existed in the heatmaps of 45 (out of 52) low value images that were mislabeled as no pneumonia.
Therefore, the contradictions between feature vectors and labels likely resulted in these 65 mislabeled images being assigned low value. Table 2. Three radiologists' relabeling results for the 100 least valuable, 100 most valuable and 100 randomly sampled chest X-ray images in the training set. We used the majority vote to obtain the final label of each image. Disagreed images were excluded from further analyses. There were many more mislabeled examples in low value images (i.e. 65) than high value images (i.e. 22, pairwise p = 8.61e-10) or random images (i.e. 20, pairwise p = 1.22e-10), suggesting that low Shapley value effectively captures mislabels in the dataset. a p values computed using χ 2 test. b Note that since our training set has a higher percentage of pneumonia labels, the mislabel rates may not be representative for the entire ChestX-ray14 dataset.  (Table 2), we looked for other types of abnormalities in these images, and compared them to the 65 mislabeled low value images. The results are summarized in Table 3. Among the 22 mislabeled high value images, 10 images showed abnormal opacity (i.e. an important indicator for pneumonia) in the lung. Furthermore, the rest of the 12 images showed other kinds of abnormalities to certain degrees, including interstitial patterns and mass. Figure 3c shows example heatmaps of mislabeled high value images, where high activations correspond to abnormal areas. In contrast, only 2 out of the 13 low value images that were mislabeled as pneumonia had abnormalities, while the rest 11 images were completely normal. Moreover, among the 52 low value images that were mislabeled as no pneumonia, 50 images showed abnormal opacity in the lung. See Supplementary Figure S12 for examples of abnormalities in the mislabeled images.   Table 3. Number of mislabeled images that showed abnormalities among the 100 most valuable and 100 least valuable chest X-ray images. All high value images that were mislabeled as pneumonia showed abnormalities to certain degrees. In contrast, only 2 out of 13 low value images that were mislabeled as pneumonia showed abnormalities. Moreover, 50 out of 52 low value images that were mislabeled as no pneumonia showed abnormal opacity. a p value computed using χ 2 test. www.nature.com/scientificreports/ Since there were relatively few true pneumonia cases in the training set, images mislabeled as pneumonia that showed abnormalities were still useful for the pneumonia detection algorithm. This was reflected in their Shapley values and could explain why the 22 mislabeled images had high Shapley values.

Relation between image quality and Shapley values.
In addition to mislabels, we also examined if there was a relation between image quality and Shapley values. The radiologists reported that all of the 300 chest X-ray images met diagnosis quality. However, there were eight images where a portion of the lung field was out of the image frame (see Supplementary Figure S13). Among these eight images, six had negative Shapley values whether or not they were correctly or incorrectly labeled. Whereas the other two images had positive Shapley values and were correctly labeled. Therefore, this suggests that low Shapley values not only indicate mislabels, but also poor image quality.

Discussion
Unlike prior studies that implicitly handle noisy labels or images in model development or training stages, our method directly identifies low value data points regardless of the reason for their poor quality. In addition to low value data, our method also informs us of data points that are valuable to pneumonia detection. Importantly, all of the 100 most valuable chest X-rays were labeled as pneumonia in ChestX-ray14. Since our training set is highly imbalanced (pneumonia versus no pneumonia ratio = 1:9), it is unsurprising that these rare positive data points are more important for pneumonia detection than the abundant negative data points.
Among the 100 chest X-ray images randomly sampled from our training set, 20 images (i.e. 20%) were mislabeled in ChestX-ray14 (see Table 2). Although this percentage might not represent the mislabeling rate in the entire ChestX-ray14 dataset, it suggests that a large portion of images in the dataset might not be reliably labeled. By visually inspecting images in ChestX-ray14, a recent study has shown that the pneumonia labels in the dataset only have 50% positive predictive value (PPV), lower than the 66% PPV reported in the original documentation of the dataset 12 . Our study further provides evidence towards the quality of the labels in ChestX-ray14. Since there are no gold-standard labels in ChestX-ray14, we suggest that users should be cautious about interpreting results from machine learning models evaluated on ChestX-ray14 data.
In our experiments, 22 images that were mislabeled as pneumonia were assigned high Shapley values (see Table 2). There are two possible reasons that resulted in this unexpected scenario. First, our analyses have shown that all of these images contain abnormalities to varying degrees. Hence, the feature vectors extracted from the pre-trained CheXNet might not have sufficient distinguishable representations between non-pneumonia and pneumonia abnormalities, and using representations from a better pre-trained model might mitigate this problem. Second, the logistic regression algorithm used for training data Shapley might have confused image features seen in non-pneumonia cases with those in pneumonia cases, and thus a more complex learning algorithm might be needed.
There are several limitations in our study, and we hope to address them in future studies. First, we did not use the full ChestX-ray14 dataset due to limited computational resources. In future work, we can predict the Shapley values of the rest of the data points in ChestX-ray14 using the recently developed distributional Shapley framework 18 . Distributional Shapley uses the data Shapley values that we have already computed and learns to predict the value of new data points based on the fact that all data points come from the same underlying distribution. Hence, distributional Shapley will allow us to predict Shapley values of new data points prospectively. Second, since the ChestX-ray14 dataset has extremely imbalanced class labels, we sampled a larger percentage of pneumonia cases in our training, validation and held-out test sets (see Table 1). In order to accurately compute data Shapley values in imbalanced datasets, methods to explicitly account for class imbalance are needed. Third, because many of the abnormal chest X-rays show more than one thoracic diseases, it was hard for the model and the radiologists to tell for certain whether there is underlying pneumonia or not. Future works on predicting multi-class labels might better capture low quality data. Lastly, to show that our method can be used as a general approach for cleaning up massive medical datasets, validations on other medical data types are required.
Our approach has several important advantages. First, it allows us to have cleaner datasets and improves the performance of pneumonia detection. Second, it provides us insights into what types of chest X-rays are useful or harmful for pneumonia detection, which can inform us of the type of data to acquire to most efficiently improve the model performance. Third, our analyses suggest that not all mislabeled images are harmful --some images that were originally mislabeled in ChestX-ray14 are still useful for pneumonia detection. Fourth, our method is flexible and can be generalized to multi-label classification problems. In future work, we can extend our method to multi-label classification by (a) training data Shapley using a multi-label performance score (e.g. micro F1-score), or (b) treating each class as a binary classification problem and summing up their Shapely values for individual classes. Finally, our method can be easily extended for other medical data types such as time-series signals and electronic health records. For instance, one promising future direction is to use data Shapley values to prioritize what data to analyze for clinicians and thus help accelerate clinical workflows.
In conclusion, we used data Shapley to quantify the value of training data in a large-scale chest X-ray dataset in the context of pneumonia detection. We provided quantitative and qualitative analyses of the data Shapley values. We showed that low Shapley value indicates mislabeled and low quality images in the dataset, while high Shapley value indicates data points that are valuable for pneumonia detection. Our results are likely generalizable to other use cases, and our approach can serve as a framework for using data Shapley to denoise massive medical imaging datasets to improve the reliability of machine learning algorithms trained on such datasets.

Methods
Data. In this study, we used ChestX-ray14 9 , a large public chest X-ray dataset whose pathology labels were text-mined from X-ray reports. We chose ChestX-ray14 because it is known to have inaccurate labels 12 . As a proof of concept, we used 2,000 chest X-rays in ChestX-ray14 as the training set to train the data Shapley algorithm and compute the Shapley values, 500 chest X-rays as the validation set to compute the predictor performance score during training, and 610 chest X-rays as the held-out test set to report the final results. All of the 3,110 chest X-rays were sampled from the train or validation set of ChestX-ray14. Since only 1% of the chest X-rays in the dataset are associated with pneumonia, we sampled the training set with a larger proportion of pneumonia cases. In addition, for the ease of the interpretation of results (e.g. prediction accuracy, precision and recall), we sampled balanced validation and held-out test sets. Table 1  Quantifying the value of chest X-rays with data Shapley. An overview of our method to compute data Shapley values for the chest X-rays is shown in Fig. 1. First, we extracted features from the last averagepooling layer (i.e. before the last fully connected layer) of a pre-trained CNN called CheXNet 36 , which resulted in a 1,024-dimensional feature vector for each chest X-ray. Next, we applied the TMC-Shapley 17 algorithm to calculate the value for each chest X-ray for pneumonia detection. Specifically, we denote the training data as where n was the size of the training set, x i ∈ R 1,024 was the feature vector, and y i ∈ {0, 1} was the pneumonia label (0 for no pneumonia and 1 for pneumonia). We used logistic regression as the supervised learning algorithm and prediction accuracy as the performance metric V . As a comparison to data Shapley, we also computed the leave-one-out (LOO) value 35 for each training datum. After calculating data Shapley values, we removed the most (least) valuable data points from the training set, as ranked by Shapley values, LOO values and uniform sampling (i.e. random). We trained a new logistic regression model every time when data points were removed. Moreover, we investigated the 100 least valuable and the 100 most valuable chest X-rays in the training set based on their Shapley values. In addition, we randomly sampled another 100 chest X-rays from the training set as a comparison to the 100 most valuable and 100 least valuable images.

Effects of pre-trained CNNs, learning algorithms and validation/held-out test sets. We per-
formed three experiments to investigate the effects of a different pre-trained CNN, learning algorithm or validation and held-out test set. First, we kept the learning algorithm (i.e. logistic regression) and performance metric (i.e. prediction accuracy) the same, but used features extracted from a different pre-trained CNN, ResNet-50 37 . The input vector x i ∈ R 2,048 because there are 2,048 output neurons in the layer before the last fully-connected layer of ResNet-50. Second, we kept the feature vectors and the performance metric the same, but applied a multi-layer perceptron with one hidden layer and 20 hidden units as the supervised learning algorithm. Lastly, we kept the input features, the learning algorithm and the performance metric the same, but sampled different validation set (n = 500; 49.0% pneumonia) and held-out test set (n = 610; 50.8% pneumonia). In these experiments, we kept the same training set in order to compare the Shapley values. We did not vary the performance metric because our main focus in this study is the accuracy of pneumonia detection.

Investigating relations between data Shapley values and input labels.
We investigated the relations between the Shapley values and the labels. We asked three radiologists (R.Y., S.R. and D.L.R.) to re-label 300 chest X-rays (i.e. 100 least valuable, 100 most valuable and 100 randomly sampled from the training set). The radiologists were blinded to the original labels and the Shapley values while labeling the chest X-rays. For each chest X-ray, the radiologists had the choice of labeling it as pneumonia, no pneumonia or unsure. www.nature.com/scientificreports/ In order to further evaluate mislabeled images and their Shapley values, we asked the three radiologists to re-evaluate the 100 most valuable and the 100 least valuable chest X-ray images, and determined whether these images have other types of abnormalities (e.g. opacity) or are completely normal.
Visualizing heatmaps for chest X-rays. To better understand the mislabeled chest X-ray images, we visualized heatmaps that show the local regions in the images leading to the prediction of pneumonia 39 . Specifically, we fed a chest X-ray image into the same pre-trained CheXNet 36 and took the output of the final convolutional layer as the feature maps. Let f k ∈ R 7×7 be the k-th feature map and w k be the final classification layer weight for the k-th feature map. Taking the weighted sum of the feature maps and the corresponding weights, we obtained a heatmap M ∈ R 7×7 that shows the most salient features for the prediction of pneumonia. Mathematically, the heatmap M was computed as follows: Lastly, we overlaid the heatmap with the original image, which allowed us to visualize the most salient local regions in the image leading to the prediction of pneumonia.
Statistical tests. χ 2 tests were used to compare the proportions of mislabels in the three sets of chest X-rays (i.e. 100 most valuable, 100 least valuable and 100 randomly sampled), as well as the proportions of abnormalities in mislabeled images among the 100 most valuable and 100 least valuable chest X-rays.

Data availability
The ChestX-ray14 dataset 9 used for this study is publicly available at https:// nihcc. app. box. com/v/ Chest Xray-NIHCC. In addition, we provide the radiologists' labels of the 300 chest X-rays and their corresponding TMC-Shapley values as supplementary information.