A Curriculum Learning Strategy to Enhance the Accuracy of Classification of Various Lesions in Chest-PA X-ray Screening for Pulmonary Abnormalities

We evaluated the efficacy of a curriculum learning strategy using two steps to detect pulmonary abnormalities including nodule[s], consolidation, interstitial opacity, pleural effusion, and pneumothorax with chest-PA X-ray (CXR) images from two centres. CXR images of 6069 healthy subjects and 3417 patients at AMC and 1035 healthy subjects and 4404 patients at SNUBH were obtained. Our approach involved two steps. First, the regional patterns of thoracic abnormalities were identified by initial learning of patch images around abnormal lesions. Second, Resnet-50 was fine-tuned using the entire images. The network was weakly trained and modified to detect various disease patterns. Finally, class activation maps (CAM) were extracted to localise and visualise the abnormal patterns. For average disease, the sensitivity, specificity, and area under the curve (AUC) were 85.4%, 99.8%, and 0.947, respectively, in the AMC dataset and 97.9%, 100.0%, and 0.983, respectively, in the SNUBH dataset. This curriculum learning and weak labelling with high-scale CXR images requires less preparation to train the system and could be easily extended to include various diseases in actual clinical environments. This algorithm performed well for the detection and classification of five disease patterns in CXR images and could be helpful in image interpretation.


Results
Test data comprised 20% of the total dataset and included 1423 normal individuals and 1549 patients, including 394 with nodules, 282 with consolidation, 286 with interstitial opacity, 465 with pleural effusion, and 253 with pneumothorax. Considering the healthy subjects, serious data imbalance problems could occur and confuse accurate evaluation of the presence or absence of each disease. Therefore, six performance metrics were calculated separately for each disease class: area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). In addition, the performance of the model in terms of abnormal screening was evaluated by combining the disease classes. Comparison with baseline model. To assess the effectiveness of the curriculum strategy, a curriculum learning-based model was compared with a baseline model that was trained directly using entire images. Both models were found to be sufficiently trained and to converge on the validation set ( Fig. 1). Because the curriculum learning-based model preceded training on the patch images, the convergence was rapid and stable and attained better local minima. Furthermore, the loss and accuracy of the Asan Medical Center (AMC) dataset were 0.116 and 96.1% in baseline and 0.099 and 97.2% in curriculum learning, respectively. The loss and accuracy of the Seoul National University Bundang Hospital (SNUBH) dataset were 0.210 and 92.7% in baseline and 0.181 and 94.2% in curriculum learning, respectively, as shown in Table 1.
As shown in Table 2, the performances of the curriculum learning-based model showed better overall results. The AUCs of abnormalities in the two centres on the curriculum learning-based model were 99.0% and 100.0%, respectively, and those on the baseline model were 96.7% and 99.9%, respectively. The disease patterns with the most difference between the two algorithms in terms of the AUC in the datasets of the two centres were nodule[s] and consolidation. Furthermore, we conducted the paired t-test between the curriculum learning-based model and baseline model. These results show that the performance of the curriculum learning-based model was significantly better in Table 2.
Performance evaluation. The performance of the curriculum learning-based model on the test set was evaluated using weight parameters at the minimum loss point on the validation set. The six metrics were determined separately for the datasets from each of the hospitals ( Table 2). The sensitivity and accuracy of pneumothorax was the highest in the datasets of the two centres with sensitivities of 99.9% and 99.2%, respectively, and with accuracies of 99.3% and 98.3%, respectively. The specificity and AUC of pleural effusion were the highest in the AMC dataset (94.8% and 99.3%), with interstitial opacity having the highest specificity and AUC in the SNUBH dataset (96.6% and 99.8%). The sensitivity and AUC of consolidation were 98.1% and 89.3%, the lowest in the AMC dataset, and those of the nodule[s] were 89.9% and 82.7% in the SNUBH dataset, respectively. Consolidation had the lowest specificity of 54.6% and 62.6% in the datasets of the two centres, respectively. Additionally, pneumothorax had the highest PPV (99.3%) in AMC and NPV (96.2%) in the SNUBH dataset.  In addition, we performed statistical analyses on the size of the nodule[s]. The size of the nodule was divided into three categories-3 cm or less, 3-5 cm, and 5 cm or more. As shown in Table 3, the proposed curriculum learning-based method showed better performance than the baseline model for each nodule size. Nodule[s] that were 3 cm or less in size in the AMC dataset showed the best AUC, and the AUC of nodule[s] whose size was over 5 cm in the SNUBH dataset was the best. The accuracies of the three size categories were 90.2%, 91.9%, and 90.5% in the AMC dataset, respectively, and 74.7%, 77.9%, and 79.65% in the SNUBH dataset, respectively.
For more detailed evaluation, the variables embedded in the highest layer of the network were visually evaluated after two-dimensional reduction using the t-distributed stochastic neighbour embedding (t-SNE) technique 6 . Figure 2 shows the formation of manifolds according to the disease, the hospital, and the manufacturer of the X-ray machine. The formation of manifolds for each disease and hospital was visualized to assess whether the model correctly trained each disease pattern (Fig. 2). The manifold for each disease was found to be well formed (Fig. 2a). In contrast, the manifolds according to the hospital were found to have a slightly different marginal distribution, as shown in (Fig. 2b). To the best of our knowledge, this phenomenon could be caused by various factors such as the manufacturer of the X-ray machine, radiation exposure time, and image reconstruction algorithm. The most prominent of these factors was the manufacturer, and the formation of the manifold according to the manufacturer was confirmed, as shown in (Fig. 2c). The AMC and SNUBH datasets consisted primarily of data from GE and Philips, respectively, which allowed us to identify different marginal distributions according to the manufacturers. Figure 3 shows the classification results and localization using class activation maps (CAM) for five lung diseases subjected to curriculum learning. Although the CAM results generally coincided with the AUC results drawn by thoracic experts, our CAM algorithms yielded misclassifications in several patients with nodule[s] or consolidation (Fig. 4). Although localization of the disease pattern was visually appropriate, some nodules were classified as consolidations and some consolidations as nodule[s]. www.nature.com/scientificreports www.nature.com/scientificreports/

Discussion
In this study, we evaluated the efficacy of a two-step curriculum strategy for detecting pulmonary abnormalities on CXR images from two hospitals and found that curriculum learning could guide the model toward a better local minimum. The curriculum learning-based model was investigated using the minimum loss point on the validation set to assess weight parameters on the test set. In general, large disease patterns including interstitial opacity, pleural effusion and pneumothorax showed better performances in this study. Especially, pneumothorax had the highest sensitivity and accuracy in the datasets of the two centres. Furthermore, PPV and NPV of pneumothorax were the best. The specificity of pleural effusion was the highest in the AMC dataset, whereas the specificity of interstitial opacity was the highest in the SNUBH dataset. The AUC of pleural effusion was the highest among other disease patterns. Otherwise, the sensitivity of consolidations was the lowest in the AMC dataset, whereas the specificity of nodule[s] was the lowest in the SNUBH dataset. In addition, the specificity of consolidation was the lowest in the datasets of the two centres and the lowest accuracy in the AMC dataset. Nodule[s] had the lowest accuracy in the SNUBH dataset. Additionally, PPV and NPV results showed better results in the other three lesions than both nodule[s] and consolidation. As shown in Table 2, the curriculum learning-based model showed significantly better overall results. Specifically, the AUCs of abnormalities in the datasets of the two centres on the curriculum learning-based model showed better overall results. Nodule[s] and consolidation showed the most differences between the two algorithms in terms of AUC.
In addition, we conducted various statistical analyses on the size of the nodules. As shown in Table 3, the curriculum learning-based model surpassed the baseline. In our strategy, nodule[s] over 5 cm had the worst AUC in AMC. In addition, as the specificity showed the lowest results, the AMC dataset seems to be confused with consolidation because of the mass type contained in nodule[s]. In contrast, AUC of nodule[s] over 5 cm in the SNUBH dataset was the best. As the nodule[s] size increased, the AUC and accuracy improved. However, overall, the accuracy of nodule[s] was lower than the other disease patterns. Evaluations of the location and classification of nodule[s] and consolidation found that, although CAMs of the disease patterns were appropriately visualised, the two lesions may be misclassified as each other. This may result from similar image patterns with different sizes of nodule[s] and consolidation on CXR images. Differentiating between nodule[s] and consolidation may be difficult in CXRs, suggesting the need for merged labelling criteria.
Although previous studies 7 have evaluated the classification of each type of lesion, they were unable to accurately determine the locations of various lesions on CXR images during initial diagnosis. The present study included weakly supervised deep learning with curriculum methods that can simultaneously detect multi-labelled lesions of various sizes and disease patterns on large CXR images. Finally, CAM 8 was used to localise and visualise abnormal patterns. Knowing whether an evaluation of multiple lesions on CXR images by this method yields results like those diagnosed by experts is important. This curriculum-based, weakly supervised strategy may be promising in patients with different types of lesions because it may be used to determine the location and classification of multiple classes of lesions in multi-centre datasets. This result is shown in Fig. 5(a-c). This study had several limitations. Although we have developed a robust model by collecting and training data from datasets of two centres, data need to be collected and evaluated from more hospitals to cover a large number of X-ray variations. Figure 4 shows that untrained X-ray variations would have different marginal distributions and additional generalization methods could be required. Additionally, only five types of lesions were evaluated, suggesting a need to assess the practical utility of this strategy in patients with additional diseases, including cardiomegaly, tuberculosis, rib fracture, and mediastinal widening. In conclusion, this deep learning-based CAD system with high-scale CXR images from two centres, which was evaluated by curriculum learning and weak labelling only, www.nature.com/scientificreports www.nature.com/scientificreports/ required less preparation to train the system. This system could therefore be easily extended to include various kinds of diseases in actual clinical environments. In addition, our algorithm performed well for the simultaneous detection and classification of five disease patterns-nodule[s], consolidation, interstitial opacity, pleural effusion, and pneumothorax-on CXR images.

Methods
The overall procedure for curriculum learning involved two steps. First, the regional patterns of abnormalities were identified by initial learning with patch images, specifically training the network on regional patterns of abnormalities. Second, the network was fine-tuned using entire images. Resnet-50 architecture 9 was selected to train weak supervisions and modified for multi-label, non-multiclass problems to detect various disease patterns. Finally, class activation maps (CAMs) 8 were extracted to localise and visualise the abnormal patterns. Figure 6 shows a schematic of the overall procedure.
Dataset. CXR images of adults were collected from two hospitals, AMC and SNUBH. CXR images of 6069 healthy subjects and 3417 patients at AMC were obtained, with the latter including 944, 550, 280, 1364, and 331 patients with nodule[s], consolidation, interstitial opacity, pleural effusion, and pneumothorax, respectively. CXR images of 1035 healthy subjects and 4404 patients were obtained at SNUBH, with the latter including 1189, 853, 1009, 998, and 944 patients with nodule[s], consolidation, interstitial opacity, pleural effusion, and pneumothorax, respectively. Normal and abnormal datasets with nodule[s] (including mass)/consolidation or interstitial opacities were confirmed by chest CT and pleural effusion, and pneumothorax on CXRs were determined by consensus of two thoracic radiologists with corresponding chest CT images. For training and validation of the model, each abnormal lesion was confirmed and manually drawn by expert thoracic radiologists. The data were randomised, 70% for training, 10% for validation, and 20% for testing. All patient identifiers were removed. The study protocol was approved by the institutional review board for human investigations at AMC and SNUBH, and the requirement for informed consent was waived owing to the retrospective nature of this study. Implementation details and training strategy. Original   www.nature.com/scientificreports www.nature.com/scientificreports/ designed for general natural images, including memory problems. Therefore, all images were converted using bi-linear interpolation to a fixed size of 1024 × 1024 pixels. Because the pixel values of X-ray images have no physical meaning and the noise varies, careful pre-processing was required. The images were adjusted by the average of pixel values for the entire image at that windowing level, and a standard deviation of 3.5 pixel values was used for windowing width, normalizing the range of pixel values for each image. The standard deviation of 3.5 pixel values was selected empirically as a parameter for data augmentation during training by adjusting to a random floating point between 3 and 4. Sample-wise standardization was performed by subtracting the average pixel value of each image. Because patterns on X-ray images may differ among manufacturers, a sharpening and blurring technique was randomly applied to images during training, allowing the model to be robust to these image variations. We used the data in training and included additional data augmentation techniques, such as rotation (±10°), zoom (±10%), and shifting (±10%). The model was implemented in Keras with a Tensorflow backbone and a stochastic gradient descent optimizer with learning and decay rates of 5e-5. In the weakly supervised classification problem, Resnet has been the most widely used CNN architecture because it showed good performance in the ImageNet Large Scale Visual Recognition Competition (ILSVRC). As the layer becomes deeper, high performance could be expected. However, because of the trade-off with training time, this study used a 50-layer architecture (Resnet-50), which showed appropriate training time and performance. Resnet-50 designed in ILSVRC employs a softmax function as a classifier for multiclass problems. By contrast, the detection of various disease patterns on CXR should be regarded as a multi-label problem because various diseases can exist independently 3 . Therefore, the classifier was modified by multiple sigmoid functions: where y i j ,  is the independent probability corresponding to class j of sample i, and  a i j , is the activation value. Accordingly, loss was calculated as the sum of binary cross-entropy: www.nature.com/scientificreports www.nature.com/scientificreports/ where K and N represent the total numbers of classes and samples, respectively, and w is a weight term dealing with imbalance. By default, under-sampling was performed to adjust the class distribution of a dataset by randomly sampling the data of the remaining classes based on the class with the smallest number of images during training. Even then, there was a difference of (the number of classes + 1) times considering healthy subjects between the numbers of positive and negative samples in a class. To solve this problem of imbalance, the loss corresponding to each class was multiplied by its weight. Weakly supervised learning for image classification generally requires a considerable amount of data, with more required as complexity increases. Direct training with entire CXR images could lead to the wrong local minima because overlapping patterns of organs, tissues, and bones make the problem more difficult. A simple curriculum learning strategy was employed, consisting of two steps to train the complex disease patterns. In the first step, the Resnet-50 network, which was pre-trained on the ILSVRC dataset 10 , was trained using lesion-specific patch images. These patch images were extracted around the points selected by expert thoracic radiologists to better train the regional patterns of lesions (Fig. 7). The size of each patch image was defined as half its original size to contain a sufficient percentage of each disease pattern, as well as patterns surrounding each lesion. Subsequently, the network was fine-tuned using entire images because of the difference in distribution between patches and entire images.