An aggregate method for thorax diseases classification

A common problem found in real-word medical image classification is the inherent imbalance of the positive and negative patterns in the dataset where positive patterns are usually rare. Moreover, in the classification of multiple classes with neural network, a training pattern is treated as a positive pattern in one output node and negative in all the remaining output nodes. In this paper, the weights of a training pattern in the loss function are designed based not only on the number of the training patterns in the class but also on the different nodes where one of them treats this training pattern as positive and the others treat it as negative. We propose a combined approach of weights calculation algorithm for deep network training and the training optimization from the state-of-the-art deep network architecture for thorax diseases classification problem. Experimental results on the Chest X-Ray image dataset demonstrate that this new weighting scheme improves classification performances, also the training optimization from the EfficientNet improves the performance furthermore. We compare the aggregate method with several performances from the previous study of thorax diseases classifications to provide the fair comparisons against the proposed method.

www.nature.com/scientificreports/ et al. 4 . Baltruschat et al. 6 noticed that different split-sets would lead to different performances for the dataset 3 . To have the fair benchmarks, we report several results from various split-sets options for the performance evaluation. We perform three split-sets experiments configuration setup, which aims to provide better evaluation and have the comprehensive analysis; the first is by the use of "official" split from 9 , the second is under fivefolds cross-validation configuration which has also been used in the work of Baltruschat et al. 6 , and the last one is by the use of identical splits from the public Github-page 8,10 . We achieve state-of-the-art results for the classification problem of the Chest X-Ray dataset 3 , measured under these three split-sets configurations. This research is mainly contributing to the improvement of medical image classification problem. We tackle the imbalance problem within Chest X-Ray dataset and propose the advancement of the use of state-of-the-art neural net architecture for the final classification performance;The EfficientNet with two-stage training. In summary, our contribution is to propose an approach which can combine weights calculation algorithm for deep-network and the optimization of training strategy from the state-of-the-art architecture. The "Introduction" section of this paper provides a brief introduction and overview of the research. The "Method" section mainly discusses the existing classification approach and the proposed method. The "Experiments and results" section presents the results from experiments. The "Discussion" section give a more in-depth discussion about the outcome, then The "Conclusion" section explains the conclusion from the research.

Method
The existing weights function and network architecture. Wang et al. 3 and Gündel et al. 4 defined the weights, ω k+ and ω k− , of the positive and negative samples for the k th pattern.
where P k and N k are the numbers of positive and negative samples for the the k th pattern. However, Cui et al. 2 used both ω k+ and ω k− equally to develop the loss function. In the manuscript, Wang et al. 3 and Gündel et al. 4 did not use an identical dataset. Wang et al. 3 used Chest X-Ray 8 which consists of eight classes with 108,948 images. Whilst 4 used Chest X-Ray 14 which consists of fourteen classes with 121,120 images. However, we can find the results from the method of Wang et al. 3 under the Chest X-Ray 14 "official split" configuration in the manuscript of Gündel et al. 4 . Both works implemented Eq. (1) in the loss-function according to the literature 3,4 . Therefore, we can conclude the Eq. (1) applies to the samples of the training set. Lin et al. 11 proposed the focal-loss function: p is the prediction. In Eq. (2), the parameter α attempts to balance the positive-negative samples while γ adjusted to release the easy samples and to dominate the hard samples, where the easy and hard samples are those classified correctly and incorrectly respectively. Generally, γ ≥ 0 ; when γ = 0 focal-loss is the same as an ordinary cross-entropy loss 11 . The experimental results showed that the easy samples are down-weighed when γ ≈ 1 ; The samples are further down-weighed when γ > 1 . Determination of α is discussed to demonstrate the impact to the focal loss function (Eq. 2). The parameters chosen as below 2 : where n k is the number of the k th pattern, and N is the number of samples. Conceptually, β is used to adjust the significance of the number of samples. N(β) is the sum of all α k -s which is corresponded to the β value for each k-pattern. N(β) is used for normalization with the number of patterns. However, the work from Cui et al. 2 ignores the negative patterns into the weight calculations, this dismissed the very important variables because the negative patterns from negative classes are commonly to find in the medical images classification problem.  The proposed weights function and network architecture. The normalization of α k formulated in Eq. (4) is used to weight the k th pattern: where C is the number of class. Although Cui et al. 2 proposed the grid search to determine β based on their formulation, the separable weights of a positive and negative pattern have not been addressed. In this paper, we integrate the separability of positive and negative patterns into the loss-function in order to improve the classification capability of Cui et al's approach. The hypotheses address the importance of both positive and negative pattern weights to improve end-to-end training.
where ω k+ are the weights for positive samples of the k th pattern. Equation 5 is an elaboration point between 2 and our proposed method. We deliberately assign α k to each sample in k th pattern based on the specified ω k+ weights. The work 2 emphasized the importance of effective samples to define the weights and we have two types of weights ω k+ and ω k− come into the proposal. In our proposed approach, α k (β) from 2 attempts to determine the weights of only the positively labeled samples, which is given in Eq. (5). Also, we determine the weight of the negative patterns: Experimental results evaluate the performance of the proposed weights in Eqs. (5) and (6) to balance the imbalanced samples.
The weighted cross entropy loss. The formulation for cross entropy loss 16 with the proposed weight is: where y k true are the ground-truth labels for each samples in pattern k. To perform the experiments in Section 4, we set the ω k− = ω k+ for a particular case, the case where we want to see the outcome from Cui et al. 's 2 formulation www.nature.com/scientificreports/ into the dataset 3 classification problem. The Cross Entropy loss use softmax output by default, whereas the Binary Cross Entropy loss use sigmoid output.
The weighted focal loss. The formulation for focal loss with the proposed weight is: The proposed focal-loss attempts to weight both the easy-hard samples and the positive-negative patterns, which are not addressed by Cui et al. 's approach 2 . The proposed focal loss also suits the multiclass classification problem.
There is no existing focal-loss method which addresses both effective number of samples and positive-negative patterns weighting.
The progressive image resizing. Progressive image resizing is the procedure to train a single deep network architecture with incremental input sizes in multiple stages of training. The first stage trains the network with the default image size for the network and then followed by the next stage which utilises the bigger size images and the use of the best performance of the pre-trained model from the previous stage. There is no formal definition of the exact number of steps, but the classification performance will improve to some extents and then saturates, and gain diminishes; this is very specific to the classification problems. We report that the third stage of training with progressive image resizing did not improve the performance of the existing Chest X-Ray classification problem. Another functionality from the progressive-image-resizing is to provide another form of augmentation. It (re)trains the model with the augmentations of different sized-inputs. Several works 17-20 of literature mention that augmentation is a proven method to reduce overfitting. We need to have our final model is risk-free from overfitting, and the two-stage training is our approach to perform that. In summary, we perform the two-stage training to achieve two aims: (1) to improve the classification accuracy, (2) to prevent overfitting.
EfficientNet. The recent work by Tan et al. 5 introduced EfficientNet. It proposed a formulation to perform grid-search among three prominent aspects of the deep network's architecture: depth, width and input resolution. The depth defines the number of layers, the width defines the number of nodes for each layer, and the input resolution defines the size of the input images. The compound scaling from those three components are then composed into different architectures from EfficientNet-B0 into EfficientNet-B7. The networks use the mobile inverted bottleneck layers similar to 21,22 , the layers then concatenated to squeeze-excitation layer 23 . The ReLu6 function is capped at the magnitude of 6; it was used in MobileNetV2 22 . However, EfficientNet replaces the use of ReLu6 with Swish. Equation 9 shows the difference among the ordinary ReLu function, the ReLu6 24 and the Swish activation function: The layers of EfficientNet-B0 are depicted in Table 2. The further scaling of EfficientNets B0 into B7 are then defined by the grid-search formula as reported in 5 . After the input layer the EfficientNet use 3 × 3 spatial convolutional layer in stride 2 mode, then it uses MBConv1 the linear bottleneck and inverted residual layer 22 . After the MBconv1 layer the network has six consecutive MBConv6 layers with various 3 × 3 and 5x5 kernel as listed in Table 2. Each MBConv6 has three consecutive layers consist of a 1 × 1 convolutional layer, a 3 × 3 or 5 × 5 depth-wise convolutional layer and another 1 × 1 convolutional layer. Each MBConv1 has two consecutive layers

Stage
Operator Resolution Channels Layers www.nature.com/scientificreports/ consist of a 3 × 3 depth-wise convolutional layer and another 1 × 1 convolutional layer. The final layer consists of 1 × 1 convolutional, the global average pooling and 1280 nodes of a fully connected layer. Following the previous modification of DenseNet-121 into the specific implementation of the Chest X-Ray 3 classification problem, we also modify the final output layer from 1280 nodes into 14 nodes.
The performance evaluation. Suppose we want to have a better perspective about the algorithm performance; we need to apply different metrics to evaluate the results. We apply the AU-PRC (Area Under Precision-Recall Curve) metric for further evaluation; the metric has a different characteristic than AU-ROC (Area Under Receiver-Operating-Characteristics). In the term of baseline, AU-ROC has a fixed baseline of .50 for random classifier and 1 for the perfect classifier, respectively 25,26 . In contrast, AU-PRC baseline is dynamic since it heavily depends on the ratio between positive and negative samples 27 . AU-PRC is more sensitive to data distribution. AU-PRC will have the baseline .50 for a random classifier under the equal number of positive and negative samples; when the number of negative samples ten times than positive samples this baseline will decrease to a smaller number .09 27 . The formulation to calculate the baseline of AU-PRC shown in Eq. (10) is from the literature 27 .
Suppose we have two classes with an identical value of AU-PRC .50, the interpretation from this particular result will vary for both classes. The .50 AU-PRC is a good result for the class with low positive samples, but it may not be satisfactory for the class with a remarkable number of positive samples.

Experiments and results
The chest X-ray dataset 3 is used to evaluate the performance of the proposed method. It contains 112,120 Chest X-Ray images from 30,805 unique patients, and it has multilabel of 14 classes diseases. The image resolution is 1024 × 1024 with the 8-bit channel. We downsampled the resolution as 224 × 224 and converted the channel into RGB which can be adopted to our backbone network. Chest X-Ray 14 only consists of frontal-view images. It does not have any lateral-view cases. The number of positive samples for each class is much less than the negative samples as depicted in Fig. 1 . In our proposed method, the five hyperparameters β are given in Eq. (11).
where β 2 is determined by Eq. (3). The grid-search determines the other β-s. In the exception of the β 4 , the grid search was performed by changing the β value with standard deviation of 10 from β 2 . The current value of β 4 was chosen because that magnitude is the median between β 3 and β 5 . Also the results obtained by the proposed www.nature.com/scientificreports/ method is compared to those obtained by the other six methods, Wang et al. 3 , Yao et al. 12 , baseline ChexNet 15 , weighted binary cross entropy loss, Balturschat et al. 6 and Gündel et al. 4 . The comparison is depicted in Table 4.
Backbone network training. Since we use the DenseNet 121 14 as the primary backbone network, the availability of pre-trained ImageNet can be used for the classification. Here we used the pre-trained weights from Imagenet to develop the network. The dataset 3 for base metrics, including the same training, validation, test splitting set from 9 . We refer the split-set 9 as "official split". Summers 9 has two groundtruth files as label, they are train_val_list.txt which consists of 86,524 samples and test_list.txt which consists of 25,596 samples. Baltruschat et al. 6 emphasized that different splitting of dataset 3 has significantly impact to the classification performance.
Since the splitting of training and test data is exactly the same, the benchmark is fair. Figure 1 and Table 3 show that the class distribution is imbalance since the positive and negative samples are very different. We use a single Titan V with 12 Gb GPU memory to develop the network, where 24 hours with 25 epochs of training are required. We only train the Densenet-121 in single-stage training cycle and do not perform progressive image resizing.
Because we aim to improve the overall classification performance, we also modify the architecture of backbone network from DenseNet-121 into EfficientNet 5 . The approach is mainly to expand the performances from the proposed cost-sensitive loss function into better architecture. We are limited only to the use of the EfficientNet-B0 and the EfficientNet-B3 networks for the experiments. This is mainly because we have achieved the peak of  www.nature.com/scientificreports/ the computing-resources limits, and to perform the experiments over all the EfficientNet architectures are not feasible at this stage. Consecutive EfficientNets training requires extensive computations; due the scaling of the image sizes, the depth and also the width of the network. In the other hand, the approach of progressive image resizing only take into account the aspect of image sizes into computational resources; it ignores the depth and the width of the network. To train the EfficientNets, we use the Tesla v100 with 32 Gb of GPU memory. For each network, we performed the two-stage training procedure with progressive-image-resizing as previously discussed. On the first stage we train the network with pre-trained model from ImageNet, then on the second stage we train the network with the best performing model from first stage. The important finetune is the size of the image input; the first stage we use the default input size from the network then we doubled the input size on the second stage. This has been implemented with size of 224 × 224 on the first stage of EfficientNet-B0 and 448 × 448 on the second stage EfficientNet-B0, also 300 × 300 on the first stage of EfficientNet-B3 and 600 × 600 on the second stage EfficientNet-B3. We reduce the batch size into half, from 32 on the first stage to 16 on the second stage. The reduced batch size is mainly to ensure the batched-images for each step on each epoch will fit into the GPU's memory boundary. The two-stage training with progressive-image-resizing has successfully improved ±1% to the classification outputs between first stage and second stage for each model.

The baseline network.
We reproduce ChexNet 15 based on 28 . The experiments performed by our proposed method and the other methods 3,4,6 are based on the training and test split in 9 are reported in Table 4. However, Rajpurkar et al. 15 never share the split-set for the public. The use of official split 9 results in lower performance than reported in Rajpurkar et al. 15 . We use the ADAM optimizer as in 15 to develop the neural network of which the optimization is converged at epoch 11. Other researches also used ADAM 4,6 and stochastic gradient descent 3 .  3) to compute the weights. We perform this experiment to provide evidence of performances which come from 2 versus the one come from our approach. In this experiment, we use binarycross-entropy as loss function and combine with the weighting into the loss-function. We set the ω k− = ω k+ for the implementation of Eq. (7) for this particular case, since 2 ignores the balanced positives-negatives. The best performance classification for the model is also achieved on epoch 11, similar to Baseline. The comparison results with our other experiments are shown in Table 4. This method perform only slightly better than the baseline with the 79.24% area under ROC curve.

Weighted binary cross entropy with effective number of samples.
Weighted focal loss with positive and negative pattern. In this experiment, we use the loss function 11 which is integrated with the focal loss and the proposed weighting. We choose the value of α value based on 11 which is between [.25, .75] ; we found that α = 0.5 and γ = 1 is the best of focal-loss hyperparameters for our proposed method. We use the RANGER (Rectified Adam and LookAhead) optimizer, which requires a smaller number of training epochs to converge. The optimizer converges at epoch 5. We deliberately assign the two-stage training to prevent overfitting and also to improve the performance. This method achieves 82. The intuitive theoretical background and the evidence from experiment. Since part of our approach inherits the strength of focal loss 11 , and also the class-balanced approach 2 . We can have further theoretical analysis from the proposed approach intuitively based on 11 and 2 . The main distinction of focal-loss with binary cross entropy loss is the existence of α and γ parameter. Cui et al. mention on the paper: "the classbalanced term can be viewed as an explicit way to set α in focal loss based on the effective number of samples" 2 . However, Lin et al. also mention "a common method for addressing class imbalance is to introduce a weighting factor α ∈ [1, 0] for class 1 and 1 − α for class − 1" 11 . We can inference those two statements into our elaboration in Eq. (6). The experiments provide a further evidence for the theory. The improvement from the change of the proposed formula is ±1% under the test-set according to the Table 5. Both experiments were performed with α = 0.5 and γ = 1.0 for the Focal Loss's parameters. Both Tables 5 and 8  The imbalance metric evaluation. Table 6 and Fig. 2 show the advancement from the proposed method in comparison with previous work 15 also with the baseline retrieved from the dataset. We calculate the baseline of AU-PRC metric directly from the dataset's distribution of positives and negatives samples with the use of Eq. (10). The bold fonts show the top-scores achieved between a same split-set configuration. The hernia has the lowest number of positive samples in the distribution. Despite being the most minority, the proposed algorithm results from hernia a couple of hundred higher AU-PRC than the baseline; shown in Table 6  www.nature.com/scientificreports/

Discussion
In order to provide more insights of the effect from different splits into the classification performance, several split-sets has been taken into performance evaluation. The standard procedure is to follow the "official" splits 9 and we report the results in Table 4. To the best of our knowledge, only 6 reported the performance evaluation of a random fivefolds cross validation from the Chest X-Ray dataset 3 and we report the results from the proposed method in Table 7. There are other split-sets which are considered "non-standard" settings, these splits are from github pages. Ren 8 is the third party re-implementation of 7 , and also 10 is the third party re-implementation of 15 . However, after further investigation 8 and 10 are under the identical training, validation and testing sets. We report the results with the custom-sets 8,10 in Table 8. The results in Table 9 are the improvements made in compare with the most recent research 7 . The one-by-one comparison for each disease with latest research 7 as listed in Table 8. We achieve better performances in compare with the work of Guan et al. 7 and we propose technically more simple approach to achieve the results. Since the diversity of split-sets is a well-known problem for the dataset's 3 evaluation, the use cross validation is a fair method to follow. Baltruschat et al. 6 is the only work that reported performing cross-validation to the dataset 3 , we achieve better performance in fivefolds cross validation experiment than the work of Baltruschat et al. 6 . The class-activation-mapping (CAM) method 29,30 visualizes the  www.nature.com/scientificreports/ discriminative-features from the deep-network's last layer in the form of heatmap localisation, the more heatmap visualization match the groundtruth bounding-box from dataset means the network has better understanding of the images. We visualize our classification performances with heatmap from CAM method in Table 10. We  www.nature.com/scientificreports/ obtain the bounding-boxes as the annotation groundtruth for only 8 (eight) classes which are available from the file BBox_List _2017.csv 9 . The annotations consists of 984 images, and the number of samples for each class is not distributed evenly. Table 10 shows that the networks which are equipped with the proposed method read the area of the disease better than the baseline. We found the third party re-implementation 8 reported lower performances than reported in the paper 7 . Guan et al. 7 did not provide the official code and split-sets. The critical classification problems for the dataset 3 , different splits will lead to different performances 6 .

Conclusion
We proposed an aggregate of novel weighting function to formula the focal-loss function in complement with the two-stage training of EfficientNet, a state-of-the-art neural network architecture. We aim to improve the classification capability. Existing approaches of weighting function did not address the sample characteristics of both the positive-negative and easy-hard. The proposed weighting function attempts to improve the classification capability by address both the sample characteristics,which are ignored by the existing methods. The proposed approach provides a better decision boundary to the multiclass classification problem since the proposed approach addresses the imbalances of both positive-negative and hard-easy samples, also the use of recent network architecture scale-up the performances better. The proposed approach is able to improve the classification rates by 2.10% than the latest research's outputs which is measured in the area under the receiver operating characteristic curve (AUROC). The proposed method also achieved state-of-the-art results under three distinct experiments setup, currently the results are the best improvements for the Chest X-Ray dataset being used. Since the proposed approach only addresses multiclass classification problem and multilabel classifications are not tackled, future research will target on multilabel problems. The proposed approach will be further validated.  www.nature.com/scientificreports/