Diagnostic accuracy of deep-learning with anomaly detection for a small amount of imbalanced data: discriminating malignant parotid tumors in MRI

We hypothesized that, in discrimination between benign and malignant parotid gland tumors, high diagnostic accuracy could be obtained with a small amount of imbalanced data when anomaly detection (AD) was combined with deep leaning (DL) model and the L2-constrained softmax loss. The purpose of this study was to evaluate whether the proposed method was more accurate than other commonly used DL or AD methods. Magnetic resonance (MR) images of 245 parotid tumors (22.5% malignant) were retrospectively collected. We evaluated the diagnostic accuracy of the proposed method (VGG16-based DL and AD) and that of classification models using conventional DL and AD methods. A radiologist also evaluated the MR images. ROC and precision-recall (PR) analyses were performed, and the area under the curve (AUC) was calculated. In terms of diagnostic performance, the VGG16-based model with the L2-constrained softmax loss and AD (local outlier factor) outperformed conventional DL and AD methods and a radiologist (ROC-AUC = 0.86 and PR-ROC = 0.77). The proposed method could discriminate between benign and malignant parotid tumors in MR images even when only a small amount of data with imbalanced distribution is available.

www.nature.com/scientificreports/ techniques prevented the model from overfitting in the small dataset. In this method, graphics processing unit (GPU) acceleration was not used, and the training of the model was completed in a reasonable amount of time.

Theoretical framework
Diagnostic accuracy equivalent to that of dermatologists was achieved by using a convolutional neural network for the classification of skin cancer 5 . The application of DL to histopathological tissue samples has been advanced, and DL was used for the discrimination of malignant lymphoma 6 and breast cancer 7 . Although the detection of malignant thyroid and salivary gland tumors using DL has already been reported for histopathological samples 8 , there have been no reports on DL for the discrimination of parotid gland tumors using MRI images, except for reports on texture analysis 9,10 . Training a conventional machine learning algorithm using a small amount of data has achieved promising results 4 . However, there are only a few reports on DL 11 . It is generally known that a large amount of training data can lead to better performance, whereas training a DL model on a small amount of data may be difficult. Johnson et al. investigated the use of DL in a large amount of highly imbalanced data 12 . However, the application of DL to a small amount of highly imbalanced data is still unknown.

Materials and methods
This study was conformity with the Declaration of Helsinki and Ethical Guidelines for Medical and Health Research Involving Human Subjects in Japan (https ://www.mhlw.go.jp/file/06-Seisa kujou hou-10600 000-Daiji nkanb oukou seika gakuk a/00000 80278 .pdf). The requirement to obtain informed consent was waived because of the retrospective design. This study was approved by the Kobe University Ethical Committee (Permission number: B190167) and carried out according to the guidelines of the committee.

Sample. Magnetic resonance images (T1-and T2-weighted images) of 245 parotid tumors obtained between
April 2010 and March 2019 were retrospectively collected in a single center. Patient age ranged from 11 to 86 years (with a median of 56 years). A total of 122 cases were male, and 123 were female. Tumor histopathology was confirmed by surgery, whereby 190 (77.6%) tumors were classified as benign, and 55 (22.4%) as malignant ( Table 1).
The main parameters of MR imaging were as follows: The magnetic field strength was primarily 1.5 and 3 T (data distribution is shown in Table 2), slice thickness was 1.5-10 mm, and the matrix size was ranged from 256 × 256 to 960 × 960. For each case, T1-and T2-weighted grayscale images were cropped by a board-certified radiologist (17 years of experience), resulting in images with the largest axial cross-section of the tumor. To obtain input to the DL model from the MR images, the cropped T1-and T2-weighted grayscale images were fed to the blue and green channels of a pseudo-color image (RGB image), respectively. The red channel was empty. Thereby, 245 pseudo-color images were obtained from the MR images, and they were scaled to fit the input size of the DL model ( Figs. 1 and 2).
The dataset consisting of these pseudo-color images was randomly divided into the training (60%), validation (20%), and test set (20%). In the three sets, benign and malignant images were distributed in equal proportions. www.nature.com/scientificreports/ Data analysis procedures. In this study, a two-stage model was used: The first stage involved a classification model using a deep convolutional neural network (DCNN), whereas in the second stage, an outlier detection method was employed for AD. In the former, a DCNN based on VGG16 13 was used for classification, and transfer learning was performed 14 . Then, the output of the DCNN before the final output layer (feature descriptors) was fed to the local outlier factor (LOF), which was used for outlier detection 15 . The LOF classified the DCNN feature descriptor as normal (corresponding to benign) or abnormal (corresponding to malignant) (Fig. 3). We used VGG16 attached to Keras (https ://githu b.com/fchol let/keras ) as a DCNN classification model, as suggested by the Visual Geometry Group at the University of Oxford in ILSVRC2014. We changed the input image size of VGG16 to 100 × 100 pixels, and the loss function was the L 2 -constrained softmax, as described in the below. The L 2 -constrained softmax loss forces the length of feature descriptors x to be a pre-specified constant ( α): All processing was performed on a PC without a discrete GPU (Core i5 5257U CPU at 2.7 GHz, RAM 8 GB). Python (version 3.6.8) (https ://www.pytho n.org) was used as the programing language, and Keras (version 2.1.6) and TensorFlow (version 1.13.2) (https ://tenso rflow .org/) were used as the deep learning framework. Adam, www.nature.com/scientificreports/ with a learning rate of 1.0 × 10 -5 , was used as the optimizer 16 . Transfer learning was performed by freezing the trainable parameters of 10 layers in VGG16, including the convolutional layer. The network was trained using a batch size of 1 and up to 500 epochs. The training stopped when the loss of the validation set was not improved.
In each epoch, all the pseudo-color images of training set as well as the non-medical images from the CUReT dataset were processed. The details of the CUReT dataset are described in the following paragraph. The execution of all training processes on the PC required approximately half a day. It is well known that the softmax loss is effective on high-quality datasets with small data variation. However, when the data are imbalanced and of poor quality, performance degradation occurs. On the other hand, the L 2 -constrained softmax loss is known to be effective even for low-quality data with strong imbalance 17 . Specifically, on a hypersphere, minimizing the L 2 -constrained softmax loss is equivalent to maximizing the cosine similarity for the same category pairs, and minimizing it for different category pairs, thus strengthening the feature verification signal. Moreover, the L 2 -constrained softmax loss can better classify extreme and difficult images because all the feature descriptors have the same L 2 -norm 17 . The L 2 -constrained softmax loss is given by where x i is the input image in a batch of size M , y i is the corresponding class label, and f (x i ) is the feature descriptor obtained from the penultimate layer of the DCNN classification model. C is the number of classes, and W and b are the weights and bias, respectively, for the last layer of the network, which acts as a classifier. In the proposed model, M = 1 , C = 3 , and α was set to 80.
In this study, the output of the classification model of a DCNN with the L 2 -constrained softmax loss before the final output layer (feature descriptor) was fed to the LOF (an abnormality detection method) to classify MR images into two groups: normal (benign) and abnormal (malignant). The LOF is an unsupervised AD method that computes the local density deviation of a given data point with respect to its neighbors 15 . Implementation of Scikit-learn was used for the LOF in this study 18 . The brute-force search method was used to compute the nearest neighbors, and the squared Euclidean distance was used for the distance computation. The local density is required to calculate the LOF. The details of the local density are as follows 15 : where lrd p is the local density of object p ( p ) is represented by the feature descriptor obtained from the DCNN), N k p is the k-distance neighborhood of p , and d p, q is the distance between objects p and q in the specified space (in this study, the squared Euclidean distance). Using lrd p , the LOF of p is defined as follows: where k was set to 5 in this study. The output LOF value was used to determine whether p (tumor) was normal (benign) or abnormal (malignant) (Fig. 4).
To improve the robustness of feature extraction in the DCNN classification model, images from the Columbia-Utrecht reflectance and texture database (CUReT) 19 , which is often used as an artificial texture library, were used as a third class in addition to benign and malignant tumors (Figs. 3 and 5). This technique can be regarded as data augmentation by adding a non-medical image dataset to a medical image dataset.  In addition to the proposed method, we evaluated the performance of several DL models and other AD methods, such as the DCNN classification model alone, a combination of the DCNN classification model and the other AD method (one-class support vector machine (OCSVM)), and convolutional variational autoencoder (CVAE) 22 . CVAE is a type of autoencoder used for AD. In addition to VGG16 13 , we used MobileNet 23 and ResNet50 24 as DCNN classification models. They were included in Keras, and their input image sizes were modified in the current study. In addition to the L 2 -constrained softmax loss, the conventional softmax loss was also used in the DL models. After the output of the DCNN classification model was obtained as a feature descriptor, OCSVM 25 , which is commonly used for AD, was also evaluated instead of the LOF. As shown, in addition to benign and malignant images, images from the CUReT dataset were used as a third class in the proposed method. For comparison, the results of training DL models without the CUReT dataset were also evaluated.

Measures.
Using the test dataset, we evaluated the diagnostic accuracy of the proposed DL with AD, conventional two-categorical DCNN classification models, and CVAE. In addition to the radiologist who cropped the images, another board-certified radiologist with 18 years of experience evaluated his diagnostic accuracy by using a 5-point scale: 1 (definitely benign) to 5 (definitely malignant). Receiver operating characteristic curve (ROC) analysis and precision-recall (PR) analysis were performed, and the area under the curve (AUC) was calculated for the results by the models and by the radiologist. The sensitivity and specificity values were calculated using the threshold obtained by the Youden index. The 95% confidence interval (95%CI) of sensitivity, specificity, and PR-AUC was calculated by the bootstrap method. The 95%CI of ROC-AUC was calculated by the Delong method.

Results
The diagnostic accuracy of the models evaluated in this study is shown in Table 3. The VGG16-based model with the L 2 -constrained softmax loss and LOF exhibited the highest diagnostic performance (ROC-AUC = 0.86 and PR-ROC = 0.77) (Fig. 6). The combined use of the CUReT dataset, L 2 -constrained softmax loss, and LOF resulted in the highest diagnostic performance among all DCNN classification models. In VGG16, the addition of the CUReT dataset resulted in an evident improvement of the diagnostic performance, whereas changing the loss function to the L 2 -constrained softmax loss resulted in a reduction. However, the combined use of the CUReT dataset with AD further improved the performance (Table 3). For ResNet50 and MobileNet, a slight improvement by the addition of the CUReT dataset was observed. Table 3 also shows that in each of the three models, the combination of the LOF and AD yielded better results than the combination of OCSVM. The combination of the LOF and CUReT yielded better results than the addition of the CUReT dataset alone. The VGG16 network tended to yield better results than MobileNet and ResNet50 (Fig. 6). The radiologist performed better than CVAE, but worse than the VGG16-based model with the L 2 -constrained softmax loss and LOF (Fig. 6).  www.nature.com/scientificreports/

Discussion
Two-category classification has often been used for classifying objects into two groups (e.g., benign and malignant). However, training DCNN classification models on two categories using a small dataset is known to lead to overfitting. In the present study, even though only a small amount of training data could be used, a significant performance improvement was obtained by including images from the CUReT dataset as a third category, in addition to the data augmentation. This is because the addition of the CUReT dataset facilitated the extraction of general visual patterns and reduced the overfitting caused by the small number of images.
In this study, we used the L 2 -constrained softmax loss, which was demonstrated to be superior to the categorical softmax loss in extracting features from small and imbalanced datasets. By training the DCNN classification model using this loss function, the feature descriptors of less diverse and more numerous benign tumors were dense in the feature space, and those of more diverse and less numerous malignant tumors were sparse and far from benign tumors. These characteristics of the feature descriptors suggest that the use of the LOF and AD based on local density and distance is quite effective. Accordingly, it may be possible to improve the accuracy of DL by combining AD techniques (LOF) with DCNNs.
In this study, we did not use GPUs, which are commonly employed in training DL models; rather, we used CPUs, which are commonly employed in general applications, for DL training and evaluation. Nevertheless, the results demonstrated that the diagnostic accuracy of the proposed model was at least comparable to that of the radiologist. The computational performance was improved owing to the smaller resolution of the input images resulting from cropping only tumor images, the use of a relatively small DCNN, and the smaller number of input images. Among the DCNN classification models, VGG16 generally performed better than MobileNet and ResNet50. This was inconsistent with the number of modifiable parameters (VGG16, 138,357,544; MobileNet, 4,253,864; and ResNet50, 25,636,712) 27 , and with the ImageNet-based evaluation (Top1 accuracy VGG16, 0.713; MobileNet, 0.665; and ResNet50, 0.759) 27 . On the other hand, the performance excellence of VGG-16 is attributed to the fact that the architecture depth of VGG-16 is optimal to learn from the dataset of the current study. Optimal DCNNs for medical images characterized by low-resolution and imbalanced datasets should be investigated in future work.
The accuracy of benign/malignant discrimination using ultrasound is reportedly low, even with pulsed Doppler and color Doppler sonography (sensitivity 72%; specificity 88%) 28 . It has been reported that no clear correlation between malignancy and the characteristics of MR images has been found in non-contrast MR imaging. Moreover, the use of dynamic MR images and the apparent diffusion coefficient improved the accuracy of benign/ malignant discrimination (sensitivity, 86%; specificity, 92%) 29 . Our results indicate lower diagnostic values, with a sensitivity of 75% and a specificity of 82%. Diagnostic accuracy is affected by the pretest probability of the disease in the studies; thus, it is not easy to compare only the values themselves among different studies. To resolve this www.nature.com/scientificreports/ issue, in the present study, the diagnostic accuracy of the radiologist was evaluated using image-interpretation experiments, and it was demonstrated to be lower than that of the proposed DL model. Generally, the construction of DL models with high diagnostic performance requires a large amount of image data. The use of a small or imbalanced dataset may be considered a challenging task for constructing robust and reliable DL models. The results of this study demonstrated that the proposed technique enabled the construction of a DL model in the case of a small number of images. In the medical field, as it is often difficult to collect big and balanced data for rare diseases, the proposed technique may facilitate the construction of a robust DL model with high diagnostic performance. In addition, even though we only used a CPU without a GPU, we obtained satisfactory results. The proposed method has a relatively low computational cost and may therefore be easier to implement than complex and large DCNN models. www.nature.com/scientificreports/

Limitations and conclusion
Our study has several limitations. Specifically, the pseudo-color images were created by selecting only the slice containing the maximum diameter of the tumor, and this slice may not best represent the characteristics of the tumor. The most important aspect in the actual MRI diagnosis is the invasiveness of the surrounding area observed in malignant tumors 1 . In this study, however, such invasiveness was not necessarily included in the evaluation slice, and thus the diagnostic accuracy might differ from that of a clinical diagnosis. Furthermore, although 3D data may improve diagnostic accuracy, we only constructed a 2D artificial intelligence model, as a 3D model would require extensive image processing. Finally, the CUReT dataset, which is often used in texture analysis, was used as the non-medical image dataset in this study. It is necessary to investigate whether other non-medical images can be used to prevent overfitting in small datasets.
In conclusion, the proposed method (i.e., a combination of DL with AD) could discriminate between benign and malignant parotid tumors in MR images even though the DL training data consisted of a small number of images with strongly imbalanced distribution. Among the various DL models and AI techniques, the VGG16based model with the L 2 -constrained softmax loss, LOF, and CUReT datasets exhibited the highest diagnostic accuracy. As a potential application of the proposed method, it may be possible to obtain an accurate and robust DL model in diseases for which it has been difficult to construct a DL model due to a small amount of data with imbalanced distribution.