Ovarian tumor diagnosis using deep convolutional neural networks and a denoising convolutional autoencoder

Discrimination of ovarian tumors is necessary for proper treatment. In this study, we developed a convolutional neural network model with a convolutional autoencoder (CNN-CAE) to classify ovarian tumors. A total of 1613 ultrasound images of ovaries with known pathological diagnoses were pre-processed and augmented for deep learning analysis. We designed a CNN-CAE model that removes the unnecessary information (e.g., calipers and annotations) from ultrasound images and classifies ovaries into five classes. We used fivefold cross-validation to evaluate the performance of the CNN-CAE model in terms of accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (AUC). Gradient-weighted class activation mapping (Grad-CAM) was applied to visualize and verify the CNN-CAE model results qualitatively. In classifying normal versus ovarian tumors, the CNN-CAE model showed 97.2% accuracy, 97.2% sensitivity, and 0.9936 AUC with DenseNet121 CNN architecture. In distinguishing malignant ovarian tumors, the CNN-CAE model showed 90.12% accuracy, 86.67% sensitivity, and 0.9406 AUC with DenseNet161 CNN architecture. Grad-CAM showed that the CNN-CAE model recognizes valid texture and morphology features from the ultrasound images and classifies ovarian tumors from these features. CNN-CAE is a feasible diagnostic tool that is capable of robustly classifying ovarian tumors by eliminating marks on ultrasound images. CNN-CAE demonstrates an important application value in clinical conditions.

www.nature.com/scientificreports/ Not only is it important to differentiate benign from malignant ovarian tumors, it is also important to distinguish among the various benign ovarian tumor types, because it is estimated that up to 10% of women will have surgery for an ovarian cyst in their lifetime 6 . Therefore, in this study, we aimed to accurately diagnose malignant and various benign ovarian tumors by using a texture analysis approach on ultrasound images 7 .
The most important thing in texture analysis is the quality of the image. Any disturbances on an image can interfere with texture feature learning, and it is inevitable that among the many images collected for large cohort studies 8 , there will be images with disturbances. These disturbances are not easily removed manually, as the phenotypes of disturbances are different from each other. This study proposes a deep learning-based approach, a convolutional neural network with a convolutional autoencoder (CNN-CAE), to remove disturbances automatically and sort ovarian tumors into five classes: normal (no lesion), cystadenoma, mature cystic teratoma, endometrioma, and malignant tumor.
The effectiveness of the proposed CNN-CAE is validated across 1613 ovarian ultrasound images collected from 1154 patients. The deep learning visualization method and degradation of sorting performance validate the effect of image disturbance on texture analysis qualitatively and quantitatively. We believe that the CNN-CAE we propose is a viable deep learning-based diagnostic tool for distinguishing ovarian tumors.
The remainder of this paper is organized as follows. Section "Results" presents the results of this study. Section "Discussion" presents the discussion. Section "Material and methods" describes the material and methods.

Results
Removing marks via CAE. Figure 1 shows the images before and after removal of marks via the CAE model. There are calipers and annotations around the ovary on the upper images, which inevitably affect the features that the CNN model learns. As seen in the lower row, the images we obtained through the CAE model are relatively clean, without calipers or annotations around the ovary. The pixels replacing the marks are well generated compared with the surrounding pixels, without a sense of heterogeneity. The classification results for the two highest performance models are shown on the confusion matrices in Fig. 2. The predictions by the model are shown on the X-axis and the pathology diagnoses are shown on the Y-axis. As can be seen from these diagrams, the DenseNet161 model showed a better result in the classification of malignancies. Malignancies were correctly identified in 467 of 539 images (86.6%). The DenseNet121 model correctly classified 451 (83.7%) lesions as malignant. For benign tumors, approximately 70.0-80.0% of the images were correctly sorted into each class. In the sorting of benign tumors, the DenseNet161 model correctly classified 682 images out of 860 images, while the DenseNet121 model classified 679 images out of 860 images correctly.

Figure 1.
Ultrasound images before and after removing the marks via the convolutional autoencoder. The first row is the images with marks, and the second row is the image without marks. Example images are from left to right normal, cystadenoma, mature cystic teratoma, endometrioma, and malignancy. www.nature.com/scientificreports/ Comparison of the receiver operating characteristic curves (ROC) for the two top models is shown in Fig. 3. The AUC values for each class of both models are in the range of 0.89-0.98. In determining tumor presence, the DenseNet161 model had an AUC of 0.9837, indicating that the presence of a tumor is well distinguished. The ROC for tumor malignancy had an AUC > 0.9, which is promising for distinguishing malignant tumors.

Validation of CNN-CAE through the visualization method.
We applied a gradient-weighted class activation mapping (Grad-CAM) 9 CNN visualization method to determine the effects of marks on CNN learning. For comparison, we trained the same CNN model with the images without removing the marks and visualized both CNN models. The visualization results are shown in Fig. 4. The images on the left are input images, and those on the right are visualization results through Grad-CAM. In the Grad-CAM image, the activated (red) area is strongly considered in predicting final results, whereas the blue area is generally not considered in the final result. The Grad-CAM result shows that marks on the images, such as calipers and annotation, can affect the classification process in the CNN model. Furthermore, we could see that the activation area of the model trained without marks coincided with the correct area more often. This means that the classification is based on morphology and texture information, and thus we can regard the classification results are valid. However, the www.nature.com/scientificreports/ activation area of the model trained with marks is distributed over an incorrect area, as can be seen in Fig. 4 (red circles), which means that the classification is not based on shape and texture information. Therefore, these results are considered invalid.

Discussion
The correct diagnosis of ovarian tumors is necessary for determining the appropriate treatment, and several machine learning methods have been studied for ovarian tumor diagnosis. It is of utmost importance to distinguish malignancies among the various ovarian tumors. Previous machine learning trials have distinguished only between benign and malignant tumors in small populations. In two papers from the Timmerman group evaluating the differentiation of ovarian tumors using machine learning, the first study showed 76.0% accuracy using SVM on 187 ultrasound images 10 , and the second reported an accuracy of 77.0% using a local binary pattern coding operator 4 . However, the ultrasound images were based on the segmentation of lesions and displayed calipers used by the ultrasound specialists, which presents the limitation of an intervention bias. Another recent study reported that the manual removal of peripheral organs (e.g., uterus) from the image resulted in a sensitivity  www.nature.com/scientificreports/ of 96.0% and a specificity of 89.3% for distinguishing between benign and malignant ovarian tumors 11 , but it is important not only to distinguish between malignant and benign tumors, but also to identify benign tumors that require surgery and those that may progress to malignancies. We performed deep learning analysis in one normal ovary and four common ovarian tumors, and our CNN-CAE study differentiated four types of ovarian tumor that required surgical treatment with an accuracy of 97.0%. In particular, we determined that the AUC for malignancy was 0.94, which clearly distinguished malignant from benign. Comparing the CNN-CAE results with the reading results of novice, intermediate, and advanced readers showed an accuracy of 66.8%, 86.9%, and 91.0%, respectively, suggesting that, considering the CNN-CAE results, inexperienced examiners can diagnose ovarian tumors with high accuracy. Deep learning is vulnerable to imperceptible perturbations of input data. Some studies have shown that small disturbances to the input data can significantly degrade deep learning performance 12 . In this study, CNN-CAE was used to remove marks from ultrasound images to improve diagnostic accuracy. We confirmed that disturbances such as calipers and annotations could affect the CNN model results via Grad-CAM, and we developed the CNN-CAE model to eliminate these calipers and annotations. The CNN-CAE model successfully removed the calipers and annotations and classified ultrasound images at a high level of accuracy. These results show that even if marks are present on ultrasound images, they can be removed automatically so that only the ovary can be assessed for the correct diagnosis. The visualization results of Grad-CAM verify the reliability of the CNN-CAE model in terms of utilizing data in which disturbances exist. These results show that our method no longer needs to save a no-caliper image when taking an ultrasound image, and has the advantage of utilizing existing stored caliper images in addition to the images taken on-site. Our study is the first to address the removal of disturbances on ovarian ultrasound images via deep learning, and we have also demonstrated the usefulness of a deep learning-based method to solve the problem of disturbances existing in medical data.
There are some limitations to this study. First, this study was a retrospective cohort study performed on images of tumors with a known histological diagnosis, and was performed on a relatively small number of patients and images. Second, only still ultrasound images were used. Therefore, the multi-focal images that can be extracted from the ultrasound video were not fully utilized. Nevertheless, the CNN-CAE model has the potential to be widely used not only to identify malignancy but also for the classification of benign tumors that require surgery. Furthermore, the model has the advantage that it can be read not only in newly captured ultrasound images but also previously captured images. In future research, we will develop a classification model based on the most recent method and examine various aspects of ovarian tumor imaging, such as clinical radiology and ultrasound imaging technique. In addition, we will conduct a study to improve classification accuracy using multi-focal images.

Material and methods
Overall process. The research algorithm is shown in Fig. 5. The collected images went through the preprocessing stage to eliminate the effects of different devices and conditions. As the marks on the collected images can affect model training, we removed the marks through the CAE. After the CAE model successfully removed the marks on the images, these processed images were used to train and validate the CNN model. In the final stage, the results were compared against expert readings to validate the results, and CNN visualization methods were used to verify reliability.
Dataset. This study was conducted using 1613 single-ovary ultrasound images from 1154 patients at the Seoul St. Mary's Hospital, who underwent surgical removal of tumors of one or both ovaries between January 2010 and March 2020 and had known pathology diagnoses. Preoperative ultrasound images were obtained by using a GE Voluson E8 Expert ultrasound system (GE Healthcare, Milwaukee, WI, USA), an Accuvix XQ (Medison, Seoul, Korea), or a WS80A Elite (Samsung Medison. Co., Ltd, Seoul, Korea). Representative grayscale images of the tumors were made by the expert ultrasonographer, and all of the images were stored in JPG digitized format. The information from color Doppler was not taken into account in this study. The images were categorized according to the pathology diagnosis as (1) normal (no lesion), (2) cystadenoma (mucinous/serous), (3) mature cystic teratoma, (4) endometrioma, and (5) malignancy, including cancerous and borderline tumors. Representative images are shown in Fig. 1. Ethical approval. This study was approved by the Institutional Review Board (IRB) of Seoul St. Mary's Hospital of the Catholic University of Korea College of Medicine (IRB approval no. KC18RESI0792). All procedures performed in studies involving human participants were carried out in accordance with the Helsinki Declaration ethical principles for medical research. The requirement for informed consent was waived because of the retrospective study design after in-depth review by IRB.
We separated the 1613 images into five data subsets for 5-fold cross-validation. The five subsets were used for training and validation iteratively, yielding robust results from each independent result. Since only one ultrasound image was taken for each ovary, there were no instances of duplication of images of the same ovary in the training and validation data. The data overview is shown in Table 3. All images were processed by the following steps: (1) the frame information for the ultrasound images (patient identification number, name, and timestamp) was removed; (2) the ultrasound images were trimmed to a 1:1 ratio, as not to distort the scale and shape of the original image; and (3) in the final step, the pixel values of the ultrasound images pixel were normalized to (0, 1), and the image resolution was resized to 360 × 360 considering the compatibility between CAE and CNN model. For model training, we performed offline data augmentation only on the training images to 8× by rotating the images by 90° 3 times and flipping each rotated image horizontally. www.nature.com/scientificreports/ Methods. The images processed by the data acquisition methods described above still contained marks, such as calipers and annotations, which cannot be removed manually. These marks can affect the features for classifying each class. Therefore, we designed a CAE model that removed the marks on images and regenerates the pixels where the marks were placed. The structure of the CAE model is shown in Fig. 6. The U-net structure was referenced and minor variations, such as squeeze and excitation block (SE-block) and multi-kernel dilated convolution, were adapted to improve the mark removal performance 13 .  www.nature.com/scientificreports/ The operation of the CAE model is performed in a series of steps. First, the original image with the marks is input and passed through the squeezing and excitation block, which compresses and weights the information and features of the convolution layer 14 . These feature maps pass into deeper layers by average pooling. In the deepest layer, feature maps are merged with context information via dilated convolution operations with multiple dilation values. After passing through convolution and transposed convolution layers, the feature maps are generated to the clean ultrasound image. During this process, the high-resolution information of the shallow layer is concatenated to the convolution layers to enhance resolution. The CAE model was trained on 171 pairs of marked and clean ultrasound images by setting the following training parameters: 200 epochs, 2 batch size, mean squared loss, Adam optimizer, and 0.00005 learning rate decaying every epoch for 0.95 times. These 171 pairs of marked and clean ultrasound images were collected by saving images separately before adding calipers and annotations to ultrasound images. The marked ultrasound image was input to the CAE model, and the image result of the CAE model was compared to the clean image. The weight parameters of the CAE model are optimized to minimize a mean squared error between two images.
The purpose of this study was to produce a deep learning model for classifying ovarian tumors from ultrasound images. To date, deep learning models that have demonstrated success in the field of image recognition have mostly been valid in general RGB images. Unlike RGB images, which directly capture reflected light, ultrasound images are obtained indirectly from sound waves reflected from the layers between different tissues. As a result, ultrasound images have low resolution and hazy object boundaries. Considering these differences between ultrasound images and RGB images, it is as yet unknown which deep learning model is optimal for classifying ultrasound images. We trained three CNN models ResNet 15 , inception-v3 16 , and DenseNet 17 architecture, that have been widely used and validated in many other classification tasks, to discover the best model for classification of ovarian tumors on ultrasound images. Each model's end layer was modified to have five hidden nodes to classify five classes of ovarian ultrasound images. The model weights, except for the end layer, were initialized with pre-trained parameters that were optimized for another computer vision task (ImageNet). We newly trained these models, ResNet101, Inception48, and DenseNet121, for the task of distinguishing the five classes by setting the following training parameters: 50 epochs, 8 batch size, cross-entropy loss, Adam optimizer, and 0.00005 learning rate decaying every epoch for 0.95 times. Among the models, the DenseNet121 structure showed best result on the validation dataset, and we additionally trained other DenseNet models having a different number of convolutional layers. The results for each structure are shown in Table 4 and Fig. 7, respectively.

Evaluation measures.
For the evaluation of the ovarian tumor classification, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and area under the receiver operating characteristic (ROC) curve (AUC) were used as performance measures. The mean and 95% confidence interval of the five-fold cross-validation results were used to calculate performance measures. www.nature.com/scientificreports/

Data availability
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Received: 19 April 2022; Accepted: 16 September 2022 Table 4. Five-fold cross-validation of the accuracy of the convolutional neural network models with 95% confidence interval (in parentheses) calculated on the fivefold validation dataset using Student's t-test.  Figure 7. Structure of the DenseNet model. DenseNet uses a DenseBlock that employs fewer parameters while enhancing the information flow and gradient flow.