Comparison of machine and deep learning for the classification of cervical cancer based on cervicography images

Cervical cancer is the second most common cancer in women worldwide with a mortality rate of 60%. Cervical cancer begins with no overt signs and has a long latent period, making early detection through regular checkups vitally immportant. In this study, we compare the performance of two different models, machine learning and deep learning, for the purpose of identifying signs of cervical cancer using cervicography images. Using the deep learning model ResNet-50 and the machine learning models XGB, SVM, and RF, we classified 4119 Cervicography images as positive or negative for cervical cancer using square images in which the vaginal wall regions were removed. The machine learning models extracted 10 major features from a total of 300 features. All tests were validated by fivefold cross-validation and receiver operating characteristics (ROC) analysis yielded the following AUCs: ResNet-50 0.97(CI 95% 0.949–0.976), XGB 0.82(CI 95% 0.797–0.851), SVM 0.84(CI 95% 0.801–0.854), RF 0.79(CI 95% 0.804–0.856). The ResNet-50 model showed a 0.15 point improvement (p < 0.05) over the average (0.82) of the three machine learning methods. Our data suggest that the ResNet-50 deep learning algorithm could offer greater performance than current machine learning models for the purpose of identifying cervical cancer using cervicography images.

Cervical cancer is the second most common cancer in women worldwide and has a mortality rate of 60%. Currently, about 85% of the women who lose their lives to cervical cancer each year live in developing countries, where medical care is far more limited both in terms of available professionals and access to technology [1][2][3] . Many of these deaths could be prevented with access to regular screening tests, which would enable the effective treatment of precancer stage lesions 4,5 . As cervical cancer has no overt signs during the early stages of progression and a long latent period, early detection through regular checkups is vitally important. 6,7 .
The typical method of identifying cervical cancer is 8 cervicography, a process in which morphological abnormalities of the cervix are determined by human experts based on cervical images taken at maximum magnification after applying 5% acetic acid to the cervix 9 . However, this method is limited in that it requires sufficient human and material resources; accurate reading of cervical dilatation tests requires a licensed professional reader and professional photography equipment capable of magnifying more than 50 times is needed in order to perform the procedure 10 . In addition, reader objectivity is limited and can only be increased through systematic and regular reader quality controls. As it stands, there likely exists inter-intra observer errors, though without systemic controls in place, data on this topic is scarce. Not only that, results can also vary depending on the subjective views of the readers and reader's reading environment 11,12 .
To compensate for these shortcomings, computer-aided diagnostic tools such as classic machine learning (ML) and deep learning (DL) have been used to recognize patterns useful for medical diagnosis 13,14 . ML is considered a high-level construct of DL, and it refers to a series of processes that analyze and learn data before making decisions based on the learned information 15 . ML requires a feature engineering process that eliminates unnecessary variables and pre-selects only those that will be used for learning. This process is disadvantaged www.nature.com/scientificreports/ by the requirement that experienced professionals pre-select critical variables. Conversely, DL overcomes this shortfall by a process in which the system learns important features without pre-selected variables and the human assumptions pre-selected variables inherently includes 16 . In this paper, ML and DL concepts are used separately.
In the 2000s, ML-based cervical lesion screening techniques began to be actively studied 17 . In 2009, an artificial intelligence (AI) research team in Mexico conducted a study to classify negative images, which are those clearly without indications of cancer, and positive images, which are those judged to require close examination using the k-nearest neighbor algorithm (K-NN). K-NN has had moderate success in past studies; using images from 50 patients, k-NN was able classify negative and positive images with a sensitivity of 71% and a specificity of 59% 18 .
In another study of ML classification performed by an Indonesian group in 2020, image processing was applied to cervicography images and a classification of normal negative and abnormal positive images was conducted using a support vector machine (SVM) with an accuracy of 90% 19 .
In the field of cervical research, many research teams around the world have begun to focus on DL methods for the detection and classification of cancer 20 . In 2019, a group at Utah State University in the United States used a faster region convolution neural network (F-RCNN) to automatically detect the cervical region in cervicography images and classify dysplasia and cancer with an AUC of 0.91 21 . In 2017, a research group in Japan conducted an experiment in which 500 images of cervical cancer were classified into three grades [severe dysplasia, carcinoma in situ (CIS), and invasive factor (IC)] using the research team's self-developed neural network which showed an accuracy of about 50% during the early stages of development 22 . In 2016, Kangkana et al. conducted a study classifying Pap smear images using various models including deep convolution neural networks, convolution neural networks, least square support vector machines, and softmax regression with up to 94% accuracy 23 . ML and DL models are still being actively studied for the classification of medical images, especially for the purpose cervical lesion screenings.
In this study, we classified cervicography images as negative or positive for cervical cancer using ML and DL techniques in the same environment in order to compare their performance. ML, which requires variables preselected by humans to perform classification tasks, is a method that uses previously known diagnostic criteria as variables, such as morphology or texture. Conversely, DL extracts what the system identifies as critical variables through the training algorithm itself without the assumptions inherent in previous human analyses. Herein, we compare to performance of ML using previously determined diagnostic variables to that of DL, which extracts new statistical information potentially unknown to human experts. This comparison will allow future researchers to better choose which models are most suitable for their purposes, which will ultimately provide clinicians systems capable of accurately assisting in the diagnosis of cervical cancer. Data pre-processing. Cervicography images were generally wide compared to their heights. The cervical area was located in the center of the image and the vaginal wall was often photographed on the left and right sides. In the ML feature analysis stage, the entire input area is screened for feature extraction. Thus, it is typical to remove the areas outside the target area to prevent the accumulation and extraction of unnecessary data. In our dataset, providing that the cervical region was appropriately centered, the left and right ends were cropped such to make the image size uniform and to set the width equal to height. Likewise, for the DL model, the same pre-processed images were used as the input in order to create comparable conditions. Study design for ML analysis. The overall process of ML is shown in Fig. 1. Training sets were preprocessed images as described above. After extracting more than 300 features from the pre-processed images in the feature extraction stage, only the major variables affecting classification were selected via the Lasso model. In ML models, we used the Extreme Gradient Boost (XGB), Support Vector Machine (SVM), and Random Forest www.nature.com/scientificreports/ (RF) to train classifications from the selected variables. After training the models, a fivefold cross-validation was performed using the test set to evaluate model performance.

Methods
Eighteen first order features were identified and 24 Grey Level Co-occurrence Matrix (GLCM), 16 Grey Level Run Length Matrix (GLRLM), 16 Grey Level Size Zone Matrix (GLSZM), and 226 Laplacian of Gaussian (LoG)filtered-first-order features were included as second order features. A total of 300 features from five categories were extracted from the training set images 24 .
A first-order feature is a value that relies only on each individual pixel value for analyzing one-dimensional characteristics such as mean, maximum, and minimum. Building upon first-order features, the GLCM of second-order-features is a matrix that takes into account the spatial relationship between the reference pixel and the adjacent pixels. Adjacent pixels refer to pixels located either east, northeast, north, and northwest from the reference pixel. Then, the second-order feature GLRLM matrix calculates how continuous pixels have the same value within a given direction and length. GLSZM identifies nine adjacent pixel zones, creating a matrix that calculates how continuous pixels with the same value are. Finally, LoG-filtered-first-order is a method of applying the Laplacian of Gaussian (LoG) filter and selecting first order features. The LoG filter is the application of a Laplaceian filter after smoothing the image with a Gaussian filter, a technique commonly used to find contours, which can be thought of as points around which there are rapid changes in the image.
ML generally adopts only key features among extracted features so that creates easy-to-understand, better-performing, and fast-learning models. The lasso feature selection method using L1 regularization is commonly used to create training data for these models, where only a few important variables are selected and the coefficients   (Table 1).

ML classification architectures.
For ML classification, we used the XGB, RF, and SVM architectures.
XGB is a boosting method that combines weak predictive models to create strong predictive models 26 . As shown in Fig. 2a, a pre-pruning method is used to compensate for the error of the previous tree and create the next tree. The RF model as shown in Fig. 2b is a method of bagging. After a random selection of variables, multiple decision trees are created. The results are then integrated into an ensemble technique that is used to classify data 27 . SVM is a linear regression method; when classifying two classes as shown in Fig. 2c, it finds data closest to the line called the vector and then selects the point where the margin of the line and support vector maximizes 28    Study design for DL analysis. The entire DL process is shown in Fig. 3. After preprocessing images as was done for the ML models, the model was created based on the ResNet-50 architecture. The generated model was then applied to the test set and the performance of the ML and DL models was evaluated through fivefold cross-validation.

DL classification architecture.
For DL, we used the ResNet-50 algorithm, a type of deep convolution neural network (DCNN) (Fig. 4a). As shown in Fig. 4b, the traditional CNN method was used to find the optimal value of input x through the learning layer, while ResNet-50 was used to find the optimal F(x) + x by adding input x after the learning layer. This approach has advantages in reducing both network complexity and the vanishing gradient problem, resulting in faster training. 29 . We then applied a transfer-learning technique using a network pre-trained on ImageNet 30 . At the beginning of training, the weights of the pre-trained layers are frozen, and at the point where the loss no longer falls during training, the newly added layer is judged to be well trained, and the weights of all layers are made trainable and training is resumed. Parameters for training were set to batch size of 40 and 300 epochs, which was suitable for the computing power of the hardware. The learning rate was set to be 0.0001 to prevent significant changes in transition learning weights. To improve learning speed, the images were resized to 256 × 256.
Evaluation process. Cross validation is an evaluation method that prevents overfitting and improves accuracy when evaluating model performance. We validated the classification performance of two algorithms with a fivefold cross validation, a method in which all datasets were tested once each with a total of five verifications. To ensure comparable results, the same five training sets and test sets were used in each method.
Usually  www.nature.com/scientificreports/ Using the above scores, we evaluated each model with metrics as follows. Precision, also known as PPV, is the ratio of what is true to what is classified as true. Recall, which is also known as sensitivity, is the ratio of what the model predicts to be true among what is true. The F1-score is the harmonic mean of precision and recall. Accuracy is the proportion of the total predicted trues that are true, and the proportion of the total predicted false classifications that are false.

Results
Visualization. The bar graph in Fig. 5 shows the 10 selected features and the importance of each feature to the ML models. Features with values greater than zero have positive linear relationships while features smaller than zero have negative linear relationships.
To determine which area the DL recognized as negative or positive, results from the test set were visualized using a Class Activation Map (CAM) to show which areas were given more weight (Fig. 6).

Evaluation.
To evaluate the performance of the ML and DL models, results were validated by fivefold cross validation using precision, recall, f1-score, and accuracy indicators as shown in Fig. 7.

Discussion
Principal findings. In this study, we compared the performance of ML and DL models by automatically classifying cervical images as negative or positive for signs of cervical cancer. The ResNet-50 architecture's performance was 15% higher than the average performance of the XGB, RF, and SVM models.   www.nature.com/scientificreports/ Results. Herein, we investigated the performance of ML and DL models to determine which algorithm would be more suitable to assist clinicians with the accurate diagnosis of cervical cancer. Using 1984 negative images and 2135 positive images, a total of 4119 cervicography images were used to select 10 out of 300 features from pre-processed images in a linear regression. Three algorithms (XGB, SVM, and RF) were used to create the ML classification models. The DL classification model with ResNet-50 architecture was also generated using the same pre-processed images. With both ML and DL techniques, our assessment found more reliable results when all datasets were tested using fivefold cross validation. The AUC values for XGB, SVM, and RF were 0.82, 0.84, and 0.79, respectively. Resnet-50 showed an AUC value of 0.97. ML algorithms did not exceed 0.80 for accuracy, while ResNet-50 showed an accuracy of 0.9065 with a relatively better performance.
Clinical implications. Generally, when diagnosing cervical cancer in clinical practice, lesions are diagnosed by compiling several data points including the thickness of the aceto-white area, presence of transformation zone, and tumor identification. Given the complexity of the diagnostic process, the end-to-end method of DL, which divides the problem into multiple parts and obtains answers for each and outputs results considering comprehensively each answers, likely contributed to the DL model's improved performance in the cervical cancer classification task in the learning system. Compared to DL, ML splits the problem into multiple parts, obtains the answers for each, and just adds the results together. We speculate that the step-by-step methods of ML may have had difficulty understanding and learning these complex diagnostic processes.

Research implications.
In terms of algorithms, DL identifies and learns meaningful features from the totality of features by itself, while ML requires that unnecessary features be removed by human experts before training. This difference could be responsible for the decreased performance of the ML models. Since DL learns low-level features in the initial layer and high-level features as the layers deepen, the weight of high-level features, which are not learned by ML, is likely responsible for the difference in performance between the two types of systems. Thus, DL likely performed better due to its integration of high-level features.
In this study, cervical images were cropped to create a uniform dataset as was needed for the ML models and to provide the basis for an accurate comparison between the DL and ML architectures. In future research, the addition of a DL-based cervical detection model to this classification task could further improve the accuracy of model comparisons by facilitating the selection of only the appropriate areas to be analyzed.

Strengths and limitations.
This study is the first to compare the performance of DL and ML in the field of automatic cervical cancer classification. Compared to other studies that have produced results using only one  www.nature.com/scientificreports/  www.nature.com/scientificreports/ method, either DL or ML, this work enables cervical clinicians to objectively evaluate which automation algorithms are likely to perform better as computer-aided diagnostic tools. In pre-processing, the same width was cropped from both ends of the image to remove vaginal wall areas, assuming that the cervix was exactly in the middle. However, not all images had the cervix in the center or the shape of the cervix has possibility to be distorted and out of the desired area. The cropped images we used may still have contained vaginal walls, which are unnecessary, or the cervix that was intended to be analyzed could have been cropped out. This may have disproportionally decreased the accuracy of one model or the other, weakening the comparison.
In addition, Data augmentation is known to help increase generalization of networks and help reduce overfitting during training. It is thought that it will be possible to compare models with higher performance by applying this data augmentation method in future studies.
Moreover, when selecting ML features, the lasso technique was used and 10 features were selected. However, adopting a different feature selection method or selecting features more or less than 10 could result in a completely different outcome. The fact that human intervention is involved in the ML process itself is a major disadvantage and could mean that it is not possible to truly compare ML to DL models accurately.

Conclusion
Herein, the performance of ML and DL techniques was objectively evaluated and compared via the classification of cervicography images pre-processed by the same methods.
The results of this research can serve as a criterion for the objective evaluation of which techniques will likely provide the most robust computer-assisted diagnostic tools in the future.
Furthermore, when diagnosing cervical cancer, it may be clinically relevant to consider the diagnostic factors identified by multiple model architectures.
In future studies, a more accurate comparison of cervical cancer classification performance could be conducted by adding a detection model that accurately detects and analyzes only the cervix. Finding and adopting better techniques for feature selection could also minimize human intervention in ML, strengthening the comparison between different model architectures. We expect these future studies to allow for a more objective comparison of different model architectures that will ultimately assist clinicians in choosing appropriate computer-assisted diagnostic tools.