A convolutional deep learning model for improving mammographic breast-microcalcification diagnosis

This study aimed to assess the diagnostic performance of deep convolutional neural networks (DCNNs) in classifying breast microcalcification in screening mammograms. To this end, 1579 mammographic images were collected retrospectively from patients exhibiting suspicious microcalcification in screening mammograms between July 2007 and December 2019. Five pre-trained DCNN models and an ensemble model were used to classify the microcalcifications as either malignant or benign. Approximately one million images from the ImageNet database had been used to train the five DCNN models. Herein, 1121 mammographic images were used for individual model fine-tuning, 198 for validation, and 260 for testing. Gradient-weighted class activation mapping (Grad-CAM) was used to confirm the validity of the DCNN models in highlighting the microcalcification regions most critical for determining the final class. The ensemble model yielded the best AUC (0.856). The DenseNet-201 model achieved the best sensitivity (82.47%) and negative predictive value (NPV; 86.92%). The ResNet-101 model yielded the best accuracy (81.54%), specificity (91.41%), and positive predictive value (PPV; 81.82%). The high PPV and specificity achieved by the ResNet-101 model, in particular, demonstrated the model effectiveness in microcalcification diagnosis, which, in turn, may considerably help reduce unnecessary biopsies.


Scientific Reports
| (2021) 11:23925 | https://doi.org/10.1038/s41598-021-03516-0 www.nature.com/scientificreports/ using the gradient-weighted class activation mapping (Grad-CAM) method, which calculates the weighted sum of the feature map in each convolutional layer 14 . Therefore, in this study, we fine-tuned the weights of five pre-trained deep learning models instead of training DCNN models from scratch and used these models to assess the diagnostic performance of DCNNs for classifying suspicious breast microcalcifications in screening mammograms. Grad-CAM was used to confirm the validity of the DCNN models.

Results
Of all 1579 microcalcifications observed, 589 (37.3%) were histologically proven as malignant lesions. Table 1 lists the diagnostic performance measures of the five DCNN models at a learning rate of 1e−4. The best area under the receiver operating characteristic (ROC) curve (AUC) among all the models (0.856) was achieved by the ensemble model. The best sensitivity (82.47%) and NPV (86.92%) were obtained by the DenseNet-201 model, and the best accuracy (81.54%), specificity (91.41%), and PPV (81.82%) were achieved by the ResNet-101 model. The AUC of the DCNN models was statistically significantly different from that of the ensemble model, based on the DeLong test. The sensitivity, specificity, and PPV obtained via the generalized estimating equation (GEE) method exhibited a statistically significant difference from those obtained via the DCNN and ensemble models. The diagnostic performances (accuracy, AUC, sensitivity, specificity, PPV, NPV) of the five DCNN models are depicted in Fig. 1. The results obtained at a 1e−5 learning rate are described in the Supplementary materials. The p values listed in Table 1 are overall p values that show the overall difference between each DCNN model and the ensemble model. As the overall p values indicated a significant difference, pairwise comparisons were performed to calculate the p value for each pair of models. The pairwise diagnostic performance comparisons of the DCNN models at the 1e−4 and 1e−5 learning rates are presented in the Supplementary materials.
The Grad-CAM method generated heatmaps, which were activated by the model to provide evidence concerning the malignancy or benignity of microcalcification regions. Figures 2 and 3 depict the Grad-CAM heatmaps generated by the Inception-ResNet-v2 model. The Grad-CAM depicts highlighted areas that were positive in predicting malignant or benign microcalcification, with areas of strong emphasis marked in red and areas of weak emphasis in blue. The upper-right heatmaps in Figs. 2 and 3 show that the red areas are important for determining the malignancy and benignity of microcalcification. In the cases wherein the DCNN model misclassified images, we could recognize that the DCNN model focused on unsuitable areas, such as those shown in the lower-right images of Figs. 2 and 3.

Discussion
In this study, we investigated the diagnostic performance of DCNNs for patients with suspicious microcalcification in screening mammograms. The ensemble model AUC was 0.856. The specificity and PPV of ResNet-101 were 91.41% and 81.82%, respectively. The results of this study suggest that DCNNs can help radiologists determine the malignancy or benignity of microcalcifications. Earlier studies reported the existence of substantial inter-reader variability, and the PPV for mammographic microcalcifications in these studies usually did not exceed 30% 3-6 ; however, the ResNet-101 PPV obtained herein (81.82%) was considerably higher. This, along with the high specificity (91.41%) of ResNet-101, can be essential for reducing the need for the unnecessary biopsies burdening patients.
Transfer learning is commonly used in deep learning applications. It has been very effective in the medical domain, wherein the amount of data is generally limited 13,15 . In this study, we utilized five state-of-the-art DCNN models: ResNet-101, Xception, Inception-v3, Inception-ResNet-v2, and DenseNet-201 [16][17][18][19][20] . Each model was pretrained on ImageNet 21 , and the resultant model weights were fine-tuned, instead of training new models from scratch. Through transfer learning, the microcalcification classification performance of the fine-tuned DCNNs could be assessed to improve the differential diagnosis of breast cancer and reduce false positive diagnoses.
The interpretability of the DCNN models is essential in the medical imaging field. The DCNN model results were visualized for interpretation using the Grad-CAM technique. The superimposed heatmaps generated via www.nature.com/scientificreports/ Grad-CAM yielded additional information useful in clinical practice use-cases. These Grad-CAM results reflected the microcalcification regions that most affected DCNN model predictions.
Although the applicability of DCNN models to classify suspicious microcalcification was demonstrated herein, the limitations of this study should be considered. First, a relatively small population was studied. Although fivefold data augmentation of the training data was performed and the pre-trained networks were adapted to address this limitation, the potential risk in the training may remain. Second, the data analyzed were localized to a single hospital center. Additional investigations involving larger and multi-center populations are necessary for further analysis and improvement of the DCNN model performances, particularly in terms of their PPV. Third, the sensitivities of the models were low compared to the other performance-measure values. This problem may be resolved if more data were used for training. Finally, our study population only sampled the patients recommended for biopsy owing to suspicious microcalcifications. However, a biopsy may not have been recommended for all patients when one was warranted; hence, they may have been excluded from the population despite their eligibility. This selection bias can limit the generalizability of the results obtained.

Materials and methods
Study subjects. The institutional review board (IRB) of Gangnam Severance Hospital (Approval Number: 3-2018-0176) approved this retrospective, single-center study and the written informed consent of patients was waived. All research was performed in accordance with relevant guidelines and regulations.
The cohort comprised patients exhibiting suspicious microcalcifications in screening mammograms, who underwent stereotactic vacuum-assisted biopsy (SVAB) or surgical excisional biopsy for histopathologic confirmation between July 2007 and December 2019. Our study focused only on microcalcifications, and the www.nature.com/scientificreports/ microcalcification lesions associated with masses were excluded. Furthermore, patients with breast related symptoms, a history of breast cancer, previous breast operations, and implants, and without histopathologic results, were excluded. For each patient, the mammography examination was considered the index examination, followed by histopathologic confirmation, during our study period. For the final analysis, 1579 mammograms from 821 patients were utilized. The dataset was split into three distinct sets. For training and validating the DCNN models, 1319 mammograms obtained from 676 patients between 2007 and 2017 were used. The test dataset consisted of 260 mammograms obtained from 145 patients between 2018 and 2019.
All mammographic images were obtained via a full-field digital mammography unit (Lorad Selenia, Hologic, Danbury, CT, USA). Standard craniocaudal and mediolateral oblique views were obtained for both breasts along with spot-magnification views over the microcalcification regions.

DCNN architectures.
Over the past few years, DCNN performance has drastically improved on increasing network depth. In this study, we use five state-of-the-art DCNN architectures, ResNet-101 16 , Xception 17 , Inception-v3 18 , Inception-ResNet-v2 19 , and DenseNet-201 20 , which were pre-trained on the ImageNet dataset, to implement transfer learning from natural images to microcalcification images. The features of the five DCNN models are described in the Supplementary materials. www.nature.com/scientificreports/ The classifier layers in the DCNN models were replaced with a new classifier for microcalcification characterization. To this end, the weights of the existing convolution layers were initialized with the weights of the pre-trained DCNN models, whereas new classifier layers were initialized with random weights. Subsequently, the parameters of the convolution and classifier layers were fine-tuned through backpropagation using the microcalcification-image dataset.

DCNN model fine-tuning. Model fine-tuning was performed on a Windows 10 personal computer (PC)
with an NVIDIA GTX 2080Ti graphics processing unit (GPU). The five DCNN models were implemented in MATLAB R2020a using the Deep Learning Toolbox.
Data preparation. Of the 1579 mammographic images, 1121 (positive, 418; negative 703) were randomly chosen as training data and 198 (positive, 74; negative, 124) as validation data. The remaining images (positive, 97; negative, 163) were independently selected as test data. To generate the target classification outcomes in each dataset, a radiologist with more than 10 years of experience in mammography cropped each lesion by applying a www.nature.com/scientificreports/ square region of interest (ROI) of 512 × 512 pixels using routines written in MATLAB. Each cropped image was classified as either benign or malignant and accordingly saved in the corresponding folder. Data augmentation of the training dataset was used to avoid overfitting and increase the number of training examples, because deep learning shows better results with more data. The 1121 training images were randomly flipped horizontally and vertically, rotated between 1° and 359°, and translated along the x-and y-axis by − 30 to + 30 pixels. The validation and test data were not augmented.
Training. When fine-tuning the pre-trained network, we used the adaptive moment estimation (Adam) optimizer, which combines the advantages of the adaptive gradient algorithm (AdaGrad) and root mean square propagation (RMSProp) techniques. The Adam optimizer ( β 1 = 0.9 and β 2 = 0.999 ) was used with an initial learning rate of 1e−4 for fine-tuning the five DCNNs. To ensure that the DCNN models avoid overfitting and generalize well, we used L2 regularization (weight decay) to penalize large weights. The hyper-parameter for controlling the amount of the regularization was set to 0.0005 for all DCNN models. Furthermore, we adopted an early stopping strategy to monitor the validation loss. The "patience, " i.e., the number of epochs that the model waits for to observe loss reduction in the validation set before stopping the training, was set to 20. The mini-batch sizes were set to 32 for Xception and Inception-v3, 16 for ResNet-101 and Inception-ResNet-v2, and 8 for DenseNet-201. These sizes were determined based on the maximum capacity of the GPU. The best DCNN models were selected based on the best AUCs obtained on the validation set.

Interpretation of DCNN models via Grad-CAM.
To understand the basis of classification decisions made by the five DCNN models, we employed the Grad-CAM technique 14 . Grad-CAM utilizes the gradient of the convolutional-feature-based classification score determined by the network to identify the most critical areas of the image for classification. Areas with large gradients are the ones where the final score depends most on the data. The Grad-CAM visualized the image results via heatmaps.
Ensemble DCNN model. Several strategies exist for combining outputs from different models 22 . For example, the feature vectors from different models can be concatenated and a single classifier can be trained using the resulting higher-dimensional input. Another approach involves computing the average of the model outputs; this strategy was used herein to train the five DCNN models separately and average their predictions for testing.
Data and statistical analysis. Histopathologic results obtained via SVAB or excisional biopsy served as reference standards for the imaging findings. Ductal carcinoma in situ (DCIS) and invasive carcinoma were classified as malignant. All other final histopathologic results, including high-risk lesions, were classified as benign.
To evaluate the discriminative power of the DCNN models, the following quantitative measures were calculated: overall classification accuracy (ACC), sensitivity (SENS), specificity (SPEC), PPV, NPV, and AUC. The best cutoff point for each DCNN model to differentiate between malignant and benign microcalcification was set based on the maximal Youden index (sensitivity + specificity − 1) 23 . The GEE method was used for comparing the diagnostic performances of the individual DCNN models, as well as the ensemble model, in terms of accuracy, sensitivity, specificity, PPV, and NPV 24 . We applied the non-parametric DeLong test of differences to the six AUCs 25 . The statistical analysis was performed using SAS software (version 9.4; SAS Institute, Cary, NC). A p < 0.05 value was considered to indicate a statistically significant difference.