Applying Data-driven Imaging Biomarker in Mammography for Breast Cancer Screening: Preliminary Study

We assessed the feasibility of a data-driven imaging biomarker based on weakly supervised learning (DIB; an imaging biomarker derived from large-scale medical image data with deep learning technology) in mammography (DIB-MG). A total of 29,107 digital mammograms from five institutions (4,339 cancer cases and 24,768 normal cases) were included. After matching patients’ age, breast density, and equipment, 1,238 and 1,238 cases were chosen as validation and test sets, respectively, and the remainder were used for training. The core algorithm of DIB-MG is a deep convolutional neural network; a deep learning algorithm specialized for images. Each sample (case) is an exam composed of 4-view images (RCC, RMLO, LCC, and LMLO). For each case in a training set, the cancer probability inferred from DIB-MG is compared with the per-case ground-truth label. Then the model parameters in DIB-MG are updated based on the error between the prediction and the ground-truth. At the operating point (threshold) of 0.5, sensitivity was 75.6% and 76.1% when specificity was 90.2% and 88.5%, and AUC was 0.903 and 0.906 for the validation and test sets, respectively. This research showed the potential of DIB-MG as a screening tool for breast cancer.

SCIENTIfIC REPORTS | (2018) 8:2762 | DOI: 10.1038/s41598-018-21215-1 purpose of our study was to assess the feasibility of DIB in mammography (DIB-MG) and to evaluate its potential for the detection of breast cancer.

Materials and Methods
Data collection. Five institutions (all tertiary referral centers) formed a consortium for the imaging database. All study protocols were approved by the institutional review board of Yonsei University Health System (approval number: 1-2016-0001) and the requirement for informed consent was waived. All experiments were conducted in accordance with the Good Clinical Practice guidelines. For algorithm development, digital mammography images were retrospectively obtained from PACS. We included women with four views of digital mammograms. Exclusion criteria were as follows. 1) Women with previous surgery for breast cancer, 2) Women with previous surgery for benign breast disease within 2 years, 3) Women with mammoplastic bag, 4) Women with mammographic clip or marker. All cancer cases were confirmed by pathology and all normal cases were defined as BI-RADS category 1 (negative) without malignancy development during at least 2 years of follow-up. Both screening and diagnostic mammograms were included. This study was solely focused on whether our algorithm could discriminate cancer from normal cases, so presumed benign cases (BI-RADS categories 2, 3, 4, and 5 without cancer) were not included. Accordingly, 29,107 digital mammogram sets were obtained, in which there were 4,339 cancer cases and 24,768 normal cases. All images in the data sets were recorded by radiologists for breast density, cancer type (invasive vs noninvasive), features (mass, mass with microcalcifications, asymmetry or focal asymmetry, distortion, microcalcification only, etc.) and size of the invasive cancer. For cancers showing mass with microcalcifications, both mass and microcalcifications were recorded as features. Breast density was recorded using BI-RADS standard terminology of almost entire fat (A), scattered fibroglandular densities (B), heterogeneous dense (C), and extremely dense (D) 6 . Data sets. In 4,339 cancer cases, training, validation and test sets were randomly selected with a ratio of 5:1:1 (3,101/619/619). Each dataset was evenly distributed in terms of patients' age, breast density, and manufacturer, and cancer type, feature, and size in order to remove selection bias between training, validation and test sets (Table 1). Predominant features of cancer were mass (n = 2,366) or microcalcifications (n = 1,962), so other features (asymmetry for focal asymmetry (n = 463), distortion (n = 100)) were not controlled in the data sets.
In 24,768 normal cases, the same number of validation (n = 619) and test (n = 619) cases were randomly selected, and the rest were used for training. For normal cases, each partition of the dataset was evenly distributed in terms of patients' age, breast density, and manufacturer in order to remove selection bias ( Table 2). Development of the Algorithm. Deep convolutional neural network (DCNN) is a deep learning algorithm specialized for images 22 . Each convolutional layer extracts features hierarchically (layer-by-layer) to abstract semantics from the raw input images. DIB-MG is implemented based on a residual network (ResNet) 23 , the state-of-the-art DCNN model for image recognition. Figure 1 shows the overall architecture of DIB-MG. It consists of two initial blocks (init_block), four residual blocks (residual_block), and an aggregation block (aggre-gate_block). Each residual block includes four consecutive convolution layers with skip connection as described in the right-bottom of the Fig. 1 (⊕ is an element-wise addition operator), while the other blocks include a single convolution layer. Each block also includes some auxiliary components such as a batch normalization layer (BN: normalization of activations within a batch) 24 , a rectified linear unit (ReLU: a simple mathematical function for non-linear activation) 22 , a max-pooling layer (P max : static dimension reduction function for translation-invariant feature abstraction) 22 , and a global-average-pooling (GP avg : average of the entire 2-dimensional input feature map) 23 . Details of the components are well described in the original literatures [22][23][24] . DIB-MG consists of nineteen convolution layers with a two-stage global-average-pooling layer. The former eighteen convolution layers extract hierarchical features for cancer classification, while the last convolution layer (1 × 1 convolution kernel with filter width 2) generates per-view maps (one for cancer, and the other for normal cases) via for final DIB construction (Fig. 2). Figure 3 shows an example of DIB as well as ground-truth lesions. Since we did not use pixel-level lesion annotation in this experiment, each per-view map generated from the last convolution layer (i.e. map generation stage) was converted in a single value to be compared with the ground-truth label (biopsy-proven cancer: 1 or normal: 0). So, the final maps were converted into a vector (each vector element represents its own class) using the global-average-pooling operation. In the training stage, the error between the output vector (y_pred in Fig. 2) and the ground-truth label was propagated backward via back-propagation algorithm 25 , and the model parameters of the entire network were updated based on the propagated errors.
Training Set-up. All the DICOM files are first converted to PNG files considering window_center and win-dow_width defined in the header of each DICOM, and then the pixel values are normalized to be in the range −1.0 to 1.0. Random perturbation of the pixel intensity in terms of constrast (±10%) and brightness (±10%) is used every training iteration to overcome the difference in vendor-specific contrast/brightness characteristics. All Train (n = 23530) Validation (n = 619) Test (n = 619) P value  Evaluation of the Algorithm. Training proceeds to minimize the prediction error of the entire training set, and the final DIB-MG performed best on the validation set is chosen for evaluation on the test set. In an inference stage, the final output value of the trained model (y_pred ranging from 0.0 to 1.0) is used to decide whether the input case is cancer or not. More specifically, y_pred represents the confidence level of malignancy. This value is not exactly equal to the probability of cancer, as cancer cannot be specified as a probability. But it is correlated with the cancer probability in real exams. The constructed DIB (class number of per-view maps generated from the last convolution layer) includes information on spatial discriminativity. As mentioned before, each map represents the corresponding class and shows the most discriminative part in terms of the final classification result; e.g., if y_pred is 0.9 (cancer probability), then the region with the highest value on the cancer map is the most discriminative part in terms of its cancer decision.

Statistical Analysis and Performance Comparison.
Chi-square tests were used to see whether there was any difference in categorical variables between training, validation and test sets. With validation and test sets, diagnostic performances were measured. Sensitivity, specificity, and accuracy were compared according to various demographics using the chi-square test. For features, logistic regression with the generalized estimating equation (GEE) method was applied to take into account that some patients had mass with microcalcifications. The AUC were compared between the validation and test sets using chi-square statistics. All analyses were conducted by a statistician using SAS statistical software (version 9.4; SAS Institute Inc., Cary, NC, USA) and R version 3.3.1 (R Foundation for Statistical Computing, Vienna, Austria).

Results
At the operating point (threshold) of 0.5, sensitivity was 75.6% and 76.1% when specificity was 90.2% and 88.5%, and AUC was 0.903 and 0.906 for the validation and test sets, respectively, with no statistical difference (Table 3). Sensitivity and specificity were not statistically significant between age ≥50 and <50, but they were significantly different according to the manufacturer (Table 3). In regards to breast density, sensitivity was not affected, however, specificity and accuracy decreased as breast density increased (Table 4).
In the malignant group (Table 5), sensitivity was better in mass than in calcifications (84.1-86.1% vs 77.5-77.9%), better in invasive cancer than in non-invasive cancer (77.0-77.9% vs 54.2-59.7%), and better in mass ≥20 mm than <20 mm (88.5-88.6%, 68.4-71.0%).    27 . They used a previously reported computerized segmentation algorithm in order to extract the clustered microcalcifications from mammograms 28 . In their approach, pre-defined microcalcification features obtained from lesion-annotated mammograms were used as an input for the unsupervised deep learning model (stacked autoencoder) 29 . Kooi et al. compared state-of-the art mammography CAD systems, relying on manually designed features as well as data-driven features using DCNN 18 . Especially in a deep learning approach, image patches extracted from lesion-annotated mammograms were used for training. Becker et al. evaluated the diagnostic performance of their deep neural network model for breast cancer detection 30 . A total 143 histology-proven cancers and 1,003 normal cases were used for this study, where all the cancer cases of the training dataset were manually annotated pixel-wise by radiologists according to descriptions in the radiology report. Compared to the aforementioned approaches, we used pure data-driven features from raw mammograms without any lesion annotations, which is scalable and practical for future CAD systems.

Sensitivity (%) Specificity (%) Accuracy (%) AUC
In previous reports with CAD, sensitivity was higher in microcalcifications than mass [31][32][33][34] , however, in this study, sensitivity was better in mass than calcifications. That is due to the difference in data sets. In our data set, both screening and diagnostic mammograms were included, in which 45.7% (1721/3762) of invasive carcinomas were equal or larger than 2 cm, whereas other studies with CAD included only screening mammograms [31][32][33][34] . Further studies using the DIB-MG algorithm on screening data sets should follow.
Our data showed that sensitivity for breast cancer detection was similar for non-dense breasts and dense breasts. However, specificity decreased as breast density increased. Eventually, low specificity was directly related with increasing false-positives, so we need to develop algorithms increasing specificity in the future.
In our study, diagnostic performance was different according to the manufacturer; sensitivity is the highest (88.8%) and specificity is the lowest (61.7%) in Siemens. In each data set (training, validation and test sets), the three manufacturers were evenly distributed (roughly 4:3:3 in cancer cases, 5:4:1 in normal cases). However, cancer cases were occupied with 27.2-29.2%, compared to 7.6-8.6% in normal cases in Siemens machine. This indicates that the number of cases trained with a certain type of machine can influence the diagnostic performance of mammography. This kind of selection bias should be considered in a future study regarding deep learning.
We acknowledge several limitations of our study. First, in this study we included only normal and cancer cases, so benign cases need to be included. Also, the dataset should be more expanded. Second, our model does not use any pixel-level annotations for training, so there might be errors in predicting the lesion location in examples predicted as cancer. It is necessary to confirm whether the lesion location is accurately predicted, and retrain the model based on those examples to improve localization performance.
In conclusion, this research showed the potential of DIB-MG as a screening tool for breast cancer. Further studies using a large number of high-quality data including benign cases are needed to further investigate its feasibility as a screening tool.