Deep learning-based computer-aided diagnosis in screening breast ultrasound to reduce false-positive diagnoses

A major limitation of screening breast ultrasound (US) is a substantial number of false-positive biopsy. This study aimed to develop a deep learning-based computer-aided diagnosis (DL-CAD)-based diagnostic model to improve the differential diagnosis of screening US-detected breast masses and reduce false-positive diagnoses. In this multicenter retrospective study, a diagnostic model was developed based on US images combined with information obtained from the DL-CAD software for patients with breast masses detected using screening US; the data were obtained from two hospitals (development set: 299 imaging studies in 2015). Quantitative morphologic features were obtained from the DL-CAD software, and the clinical findings were collected. Multivariable logistic regression analysis was performed to establish a DL-CAD-based nomogram, and the model was externally validated using data collected from 164 imaging studies conducted between 2018 and 2019 at another hospital. Among the quantitative morphologic features extracted from DL-CAD, a higher irregular shape score (P = .018) and lower parallel orientation score (P = .007) were associated with malignancy. The nomogram incorporating the DL-CAD-based quantitative features, radiologists’ Breast Imaging Reporting and Data Systems (BI-RADS) final assessment (P = .014), and patient age (P < .001) exhibited good discrimination in both the development and validation cohorts (area under the receiver operating characteristic curve, 0.89 and 0.87). Compared with the radiologists’ BI-RADS final assessment, the DL-CAD-based nomogram lowered the false-positive rate (68% vs. 31%, P < .001 in the development cohort; 97% vs. 45% P < .001 in the validation cohort) without affecting the sensitivity (98% vs. 93%, P = .317 in the development cohort; each 100% in the validation cohort). In conclusion, the proposed model showed good performance for differentiating screening US-detected breast masses, thus demonstrating a potential to reduce unnecessary biopsies.


Scientific Reports
| (2021) 11:395 | https://doi.org/10.1038/s41598-020-79880-0 www.nature.com/scientificreports/ malignant breast lesions [9][10][11][12] . However, efforts to detect more cancers using screening US have led to excessive recalls and false positive biopsies. In addition, cancers detected using screening US tend to be small in size, without calcifications, and lack typical malignant features compared with palpable breast cancers or breast cancers detected using screening mammography 13,14 . Hence, the task of classifying breast masses detected using screening US as benign or malignant is challenging. Moreover, the assessment of masses detected using screening US entails concerns regarding operator dependency. Artificial intelligence, particularly involving deep learning algorithms, is gaining extensive attention owing to its excellent performance in image recognition tasks 15 . A deep learning based computer-aided diagnosis (DL-CAD) software has been recently developed and employed for breast mass differentiation in clinical practice [16][17][18][19][20][21] . Reports on the use of the dichotomized final assessments ("possibly benign" or "possibly malignant") rendered by the current commercial DL-CAD software in patients with variable conditions showed the potential of DL-CAD with regard to improving the diagnostic accuracy and specificity [16][17][18][19][20][21] . However, to the best of our knowledge, no studies have focused explicitly on DL-CAD performance considering selected patients with breast masses detected using screening US. Considering the small size and less obvious morphologic characteristics of breast masses detected using screening US, the quantitative information derived from the DL-CAD software could provide additional details missed by the expert radiologists; furthermore, such information could prove to be substantially useful for the differentiation of breast masses while facilitating patient management in the clinical setting.
Therefore, the purpose of this study was to develop a diagnostic model that incorporates the quantitative morphologic features extracted from DL-CAD to improve the differential diagnosis of breast masses detected using screening US and reduce false-positive diagnoses.

Results
Characteristics of the development and validation cohorts. The quantitative morphologic scores were obtained using the DL-CAD software for the development and validation sets (463 patients, mean age, 45 years; range, . Surgery was performed on all lesions with malignant biopsy results (n = 52) and some of the lesions with benign biopsy results (n = 81). Clinical and imaging follow-up was performed using US (mean follow-up time, 12 months; range, 6-23 months) for the non-excised lesions proven to be benign through biopsy (n = 330), and lesion stability was confirmed in all the cases. The characteristics of the development and validation cohorts are summarized in Table 1. The age of the patients and size of the breast masses in the US were comparable between the two cohorts (P = 0.937 and 0.444, respectively). The proportion of BI-RADS category 3 lesions was significantly higher in the development cohort than in the validation cohort (28.1% [84 of 299] vs. 2.4% [4 of 164], P < 0.001). In the development cohort, there were 256 (85.6%) benign and 43 (14.4%) malignant (33 invasive ductal carcinomas, 7 ductal carcinomas in situ, 1 invasive lobular carcinoma, 1 mucinous carcinoma, and 1 adenoid cystic carcinoma) masses. In the validation cohort, there were 155 (94.5%) benign and 9 (5.5%) malignant (7 invasive ductal carcinomas and 2 ductal carcinomas in situ) masses. The proportion of malignancy was significantly higher in the development cohort than in the validation cohort (14.4% vs. 5.5%, P = 0.004). Among the quantitative morphologic scores extracted from the DL-CAD software (Supplementary Table S1), the round shape score and all descriptor scores of the echo pattern, margin, and posterior features were significantly different between the two cohorts (all P < 0.001). However, the oval shape score, irregular shape score, and parallel and non-parallel orientation scores were all comparable between the two cohorts (P = 0.545-0.859).

Development and validation of a nomogram.
A diagnostic model incorporating the four aforementioned variables was developed and presented as a nomogram (Fig. 1a). In the development cohort, the AUC of the radiologists' BI-RADS final assessment was 0.65 (95% CI 0.61, 0.69), whereas the AUC in the case of the diagnostic model was 0.89 (95% CI 0.84, 0.93) (Fig. 1b). The bootstrap-corrected AUC of the diagnostic model  (Fig. 1c). In the development cohort, the calibration plot showed good agreement between the predicted and observed probabilities (Fig. 1d); the calibration slope and intercept were close to 1 (95% CI 0.7, 1.3) and 0 (95% CI − 0.5, 0.5), respectively. The bootstrap-corrected estimates for the calibration were 0.84 and − 0.75 for the slope and intercept, respectively. However, in the validation cohort, the observed probability was lower than the predicted probability ( Fig. 1e) with a calibration slope of 1 (95% CI 0.4, 1.6) and calibration intercept of − 1.4 (95% CI − 2.2, − 0.5).
Application of the nomogram to reduce unnecessary biopsies. After obtaining the risk scores from the diagnostic model, the optimal cut-off of the nomogram was determined to be 114 points to maximally improve the specificity with maintaining the sensitivity to 95% or higher. A comparison of the false positive rates, biopsy rates, and sensitivities between the BI-RADS final assessment and the diagnostic model using the nomogram with the determined cut-off is presented in

Discussion
In this study, the DL-CAD based nomogram for the differential diagnosis of breast masses detected by screening US was developed and validated using the multicenter cohorts. Among the quantitative morphologic features obtained from the DL-CAD software, 'shape' and 'orientation' information were the most beneficial features in predicting malignancy. In comparison with the radiologists' BI-RADS final assessment alone, our nomogram incorporating the DL-CAD based quantitative information, patient age, and the BI-RADS assessments, improved AUCs and lowered the false-positive rates without decreasing the sensitivities in the differential diagnosis of breast masses detected by screening US.
To date, various studies have attempted to overcome the limitations of screening breast US, including a low positive predictive value and increase in false-positive biopsies 22 . The use of elastography or color Doppler US, in addition to grayscale US images, has been suggested and used in clinical practice 23 . However, in these techniques, the additional time required for image acquisition and operator dependency with regard to both image acquisition and interpretation remain as limitations 24 . In this respect, the use of DL-CAD with grayscale US images could provide objective, quantitative information, enabling the proposed method to serve as a second reader assisting radiologists to effectively reduce unnecessary biopsies and increase cost-effectiveness. The detection of subtle differences in the mass features among grayscale US images is promising, as this can also be applied to an automated breast US system.
With regard to creating the diagnostic model, the development and validation cohorts did not share common mass features and ratios of malignancy; however, good discrimination performance was noted, implying high compatibility with a slightly different cohort. The use of the developed nomogram lead to improved AUC values and lower false positive rates without reducing the sensitivity, in contrast with using only BI-RADS; thus, this is an effective method for reducing the false-positive biopsies associated with supplemental US exams. These results are concordant with those of previous studies that consistently reported the utility of DL-CAD software for improving the diagnostic specificity of both experienced and less-experienced radiologists [16][17][18][19][20][21] . However, in previous studies, their performances were evaluated on a single study population with a given dichotomized output from the DL-CAD software. In this study, the individual quantitative data of the DL-CAD software were explored for the first time.
Among the morphologic features of the grayscale US images assessed with the DL-CAD software, the shape and orientation were the most reliable US features for characterizing the screening US-detected breast masses. In contrast, the echogenicity and posterior features were the least reliable. The results of this study are consistent with those of prior studies by Hong et al. 25 and Elverici et al. 26 , which investigated the US features useful in diagnosing non-palpable breast lesions. An irregular shape, spiculated margin, and non-parallel orientation are  www.nature.com/scientificreports/ well-known features highly predictive of malignancy 25,26 . An irregular shape might reflect the inconsistent growth and advancement of the lesion edge, whereas the non-parallel orientation feature might suggest the spread of the lesion through the tissue-plane boundaries, both of which are substantially indicative of a malignant process 25 . Margin assessment is somewhat challenging for small-sized lesions despite the use of high-spatial-resolution US equipment 9 . In concordance with the study conducted by Elverici et al. 26 , the margin was not independently associated with malignancy in the screening-US-detected masses evaluated in this study. In the study performed by Chen et al. 27 , the reliable US features differed according to the size of the breast masses; furthermore, for breast masses ≤ 1 cm in diameter, an irregular shape and not smooth margin were associated with malignancy. The reliable US features predictive of malignancy might differ depending on the size and palpability of the breast masses. This study has several limitations. First, this was a retrospective study, and the results were dependent on the composition of the data, whose size was limited. Owing to the small number of malignant lesions and the low proportion of BI-RADS category 3 masses in the validation cohort, a tendency to overestimate was noted 28 . Further improvements must be achieved via larger and prospective studies before actual clinical use. Second, although BI-RADS 3 masses that were biopsied on the request of a clinician or patient were included in this study, the inclusion of BI-RADS 3 masses may not reflect the usual population requiring biopsies. Third, we only applied Table 4. Comparison of the false positive rate, biopsy rate, and sensitivity between the radiologist's BI-RADS assessment and the proposed nomogram. Data in parentheses are the 95% confidence intervals, and the data in brackets are the numerators/denominators. BI-RADS, Breast Imaging Reporting and Data System.  www.nature.com/scientificreports/ our DL-CAD system for women who were scheduled for biopsy. In our study, we do not have data on entire screening cohort, and the patients who were not performing biopsy for their mass were not included. Fourth, only one representative US image was used for each case; thus, variability could still exist in lesion assessments performed by the DL-CAD software and radiologists. In addition, we used the quantitative scores extracted from the DL-CAD software, which is not displayed automatically in current clinical version. In the future, providing the quantitative information as an output should be considered based on our results. The modified software would improve the diagnostic accuracy even in challenging situations, such as in the case of breast cancers detected using screening US. Lastly, the risk status of the patients, such as a family history of breast cancer and BRCA mutation information, were not considered.
In conclusion, the proposed DL-CAD-based diagnostic model showed good performance with regard to the lesion differentiation of breast masses detected using screening US. The quantitative DL-CAD information can provide a more accurate diagnosis and potentially lead to a decrease in false positive biopsies without overlooking cancers.

Methods
Samsung Medison Co. (Seongnam, Korea) provided technical support with regard to analyzing the US images using the DL-CAD software (S-Detect; Samsung Medison Co., Seongnam, Korea) and obtaining outputs from the software. However, Samsung Medison Co. was not involved in the study design, data collection, statistical analysis, data interpretation, or manuscript preparation. The institutional review boards of Seoul National University Hospital (No. 1901-110-1005), Severance Hospital (4-2019-0112), and Samsung Medical Center (2019-02-051-001) approved this multicenter retrospective study, and the requirement for written informed consent was waived for re-analyzing data from prospective cohort study. All research was performed in accordance with relevant guidelines and regulations.
Patients. With regard to the development cohort, the US images were collected from the prospective cohort studies performed at Severance Hospital and Samsung Medical Center, Seoul, Korea from January to December 2015. From this database, we identified patients according to the following inclusion and exclusion criteria. The inclusion criteria were asymptomatic women with breast masses detected using screening US, were scheduled for US-guided core needle biopsy, and the DL-CAD software was applied to the breast masses before performing US-guided core needle biopsy in all of the included cases. The exclusion criteria were women with breast symptoms, including lump or nipple discharge, with positive findings (BI-RADS final assessment category 0, 3, www.nature.com/scientificreports/ 4, or 5) obtained via mammography; the exclusion criteria also included a personal history of breast cancer, with recently diagnosed, known cancer. In total, 299 women who fulfilled the aforementioned inclusion criteria comprised the development set (mean age, 45 years; range, 19-81 years). To form an independent external validation cohort, we recruited patients using the same inclusion and exclusion criteria as those for the development cohort in Seoul National University Hospital, Seoul, Korea between February 2018 and June 2019. A total of 164 women (mean age, 45 years; range, 23-74 years) comprised the validation cohort (Fig. 4).

DL-CAD application.
All masses included in this study were scheduled to undergo US-guided biopsy according to the predetermined BI-RADS category. The BI-RADS category was determined solely based on grayscale US images without DL-CAD information. As per the institutional rules for biopsy, all BI-RADS category 4 or 5 masses undergo biopsy, and upon the request of a clinician or patient, BI-RADS category 3 masses can be scheduled for biopsy. On the day of the scheduled biopsy, the DL-CAD software was applied using 5-12 MHz linear probes and a real-  www.nature.com/scientificreports/ Extraction of quantitative morphologic scores from the DL-CAD software. In this study, the quantitative morphologic scores obtained from the DL-CAD software, consisting of values between 0 and 1 for each BI-RADS lexicon descriptor, were used to develop a diagnostic model (Fig. 5) for the differential diagnosis of breast masses detected using screening US; this was because our unpublished preliminary test performed using the commercial DL-CAD software for the masses detected using screening US showed unsatisfactory diagnostic performance. The current commercial version of the DL-CAD software does not present these quantitative morphologic scores but displays one descriptor with a maximum value for the shape, orientation, margin, posterior echogenic features, and echo pattern. We believe that the current version involving the selection of one descriptor with a maximum value and dichotomized final assessment category might provide the omitted information with regard to our selected population with breast masses detected using screening US. During the generation of the quantitative morphologic scores, deep learning technology was applied to build the classifier for the BI-RADS lexicon. A deep learning network consists of a series of multiple layers of simple components, many of which compute their own non-linear mappings between the input and output. Unlike conventional machine learning techniques, which use different algorithms and trained to learn about relationship between hand-crafted image features (i.e., edges, corners, or textures) and clinical outcomes, deep learning obtains the complex data structures in large high-dimensional datasets to determine the manner in which the internal parameters, called weights, of a deep neural network should be adjusted to learn the representations of data with multiple levels of abstraction. The weights are used to compute the representation of data in each layer based on the output of the previous layer, and complex functions with enough components of multiple levels of representation can be learned. For classification tasks, higher layers amplify the aspects of the input that are important for discrimination and suppress irrelevant variations 15 . All the weights of each layer are not generated by human experts but learned from a large amount of data via learning procedures. In the final layer, the inputs from the previous layer are transformed to probabilities that sum to one, representing the probability distributions of the final outputs. The final decision was made by combining the output of the US lexicon features with that of the other network for classifying the ROI images 29,30 . Data and statistical analysis. Along with the quantitative morphologic scores obtained from the DL-CAD software, the age of the patients, mass size based on the US, final BI-RADS assessment of the radiology report, and final pathological results were collected.
A comparison of the characteristics of the development and validation cohorts and those of the benign and malignant breast lesions in the development cohort was performed using the Chi-squared test for the categorical variables, the Wilcoxon rank sum test for the continuous variables except for age and size, and the independent t-test for age and size. Uni-and multivariable logistic regression analyses were used to identify the factors associated with malignancy in the development cohort. Variables with a P value < 0.1 under the univariable analysis  Table in the left side shows quantitative scores obtained from DL-CAD software of the mass. The current commercial version of the DL-CAD software displays one descriptor with a maximum value for shape, orientation, margin, posterior echogenic features, and echo pattern: in this case, irregular shape, not parallel orientation, microlobulated margin, posterior shadowing, and hypoechoic echo pattern. Thus, these features and the final assessment of 'possibly malignant' were displayed on the monitor as an output (a). On the other hand, this study used the most significant quantitative morphologic scores extracted from the DL-CAD software and clinical information to build the diagnostic nomogram (b).

Scientific Reports
| (2021) 11:395 | https://doi.org/10.1038/s41598-020-79880-0 www.nature.com/scientificreports/ were entered into the multivariable analysis with backward elimination. A nomogram was developed based on the multivariable logistic regression analysis. The diagnostic performance of the nomogram was evaluated in terms of discrimination and calibration. The discrimination was quantified using the area under the receiver operating characteristic (ROC) curve (AUC). The calibration was quantified using the calibration slope and intercept and visualized in the calibration curve. The internal validity of the nomogram was estimated via bootstrapping using 1000 bootstrap samples, whereas its external validity was estimated using the validation cohort. The hypothetical application of the nomogram was compared with the radiologists' BI-RADS final assessment using the McNemar test in terms of the false positive rate, biopsy rate, and sensitivity. The cut-off of the nomogram was determined by maximizing the specificity without reducing the sensitivity to < 95% on the ROC curve. The cut-off of the radiologists' BI-RADS final assessment was determined at category 4A. The false positive rate was defined as the number of benign breast lesions above the cut-off indicating suspicious for malignancy divided by the total number of benign lesions. The biopsy rate was defined as the number of breast lesions above the cut-off divided by the total number of breast lesions. The sensitivity was defined as the number of malignant lesions above the cut-off divided by the total number of malignant lesions. The SAS software (version 9.2; SAS Institute, Cary, NC) was used for the statistical analysis, and the R package (version 3.6.1) was used to build the nomogram.