## Introduction

Breast cancer is the most frequently diagnosed cancer and the leading cause of cancer-related deaths among women worldwide1. It is estimated that 281,550 new cases of invasive breast cancer will be diagnosed among women in the United States in 2021, eventually leading to approximately 43,600 deaths2. Identifying breast cancer at an early stage before metastasis enables more effective treatments and therefore significantly improves survival rates3,4. Mammography has long been the most widely utilized imaging technique for screening and early detection of breast cancer, but it is not without limitations. In particular, for women with dense breast tissue, the sensitivity of mammography drops from 85% to 48–64%5. This is a significant drawback, as women with extremely dense breasts have a 4-fold increased risk of developing breast cancer6. Moreover, mammography is not always accessible, especially in limited-resources settings, where the high cost of equipment is prohibitive and skilled technologists and radiologists are not available7.

Given the limitations of mammography, ultrasound (US) plays an important role in breast cancer diagnosis. It often serves as a supplementary modality to mammography in screening settings8 and as the primary imaging modality in many diagnostic settings, including the evaluation of palpable breast abnormalities9. Moreover, US can help further evaluate and characterize breast masses and is therefore frequently used for performing image guided breast biopsies10. Breast US has several advantages compared to other imaging modalities, including relatively lower cost, lack of ionizing radiation, and the ability to evaluate images in real time4. In particular, US is especially effective at distinguishing solid breast masses from fluid-filled cystic lesions. In addition, breast US is able to detect cancers obscured on mammography, making it particularly useful in diagnosing cancers in women with mammographically dense breast tissue11.

Despite these advantages, interpreting breast US is a challenging task. Radiologists evaluate US images using different features including lesion size, shape, margin, echogenicity, posterior acoustic features, and orientation, which vary significantly across patients12. Ultimately, they determine if the imaged findings are benign, need short-term follow-up imaging, or require a biopsy based on their suspicion of malignancy. There is considerable intra-reader variability in these recommendations and breast US has been criticized for increasing the number of false-positive findings13,14. Compared to mammography alone, the addition of US in breast cancer screening leads to an additional 5–15% of patients being recalled for further imaging and an additional 4–8% of patients undergoing biopsy15,16,17. However, only 7–8% of biopsies prompted by screening US are found to identify cancers15,17.

Computer-aided diagnosis (CAD) systems have been proposed to assist radiologists in the interpretation of breast US exams over a decade ago18. Early CAD systems often relied on handcrafted visual features that are difficult to generalize across US images that were acquired using different protocols and US units19,20,21,22,23,24. Recent advances in deep learning have facilitated the development of AI systems for the automated diagnosis of breast cancer from US images25,26,27. However, the majority of these efforts rely on image-level or pixel-level labels, which require experts to manually mark images containing visible lesions within each exam or annotate lesions in each image, respectively28,29,30,31,32,33. As a result, existing studies have been based on small datasets consisting of several hundreds or thousands of US images. Deep learning models trained on those datasets might not sufficiently learn the diverse characteristics of US images observed in clinical practice. This is especially important for US imaging as lesion appearance can vary substantially depending on the imaging technique and the manufacturer of the US unit system. Moreover, prior research has primarily focused on differentiating between benign and malignant breast lesions, hence evaluating AI systems only on the images which contain either benign or malignant lesions34,35,36. In contrast, the majority of breast cancer screening exams are negative (no lesions are present)7,11. In addition, most AI systems in previous studies do not interpret the model’s predictions, resulting with "black-box” models28,29,30,31,32,33,34,35,36. So far, there has been little work on interpretable AI systems for breast US.

In this work, we present an AI system (Fig. 1) to identify malignant lesions in breast US images with the primary goal of reducing the frequency of false positive findings. In addition to classifying the images, the AI system also localizes the lesions in a weakly supervised manner37,38,39. That is, our AI system is able to explain its predictions by indicating locations of malignant lesions even though it is trained with binary breast-level cancer labels only (see Methods section ‘Breast-level cancer labels’), which were automatically extracted from pathology reports. The explainability of our system enables clinicians to develop trust and better understand its strengths and limitations.

The proposed system provides several advances relative to previous work. First, to the best of our knowledge, the dataset used to train and evaluate this AI system is larger than any prior dataset used for this application29,40. Second, to understand the potential value of this AI system in clinical practice, we conducted a retrospective reader study to compare its diagnostic accuracy with ten board-certified breast radiologists. The AI system achieved a higher area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) than the ten radiologists on average. Moreover, we showed that the hybrid model, which aggregates the predictions of the AI system and radiologists, improved radiologists’ specificity and decreased biopsy rate while maintaining the same level of sensitivity. Of note, the term “prediction” refers to the diagnosis produced by AI/radiologists in this retrospective study as it is often used in the machine learning literature. It does not imply the study being prospective. In addition, we showed that the performance of the AI system remained robust across patients from different age groups and mammographic breast densities. Accuracy of our system also remained high when tested on an external data set40.

## Results

### Datasets

The AI system was developed and evaluated using the NYU Breast Ultrasound Dataset41 consisting of 5,442,907 images within 288,767 breast US exams (including both screening and diagnostic exams) collected from 143,203 patients examined between 2012 and 2019 at NYU Langone Health in New York, USA. The NYU Langone hospital system spans multiple sites across New York City and Long Island, allowing the inclusion of a diverse patient population. The dataset included 28,914 exams associated with a pathology report, and among those, the biopsy or surgery yielded benign and malignant results for 26,843 and 5593 breasts, respectively. Patients in the dataset were randomly divided into a training set (60%) that was used for model training, a validation set (10%) that was used for hyperparameter tuning, and an internal test set (30%) that was used for model evaluation. Each patient was included in only one of the three sets. We used a subset of the internal test set for the reader study. The statistics of the overall dataset, the internal test set, and the reader study set are summarized in Table 1.

Each breast within an exam was assigned a label indicating the presence of cancer using pathology results. The pathology examinations were conducted on tissues obtained during a biopsy or breast surgery. As shown in Fig. 1b, all cancer-positive exams were accompanied by at least one pathology report indicating malignancy collected either 30 days prior or 120 days after the US examination. This time frame was chosen to maximize the inclusion of both lesions found at primary screening US and lesions found during targeted US after an initial imaging workup with a different modality. We filtered the internal test set to ensure that cancers were visible on positive exams and that negative exams had either cancer-negative biopsy or at least one negative follow-up US exam (see Methods section ‘Additional filtering of the test set’). Studies with neither a pathology report nor any negative follow-up were included in the training and validation set but excluded from the internal test set.

To assess the ability of the AI system to generalize across patient populations and image acquisition protocols, we further evaluated it on the public Breast Ultrasound Images (BUSI) dataset collected at the Hospital for Early Detection and Treatment of Women’s Cancer in Cairo, Egypt40. This external test set consisted of 780 images, of which 437 were benign, 210 were malignant, and 133 were negative (no lesion present). These images were collected from 600 patients. Of note, the BUSI dataset was acquired using different US machines and was collected from patients with contrasting demographic backgrounds compared to the NYU dataset. Each image in the BUSI dataset was associated with a label indicating the presence of any malignant lesions.

### AI system performance

On the internal test set of 44,755 US exams (25,003 patients, 79,156 breasts), the AI system achieved an AUROC of 0.976 (95% CI: 0.972, 0.980) in identifying breasts with malignant lesions. Additionally, we stratified patients by age, mammographic breast density, US machine manufacturer, and evaluated AI model performance across these sub-populations (Table 2). The AI system maintained high diagnostic accuracy among all age groups (AUROC: 0.969–0.981), mammographic breast densities (AUROC: 0.964–0.979), and US device manufacturers (AUROC: 0.974–0.990). We also explored the impact of training dataset size on the performance of AI system. We observed that more training data led to a better AUROC (Supplementary Table 1). In addition, we evaluated the AI system on the external test set (BUSI dataset)40. Even though the AI system was not trained on any images of the external test set, it maintained a high level of diagnostic accuracy (0.927 AUROC, 95% CI: 0.907, 0.959).

To compare the performance of the AI system with that of breast radiologists, we conducted two reader studies: one on the internal test set and the other on the external BUSI dataset. Conclusions drawn from the results for both datasets were consistent. Here we present the results for the internal test set. The results for the external test set are in the Supplementary Information.

From the internal test set, we constructed a reader study subset by selecting 663 exams (644 patients, 1024 breasts). Among the exams selected for this study, 73 breasts had pathology-proven cancer, 535 breasts had a biopsy yielding exclusively benign findings, and 416 breasts were not biopsied but were evaluated by radiologists as likely benign and had a follow-up benign evaluation at 1–2 years. Readers were informed that the study dataset was enriched with cancers but were not informed of the enrichment level.

Ten board-certified breast radiologists rated each breast according to the Breast Imaging Reporting and Data System (BI-RADS)12. Radiologists’ experience is described in Supplementary Table 2. Readers were provided with contextual information typically available in the clinical setting, including the patient’s age, burnt-in annotations showing measurements of suspicious findings, and notes from the technologist, such as specifying any region of palpable concern or pain. In contrast, the AI system was not provided any contextual information.

### Subgroup analysis on the biopsied population

Next, we evaluated the accuracy of readers and the AI system exclusively amongst breasts with pathology-confirmed cancers (97 malignant lesions across 73 breasts). As shown in Supplementary Table 5, we stratified malignant lesions by cancer subtype, histologic grade, and biomarker profile. This was done to further investigate the AI system’s ability to discriminate between benign and malignant lesions. Certain types of breast cancers (such as high grade, triple biomarker negative cancers) may closely resemble benign masses (more likely to have oval/round shape and circumscribed margins, less likely to have posterior attenuation compared to other cancers) and are considered particularly difficult to characterize42. Although the sample sizes in some subgroups are limited, this analysis demonstrated that the sensitivity of the AI system was similar to that of the readers across all stratification categories. There were no significant differences in sub-populations of patients where the AI system had inferior performance.

### Potential clinical applications

In addition, the AI system could also be used to assist radiologists to triage US exams (Supplementary Table 8). To evaluate the potential of the AI system in identifying cancer-negative cases with high confidence, we selected a very low decision threshold to triage women into a no-radiologist work stream. On the reader study subset, using this triage paradigm, the AI system achieved an NPV of 99.86% while retaining a specificity of 77.7%. This result suggests that it may be feasible to dismiss 77.7% of normal/benign cases and skip radiologist review if we accept missing one cancer in every 740 negative predictions, which is less than 1/6 of the false negative rate observed among radiologists in the reader study (one missed cancer for every 109 negative evaluations). To evaluate the potential of the AI system in triaging patients into an enhanced assessment work stream, we used a very high decision threshold. In this enhanced assessment work stream, the AI system achieved a PPV of 84.4% while retaining a sensitivity of 52.1%. These results suggest that it may be feasible to rapidly prioritize more than half of cancer cases, with approximately five out of six biopsies leading to a diagnosis of cancer. For comparison, only 27.1% biopsies that the radiologists recommended were diagnosed with cancer. While we demonstrated the potential of AI in automatically triaging breast US exams, confirmation of these performance estimates would require extensive validation in a clinical setting.

## Discussion

In this work, we present a radiologist-level AI system that is capable of automatically identifying malignant lesions in breast US images. Trained and evaluated on a large dataset collected from 20 imaging sites affiliated with a large medical center, the AI system maintained a high level of diagnostic accuracy across a diverse range of patients whose images were acquired using a variety of US units. By validating its performance on an external dataset, we produced preliminary results substantiating its ability to generalize across a patient cohort with different demographic composition and image acquisition protocols.

Another strength of this study is that we explored the benefits of collaboration between radiologists and AI. We proposed and evaluated a hybrid diagnostic model that combined the predictions from radiologists and the AI system. The results from our reader study suggest that such collaboration improves the diagnostic accuracy and reduces false positive biopsies for all ten radiologists (Supplementary Table 6). In fact, breast US has come under criticism for having a high false positive rate13,14. As reported by multiple clinical studies, only 7-8% of breast biopsies performed under US guidance are found to yield cancers15,17. Indeed, for the ten radiologists in our cancer-enriched reader study subset, on average 19.3% (SD: 4.7%, 95% CI: 17.7%, 20.6%) of cancer-negative exams were falsely diagnosed as positive and only 27.1% (SD: 4.1%, 95% CI: 22.9%, 33.1%) of the exams that they recommended to undergo biopsy actually had cancer. In this study, we showed that the hybrid models reduced the average radiologist’s false positive rate to 12.0% (SD: 3.9%, 95% CI: 7.6%, 21.0%), representing a 37.3% (SD: 12.9%, 95% CI: 35.5%, 39.3%) relative reduction. The hybrid models also increased the average radiologist’s PPV to 38.0% (SD: 6.0%, 95% CI: 24.1%, 50.0%). These results indicate that our AI system has the potential to aid radiologists in their interpretation of breast US exams to reduce the number of false positive interpretations and benign biopsies performed.

Beyond improving radiologists’ performance, we also explored how AI systems could be utilized to assist radiologists to triage US exams. We showed that high-confidence operating points provided by the AI system can be used to automatically dismiss the majority of low-risk benign exams and escalate high-risk cases to an enhanced assessment stream. Prospective clinical studies will be required to understand the full extent to which this technology can benefit US reading.

Finally, we have made technical contributions to the methodology of deep learning for medical image analysis. Prior work on AI systems for interpreting breast US exams, and other similar applications, rely on manually collected image-level or pixel-level labels28,29,30,31,32,33. In contrast, our AI system was trained using breast-level labels which were automatically extracted from pathology reports. This is an important difference, as developing a reliable AI system for clinical use requires training and validation on large-scale datasets to ensure the network will function well across the broad spectrum of cases encountered in clinical practice. At such a scale, it is impractical to collect labels manually. We address this issue by adopting the weakly supervised learning paradigm to train models at scale without the need for image-level or pixel-level labels. This paradigm enables the model to generate interpretable saliency maps that highlight informative regions in each image. Admittedly, the literature has not yet reached a consensus on the definition of what exactly interpretability for neural networks is. Nevertheless, with the saliency maps, researchers can perform qualitative error analysis and understand the strength and limitations of the AI system. Furthermore, a system trained with such a large dataset could help discover novel data-driven imaging biomarkers, leading to a better understanding of breast cancer.

Despite the contributions of our study in advancing breast cancer diagnosis, it has some limitations. We focused on the evaluation of an AI system that detects breast cancer only using US imaging. In clinical practice, US imaging is often used as a complementary modality to mammography. One promising research direction is to utilize multimodal learning45,46 to combine information from other imaging modalities. Moreover, the diagnosis produced by our AI system is based only on a single US exam, while breast radiologists often refer to patients’ prior imaging to evaluate the morphological changes of suspicious findings over time. Future research could focus on augmenting AI systems to extract relevant information from past US exams. In addition, we did not provide an evaluation on patient cohorts stratified by risk factors such as family history of breast cancer and BRCA gene test results.

Another limitation of this work is the design of reader study. To provide a fair comparison with the AI system, readers in our study were only provided with US images, patients’ ages, and notes from the operating technician. In clinical practice, breast radiologists also have access to other information such as patients’ prior breast imaging and their electronic medical records. Moreover, in the breast cancer screening setting, a screening US examination is typically accompanied by a screening mammogram. Even if prior US exams are not available, radiologists can typically refer to the mammogram for additional information, which can also influence the way that an US exam is interpreted. In addition, the qualitative analysis presented in this study was conducted over a limited set of exams. A systematic study on the differences between the AI system and the perception of radiologists in sonography interpretation is required to understand the limitations of such systems.

Finally, compared to the NYU Breast Ultrasound Dataset, the external test set is limited in size. All images in the external test set were acquired using a single US system40. Moreover, each lesion/finding in the external test set is only associated with a single image. On the contrary, in clinical practice, the technicians often acquire multiple images from different views for findings that are suspicious of malignancy. This difference in image acquisition protocol likely lead to the gap in AI’s performance between the internal and external test set.

In conclusion, we examined the potential of AI in US exam evaluation. We demonstrated in a reader study that deep learning models trained with a sufficiently large amount of data are able to produce diagnosis as accurate as experienced radiologists. We further showed that the collaboration between AI and radiologists can significantly improve their specificity and obviate 27.8% of requested biopsies. We believe this research could supplement future approaches to breast cancer diagnosis. In addition, the general approach employed in our work, mainly the framework for weakly supervised classification and localization, may enable utilization of deep learning in similar medical image analysis tasks.

## Methods

### Ethical approval

This retrospective study was approved by the NYU Langone Health Institutional Review Board (ID#i18-00712_CR3) and is compliant with the Health Insurance Portability and Accountability Act. Informed consent was waived since the study presents no more than minimal risk. This study is reported following the TRIPOD guidelines53.

### NYU breast ultrasound dataset

The dataset used in this study was collected from NYU Langone Health system (New York, USA) across 20 imaging sites. The final dataset contained 288,767 exams (5,442,907 images) acquired from 143,203 patients imaged between January 2012 and September 2019. Each US exam included between 4 and 70 images with 18.8 images per exam on average (Supplementary Fig. 3a). The images had an average resolution of 665 × 603 pixels in width and height, respectively (Supplementary Fig. 3b). Both B-mode and color Doppler images were included. For each color Doppler image, the color Doppler map was overlaid on the B-mode US image. The AI system processed both B-mode images and color Doppler images in the same way. A summary of the acquisition devices is shown in Supplementary Table 9. Each exam was associated with additional patient metadata as well as a radiology report summarizing the findings. We extracted breast tissue density from the patients’ past mammography reports and assigned "unknown” to patients who did not have any mammography exams. Both screening and diagnostic US exams were included. Screening exams are performed for women who have no symptoms or signs of breast cancer while diagnostic US exams can be used to evaluate women who present with symptoms such as a new lump or pain in the breast or can be used to further evaluate abnormalities detected on a screening examination. While screening exams are typically comprehensive and image both breasts, diagnostic US exams vary in terms of how targeted they are, and might image both breasts, one breast, or sometimes just a single lesion. The dataset was filtered as described in the next section. Further details can be found in the technical report41.

### Filtering of the dataset

We initially extracted a dataset of 425,506 breast US exams consisting of 8,448,978 images collected from 212,716 unique patients. We then applied a few levels of filtering to obtain the final dataset for training and evaluating the neural network. This entailed the exclusion of exams with invalid patient identifiers, exams collected before 2012, exams collected from patients younger than 16 years of age, duplicate images, exams from non-female patients, and invalid images based on the ImageType attribute, which consisted of non-US images such as reports or demographic data screenshots. We further excluded images that were collected during biopsy procedures based on the PerformedProcedureStepDescription, StudyDescription & RequestedProcedureDescription attributes of the image metadata, in that order, images with missing metadata information relating to the type of procedure, images with more than 80% zero pixels, exams with multiple patient identifiers or study dates, exams with an extreme number of images, and exams with missing image laterality.

Patients were then randomly split among training (60%), validation (10%) and test (30%) sets. After splitting, each patient appeared in only one of the training, validation, and test sets. The training set consisted of 3,930,347 images within 209,162 exams collected from 101,493 patients. The validation set consisted of 653,924 images within 34,850 exams collected from 16,707 patients. The test set consisted of 858,636 images within 44,755 exams collected from 25,003 patients. The training set was used to optimize learnable parameters in the models. The validation set was used to tune the hyperparameters and select the best models. The test set was used to evaluate the performance of the models selected using the validation set. We applied additional filtering on the test set as described in the next section.

### Additional filtering of the test set

Next, we refined exams with biopsy-proven benign findings to determine if the pathology results were deemed by the radiologist to be concordant or discordant with the imaging features of the breast lesion. Patients with biopsy reports that confirmed a discordant benign finding were only included in the test set if they received a subsequent biopsy (that was not discordant) or breast surgery within the 6 months following the initial discordant biopsy. Patients with benign discordant biopsies that did not receive subsequent pathological evaluation were excluded.

Lastly, we ensured that exams with pathology-proven cancers contained images of these cancers. Since breast US produces small images which do not comprehensively capture the entire breast, a proportion of patients diagnosed with breast cancer did not have images of the cancer in any of their US images. US exams with a label indicating malignancy and a BI-RADS score of 1–2 were excluded as these exams typically did not contain images of the cancer. Additionally, patients diagnosed with breast cancer who did not have any breast pathology obtained using US-guided biopsy were also excluded, since the majority of patients diagnosed using MRI and stereotactic-guided biopsies had malignancies that were sonographically occult. US exams that received a BI-RADS score of 0, 3, and 6, as well as patients who had breast pathology obtained using multi-modal image guidance (US plus stereotactic and/or MRI guided biopsies) had their cases manually reviewed to confirm that breast cancer was visible on the US exam. Patients who were given a BI-RADS score of 4-5 and had all their breast pathology obtained using US-guided biopsy were presumed to have visible cancers and were not manually reviewed.

### Breast-level cancer labels

Among all the exams in the dataset, 28,914 exams (approximately 10%) were associated with at least one pathology report dated within 30 days prior or 120 days after the US examination. Pathology reports were used to automatically detect cancer labels. In cases where there were multiple pathology reports recorded within the considered time window, all of these reports were evaluated. Malignant findings included primary breast cancers: invasive ductal carcinoma, invasive lobular carcinoma, special-type invasive carcinoma (including tubular, mucinous and cribriform carcinomas), inflammatory carcinoma, intraductal papillary carcinoma, microinvasive carcinoma, ductal carcinoma in situ, as well as non-primary breast cancers: lymphoma and phyllodes. Benign findings included cyst, fibroadenoma, scar, sclerosing adenosis, lobular carcinoma in situ, columnar cell changes, atypical lobular hyperplasia, atypical ductal hyperplasia, papilloma, periductal mastitis and usual ductal hyperplasia. The labels were automatically extracted from the corresponding pathology reports using a natural language processing pipeline developed earlier41. Of note, patients with multiple pathology reports could be assigned both malignant and benign labels if their exam contained both types of lesions.

### Breast ultrasound images dataset

This external dataset was collected in 2018 from Baheya Hospital for Early Detection and Treatment of Women’s Cancer (Cairo, Egypt) with the LOGIQ E9 ultrasound system. It included 780 breast US images, with an average resolution of 500 × 500 pixels, acquired from 600 female patients whose ages ranged between 25 and 75 years old. Among these 780 images, 133 were normal images without cancerous masses, 437 were images containing malignant masses and 210 were images with benign masses. We refer the reader to the original paper for more information about this public dataset40.

### Deep neural network architecture

We present a deep learning model (DLM) whose architecture is shown in Supplementary Fig. 5. To explain the mechanics of this model, we need to introduce some notations. Let $${{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{H,W,3}$$ denote an RGB US image with a resolution of H × W pixels and let X = {x1, x2, . . . , xK} denote an image set that contains all images acquired from the patient during an US exam from one breast. This DLM is trained to process the image set X, which may vary in number of the images it contains (Supplementary Fig. 3), and generate two probability estimates $${\hat{y}}^{b}$$, $${\hat{y}}^{m}\in [0,1]$$ that indicate the predicted probability of the presence of benign and malignant lesions in the patient’s breast, respectively. The DLM is designed to resemble the diagnostic procedure performed by radiologists. First, it generates saliency maps and probability estimates for each image xk in the image set. This step is similar to a radiologist roughly scanning through each US image and looking for abnormal findings. Then it computes a set of attentions scores which indicate the importance of each image to the cancer diagnosis task. This procedure can be seen as an analog to a radiologist concentrating on images that contain suspicious lesions. Finally, it forms a breast-level cancer diagnosis by combining information collected from all images. This is analogous to modeling a radiologist comprehensively considering signals in all images to render a full diagnosis. Below we describe each step in detail.

1. 1.

Saliency maps. The DLM first utilizes a convolutional neural network54fg (parameterized as ResNet-1855) to extract a representation of each image xk, in an image set X, denoted by $${{{{{{{{\bf{h}}}}}}}}}_{k}\in {{\mathbb{R}}}^{h,w,C}$$. The height, the width, and the number of channels are denoted by h, w, and C, respectively. Inspired by Zhou et al.38, we then apply a convolutional layer with 1 × 1 convolutional filters followed by sigmoid non-linearity to transform hk into two saliency maps $${{{{{{{{\bf{A}}}}}}}}}_{k}^{b}\in {{\mathbb{R}}}^{h,w}$$ and $${{{{{{{{\bf{A}}}}}}}}}_{k}^{m}\in {{\mathbb{R}}}^{h,w}$$. These saliency maps highlight approximate locations of benign and malignant lesions in each image. Each element $${{{{{{{{\bf{A}}}}}}}}}_{k}^{b}[i,j],{{{{{{{{\bf{A}}}}}}}}}_{k}^{m}[i,j]\in [0,1]$$ denotes the contribution of spatial location (i, j) towards predicting the presence of benign/malignant lesions. The resolutions of the saliency maps (h, w) depends on the implementation of fg. The sizes (h, w) are usually smaller than the resolution of the input image (H, W). In this work, we set h = w = 8, C = 512, and H = W = 256.

2. 2.

Attention scores. The images in the image set X might significantly differ in how relevant each of them is to the classification task. To address this issue, we utilize the Gated Attention Mechanism56, allowing the model to select which information to incorporate from all images. Specifically, we first apply global max pooling to transform the representation hk computed for the image xk into a vector $${{{{{{{{\bf{v}}}}}}}}}_{k}\in {{\mathbb{R}}}^{C}$$. Two attention scores $${\alpha }_{k}^{b}$$ and $${\alpha }_{k}^{m}\in [0,1]$$ that indicate the importance of each image xk to the estimation of the probability of the presence of benign and malignant findings in the breast are computed as

$${{{{\boldsymbol{\alpha}}}}}_k = \frac{{\exp}\{{{{{\mathbf{W}}}}}^\intercal ({\tanh}({{{{\mathbf{V}}}}}{{{{\mathbf{v}}}}}_k^{\intercal}) \odot {{\mbox{sigm}}}({{{{\mathbf{U}}}}}{{{{\mathbf{v}}}}}_k^{\intercal}) )\}}{\sum^K_{j=1}{\exp}\{{{{{\mathbf{W}}}}}^\intercal ({\tanh}({{{{\mathbf{V}}}}}{{{{\mathbf{v}}}}}_j^{\intercal}) \odot {{\mbox{sigm}}}({{{{\mathbf{U}}}}}{{{{\mathbf{v}}}}}_j^{\intercal}) )\}},$$
(1)

where $${{{{{{{{\boldsymbol{\alpha }}}}}}}}}_{k}=\left[\begin{array}{l}{\alpha }_{k}^{b}\\ {\alpha }_{k}^{m}\end{array}\right]$$ denotes the concatenation of attention scores for both benign and malignant findings,  denotes an element-wise multiplication, and $${{{{{{{\bf{W}}}}}}}}\in {{\mathbb{R}}}^{L,2}$$, $${{{{{{{\bf{V}}}}}}}}\in {{\mathbb{R}}}^{L\times M}$$ and $${{{{{{{\bf{U}}}}}}}}\in {{\mathbb{R}}}^{L\times M}$$ are matrices of learnable parameters. In all experiments, we set L = 128 and M = 512.

3. 3.

Cancer diagnosis. Lastly, the DLM aggregates the information from all US images in the image set X and generates the final diagnosis using the attention scores and saliency maps. We first use an aggregation function $${f}_{{{{{{{{\rm{agg}}}}}}}}}({{{{{{{\bf{A}}}}}}}}):{{\mathbb{R}}}^{h,w}\mapsto [0,1]$$ to transform the saliency maps into image-level predictions:

$${\hat{y}}_{k}^{b}={f}_{{{{{{{{\rm{agg}}}}}}}}}({{{{{{{{\bf{A}}}}}}}}}_{k}^{b})\quad {\hat{y}}_{k}^{m}={f}_{{{{{{{{\rm{agg}}}}}}}}}({{{{{{{{\bf{A}}}}}}}}}_{k}^{m}).$$
(2)

In our work, we parameterize fagg as the top t% pooling proposed by Shen et al.57,58,59. Namely, we define the aggregation function as

$${f}_{{{{{{{{\rm{agg}}}}}}}}}({{{{{{{\bf{A}}}}}}}})=\frac{1}{| {H}^{+}| }\mathop{\sum}\limits_{(i,j)\in {H}^{+}}{{{{{{{{\bf{A}}}}}}}}}_{i,j},$$
(3)

where H+ denotes the set containing locations of top t% values in A, and t is a hyperparameter. The breast-level cancer prediction $$\hat{{{{{{{{\bf{y}}}}}}}}}=\left[\begin{array}{l}{\hat{y}}^{b}\\ {\hat{y}}^{m}\end{array}\right]$$is then defined as the average of all image-level cancer predictions weighted by the attention scores:

$${\hat{y}}^{b}=\mathop{\sum }\limits_{k=1}^{K}{\alpha }_{k}^{b}{\hat{y}}_{k}^{b},\quad {\hat{y}}^{m}=\mathop{\sum }\limits_{k=1}^{K}{\alpha }_{k}^{m}{\hat{y}}_{k}^{m}.$$
(4)

### Training details

In order to constrain the saliency maps to only highlight important regions, we impose the L1 regularization on A which penalizes the DLM for highlighting irrelevant pixels:

$${L}_{{{{{{{{\rm{reg}}}}}}}}}({{{{{{{\bf{A}}}}}}}})=\mathop{\sum}\limits_{(i,j)}| {{{{{{{\bf{A}}}}}}}}[i,j]| .$$
(5)

Despite the relative complexity of our proposed framework, this DLM can be trained end-to-end using stochastic gradient descent with the following loss function, defined for a single training example (i.e. one breast) as

$$L({{{{{{{\bf{y}}}}}}}},\hat{{{{{{{{\bf{y}}}}}}}}})=\mathop{\sum}\limits_{c\in \{b,m\}}{{{{{{{\rm{BCE}}}}}}}}({y}^{c},{\hat{y}}^{c})+\beta \mathop{\sum }\limits_{k=1}^{K}{L}_{{{{{{{{\rm{reg}}}}}}}}}({{{{{{{{\bf{A}}}}}}}}}_{k}^{c}),$$
(6)

where BCE is the binary cross-entropy and β is a hyperparameter. For all experiments, the training loss is optimized using Adam60. Of note, labels indicating the presence of benign lesions (yb) were also used during training to regularize the network through multi-task learning61. On the test set, we focus on evaluating predictions of malignancy since it is a more clinically relevant task: identification of malignant lesions has an immediate and significant impact on patient management (biopsy, potential surgery), whereas identification of a benign breast lesions typically does not alter management compared to patients without breast lesions12.

We optimized the hyperparameters with random search62. Specifically, we searched for the learning rate η 10[−5.5, −4] on a logarithmic scale, regularization hyperparameter β 10[−3, 0.5] on a logarithmic scale, weight decay hyperparameter λ 10[−6, −3.5] on a logarithmic scale, and the pooling threshold t [0.1, 0.5] on a linear scale. We trained 30 separate models using hyperparameters uniformly sampled from the ranges above. Each model was trained for 50 epochs. We saved the model weights from the training epoch that achieves the highest AUROC on the validation set. To further improve our results, we used model ensembling63. Specifically, we average the breast-level predictions of the top 3 models that achieved the highest AUROC on the validation set to produce the overall prediction of the ensemble.

During training, we adopt image augmentation including random horizontal flipping (p = 0.5), random rotation (−45° to 45°), random translation in both horizontal and vertical directions (up to 10% of the image size), scaling by a random factor between 0.7 and 1.5, and random shearing (−25° to 25°). The resulting image was then resized to 256 × 256 pixels using bilinear interpolation and normalized. During the validation and test stages, the original image was resized and normalized without any augmentation.

### Test-time augmentation

We adopted test-time augmentation64 on the external test set to improve model’s performance. We applied following augmentations and computed a prediction on each augmented image: random horizontal flipping (p = 0.5), random vertical flipping (p = 0.5), and altering the brightness and contrast by a factor randomly chosen from [0.9, 1.1]. This augmentation pipeline was selected using AI’s performance on the validation subset of the NYU Breast Ultrasound Dataset. We repeated this procedure 20 times on each image. The final prediction for each image was computed by averaging the predictions on all augmented images.

### Implementation details

Image preprocessing was performed using Python (3.7) with the following packages: OpenCV (3.4), pandas (0.24.1), Numpy (1.15.4), PIL (5.3.0), and Pydicom (2.2.0). Deep learning model was implemented using PyTorch (1.1.0) and Torchvision (0.2.2). Evaluation metrics were computed using Sklearn (0.19.1).

### Hybrid model

To explore the potential benefit that the AI system might be able to provide, we created a hybrid model for each radiologist, whose predictions were created by averaging the predictions of the respective radiologist and the AI model: $${\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\rm{hybrid}}}}}}}}}=\lambda {\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\rm{expert}}}}}}}}}+(1-\lambda ){\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\rm{AI}}}}}}}}}$$. The BI-RADS scores of radiologists were used as their predictions. Both $${\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\rm{AI}}}}}}}}}$$ and $${\hat{{{{{{{{\bf{y}}}}}}}}}}_{{{{{{{{\rm{expert}}}}}}}}}$$ were standardized to have zero mean and unit variance. In this study, we set λ = 0.5. We note that λ = 0.5 is not the optimal value. On the other hand, the performance obtained by retroactively fine-tuning λ on the reader study is not transferable to realistic clinical settings. Therefore, we chose λ = 0.5 as the most natural way of aggregating two predictions without prior knowledge of their quality.

### Statistical analysis

In this study, we evaluated the performance of the AI system, radiologists, and the hybrid models using the following evaluation metrics: area under receiver operating characteristic curves (AUROC), area under precision-recall curve (AUPRC), sensitivity, specificity, biopsy rate, negative predictive value (NPV), and positive predictive value (PPV). AUROC and AUPRC were used to assess the diagnostic accuracy of the probabilistic predictions generated by the AI system/hybrid models and the BI-RADS scores of the readers. The BI-RADS scores were treated as a 6-point index of suspicion for malignancy: scores of 1 and 2 were collapsed into the lowest category of suspicion; scores 3, 4A, 4B, 4C and 5 were treated independently as increasing levels of suspicion. AUROC avoids the subjectivity in selecting the thresholds to dichotomize continuous predictions, since it compares performance across all possible recall rates. However, AUROC weights omission and commission errors equally and therefore could provide excessively optimistic estimates in extremely imbalanced classification tasks such as cancer diagnosis where the negative cases often overwhelm the positive cases65. Therefore, to complement AUROC, we also reported AUPRC which solely evaluates the ability to correctly identify the positive cases. We calculated both AUROC and AUPRC using the Python Scikit-learn API66.

In addition, we also evaluated the binary predictions of the AI system, the hybrid models, and the readers using sensitivity, specificity, biopsy rate, NPV, and PPV. These metrics are commonly used to assess the diagnostic accuracy in clinical studies7,11,15. The PPV reported in this study corresponds to PPV2, which is defined as the number of breasts with cancer that were recommended to undergo biopsy divided by the total number of breast biopsies recommended12. For each breast, the AI system and the hybrid models produced a probabilistic score that represents the likelihood of cancer being present. We dichotomized these scores to produce binary predictions by selecting a score threshold that separates positive and negative decisions. To compute sensitivity, we dichotomized the AI system’s probabilistic predictions to match average reader’s specificity. To calculate the specificity, biopsy rate, PPV and NPV, we dichotomized the AI system’s probabilistic predictions by matching the average reader’s sensitivity. We similarly dichotomized the predictions of each hybrid model using the sensitivity/specificity of its respective reader. For all evaluation metrics, we estimated the confidence intervals at 95% by 1000 iterations of the bootstrap method67.

In the reader study, we compared the AUROC, AUPRC, sensitivity, specificity, PPV, and biopsy rate of the AI system and hybrid models with those of the average radiologists. The confidence interval for these differences was obtained through 1000 iterations of bootstrap method67. The p-values were computed using one-tailed permutation test68. In each of 10,000 trials, we randomly swapped the AI/hybrid model’s score with one of the comparator reader’s score for each case, yielding a reader-AI difference sampled from the null distribution. A one-sided p-value was computed by comparing the observed statistic to the empirical quantiles of the null distribution. We used a statistical significance threshold of 0.05.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.