Breast cancer is the most common cancer in women, and the second most common cancer overall in the world1. Early detection is a crucial factor that aids in reducing mortality rates due to breast cancer2. Mammography is therefore offered as a screening modality for women over a certain age in many countries for early detection of cancer3. In fact, the 10 year survival rate due to breast cancer falls from over 95% in patients where the cancer is less than 1 cm to about 60% when the size of the cancer is over 3 cm4. However, detecting cancer on mammograms is a tedious job5, apart from requiring highly specialized breast radiologists. Only about 5 out of 1000 scans would harbor a cancer6. The cancer may also only occupy less than 1% of the image. The ability to identify these tiny cancers which are likely to be missed by a fatigued radiologist would therefore be an important contribution of computer vision.

State-of-art deep learning based image classifiers7,8 and object detectors9,10,11 have performed exceedingly well in mammography. The reader is directed to the systematic review by Freeman et al.12 for a more detailed description of currently available deep learning tools. However the sensitivity for detection of small masses is lower than for large masses13. Recently, some authors have attempted to design approaches specific to small cancer detection. Savelli et al.14 in their work designed a multi-context network for small lesion detection, however they show their results only on microcalcification. Agarwal et al.13 trained a patch-based classifier while showing the benefit of domain adaptation in mammography. They showed their results on masses of different sizes in their analysis, showing that the performance of CNNs drops for small masses. Lotter et al.15 perform multi-scale curriculum learning on mammograms to deal with the problem of very small lesions in comparison to image size. However, they perform only image level classification, without precise cancer localisation. In this work, we deliver a concerted effort towards the detection of small sized cancers, which represent a clinically significant problem. We analyze this problem from the perspective of mammography and propose a solution by combining resolution, scale as well as image-context. (Fig. 1).

Figure 1
figure 1

Importance of resolution, scale and Image-context in detection of small cancers on mammography.

Importance of spatial resolution

Fine Microcalcifications and details such as spiculations are central to identifying breast cancers, the visibility of which on mammography is critically dependent on the spatial resolution of a mammogram. In fact, out of all imaging modalities, mammography has the highest spatial resolution. Digital Mammograms have a pixel size of around 50–100 microns16. DMs tend to be very large images, ranging from 2300 × 1800 pixels (of dimension 100 microns) to 4096 × 5625 pixels (of dimension 54 micron)17. Most CNNs however take much smaller input sizes, thus losing this information so critical to the diagnosis of small masses.

Importance of scale

When mammograms are used in full resolution as input to a network, the receptive field typically includes only a fraction of the image. While full resolution is vital for identifying fine details (such as spiculations), masses may not be seen in its entirety within the receptive field. Thereby the shape and margins of the mass would not be visualized, which are important descriptors of any mammographic mass (Fig. 1). Thus, apart from seeing mammograms at original resolution, we propose that a multi-scale approach where the image is also seen at reduced scale would be particularly beneficial in the context of mammograms.

Importance of image-context

The features essential for detection of small cancers, may be significantly different from those for large (or average sized) cancers, much like in case of small objects in natural images18. While shape, margins and density are important descriptors for average sized masses, for a very small mass, features such as margins and shape may not be visible directly. Other cues in surrounding parenchyma such as architectural distortion and the skin bulge or retraction play an important role in the detection. Thus we propose that image-context is particularly important in the case of small sized cancers.

Approach to detection of small cancers

Though resolution, scale and context are all well described factors playing an important role in object detection, designing models which can do all these together is challenging. Our ideas and network for incorporating scale and context for detection are inspired by the work of Hu et al.18 on small object detection in natural images. We hereby present a concerted effort towards small cancer detection with our network, which takes into account all the above factors, without compromising on each other.

Materials and methods


This was a single institution, multi-centre study. We show our results on one diagnostic mammography dataset, one screening mammography dataset and a curated dataset with small cancers.

Training dataset

For training our network, we collected a dataset from our hospital consisting of Full Field Digital Mammograms (FFDM) acquired on Selenia Dimensions, a Hologic Mammography unit from Jan 2015 to Dec 2015. In order to make a balanced dataset suitable for training, we collected consecutive patients who had been assigned a BIRADS 4 category and had a histopathological diagnosis. Thus this dataset consisted of 839 images, with 393 cancers. The images of the contralateral breast provided normal examples to the network.

Test datasets

Diagnostic mammography dataset

Our country has no formal mammography screening program. Thus our test dataset (acquired from the same centre as the training dataset) had a distribution of diagnostic mammography practice. For this dataset, consecutive patients who underwent mammography from January 2018 to June 2018 were chosen. Patients who had been given a BIRADS 4 or above but did not have histopathological report were excluded from the study. There were 2569 images with 243 cancers in this dataset. This is referred to as the DM dataset in further discussion.

Screening mammography dataset

This was an external dataset obtained from our cancer centre, where opportunistic screening is offered to all eligible women. All patients who underwent mammography from January to April 2021 were selected, except those without histological proof for BIRADS 4 or above lesions. This dataset provided an external test dataset, as well as helped ascertain our efficacy in a screening setting. These images were acquired on a Hologic system. There were 2146 images with 59 cancers, and this is referred to as the SM dataset in further discussion.

Small cancer dataset

In order to establish the value of our network in small cancers, we curated a dataset of patients with cancers less than 1 cm in size (diameter of mass was used for masses and longest dimension of the cluster was used for a cluster of microcalcification). There were 79 images in this dataset with an average cancer size of 5.8 mm diameter. This is referred to as the SC dataset. Images in this dataset came from both the above centres.

All datasets were collected after obtaining ethical clearance from the Institutional Review Board (IRB) of the All India Institute of Medical Sciences with reference number IEC-247/04.05.2018. This data was de-identified, informed consent was obtained for use of data from all patients participating in the study. All experimental protocols were approved by the IRB of the All India Institute of Medical Sciences, New Delhi and all methods were carried out in accordance with prevailing guidelines and regulations. Bounding box annotations were performed by 3 breast radiologists with 2, 8 and 15 years of experience in breast imaging. All images were of size 3328 × 4096 pixels.

Model architecture


All input images were of size 3328 × 4096. These were initially cropped to remove the portion of the mammogram that had no breast in it. These images were thus of variable size (depending on the size of the breast) on one dimension and 4096 pixels on the other dimension. These were then passed forward in the network.

Network architecture

Our proposed architecture is shown in Fig. 2. The network involved the following steps (1) Generating multiple scales (2) Systematic crop of images (3) Passing through baseline architecture (4) Combination at test time.

Figure 2
figure 2

Proposed Architecture for detection of small cancers. The image is rescaled to 3 different scales, and fixed sized crops are systematically obtained to provide inputs with variable resolution, scale and context to the network. Predictions from all 3 scales are combined at test time.

Generation of multiple scales

The full resolution image is rescaled to give images at 3 scales- X, 0.5X and 0.25 X, where X is the original image.

Systematic crops

Crops of size 0.25 times the original image are taken from all 3 scales. These crops were 1024 pixels on one dimension, and variable size on the other dimension. These crops constitute the input to the network. The crops are systematically taken from the larger images from right to left and top to bottom ensuring that no part of the image is left out.

Baseline architecture

We chose YOLO v5 (You Only Look Once version 5)19 as our baseline architecture as it is fully convolutional and thus allows us to pass images of all input sizes. Our ablation study to choose the baseline network is presented in the results section. We used YOLO v5 with a CSPDarknet backbone, a PANet neck with upsampling and Concatenations from different layers and a final YOLO head Convolution. The CSPDarknet is used for feature extraction, PANet for feature fusion and the YOLO layer for computing class and objectness scores. Concatenation of feature vectors at varying layers within the back-bone architecture helps to leverage the varying contextual information available in different layers. Each layer therefore corresponds to increasingly larger receptive fields, which is instrumental in detecting masses of very small sizes.

Combination of output at test time

At test time, full images were given as input in 3 scales- X, 0.5 X and 0.25X. Predictions were generated for each scale separately. These predictions were finally combined using Weighted Box Fusion20(WBF) as described by Wang et al. WBF uses the confidence scores from each of the models and then combines them to construct an "average” bounding box that captures the underlying ground truth boxes better than any of the individual predictions from the models. We experimented with simple Non-Maximal Suppression (NMS) and NMS with thresholds and found WBF to perform the best.

Implementation details

As in yolov5 default implementation, we used a Binary Cross-Entropy with Logits Loss for computation of object scores, and SGD as the Optimizer. We kept the batch size to 16, and the initial learning rate to be 0.01. All computations were carried out on High Performance Computing Cluster having 32 GB V100 GPUs.

Results and experiments

We evaluated the model using Free-Response Operator Characteristic (FROC) curves by plotting the sensitivity of the network against the false positive marks per image. A detection was considered a true detection if the center of the predicted box fell anywhere within the ground-truth box, as is the standard practice in mammography21,22.

Selection of baseline architecture

For selection of baseline network, 4 object detection networks which were fully convolutional were trained on our training data and tested on the DM dataset. The results are given in Table 1.

Table 1 Selection of a fully convolutional network as baseline for our proposed network.

Results on the DM, SM and small cancer dataset

Our FROC curves on the DM, SM and SC datasets are given in Fig. 3a–c respectively . The results of the baseline architecture is also plotted for comparison. Table 2 summarizes the performance of the network on the 3 datasets.

Figure 3
figure 3

Performance of the proposed network on the DM dataset (a), SM dataset (b) and the SC dataset (c). Results of baseline architecture are plotted for comparison.

Table 2 Summary of performance of our proposed network on the 3 datasets.

Ablation studies

The architecture we have described has been built on 3 basic principles: resolution, scale and context. In order to study the effect of each of these components, we performed a few ablation experiments on our Small Cancer dataset (Fig. 5).

First, to study the effect of resolution, we tested the small cancer dataset by upsampling the low resolution images instead of using higher resolutions for crops. Here the images were first downsampled to 0.25X, and then 0.5X and X were generated by upsampling the 0.25X image. Crops were then taken from the upsampled images. Figure 4a shows the FROC thus obtained and compares it with our proposed model. As seen here, the model performs much better when crops are taken from high resolution images, rather than simply from different scales, demonstrating the importance of resolution.

Figure 4
figure 4

Effect of resolution (a), scale (b) and context (c) on performance of network.

Second, in order to study the effect of scale, we train our baseline architecture with only full resolution images without re-scaling them (Fig. 4b). As seen in Fig. 4b, our proposed model performs better than the model trained and tested on only the high resolution image indicating the importance of a multi-scale approach.

Finally, in order to study the effect of context, we first negate the effect of resolution by using upsampled images (upsampled from 0.25X images) rather than original resolution images as input. Crops were now systematically taken from each scale. Thus here crops from 0.25X have maximum context and crops from X have least context, though all have the same resolution. The performance on 0.25X and X are thus studied in Fig. 4c. As seen here, 0.25 X performs much better than X, demonstrating the importance of image context in cancer detection.

In order to analyze the importance of each scale/ resolution in relation to size of image, we analysed the results of each individual scale prior to WBF on the SM dataset. We analysed masses exclusively caught only on that particular scale. These are summarized in Table 3.

Table 3 Analysis of detection performance on individual scales prior to WBF.

Thus we see that each factor, resolution, scale as well as context had significant independent contribution towards achieving the best accuracy using our model.


Context, scale and resolution have been known to play a role in small object detection in natural images. In this work we explore the role of each of these factors towards small cancer detection specific to mammography, which presents a unique problem due to the presence of very small objects (cancers) in very large images. We showed through experiments how each of these factors have an important role to play in detection of small cancers.

We performed a single institution, multi-centre study to validate the results of our proposed network. We validated our results on a dataset with a distribution of diagnostic mammography practice as well as a screening mammography dataset (external validation). We show that our network performs comparably in both settings. In both datasets we showed an improvement in comparison to our baseline architecture. In addition, we tested our results on a curated dataset with cancers < 1 cm in size. We show that the improvement is most marked in this dataset with our approach, suggesting that this approach is particularly suited for detection of small cancers. Some examples from this dataset which were captured by our network are given in Fig. 5. We also note that despite being an external dataset, we perform better in our SM dataset than our DM dataset. This is due to a large number of benign masses in our DM dataset which were detected as cancers, contributing to false positives in this dataset.

Figure 5
figure 5

Visualization of our results of our network on small masses < 1 cm. Yellow represents the ground truth bounding box, green represents predictions of our network.

We analyze the importance of resolution, context and scale by comparing our performance of each factor in isolation with the performance of our proposed network. Through our ablation studies (Fig. 4) we demonstrate that each of these factors have an important role to play. Our analysis of performance of each individual scale (Table 3) also showed that though the sensitivity was the lowest at full resolution, this scale was particularly important for detection of small masses.

Our study has some limitations. We acknowledge that all mammograms came from a single vendor, thus the effectiveness of the network in other situations need further study. Analysis of masses that were missed by this network revealed that we missed isodense, obscure masses, and masses placed in the peripheral breast tissue. There were only 3 images with microcalcification clusters in the DM dataset and 9 images in the SM dataset. Therefore the performance of the network on microcalcifications has not been adequately studied. Analysis of the false positives revealed that some benign lesions, such as cysts and fibroadenomas were also detected as cancers by the network. This indicates the direction of further research in this area.

To conclude, in this work we present a concerted effort towards the detection of small cancers. In a multicentre study we show the effectiveness of our approach in the setting of diagnostic mammography, screening mammography, as well as in a subset of small cancers < 1 cms in size.