Background & Summary

Breast carcinoma is one of the most commonly diagnosed cancer and the second leading cause of death from cancer in women1. The popularity of mammography uptake in breast carcinoma treatment has dramatically improved the 5-year survival rate of breast carcinoma since the 1980s2. Due to the sensitivity of mammography and the heterogeneity of breast cancer lesions, invasive methods such as biopsy, surgery is critical to confirm the benign and malignant tumors, and the molecular subtypes to optimize the type of treatment3.

Advances in both imaging and computer have synergistically lead to a rapid rise of the artificial intelligence (AI) for breast imaging in the following three tasks: (1) Computer-aided detection (CADe)4,5,6,7,8,9 aims at locating suspect lesions such as mass and microcalcification, leaving the classification to the radiologist; and (2) Computer-aided diagnosis (CADx)10,11,12,13,14 aims to characterize the suspicious region of lesion and/or estimate its probability of onset; and (3) Findings of predictive image-based biomarkers15,16,17,18 by applying the computational methods to mine the potential relationships between image representation and molecular subtype, including luminal A, luminal B, HER2 positive, and Triple-negative. Although mammography imaging is rapidly growing in the three areas, the promising results of radiomics approaches have not been widely used in daily clinical practice. Limited data sharing is an essential reason for reducing the development of radiomics strategies.

In investigating the CADe and CADx, there are several datasets19,20,21,22,23 that are publicly and freely available to authorized investigators. The datasets involve the Digital Database for Screening Mammography (DDSM), the Mammographic Imaging Analysis Society (MIAS) database, the Image Retrieval in Medical Application (IRMA) project, and the Curated Breast Imaging Subset of DDSM (CBIS-DDSM). Notwithstanding these public datasets are useful, there is still a lack of open access datasets that expand beyond the white population, which will enable researchers to verify previous findings and make the dataset more diverse. Furthermore, the biopsy confirmed results, such as immunohistochemical or molecular subtype, for most of the current datasets are missing. Therefore, an open-access database consisting of large samples with immunohistochemical type is valuable for researchers who are interested in this domain or who require an independent database for cross-validation. In this study, we built a database that contained two branches labeled by Chinese Mammography Database (i.e., CMMD1 and CMMD2) for allowing researchers to investigate the relationships among image features, pathological assessment, and tumor molecular subtypes. Specifically, CMMD1 including 1026 cases diagnosed with benign or malignant tumors were collated to promote the development of the CADx and CADe. While the CMMD2 included 749 cases, its purpose is to investigate the relationship between image features of invasive carcinoma and molecular subtypes. Note, the cases in CMMD2 have more complete immunohistochemical markers than CMMD1. Both datasets involved mammography images and clinical data such as age, and benign or malignant tumor. Currently, it is available for research through the International Data-sharing Initiative. Our free data sharing can hasten the clinical application of radiomics approaches. Table 1 lists the popular and publicly available databases in the field of mammography.

Table 1 Statistics of popular and publicly available databases in the field of mammography.

Methods

Patient recruitment

Ethical approval was acquired for this retrospective analysis, and the requirement to obtain informed consent was waived. Our study was conducted on 1775 patients (mean age: 47.56 year; range: 18–87 years) with benign or malignant breast who underwent mammography examination between July 2012 and January 2016. CMMD1 involves 1026 patients (mean age: 45.92 year; range: 17–84 years), which have the mammography data and complete clinical data. CMMD2 includes 749 patients (mean age: 49.82 year; range: 21–87 years) with complete immunohistochemical markers. Figure 1 illustrates the patient recruitment pathway, along with the inclusion and exclusion criteria. It is clear that CMMD1 and CMMD2 are the subsets of CMMD, CMMD1 merely distinguishes between benign and malignant patients (see the Exclusion criteria 1 in Fig. 1), while CMMD2 only contains malignant cases with detailed molecular subtypes (see the Exclusion criteria 2 in Fig. 1).

Fig. 1
figure 1

Recruitment pathway for patients in our study.

Image collection and interpretation

Image data were acquired on a GE Senographe DS mammography system and a Siemens Mammomat Inspiration mammography system in the SunYat-sen University Cancer Center in Guangzhou, and the Nanhai Affiliated Hospital of Southern Medical University in Fushan, China. The scans were processed by the operator with a fixed operating procedure. For each subject, craniocaudal (CC) projection images and mediolateral oblique (MLO) projections images were obtained. In the released database, the raw images were stored as 8-bit grayscale in the Digital Imaging and Communications in Medicine format. All images were digitized at a resolution of 2294 × 1914 pixels.

Two radiologists with at least five years of experience performed mammography interpretation and guidance before surgery to determine which patients should be treated surgically. It was asked to refer to the standard readings of the breast imaging report and data system, established by the American College of Radiology24. By referring to commonly used X-ray classification methods, the images are divided into three types of masses, calcifications, and both. Note, the two radiologists independently reviewed the mammography in our study. When the results of the two doctors are inconsistent, they will combine the pathology report to further determine the type of abnormality.

Pathological evaluation

In this study, biopsy samples were collected from all patients by core needle biopsy. The sample tissues were routinely stored as formalin-fixed and paraffin-embedded tissue blocks. The pathologist stained the section of biopsy tissue with hematoxylin and eosin (HE), analyzed the tissue morphology under the microscope. If necessary, surgery was performed to extract the suspicious lesion specimen. The immunohistochemistry test is conducted to determine the pathological result.

Immunohistochemistry

According to the different expressions for immunohistochemistry including estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and Ki-67, invasive breast carcinoma is divided into four molecular subtypes, including Luminal A (ER+ and/or PR+, HEER2- and Ki-67 < 20%), Luminal B (ER+ and/or PR+ and Her2+ or Ki-67 > 20%), HER2-enriched (ER- and PR-, Her2+), and triple-negative (ER-, PR-, Her2-)25. The surgical specimens were fixed with 4% neutral buffer formaldehyde solution. The monoclonal antibodies were adopted for nuclear staining to evaluate the status of ER and PR. A negative test was defined as staining less than 1% (<1%) of tumor cells, while a positive test was defined as staining of greater than or equal to 1% (≥1%) of tumor cells. In assessing the expression of HER2, the specimen was first graded by IHC and scored by 0 to 3+, according to the recommendations of the American Society of Clinical Oncology/College of American Pathologists26. If there is no observed staining or faintly/barely perceptible membrane staining in less than 10% (<10%) of tumor cells, the score was set as 0. If there are greater than or equal to 10% (≥10%) of tumor cell membrane staining or the cell membrane staining faintly/barely noticeable, the score was marked as 1+. If there is weakly to moderately complete membrane staining observed in more than 10% (>10%) of tumor cells, the score was marked as 2+. In this case, the tissue was further evaluated by fluorescence in situ hybridization (FISH) analysis for HER2 gene amplification. In assessing the expression of Ki-67, immunostaining was performed by the monoclonal antibody Ki-67. The Ki-67 expression is divided between 0% and 100%. A cutoff value of 20% was used to classify the sample into low or high expression27.

To sum up, we list the clear and transparent about each step in the generation of the dataset, ultimately presenting a fully reproducible dataset, as shown in Fig. 2.

Fig. 2
figure 2

Study design for the construction of mammography data of breast. (a) Patients with lesions of breast were selected for the study. (b) MLO and CC molybdenum targets used as part of clinical diagnosis are shown in the scheme. (c) After the biopsy, resected tumors were routinely stored as formalin-fixed and paraffin-embedded (FFPE) tissue blocks and stained with hematoxylin and eosin (HE) for anatomic pathology. (d) Surgical specimens from surgery were evaluated by routine immunohistochemistry (IHC) to confirm the tumor of origin and molecular subtypes of each case.

Data Records

Subject Identifiers

A unique identifier for each subject was identical in all two public datasets in this database. Subject IDs were 4-digit numbers in the form of D1-xxxx or D2-xxxx.

Imaging and clinical data

The CMMD collection28 contains breast mammography images and corresponding clinical data. Imaging, clinical data for all subjects are stored in The Cancer Imaging Archive https://www.cancerimagingarchive.net/ under https://doi.org/10.7937/tcia.eqde-4b16. Imaging data for all subjects are were store in the folder CMMD. All image data were processed using standard TCIA curation workflows. TCIA uses a standards-based approach for de-identification of images stored in the Digital Imaging and Communications in Medicine format. One comma-delimited file (CMMD_clinicaldata_revision.xlsx) contains clinical data for all subjects with unique subject identifiers. Table 2 lists the statistics on clinical-demographic of enrolled patients. Figure 3 is an illustrative example of clinical data for CMMD1 and CMMD2. As can be seen from the figure, the clinical data for CMMD1 contains age, image categories, and abnormality. Compared with CMMD1, CMMD2 further contains molecular subtypes that are able to assist the doctor for the clinical guidance or the related studies on immunohistochemistry.

Table 2 Statistics on clinical-demographic of enrolled patients.
Fig. 3
figure 3

An example of clinical data for CMMD1 and CMMD2.

Limitations of CMMD

Our data has some notable limitations. First, the sample size is not very large. Second, the ROI is not marked. We will add more available information and increase the amount of data in the future.

Technical Validation

All data were collected by the hospital and used as part of the diagnosis, therefore all quality assurances were performed by the institution that collected the data.

Usage Notes

The data of our previous publications14,29 are analyzed on CMMD1, while CMMD2 with molecular subtypes is our newly added data. All data are raw data without any preprocessing. We also welcome any cooperation with us to fully explore our dataset.