New breast cancer prognostic factors identified by computer-aided image analysis of HE stained histopathology images

Computer-aided image analysis (CAI) can help objectively quantify morphologic features of hematoxylin-eosin (HE) histopathology images and provide potentially useful prognostic information on breast cancer. We performed a CAI workflow on 1,150 HE images from 230 patients with invasive ductal carcinoma (IDC) of the breast. We used a pixel-wise support vector machine classifier for tumor nests (TNs)-stroma segmentation, and a marker-controlled watershed algorithm for nuclei segmentation. 730 morphologic parameters were extracted after segmentation, and 12 parameters identified by Kaplan-Meier analysis were significantly associated with 8-year disease free survival (P < 0.05 for all). Moreover, four image features including TNs feature (HR 1.327, 95%CI [1.001 - 1.759], P = 0.049), TNs cell nuclei feature (HR 0.729, 95%CI [0.537 - 0.989], P = 0.042), TNs cell density (HR 1.625, 95%CI [1.177 - 2.244], P = 0.003), and stromal cell structure feature (HR 1.596, 95%CI [1.142 - 2.229], P = 0.006) were identified by multivariate Cox proportional hazards model to be new independent prognostic factors. The results indicated that CAI can assist the pathologist in extracting prognostic information from HE histopathology images for IDC. The TNs feature, TNs cell nuclei feature, TNs cell density, and stromal cell structure feature could be new prognostic factors.


Results
Major clinical pathological characteristics of the patients. The main demographic, clinical and pathological characteristics of the 230 IDC patients are summarized in Table 1. The median age was 53 years (range, 27-85 years). The positive rates of estrogen receptor (ER) and human epidermal growth factor receptor 2 (HER2) were 45.7% and 36.1%, respectively. In terms of histological grade, 20 (8.7%) patients were classified as histological grade 1, 174 (75.7%) histological grade 2, and 36 (15.6%) histological grade 3. At the median follow-up of 105 months, among the 230 cases, 152 (66.1%) patients had tumor recurrence, and the median 8-DFS was 50.1 months (95% confidence interval [CI]:38.3 -62.0 months) as analyzed by Kaplan-Meier survival curve.
Image segmentation. So far, the limiting factor for quantitative HE image analysis is the absence of a robust and accurate segmentation algorithm to distinguish objects (tumor nests, gland, nuclei etc.) of interest from the background. Apart from histological grade, there are many other morphologic features of BC that have been proposed as prognostic factors, including angiogenesis, lymphocytes infiltration, and tumor-associated inflammation. Segmentation of the tissue into different components is the first step toward automatic morphometry. We used a pixel-wise SVM classifier for tumor nests (TNs)-stroma segmentation and a marker-controlled watershed for nuclei segmentation. The segmentation results are presented in Fig. 2 in the form of pseudo-color images; all pixels were sub-classified into TNs (yellow), stroma (black), epithelial nuclei (red), stromal round nuclei (infiltrating immune cells, IICs in purple), and stromal non-round nuclei (cancer-associate fibroblastic cells [CAFs] and angiogenic vascular cells [AVCs] in green) 13 .
Parameter extraction and dimensionality reduction. The morphologic characteristics of the tissue, cells, and nuclei are relatively complex. Some parameters could be easily described, for example, a tubule formation and high nuclear-to-cytoplasmic ratio, but most are difficult to describe, hard to learn, and often require long time before a pathologist can grasp. Image analyses emulate the expert learning to recognize objects in images; can transform microscopically-observed parameters from semi-quantitative values to quantitative data. The reproducible, objective morphologic results generated from image analysis provide means to encode image information into a set of discriminatory measurable values. We extracted 730 image parameters (Table S1) from multiple classes of different components. These parameters could be divided into pixel-level parameters (n = 400) including intensity, color and texture variables; object-level parameters (n = 314) including morphometry (object size and shape etc.) and topological variables; and semantic-level parameters (n = 16) including nest area/stroma area ratio, stroma round cell density, and nuclei/cytoplasm ratio etc (Table S2). The pixel-level parameters were not considered in this work because they are the least interpretable in terms of current biological knowledge 18 . For the other two sets of parameters, we conducted univariate survival analysis for parameters dimensionality reduction to screen for clinically significant image parameters. This yielded 14 parameters for further analysis; 12 among these were significantly associated with 8-DFS (P < 0.05 for all) ( Table 2 and Table 3).
Clinical value of morphologic parameters for 8-DFS prediction. The object-level parameters refer to the properties (such as area, perimeter, and fractal dimensions) of objects with measurable values. Univariate survival analysis showed that the TNs fractal dimension (P = 0.004), TNs number (P < 0.001), TNs perimeter sum (P = 0.001), TNs cell Delaunay area sum (P = 0.007), and stromal cell structure parameters (P = 0.011) were negatively correlated with 8-DFS. TNs area average (P = 0.011), TNs area variance (P = 0.014), and TNs cell nuclei area average (P = 0.004) were positively associated with 8-DFS. The TNs cell nuclei area variance (P = 0.056) and TNs cell nuclei eccentricity maximum (P = 0.077) had no statistically significant correlation with 8-DFS (Table 2).
Semantic-level parameters refer to the information on specific relationships among histological structures, such as area ratio and cell density. Univariate analysis showed that TNs cell nuclei/TNs area ratio (P = 0.006), and TNs cell density (P < 0.001) were negatively correlated with 8-DFS. TNs area/perimeter ratio (P = 0.032) and stromal non-round cell density (P = 0.008) were positively associated with 8-DFS (Table 3). Screening for image features by 8-DFS prediction. The above 12 significant object-level and semantic-level parameters were subject to principal component analysis for further reduction, producing two major sets of image features: TNs feature to quantify tumor nests and TNs cell nuclei features to quantify cancer cell nuclei (Table S3 and S4). Univariate survival analysis showed that TNs feature was negatively correlated with 8-DFS (P < 0.001), and TNs cell nuclei feature was positively associated with 8-DFS (P = 0.021) ( Table 3). Examples of correlation between 8-DFS with TNs feature and TNs cell nuclei feature are shown in Fig. 3.

Validation of image features by multivariate analysis.
To validate the clinical significance of the newly selected image features, we carried out a multivariate Cox analysis integrating both traditional and newly identified variables. To avoid the possible multicollinearity among the variables in regression model, a spearman rank correlation test was conducted, and the result showed that no significant correlation among the image features and histologic grade. Then, we integrated traditional prognostic factors including T, N, histologic grade, ER and HER2, and 7 image features that screened by 8-DFS prediction into a Cox proportional hazards model. This resulted in 6 independent prognostic factors. In addition to 2 traditional factors including histological grade and HER2, 4 image features including TNs feature, TNs cell nuclei feature, TNs cell density, and stromal cell structure feature were new independent prognostic factors. Among the 4 image features, TNs cell nuclei feature was a positive prognostic factor, and the other 3 image features were negative prognostic factors for 8-DFS (Table 4).
To assess the additional value of independent prognostic image features, we related them with histological grade. Result showed that TNs feature, TNs cell density, and stroma cell structure feature gave a better discrimination in histologic grade 2, and could identify low risk patients in this subgroup (Fig. 4). Furthermore, the predictive performance of independent prognostic factors was quantified by the area under the curve (AUC) from a ROC analysis separately. TNs feature could better predict the clinical outcomes of IDC patients compared to other factors (AUC: 0.644 [95%CI: 0.570 -0.718], P < 0.001) (Fig. 5).

Discussion
CAI could be an important tool to help pathologists in BC prognosis assessment and histological grade 20,26,27 . Using CAI in this study, we extracted 730 morphological parameters; and 12 parameters, including 6 tumor nests (TNs) parameters, 2 TNs cell parameters, 2 stroma cell parameters, and 2 nuclei parameters were significantly associated with 8-DFS. Four image features, including TNs feature, TNs cell nuclei feature, TNs cell density, and stromal cell structure feature were identified by multivariate Cox analysis to be new independent prognostic predictors. Moreover, they can quantify the microstructure of tissue in HE histopathology images, and yield high reproducible and adequate information. NGS is demonstrated to have prognostic significance in small tumor groups 28 , but low reproducibility limits its application in breast cancer management 13 . Researchers try to address the issues arising from manual grading by adopting CAI. Several works have proposed solutions for nuclei pleomorphism scoring 21 , tubular formation scoring 29,30 or mitosis detection 31 . The segmentation of nuclei in breast cancer histopathology images has many different applications include extraction of prognostic features, automatic nuclear pleomorphism grading as part of a computer-aided grading system 21 . Along with nucleus pleomorphism, the degree of structural differentiation of the tissue is one of the earliest prognostic factors for BC 29,30 . Cancer disrupts the ability of the cells to communicate with each other and organize themselves into structures such as tubules. Approaches attempt at combining the three criteria of NGS to provide a complete automatic grading system 32,33 .
The abovementioned works are based on the NGS frame, whereas comprehensive analysis of quantitative morphological features could help identify novel prognostic characteristics from histopathology images. Such work by Tambasco et al 34 used fractal dimension to quantify the complexity of whole epithelial architecture in IHC images from invasive breast cancer. In our work, we extracted a large number of morphometric and topological features in relatively more complex HE histopathology images from IDC. The TNs cell nuclei features generated in this work could quantify nuclei pleomorphism, and the TNs features could objectively measure the morphologic complexity of malignant epithelial architecture. Results showed that both of them were independent prognostic predictors. The TNs feature even had better performance in predicting clinical outcomes than histologic grade and HER2 status in ROC analysis. This suggests that TNs feature and TNs cell nuclei feature could be candidate prognostic factors for IDC.
Furthermore, CAI could help to explore prognostic value of the tumor microenvironment morphologic characteristics for IDC. The NGS examines only three morphological features of epithelial cells, which have failed to accurately classify patients. Our results show that stroma cell structure feature, TNs feature, and TNs cell density which gave a better discrimination in histological grade 2, could further identify low risk patients in this subgroup. Stroma cells such as CAFs, AVCs and IICs are important components of tumor microenvironment, and play important roles in cancer progression 14    to quantify cellular heterogeneity in BC 15 or to discover biomarkers for triple negative breast cancer 16 . In addition to cancer cell features, our work also classified stroma cells into two categories, stromal round cells (IICs) and stromal non-round cells (CAFs and AVCs) according to cell size and the nuclei shape. The stromal non-round cell structure feature was found to be an independent prognostic factor, which positively correlated with 8-DFS. This suggests that analyzing stroma morphologic features may offer significant help to improve prognosis prediction, in consistent with a similar report by Beck et al 35 .
However, contrary to the current knowledge, stroma non-round cell density had a positive correlation with survival. Moreover, features such as stroma round cell density and stroma cell nuclei features showed non-significant correlation with 8-DFS. Those findings should not be taken to mean that they have no effect on prognosis. We point out that accurate segmentation of objects of interest is the first step toward automatic image analysis, and this can greatly affect the significance of extracted features. Though we have performed preprocessing, many images remain intractable to the algorithm, due to the heterogeneity of the disease and the strong noise in HE images. In addition, developing computer-aided prognosis for BC based on HE histopathology images is still in the exploratory stage. A major hindrance is that researchers have performed analyses on different size images (WSI or ROIs) and magnification levels (100× , 200× , or 400× ) with various segmentation algorithms 36 and focused on diverse features of objects of interest. Thus, a direct comparison of different methods is not feasible.
This study has several limitations. First, the generalizability of the SVM classifier and marker-controlled watershed algorithm should be validated on another independent dataset. Second, the selection of an optimal image size, magnification, and processing time must be investigated to understand their effects on clinical significance. Third, the prognostic values of the independent predictors need to be validated in prospective study. More work will involve the exploration of intelligent methods like multi-field-of-view 32 for combination of various features to make full use of the underlying, invaluable, image information. Furthermore, image features alone rarely gives adequate information for prognosis 37 ; clinical pathological information and molecular assay data must also be taken into consideration 38 .
In conclusion, it is an urgent and important clinical task to predict future biological behaviors of BC based on the new information extracted from the local tumor itself in addition to conventional pathological features. CAI could be a powerful tool to help extract a huge amount of new information beyond manual analysis, and TNs feature, TNs cell nuclei feature, TNs cell density, and stromal cell structure feature could be new prognostic factors for IDC.

Materials and Methods
Patients and tissue slides. This study included 230 patients diagnosed with IDC and treated with intent-to-cure surgery at our hospital. Major treatment information included radical mastectomy (n = 43), modified radical mastectomy (n = 156), and simple mastectomy or breast conserving surgery (n = 31) in terms of surgical treatment; then followed by less intensive chemotherapy (cyclophosphamide + methotrexate + fluorouracil, n = 96), or anthracycline/taxane-based (n = 134) chemotherapy. And radiotherapy was added to patients with over 3 axillary lymph nodes involvement. For patients with HER2 positive status, molecular targeting therapy with Transtuzumab (i.e. Herceptin a monoclonal humanized anti-HER2 antibody) was recommended but not mandatory. Endocrine therapy for with either tamoxifen or third-generation aromatase inhibitors was delivered based on the ER status and clinical guidelines. The tissue slides and formalin-fixed paraffin-embedded (FFPE) tissue blocks, clinical pathological information, and follow-up information of these patients were all available. The study protocol was approved by the Institutional Ethics Committee of Zhongnan Hospital of Wuhan University, and informed consent was obtained from the patients before operation to use tissue samples for scientific researches.   Two expert pathologists (GF Yang, JP Yuan) examined the archived HE slides, selected FFPE tissue blocks for all cases and made new tissue slides for HE and IHC staining (ER, PR, and HER2). Histologic grade for each case was obtained by routine manual analysis of HE images with NGS. The ER and HER2 status of IHC images were evaluated by the above-mentioned expert pathologists. The ER and PR status were defined as the percentage of immunoreactive cells with an intra-nuclear staining of any intensity. The intra-nuclear staining of at least 1% of the cells was interpreted as a receptor-positive result 39 . The HER2 status analysis was performed according to the American Society of Clinical Oncology/CAP guidelines 40 . Image acquisition. The digital images were acquired under an Olympus BX52 microscope equipped with an Olympus DP72 camera (Olympus Optical Co., Ltd., Tokyo, Japan) by CRi Nuance multispectral imaging systems (Cambridge Research & Instrumentation, Inc., Woburn, MA, USA) with the help of an expert pathologist (JP Yuan). First, regions of interests (ROIs), the distinct invasive cancer area in images were selected at 100× . ROIs did not contain regions of necrosis, ductal carcinoma in situ or improper staining artifacts. Second, in each ROI, only fields containing both tumor nests and stroma were captured at 200× . Finally, to minimize image selection bias, five images per slide were randomly selected from the ROIs images. As a result, 1,150 images were captured under the unified image acquisition parameters and saved in tagged image file format with resolution of 1360 × 1024 pixels.
Image processing. In order to extract image features related to prognosis, we need to automatic identify and segment histological structures by image analysis methods at first. We applied an image processing pipeline by the following steps: preprocessing, segmentation, postprocessing and feature extraction 24 . First, three preprocessing methods were applied to enhance image quality. Median-filter with a 3*3 kernel was used to de-noise and smooth image; contrast stretching was used to automatically optimize the image contrast; and color normalization was applied to remove color variance and scale batch effects. Second, we used two-step segmentation algorithms to segment objects in images. (1) For nuclei segmentation, color deconvolution was used to extract the hematoxylin color component. Series of mathematical morphology operators were used to remove irrelevant "noisy" structures that may hamper the segmentation for obtaining the nuclei mask. Then the regional minima of the nuclei mask were used to mark candidate nuclei locations. Watershed regions were grown from the markers, after which spurious regions were removed based on shape, texture and boundary saliency. (2) For TNs-stroma segmentation, the pixel-level color features via the local homogeneity model and texture features of the pixel via the fast algorithm 41 were used. Then an SVM classifier was trained with randomly selected labeled pixels by the abovementioned expert pathologists, and the images were segmented with the trained SVM classifier. Third, as the segment methods lack robustness to noise and cannot classify all the pixels to the objects accurately. Expert-pathologist aided judgments were conducted to eliminate the incorrect segmentations in the postprocessing step. In the final feature extraction step, multiple levels (i.e. pixel-level, object-level, and semantic-level) of morphological features 18 were extracted from different objects.
Statistical analyses. The image features we generated from image analysis are continuous variables. In order to classify patients into different risk subgroup, we converted continuous variables into categorical variables before performing statistical analyses by using the X-tile software 42 . Then univariate survival analysis and principal component analysis were used for features dimensionality reduction. Kaplan-Meier methods were used to identify 8-year disease free survival (8-DFS) associated features, and significance among subgroups was calculated by log-rank test. The 8-DFS was defined from the date of surgery to the date of BC-specific recurrence/distant metastasis or date of last follow-up. Multivariate Cox proportional hazards regression model was performed to identify new independent prognostic factors from 8-DFS associated features. Two sided P < 0.05 was considered as statistically significant. Receiver operating characteristic (ROC) curve analysis was used to determine predictive value of the independent prognostic factors. All statistical analyses were performed with SPSS version 19.0 (SPSS Institute, Chicago, IL, USA).