## Introduction

Accurate assessment of localized prostate cancer aggressiveness is of utmost importance for determining patient treatment and follow-up strategies. Currently, this is determined based on Gleason and/or International Society of Urological Pathology (ISUP) grading1,2 of histological specimens, traditionally obtained by systematic transrectal ultrasound-guided biopsy sampling. Although the introduction of targeted approaches has improved biopsy sampling and cancer detection accuracies3, it is still limited to small portions of the prostate. With prostate cancer being a heterogeneous and multifocal disease, this can sometimes lead to inaccurate estimation of the disease extent, and thus undertreatment or overtreatment4,5. Moreover, biopsy sampling is invasive, and the risk of post-biopsy complications has become an increasing concern due to multidrug-resistance6. This makes repeated biopsy sampling unattractive in clinical practice, especially for active surveillance patients.

Multiparametric MRI (mpMRI) enables non-invasive acquisition of both anatomical [i.e. T2-weighted (T2W)] and functional [mainly diffusion-weighted (DW), and dynamic contrast-enhanced (DCE)] images of the entire prostate for cancer detection, staging, treatment planning, and response evaluation7. The introduction of mpMRI and the Prostate Imaging—Reporting and Data System (PI-RADS) guidelines8 have improved prostate cancer detection and accuracy9. DW and DCE MRI provide quantitative pathophysiological parameters such as apparent diffusion coefficient (ADC), and volume transfer constant (ktrans) and extravascular-extracellular volume fraction (Ve), which to some extent are capable of assessing prostate cancer aggressiveness10,11. Compared to DW and DCE, T2W MRI provides high spatial resolution and tissue-specific contrast, but currently, it is mainly used for qualitative evaluation of the prostate anatomy and anomalies.

Though important, qualitative assessment has several challenges and limitations, including dependency on subjective judgment of radiologists, which is prone to high inter-reader variability12 and the occurrence of equivocal findings in substantial number of cases13. Furthermore, with a multiparametric approach and the increasing availability of hybrid imaging modalities such as positron emission tomography/MRI14, the amount of data to be analyzed increases, making it also increasingly labor-intensive to manually collate all these images into meaningful information for clinical decision making.

Recently, radiomics, i.e. automatic high-throughput extraction of quantitative image features from radiological images and their subsequent analysis15,16, has gained attention with potential to overcome the above limitations and thus improve clinical decision making. Texture analysis constitutes a key methodology for extracting quantitative image features, particularly second- and high-order statistical image texture descriptors based on grey level co-occurrence matrix (GLCM)17 and grey level run length matrix (GLRLM)18, which examine spatial variations in pixel intensity distribution. Several interesting studies have reported the use of texture analysis in radiomics-based analysis of prostate cancer19,20,21,22,23, but have mostly been limited to single-center data. Previously, we24 showed that GLCM-based textural features derived from T2W images could potentially serve as non-invasive markers for assessing prostate cancer aggressiveness. Particularly, we found homogeneity and entropy features to correlate significantly with prostate cancer aggressiveness (i.e. grade groups 2 and 3) as defined on pathology, as well as with ADC and ktrans. Also, the augmentation of quantitative MRI parameters with T2W image textural features enabled better classification. However, these preliminary findings were based on a relatively small number of patients recruited from single center. The aim of this current work was to validate and extend these findings using a multicenter cohort, and to investigate their performance in the classification of biopsy-proven prostate cancers.

## Materials and methods

### Patient population and data collection

The patient cohort data for this retrospective multicenter study constitutes part of a prospectively collected (between June 2010 and August 2015) data for the Prostate Cancer localization with a Multiparametric MR Approach trial (PCa-MAP; ClinicalTrials.gov Identifier NCT01138527)25. Eligible patients (N = 128) from six institutions: Johns Hopkins University, Baltimore (n = 20); Norwegian University of Science and Technology, Trondheim (n = 22); Radboud University Medical Centre, Nijmegen (n = 30); University of California, Los Angeles (n = 20); University Health Network, Toronto (n = 10); and Medical University of Vienna (n = 26) were included in this study. All patients were diagnosed with primary prostate cancer and were scheduled to undergo preoperative mpMRI with subsequent radical prostatectomy. The Regional Committee for Medical and Health Ethics (Mid Norway), the PCa-MAP trial consortium review board, as well the review board of each participating institution (HIPAA-compliant for USA institutions) approved this study and waived the requirement for written informed consent. All methods were performed in accordance with local institutional, national and international guidelines and regulations.

### MRI examination

The image acquisition protocol and/or settings were standardized across all centers, hence the term ‘single-arm’ in the title. All imaging was performed on 3T MRI systems (Siemens Healthineers) using standard vendor-supplied body and spine phased array coils for signal detection, without an endorectal coil. A minimum of four weeks was allowed between the last biopsy and MRI to avoid hemorrhage artifacts. The acquisition consisted of localizer scans, T2W, DW, DCE, and spectroscopic imaging. In this study, we utilized only the transverse T2W and DW images, which were acquired with a turbo spin-echo sequence (repetition/echo time (TR/TE): 4000/101 ms; field of view (FOV): 200 × 200 mm; matrix: 320 × 320; slice thickness: 3 mm; interslice gap: 0.6 mm), and a single-shot echo-planar sequence with four b-values: 0, 100, 400, and 800 s/mm2 (TR/TE: 3300/60 ms; FOV: 260 × 211 mm; matrix: 160 × 130; slice thickness: 3.6 mm), respectively. The images were oriented along the longest axis of the prostate, perpendicular to the urethra to best match routine histologic sectioning of the prostate. Pre-imaging preparations were performed in accordance with local institutional guidelines.

### Histopathologic examination and tumor delineation

Patients underwent radical prostatectomy within 12 weeks after the MRI examination. The prostatectomy specimens were prepared locally according to histopathology protocols at each institution, which included fixation, serial sectioning (perpendicular to the urethra to facilitate spatial matching to MRI) into ~ 3–4 mm axial slices, and hematoxylin and eosin staining of microsections. An experienced local uro-pathologist examined the stained slides, outlined cancer foci, described cancer location, and graded them in accordance with the Gleason scoring system1,2.

The annotated whole-mount histology sections were visually matched to the T2W images based on anatomical landmarks such as urethra, ejaculatory ducts, size/shape of the peripheral zone and apex/base proximity. Moreover, descriptions from the pathology report were used as guidance. Tumor volumes of interest (VOIs) were then manually delineated (by Gabriel A. Nketiah, 5 years’ experience in prostate MRI; and guided by a radiologist Jurgen J. Fütterer with > 11 years’ experience) based on their location in histology and shape/appearance on the T2W images. The VOIs were subsequently transformed to the corresponding DW images via intensity-based rigid registration (Elastix toolbox26) using Mattes mutual information similarity metric (Fig. 1a). This was done by first co-registering the T2W images to the b = 0 s/mm2 images, and then applying the resulting transformation to the VOI masks. The co-registrations were visually verified, and manually corrected in case of mis-registration, for instance due to geometric distortion on the DW images. Each tumor was assigned a grade group (GG) according to the ISUP prostate cancer grading system2, and then dichotomized into low-(GG 1) and intermediate/high-(GG ≥ 2) aggressive cancers.

### Post-processing and feature extraction

Two types of features were computed from the image VOIs: traditional intensity histogram features (number of features, nf = 11) from the T2W images and ADC maps, and second and high-order statistical image textural features (nf = 29) based on GLCM17 and GLRLM18 from the T2W images. The “2D average” approach27 was employed to compute textural features from the tumor VOIs (Fig. 1b). First, the intensities within each VOI were discretized into 32 grey levels via fixed bin number quantization. The GLCMs and GLRLMs were computed per slice at one-pixel distance ( = 1) in four symmetric directions, $$\theta$$ = 0°, 45°, 90° and 135°. Textural features were computed from each directional matrix, and the mean of each feature across the slices was obtained. Finally, the average of each feature across the four directions was calculated to eliminate potential differences in directionality. This 2D approach was preferred over 3D texture analysis due to the presence of interslice gaps in our data acquisition. ADC maps were calculated from the nonzero b-value DW image datasets (100, 400, and 800 s/mm2) by fitting a monoexponential decay model to the image intensities as a function of b-value in each voxel. ADC histogram features were then computed for each tumor VOI. The b = 0 s/mm2 image was excluded from ADC map computation to eliminate possible perfusion effects. The ADC was included as a benchmark metric for aggressiveness classification, since it has been previously shown to correlate with prostate cancer aggressiveness10.

Prior to feature extraction, the T2W images were corrected for intensity non-uniformity using the N4 bias field correction algorithm28, and subsequently normalized using the automated dual-reference tissue normalization approach29. Briefly, two aggregate feature channel object detectors were separately trained to detect fat and muscle tissue regions, from which reference intensity values (90th and 10th percentiles, respectively) were calculated, and then utilized to normalize the 3D image intensities to pseudo T2 values by linearly scaling the reference values to their corresponding T2 values at 3T from literature. Unlike T2W image intensities, the ADCs were not normalized because they are quantitative measurements in nature. Also, outlier voxels within each VOI, defined as intensities outside the range [µ − 3σ, µ + 3σ]30 were excluded; where μ and σ denote the mean and standard deviation of the intensities within each VOI. All features were computed in accordance with the image biomarker standardization initiative27.

### Statistical analysis and classification modeling

Spearman correlation coefficients were calculated to investigate associations between the T2W image features and prostate cancer grade groups (i.e. 1–5). Differences in feature values between the two aggressiveness classes (i.e. low versus intermediate/high) were evaluated using two-tailed Mann–Whitney U-tests. p values were corrected for multiple testing using Benjamini and Hochberg’s approach31 at false discovery rate of 0.05, with values < 0.05 considered statistically significant.

To evaluate the utility of the features in classifying the two cancer aggressiveness classes, a linear support vector machine (SVM) classifier was trained and tested separately for each feature set (i.e. ADC histogram, T2W histogram and T2W textural features), and the following combinations: T2W histogram + textural features, ADC histogram + T2W histogram, and ADC histogram + T2W histogram + textural features. In this analysis, we were particularly interested in how well cancer aggressiveness at one institution could be predicted by a model trained on data from the other institutions, using histogram features with and without textural features augmentation. For this, the classifier was iteratively (i = 1:number of institutions, ni) trained and tested, each time using data from ni − i institutions as training set, with the ith institution held out as an independent external test set (Fig. 2). The training employed stratified 10-fold cross-validation for hyperparameter tuning and feature selection. Hyperparameter (misclassification cost, C) tuning and feature selection during training were performed concurrently via grid search over seven logarithmically spaced values between − 1 and 1 inclusive, and using recursive feature elimination32, respectively. The hyperparameter and feature sets with the lowest mean misclassification error over all 10-folds were selected to build the model. The cross-validation partitioning of the data during training was done on patient level rather than on tumor level to ensure that multiple tumors from the same patient were all either in the training or in the validation subset. Predictions on the test set (i.e. data from the hold out institution) were however done on tumor-level.

Receiver operating characteristic (ROC) curves were computed for each test set, from which the area under the curve (AUC) and 95% confidence intervals (95CI), accuracy, sensitivity and specificity were calculated to evaluate the performances of the classifiers (i.e. feature sets) across the centers. The optimal threshold for calculating the accuracy, sensitivity and specificity was determined from the training based on the Youden index33, and then applied to the test set.

The added value of T2W textural features was evaluated using two approaches. First, at each institution, the AUCs before (i.e. ADC histogram, T2W histogram, and ADC + T2W histogram) and after augmentation with T2W textural features (i.e. T2W histogram + textural features, and ADC histogram + T2W histogram + textural features) were compared using Delong’s nonparametric approach for comparing the areas under two or more correlated receiver operating characteristic curves34. Secondly, the differences in performances across the institutions before and after augmentation with T2W textural features were compared using paired student t-test.

Prior to the classification modelling, two-way ANOVA was performed to evaluate potential effects of data origin (i.e. institution) and cancer aggressiveness on the features. Features for which the interaction between institution and cancer aggressiveness or main effect of institution were significant were excluded from the model. Each feature was log transformed to meet normality assumption requirement of ANOVA. The SVM classification modelling was performed with scikit-learn library35 in python (version 3.7, www.python.org), and other statistical analyses were performed in MATLAB R2019a (Mathworks).

## Results

Out of the 128 eligible patients, 32 patients were excluded due to unavailable MRI (n = 5), MRI artifacts and/or distortion (n = 4), no pathology report/grading (n = 15), unsatisfactory matching between histopathology and MRI (n = 8). Data of 96 patients (mean age = 61.3 ± 6.1 years) for whom good quality MRI and post-surgical histopathology data were available were included in this study. In these patients, 127 tumor volumes (mean [range] = 469 [101–1397] voxels) were identified, of which 104 were in the peripheral zone and 23 were in the transition zone. Figure 3 shows the flowchart of patient inclusion and exclusion. Due to the limited number of transition zone tumors, only the peripheral zone tumors (in 87 patients) were analyzed, of which 30 were stratified as low-aggressive, and 74 as intermediate/high-aggressive cancers. The overview of the characteristics of patients and tumors is given in Table 1.

### Feature association with prostate cancer grade group

The Spearman correlation between the image features and cancer grade groups was significant (p < 0.05) for eight T2W intensity histogram and nine textural features (Table 2). Differences in features between the two cancer aggressiveness classes were significant for seven intensity histogram and 16 textural features. Generally, the textural features reflected higher disorder/complexity (i.e. high entropy or lower homogeneity) in intermediate/high-aggressive tumors, and vice versa in low-aggressive tumors (Fig. 4). As expected, the majority of these features (7 intensity and 9 texture features) were found to be common between the two statistical tests.

The two-way ANOVA showed no statistically significant interaction between the effects of data origin (institution) and cancer aggressiveness on any feature. Similarly, the main effect of institution was not significant.

### Classification of low versus intermediate/high-aggressive cancers

Table 3 and Fig. 5 show comparisons of the performance of the feature sets in SVM classification of the two aggressiveness classes at the various institutions. The added value of T2W textural features varied generally between the sites. At the individual centers, the augmentation of ADC and T2W histogram features with T2W textural features resulted in improvement in AUCs at four centers, though not statistically significant (p > 0.05). When considering the overall performance across the centers, the differences in AUCs before and after augmentation with textural features were not significant. However, the difference in accuracy was significant (p = 0.0218) for ADC + T2W histogram versus ADC histogram + T2W histogram + T2W texture features (Fig. 5). In terms of feature importance within the classifier, a similar trend as in the feature association with cancer grade group was observed. Textural features relating to similarity (grey level non-uniformity), maximum probability, and textural complexity (information measure of correlation) were the most frequently selected, in addition to minimum and 10th percentile from intensity histogram (Table 4).

## Discussion

T2W MRI provides high spatial resolution and tissue-specific contrast compared to DW and DCE imaging, but it is predominantly limited to qualitative radiological evaluation of the prostate. In a preliminary study using single-center data24, we showed that quantitatively derived T2W image textural features have the potential to serve as non-invasive markers for assessing aggressiveness. In this work, we extended and confirmed these findings in a multicenter cohort. T2W image textural features, particularly those reflecting homogeneity/similarity (angular second moment, run length non-uniformity, grey level non-uniformity), disorder (entropy) and textural complexity (information measure of correlation) correlated significantly with prostate cancer aggressiveness; and differed significantly between low- and intermediate/high-aggressive prostate cancers as defined by histopathology. Compared to the classifier based on the commonly used histogram metrics from ADC and T2W images, the classifier utilizing histogram features augmented with T2W textural features performed better, an indication that quantitative texture analysis of anatomical images has the potential to reveal additional morphological and pathophysiological information for radiomics-based assessment of prostate cancer aggressiveness.

The usefulness of entropy/complexity and homogeneity associated textural features in prostate cancer aggressiveness assessment and classification was shown in our previous study24, and has also been reported by others23,36,37. Histologically, aggressive prostate cancers are characterized by poor differentiation, glandular structure deformation, and loss of cellular integrity of the prostate gland. This disrupts the tissue cyto-architectural patterns, potentially leading to decreased homogeneity and high disorder. Correlations between textural features and prognostic factors and clinical outcome have also been reported36,38. If validated, these quantitatively derived T2W image features could be combined with other MRI parameters as evidence-based markers for prostate cancer. In the context of this study setup, the findings could particularly be useful in active surveillance situations to follow-up on low-risk cancer patients thereby limiting the need for repeated biopsies.

Although a number of promising studies have reported the utility of MRI texture analysis in prostate cancer19,20,21,22,23,36,37,38, very few are based on multicenter cohorts22 or focused on aggressiveness prediction/classification19,21,23,36. Multicenter data sharing is important to fulfill the high data demand for training radiomics-based decision support systems. Furthermore, multicenter studies are necessary to ascertain the applicability and robustness of texture analysis and radiomics, and to facilitate their clinical transition across centers. Texture analysis, which considers spatial relationships between pixels rather than individual pixel intensities as in a histogram, could possibly contribute to overcome the inter-institution and scanner variability challenges associated with multicenter data. Compared to DW and DCE imaging, T2W imaging is generally regarded as the most stable sequence in terms of imperviousness to scanner variations and gradient artifacts, and tolerance in patients (i.e. contrast agent-free). Although these factors add to its importance, T2W imaging is not currently used for quantitative assessment of prostate cancer aggressiveness mainly due to the non-quantitative nature of its signal intensities.