A data-driven ultrasound approach discriminates pathological high grade prostate cancer

Accurate prostate cancer screening is imperative for reducing the risk of cancer death. Ultrasound imaging, although easy, tends to have low resolution and high inter-observer variability. Here, we show that our integrated machine learning approach enabled the detection of pathological high-grade cancer by the ultrasound procedure. Our study included 772 consecutive patients and 2899 prostate ultrasound images obtained at the Nippon Medical School Hospital. We applied machine learning analyses using ultrasound imaging data and clinical data to detect high-grade prostate cancer. The area under the curve (AUC) using clinical data was 0.691. On the other hand, the AUC when using clinical data and ultrasound imaging data was 0.835 (p = 0.007). Our data-driven ultrasound approach offers an efficient tool to triage patients with high-grade prostate cancers and expands the possibility of ultrasound imaging for the prostate cancer detection pathway.


Results
Image and patient characteristics. A flowchart of the study is presented in Fig. 1. We evaluated prediction accuracies for prostate cancer using the following data sets: still ultrasound image data (left upper box), clinical data (age and prostate-specific antigen [PSA]) (right upper box), and integrated data (ultrasound image data, total prostate volume [TPV] derived from ultrasound images, PSA density [PSAD], age, and PSA) (lower box). Table 1 shows 691 cases (2676 images) that underwent ultrasound-guided prostate biopsy (systematic biopsy and magnetic resonance imaging [MRI]-targeted biopsy) at the Nippon Medical School Hospital. The median age (interquartile range) of all cases was 71 (65-76) years, and that of the cancer cases was 72 (66-77) years. Patients with cancer were significantly older than those without cancer (p < 0.001). The median PSA value of all the cases was 8.3 (5.8-14.0) ng/mL, and that of the cancer cases was 9.5 (6.7-20.5) ng/mL. The PSA level was significantly higher in cancer cases than in non-cancer cases (p = 0.002). The median TPV of all the cases was 35.0 (25.8-50.6) cm 3 , and that of the cancer cases was 30.9 (23.0-42.6) cm 3 . TPV was significantly lower in cancer cases than in non-cancer cases (p < 0.001). The median PSAD of all cases was 0.245 (0.138-0.482) ng/mL/cm 3 , and that of the cancer cases was 0.352 (0.207-0.681) ng/mL/cm 3 . PSAD was significantly higher in cancer cases than in non-cancer cases (p < 0.001). The number of biopsy Gleason scores was 6 (47 cases), 7 (215 cases), 8 (79 cases), 9 (94 cases), and 10 (1 case). High-grade cancer (Gleason score ≥ 8) accounted for 39.9% of the cancer cases.
Classification of prostate cancers (n = 691). Image-level classification using a deep neural network. We performed two sets of deep learning analyses using different labels. First, we applied a deep neural network,   (Table 2). Next, we applied the deep neural network to the same ultrasound image data with labels of the high-grade cancer group (positive, Gleason score ≥ 8) and the others (negative). The AUC of the high-grade cancer classification was 0.723 (95% CI 0.659-0.788) ( Table 2). Note that we only used ultrasound images for these classifications. Supplementary Fig. S1 shows the receiver operating characteristic (ROC) curves for the classification accuracy.
Case-level classification based on clinical data. We also applied logistic regression and support vector machine (SVM) 24,25 analyses to the clinical data with the aforementioned labels: (1) cancer classification label: cancer (Gleason score ≥ 6) or non-cancer group, (2) high-grade cancer classification label: high-grade cancer group (Gleason score ≥ 8) or the other. Age and PSA were used as clinical data in this study because these clinical data are known to be important factors in the screening of prostate cancers 26 . Table 2 presents the AUCs for each classification. The AUC of the cancer classification was 0.702 (95% CI 0.598-0.806), and that of the high-grade cancer classification was 0.691 (95% CI 0.582-0.801) (SVM). Note that we used only clinical data for this classification. Supplementary Fig. S2 shows the ROC curves for these classifications.
Case-level classification using an image integration approach. Next, we applied logistic regression and SVM to three ultrasound image data pre-analyzed by deep learning. We selected the ultrasound images with the top three highest probabilities (|P dl − 0.5|, P dl : predicted probability of the deep learning classification) in every case. Table 2 shows the AUCs. Supplementary Fig. S3 shows the ROC curves for these classifications. The highest value of AUCs in cancer classification was 0.776 (95% CI 0.683-0.870) (SVM), and that in the high-grade cancer classification was 0.816 (95% CI 0.725-0.908).
Case-level classification using a data integration approach. Finally, we applied logistic regression and SVM analyses to the integrated data (three ultrasound image data pre-analyzed by deep learning, TPV, PSAD, age, and PSA). We also selected the ultrasound images with the top three highest probabilities in every case. Table 2 shows the AUCs. Supplementary Fig. S4 shows the ROC curves for these classifications. In the integrated data, the highest value of AUCs in cancer classification was 0.807 (95% CI 0.719-0.894) (SVM), and that in the highgrade cancer classification was 0.835 (95% CI 0.753-0.916) (SVM). Figure 2 shows the ROC curves for the highgrade cancer classification of clinical data without data derived from ultrasound images and with that derived from ultrasound images (integrated data). The AUC of the integrated data was significantly higher than that of the clinical data (0.691 [95% CI 0.582-0.801]) in high-grade cancer classification (p = 0.007).

Classification of prostate cancers in cases of systematic biopsy (n = 532). To eliminate selection
bias based on MRI, we selected 532 cases of systematic biopsy without MRI-targeted biopsy. In image-level classification, the AUC of the cancer classification was 0.670 (95% CI 0.607-0.733), and that of the high-grade cancer classification was 0.732 (95% CI 0.658-0.807). We also applied logistic regression and SVM 24,25 analyses based on the clinical data. The AUC of the cancer classification was 0.639 (95% CI 0.511-0.766) (SVM), and that of the high-grade cancer classification SVM was 0.665 (95% CI 0.535-0.796). Next, we applied logistic regression and SVM anlyses to three ultrasound image data pre-analyzed by deep learning. The AUC of the cancer classification was 0.722 (95% CI 0.602-0.841) (SVM), and that of the high-grade cancer classification was 0.814 (95% CI 0.708-0.920) (SVM). Finally, logistic regression and SVM anlyses were applied to the integrated data.  Figure 3 shows the ROC curves for the high-grade cancer classification of clinical data without data derived from ultrasound images and those derived from ultrasound images (p = 0.013). In AUCs, a tendency similar to that in "Classification of prostate cancers (n = 691)" was observed.
Prostate ultrasound images of top five cases corresponding to histological cancer grading. Figure

Discussion
We aimed to idenify pathological high-grade prostate cancer using ultrasound images and limited clinical data. We evaluated the prediction accuracies of the following different datasets: ultrasound image data, clinical data, and integrated data. Several imaging examinations are performed complementary, each with its own strengths. For example, MRI contributes to the detection of significant prostate cancers. In recent years, several studies have reported the benefits of MRI prior to systematic biopsy 29 . Pellicer-Valero et al. proposed a fully automatic system based on deep learning with prostate MRI that can show cancer segmentation and cancer grading 30 . In their results, AUC of intermediate and high grade cancer detection (Gleason grade group ≥ 3) was 0.767 (ProstateX test data 31 ) and 0.840 (Valencia Oncology Institute Foundation data 30 ). On the othre hand, AUC when using our method of high grade cancer detection (Gleason grade group ≥ 4) was 0.835. The ROC curve showed sensitivity of 0.909 and specificity of 0.609 at a cut off value optimized for triage. Ultrasound imaging is widely used in prostate cancer screening globally because ultrasound can be easily performed at bedside. In this study, we showed that our data-driven ultrasound approach offers an efficient tool to triage patients with high-grade prostate cancers.
The main limitation of this study was that it was conducted at a single facility. It is known that high interobserver variability deteriorates the accuracy of ultrasound diagnosis. However, we analyzed over 2500 ultrasound images. In addition, we applied an augmentation technique and transferred learning based on ImageNet 32 . In the future, we will obtain a validation set for other facilities. Furthermore, a full three-dimensional ultrasound image analysis might be able to improve the prediction accuracy in order not to miss small cancer lesions. Although further investigation should be conducted in order to reinforce our results, we hope that our method will contribute to the accurate diagnosis of prostate cancer.
Deep learning algorithms have achieved great success in medical image analyses owing to the high affinity between neural networks and images. Integrated analysis of medical multimodal data is a key factor driving practical technology in the next stage. Even if the predictive power of each datum is insufficient, data integration can improve the predictive power by appropriate machine learning techniques. Identifying the appropriate combination of data is important for better use of the data stored in the hospital. In this study, the image selection

Materials and methods
Study population. Our study included 772 consecutive patients and 2899 ultrasound images acquired between November 2017 and June 2020. The patients underwent ultrasound-guided prostate biopsy at Nippon Medical School Hospital in Tokyo, Japan. The study profile is shown in Fig. 5. We excluded cases with a transperineal biopsy of the prostate (eight cases), a history of post intravesical bacillus Calmette-Guerin therapy (two cases), and others (insufficient saved image and data: 71 cases). We evaluated 2676 ultrasound images (691 cases) via a transrectal approach using deep learning analysis. A systematic biopsy was performed in 532 patients. A combination of both MRI-targeted and systematic biopsies was performed in 159 cases. We divided these images into training data (November 2017 to December 2019: 590 cases, 2,299 images) and test data (January 2020-June 2020: 101 cases, 377 images). We determined the hyperparameters using only the training data (Supplementary  Table S2). This study was confirmed and approved by the Institutional Review Boards of the Nippon Medical School Hospital (reference 28-11-663) and RIKEN (reference . Informed consent was obtained from all patients. All methods were carried out in accordance with relevant guidelines and regulations. Ultrasound and biopsy procedure. Ultrasound images of prostate glands were saved at four locations (base, mid, apex-mid, and apex). The prostate volume was calculated for each case. Subsequently, a prostate biopsy was performed. Figure 6 shows systematic prostate biopsy sites at each location (number of biopsy sites): lateral apex (2), parasagittal apex (2), lateral apex-mid (2), parasagittal apex-mid (4), lateral mid (2), and lateral base (2). In 532 cases, a systematic biopsy was performed. Furthermore, in 159 cases, a combination of both MRI-targeted and systematic biopsies was performed. The highest biopsy Gleason score on each ultrasound image was used as the label of the corresponding images. We gave different Gleason scores for each ultrasound image. We used an ultrasound system (Aplio i800; Canon Medical Systems, Tokyo, Japan) with a 6 MHz transrectal probe (PVT-770 RT; Canon Medical System, Tokyo, Japan). All still ultrasound images were saved in Digital Imaging and Communications in Medicine (DICOM) format.
Pathological cancer grading. Two pathologists reviewed each biopsy core and reported cancer with an assigned Gleason score. Prostate cancer was diagnosed pathologically based on the International Society of Urological Pathology grading 33 . Pathologists diagnosed all cases independently and then reached a consensus.
Ultrasound images. All DICOM ultrasound images were converted into JPEG images. We then extracted a rectangular region of the prostate from the images. This rectangular region included proximate tissues, such as the prostatic capsular vessels, pelvic fascia, and rectum. We then adjusted these images to a size of 256 × 256 pixels. We applied an augmentation technique, including a parameter of zoom range for deep learning analysis. We provided positive or negative labels to these datasets based on the pathological grading. We defined (1) labels www.nature.com/scientificreports/ for the cancer group (Gleason score ≥ 6) or the other. We also defined (2) labels for the high-grade cancer group (Gleason score ≥ 8) or the other.
Classification using a deep neural network. First, we tested three deep convolutional neural network models, Xception 23 , inceptionV3 34 , and VGG16 35 which were pre-trained on ImageNet with classification layers adapted to our labels 32 . We selected Xception in this study because it exhibited the most precise performance for ultrasound image classification. We divided all images into test and training data based on the date of ultrasound evaluation for each case (Fig. 5). To evaluate the classification by the deep neural network, we constructed an ROC curve with the corresponding AUC. This study employed the RIKEN AIP Deep Learning Environment (RAIDEN) supercomputer system for all computations.
Machine learning analysis for the case-level prediction. First, we applied logistic regression (ridge and lasso) and SVM anlayses only to the clinical data (age and PSA) to classify the case data. Subsequently, we applied these algorithms to three ultrasound image data pre-analyzed by deep learning (predicted probability of deep learning classification). Comparison among the different number of image sets showed that the AUC was the highest with the top three highest probabilities (|P dl − 0.5|, P dl : predicted probability of deep learning classification) (Supplementary Table S3). We, therefore, selected ultrasound images with those three ultrasound images in every case. We used only cases with more than three images in this case-level prediction. 678 cases were remained for case-level classification. We also divided these case data into training and test sets, as described in "Study population". Finally, we applied these algorithms to the integrated data: three ultrasound image data pre-analyzed by deep learning, TPV, PSAD, age, and PSA. We constructed an ROC curve using the corresponding AUC. We used R software for the analysis, using the glmnet package (version 2.0.16) for the ridge and lasso regression and the e1071 package (version 1.7.0) for the SVM. Calculations were performed automatically using the software packages.
Prostate ultrasound images corresponding to histological cancer grading. We evaluated prostate ultrasound images corresponding to pathological cancer grading based on two types of labels: (1) the cancer classification label and (2) the high-grade cancer classification label. We selected the top five highest predicted probability cases from the high-grade cancer logistic regression (ridge) as representative cases corresponding to the high-grade cancer group. We selected the top five lowest predicted probability cases from the cancer logistic regression as representative cases corresponding to the non-cancer group. We defined the top five highest predicted probability cases of cancer without high-grade cancer as the low-grade cancer group.
Saliency map. We applied Grad-CAM to construct a saliency map for deep learning analysis 27,28 . Grad-CAM is a technique used to produce visual explanations of decisions made by convolutional neural networks.
Statistical analysis. We compared the characteristics of patients who were cancer cases or non-cancer cases using the Wilcoxon rank-sum test for continuous data. The ROC curves were constructed and compared using the 'pROC' (version 1.13.0) package in R 36 . All reported p values were two-sided, with the level of statistical significance set at p < 0.05.  www.nature.com/scientificreports/

Data availability
The clinical datasets used were collected at the Nippon Medical School Hospital. They are not publicly available, and restrictions apply to their use.