## Introduction

Prostate cancer is the second most frequently diagnosed cancer in men worldwide1. Radical prostatectomy (RP) is a standard of care option for all men with localized disease2. Unfortunately, about 20–40% of patients treated with RP will develop a biochemical recurrence (BCR) from prostate bed recurrence, pelvic lymph nodes (LNs), or distant metastases. Early detection of the disease could improve the efficacy of intervention and reduce treatment-related toxicity. The source of the prostate-specific antigen (PSA) rise includes prostate bed, pelvic LNs or distant metastases. Conventional imaging studies are thought to have low sensitivity at low PSA levels, which poses a challenge since earlier salvage radiotherapy is known to be more effective than late salvage radiotherapy3,4. Advanced nuclear medicine tests, such as flucicolvine5 and Prostate-specific membrane antigen (PSMA) positron emission tomography (PET)6, have a much higher sensitivity and can detect the location of recurrences at much lower PSA values. Studies have reported patient-based sensitivity and specificity of 98.7–100% and 88.2–100%, respectively7,8. Recently, the landmark EMPIRE-1 trial showed improved event-free survival with the incorporation of fluciclovine PET into radiation planning after RP9. A head-to-head trial has shown that the detection rate and sensitivity of PSMA is superior to that of Axumin for pelvic and extrapelvic disease10.

Unfortunately, the PSMA PET/computed tomography (CT) carries significantly higher average overall costs compared to CT scans10. The cost can be prohibitive in low-resource settings and/or if repeated scans are needed. Therefore, significant barriers exist for the widespread use of 68Ga-PSMA-11 PET/CT for detecting prostate cancer recurrence after radical prostatectomy at the present time.

Unprecedented progress has been made in artificial intelligence in the past decade, which has demonstrated great promise in many fields, including computer-aided diagnosis (CAD) of metastatic tumor spreads. Lately, researchers have been coming up with numerous solutions regarding the classification of various types of metastases11. For example, Zhou et al. demonstrated the feasibility of breast cancer metastases classification using convolutional neural networks (CNN)12, while Ariji et al. designed CNN for nodal metastases classification13. For metastatic prostate cancer (PCa), Hartenstein et al. presented the work of PCa LN metastasis classification14.

Nevertheless, most of the current CAD metastases detection methods are limited to binary patch classification with an evenly balanced mix of positive/negative cases (50%/50%), which would be difficult to apply in the clinical setting15. The corresponding reasons are two-fold. First, extracting incoming patients’ scanning into patches or voxels and then feeding into classification algorithms are too labor-intensive to be included into a clinical workflow. Second, artificially balanced positive to negative cases bears little resemblance to the ratio seen in the real-world setting.

Compared to most classification methodologies, modern object detection networks are more powerful tools that can identify and localize abnormalities from the entire input feature maps16. Lately, Zhao et al. proposed a triple-combining 2.5D U-Net pipeline for metastatic pelvis bone and LN lesion segmentation on 68Ga-PSMA-11 PET/CT. This framework consisted of three 2.5D U-Nets, which extracted features from axial, coronal, and sagittal planes and predicted tumor masks based on majority voting17. They assessed the regime with the input of CT/PET alone or a fusion of the two.

Recent object detection and localization methods could be divided into one-stage as well as two-stage approaches. One-stage models, including the YOLO series18 and U-Net derivatives19 are more efficient, whereas two-stage ones, including the R-CNN family, are of better accuracy20. Since most tasks in clinical practice are more rigid on the accuracy of the modality, two-stage detectors are more favorable for learning medical imaging21.

In the present study, built from Mask R-CNN22, we investigated the feasibility of detecting PCa LN metastases solely based on diagnostic CT images with contours on pelvic lymph node clinical target volumes (CTVs).

## Materials and methods

### Dataset

#### Patients and data management

In total, 88 PCa patients who showed positive lymph nodes in PSMA PET/CT at 4 institutions (the Technical University of Munich, the University of California at Los Angeles, Ludwig-Maximilians-University of Munich, and the University of Essen) were included. All patients underwent radical prostatectomy, had BCR without prior radiotherapy and underwent 68Ga-PSMA-11 PET/CT at a serum PSA level of less than 1 ng/mL between August 2013 and May 2017 to detect the sites of recurrence. All patients gave written consent to undergo the procedures. The clinical data and Digital Imaging and Communications in Medicine (DICOM) files of all patients were anonymized and imported onto a dedicated radiotherapy contouring workstation at UCLA (MIM, version 6.7.5; MIM Software Inc., location of the company). This post hoc retrospective analysis was approved by the UCLA Institutional Review Board (#12-001882), and the requirement to obtain informed consent was waived. All experiments were performed in accordance with relevant guidelines and regulations.

#### 68Ga-PSMA-11 PET/CT image acquisition

68Ga-PSMA-11 PET/CT imaging was performed according to recent guidelines23. Images were acquired on the Siemens Biograph 128 mCT (68%), Siemens Biograph 64 (19%), Siemens Biograph 64 mCT (9%), and GE Healthcare Discovery 690 (5%). The 68Ga-PSMA tracer was used at all sites. The median injected dose was 154 MBq (range 65–267 MBq). To reduce bladder activity, patients received 20 mg of furosemide at the time of tracer injection if there was no contraindication24. The median uptake period was 59 min (range 37–132 min). A diagnostic CT scan (200–240 mAs, 120 kV) was performed after intravenous injection of contrast agent, followed by whole-body PET image acquisition (2–4 min/bed position)25.

#### Pelvic lymph node clinical target volumes and PET lesion contouring

Pelvic lymph node CTVs were contoured on the CT dataset of the PET/CT scan for all 88 patients by an experienced radiation oncologist who was masked to the PET findings in accordance with the radiation therapy oncology group (RTOG) consensus contouring26,27. CTV is a term commonly used in radiotherapy. CTV includes all at-risk LNs plus a margin for micro diseases in this specific context. We also noticed in certain cases the pelvic LNs were located at the boundary of pelvic nodal CTVs following RTOG guidelines (slightly fall out of the RTOG contours for 1–2 pixels). To ensure that the pelvic LN masks cover all the pixels of LN metastases and, more importantly, overcome the weak learning capability of CNN filter on edges of a feature map, we isotropically expanded pelvic LN CTVs by 10 absolute pixels (l = 6.48 mm). These wider contours introduced false positives (FPs) within the expansion zone but then eliminated them at the stage of post-processing (see details in “Modeling pipeline”). 68Ga-PSMA-11 positive lesions were contoured on the CT images by radiation oncologists. These contours were subsequently used to define 68Ga-PSMA-11-based target volumes25.

#### Data split

The patients were divided into training (nPatient = 52, nNode = 400, split ratio = 3/5), validation (nPatient = 18, nNode = 143, split ratio = 1/5), and test (nPatient = 18, nNodes = 196, split ratio = 1/5) sets balanced on their national comprehensive cancer network (NCCN) risk groups at initial diagnosis. Details of split on NCCN risk group see in Table 1.

#### Windowing analysis

To narrow down the area of metastatic LN detection and accentuate the morphological features of metastases, we focused on the area inside the pelvic CTVs and carefully selected windowing strategies of Hounsfield units (HUs) during training. Table 2 lists the representative statistics of window width. Noteworthy, various ranges of widow width were selected by first conducting distribution analysis of all HUs of positive node pixels in the training set and then gradually and symmetrically excluding some extreme image pixel values at the left and right tails of the distribution based on quantile analysis. We will explore different PCa LNs metastases Hounsfield unit (HU) window width along with standard soft tissue HU widow width (− 125, 225) in the below modeling pipeline. This windowing logic will be referred to as quantile windowing strategies in the following sections.

### 2.5-Dimensional (2.5D) object detection pipeline

#### Data preprocessing

As shown in Fig. 1, Our data preprocessing pipeline consists of two paths for images fed into the pretrained network and the fine-tuned model, respectively. For the path of pretrained processing, we performed 2.5D concatenation, HU transformation, black border crop-out, and soft tissue windowing sequentially. For that of fine-tuned processing, we performed 2.5D concatenation, HU transformation, LN CTV contour mask- and crop-out, and quantile windowing strategies. Specifically, 2.5D here means that we will channel-wise concatenate the central CT slice along with its adjacent superior/inferior slices. HU transformation is to convert the DICOM pixels stored in the bundled “three-channel” images into HUs, and LN CTV contour mask- and crop-out operation set the pixels outside of the expanded central pelvic nodal contours on CT to zero and crop the image to only keep the CTV region so as to ease the fine-tune learning process.

After the above procedures, we wrap up both paths by performing uniform normalization and data augmentation of the images. The data is geometrically augmented using random resizing (image largest width to 640–800), horizontal flipping (p = 0.5) and random rotation (angle 0–180°), and morphologically augmented using random gaussian noise (kernel = 5, sigma = 1) and random brightness.

#### Modeling pipeline

As shown in Fig. 2, the complete design of workflow includes three steps, the initial pretrained whole slice imaging (WSI)-Mask R-CNN, the further fine-tuned Regional Mask R-CNN, and the “window bagging”. Our rationales will be elaborated on in below.

##### Prediction post-processing

During experiments, we found that our Regional-Mask R-CNN still suffered from two types of false positives—predictions near the outer boundary of expansion zone and vascular/bowel structures—that could benefit from post-processing. Three hyper-parameters (see $$\tau$$4−6 in Table 3) were cross-validated to automate the post-processing. For FPs of the expansion zone boundary, we set $${\tau }_{4}$$ to regularize the valid predictable LNs nodal expansion zone from a range of 1–10 pixels. For vascular/bowel structure FPs, we set $$\tau$$5−6 to determine the quantile of all HUs within the predicted detection box ($${\tau }_{6}$$) above which threshold of HUs ($${\tau }_{5}$$) was not taken in the final prediction set. Since vessel and bowel patterns both have higher HUs than pelvic nodes on contract enhanced CTs.

##### “Window bagging”

To further enhance model performance, we bagged multiple post-processed Regional Mask R-CNN trained with different quantile windowing inputs, the so-called “window bagging”, to count the votes from the crowd. Notably, bootstrap of the dataset was not conducted here for each voter since we believe that inputs with different quantile windowing could diversify the training information and therefore avoid collinearity. Details of our “window bagging” workflow can be seen in Fig. 2.

$${\tau }_{1-3}$$ are cross-validated hyper-parameters for “window bagging” tuning. $${\tau }_{1}$$ is the intersection over union (IoU) threshold for determining the detection boxes generated from different voters as the final “window bagging” prediction. $${\tau }_{2}$$ decides the number of voters in the final “window bagging” models. $${\tau }_{3}$$ is the IoU for recognition of whether the bagged prediction hits ground truths (GTs).

#### Loss function

Although hybrid loss functions have been used recently in various deep networks30,31,32, our loss function kept the same as the original Mask R-CNN due to its efficiency with the dataset.

$$L={L}_{cls}+{L}_{box}+{L}_{mask}$$
(1)

where $${L}_{cls}$$ and $${L}_{box}$$ still follows the definition in Faster R-CNN33 and $${L}_{mask}$$ is the average binary cross entropy loss proposed in Mask R-CNN22.

### Model training

Our 2.5D object detection pipeline was implemented in detectron2 (https://github.com/facebookresearch/detectron2) project using PyTorch and performed on a GPU cluster with 4 × RTXA6000. Figure 3 shows the two training processes in detail.

For WSI Mask R-CNN, we trained the three-channel whole slice images on stochastic gradient descent (SGD) optimizer for 3 k iterations, with a batch size of 64 (4 $$\times$$ 16), learning rate (LR) of 0.01 decreasing by tenfold at 2 k iterations, a momentum of 0.9, and weight decay of 0.0001.

For Regional Mask R-CNN, we fine-tuned the pelvic nodal contour masked-out three-channel images using SGD for 6 k iterations with a batch size of 64 (4 $$\times$$ 16), LR of 0.005 decreasing by tenfold at 4 k and 5 k iterations, respectively, a momentum of 0.9, and weight decay of 0.0001. The final training loss decreased to around 0.4.

### Model evaluation

We reported the best performance, tuned from individual criteria, including sensitivity, precision, and F-1 score for steps of prediction post-processing and “window-bagging”. Sensitivity is defined at the metastasis level, which means that if the model could locate one slice of a single metastatic LNs, we count this entire metastasis as a hit. Precision is defined as the slice level, which counts each slice of metastases captured by the detection box predictions. All metrics are evaluated on node instead of patient level.

## Results

Positive pelvic LN GTs with the CTV contours are visualized in Fig. 4. Qualitative and quantitative results are presented in Fig. 5 and Table 4, respectively. Figure 5 enlarges the representative 2D images to highlight the sub-regions near the predictive or ground-truth positive LNs, and the detection boxes. Note that a positive LN can be found in multiple adjacent 2D slices, and a number of positive LNs could apprear in one slice. Visually from Fig. 5, there is not a clear difference between true positives (TPs), FPs, and false negatives (FNs), showing the challenge of directly using the CT for manual lymph node detection and classification.

Table 4 shows a quantitative comparison of detection methods. The single ImageNet-pretrained Regional Mask R-CNN resulted in robust sensitivity achieving ~ 80% AUC and detecting > 60% of the positive LNs but low precision under 30%. Fine-tuning individual Regional Mask R-CNNs from weights of WSI Mask R-CNN improves the precision by ~ 5% without compromising sensitivity. Prediction post-processing improved each learner by another 15%. Lastly, via “window bagging” of Regional Mask R-CNN pretrained on WSI as well as prediction post-processing, we obtained another 5% gain in precision score with a high sensitivity of 83.351% and AUC of 90.034%.

## Discussion and conclusion

In the study, we developed a 2.5D deep learning pipeline for prostate metastatic LNs. As shown in Fig. 5, the differences between negative and positive nodes are subtle in CT, making it impractical for human observers to perform the detection task. However, after supervised learning based on PSMA-PET, our AI pipeline located the majority of positive pelvic LNs solely based on pelvic LN region extracted from CT scans, achieving an AUC of 90.034%, sensitivity of 83.351% and specificity of 58.621% out of 196 positive pelvic LNs (18 patients) in the test set. Our results show more promising performance compared to the triple-combining 2.5D U-Net proposed by Zhao et al., where the specificity of 54.8% and positive predictive value of 59.7% were reported for the case where solely CT was input to their network17.

Object detection of metastatic PCa lymph nodes using WSI CT scans is a challenging task mainly due to the enormous class imbalance between positive and negative voxels, the almost identical morphological patterns between abnormal and normal LNs, the large variance of appearances of the normal and abnormal tissues, the interference from complex pelvic structures (vascular, bowel, and pelvic bone structures), the infeasibility to balance positive and negative LNs on a WSI, and, in this specific task, a relatively small dataset to train the deep learning network. Nevertheless, our object detection pipeline still achieved superior sensitivity and relatively lower specificity than the easier binary classification problem.

We combatted those facts with five strategies: transfer learning from WSI imaging, fine tuning from regional pelvic LN CTVs, prediction post-processing, and “window bagging”. Our results show an additive and progressive improvement indicating independent mechanisms with these strategies (1) pretraining on entire CT slices provides more background information; (2) precise regional searching within CTVs greatly simplifies the complexity of feature learning; (3) prediction post-processing with tuned hyper-parameters helps refine the spatial and pixel-wise search regions; (4) “window bagging” of voters synthesizes individual training cohorts to reduce FPs while improving the robustness of sensitivity.

The present study has important clinical implications. Pelvic LN recurrence after definitive local therapy can be treated with external beam radiation therapy with or without androgen deprivation therapy. Many studies have demonstrated good efficacy and safety profile of whole pelvic radiation with simultaneous integrated boost to lymph nodes with gross disease3,34. Another more targeted yet experimental approach is to deliver stereotactic body radiation therapy specifically to individual lymph nodes that are involved without irradiating the pelvic lymph node region comprehensively35,36,37. In either approach, detailed information regarding the location of pelvic LNs harboring PCa is essential for treatment planning. Traditional CT-based detection method largely relies on morphological characteristics of the LNs, such as size (≥ 9–10 mm), presence of fatty hilum, shape (oval vs. round), and the short/long axis ratio38. PSMA PET/CT was able to detect LN metastasis in nodes under 10 mm in size, with one study reporting a 60% detection rate for nodes between 2 and 5 mm39. Patients with lower Gleason score (GS) tended to have smaller PSMA–positive LNs (mean 7.7 mm), than patients with intermediate- (mean 9.4 mm) and high GS cohorts. Based on the CT morphology criteria, only 34% of low GS patients, 56% of intermediate GS patients, and 53% of high GS patients were considered CT positive40. The examples shown in Fig. 5 confirm the challenge of visually detecting positive lymph nodes.

As PSMA PET/CT has yet to become widely available due to financial and availability barriers, a low cost and easily accessible alternative approach that can help predict the presence and location of potential pelvic LN involvement based solely on conventional diagnostic CT is extremely appealing. The method developed here is not intended to replace PSMA PET/CT. Rather, it may help clinicians select patients who may benefit the most from PSMA PET/CT. The high accuracy of classifying patients with or without positive LNs is conducive for such a task.

The current dataset with 52 training patients is still far from sufficient, leaving space to further reduce the FPs and FNs with more training data. Additionally, the current pipeline benefits from manual pelvic LNs CTV segmentation that helps focus on a smaller and more relevant search volume. However, manual labeling of the structure can be inconsistent. Moreover, LN CTVs for radiotherapy purposes do not precisely delineate the individual pelvis lymph nodes. Additional non-LN tissues are included in the CTV, complicating the detection task. In the future, an automated pelvis LN segmentation network can be trained to improve both aspects based on curated CT with detailed labeling of the structure, such as the data released by the CAMELYON17 challenges. We also plan to apply more complex z-dimensional slice fusion strategies to provide more context information for the network and adding more background information via pretraining from other datasets, including DeepLesion41, Luna1642 and etc. In addition, adding attention gating into the network is another direction to explore. Lastly, as an extension of this work, the performance of our proposed approach can be compared with the performance of a capsule network since capsule networks can preserve spatial relationships of learned features and have been proposed recently for image classification tasks43,44,45.

Another limitation of the study is that the PSMA PET is not a perfect ground truth for training and validation. PSMA PET detection sensitivity has been reported between 40 and 60% in a study46 for patients with low PSA levels. However, the same method used in the study should be applicable as enhanced diagnostic information from histopathology and complementary imaging modalities, e.g., hyperpolarized C-13 MRI, becomes available.