Deep learning-based decision support system for the diagnosis of neoplastic gallbladder polyps on ultrasonography: Preliminary results

Ultrasonography (US) has been considered image of choice for gallbladder (GB) polyp, however, it had limitations in differentiating between nonneoplastic polyps and neoplastic polyps. We developed and investigated the usefulness of a deep learning-based decision support system (DL-DSS) for the differential diagnosis of GB polyps on US. We retrospectively collected 535 patients, and they were divided into the development dataset (n = 437) and test dataset (n = 98). The binary classification convolutional neural network model was developed by transfer learning. Using the test dataset, three radiologists with different experience levels retrospectively graded the possibility of a neoplastic polyp using a 5-point confidence scale. The reviewers were requested to re-evaluate their grades using the DL-DSS assistant. The areas under the curve (AUCs) of three reviewers were 0.94, 0.78, and 0.87. The DL-DSS alone showed an AUC of 0.92. With the DL-DSS assistant, the AUCs of the reviewer’s improved to 0.95, 0.91, and 0.91. Also, the specificity of the reviewers was improved (65.1–85.7 to 71.4–93.7). The intraclass correlation coefficient (ICC) improved from 0.87 to 0.93. In conclusion, DL-DSS could be used as an assistant tool to decrease the gap between reviewers and to reduce the false positive rate.

Baseline characteristics of data sets. There was no significant difference in the population characteristics (age, man to woman ratio, proportion of nonneoplastic and neoplastic polyps, and size of polyps) between the development set and test set (p ≥ 0.05). The baseline characteristics of the data sets are described in Table 1.
In the both development set and test set, patient age and polyp size showed statistically significant differences between nonneoplastic and neoplastic polyp (p < .001). The average size of nonneoplastic and neoplastic polyps was 9.4 ± 3.5 mm and 18.4 ± 8.4 mm, respectively. The optimal size cutoff for differentiating a neoplastic polyp was over 13.1 mm (with sensitivity 70.1%, and specificity 87.1%) form the receiver operating characteristic (ROC) curve analysis. However, there was a substantial amount of size overlap between the nonneoplastic and neoplastic polyps. The size distribution histogram is shown on Fig. 1.

Outcomes of clinical validation of DL-DSS.
Step 1: Diagnostic performance of image analysis by three human reviewers. Table 2 summarizes the diagnostic performance of the three human reviewers. Three reviewers showed different area under the receiver operating characteristic curve (AUC) from 0.78 to 0.94. Reviewer A showed relatively good sensitivity (31 of 35 [88.6%]) and specificity (54 of 63 [85.7%]) on classifying the neoplastic polyp. However, reviewers B and C showed relatively low specificities of 43/63 (68.3%) and 41/63 (65.1%) respectively. The overall accuracy was between 68/98 (69.4%) and 85/98 (86.7%). Table 3 summarizes the comparison with the US image findings between neoplastic polyps and nonneoplastic polyps. Among the three reviewers, the characteristic image findings of neoplastic polyps were single, larger, and sessile compared to the nonneoplastic polyps. A majority of the reviewers (two of three) reported lobulated surface contour, the presence of a vascular core, and heterogeneous internal echogenicity was also a common finding in the neoplastic polyps. Among the US findings, multiplicity, size, vascular core, and presence of gallstone  Table 2). One radiologist with 11 years of abdominal ultrasound experience showed better performance than the DL-DSS. However, the system showed a better performance than two reviewers, who had 3 and 5 years of abdominal ultrasound experience. A statistically significant difference was observed between the DL-DSS and reviewer B (p < 0.01).
Step 3: Diagnostic performance of image analysis by three human reviewers with the aid of the DL-DSS. Table 2 shows that all three reviewers' diagnostic performances increased when the DL-DSS was added. All three of the reviewers' AUCs became higher than 0.9. A statistically significant improvement was shown for reviewer B, whose performance increased from 0.78 to 0.91 (p < 0.01). Among the sensitivity, specificity, and accuracy, the specificity showed marked growth in all three reviewers (Fig. 2). In addition, the interobserver agreement improved with the aid of the DL-DSS (ICC from 0.87 to 0.93).
Diagnostic performance for GB polyps larger than 10 mm. Table 4 summarizes the diagnostic performance of the three human reviewers and the DL-DSS for GB polyps larger than 10 mm. When the GB polyps were There was a substantial overlap zone between nonneoplastic polyps and neoplastic polyps.

AUC Comparison
Sensitivity Specificity Accuracy F-1 Score Step 1: Initial Image Analysis Step 1  www.nature.com/scientificreports www.nature.com/scientificreports/ larger than 10 mm, the AUC of all experiments became lower. However, the added gain of the DL-DSS became more evident. For the human reviewers, the gap between sensitivity and specificity became more prominent. The sensitivity and specificity were from 69.7 to 100 and from 41.9 to 77.4, respectively. Additionally, the sensitivity and specificity of the DL-DSS were 78.8 and 87.1, respectively. The maximum gap of specificity between human and DL-DSS was exaggerated from 27 to 45.2. The p-values of the comparison between the 1 st and 2 nd review became more significant (<0.001-0.16).

Discussion
The DL-DSS could be a useful assistant tool for the differential diagnosis of neoplastic GB polyp by decreasing the gap between reviewers with different experience levels and reducing the false positive rate. The initial AUCs for neoplastic GB polyp were 0.94, 0.78, and 0.87. With the DL-DSS assistant, reviewers' performances improved to 0.95, 0.91, and 0.91. The diagnostic performance of the less-experienced radiologists significantly improved with the aid of the DL-DSS (0.78 to 0.91 p < 0.01 for Reviewer B and 0.87 to 0.91, p = 0.17 for Reviewer C). The ICC improved from 0.87 to 0.93 with the aid of DL-DSS.
Among the important US finings for neoplastic polyps such as a single, larger, and sessile polyp, a larger size was the only independent factor on the multivariable regression. The optimal cutoff dividing the nonneoplastic polyp and neoplastic polyp was 13.1 mm. The importance of the size also had been reported on previous studies. According to Yeh C-N's reports 19 , a size criteria that was equal to 10 mm or larger was the independent risk factor for neoplastic polyp. Cha BH, et al. collected 210 patients who had a GB polyp larger than 10 mm, and found that a size equal to 15 mm or larger was an independent risk factor suggesting neoplastic polyp 20 . Choi TW, et al. collected 136 patient's high-resolution US images and reported that a single polyp and large diameter were the meaningful factors predicting a neoplastic polyp 21 .
The DL-DSS showed a higher specificity than all of the human reviewers, and this characteristic helped in improving the human performance. Although the human reviewers mainly relied on the size of polyp to differentiate neoplastic polyp, there was a substantial overlap zone between nonneoplastic polyps and neoplastic polyps in the size distribution plot. On the subgroup analysis for the large polyp (≥10 mm), the specificity of the human reviewers dropped markedly than the DL-DSS. The human reviewers overcalled the neoplastic polyp when the polyp measured 10 mm or larger, and this trend led to unnecessary cholecystectomies. The DL-DSS could reduce the false positive rate, thereby avoiding unnecessary cholecystectomy.
The DL-DSS could help to improve a less-experienced radiologist's performance and narrow the gap between reviewers. Nam JG, et al. demonstrated that the malignant pulmonary nodule detection deep learning algorithm improved the nodule detection performance of 18 physicians with different experience levels 16 . The performance gap between the thoracic radiologist group and other physicians became lower. Bien N, et al. developed and validated the deep learning model to detect a general abnormality, anterior cruciate ligament tear, and meniscal tear on knee MRI data 17 . The model improved the interrater reliability among the 7 general radiologists and 2 orthopedic surgeons. Our study results also showed that the diagnostic performance of a nonexperienced radiologist significantly improved with the aid of the DL-DSS (0.78 to 0.91 p < .01 for Reviewer B and 0.87 to 0.91, p = 0.17 for Reviewer C).
Our study has some limitations. First, this study was only performed in a single center, and only temporally external validation was done. Second, the DL-DSS differentiated between nonneoplastic polyps and neoplastic  www.nature.com/scientificreports www.nature.com/scientificreports/ polyps rather than diagnosing the malignant polyps. Since malignant polys are rare, not enough of a development dataset could be collected. With the multicenter dataset, the system could also learn this function. Third, in this study, we used Inception V3 and did not compare with more recent networks such as DenseNet, NASNet, etc. This is one of limitation of this research and further study is needed to make optimal DL-DSS. Finally, as a retrospective study, this study could not reflect the real clinical setting. A multicenter prospective study is needed.
In conclusion, we developed the DL-DSS, which has higher specificity at differentiating neoplastic polyps on transabdominal US. When it was used as an assistant tool, the gap between the reviewers with different experience levels narrowed, and the false positive rate was reduced, especially for the polyps with a size of 10 mm or larger. The DL-DSS could be used as an assistant tool for the differential diagnosis of neoplastic GB polyps using US, decrease the gap between the reviewers with different experience levels, and reduce the false positive rate, thus avoiding unnecessary cholecystectomies.

Materials and Methods
This study was approved by the Institutional Review Board of Seoul National University Hospital (IRB No. 1712-007-903). The Institutional Review Board granted a waiver of informed patient consent due to the retrospective nature of our study. All methods were performed in accordance with the relevant guidelines and regulations.
Patients. We reviewed our institution's medical patient records from 2006 to 2017 and identified 8,452 patients who received cholecystectomy and 8,047 patients who received US for GB. From this data, we collected 923 patients on whom a GB polyp was found through US and received consecutive cholecystectomy. Among them, the following patients were excluded: polyp was not identified on pathologic report (n = 290); polyp smaller than 4 mm in size (n = 82); suboptimal image quality (n = 7); size and location of the polyp was evidently different between US and pathologic report (n = 6); rare pathology such as lymphoma or metastasis (n = 3). On the re-evaluation, some of the reviewers downgraded the score. (b) Three patients with a neoplastic polyp, measured from 13 to 18 mm size. Some of the reviewers classified these polyps as nonneoplastic polyp with confidence scale 3 or less. On the other hand, patient-level probability value was from 0.7 to 0.9, favoring neoplastic polyp. On the re-evaluation, some of the reviewers upgraded the score. (2020) 10:7700 | https://doi.org/10.1038/s41598-020-64205-y www.nature.com/scientificreports www.nature.com/scientificreports/ Finally, a total of 535 patients were included in our study group, including 357 patients with nonneoplastic polyps and 179 patients with neoplastic polyps. We divided selected patients into two groups to make two temporally independent datasets: the development dataset and test dataset. According to study date of their latest US exam, the patients who received US before January 2015 were considered as the development dataset (n = 437), and the patients who received US from January 2015 were grouped as the test dataset (n = 98). Figure 3 shows the flowchart of this study population. US examination. An US examination was performed by the radiologists, who have approximately 10 to 20 years of abdominal ultrasound experience, using an US scanner (LOGIQ 9 or LOGIQ E9, GE Healthcare, Milwaukee, WI, USA). First, the GB was observed grossly by a subcostal or intercostal approach with the convex low-MHz transducer (C1-6 or 4C, GE Healthcare, Milwaukee, WI, USA). Then the high resolution images were achieved by a linear transducer (9L or 7L, GE Healthcare, Milwaukee, WI, USA), which produced a high frequency ultrasound wave with a bandwidth of 2-8 MHz 21,22 .

Development of DL-DSS.
Labeling of dataset. The images were selected from each patient's US studies according to the following criteria: only the B-mode images were captured with the linear transducer, the images did not contain any annotation such as size measurement, and the captured polyp should match the pathology report in terms of its size and location. The US image selection and labeling process were performed by two radiologists (Kim JH and Jeong YB, with 23 years and two years of abdominal ultrasound experience, respectively) with consensus. If one patient received serial US follow up, all of the US studies were included when the patient was in the development dataset. By contrast, only the latest US study was included when the patient was in the test dataset. All of the selected images were converted into Portable Network Graphics (PNG) format. Then, all of the images were manually cropped using an in-house program. In every image, we drew a free size square box that contained one polyp and its attachment site. The square box was drawn as small as possible to contain the whole following structures; polyp, stalk, and focal portion of GB wall the polyp attached. The center of the box was located as close as to the polyp's center. If there were multiple polyps in one image, each meaningful polyp that was 4 mm or larger and matched the pathologic description was cropped separately. We evaluated the size of each polyp simultaneously. We used the size that was measured by the onsite radiologist when it was available. If not, a retrospective size estimation was conducted using the scale bar in the image.
Training for the DL-DSS. We used the transfer learning method base on the GoogleNet Inception v3 Convolutional Neural Network (CNN) architecture 23,24 . DL-DSS used both cropped image and nonimage information. All of the cropped images were resized into 299 × 299 pixels. Intensity normalization was used as a preprocessing method. Geometric image augmentation techniques such as vertical/horizontal flipping, rotation, and cropping were used to reduce overfitting. Then, the image input was processed with a pretrained Inception v3 model to extract the high-level features. Nonimage inputs, including patient age, size of polyp and polyp multiplicity, were concatenated to the last fully connected layer of the network using a late fusion strategy 25 . In the end, DL-DSS reports a probability value that represents the image-level probability of a neoplastic polyp. A schematic diagram of the DL-DSS is described in Fig. 4.
The development dataset was randomly divided into an 8:2 ratio and was then used for development (4138 images) and test (1033 images). The training process was done with a learning rate of 0.001 and an RMSprop optimizer. We used an early stopping strategy to prevent overfitting, and training was performed for up to 400 epochs where validation loss has reached a plateau (Fig. 5).

AUC
Comparison Sensitivity Specificity Accuracy F-1 Score Step 1: Initial Image Analysis Step 1 vs

Clinical validation of the DL-DSS.
Step 1: Image analysis by three human reviewers. Two abdominal radiologists (Reviewers A and B, with 11 and 5 years of abdominal ultrasound experience, respectively) and one trainee radiology resident (Reviewer C with 3 years of abdominal ultrasound experience) independently reviewed the 98 patients' US exams in the test dataset. They were blinded to the histologic diagnoses. First, they reviewed the following characteristics of polyp: multiplicity (solitary or multiple), size of the largest polyp (mm), overall shape (pedunculated or sessile), surface contour (smooth or lobulated), presence or absence of a vascular core seen on color Doppler sonography, presence of GB stone, internal echogenicity level (hypoechoic or iso-to hyperechoic), internal echogenicity pattern (homogenous or heterogenous) and presence of hyper-or hypoechoic foci. When the patient had multiple polyps, the largest polyp was evaluated. The detail method to evaluate the polyp characteristics was based on the previous researches 21,22 . Second, the radiologists graded the possibility of a neoplastic polyp using a 5-point confidence scale (1: definitely a nonneoplastic polyp; 2: probably a nonneoplastic polyp; 3: borderline; 4: probably a neoplastic polyp; and 5: definitely a neoplastic polyp) for each patient. The grading was based on not only the reviewers own experience, but also the polyp characteristics. All reviewers knew the relationship between the risk of neoplastic polyp and polyp characteristics, based on two previous researches 21,22 .
Step 2: Image analysis by DL-DSS. A temporally external validation of the DL-DSS was done with the 98 patients in the test set. Unlike human reviewers' 5-point confident scale, the DL-DSS output the decimal probability value range from 0 to 1 (0: definitely a definite nonneoplastic polyp, and 1: definitely a neoplastic polyp). As multiple images were included for one patient, the average of the multiple image-level probability values was used as a patient-level probability value (Fig. 4). Using the patient-level probability value from each patient, we performed a ROC curve analysis and calculated the sensitivity, specificity, and accuracy of the DL-DSS.
Step 3: Image analysis by three human reviewers with the aid of the DL-DSS. To evaluate the added value of the DL-DSS aided diagnosis, a second review was conducted by three reviewers after 4 months since the first human www.nature.com/scientificreports www.nature.com/scientificreports/ review. They were still blinded to the histologic diagnoses. The following DL-DSS analysis data from each patients of test dataset was provide to the reviewers: the number of cropped images analyzed by DL-DSS, patient-level probability value, and standard deviation of multiple image-level probability values. US-images and previous 5-point confidence scale that own rated were also provided to the reviewers. Finally, considering all this information from the DL-DSS, including DL-DSS performance data and DL-DSS analysis data, the reviewers rerated the 5-point confidence scale. We also provided the performance of the DL-DSS on development dataset was introduced, by providing the ROC curve, sensitivity, specificity, and accuracy (with cut-off probability value 0.5).
Statistical analysis. The population characteristics, such as man to woman ratio, a proportion of nonneoplastic and neoplastic polyps, were compared using a Pearson's Chi-square test between the development set and test set. Also, patient age and size of polyp were compared using an independent t test. These method were also All cropped Image were resized into 299 × 299 pixels, and then processed with a pretrained Inception v3 model. Nonimage inputs were concatenated to the last fully connected layer of the network using a late fusion strategy. (b) As multiple images were included for one patient, the average of the multiple image-level probability values was used as a patient-level probability value. It is a continuous value between 0 and 1. Zero represents a definite nonneoplastic polyp, and vice versa.
In the step 1, polyp characteristics were compared between the nonneoplastic polyps and neoplastic polyps using a Pearson's Chi-square test and Fisher's exact test. The sizes of the polyps were compared using an independent t test. A logistic regression analysis was performed to extract meaningful predictors for neoplastic polyp. After the univariate analysis with the US findings, variables with p values less than 0.05 were chosen for the multivariate analysis. Among three reviewers, the interobserver agreement was calculated with an ICC for each US finding and confidence scale. An ICC was classified into three levels of agreement: 0.59 and less as poor agreement, 0.60-0.79 as moderate agreement, and 0.80-1.00 as good correlation.
In step 1 and step 3, the diagnostic performance evaluation of reviewers was done by a ROC curve analysis, using a 5-point confidence scale. Sensitivity, specificity, accuracy, and F-1 score were calculated with a cut-off confident scale (>3). In step 2, the diagnostic performance evaluation of the DL-DSS was done by a ROC curve analysis, using a patient-level probability value. Sensitivity, specificity, accuracy, and F-1 score were calculated with a cut-off probability value (>0.5). An AUC comparison was performed by using a pairwise comparison ROC curve analysis.
All of the analyses were performed with software (SPSS version 25.0, IBM, Armonk, NY; and MedCalc version 18.2, MedCalc Software, Ostend, Belgium). For all of the tests, a P value less than 0.05 was considered to indicate statistical significance.

Data availability
The datasets generated during the current study are not publicly available due to our institutional review board prohibits publication of patient's personal medical records. on the development and test datasets, respectively. We used an early stopping strategy to prevent overfitting, and training was performed up to 400 epochs where validation loss has reached a plateau. Final accuracy was 96.69% and 88.58% in the development set and test set.