A robust convolutional neural network for lung nodule detection in the presence of foreign bodies

Lung cancer is a major cause of death worldwide. As early detection can improve outcome, regular screening is of great interest, especially for certain risk groups. Besides low-dose computed tomography, chest X-ray is a potential option for screening. Convolutional network (CNN) based computer aided diagnosis systems have proven their ability of identifying nodules in radiographies and thus may assist radiologists in clinical practice. Based on segmented pulmonary nodules, we trained a CNN based one-stage detector (RetinaNet) with 257 annotated radiographs and 154 additional radiographs from a public dataset. We compared the performance of the convolutional network with the performance of two radiologists by conducting a reader study with 75 cases. Furthermore, the potential use for screening on patient level and the impact of foreign bodies with respect to false-positive detections was investigated. For nodule location detection, the architecture achieved a performance of 43 true-positives, 26 false-positives and 22 false-negatives. In comparison, performance of the two readers was 42 ± 2 true-positives, 28 ± 0 false-positives and 23 ± 2 false-negatives. For the screening task, we retrieved a ROC AUC value of 0.87 for the reader study test set. We found the trained RetinaNet architecture to be only slightly prone to foreign bodies in terms of misclassifications: out of 59 additional radiographs containing foreign bodies, false-positives in two radiographs were falsely detected due to foreign bodies.

With about 1.7 million deaths in 2018, lung cancer is one of the most common causes of cancer death 1 . As an early diagnosis improves outcomes 2 , regular screening with imaging methods is beneficial. For screening, especially low-dose computed tomography (CT) shows promising results in order to reduce mortality 3 . While a regular clinical chest CT requires a dose of 4-18 mSv 4 , low-dose CT applies an effective dose of around 1.5 mSv 3 . Compared to a posteroanterior study of the chest, which requires around 0.02 mSv 4 , the applied dose is still significantly higher. Hence, due to wider availability and possible avoidance of radiation induced long term effects, chest X-ray (CXR) is a potential alternative to chest CT for lung cancer screening. However, interpretation of CXR images is often challenging, as small lung nodules can easily be missed. For successful lung cancer screening it is mandatory to keep the rate of false-negatives low.
False-positive cancer diagnoses on the other hand may lead to substantial psychological consequences in patients, such as changes in self-perception or anxiety, as investigated for colorectal cancer 5 . Thus, for successful lung cancer screening, keeping the rate of false-negatives and false-positives as low as possible is mandatory.
With the rise in computing power, deep-learning based computer-aided diagnosis (CAD) systems have gained interest in the research community. Only recently, performance of human readers in disciplines such as breast cancer screening 6 and dermoscopic melanoma image classification 7,8 was met or even exceeded. For mammography and chest X-ray classification, networks which are trained with case-level labels showed promising results [9][10][11][12] . However, such systems can only provide disease locations by the use of techniques such as saliency maps 13 . As these usually provide only inaccurate location boundaries, it is of interest to train such system with detailed annotations such as box coordinates or segmentations. Besides, it is also possible to train such Scientific RepoRtS | (2020) 10:12987 | https://doi.org/10.1038/s41598-020-69789-z www.nature.com/scientificreports/ networks in a semi-supervised manner, e.g. where a part of the data is labeled on pixel-level and the remaining radiographs are annotated on case-level 14 . For deep learning applied on CXR images with pixel level annotations, U-Net-like architectures can be employed for segmentation tasks 15,16 . Current state of the art methods for pneumothorax detection 17 or mammography screenin 6 make use of box-annotations, which can be derived from pixel wise annotations. Both aforementioned studies use a RetinaNet architecture, a one stage detector 18,19 , which is characterized by a faster inference time than two stage detectors 20,21 . The aim of this study was to train a RetinaNet detector for the task of pulmonary nodule detection, which is robust to foreign bodies. We evaluated its accuracy for screening and nodule detection tasks. Furthermore, we compared its performance to the participants of a reader study.

Results
Nodule location detection. For nodule localisation, the assessed RetinaNet architecture achieved 43 truepositives, 26 false-positives and 22 false-negatives. In comparison, performance of the two readers was was 42 ± 2 true-positives, 28 ± 0 false-positives and 23 ± 2 false-negatives. Detailed results are shown in Table 1. If not otherwise stated, all results in this paper are given in the form mean ± standard deviation. The nodule detection performance of RetinaNet can be inspected visually (Fig. 1). Lung segmentation was used to exclude extrathoracic detections. For lung segmentation a Dice score of 0.97 was achieved.
In order to investigate if larger nodules can be detected more easily and if nodules in radiographs with many additional nodules are detectable more easily, we plotted these parameters against the detection score and performed a linear regression model fit for the number of nodules ( Fig. 2A) and the nodule size (Fig. 2B). Furthermore, a free response receiver operating characteristic (FROC) curve is shown in Fig. 3B.
Screening. The classification performance was also assessed on case-level using a ROC (receiver operating characteristic) curve. Here the true-positive rate is plotted over the false-positive rate. For RetinaNet, an AUC (area under the ROC curve) value of 0.87 with a confidence interval (CI) range from 0.80 to 0.94 was found for the model. The performance of the two radiologists for the case-level screening task was 26 TP/4 FP and 31 TP/11 FP respectively. The ROC curve for the model with radiologist scores is shown in Fig. 3A.

Investigation of foreign body detections.
We investigated false-positive detections for five different types of foreign bodies ( Table 2). These included ports, electrocardiography devices (ECG), surgical clips (Clips), sternum cerclages and pacemakers. For RetinaNet, one false-positive detection due to foreign bodies occurred for each, ECG and port. Examples of foreign body detections, as predicted by the RetinaNet architecture are shown in Fig. 4A and B. Overlapping boxes (Fig. 4B) occur rarely and probably due to a suspected bigger nodule behind two smaller ones.

Discussion
In this study, we trained and investigated a CNN for pulmonary nodule detection and compared the predictions made by the CNN to the results of two professional radiologists. Besides nodule localisation, the usability of the algorithm for case-level screening was investigated. An important aspect of this study was to evaluate the possibility of foreign bodies contributing to wrong decisions in CNN-based nodule detection systems. This was not investigated in previous work 9, 10, 22 and therefore we evaluated the question, if foreign bodies are wrongly detected as nodules by a CNN. For the nodule detection task, the CNN was able to outperform one radiologist. The CNN was found to identify larger nodules more easily, but also performed well on smaller nodules (Fig. 2B).
For the screening task, the underlying question was if a radiograph contains nodules or not. Case-level predictions were derived from the predicted box-annotations, similar as it was done for mammography classification by McKinney et al. 6 . Compared to CNNs already trained with case-level annotations 9 , CNNs trained with box-annotations, such as the investigated RetinaNet architecture, have substantial differences: Unfortunately, box-level annotations have to be generated additionally by an expert radiologist and cannot be retrieved easily from existing PACS data. While the availability of annotated data is a common limitation for the training of deep learning systems in a clinical setting, the investigated technology relies on the availability of such annotations. In case such annotations are not available, weakly-supervised training 23-25 may be a possible alternative. However, in order to understand the CNNs decisions, CNNs trained with box-annotations allow a more detailed evaluation of retrieved results: CNNs trained with box annotations are capable of providing an independent score and an accurate location for each lesion. In contrast, CNNs trained with case level annotations can only provide a score for the whole image. Table 1. Results for the nodule detection task for radiologists and the RetinaNet model. Evaluation was performed with respect to true-positives (TP), false-positives (FP) and false-negatives (FN).

TP FP FN
Radiologist A 40 28 25 Radiologist B 44 28 21 RetinaNet 43 26 22 Scientific RepoRtS | (2020) 10:12987 | https://doi.org/10.1038/s41598-020-69789-z www.nature.com/scientificreports/ In literature, Wang et al. 9 and Rajpurkar et al. 10 reported a ROC AUC of 0.72 and 0.78 for weakly supervised CNN based lung nodule screening, respectively. For screening, we achieved a ROC AUC of 0.87 in our experiments. www.nature.com/scientificreports/ Several studies have reported CAD performance for lung nodule detection. Here, sensitivities of CAD systems vary widely between literature, whereat a higher sensitivity usually yields a higher false-positive rate. Li et al. 26 reported 47 of 66 nodules were correctly marked by a CAD system for CXR lung nodule detection. This gives a sensitivity of 0.71 at a mean false-positive rate of 1.3. For nodule detection Kim et al. 22 reported a sensitivity of 0.83 at a false-positive rate per radiograph of 0.2. For CT based nodule detection CAD systems, an average sensitivity of 0.82 was reported at a cutoff of 3 false-positives per radiograph by Jacobs et al. 27 . At a false positive cutoff of about 0.2, their FROC curve yielded a sensitivity of around 0.53. To compare these results to our experiments, we can select a specific point on the FROC Curve (Fig. 3B), e.g. a sensitivity of 0.59 at a false-positive rate of 0.2 per radiograph.
Comparison of performance within literature is difficult, however, as results highly depend on the dataset (e.g. nodule count and nodule size) and objective of the study. Quekel et al. 28 reported lesion miss rates for lung cancer on chest radiographs between 25 and 90% for different studies involving human observers. Therefore, a comparison of the given ROC and FROC results between different papers should be interpreted with caution.
Besides, the potential of deep learning systems, this study also shows that before clinical use of a deep learning system, it has to be carefully assessed how uncommon image characteristics can contribute to false decisions of CNNs. While we could have designed an ideal dataset without foreign bodies, a remarkable feature of our dataset is that we intentionally included foreign bodies in our training and test sets. Therefore, our dataset is closer to the clinical routine and tests the robustness of the detection algorithm. Hence, an important finding of  The present study has some limitations: above all, the size of the dataset was rather small. While we used 411 radiographs for training, other studies use larger datasets (e.g. 11,734 images for training by Mckinney et al. 6 for mammography), which could further improve performance. Next, we only utilized PA radiographies for the detection task. In addition, the use of additional lateral chest radiographs could increase the performance further, but requires additional segmentations from human experts.
Moreover, we only used a single center data-set from our institution, which may inhibit the ability to translate the model to different populations and devices 29 . Last, we limited the CNN input resolution to 512 × 512 pixels, in order to reduce the computational workload.

conclusion
In this study, we trained and evaluated a RetinaNet based CNN and conducted a reader study. In summary, the presented CNN has the potential to help radiologists during clinical routine and is robust to foreign bodies. The CNN's decisions can be followed by inspection of individual lesion scores and box-predictions, which is an advantage over other CNN architectures.
As there are still a few foreign body detections, in future work it has to be investigated, if it is sufficient to train with a larger dataset, or an auxiliary CNN 30 is needed to identify abnormal cases and react correspondingly. With advances in healthcare digitisation, information about foreign bodies may also be available in machine readable form soon. Such information, stored in patient records, may be used to alter the CNNs decision (e.g. to invalidate lesion scores in the region of a known pacemaker). Furthermore, to enhance classification performance, we plan to collect and annotate more data. Additionally, the CNN could be trained to detect multiple pathologies, as done for case-level annotations in prior studies 10 .

Materials and methods
Dataset. Data access was approved by the institutional ethics committee at Klinikum Rechts der Isar (Ethikvotum 87/18 S) and the data was anonymized. The ethics committee has waived the need for informed consent. All research was performed in accordance with relevant guidelines and regulations. A dataset of 391 Table 2. Number of radiographs with false-positive (FP) detections due to foreign bodies (FB) made by the RetinaNet architecture. False-positives due to FB only occurred for radiographs classified as port or ECG.

Foreign body
Total radiographs Radiographs with FP due to FB  Table 3. Additional clinical information from the medical report, such as follow-up CT scans were available to verify the diagnosis. Thereby, case-level groundtruth labels (unsuspicious or nodulous) were assigned based on the diagnosis of two radiologists: the first radiologist made the diagnosis in clinical routine and a second radiologist (JB, 3 years of experience in chest imaging) verified and segmented the nodules retrospectively using our in-house built web-based platform. For the reader study test-set, one more radiologist verified the segmented nodules (DP, 12 years of experience). From the segmentations, bounding boxes were extracted based on the segmentation boundaries. From the radiographs with nodules, 257 were used for training. Training data was supplemented by the Japanese Society of Radiological Technology (JSRT) dataset 31 , from which 154 additional radiographs with annotated nodules were obtained. Therefore, the total number of radiographs used for training was 411.
Additionally, lung segmentations for all 247 JSRT files were obtained from the segmention in chest radiography (SCR) database 32 in order to train a lung segmentation network. Please note that data for lung segmentation also includes 93 additional non-nodulous images from the JSRT database. For lung segmentation train, validation and test set size was set to 157, 40 and 50. Fig. 5. For nodule detection, we employed a RetinaNet architecture 18 . This architecture was successfully utilitzed in prior literature 6,22 for nodule detection in radiographs. It inputs a preprocessed radiograph and outputs multiple box-coordinates of nodule locations with additional scores (assessing confidence).

Network training. General workflow for network training is illustrated in
For preprocessing, images were resampled to 512 × 512 pixels. Afterwards, histogram equalization was performed. Resulting intensities were normalized to values between 0 and 1.
Training was performed using a batch size of 1 with 1000 steps per epoch. For the utilized loss function (focal loss), hyperparameters were set to α = 0.25, γ = 2.0 . The initial learning rate was set to 10 −5 and reduced by factor 0.1 after 3 epochs of stagnating loss ( δ = 0.0001 ). The network was trained for 50 epochs in total. Data augmentation transformations included contrast, brightness, shear, scale, flip, and translation. From the training set, 80 percent of the radiographs were used for training and 20 percent for validation. None of the training or validation data was part of the reader study or foreign body test sets. Models were implemented based on kerasretinanet 33 using Tensorflow 34 and Keras 35 . As a RetinaNet backbone, ResNet-101 36 was used.
To invalidate extrathoracic nodule detections made by RetinaNet, an additional lung segmentation network was developed. For lung segmentation, an U-Net 15 like architecture is applied in illustrated in Fig. 6. U-Net like architectures were successfully applied for lung segmentation in previous literature 16 . Training masks were generated by combining the left lung lob, right lung lobe and heart mask from the SCR dataset. An Adam optimizer with a learning rate of 10 −4 was used. Total number of epochs was set to 30. Augmentation operations included zoom, height, shift and rotation. As a loss function, the Dice loss according to 37 was used. Table 3. Patient demographics for training and test subsets. Mass size is given as a fraction of the radiograph size (1.0 would indicate every pixel of the radiograph is a nodule). As for screening and foreign body (FB) test sets segmentations were unavailable, nodule mass and location were not provided. Within a radiograph multiple foreign bodies may occur. Secondary pathologies (SP) were excluded from the reader study. AM indicates acromastinum induced artifacts, which often show nodule-like morphological characteristics.

Metrics.
In each radiograph, we evaluated the number of true-positives (TP), false-positives (FP) and falsenegatives (FN). True-negatives were not counted, as these include all possible remaining boxes within the radiograph. To determine the aforementioned numbers, a distance measurement is required, whereat we utilized a method similar to Shapira et al. 25 : within a single radiograph, first the center of masses of all ground-truth and prediction lesions are determined. Next, between each pair (G i , P i ) of a ground-truth lesion G i and a prediction lesion P i , the euclidean distance is calculated. If the euclidean distance is below a certain threshold D, the nodule accounts as a true-positive. If there is no neighbour within the distance D for a G i or P i , the nodule counts as false-negative and false-positive respectively. For a 512 × 512 pixel image we set the value of D to 23,  www.nature.com/scientificreports/ which means a distance below 23 pixels for lesion centers between ground-truth and prediction yields a TP. To determine this value, we let the radiologist who annotated the groundtruth mark the lesion center in an additional experiment. Afterwards we calculated the distances between groundtruth segmentation center-of-mass and marked lesion center. The maximum distance between the groundtruth center-of-mass and the radiologist mark yielded 23 pixels. Furthermore, the sensitivity of the lesion detection can be controlled by ignoring predictions below a certain lesion score. Setting a lower threshold usually increases sensitivity, but also false-positives. This trend was visualized for different thresholds using a FROC curve, similar to Kim et al. 22 . For the absolute true-positive and false-positive numbers for RetinaNet results, we set a threshold of 0.6, which yields a lower false-positive rate than radiologists and therefore makes the absolute number of true-positives comparable. For evaluation of the nodule location detection task, the returned boxes were analyzed with respect to ground-truth annotations using the described metric.
Additionally it is required to retrieve a case-level score, for the screening task. This case-level score indicates whether there is one or more nodule in the radiograph. As we retrieved individual nodule scores from the RetinaNet predictions, we chose the maximum of all nodule scores within the radiograph as a case-level score.
Reader study setup. For the reader study, two radiologists (CML and JA) interpreted 75 chest PA (posterior anterior) radiographs. The radiologists had 4 and 6 years experience. In order to simulate a clinical setting, each radiologist was given a time constraint of 10 seconds per radiograph. The assignment was to mark all nodules with a mouse click using our in-house built web-based platform. At least one nodule occured in 36 radiographs. Total nodule count was 65 and the average nodule count in radiographs with nodules was 1.8 ± 1.6. Statistical analysis. The bootstrap approach 38 was used to calculate CIs of the ROC retrieved in the screening task for the RetinaNet architecture. We conducted the following experiment with 1,000 replications: In each experiment, we selected 75 random samples from the test set and calculated the ROC AUC values from these samples. In order to retrieve the 95% confidence interval, we sorted the resulting AUCs from all experiments incrementally and took the AUC value at 2.5% and 97.5% as minimum and maximum of the CI, respectively.

Data availability
Models for inference can be retrieved from the authors on reasonable request. The data for training is not available due to patient privacy. However, all methods are described in sufficient detail in order to be replicated with own data.