Introduction

Lung cancer is one of the most frequent cancers and is worldwide the leading cause of cancer death1. It has been estimated that 2.2 million global new lung cancer cases were diagnosed in 2020 and that 1.8 million cancer deaths were caused by lung cancer, almost one fifth of all cancer deaths. Lung cancer is often diagnosed at an advanced stage, because symptoms usually occur when the disease has progressed to a higher stage2.

The survival rate substantially improves if lung cancer is diagnosed at an early stage2. For this reason, lung cancer screening programs aim to detect lung cancer as early as possible with low-dose computed tomography (CT), when the cancer presents as a small pulmonary nodule. Although trials have provided evidence that CT screening can substantially reduce lung cancer related mortality in a high-risk population3,4, their implementation into clinical practice has been slow5,6. Hence, early-stage lung cancer is generally diagnosed after the detection of incidental nodules in non-screening chest CT scans that were ordered for other medical reasons7. Similarly, pulmonary metastases from extra-thoracic malignancies can be detected as incidental nodules as well and should be diagnosed and treated as early as possible given their large potential for further tumor spread8.

However, the detection of pulmonary nodules in CT scans is a challenging task in a routine clinical setting. First, nodules can be as small as three millimeters and are therefore hard to detect by radiologists9. This is especially true when the diagnosis of lung and airway diseases is not the focus of the examination and the chosen imaging parameters are suboptimal for this task. Second, radiologists may only focus on the main clinical question and discontinue their search for additional findings such as nodules10,11. Finally, the workload of radiologists has dramatically increased in the past 15 years, mainly caused by the increasing number of CT studies12. This underlines the importance of efficient nodule detection and management.

Artificial intelligence (AI) is a potential solution to support radiologists with this task. Many AI studies have reported a high performance of deep learning-based computer-aided detection (DL-CAD) systems for nodule detection: the sensitivities range from 86% to 98% with an average of 1–2 false positives per scan (FP/s) on public CT datasets13, such as the LIDC-IDRI dataset14. When using a DL-CAD system as a concurrent reader, radiologists can obtain a higher detection sensitivity, improve the uniformity of their management recommendations, and reduce their reading time15,16,17,18.

Nonetheless, validation studies of DL-CAD systems on modern clinical datasets remain sparse6,19. Even when such a dataset is available, AI studies are often limited to a reference standard set by one or two radiologists while substantial interobserver variability exists for the task of nodule identification20,21. More importantly, most reference standards lack histopathological proof or follow-up imaging for at least 2 years to determine which individual nodules were malignant. Although multiple AI studies have already demonstrated the potential clinical value of DL-CAD systems for scan-level lung cancer detection22,23,24, they do not assess their performance for detecting all clinically relevant nodules that require follow-up regardless of their malignancy status.

Therefore, the aim of this study is to bridge the gap between lung cancer and nodule detection AI studies. In a retrospective multi-center setting, we developed and validated a deep learning-based algorithm for the detection of pulmonary nodules in routine clinical CT scans with a reliable reference standard based on nodule identifications of five thoracic radiologists and nodule-level malignancy status. We demonstrate that a DL-CAD system can accurately detect benign pulmonary nodules, small lung cancers, and metastases in heterogenous CT scans that are made in routine clinical care.

Methods

Study design

At both institutions, the local institutional review board approved the study and waived the need for informed consent because of the retrospective design and the use of anonymized data (Radboud University Medical Center: case 2016-3045, project 19010; Jeroen Bosch Hospital: case 2020.04.22.01). First, we developed and validated a pulmonary nodule detection system with the publicly available LUNA16 dataset25, a subset of the LIDC-IDRI archive14. The details of this procedure and results are provided in Supplementary Note, Figure, and Table 1. Then, a large dataset of CT scans was collected from the picture archiving and communications systems (PACS) from a university medical center (hospital A; Radboud University Medical Center) and a large non-academic teaching hospital (hospital B; Jeroen Bosch Hospital) in the Netherlands. The CT scans from hospital A were annotated for the presence of pulmonary nodules by trained medical students and subsequently the system was re-trained using both the CT scans from LUNA16 and hospital A (see Supplementary Note 2 for more details). Finally, the detection system was evaluated on two datasets: a hold-out set with CT scans from hospital A (internal test set) and a completely independent set from hospital B (external test set). Five thoracic radiologists independently located the pulmonary nodules in the scans and two additional radiologists determined the malignancy status of each nodule and located any missed cancers using data from the national Netherlands Cancer Registry (NCR).

Datasets

For training a lung detection component of the AI system (see section Nodule detection pipeline), a dataset was prepared with 500 thorax and thorax-abdomen CT scans (500 patients) from hospital A from 2017. For training the nodule detection system, another dataset was prepared including all 888 thorax CT scans (887 patients) from the LUNA16 challenge25 and 602 thorax CT scans (602 patients) from hospital A from 2017. For testing the complete nodule detection system, two datasets were prepared: one dataset with 100 thorax and thorax-abdomen CT scans (100 patients) from hospital A from 2018-2020 (internal test set) and another dataset including 100 thorax and thorax–abdomen CT scans from hospital B from the same period (external test set). At both hospitals, the scans were evenly sampled from four categories to obtain balanced datasets: (1) patients with stage I lung cancer; (2) patients with pulmonary metastases; (3) patients with benign pulmonary nodules larger than 5 mm for which imaging follow-up would be recommended26; and (4) patients with benign pulmonary nodules smaller than 5 mm or no nodules (which were considered as normal). There was no patient overlap between the training and test datasets.

Flowcharts of the study selection for the test and training datasets are shown in Fig. 1 and Supplementary Fig. 2, respectively. During the study selection procedure, an experienced radiologist (E.T.S) assessed the validity of the scans using the eligibility criteria as defined in the next section. In case of doubt, another radiologist was consulted (M.R.). The sample sizes of the training data obtained at hospital A were based on the size of the LUNA16 dataset. For the test set, we aimed to collect 100 scans per hospital, which was mainly determined by the maximum number of scans that could be annotated by the panel. One CT scan per patient was sampled to maximize the diversity of the datasets.

Fig. 1: Flowchart for creating the test dataset for the evaluation of the pulmonary nodule detection system.
figure 1

aCT scans with thick slices (>3 mm), missing slices, or very low volume (<50 slices) were excluded. bThe most recent CT scan prior to the cancer diagnosis was selected to ensure retrospective localization. cOnly selected primary lung cancers that were histologically examined. dBased on a natural language processing analysis of the radiology reports51. eOne CT scan per patient was selected. Non-nodular lung cancers were cancers that did not appear as nodules (i.e., masses, thick-walled cysts).

An overview of the main characteristics of each dataset is included Table 1 and additional imaging parameters are included in Supplementary Table 2. The annotation protocols for the training data are described in detail in Supplementary Note 3. Characteristics of all pulmonary nodules in the training and test datasets are described in Table 2 and additional information about the training labels is included in Supplementary Table 3. The characteristics of the subset of malignant nodules in the test datasets are provided in Supplementary Table 4.

Eligibility criteria

In accordance with the British Thoracic Society (BTS) nodule management guidelines, only adult patients (≥18 years old) were included26. For the selection primary lung cancer cases, we included patients with stage I cancer as they include nodules instead of masses (>30 mm)27. For the selection of pulmonary metastases cases, both patients with metastasized lung cancer and extra-thoracic cancer were included.

Considering the routine clinical setting of our study, it is important to note that not all patients can be reliably screened for malignant pulmonary nodules. Patients with extensive fibrosis or consolidations (e.g., due to severe interstitial diseases, hemorrhage, or pneumonia) were excluded, as their lungs contain high attenuation areas that prevent correct location and delineation of relevant nodules. Furthermore, patients were excluded if CT scans were made with a slice thickness >3 mm, or were limited by severe breathing artifacts or incomplete coverage of the lungs. Finally, patients with more than 15 pulmonary metastases (according to the initial visual assessment) were excluded to reduce annotation efforts and prevent data imbalance.

Reference standard

A panel of five thoracic radiologists (J.T.-d.J., S.S., M.M., M.v.L., M.K. with 2, 4, 6, 16, 21 years of experience, respectively) independently annotated and measured all intrapulmonary nodules in the test datasets with in-house software (version 19.9.2 of CIRRUS Lung Screening, DIAG, Radboudumc, Nijmegen, The Netherlands). Nodules were manually identified and then volumetrically measured using a semi-automatic nodule segmentation algorithm28. Radiologists were able to manually correct nodule segmentations during this process. Furthermore, they indicated the lobe location and type of the nodules (solid, part-solid, non-solid, perifissural, and calcified). The radiologists were instructed to annotate all intrapulmonary nodules, defined as any round or irregular density inside the lung parenchyma with an equivalent diameter ≥3 and ≤30 mm29. We matched the nodule annotations of the different radiologists and used a majority vote reference standard that only included the nodules that were detected by at least three radiologists. The remaining annotations (i.e., lesions found to be <3 mm or >30 mm or nodules annotated by the minority of radiologists) were considered as indeterminate findings and were moved to an exclusion list, which is consistent with the reference standard from the LUNA16 challenge (more details in section Analysis).

Two radiologists (E.T.S. and M.R., not part of the panel) linked the annotated nodules to the cancer diagnoses and checked for any missed cancers. They were provided with all available CT scans (period 2000–2020) of a patient; the corresponding radiology reports; and the lobe location, nodule diameter, and histological type of the primary and metastasized cancers as recorded in the NCR. All cancer diagnoses were either confirmed by histological examination, cytology testing, or clinical diagnostic testing (e.g., medical imaging, exploratory surgery). The basis of all cancer diagnoses and cancer morphology are provided in Supplementary Tables 5 and 6. A lesion was considered benign if it was stable and not followed by a cancer diagnosis within two years, although this does not completely rule out the possibility of an indolent malignancy in a stable subsolid nodule30.

Nodule detection pipeline

An overview of the nodule detection pipeline is displayed in Fig. 2. The pulmonary nodule detection system consists of three components that each use deep learning architectures for the following tasks: (1) lung detection, (2) nodule candidate detection, and (3) false positive reduction. A detailed description of the design of these components and the training procedure is provided in Supplementary Note 2.

Fig. 2: Components of the pulmonary nodule detection system.
figure 2

First, the system takes a CT scan and detects the lungs slice-by-slice to obtain a region of interest. Second, nodule candidates are generated by analyzing overlapping CT volumes of five slices each. Finally, nine different 2D views are sampled from each nodule candidate and analyzed by a multi-view ResNet-50 network.

The lung and nodule candidate detection models are one-stage 2D object detectors with the YOLOv5 architecture (version 5.0, 2021)31. The slice-by-slice 2D lung bounding boxes are combined into a 3D volume of interest. This preprocessing step enables a fast and accurate localization of the lungs, and thereby reduces the computational load for the subsequent components, especially for the analysis of larger CT scans that contain both the thorax and abdomen. The nodule candidate detection component is designed to detect potential nodule locations with the highest possible sensitivity. As in previous work32,33,34, this component uses consecutive axial CT slices as input channels, thereby adding additional spatial information to the single 2D input images. This procedure helps to discriminate nodules from pulmonary vessels and other linear structures.

The false positive reduction component reduces the number of false positives while retaining a high sensitivity. The false positive reduction model is adapted from the work of Venkadesh et al.35, which is a multi-view ResNet50 classification model that takes nine different slices from a 3D patch around a nodule candidate. The nodule detection pipeline returns the center coordinates of the detected nodules and their nodule likelihood scores.

Analysis

The nodule detection system was evaluated on the internal and external test set by measuring the sensitivity and corresponding false positive rate per scan on different operating points on the Free Receiver Operating Characteristic (FROC) curve. We assessed the sensitivity at 7 predefined false positive rates, namely 0.125, 0.25, 0.5, 1, 2, 4, and 8 false positives per scan. We also assessed the average sensitivity at all false positive rates, referred to as the Competition Performance Metric (CPM) in the LUNA16 challenge25. For each threshold, the 95% confidence interval was calculated by using bootstrapping (1,000 bootstraps using scan-level sampling with replacement). These evaluation metrics were calculated with the Python scripts that were provided for the LUNA16 challenge25. We evaluated the nodule detection system for multiple subsets of nodules with a minimum diameter threshold of 3, 4, and 5 mm. Additional analyses were conducted for primary lung cancers, pulmonary metastases, and actionable benign nodules (≥5 mm, neither calcified nor perifissural), which would require follow-up according to the BTS guidelines26.

Our hit criterion was that the center of a predicted nodule should be within the radius of the ground truth nodule, otherwise a detection was regarded as a false positive. If a nodule prediction matched with a nodule regarded as indetermined (see section “Reference standard”), then it was ignored and not counted as a true or false positive. To characterize failure modes of the AI system, we asked an experienced thoracic radiologist (E.T.S.) to perform a visual inspection of all false negative detections and 25 randomly sampled false positive detections of the nodule detection system for each test set. For this analysis, we selected a clinically acceptable operating point for the model that corresponded to an average of 1 FP/s.

To assess the potential clinical value of the proposed AI system, its sensitivity for detecting primary cancers, pulmonary metastases, and actionable benign nodules was compared to those from the individual radiologists from the panel. The sensitivity and false positive rate of each radiologist were estimated by comparing his or her annotations with a new set defined by the other four radiologists (for the definition of benign and indetermined nodules). Since the reference standard slightly changed for each radiologist, the sensitivity of the AI system at 1 FP/s was recalculated for each comparison. The 95% confidence intervals for the AI system and radiologists were calculated by using bootstrapping (1000 bootstraps using scan-level sampling with replacement). Significance testing was performed with a two-sided paired permutation test (1000 iterations) on nodule-level with the MLxtend library (version 0.22.0, 2023)36 for Python. A p value smaller than 0.05 was considered significant.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

Test datasets characteristics

In the period of 2018-2020, 14,943 adult patients underwent 34,689 thorax and thorax–abdomen CT scans at hospital A (see Fig. 1). In hospital B, 12,739 patients underwent 22,621 thorax and thorax–abdomen CT scans in the same period. From these scans, 356 scans (nA = 323, nB = 33) were excluded due to inadequate technical specifications (e.g., thick slices, missing slices, or very low volume). Initial samples of 2852 (hospital A) and 2935 (hospital B) CT scans were drawn from four nodule categories for the respective internal and external test set. For each category, scans were randomly sampled until 25 valid scans were obtained per hospital. During this process, 267 scans (nA = 128, nB = 139) were assessed. In total, 67 scans (nA = 28, nB = 39) were excluded as they had too many pulmonary metastases (nA = 13, nB = 13), extensive consolidations or fibrosis (nA = 7, nB = 13), only non-nodular lung cancers (e.g., masses) (nA = 8, nB = 6), registered cancers that could not be retrospectively localized (nB = 3), insufficient lung coverage (nB = 2), or severe artifacts (nB = 2). This resulted into a final selection of 100 studies from 100 patients (63 ± 15 years, 52 women) for the internal test set (hospital A) and 100 studies from 100 patients (67 ± 12 years, 53 women) for the external test set (hospital B) (see Table 1).

Table 1 Characteristics of the training and test datasets.

For these test sets, 622 (nA = 319, nB = 303) from the 1,617 annotations (nA = 852, nB = 765) in total were included in the analysis (see Table 2). The remaining findings were considered as indetermined, as they did not meet the size criteria (nA = 218, nB = 157) or were non-cancerous and not labeled by the majority of radiologists (nA = 315, nB = 305).

Table 2 Characteristics of the pulmonary nodules in the training and test datasets.

Pulmonary nodule detection analysis

Table 3 presents the sensitivity of the AI system at 0.125, 0.25, 0.5, 1, 2, 4, and 8 false positives per scan (FP/s) on the internal and external test set. The FROC curves for the detection of actionable benign nodules, primary lung cancers, and pulmonary metastases per test set are shown in Fig. 3. Supplementary Note 4 describes additional analyses of the detection performance of the individual components of the AI system. More specifically, data characteristics and results of the lung detection component are presented in Supplementary Tables 7 and 8. Performance of the nodule candidate detection and false positive reduction components are provided in Supplementary Table 9. Nodule detection results on subgroups of contrast-enhanced and non-contrast CT scans are provided in Supplementary Table 10. The average processing time per scan was 30 ± 18 s.

Table 3 Pulmonary nodule detection results on the internal and external test set.
Fig. 3: Free response receiver operating characteristic (FROC) curves of the AI system per test set.
figure 3

The internal test set (hospital A) contained 27 primary lung cancers, 165 pulmonary metastases, and 63 actionable benign nodules. The external test set (hospital B) contained 32 primary lung cancers, 113 pulmonary metastases, and 87 actionable benign nodules. The shaded bands represent the 95% confidence intervals per nodule category.

For the internal test set, the sensitivity for detecting all nodules at an average of 1 FP/s (detection threshold = 0.647) was 90.9% (290/319, 95% CI: 88.0–93.6%) and the CPM was 85.4%. For detecting actionable benign nodules, the sensitivity at 1 FP/s was 92.1% (58/63, 95% CI: 84.3–98.4%) and the CPM was 90.9%. For detecting primary lung cancers, the sensitivity at 1 FP/s was 92.6% (25/27, 95% CI: 82.1–100%) and the CPM was 91.5%. For detecting pulmonary metastases, the sensitivity at 1 FP/s was 90.3% (149/165, 95% CI: 85.6–94.5%) and the CPM was 83.9%.

For the external test set, the sensitivity for detecting all nodules at an average of 1 FP/s (detection threshold = 0.573) was 92.4% (280/303, 95% CI: 89.8–95.1%) and the CPM was 87.6%. For detecting actionable benign nodules, the sensitivity at 1 FP/s was 94.3% (82/87, 95% CI: 88.1–98.8%) and the CPM was 90.5%. For detecting primary lung cancers, the sensitivity at 1 FP/s was 96.9% (31/32, 95% CI: 91.7–100%) and the CPM was 94.2%. For detecting pulmonary metastases, the sensitivity at 1 FP/s was 92.0% (104/113, 95% CI: 88.5–95.5%) and the CPM was 89.0%. The sensitivity of the nodule detection system for the nodules with a minimum diameter threshold of 4 and 5 mm are included in Table 3.

Visual assessment of false positives and false negatives

All false negatives (nA = 29, nB = 23) and a random selection of false positives (nA = 25, nB = 25) from the AI system at an operating point of 1 FP/s were visually assessed. The false negatives could be divided into seven categories and the false positives into fourteen categories, as outlined in Fig. 4. For the false negative categories, it is also shown how many nodules missed by the AI model were still detected by the radiologists from the panel for reference purposes. The three most frequent false negatives were juxtapleural nodules (nA = 17, nB = 11; defined as solid nodules located on or within 10 mm of the visceral pleura37), juxtavascular nodules (nA = 5, nB = 2, defined as solid nodules that are attached to a vessel), and non-solid nodules (nA = 3, nB = 3). Regarding the false positives, the three most frequent false categories were fibrosis (nA = 2, nB = 4), duplicate nodule detections (nA = 2, nB = 3), and consolidations (nA = 2, nB = 1). Potentially missed solid nodules (nA = 7, nB = 7), perifissural nodules (nA = 2, nB = 3), and micronodules (nA = 1, nB = 4; defined as nodules smaller than 3 mm29) were counted as false positives in accordance with our reference standard.

Fig. 4: Frequency distribution of the categories of the false negative and false positive detections of the AI system.
figure 4

In the internal test set (hospital A), 29 false negative and 25 false positive detections were inspected. In the external test set (hospital B), 23 false negative and 25 false positive detections were inspected. The lowest, median, and highest sensitivity of the radiologists have been indicated for the false negative detections of the AI system.

Comparison of AI performance with the panel of radiologists

Figure 5 shows examples of malignant nodules that were missed by either the AI system or majority of radiologists in the panel. The majority of radiologists detected all primary lung cancers in the internal and external test set, but the AI system missed 2 out of 27 cancers in the internal test set and 1 out of 32 cancers in the external test set. In the internal test set, 55 out of 165 metastases (33%) were missed by the majority of radiologists (15/25 patients). The AI system detected 41 (75%) of these missed metastases (14/15 patients). For the external test set, 27 out of 113 metastases (24%) were missed by the majority of radiologists (11/25 patients). The AI system detected 21 (78%) of these missed metastases (10/11 patients).

Fig. 5: Examples of primary cancers and pulmonary metastases missed by the AI system or majority of radiologists from the panel.
figure 5

The bounding boxes are 60 mm. The most frequent false negatives of the AI model were non-solid nodules (a), juxtapleural nodules (b, c), and part-solid nodules (d). In contrast to the AI model, the most frequent false negatives of the radiologists were nodules in base of the lungs (e), nodules in the right azygoesophageal recess (f), nodules with the same diameter as neighbouring vessels (g), and juxtapleural nodules (h).

In Table 4, the sensitivity of the AI system for detecting actionable benign nodules, primary lung cancers, and pulmonary metastases at 1 FP/s was compared with the sensitivity of each individual radiologist from the panel. The FROC curve of the AI system (average of all comparisons) and operating points of the radiologists are visualized in Fig. 6. The AI system had a significantly higher sensitivity than 2 out of 5 radiologists for detecting actionable benign nodules in the internal test set (radiologist 2, 95% vs. 79%, p = 0.02; radiologist 4, 95% vs. 79%, p = 0.03) and external test set (radiologist 2, 95% vs. 70%, p < 0.001; radiologist 4, 95% vs. 66%, p < 0.001), although with a higher false positive rate than the radiologists (average difference of 0.6 FP/s). For detecting pulmonary metastases, the AI system had a significantly higher sensitivity than 4 out of 5 radiologists in the internal test set (radiologist 1, 86% vs. 77%, p = 0.01; radiologist 2, 90% vs. 32%, p < 0.001; radiologist 3, 88% vs. 70%, p < 0.001; radiologist 4, 88% vs. 45%, p < 0.001) and external test set (radiologist 1, 90% vs. 81%, p = 0.048; radiologist 2, 92% vs. 66%, p < 0.001; radiologist 3, 92% vs. 75%, p < 0.001; radiologist 4, 92% vs. 61%, p < 0.001). There was no significant difference between the sensitivity of the AI system and the radiologists for detecting primary lung cancer. In all other cases, there was no significant difference in sensitivity between the AI system and radiologists.

Table 4 Comparison between the nodule detection performance of the AI model and individual readers on the internal and external test set.
Fig. 6: Operating points of the individual radiologists of the panel compared to the free response receiver operating characteristic (FROC) curves of the AI system per test set.
figure 6

The FROC curves are averaged over all comparisons per nodule category.

Discussion

In recent years, DL-CAD systems have shown a high performance for detecting pulmonary nodules in publicly available CT datasets. However, many nodule detection systems have been neither externally validated in a clinical setting nor specifically validated for the detection of cancer. For these reasons, we developed and evaluated a DL-CAD system for the detection of pulmonary nodules in routine clinical CT scans with a known malignancy status. At a clinically acceptable threshold of 1 FP/s, the system obtained a sensitivity of 92% for detecting nodules with a minimum diameter of 3 mm on the external test set. For actionable benign nodules, primary lung cancers and pulmonary metastases, the sensitivity was 94%, 97%, and 92%, respectively. The detection performance for hospitals A and B was comparable, indicating a good generalization performance. A comparison between the nodule detection performance of the DL-CAD system and radiologists showed that the system could provide a higher sensitivity (average difference of 14, 2, and 17 percentage points for actionable benign nodules, primary lung cancers and pulmonary metastases, respectively) at the cost of a slightly higher false positive rate (average difference of 0.6 FP/s), and that it could locate most pulmonary metastases (78%) that were missed by the majority of the five radiologists.

The sensitivity of our system for detecting pulmonary nodules is comparable to the sensitivities reported in previous evaluation studies of DL-CAD systems in a non-screening setting. Studies have reported sensitivities in the range of 65–88% at 1 FP/s for pulmonary nodules of any diameter38,39,40,41. For nodules with a minimum diameter of 3, 4, and 5 mm, detection sensitivities are reported of 74%33, 88%42, and 82–91%15,43 at 1 FP/s, respectively. In future research, DL-CAD systems should be benchmarked on a modern dataset with routine clinical CT scans to determine the most optimal detection method for analyzing these highly heterogenous scans.

To the best of our knowledge, there have been no studies in recent years that evaluated a DL-CAD system for the detection of both small lung cancers and pulmonary metastases in routine clinical CT scans. For case-level lung cancer detection however, Zhang et al.24 have shown that their DL-CAD system can obtain a sensitivity of 96% and specificity of 88% on a dataset of 50 preoperative CT scans, from which half contained pathologically confirmed malignant nodules. This sensitivity is in agreement with our findings, although the specificity cannot be directly compared as our system does not distinguish between benign and malignant nodules.

The assessment of the false negatives and false positives of the DL-CAD system shows that juxtapleural nodules are most challenging to detect, but also that the system can potentially identify nodules that are even missed by a panel of experienced radiologists. Juxtapleural nodules might be hard to detect by the DL-CAD system due to their highly variable shape and similar density to the pleural wall. Besides juxtapleural nodules, the DL-CAD system may miss non-solid nodules, most likely due to their low contrast resolution and the small proportion in the training data as compared to solid nodules (Table 2). The missed primary lung cancers were either non-solid (n = 2) or juxtapleural (n = 1). Regarding the false positives, most could be considered as possibly overlooked or misinterpreted as non-nodular (13/25 for the internal test set and 14/25 for the external test set). They were often small (<4 mm), had typically benign features (i.e., triangular shape), or were attached to the pleura or vasculature. Previous studies have shown that CAD systems can indeed detect nodules that are missed by multiple readers39,44. Other false positive detections were duplicate detections (e.g., nodular components of a larger lesion), non-nodular lesions (e.g., bandlike), fibrosis, and consolidations. Similar causes for false positive findings have been reported by Martins Jarnalo et al.42.

The comparison between the DL-CAD system and radiologists showed that the system had a significantly higher sensitivity for pulmonary metastases than most radiologist (4 out of 5) and a significantly higher sensitivity for actionable benign nodules than some radiologists (2 out of 5). No significant differences were found for the detection of primary lung cancer, although the operating points of the radiologists were located above the FROC curve of the DL-CAD system (see Fig. 6). None of the primary lung cancers were missed by the majority of radiologists, probably due to their relatively large size (median diameter of 18 mm in the external test set). The detection of pulmonary metastases was a more challenging task for the radiologists, as these lesions were much smaller (median diameter of 7 mm in the external test set) and appeared in greater numbers (median of four lesions per scan versus one). The difficulty of this task has been demonstrated before in a study of Chen et al., who showed that one or more pulmonary metastases were missed in 37% of all cases45. In our datasets, missed pulmonary metastases were usually located in the base of the lower lobes, right azygoesophageal recess, or nearby blood vessels with a similar diameter. These locations are known to be blind spots for radiologists45,46 and our results suggest that these could be overcome with the help of a DL-CAD system.

Although it has not been demonstrated in this study that a DL-CAD system could improve the radiologists’ performance for lung cancer detection, it is likely that the system could aid in the detection of small lung cancers given its high sensitivity for pulmonary metastases and actionable benign nodules. Furthermore, it is important to emphasize that the panel of radiologists were instructed to detect any pulmonary nodule, while this task is not necessarily the focus of a CT examination in daily clinical practice. As a result, certain cognitive biases were less likely to occur in our setting, such as satisfaction of search11, and the performance of the radiologists might be overestimated.

The strengths of our study are the use of data from routine clinical CT scans from multiple hospitals and a reference standard set by a panel of thoracic radiologists with nodule-level malignancy labels. However, this study also has a few limitations. First of all, we did not conduct a second reading round where the radiologists could review each other’s marks, such as the annotation process of the LIDC-IDRI database14. More hard-to-detect nodules could have been added to the test datasets by implementing such an annotation process. Secondly, we selected an operating point for the DL-CAD system that matched 1 FP/s to compare its sensitivity with those of the radiologists. This operating point may not necessarily be the optimal trade-off between the sensitivity and false positive rate in terms of costs and benefits and the radiologists’ preferences. The selection of optimal operating points should be further investigated. Finally, the CT scans from hospital B were reconstructed with a relatively high slice thickness of 3 mm. This study could have benefited from another external validation set with CT scans with thinner slices.

In conclusion, this study demonstrates that a DL-CAD system obtained a high sensitivity with an acceptable false positive rate for the detection of benign actionable nodules, primary lung cancers, and pulmonary metastases in CT scans from a retrospective cohort of a routine clinical population. The system thereby shows potential for aiding radiologists in detecting small lung cancers and pulmonary metastases for obtaining a timely diagnosis or monitoring disease progression. Future research should focus on the evaluation and implementation of this system in a prospective clinical setting.