Introduction

The World Health Organization (WHO) has identified tuberculosis as the most common life-threatening infectious disease and a leading cause of death worldwide1,2. Early detection of pulmonary tuberculosis (PTB) is essential for mitigating the spread, morbidity, mortality of the disease, as well as the burden of care for patients, families, and the overall public health system1,3,4,5. In suspected cases of active PTB, isolation and adherence to airborne precaution guidelines are recommended prior to confirmation, given that confirming the results of culture requires several weeks, which is the gold standard for PTB diagnosis4,6,7. Treatment and isolation of a patient, which are necessary as soon as the disease is suspected, could be chosen instead. Therefore, effective strategies are needed to facilitate the prompt diagnosis of active TB in medical institutions in areas with a high burden of the disease7,8, as failure to rapidly and accurately identify PTB can result in nosocomial infections or wastage of isolation resources.

In clinical settings, if PTB is suspected based on the patient's clinical manifestations and chest radiography, a sputum test such as a smear microscopy or polymerase chain reaction (PCR) is performed. This is followed by a sputum culture to confirm the diagnosis7,8. In addition, chest computed tomography (CT) complements differential diagnosis and guidance for clinical decisions during the treatment for PTB9,10. More recently, deep learning-based automated detection algorithms (DLAD) have been introduced for PTB prediction6,7,11. However, these diagnostic tools have clear limitations when performing diagnosis prior to confirmation of culture results based on a single test6,7,9. Given the differences in available diagnostic tools for PTB between regions and institutions and the uncertainty about the time required to obtain results3,4, clinicians should consider the results of only the diagnostic tests performed in a given clinical setting when making decisions, such as administering TB drugs and using isolation resources. However, there is no consensus on strategies for effectively combining the results of different tests for the diagnosis of PTB to support clinical decision making. To the best of our knowledge, no previous studies have addressed this gap in knowledge. Therefore, we aimed to develop a model to predict culture test results for PTB in a multimodal approach using available tests in clinical settings with different diagnostic tools that may be available. Additionally, we sought to determine whether combining our diagnostic model with DLAD, a recently developed TB detection tool, would improve diagnostic performance.

Method

Study design and setting

This retrospective observational study was conducted using prospectively collected data from the emergency department (ED) registry. We followed the STROBE guidelines and adhered to the tenets of the Declaration of Helsinki. This study was approved by the institutional review boards of Severance Hospital (approval number 4-2022-0481). Due to the retrospective nature of the study, the need of informed consent was waived by the institutional review boards of Severance Hospital.

In South Korea, approximately 20,000 new cases of TB are diagnosed each year (equivalent 35.7 cases per 100,000 population in 2021), of which approximately 2.5% are hospitalized. South Korea has low TB prevalence, resulting in low pretest probability. The present study was performed at a tertiary hospital with Level 1 ED located in Northwestern Seoul (the capital city of South Korea). Approximately 100,000 patients visit this ED per year.

This ED is currently following a standardized diagnostic protocol for patients with suspected PTB. The sputum for three pairs of smear microscopy, PCR (Gene Xpert MTB/RIF), and Mycobacterium tuberculosis (MTB) sputum cultures on solid and liquid media are obtained from patients with suspected PTB based on chest radiographs and clinical presentation during the initial assessment. The results of smear microscopy are obtained within 4 h; however, owing to a predetermined test reception time, the time required for obtaining results in practice is 24 h. PCR requires approximately 6 h to confirm the results, whereas sputum culture takes more than 6 weeks.

Additionally, a chest CT scan is performed if the physician is unsure of the presence of active disease based on the chest radiograph and clinical presentation or if a cause other than PTB needs to be differentiated. All chest radiographs and CT images performed in the ED are interpreted within 12 h by board-certified radiologists with at least three years of experience.

Study population and data collection

Our study was conducted on patients over 18 years of age who consecutively visited the ED between January 2018 and December 2021. We included all patients with suspected PTB based on chest radiographs and clinical presentation at the time of visit and who underwent sputum testing (smear microscopy, PCR, sputum culture) in accordance with a standardized diagnostic protocol for PTB.

The present study data were extracted through the Clinical Research Analysis Portal (SCRAP), which is operated by the data portal system at the study site. Based on this data platform, we obtained patient information on the sex age, vital signs, medical history, symptoms, and results of blood tests performed at the time of visit. We also collected chest radiographs and CT readings, as well as the results of sputum testing performed to diagnose PTB.

Deep learning algorithm for detecting tuberculosis screening score

All chest radiographs used in this study were analyzed using deep learning-based automated detection algorithms (DLAD) for chest radiographs, capable of detecting active cases of PTB; these algorithms are not yet commercially available. The tuberculosis screening score analyzed through this technology (Lunit INSIGHT CXR v3.1.5.0) was collected for the study. This new DLAD is an improvement over previously released DLADs, which predicted the presence/absence of TB by assuming the maximum value of the prediction scores for nodules and consolidations. The new model is more sophisticated and less dependent on other lesions, such as nodules or integration. To develop this new DLAD, chest radiographs with a microbiological reference standard (culture and/or GeneXpert test) were used for training. In the training stage, the model was trained to predict active TB using an additional 140,285 (16,846 positive and 123,439 negatives) data points with TB annotations. The new DLAD met the target product profile criteria for a triage test set forth by the WHO, with a threshold of 0.15 achieving 70% specificity and the corresponding sensitivity. In the screening setting, compared to the normal cases without any abnormal findings, the performance test of the new DLAD showed an area under the receiver operating characteristic curve (AUROC) of 0.984, a sensitivity of 93.78%, and a specificity of 95.56%. Furthermore, in the triage setting, where all cases containing normal and abnormal findings were included, the results showed an AUCROC of 0.928, a sensitivity of 93.78%, and a specificity of 70.85%. The probability score for the high-sensitivity cut-off used in this test was 0.1512.

Outcome measures

The primary endpoint of this study was the confirmation of PTB. A positive result is defined as the growth in MTB, which serves as the reference standard for active PTB4,6,7. Radiologic examination results are defined as positive if interpreted as suspicious for active TB by a radiologist, whereas the results are considered negative if interpreted as non-tuberculous mycobacteria (NTM) or old TB lesions. The TB screening score quantified in the DLAD is measured as a continuous variable ranging from 0 to 100.

Model development

The entire dataset was randomly split into training and test sets in a 7:3 ratio. We developed a model to diagnose PTB using a training dataset. First, we analyzed the factors that were significantly associated with a positive culture result of PTB among the variables of past history, clinical symptoms, and blood test results through univariable logistic regression. Subsequently, based on a combination of the 8 factors identified through univariable analysis and 4 diagnostic tests for TB, a total of 10 diagnostic models were developed. The combinations of diagnostic tests were organized sequentially based on increasing input variables considering the time required to confirm the results, and five additional models were developed for the same model when the interpretation of chest radiography was replaced with the DLAD. In addition to the 10 nested models accounting for clinical relevance, we further developed a diagnostic model with multivariate logistic regression using the Akaike information criterion (AIC) stepwise selection method. All developed models were validated using the test dataset.

Statistical analyses

Categorical variables were reported as counts and percentages, and continuous variables were expressed as the mean and standard deviation. For baseline comparisons, we used the student T-test for continuous variables and Fisher’s exact test or chi-square test for categorical variables.

We evaluated the predictive performance including sensitivity, specificity, accuracy, positive predictive value, negative predictive value, and AUROC for each diagnostic test for PTB. Univariable analyses were performed using logistic regression with variables associated with PTB based on previous studies. We obtained odds ratio with 95% confidence intervals (CIs) and p-values. Each variable associated with a p-value below 0.1 in the univariable analysis was entered into the multivariable logistic regression models. Thereafter, we calculated the AIC and concordance index of the developed multivariable models. To facilitate the clinical application of these models, we developed a nomogram for the prediction of a positive PTB test, and specificity was calculated with the sensitivity of each model fixed at 90% or higher. In the nested models, the AUROC comparison was performed using the nonparametric bootstrap method when replacing chest radiograph readings with DLAD. The mean and confidence interval of the AUROC difference from 1000 bootstrap samples were presented, and a significant difference was considered if the confidence interval did not include zero. P values less than 0.05 were considered to be statistically significant. All analyses were performed using R (package version 4.0.3).

Results

During the study period, a total of 378,775 patients visited the ED, of which 253,827 were aged 18 years or older. Of these, 8,374 patients who underwent sputum testing performed in accordance with the standardized diagnostic protocol for PTB were included in the statistical analyses. In the training set and test set, the number of patients with sputum culture-confirmed PTB was 119 and 51, respectively, accounting for 2% of all patients (Fig. 1). The baseline characteristics and missing rate between the dataset are listed in Table 1. In the study population, 980 and 6222 patients did not undergo chest CT and PCR results, respectively, with missing rates of 23.6% and 74.3%. The body mass index was unknown for 4485 (53.6%) patients.

Figure 1
figure 1

Flowchart of patient enrollment. ED, Emergency Department.

Table 1 Baseline characteristics between training and test set.

Our study evaluated the performance of PTB diagnostic tests individually, and the results are presented in Table 2. Smear microscopy and PCR alone were only 41.2% and 22.6% sensitive, respectively, for detecting TB culture. The sensitivity of TB detection based solely on chest radiograph interpretation was 3.4%. Moreover, the cut-off point of the score maximizing the diagnostic performance of DLAD-based TB detection was 20.59, and the sensitivity obtained using this score was 70.6%. The AUROC for detecting TB in chest CT interpretations was 0.759 (95% CI 0.747–0.772), the highest of any single diagnostic modality.

Table 2 Diagnostic performance of individual tests for pulmonary tuberculosis detection.

The eight variables that were significantly associated with PTB in univariable analyses and included in the multicomponent diagnostic model were the respiratory rate, oxygen saturation, dyspnea, anorexia, general weakness, weight loss, albumin, and sodium (Supplement Table S1). The performance of the 10 nested multicomponent diagnostic models, created by combining these 8 factors with the diagnostic tests for TB detection, is shown in Table 3. As additional diagnostic tests were included in the multicomponent diagnostic model, the AUROC and area under precision recall curve (AUPRC) expectably increased, and the clinical factors identified in the univariable analysis lost statistical significance. Chest radiography was not significant as an independent factor in the multicomponent model with other diagnostic tests added; however, the p-value for the odds ratio of DLAD to outcome was less than 0.05 in all multicomponent diagnostic models (Supplement Table S2). When the interpretation of chest radiography was replaced by DLAD, except for the models that included all tests, all models showed a statistically significant increase in their AUROC. In other words, if only all tests are available, the use of chest radiography gives equivalent result to DLAD (Fig. 2). Figure 3 plots the performance and nomogram of the optimal diagnostic model created using the stepwise selection method for PTB detection. The optimal diagnostic model had an AUROC of 0.924 (95% CI 0.871–0.976) and an AUPRC of 0.403 (95% CI 0.195–0.580).

Table 3 Performance of the 10 nested multicomponent diagnostic models created by combining 8 clinical factors with the diagnostic tests for pulmonary tuberculosis detection.
Figure 2
figure 2

Change in AUROC of nested multicomponent diagnostic models when chest radiograph interpretations are replaced with DLAD by radiologist. AUROC, Area Under the Receiver Operating Characteristic curve; DLAD, Deep Learning-based Automated Detection algorithm; CI, Confidence Interval.

Figure 3
figure 3

Performance and nomogram of the optimal diagnostic model created using the stepwise selection method. AUROC, Area Under the Receiver Operating Characteristic curve; AUPRC, Area Under Precision Recall Curve; DLAD, Deep Learning-based Automated Detection algorithm.

Of the five multicomponent models with conventional interpretations of chest radiography, none had a specificity above 70% when sensitivity was fixed at 90%, whereas two models with DLAD exhibited a specificity above 70%. The optimal diagnostic model created using the stepwise selection method rather than the nested model maintained a specificity of 81.4% when sensitivity was fixed at 90% (Table 4). The calibration plots for multicomponent diagnostic models are shown in Supplement Fig. S3. P values for the Hosmer–Lemeshow test in all multicomponent diagnostic models were greater than 0.05, suggesting that diagnostic models were well calibrated.

Table 4 Specificity of each multicomponent diagnostic model with 90% sensitivity fixed.

Discussion

The present study devised multi-component diagnostic models that are applicable to individualized clinical settings; this strategy will help guide clinical decisions regarding the presence or absence of PTB. Given that all diagnostic test results for PTB were available, more accurate predictions could be obtained; nevertheless, clinical decisions should be optimal even in their absence. Depending on the community and healthcare setting, the distribution of physical and systemic resources for PTB testing varies widely1,3,4,13. Consequently, different clinical areas implement different types of diagnostic tests, and the time taken by a physician to assess the results of the same test varies3,4,13. In particular, EDs represent clinical settings where patients with acute, uncertain diagnoses may stay for long periods of time, often in close contact owing to crowding. Thus, they are at a higher risk of tuberculosis than patients in outpatient settings14,15,16. Moreover, sputum culture results can take several weeks to confirm, and other diagnostic tests are staggered. In the absence of sufficient reference materials, the decision to isolate and initiate treatment for a patient with suspected PTB has been based so far on clinical experience. The clinical tools developed in our study, which are customized for different clinical settings, can assist physicians in making quantitative and evidence-based decisions.

In the present study, individual diagnostic tests for PTB had poor sensitivity in comparison with specificity. In particular, chest radiographs and smear microscopy, which are conventional tools used for PTB screening, had a sensitivity of less than 50%, which is consistent with the results of previous studies17,18. Single prediction using PCR results, which are available in a shorter time frame than smear microscopy3,19, also had a low sensitivity for TB detection (22.6%). The low sensitivity of TB detection in healthcare facilities can be related to the spread of nosocomial infections; this implies that TB cannot be ruled out based on a negative test result. Our results suggest that single-test screening approaches are risky for nosocomial transmission, especially in high-density settings such as EDs and multi-bed wards. In this regard, Cattamanchi et al. demonstrated in a prospective cluster trial that a multi-component strategy for the diagnosis of PTB significantly increased diagnosis rates8,18. Furthermore, this suggests that a multicomponent diagnostic model for PTB is accurate and beneficial for controlling hospital infections. Increasing the number of diagnostic tests improves accuracy and specificity, while maintaining 90% sensitivity, aligning with WHO guidelines for TB screening1,2. Therefore, ensuring rapid turnaround times for multiple diagnostic tests in hospitals is crucial for preventing the spread of nosocomial PTB infection.

Notably, the present study demonstrated that the contribution of the DLAD to the detection of PTB was significantly higher than the interpretation of the chest radiography performed by the radiologist. Chest radiography is valuable for clinically diagnosing PTB and has been a pivotal tool in TB control for over a century, particularly in high-burden clinical setting17,20,21. However, the use of chest radiography to detect PTB is limited as this imaging technique lacks accuracy and requires radiological expertise11,17,21,22,23. Chest CT also requires specific expertise, and its limited availability, radiation hazards, and use of contrast media hinder its widespread adoption17. Recently, there has been renewed interest in using chest radiography for TB screening, leveraging advances in machine learning approaches to automate chest radiography interpretation21. WHO updated their TB screening guidelines to recommend computer-assisted detection software instead of human readers for digital chest radiography analysis for tuberculosis screening and triage of individuals aged 15 years and above11. Because DLAD diagnostic performance varied by population in individual previous studies, the high performance of DLAD for single use is not generalizable20,23,24. Our study simply confirms the superior sensitivity of DLAD use compared to single use of conventional chest radiography interpretations. Especially, conventional chest radiography interpretations in the multi-component approach were not statistically significant; however, the DLAD remained a significant factor in all models. We also found that replacing conventional strategies with the DLAD significantly improved performance in all multi-component models that could be used when PCR testing was not available. Therefore, the use of the DLAD in combination with other diagnostic tests may be an alternative in clinical settings where advanced diagnostic facilities for the detection of PTB are not available or where the turnaround time for the results is protracted. This finding suggests that our strategy may be particularly helpful in low-income countries where availability for screening for PTB is lacking5,13,25.

Globally, the occurrence of PTB is concentrated in underdeveloped countries with limited health care resources, which hinders diagnoses and follow-ups on the disease13. Owing to these epidemiological characteristics, the utilization of culture tests as a reference standard in research is rendered a difficult task because of the time required to confirm results6,21,23,24,26. Our study was performed at a level 1 ED located in a tertiary hospital with a standardized care protocol for suspected PTB patients, which allowed us to establish a structured cohort from the outset and follow up without data loss until culture results were available. In addition, the study population for tuberculosis-related research is generally imbalanced because it is not highly prevalent. Therefore, previous studies have recommended measuring performance with AUPRC or a framework that specifies target sensitivity and evaluates specificity rather than AUROC24,27, and the performance of our diagnostic models was presented using these recommended metrics.

Our study has several limitations. First, our study was conducted in a retrospective design at a single institution, which may limit the generalizability of the findings to other healthcare settings. This is because it contains the potential biases of retrospective studies and our results therefore need to be prospectively validated in study sites with different clinical settings. Second, although study participants were tested in accordance with a standardized protocol, the tested population featured missing cases of PCR testing and chest CT, which introduces bias in the diagnostic performance of the model.

Conclusions

In conclusion, a multicomponent diagnostic model using various clinical manifestations and ancillary test results is more accurate in detecting active patients with PTB than the diagnostic tools that use a single test. Among these diagnostic techniques, the TB screening score obtained from DLAD as an adjunctive tool for chest radiography can replace traditional interpretations reported by radiologists. Thus, diagnostic models using DLAD can assist in preventing the spread of PTB in resource-limited clinical settings and in optimizing healthcare resource utilization.

Ethical approval

This study was approved by the institutional review boards of Severance Hospital (approval number 4-2022-0481) and the requirement for informed consent from patients was waived owing to the study’s retrospective design.