Evaluation of the diagnostic accuracy of Computer-Aided Detection of tuberculosis on Chest radiography among private sector patients in Pakistan

The introduction of digital CXR with automated computer-aided interpretation, has given impetus to the role of CXR in TB screening, particularly in low resource, high-burden settings. The aim of this study was to evaluate the diagnostic accuracy of CAD4TB as a screening tool, implemented in the private sector in Karachi, Pakistan. This study analyzed retrospective data from CAD4TB and Xpert MTB/RIF testing carried out at two private TB treatment and diagnostic centers in Karachi. Sensitivity, specificity, potential Xperts saved, were computed and the receiver operator characteristic curves were constructed for four different models of CAD4TB. A total of 6,845 individuals with presumptive TB were enrolled in the study, 15.2% of which had MTB + ve result on Xpert. A high sensitivity (range 65.8–97.3%) and NPV (range 93.1–98.4%) were recorded for CAD4TB. The Area under the ROC curve (AUC) for CAD4TB was 0.79. CAD4TB with patient demographics (age and gender) gave an AUC of 0.83. CAD4TB offered high diagnostic accuracy. In low resource settings, CAD4TB, as a triage tool could minimize use of Xpert. Using CAD4TB in combination with age and gender data enhanced the performance of the software. Variations in demographic information generate different individual risk probabilities for the same CAD4TB scores.

Tuberculosis (TB) remains a major cause of morbidity and mortality globally. In 2015, there were an estimated 10.4 million incident cases of TB and 1.8 million TB deaths 1 . Active case finding programs are being increasingly utilized to reduce the case-detection gap 2, 3 .
In recent years, there has been growing interest in the use of chest x-rays (CXR) as a screening tool for TB within active and enhanced case finding programs 4 . Recent TB prevalence surveys have shown that CXR has higher sensitivity than verbal screening for identifying pulmonary TB [5][6][7] . Previously, costs, limited access to x-ray facilities, maintenance of equipment, availability of trained personnel, poor specificity and inter-observer variation meant that the role of CXR within diagnostic algorithms was limited 8 .
The advent of digital chest radiography along with software capable of automated interpretation such as the "Computer Assisted Diagnosis for TB" (CAD4TB) software developed by the Diagnostic Image Analysis Group of the Radboud University Medical Centre has prompted reconsideration of the role of CXR in TB screening, particularly in low resource, high-burden settings 9 . Long-term use of digital radiography is cost-efficient compared to conventional radiography as it eliminates recurring costs related to reagent use and radiologists 10 . Currently, CAD4TB is the only scoring software that has been evaluated and is being implemented in programmatic settings. Encouraging findings on the diagnostic accuracy of CAD4TB has been reported from sub-Saharan Africa, and most recently from Bangladesh [11][12][13][14][15] .
The need for improved approaches for screening has acquired greater pertinence following the introduction of sensitive rapid molecular diagnostics for TB such as Xpert MTB/RIF (Xpert) testing [16][17][18] . However, the scale-up of Xpert testing is limited in resource-constrained countries by high costs of test cartridges [19][20][21][22] .
An increasing body of evidence from high burden countries suggests that the use of digital CXR equipment and the automated reading of CXR with Computer Aided Detection (CAD), as a pre-screening tool, in conjunction with an expensive molecular test such as Xpert can improve case finding efforts 23 .
The use of CAD4TB is still in development phase, and the World Health Organization (WHO) has not developed any formal guidelines or recommendations for its use due to limited evidence. The aim of this study was to evaluate the diagnostic accuracy of CAD4TB as a screening tool, in Karachi, Pakistan, a megacity with a high TB prevalence and a substantial burden of undiagnosed TB. Similar studies, reporting diagnostic accuracy using Xpert MTB/RIF as the reference standard have been reported from Zambia in 2013 and Bangladesh in 2017 13,14 .
Other studies from Zambia, Tanzania, South Africa and England have evaluated CAD4TB against the reference standard of culture 15,[24][25][26] . Our current study is another data point in the series of studies, carried out in Pakistan. In addition, we also investigated whether different models of CAD4TB implementation that included routinely collected programmatic data such as age and gender can potentially enhance the diagnostic accuracy of the software and yield of TB case-detection.

Methods
Study design and setting. Pakistan has the fifth highest burden of tuberculosis in the world and the third largest number of undiagnosed TB cases 1 . Of the estimated 510,000 new TB cases, only 331 809 (65%) were notified to the National Tuberculosis Program (NTP) in 2015, making increased case-detection and notification a key priority 27 . Currently, smear microscopy is predominantly used as a diagnostic test in a majority of facilities in Pakistan 28 .
The study was conducted at two purpose built TB treatment and diagnostic centers, called "Sehatmand Zindagi" (Healthy Life) centers, in Karachi, Pakistan, from October 2013 to September 2015. These centers are located in low-middle income neighborhoods of Karachi, Nazimabad and Korangi. In addition to digital CXR equipment with CAD4TB, Xpert testing was carried out at both centers, with initiation of treatment among those diagnosed with TB.
The study was embedded within a broader programme implementing enhanced case finding, whereby community-health workers screened all individuals attending private health providers' clinics, in the vicinity of the centers, using the WHO TB symptom screen 29 , that is screening for the presence of either of the following: cough of any duration, fever, hemoptysis, night sweats, weight loss. Following a clinical evaluation by the health providers, those identified with presumptive TB were referred to the centers for further investigation. The target population for this study included individuals with presumptive TB referred by the private providers from the catchment area of the centres, as well as individuals with symptoms who self-referred for investigation for TB. All participants underwent a paid digital CXR (USD 3-5) and were requested to provide a sputum sample for free of cost Xpert testing.
Chest X-Ray scoring procedures. The CXRs were scored for abnormalities suggestive of pulmonary TB by a software system CAD4TB (version 3.07, Diagnostic Image Analysis Group, The Netherlands). CAD4TB was developed utilizing machine learning methods and was trained using labeled samples to differentiate between normal and abnormal x-ray images. The software has two abnormality detection systems that is textural abnormality and shape abnormality systems, which analyze the abnormalities in the unobscured lung fields that have been segmented automatically. The software then uses outputs from its detection systems as image descriptive features to train a k-NN classifier to compute a cumulative abnormality score (Range 0-95) for each CXR 13,30 . A higher score is indicative of more serious abnormality suggestive of TB. A CAD4TB threshold score of 50 was used for this population determined using previously collected CXR data in a similar population. All individuals with high CAD4TB scores (50 or greater) were referred back to their consulting physicians for further clinical evaluation.
Data management and analysis. All individuals attending the TB centers were registered online using an open-source platform (Open MRS), by allocation of a unique patient ID, against which baseline information and history of presenting symptoms were recorded. Distribution of CAD4TB scores was compared for various patient characteristics such as age, gender, symptoms and Xpert result. Sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) were calculated for each of the TB symptoms using Xpert result as the standard. Univariate and multivariate associations of CAD4TB score, age, gender and symptoms (as explanatory variables) with TB infection (defined as a positive Xpert result) were computed. Logistic regressions were performed with MTB detection as the outcome variable and CAD4TB score, age and gender as the explanatory variables (Model 1 and 3). Adjusted analyses were subsequently performed through backward step-wise multivariate logistic regression using Akaike's Information Criteria (AIC) to select the final, parsimonious model where symptoms where included as predictors of TB (Model 2 and 4). The AIC is an estimator that provides the relative quality of various statistical models and allowed for the selection of the most suitable set of predictor variables for the final model. Inclusion of the full set of symptoms screened was selected through the AIC for Models 2 and 4. Receiver Operator Characteristic (ROC) curves were constructed for four prediction models for TB, namely: Model 1 (CAD4TB score only), Model 2: (CAD4TB score, symptoms), Model 3: (CAD4TB score, Age, Gender) and Model 4 (CAD4TB score, age, gender, symptoms). Area-Under the Curve (AUC) statistics were obtained for each ROC curve and confidence intervals were calculated to investigate statistical differences in discriminatory accuracy of the prediction models. Sensitivity, Specificity, PPV and NPV for CAD4TB cutoff thresholds at scores of 50, 80 and 90 were obtained for the four prediction models by determining their predicted probabilities for Scientific REPoRTS | (2018) 8:12339 | DOI:10.1038/s41598-018-30810-1 TB detection. These cut offs were selected based on the CAD4TB score distribution of the study population, with score 50, 80 and 90 being at the 25 th , 50 th percentile 75 th percentile approximately.
A range of predicted probabilities for each CAD4TB score were obtained from the two models that included CAD4TB with demographic information (age and gender) and symptoms. Locally weighted regressions were carried out for the range of predicted probabilities for both models against CAD4TB scores and were used to determine the corresponding predicted probability for MTB detection at the four CAD4TB cut-offs. Predicted probabilities of TB were computed at each CAD4TB cutoff threshold. These estimated the risk of TB detection at each CAD4TB score. These were used to estimate the number of TB cases missed, Xpert cartridges reduced (due to reduced number of individuals with a CAD4TB score above the threshold) and yield (number of MTB positive results out of all those tested) on Xpert test for the four models. All data analysis was carried out using STATA Statistical Software (Stata Corporation Version 11. College Station, TX, USA).

Review Board (IRB) of Interactive Research & Development that is registered with the Department of Health and
Human Services, USA. The methods were carried out in accordance with the relevant guidelines and regulations. Verbal informed consent was obtained from the participants before carrying out screening activities under the project. De-identified data was provided for analysis to the study researchers, whereas all patient screening and diagnostic information was secured on a password-protected server.

Results
A total of 6,845 individuals with presumptive TB were enrolled in the study between October 2013 to September 2015. Out of these, 755 individuals, with invalid, error, no result were excluded from the analysis. The median age of participants was 38.9 (IQR 17.2) years and 3,018 (49.6%) were male. The majority of individuals included in the study reported symptoms of cough (87.5%) and fever (76.1%) ( Table 1). Hemoptysis and nightsweats were reported in 13.2% and 30.5% of the study participants respectively. A total of 925 individuals enrolled in the study (15.2%) had MTB + ve results on Xpert (Fig. 1). The majority of (90.2%) people with a MTB + ve result on Xpert had a CAD4TB score >80. However, a high proportion of individuals (74.2%) that tested as MTB-ve also had scores >80 (Table 1).   (Table 3). For each model, increasing CAD4TB score thresholds, improved yield of TB case detection, with corresponding increase in specificity and decrease in sensitivity. Using the symptom screen alone, cough of <2 weeks and fever, had higher sensitivities (93.8% and 85.7% respectively) and lower specificities (14.5% and 25.6% respectively) compared to other symptoms (Fig. 2). All symptoms had high negative predictive values and low positive predictive values (Fig. 2).
For each of the models, at higher CAD4TB scores the number of Xpert tests carried out was reduced, however, it led to more patients being classified as false-negatives (TB cases missed). At a CAD4TB score of 90, a total of 3,539 Xpert tests will be saved using Model 1 (CAD4TB scores only), 4163 with Model 2 (CAD4TB scores and symptoms), 4,577 will be saved in Model 3 (CAD4TB scores, age and gender), and 4,465 in Model 4 (CAD4TB scores with age, gender and symptoms). The TB cases missed were lowest for a CAD4TB score of 50, 2.7%, 3.2%, 3.7% and 4.2% respectively for the four models. The MTB yield at a score of 90 using the four models was 30.8%, 35%, 40.2% and 39.3% respectively.  The Area under the ROC curve (AUC) for the model with only CAD4TB scores as predictor for MTB detection (Model 1) was 0.79 (95% CI: 0.78-0.81) (Fig. 3) and for Model 2 using CAD4TB scores and symptoms was 0.81. Inclusion of patient demographics (age and gender) to CAD4TB scores (Model 3) increased the AUC to 0.83 (95% CI: 0.82-0.85). A combined model of CAD4TB scores, symptoms, age and gender (Model 4) further increased the AUC to 0.84 (95% CI: 0.82-0.85), however this was not significantly different compared to Model 3. Table 4 describes a sample of the predicted probabilities for various combination of age and gender for the same selected CAD4TB scores.

Discussion
Our study evaluated the performance of CAD4TB software as a screening tool for the detection of tuberculosis in a low-resource, high burden, non-HIV setting, using Xpert as the reference test. This study is one of the largest such evaluations of CAD4TB from a programmatic setting. In our study, CAD4TB was able to correctly identify a high proportion of people who were diagnosed with TB on Xpert and hence could potentially reduce the number of expensive molecular tests needed to detect TB in our sample of patients. While the use of Xpert in programmatic settings has expanded in recent years, the WHO has also recommended use of more cost effective diagnostic algorithms through screening tools such as CXR 25,29,31 . Development of software that offer automated interpretation of CXRs, represents an important milestone that can link technological innovations to mass-screening programs for tuberculosis. The utilization of CAD4TB as a triage tool, to pre-screen individuals for Xpert cannot only, improve case-detection in screening programs but also possibly reduce program costs 32 .
The findings from this study indicate that CAD4TB offers high diagnostic accuracy. CAD4TB scoring can be utilized to triage individuals for Xpert testing as individuals with a low CAD4TB score had a low probability of being tested positive for TB. In resource constrained settings such as Pakistan, with limited funds to support Xpert testing for all people with presumptive TB, using a triage tool like CAD4TB could promote more rational use of Xpert by minimizing the number of cartridges used. This is also relevant for facilities where an onsite radiologist may not always be available to evaluate the CXR.f It is important to note that the savings offered through reduced Xpert tests need to be offset with the cost of acquiring and maintaining digital X-ray systems. However, a detailed discussion on the costing and policy implications for mass-screening using CXR is beyond the scope of this study. High sensitivity (range 85-97.3%) and NPV (range 96.1-98.4%) were recorded for CAD4TB at the score cut-offs utilized in the analysis, which is similar to what has been reported for CXR in other study settings 18,33,34 . The relatively lower specificity (range 30.3-65.7%) and PPV (20-30.8%) were also consistent with findings from another study evaluating CAD4TB 13 .
A high AUC (0.79) was recorded from the model using CAD4TB alone as a screening tool (Model 1). Other studies from Zambia and Bangladesh that also used Xpert as the reference test reported AUCs of 0.71 and 0.74 respectively 13,14 . Studies from Africa, using culture as the reference test reported AUC in the range 0.71-0.84 35 . Our results therefore support investigations elsewhere suggesting that CAD4TB performs well in detecting radiological abnormalities [11][12][13][14] . To date, the highest AUC has been reported with the version 3.07 of CAD4TB (compared to older versions) 35 . With newer versions available and being increasingly utilized by programmes, it is expected that a superior performance of CAD4TB software will be found in future evaluations using newer versions, with improved machine learning capacity. While the combined use of CAD4TB and symptoms has been evaluated in a previous study 12 , this is one of the first studies that have evaluated CAD4TB in combination with symptoms as well as demographic information (age and gender). Using CAD4TB in combination with demographic data enhanced the performance of the software, generating a higher AUC (0.83), while such information such as age and gender are routinely captured in screening programmes. However, including clinical symptoms to the model with demographics and CAD4TB did not significantly increase accuracy as was hypothesized by a previous study 13 . Another study from South Africa, reported a superior performance of a combination framework using both CAD4TB scores and symptoms (AUC 0.84) 12 . Symptoms may not have contributed to improved performance in our setting as the study population included individuals that were referred for investigations (including self-referrals). This may have led to pre-screening of individuals thereby limiting the added discrimination offered by symptoms. Addition of symptoms improved specificity but decreased sensitivity as a lower number of individuals would have been screened positive under Model 4, and a larger number of TB cases were missed. In order to obtain a precise estimate of the AUC and to detect differences in the AUC between the models, a large sample size was included in the study. Since the data was obtained from a programmatic setting rather than a controlled investigation, a higher proportion of MTB-ve individuals were enrolled reflecting the prevalence of the disease in this population.
The increased diagnostic accuracy offered through demographic data can be utilized to further enhance the yield for Xpert testing than through CAD4TB alone. In this study, we used the dataset to generate a range of predicted probabilities for TB detection using a combination of CAD4TB scores, age and gender, like those shown in Table 4, that can be used to devise risk categories for patients identified through screening, further refining the triage process. Our study demonstrates that for the same CAD4TB scores, variations in demographic information such as age and gender can generate different individual risk probabilities. For example, at a CAD4TB score of 80, a male aged 56 years may have a low probability (5.1%) of being identified as MTB + ve on Xpert compared to a female aged 22 years who may have a higher probability (19.8%) ( Table 4). Individualized risk scores could, therefore, assist frontline healthcare workers make informed decisions about whom to test. Sputum samples for Xpert testing may be collected for those with high risk for TB, and repeat tests or clinical evaluations may be carried out for those with medium to high risk, that can potentially save Xpert cartridges, improve testing yields and make programs more cost-effective. In addition to demographic data, routinely collected programmatic information such as history of TB contact, diabetes status and smoking history can be further utilized by future programs to create personalized risk scores. It must be noted that symptoms, while not offering improved accuracy in this study, may be useful in community-settings in active case finding programs where a large number of asymptomatic individuals are also among those screened and may further help improve yield on Xpert.
Our study findings also demonstrate that for increasing CAD4TB score thresholds, the sensitivity decreased, with corresponding increase in specificity, resulting in more TB cases but providing a higher yield (Table 3). Similar findings have been reported from a study in South Africa where 11% of TB cases would have been missed using a threshold score that would have triaged 40% of suspects for Xpert testing 25 . However, individualized risk assessment, may diminish the need to set CAD4TB thresholds for programs broadly with greater reliance on testing based on personalized assessment. An additional benefit of utilizing digital X-rays is increased capacity for clinical diagnosis of TB. Images can be archived online using cloud-based software allowing radiologists or clinical officers at TB facilities high quality images for diagnostic evaluation. In addition, mass-screening programs with X-rays are more likely to generate community interest and support mobilization than conventional screening camps with health workers. However, additional operational considerations continue to be relevant regardless of the modality of screening used. Improvements in processes such as health communication activities to promote screening among asymptomatic individuals, adequate resources for sputum induction, increased diagnostic capacity for testing, additional clinical staff for examining bacteriologically negative cases and engineers for providing equipment and software maintenance, will all be required to make screening and community referrals more effective. Since CAD4TB does not differentiate CXR abnormalities that may be observed in other conditions, such as pneumonia, lung cancer, etc., a significant number of people without TB are likely to be referred for diagnostic testing 14 . Algorithms and pathways to care will need to be developed for managing the diagnostic workup and treatment for these individuals. This is especially pertinent for developing countries with donor supported TB programs as diagnostics and treatment for other pulmonary pathologies are not funded.
Our study has certain limitations. The major limitation was that Xpert, and not mycobacterial culture was used as the reference standard, whereby Xpert negative, culture positive TB cases may have been missed. Individuals that were unable to expectorate sputum and cases with invalid or error results on xpert (for which additional sputum samples could not be obtained to re-run the test), were excluded from the study. These factors may have decreased the number of patients classified as MTB + ve and affected the accuracy of the results. An evaluation of the performance of CAD4TB compared with human readers was beyond the scope of this analysis as this has been conducted extensively in a number of studies. These evaluations utilized a combination of readings by clinical officers and radiologists and the performance of CAD4TB was found to have been comparable to those of human readers and also has the potential to reduce inter-reader and intra-reader variability and detection errors 11,[34][35][36] . While these early studies have demonstrated the effectiveness of CAD4TB in place of medical staff, further studies such as ours that utilize a biological reference can further support the use of CAD4TB in screening programs. Finally, the external validity of our study may be limited for active-case finding programs as the participant enrollment was carried out at a facility-based setting, and the results may not be generalizable to the community setting where a large number of asymptomatic people with TB may also be present. We therefore recommend further studies to evaluate CAD4TB in the community setting such as through mobile X-ray units.

Conclusion
This study described the first use of CXRs supported with computer-aided detection as part of enhanced case-finding intervention in the private sector in Pakistan. It demonstrated CAD4TB has the potential to be used as a triage tool to carry out screening of symptomatic individuals who could be excluded from further testing to make screening programs more cost effective by saving the number of Xpert tests. With the large scale roll-outs of Xpert and CAD4TB in local programmatic settings, its use within different case finding approaches should be evaluated and compared. A follow-up study comparing different versions of CAD4TB is also recommended. Screening algorithms need to be tailored to local contexts taking into account priorities for increased case-detection and resources required for testing additional individuals with presumptive TB. Data availability. The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request