CheXaid: deep learning assistance for physician diagnosis of tuberculosis using chest x-rays in patients with HIV

Tuberculosis (TB) is the leading cause of preventable death in HIV-positive patients, and yet often remains undiagnosed and untreated. Chest x-ray is often used to assist in diagnosis, yet this presents additional challenges due to atypical radiographic presentation and radiologist shortages in regions where co-infection is most common. We developed a deep learning algorithm to diagnose TB using clinical information and chest x-ray images from 677 HIV-positive patients with suspected TB from two hospitals in South Africa. We then sought to determine whether the algorithm could assist clinicians in the diagnosis of TB in HIV-positive patients as a web-based diagnostic assistant. Use of the algorithm resulted in a modest but statistically significant improvement in clinician accuracy (p = 0.002), increasing the mean clinician accuracy from 0.60 (95% CI 0.57, 0.63) without assistance to 0.65 (95% CI 0.60, 0.70) with assistance. However, the accuracy of assisted clinicians was significantly lower (p < 0.001) than that of the stand-alone algorithm, which had an accuracy of 0.79 (95% CI 0.77, 0.82) on the same unseen test cases. These results suggest that deep learning assistance may improve clinician accuracy in TB diagnosis using chest x-rays, which would be valuable in settings with a high burden of HIV/TB co-infection. Moreover, the high accuracy of the stand-alone algorithm suggests a potential value particularly in settings with a scarcity of radiological expertise.


Dataset Two Recruitment and Inclusion
The second dataset was collected as part of a cross-sectional diagnostic study of HIV-infected patients with at least one TB symptom (current cough, fever, night sweats or weight loss) admitted to the emergency center of Khayelitsha Hospital from 2016-2017. Inclusion criteria were: ≥ 18 years of age, HIV-positive, and currently experiencing at least one TB symptom. Inclusion criteria were: patients on anti-TB treatment (currently or within the past 3 months), patients admitted longer than 24 hours to the emergency center, informed consent not obtained, main clinical presenting feature of meningitis syndrome or new focal neurology, trauma, gynecological or psychiatric-related presentation, or pregnant. All patients underwent systematic testing for TB including CD4 count, chest x-ray, point-of-care ultrasound, sputum and urine Xpert MTB/RIF assays (Cepheid, Sunnyvale, CA, USA) and culture, and urine LAM assays performed in the emergency center and independent laboratory (Alere Determine™ TB LAM Ag, Waltham, MA, USA).

Patient Characteristics In Both Datasets
In dataset one, patients with TB had slightly lower hemoglobin (mean hemoglobin 8.8 vs. 10.7), white blood cell (WBC) counts (mean 8.7 vs. 11.7), and CD4 counts (mean 127 vs. 203). In dataset 2, patients with TB also had lower WBC counts (9.7 vs. 11.8), but discrepancy in hemoglobin was slightly less pronounced (mean hemoglobin 9.0 vs. 10.3). Again, patients with TB had lower CD4 counts on average (mean 116 vs. 203). Both datasets were predominantly female (66% and 69% with and without TB in dataset one, and 60% and 58% with and without TB in dataset 2). All of the patients in dataset one reported cough (as this was one of the inclusion criteria), while only 85% of the patients in dataset two reported cough.

Supplementary Note 2 Algorithm architecture
Neural networks are complex functions with many parameters structured as a hierarchy of layers to model different levels of abstraction. A convolutional neural network, a particular type of neural network is specially designed to handle image data: inspired by the organization of neurons in a human visual cortex, a convolutional neural network takes advantage of a parameter sharing receptive field to learn local features of an image and abstractions of these local features.
The neural network consists of two components. First, a 121-layer DenseNet (pretrained on CheXpert as discussed above) architecture was used to extract image features as a 1024-dimensional vector. In the DenseNet architecture, each layer is directly connected to every other layer within a block; for each layer, the feature maps of all preceding layers are used as inputs, and its own feature maps are passed on to all following layers as inputs. The network then forks into two modules, one for TB diagnosis using the image features and the clinical covariates, and the other for predicting the occurrence of six clinical findings that were diagnosed by radiologists and should be inferable just from the X-ray image. The TB module first uses a linear layer to learn 20 image features from the original 1024 dimensions, and then combines them with the 8 covariates, feeding the resulting 28-dimensional patient representation into a two layer neural network to predict TB. The findings module is a linear multi-label classifier on top of the 1024-dimensional image representation with 6 output units.

Focal Loss:
We optimized the algorithm's parameters with a focal loss applied to TB classification, as well as classification of additional X-ray findings. The loss for TB is upweighted by a factor of 6 so that it contributes as much as all the additional findings together. The focal loss is a modification to the standard cross entropy loss that includes an additional factor, -(1-py)c , where y is the target label and c > 0 is a hyperparameter. The overall loss for a single example is focal_loss(py) = -(1-py)c * log(py). The additional factor downweights the loss for examples that are easy to classify (i.e. when the probability of correct class is close to 1), and thus gives greater importance to examples that are challenging to classify.

Preprocessing
Images were cropped to exclude regions outside of the lungs using human annotations of the lung regions. After cropping, the shorter dimension was padded so that the resulting image was a square. Images were then downsampled to an input resolution of 320x320. Finally, the pixel values were normalized based on the per-channel mean and standard deviation that had been precomputed on the training set. All numerical clinical covariates (i.e. CD4 count, WBC count, etc.) were also normalized using the mean and standard deviation of the training set.

Algorithm Hyperparameters
The Adam optimizer with default parameters for beta1 (0.9) and beta2 (0.999) was used. The batch size was set to 16 and the initial learning rate was set to 1e-4. The learning rate was halved every 500 iterations. We also added L2 regularization to the algorithm parameters with a weight decay of 1e-4. Additionally, we applied random affine transformations (translation, rotation, and scaling) for data augmentation during training.
Visual Interpretation of the algorithm Class activation maps (CAMs) were used to highlight regions that had the greatest influence on the algorithm's decision. CAMs were generated by taking the weighted average across the algorithm's final convolutional feature maps, with weights based on global average pooling for each map. This averaged map was then scaled by the outputted probability so that more confident predictions appear brighter. Finally, the map was upsampled to the input image resolution and overlaid onto the input image.

Ensembling
We performed 5-fold cross validation in training our algorithm. On each fold, we evaluated performance on the validation split every 512 iterations and chose the best algorithm based on the highest validation AUROC. We used the best algorithm trained on each fold in an ensemble algorithm. During inference, each of the five algorithms in the ensemble produces a probability of TB and these probabilities are then averaged to get a final, ensemble probability.
For CAMs, we generated a CAM from each algorithm in the ensemble and then averaged them to get a single, ensemble CAM. The scaling is also based on the final, ensemble probability.

Other details
Clinicians diagnosed the test cases on their own time. In addition, clinicians were given information on the duration of a patient's cough, although this information was not used to train the algorithm given its presence in only one of the datasets.