Circulating cell-free RNA in blood as a host response biomarker for detection of tuberculosis

Chang, Adrienne; Loy, Conor J.; Eweis-LaBolle, Daniel; Lenz, Joan S.; Steadman, Amy; Andgrama, Alfred; Nhung, Nguyen Viet; Yu, Charles; Worodria, William; Denkinger, Claudia M.; Nahid, Payam; Cattamanchi, Adithya; De Vlaminck, Iwijn

doi:10.1038/s41467-024-49245-6

Download PDF

Article
Open access
Published: 10 June 2024

Circulating cell-free RNA in blood as a host response biomarker for detection of tuberculosis

Nature Communications volume 15, Article number: 4949 (2024) Cite this article

4228 Accesses
2 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Tuberculosis (TB) remains a leading cause of death from an infectious disease worldwide, partly due to a lack of effective strategies to screen and triage individuals with potential TB. Whole blood RNA signatures have been tested as biomarkers for TB, but have failed to meet the World Health Organization’s (WHO) optimal target product profiles (TPP). Here, we use RNA sequencing and machine-learning to investigate the utility of plasma cell-free RNA (cfRNA) as a host-response biomarker for TB in cohorts from Uganda, Vietnam and Philippines. We report a 6-gene cfRNA signature, which differentiates TB-positive and TB-negative individuals with AUC = 0.95, 0.92, and 0.95 in test, training and validation, respectively. This signature meets WHO TPPs (sensitivity: 97.1% [95% CI: 80.9-100%], specificity: 85.2% [95% CI: 72.4-100%]) regardless of geographic location, sample collection method and HIV status. Overall, our results identify plasma cfRNA as a promising host response biomarker to diagnose TB.

Multi-country evaluation of RISK6, a 6-gene blood transcriptomic signature, for tuberculosis diagnosis and treatment monitoring

Article Open access 01 July 2021

Combining host-derived biomarkers with patient characteristics improves signature performance in predicting tuberculosis treatment outcomes

Article Open access 09 July 2020

Blood RNA signatures predict recent tuberculosis exposure in mice, macaques and humans

Article Open access 09 October 2020

Introduction

The outcome of infection after exposure to Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), can vary greatly and is determined by a complex interplay between host immunity and bacterial persistence¹. Current diagnostic tests for TB are not sensitive to the full spectrum of disease states, are unable to determine if an infection has been cleared, cannot distinguish between latent, incipient, and subclinical disease, and cannot predict progression to active TB^2,3. This lack of sensitivity and specificity in diagnostic tests presents a major challenge in the management and control of TB.

Transcriptomic signatures, or changes in host-cell gene expression, can provide valuable insight into the host response to TB. Yet, while several whole blood RNA (wbRNA) signatures have been identified, none have met the optimal target product profiles (TPPs) recommended by the World Health Organization (WHO) for a non-sputum-based triage or diagnostic test^4,5,6. In this study, we investigated plasma cell-free RNA (cfRNA) as a potential new class of host biomarkers for TB. cfRNA is an analyte that can provide information beyond what can be achieved with conventional wbRNA signatures for two key reasons. First, cfRNA is released by cells in both the blood compartment and from vascularized solid tissues and thus contains information about systemic immune dynamics and immune-tissue interactions. Second, cfRNA is released predominantly from dead or damaged cells as opposed to wbRNA, which originates from a mix of live and dead cells. Thus, cfRNA may provide far clearer insights into pathways of cell death and tissue injury that would not be available with wbRNA profiling.

To evaluate the performance of plasma cfRNA as a host signature of TB, we conducted unbiased profiling of plasma cfRNA in patient cohorts spanning three countries. Our analysis identified a 6-gene signature with accuracy of 91.8%, which exceeds the optimal WHO TPPs for a triage test. Comparison to matched whole blood samples highlighted the differences in the origin of plasma cfRNA and whole blood RNA. These results lay the groundwork for the development and clinical validation of a cfRNA host response test for the detection of TB.

Results

Plasma cell-free RNA signatures of tuberculosis

We performed a case-control study to investigate the cfRNA signatures of tuberculosis. We analyzed plasma samples from a total of 251 individuals with a cough lasting at least two weeks who were enrolled in three different cohorts (Cohort 1, Cohort 2, and Cohort 3) at outpatient clinics in Uganda, Vietnam, and the Philippines (Table 1 and Fig. 1A). Individuals included in the “TB positive” group were required to have 1) a positive Xpert MTB/RIF Ultra on sputum, urine, or contaminated Mycobacterial Growth Indicator Tube (MGIT) specimen; 2) a positive sputum MGIT or solid culture; or, 3) two trace Xpert Ultra results on sputum or contaminated MGIT. All other individuals (“TB negative” group) had at least one negative Xpert Ultra result, two negative cultures in MGIT or solid media, and repeated negative sputum tests and/or clinical improvement without TB treatment at two to three months follow-up. Of the 251 plasma samples analyzed in our study, 142 (56.6%) were from individuals diagnosed with microbiologically confirmed TB, and 37 (14.7%) were from individuals living with HIV. We used next-generation sequencing to profile the RNA in these blood samples. This yielded a mean of 31,737,786 reads per sample in plasma (range: 7,200,555 to 64,206,141 reads); Supplementary Fig. 1).

Table 1 Patient Information

Full size table

**Fig. 1: Plasma cell-free RNA profiling.**

To determine the cell types-of-origin of the cfRNA in our samples, we used a reference-based deconvolution algorithm (BayesPrism). This algorithm uses a reference dataset, in this case the Tabula Sapiens single-cell RNA-seq atlas, and bayesian inference to identify the cell types from which the cfRNA in our samples originated^7,8,9,10. We found that the plasma cfRNA in individuals with respiratory symptoms was primarily derived from platelets, myeloid progenitor cells, B cells, endothelial cells, natural killer cells, monocytes, and neutrophils (Fig. 1B). We observed batch effects in the platelet contribution. There was a significant difference in the platelet contribution between all three cohorts (mean fraction: 0.11, 0.61, 0.36 for Cohort 1, Cohort 2, and Cohort 3, respectively). These differences were primarily due to the different sample collection tubes and plasma collection procedures used by the Cohort 1, Cohort 2, and Cohort 3 studies (Methods).

To gain insight into the host response to TB and evaluate if cfRNA is informative of TB status, we performed differential abundance analysis on all 251 samples (DESeq2¹¹, Methods). Using a Benjamini-Hochberg adjusted p-value cutoff of <0.05, we identified 1956 differentially abundant genes between TB positive and negative groups (Fig. 1C and Supplementary Data 1). TB was associated with elevated levels of macrophage markers (MARCO, SOCS3, FCGR1A, MPO, C1QB), neutrophil markers (ELANE, FCGR3A, S100A8, S100A9, ERG), interferon genes (IFI27L1, IFIT2, IFIT3, IFITM3, IRF1), and antimicrobial genes (AZU1, CTSG, DEFA4, STAT1, GBP1, GBP2, GBP4, GBP5, GBP6) compared to the TB negative group. We also observed elevated levels of lung-specific markers (SFTPC, SFTPB, SLC34A2, SFTPA1, SFTPA2) in individuals with TB, providing insight into ongoing pathogenic processes^12,13. To confirm the relevance of these findings, we performed Ingenuity Pathway Analysis (Fig. 1D; Supplementary Data 2). The top canonical pathways enriched in TB (filtered by |-log(p-value)|>1.0 and |z-score | >2.55) included neutrophil degranulation, interferon gamma signaling, and antimicrobial peptide pathways. We observed similar enriched pathways using gene set enrichment of KEGG, GO and Hallmark pathways (Supplementary Data 3).

Evaluating machine learning algorithms for gene-expression-based TB diagnostics

The results from the differential expression analysis, including enrichment of clinically relevant pathways and satisfactory separation of sample groups using correlation-based clustering, suggested that differentially abundant genes in plasma cfRNA could be used to develop triage tests and diagnostic assays for TB. To test this, we split the samples into a “discovery” cohort (Cohort 1 and Cohort 2) and a “validation” cohort (Cohort 3). We then randomly divided the discovery cohort samples into a training set (70%) and a testing set (30%), ensuring that disease status, HIV status, and cohort were equally represented (Fig. 1E). Next, we performed differential abundance analysis on the training set and filtered features using their adjusted p-value (p < 0.05) and base mean (mean > 100). We then trained 15 machine-learning classification models using the genes that were selected after filtering (Methods; Supplementary Table 1).

Most of the models performed well, with 8/15 models having a train and test area under the receiver operating characteristic curve (ROC-AUC) > 0.9 (Fig. 1F). However, some models, such as the generalized linear model (GLM) and linear discriminant analysis (LDA), suffered from poor test performance. The poor performance of the GLM may be due to the lack of regularization, since the performance of the generalized linear models with Ridge and LASSO feature selection (GLMNETRIDGE and GLMNETLASSO) were on par with the other models, suggesting that there may be several highly correlated features. LDA appeared to suffer from overfitting, possibly due to high collinearity in the features.

6-gene biomarker panel surpasses guidelines for a non-sputum-based triage test

After evaluating the performance of the different machine learning models, we chose the greedy forward search as it performed with the highest test set ROC-AUC. The greedy forward search algorithm is initiated by choosing the gene with the most discriminatory power in terms of a training score. This score is calculated as: (ROC-AUC + Sensitivity + Specificity) at the optimal Youden threshold. In each subsequent iteration, the algorithm adds the gene that produces the greatest increase in this score until there is no longer any gene which can increase the score by more than 0.01. This yielded a 6-gene signature consisting of guanylate binding protein 5 (GBP5), BCL2/adenovirus E1B interaction protein 3-like (BNIP3L), kruppel-like factor 6 (KLF6), Dysferlin (DYSF), LIM and SH3 protein 1 (LASP1), and poly(rC)-binding protein 1 (PCBP1) (Fig. 2A and Supplementary Table 2).

**Fig. 2: Performance of the 6-gene signature identified in plasma cfRNA.**

We evaluated the utility of the 6-gene signature as a diagnostic or triage tool using ROC analysis on the training, test, and validation sets. We found that the discriminatory power of the signature was high in all three sets using a singular cutoff (0.498) determined by the optimal Youden threshold on the training set (training set ROC-AUC = 0.921, test set ROC-AUC = 0.947, and validation set ROC-AUC = 0.946; Fig. 2B). In the training set, the TB score discriminated between TB positive and negative groups with 91.5% accuracy, 94.5% sensitivity, and 87.7% specificity. In the test set, the TB score discriminated between TB positive and negative groups with 91.8% accuracy, 97.1% sensitivity, and 85.2% specificity. Finally, in the validation set, the TB score discriminated between TB positive and negative groups with 88.3% accuracy, 97.1% sensitivity, and 76.0% specificity (Fig. 2C). Although the biomarker panel falls short of the specificity requirements for a diagnostic test, it exceeds the optimal criteria for a triage test on the test set. On the validation set, the panel exceeds the optimal sensitivity target for a triage test and falls 4 percentage points below the optimal specificity target, while still satisfying the minimal specificity target (optimal: 95% sensitivity, 80% specificity; minimal: 90% sensitivity, 70% specificity; Methods).

GBP5 was identified as the gene with the strongest contribution to the performance of the 6-gene signature (Fig. 2A). We analyzed the effect of cohort and the association of GBP5 with the Xpert MTB/RIF Ultra semi-quantitative result (Fig. 2D). We found that the expression level of GBP5 in TB positive individuals is largely unaffected by cohort and sample collection (Supplementary Table 3). We also investigated the effect of country of origin and HIV status on the performance of the full 6-gene signature (Fig. 2E). We found that the removal of HIV-positive samples improved the performance of our model in both the test and validation set. This increase was largely due to the misclassification of HIV + /TB- patients, which accounted for 7 out of the 8 HIV+ samples that were misclassified. Notably, the signature retained a high accuracy for HIV + /TB+ individuals. Across a total of 18 HIV + /TB+ patients, the 6-gene signature misclassified only a single patient. To investigate the effect of country of origin, we calculated the ROC-AUC by country across all 251 samples and found that the model performance was robust against the country of origin, with all three countries retaining an AUC greater than 0.9 (Uganda = 0.93, Vietnam = 0.90, Philippines = 0.96, Fig. 2F). Collectively these results demonstrate that the discriminatory power of the 6-gene signature is independent of geographic area and cohort and is moderately affected by HIV status.

Next, we investigated the correspondence of the 6-gene score with established markers of TB. The Xpert Ultra semi-quantitative result is a PCR-based test that bins samples based on the cycle threshold (CT) of the first positive probe that detects M. tuberculosis as follows: “low”=22 < CT ≤ 38, “medium”=16 < CT ≤ 22, and “high”=CT ≤ 16¹⁴. We found that the TB score was highly correlated with the Xpert Ultra semi-quantitative result in all three cohorts (polychoric correlation = 0.757; Fig. 2G and Supplementary Tables 4-5). Next, we compared the 6-gene signature to chest X-ray (CXR) scores which have been used as complementary information in determining the host response to TB. Recent developments in computer-aided detection (CAD) software have improved the performance of CXRs, but still fall short of the WHO TPPs for a triage test¹⁵. Three commercially available CAD applications were used to analyze CXRs from individuals enrolled in cohorts 2 and 3: CAD4TB (Delft Imaging, Netherlands), qXR (Quire.ai, India), and Lunit Insight chest X-ray TB algorithm (Lunit Inc., South Korea). The TB abnormality score is reported on a scale of 0-100 (for CAD4TB and Lunit) or 0-1 (for qXR), where a higher value indicates a more abnormal radiograph. Despite performance differences between each of the CAD algorithms, we found that the 6-gene TB score is moderately correlated with the reported TB abnormality scores (Pearson r = 0.50-0.62; Fig. 2H), indicating that the gene expression signature may provide information associated with known metrics of host response. Collectively, these results indicate that the 6-gene signature has the potential to serve as a triage tool for TB.

Comparison of plasma cfRNA biomarkers to whole blood RNA and protein studies

While many potential wbRNA signatures have been identified for the diagnosis of TB disease in clinically-relevant populations, only a handful meet the minimal WHO criteria for a triage test and none meet the minimal WHO criteria for a diagnostic test^4,16,17. Unsurprisingly, the performance of these wbRNA signatures varies widely when applied to multi-country cohorts, as diagnostic performance heterogeneity between populations is a well-known barrier to the development of triage and diagnostic assays¹⁸.

To compare the utility of cfRNA and wbRNA in relation to TB, we compared the 6-gene cfRNA signature to previously reported wbRNA signatures. Among the wbRNA signatures that meet the minimal WHO criteria for a triage test, 7 have been evaluated for their performance in distinguishing TB from other diseases in a multi-cohort population (Berry86¹⁹, daCosta2²⁰, daCosta3²⁰, Kaforou44²¹, Walter47²², Zak16⁵, and Sweeney3²³; Fig. 3A). We found that our 6-gene signature compares favorably to these wbRNA panels (Fig. 3B). We also noted that only one of the 6 genes in the cfRNA signature (GBP5) overlaps with those reported in previous wbRNA studies (Fig. 3A, Supplementary Data 4)⁴.

**Fig. 3: Comparison of the 6-gene plasma signature with whole blood signatures of TB from literature.**

To enable a direct comparison of the origin of cfRNA and wbRNA, we sequenced the wbRNA of 60 matched individuals from cohort 3 (mean reads: 13,521,651, range: 1,798,585 to 36,030,153). We then quantified the cell types that contribute RNA to the mixtures in plasma and whole blood samples collected from Cohort 3. We found that the origin of cfRNA and wbRNA is distinct (Fig. 3C). Plasma cfRNA is derived from both cells in the blood compartment and from vascularized solid tissues (solid organ fraction 0.23, 0.39, and 0.33 for Cohort 1, Cohort 2, and Cohort 3 respectively), while wbRNA is predominately released by circulating immune cells (solid organ fraction of 0.025). To further evaluate the differences between cfRNA and whole blood RNA, we performed differential abundance analysis of cfRNA and wbRNA between TB positive and TB negative samples (DESeq2). We identified 2018 and 1956 differentially abundant genes in wbRNA and plasma cfRNA, respectively, 517 of which were shared (Benjamini-Hochberg adjusted p-value < 0.05, Fig. 3D; Supplementary Table 6; Supplementary Data 5). From our 6-gene signature, three genes were also identified as differentially abundant in our matched whole blood samples: GBP5, DYSF, and BNIP3L. This analysis provides evidence that plasma cfRNA can provide insights into mechanisms of cellular injury that are not accessible with wbRNA profiling.

GBP5 is the gene most frequently reported across studies, occurring in our 6-gene cfRNA signature and in 5/7 whole blood signatures. GBP5 has been shown to be critical for NLRP3 inflammasome activation, which mediates caspase-1 activation and secretion of proinflammatory cytokines in response to pathogenic bacteria and cellular damage^24,25. Three wbRNA studies have found that GBP5 alone has high discriminatory performance: da Costa et al.²⁰ (sensitivity = 93%, specificity = 86%, ROC-AUC = 0.924), Francisco et al.²⁶ (sensitivity = 88.2%, specificity = 78.5%, ROC-AUC = 0.86), and Sutherland et al.²⁷ which uses a Xpert-MTB-HR prototype with the Sweeney3²³ signature (sensitivity = 86%, specificity = 76%, ROC-AUC = 0.87). Similarly, we find that the abundance of GBP5 in plasma cfRNA can separate TB and TB negative samples with a sensitivity of 83%, specificity of 83%, and an ROC-AUC of 0.88 (Fig. 3E and F). Whole blood GBP5 protein levels have also been evaluated as a potential biomarker for TB, and have been demonstrated by Yao et al. to be significantly higher in people with microbiologically confirmed TB²⁸. However, the performance of the blood protein biomarker at the optimal threshold (sensitivity = 0.78, specificity = 0.67, ROC-AUC = 0.76) is worse than the plasma RNA biomarker. This suggests that plasma cfRNA may be a more effective biomarker for TB than whole blood GBP5 protein levels.

Discussion

In this study, we investigated the use of plasma cfRNA as a host-specific biomarker for the detection of TB. Using RNA sequencing and machine learning, we developed a 6-gene signature that can accurately distinguish individuals with microbiologically diagnosed TB. Our signature has a diagnostic accuracy, sensitivity, and specificity of 91.5%, 94.5%, and 87.7% in the training dataset, 91.8%, 97.1%, and 85.2% in the test dataset, and 88.3%, 97.1%, and 76.0% in the validation dataset. These metrics compare favorably to the performance of previous whole blood signatures. This is particularly important because non-sputum-based biomarker tests are a priority for TB management, as many high-risk populations are unable to produce high-quality sputum samples for diagnosis.

Our study is the first to investigate the potential of plasma cfRNA as a host-specific blood-borne biomarker for TB. Given that plasma cfRNA is a novel bioanalyte to monitor TB, these results are potentially important in at least three ways. First, our results suggest that signatures derived from plasma cfRNA may be more robust than those derived from whole blood RNA (wbRNA), which have shown poor performance in independent validation studies^4,17. This may be due to the sensitivity of whole blood signatures to differences in patient cohorts and sample processing. We found that the plasma cfRNA signature is robust against these factors, making it a potentially more reliable source for TB biomarker discovery. Second, plasma cfRNA is stable, can be assayed using a small amount of plasma, and can provide insight into host cellular injury. Finally, future studies could explore the use of combined wbRNA and cfRNA bioanalytes for the development of more sensitive and specific diagnostic assays for TB.

Our study has several limitations that should be considered when interpreting the results. First, individuals were enrolled in this study under the presumption of having TB, and those diagnosed as not having TB reflect the local distribution of other conditions that also present with a cough of at least 2 weeks in duration. Cross-sectional studies including people with clinically diagnosed TB are needed to further validate accuracy. Second, the plasma sample collection method was not uniform across cohorts. While we retained high accuracy regardless of the collection tube and sample preparation method, further studies evaluating the impact of RNA preservatives and platelet contribution on the diagnostic performance may find improved performance. Third, we did not investigate patients with extrapulmonary TB, latent TB infection, or other forms of TB. Despite these limitations, our observations suggest that plasma cfRNA has the capability to serve as a proxy for tissue-specific changes in gene expression and may more adequately capture the host response to TB than wbRNA. Further research is needed to confirm these findings and develop a point-of-care, gene expression-based assay based on the plasma cfRNA signature. Such an assay could be a valuable tool in the early detection of TB and help improve the management and control of this disease.

Methods

Ethics statement

All patients provided written informed consent, and all experiments were performed in accordance with relevant guidelines and regulations. The protocols for this study were approved locally at each site by Institutional Review Boards at Cornell University (protocols IRB0145569, 1902008555); UCSF IRB 20-32670 (protocol 20-32670); University of Heidelberg Ethics Committee of the Medical Faculty (S-539/2020); the Makerere University, College of Health Sciences, School of Medicine, Research Ethics Committee 2020-182 (protocol 2017-020); Vietnam National Lung Hospital Ethical Committee for Biological Medical Research: 566/2020/NCKH (protocol 566/2020/NCKH), and De La Salle Health Sciences Institute Independent Ethics Committee 2020-33-02-A (protocol 2020-33-02-A).

Sample collection

Plasma samples were collected as part of three different cohorts with separate sample collection methodologies. All three cohorts enrolled adults with cough ≥2 weeks identified at outpatient clinics in Uganda (Cohort 1 and Cohort 3) and in Uganda, Vietnam, and the Philippines (Cohort 2).

Peripheral blood samples from individuals enrolled in Cohort 1 were collected in Streck Cell-Free DNA blood collection tubes (Streck, 230257). Plasma was separated by centrifugation at 1600 × g for 10 minutes at ambient temperature. The remaining cellular debris was removed by an additional centrifugation step at 16,000 × g for 10 minutes at ambient temperature. Plasma was stored in 1 mL aliquots at −80 °C.

Plasma samples from individuals enrolled in Cohort 2 and Cohort 3 in sites in Uganda, Vietnam, and the Philippines were collected in K2EDTA blood collection tubes (BD Diagnostics, 366643). Plasma was separated by centrifugation at 1600 × g for 10 minutes at ambient temperature in a horizontal rotor (swing out head). Plasma was similarly stored in 1 mL aliquots at −80 °C. Plasma sample volumes ranged from 100 µL to 1000 µL (Mean: 727 µL ± 227 µL, Supplementary Fig. 2). The plasma volume did not affect the measured cfRNA signature score (Supplementary Fig. 2).

Whole blood samples from individuals enrolled in the Cohort 3 in Uganda were collected in PAXgene Blood RNA tubes (Qiagen, 762165). The 2.5-mL PAXgene tubes were stored at −80 °C. Whole blood sample volumes were uniform at 700 µL.

Plasma cfRNA isolation and library preparation

Plasma samples were received on dry ice and stored at −80 °C until processed. Prior to cfRNA extraction, plasma samples were thawed at room temperature and centrifuged at 1300 x g for 10 minutes at 4 °C. cfRNA was extracted from plasma using the Norgen Plasma/Serum Circulating and Exosomal RNA Purification Mini Kit (Norgen, 51000). Extracted RNA was DNase treated with a combination of 10 µl DNase Turbo Buffer (Invitrogen, AM2238), 3 µl DNase Turbo (Invitrogen, AM2238), and 1 µl Baseline Zero DNase (Lucigen-Epicenter, DB0715K) for 30 minutes at 37 °C, then concentrated into 12 µl using the Zymo RNA Clean and Concentrated Kit (Zymo, R1015).

Sequencing libraries were prepared from 8 µl of concentrated RNA using the Takara SMARTer Stranded Total RNA-Seq Kit v3 - Pico Input Mammalian (Takara, 634485) and barcoded using the SMARTer RNA Unique Dual Index Kit (Takara, 634451). Library concentration was quantified using the Qubit 3.0 Fluorometer (Invitrogen, Q33216) with the dsDNA HS Assay Kit (Invitrogen, Q32854). Libraries were quality-controlled using the Agilent Fragment Analyzer 5200 (Agilent, M5310AA) with the HS NGS Fragment Kit (Agilent, DNF-474-0500) and pooled to equal concentrations. Each pool was sequenced using both the Illumina NextSeq 500/550 platform (paired-end, 150 bp) and the Illumina NextSeq 2000 platform (paired-end, 100 bp).

Whole blood RNA isolation and library preparation

Whole blood samples were received on dry ice and stored at −80 °C until processed. Prior to RNA extraction, whole blood samples were thawed at 37 °C. RNA was extracted from whole blood samples using the Quick-RNA Whole Blood Kit (Zymo, R1201) following the manufacturer’s instructions. RNA was processed using the NEBNext Globin & rRNA Depletion Kit (Human/Mouse/Rat; NEB, E7750X), and libraries were created using the NEBNext Ultra II Directional RNA Library Prep Kit (NEB, E7760 and E7765) following manufacturer’s specifications.

Sequencing libraries were quantified using the Qubit 3.0 Fluorometer (Invitrogen, Q33216) with the dsDNA HS Assay Kit (Invitrogen, Q32854). Libraries were quality-controlled using the Agilent Fragment Analyzer 5200 (Agilent, M5310AA) with the HS NGS Fragment Kit (Agilent, DNF-474-0500) and pooled to equal concentrations. Each pool was sequenced using the Illumina NextSeq 2000 platform (paired-end, 100 bp).

Bioinformatic processing and sample quality filtering

Sequencing data was processed using a custom bioinformatics pipeline. Since pools sequenced on the Illumina NextSeq 2000 were optimized to produce paired-end, 61 bp reads, matched sequencing data from the Illumina NextSeq 500 were trimmed using Seqtk (v1.2). Samples were then quality filtered and trimmed using BBDuk (v38.90), and aligned to the Gencode GRCh38 human reference genome (v38, primary assembly) using STAR²⁹ (v2.7.0 f, default parameters). Prior to feature quantification using featureCounts³⁰ (v2.0.0), samples were deduplicated using Picard MarkDuplicates (v2.19.2). Mitochondrial, ribosomal, X and Y chromosome genes were bioinformatically removed prior to analysis.

Samples were filtered on the basis of DNA contamination, rRNA contamination, total counts, and RNA degradation. DNA contamination was estimated by calculating the ratio of reads mapping to introns and exons. rRNA contamination was measured using SAMtools (v1.14). Total counts were calculated using featureCounts³⁰ (v2.0.0). Degradation was estimated by calculating the 5’-3’ bias using Qualimap³¹ (v2.2.1).

Samples were removed from analysis if either the intron to exon ratio was greater than 4, if a sample had less than 90,000 total feature counts, or if the rRNA contamination, total counts, or 5’-3’ bias was greater than 4 standard deviations from the mean.

Cell type deconvolution

Cell-type deconvolution was performed using BayesPrism⁹ (v1.1) with the Tabula Sapiens single-cell RNA-seq atlas¹⁰ (Release 1) as a reference. Cells from the Tabula Sapiens atlas were grouped as previously described in Voreperian et al.⁸. Cell types with more than 100,000 unique molecular identifiers (UMIs) were included in the reference and subsampled to 300 cells using ScanPy³² (v1.8.1).

Differential abundance analysis

Comparative analysis of differentially expressed genes was performed using a negative binomial model as implemented in DESeq2¹¹ (v1.34.0). Heatmaps were constructed using the pheatmap package in R (v1.0.12). Samples and genes were clustered using correlation-based hierarchical clustering. Canonical pathways, diseases and functions were analyzed using QIAGEN Ingenuity Pathway Analysis software (v73620684).

Machine learning and model training

Machine learning and model training were performed using R (v4.1.3) with the DESeq2¹¹ (v1.34.0), Caret³³ (v6.0.90), and pROC³⁴ (v1.18.0) packages. Sample metadata and count matrices were split 70/30 into a training set and a test set. To ensure a representative split, we assigned a label to each sample indicating its status for the variables of cohort, HIV status, and TB status. We then distributed samples into train and test using this label to ensure equal representation of cohort, HIV, and TB in both sets. The validation set (n = 60) was set aside for the evaluation of our model performance.

Features for model training were selected using differential abundance analysis. We excluded genes with a base mean of less than 100 and a Benjamini-Hochberg adjusted p-value greater than 0.05. The remaining 113 genes were selected for model training. Machine learning algorithms were trained using 5-fold cross-validation and grid search hyperparameter tuning. Accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (ROC-AUC) were used to measure test performance. The classification models used were generalized linear models with Ridge and LASSO feature selection (GLMNETRIDGE and GLMNETLASSO), support vector machines with linear and radial basis function kernel (SVMLin and SVMRAD), random forest (RF), random forest ExtraTrees (EXTRATREES), neural networks (NNET), linear discriminant analysis (LDA), nearest shrunken centroids (PAM), C5.0 (C5), k-nearest neighbors (KNN), naive bayes (NB), CART (RPART), and generalized linear models (GLM). An additional model was trained and tested using a greedy forward search algorithm (GFS). Briefly, this algorithm iterates over the list of genes and evaluates each individual gene’s discriminatory power by training a generalized linear model (GLM). The discriminatory power is evaluated based on a combined score, which is calculated as: (AUC + Sensitivity + Specificity), in which sensitivity and specificity are calculated at the optimal yield threshold. The gene that results in the highest score is added to the model. At each subsequent iteration over the list of candidate genes, the algorithm will attempt to add a gene to the model in this way. The algorithm stops when there is no gene in the list that can increase the score by more than 0.01.

WHO TPP thresholds

The broader diagnostic potential of our 6-gene signature was informed by the WHO target product profiles (TPP’s). The WHO TPP’s provide the minimal and optimal characteristics of a target product, where the minimal thresholds indicate the lowest acceptable standard, and the optimal criteria outline the ideal target profile. For a non-sputum-based diagnostic test, the overall minimal and optimal sensitivities are 65% and 80% with a specificity of 98%. For a non-sputum-based triage test, the optimal requirements are 95% sensitivity and 80% specificity, while the minimal criteria are 90% sensitivity and 70% specificity.

Quantification and statistical analyzes

All statistical analyses were performed using R (v4.1.0). Statistical significance was tested using Wilcoxon signed-rank tests, and Mann-Whitney U tests in a two-sided manner unless otherwise stated. Boxes in the boxplots indicate the 25th and 75th percentiles, the band in the box represents the median, and whiskers extend to 1.5 × interquartile range of the hinge. All sequencing data was aligned to the GRCh38 Gencode v38 Primary Assembly, and features were counted using the GRCH38 Gencode v38 Primary Assembly Annotation.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The raw sequencing data and de-identified RNA-seq count matrices generated in this study have been deposited in the Gene Expression Omnibus under the accession codes GSE255071, GSE255073, GSE255074. All data are included in the Supplementary Information or are available from the authors, as are unique reagents used in this Article. The raw numbers for charts and graphs are available in the Source Data file whenever possible. Source data are provided in this paper.

Code availability

All code is available on GitHub at (https://github.com/DanielEweisLaBolle/cfRNA_TB).

References

Chandra, P., Grigsby, S. J. & Philips, J. A. Immune evasion and provocation by Mycobacterium tuberculosis. Nat. Rev. Microbiol. 20, 750–766 (2022).
Article CAS PubMed PubMed Central Google Scholar
Singhania, A., Wilkinson, R. J., Rodrigue, M., Haldar, P. & O’Garra, A. The value of transcriptomics in advancing knowledge of the immune response and diagnosis in tuberculosis. Nat. Immunol. 19, 1159–1168 (2018).
Article CAS PubMed PubMed Central Google Scholar
Drain, P. K. Incipient and Subclinical Tuberculosis: a Clinical Review of Early Stages and Progression of Infection. Clin. Microbiol. Rev. 31, 00021–18 (2018).
Article Google Scholar
Mulenga, H. Performance of diagnostic and predictive host blood transcriptomic signatures for Tuberculosis disease: A systematic review and meta-analysis. PLOS ONE 15, 0237574 (2020).
Article Google Scholar
Zak, D. E. A blood RNA signature for tuberculosis disease risk: a prospective cohort study. Lancet Lond. Engl. 387, 2312–2322 (2016).
Article CAS Google Scholar
World Health Organization. High Priority Target Product Profiles for New Tuberculosis Diagnostics: Report of a Consensus Meeting, (Geneva, Switzerland. 2014).
Loy, C. J. et al. Nucleic acid biomarkers of immune response and cell and tissue damage in children with COVID-19 and MIS-C. Cell Rep. Med. 4, 101034 (2023).
Article CAS PubMed PubMed Central Google Scholar
Vorperian, S. K., Moufarrej, M. N. & Quake, S. R. Cell types of origin of the cell-free transcriptome. Nat. Biotechnol. 40, 855–861 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chu, T., Wang, Z., Pe’er, D. & Danko, C. G. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat. Cancer 3, 505–517 (2022).
Article CAS PubMed PubMed Central Google Scholar
Consortium, T. T. S. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).
Article PubMed PubMed Central Google Scholar
Xiong, M., Heruth, D. P., Zhang, L. Q. & Ye, S. Q. Identification of lung‐specific genes by meta‐analysis of multiple tissue RNA‐seq data. FEBS Open. Bio 6, 774–781 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lindskog, C. The lung-specific proteome defined by integration of transcriptomics and antibody-based profiling. FASEB J. 28, 5184–5196 (2014).
Article CAS PubMed Google Scholar
Fradejas, I., Ontanon, B., Munoz-Gallego, I., Ramirez-Vela, M. J. & Lopez-Roa, P. The value of xpert MTB/RIF-generated CT for predicting the smear status of patients with pulmonary tuberculosis. J. Clin. Tuberc. Mycobact. Dis. 13, 9–12 (2018).
Google Scholar
Kik, S. V. Diagnostic accuracy of chest X-ray interpretation for tuberculosis by three artificial intelligence-based software in a screening use-case: an individual patient meta-analysis of global data. medRxiv https://doi.org/10.1101/2022.01.24.22269730 (2022).
Gupta, R. K. Concise whole blood transcriptional signatures for incipient tuberculosis: a systematic review and patient-level pooled meta-analysis. Lancet Respir. Med. 8, 395–406 (2020).
Article PubMed PubMed Central Google Scholar
Mendelsohn, S. C. Prospective multicentre head-to-head validation of host blood transcriptomic biomarkers for pulmonary tuberculosis by real-time PCR. Commun. Med. 2, 1–13 (2022).
Article Google Scholar
Macaskill, P., Gatsonis, C., Deeks, J., Harbord, R. & Takwoingi, Y. Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy (eds. Deeks, J. Bossuyt, P. & Gatsonis, C.) Ch. 10 (The Cochrane Collaboration, 2010).
Berry, M. P. R. An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Nature 466, 973–977 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Costa, L. A real-time PCR signature to discriminate between tuberculosis and other pulmonary diseases. Tuberculosis 95, 421–425 (2015).
Article PubMed Google Scholar
Kaforou, M. Detection of Tuberculosis in HIV-Infected and -Uninfected African Adults Using Whole Blood RNA Expression Signatures: A Case-Control Study. PLOS Med. 10, 1001538 (2013).
Article Google Scholar
Walter, N. D. Blood Transcriptional Biomarkers for Active Tuberculosis among Patients in the United States: a Case-Control Study with Systematic Cross-Classifier Evaluation. J. Clin. Microbiol. 54, 274–282 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sweeney, T. E., Braviak, L., Tato, C. M. & Khatri, P. Genome-wide expression for diagnosis of pulmonary tuberculosis: a multicohort analysis. Lancet Respir. Med. 4, 213–224 (2016).
Article CAS PubMed PubMed Central Google Scholar
Shenoy, A. R. GBP5 Promotes NLRP3 Inflammasome Assembly and Immunity in Mammals. Science 336, 481–485 (2012).
Article ADS CAS PubMed Google Scholar
Kelley, N., Jeltema, D., Duan, Y. & He, Y. The NLRP3 Inflammasome: An Overview of Mechanisms of Activation and Regulation. Int. J. Mol. Sci. 20, 3328 (2019).
Article CAS PubMed PubMed Central Google Scholar
Francisco, N. M. Diagnostic accuracy of a selected signature gene set that discriminates active pulmonary tuberculosis and other pulmonary diseases. J. Infect. 75, 499–510 (2017).
Article PubMed Google Scholar
Sutherland, J. S. Diagnostic Accuracy of the Cepheid 3-gene Host Response Fingerstick Blood Test in a Prospective, Multi-site Study: Interim Results. Clin. Infect. Dis. 74, 2136–2141 (2022).
Article CAS PubMed Google Scholar
Yao, X. Whole blood GBP5 protein levels in patients with and without active tuberculosis. BMC Infect. Dis. 22, 328 (2022).
Article CAS PubMed PubMed Central Google Scholar
Dobin, A. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).
Article CAS PubMed Google Scholar
García-Alcalde, F. Qualimap: evaluating next-generation sequencing alignment data. Bioinformatics 28, 2678–2679 (2012).
Article PubMed Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 28, 1–26 (2008).
Article Google Scholar
Robin, X. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinforma. 12, 77 (2011).
Article Google Scholar

Download references

Acknowledgements

We thank the Cornell Genomics Center and the UCSF Center for Advanced Technology for help with sequencing libraries. We thank the patients and their families for contributing their blood to further our understanding of TB. This work was supported by the National Institutes of Health (NIH) grants R01AI146165, R21AI133331, R21AI124237, R01AI151059. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

These authors contributed equally: Adrienne Chang, Conor J. Loy, Daniel Eweis-LaBolle.

Authors and Affiliations

Meinig School of Biomedical Engineering, Cornell University, Ithaca, NY, USA
Adrienne Chang, Conor J. Loy, Daniel Eweis-LaBolle, Joan S. Lenz & Iwijn De Vlaminck
Global Health Labs, Inc., Bellevue, WA, USA
Amy Steadman
World Alliance for Lung and Intensive Care Medicine in Uganda, Kampala, Uganda
Alfred Andgrama & William Worodria
National Lung Hospital, Hanoi, Vietnam
Nguyen Viet Nhung
De La Salle Medical and Health Sciences Institute, Dasmarinas, Philippines
Charles Yu
University Hospital Heidelberg & German Center of Infection Research, Heidelberg, Germany
Claudia M. Denkinger
UCSF Center for Tuberculosis, University of California San Francisco, San Francisco, CA, USA
Payam Nahid & Adithya Cattamanchi
Division of Pulmonary and Critical Care Medicine, University of California Irvine, Orange, CA, USA
Adithya Cattamanchi

Authors

Adrienne Chang
View author publications
You can also search for this author in PubMed Google Scholar
Conor J. Loy
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Eweis-LaBolle
View author publications
You can also search for this author in PubMed Google Scholar
Joan S. Lenz
View author publications
You can also search for this author in PubMed Google Scholar
Amy Steadman
View author publications
You can also search for this author in PubMed Google Scholar
Alfred Andgrama
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Viet Nhung
View author publications
You can also search for this author in PubMed Google Scholar
Charles Yu
View author publications
You can also search for this author in PubMed Google Scholar
William Worodria
View author publications
You can also search for this author in PubMed Google Scholar
Claudia M. Denkinger
View author publications
You can also search for this author in PubMed Google Scholar
Payam Nahid
View author publications
You can also search for this author in PubMed Google Scholar
Adithya Cattamanchi
View author publications
You can also search for this author in PubMed Google Scholar
Iwijn De Vlaminck
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A. Chang, A.S., and I.D.V. contributed to the study design. J.S.L. performed the experiments. A. Chang, C.J.L, D.E.L, and I.D.V. analyzed the data. A.S., A. Cattamanchi, A.A., N.V.N., C.Y., W.W., C.M.D., and P.N. facilitated data collection. A. Chang, C.J.L, D.E.L, and I.D.V. wrote the manuscript. All authors provided comments and edits.

Corresponding author

Correspondence to Iwijn De Vlaminck.

Ethics declarations

Competing interests

A. Chang, C.J.L, D.E.L, and I.D.V are inventors on submitted patents pertaining to cell-free nucleic acids (US patent applications 63/237,367 and 63/429,733). I.D.V. is a member of the Scientific Advisory Board of Karius Inc., Kanvas Biosciences, and GenDX. I.D.V. is listed as an inventor on submitted patents pertaining to cell-free nucleic acids (US patent applications 63/237,367, 63/056,249, 63/015,095, 16/500,929, 41614P-10551-01-US) and receives consulting fees from Eurofins Viracor. The remaining authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Javed Agrewala, Andrew DiNardo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Supplementary Data 4

Supplementary Data 5

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chang, A., Loy, C.J., Eweis-LaBolle, D. et al. Circulating cell-free RNA in blood as a host response biomarker for detection of tuberculosis. Nat Commun 15, 4949 (2024). https://doi.org/10.1038/s41467-024-49245-6

Download citation

Received: 08 March 2023
Accepted: 29 May 2024
Published: 10 June 2024
DOI: https://doi.org/10.1038/s41467-024-49245-6

This article is cited by

Accuracy of cell-free Mycobacterium tuberculosis DNA testing in pleural effusion for diagnosing tuberculous pleurisy: a multicenter cross-sectional study
- Wei-Li Du
- Jian-Qin Liang
- Nan-Ying Che
Military Medical Research (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.