Metagenomic DNA sequencing to quantify Mycobacterium tuberculosis DNA and diagnose tuberculosis

Tuberculosis (TB) remains a significant cause of mortality worldwide. Metagenomic next-generation sequencing has the potential to reveal biomarkers of active disease, identify coinfection, and improve detection for sputum-scarce or culture-negative cases. We conducted a large-scale comparative study of 428 plasma, urine, and oral swab samples from 334 individuals from TB endemic and non-endemic regions to evaluate the utility of a shotgun metagenomic DNA sequencing assay for tuberculosis diagnosis. We found that the composition of the control population had a strong impact on the measured performance of the diagnostic test: the use of a control population composed of individuals from a TB non-endemic region led to a test with nearly 100% specificity and sensitivity, whereas a control group composed of individuals from TB endemic regions exhibited a high background of nontuberculous mycobacterial DNA, limiting the diagnostic performance of the test. Using mathematical modeling and quantitative comparisons to matched qPCR data, we found that the burden of Mycobacterium tuberculosis DNA constitutes a very small fraction (0.04 or less) of the total abundance of DNA originating from mycobacteria in samples from TB endemic regions. Our findings suggest that the utility of a minimally invasive metagenomic sequencing assay for pulmonary tuberculosis diagnostics is limited by the low burden of M. tuberculosis and an overwhelming biological background of nontuberculous mycobacterial DNA.


Scientific Reports
| (2022) 12:16972 | https://doi.org/10.1038/s41598-022-21244-x www.nature.com/scientificreports/ biofluids (e.g., urine and plasma) can be used to monitor infection even if the biofluid is not downstream of or sampled from the site of injury. For example, circulating plasma DNA has been used to monitor infection and rejection after lung transplantation 5 . Numerous studies have demonstrated that DNA from Mycobacterium tuberculosis, the causative agent of tuberculosis, can be detected in plasma, urine, and oral fluids [6][7][8] . However, its diagnostic performance in distinguishing positive sputum culture from negative sputum culture samples has fallen short of the performance standards set by the WHO (98% specificity, 80% sensitivity) 9 . We explored the utility of a metagenomic sequencing assay for tuberculosis across three biofluids and four geographic regions and found that diagnostic performance is highly influenced by the geographic origin of the control cohort ( Fig. 1). Using simulation and modeling, we found that the diagnostic performance is correlated with the abundance of M. tuberculosis DNA relative to the background of DNA from nontuberculous mycobacteria. The background of DNA originating from nontuberculous mycobacteria is low in samples from TB non-endemic regions but overwhelms and obscures the M. tuberculosis signal in samples from TB endemic regions. Our study provides insight into the burden and properties of M. tuberculosis in different biofluids and can inform the development of molecular tests that achieve the requisite standards for sensitivity and specificity.

Methods
Study cohorts. A total of 428 datasets from plasma, urine, and oral swab samples were analyzed, 191 of which were generated for this study (Table 1). Endemic plasma and urine samples were collected from patients seeking treatment for environmental enteropathy in Peru. The study was approved by the Johns Hopkins Bloomberg School of Public Health Institutional Review Board (protocol 00002185) and the Cornell University Institutional Review Board (protocol 1612006853). Tuberculosis plasma samples were collected from patients presenting with respiratory symptoms from tuberculosis clinics in Peru. The study was approved by the Foundation for Innovative New Diagnostics' Clinical Trials Review Committee and the Cornell Institutional Review Board (protocol 1612006851). Oral swab samples were collected from individuals who presented with symptoms of respiratory illness at outpatient clinics in Uganda. The study was approved by the Makerere University School of Medicine Research and Ethics Committee (protocol 2017-020). Additional oral swab samples were collected from healthy volunteers at Cornell University. The study was approved by the Cornell University Institutional Review Board (protocol 1910009101).
Additional datasets collected in the scope of previous studies were included as follows. Non-endemic urine samples were collected from kidney transplant patients who received care at New York Presbyterian   12 . All patients provided written informed consent and all experiments were performed in accordance with relevant guidelines and regulations. The geographic distribution of the datasets included in the study were mapped using the geom_map() function from the R package ggplot2.
Sample collection. Urine samples were collected via the conventional clean-catch midstream method. For the endemic cohort, approximately 50 mL of urine was centrifuged at 3000×g on the same day for 30 min and the supernatant was stored in 1 mL aliquots at −80 °C. For the tuberculosis cohort, approximately 10 mL of urine was mixed with 2 mL Streck Cell-Free DNA Urine Preserve (Streck, Cat #230604) and centrifuged at 3000×g for 30 min at ambient temperature within 30 min of specimen collection. The supernatant was similarly stored in 1 mL aliquots at −80 °C.
Peripheral blood samples were collected in EDTA. Plasma was separated by centrifugation at 1600×g for 10 min followed by centrifugation at 16,000×g for 10 min. Plasma was stored in 1 mL aliquots at −80 °C.
Oral swab samples were collected by swabbing the tongue for 15 s with rotation using a Copan regular flocked swab with molded breakpoint at 30 mm (Copan, Cat #520CS01). The swab head was then broken off into a collection tube containing 1× TE buffer, vortexed for 30 s, and stored at −80 °C. DNA isolation and library preparation. DNA was extracted from plasma and urine using the QIAamp Circulating Nucleic Acid Kit according to the manufacturer's instructions (Qiagen, Cat #55114). Sequencing libraries were prepared using a single-stranded library preparation as described in Burnham et al. 13 . DNA was extracted from oral swab samples using the QIAamp UCP Pathogen Kit according to the manufacturer's instructions (Qiagen, Cat #50214) and libraries were prepared using the Nextera XT DNA Library Prep Kit (Illumina, Cat #FC-131-1024). All libraries were characterized using the AATI Fragment Analyzer before pooling and sequencing on the Illumina NextSeq 500 platform (paired-end, 2 × 75 bp).

Fragment length distribution and quantification of M. tuberculosis. Low-quality bases and
Illumina-specific sequences were trimmed (Trimmomatic-0.32, LEADING:25 TRAILING:25 SLIDINGWIN-DOW:4:30 MINLEN:15) 14 . Reads were aligned (bwa mem 15 ) to the human reference (UCSC hg19). Reads that did not align to the human genome reference were extracted and aligned to the bacteriophage phiX174.
To obtain the fragment length distribution for M. tuberculosis DNA, unaligned reads were extracted again and aligned to the circularized M. tuberculosis H37Rv genome (edited from NC_000962.3). Reads aligning with a minimum mapping quality of 30 were extracted, and the lengths computed as the absolute difference between the start and end coordinates. The fragment length density profile was computed using the hist function from the R Graphics Package.
To quantify the abundance of M. tuberculosis, reads that did not align to the human and phiX references were extracted and aligned to a curated list of bacterial and viral reference genomes using kallisto (--pseudobam) 16,17 . Reads aligning to the 16S/23S ribosomal RNA region were removed. Microbial abundance was quantified as Reads per Million reads (RPM):

Quantification of insertion sequence and genome coverage.
Reads that did not align to the human and phiX references were extracted and aligned (bwa mem 15 ) to the circularized M. tuberculosis H37Rv genome (edited from NC_000962.3), to the IS6110 and IS1081 sequences retrieved from the GenBank repository for M. tuberculosis H37Rv (NC_000962.3), and to additional nontuberculous mycobacteria-specific insertion sequences (Table S9). Reads aligning with a minimum mapping quality of 30 were extracted, and the lengths computed as the absolute difference between the start and end coordinates. The coverage was calculated using the following equation: where N is the number of reads of length L, G is the sequence length, and c is the sequence copy number.

ROC analysis.
Receiver operating characteristic analyses were performed using the roc function from the R package pROC 18 . For each biofluid, the abundance of M. tuberculosis DNA (in RPM) was compared between sputum positive tuberculosis and sputum negative tuberculosis samples, between tuberculosis and endemic cohorts, and between tuberculosis and non-endemic cohorts.
To evaluate the effects of using non-endemic and endemic samples on the diagnostic performance for tuberculosis, reads that aligned to M. tuberculosis were extracted from each sample. Sputum positive tuberculosis Statistical analysis and data availability. All statistical analyses were performed in R 3.5.0. Boxes in the boxplots indicate the 25th and 75th percentiles, the band in the box represents the median, and whiskers extend to 1.5 × interquartile range of the hinge. The sequence data for the non-endemic urine cohort was deposited in the database of Genotypes and Phenotypes (dbGaP, accession number phs001564v3.p1). The sequence data for the non-endemic plasma cohort was deposited in the Sequence Read Archive (accession number PRJNA263522).
The sequence data generated in the scope of this study will be deposited in the Sequence Read Archive.
Ethics approval and consent to participate. The

Results
Biophysical properties of M. tuberculosis DNA in urine, plasma, and oral swabs. Microbial cellfree DNA (cfDNA) in urine and plasma has been shown to be ultrashort, with an average fragment length of less than 100 bp ( Fig. 2A) 12,13 . Compared to these biofluids, relatively few studies have examined the properties and diagnostic potential of microbial DNA from oral swabs. Oral swabs samples are an attractive new avenue for metagenomic DNA assays because they can be obtained noninvasively, are more cost effective than obtaining, storing, and shipping blood samples, and provide a high yield of DNA 19 . We found that in contrast to the microbial fragment length profiles of urinary and plasma cfDNA, DNA obtained from oral swabs is longer and likely genomic in origin: the fragmentation pattern closely mirrors that of human chromosomal DNA which arises from the tagmentation step in the library preparation protocol (Fig. 2B).
We set out to quantify the abundance of M. tuberculosis DNA detected in 161 samples obtained from patients presenting with symptoms of respiratory illness at tuberculosis clinics in the Philippines and Uganda (tuberculosis cohort), approximately half of which were sputum positive for tuberculosis (Table 1). Across all samples, we obtained an average of 31,831,284 reads (range: 1,908,898 to 155,693,161 reads; see Supplemental Information). We found that the abundance of M. tuberculosis DNA increases from plasma, to urine, to oral swabs (Fig. 2C). Within the tuberculosis cohort, there was no difference in M. tuberculosis DNA abundance between sputum positive and sputum negative tuberculosis samples. Setting a limit of detection of 0.1 reads per million reads (RPM), we find that within the tuberculosis cohort we detect M. tuberculosis DNA in 100% of the oral swab samples (42/42), 95% of the urine samples (55/58), and 93% of the plasma samples (57/61; Table S1). The median tuberculosis DNA in the endemic cohort relative to the non-endemic cohort was expected given that geography, ethnicity, and other population-based factors have previously been shown to influence the microbiome composition among healthy individuals 20 , and that the presence of an infectious organism does not necessarily equate to a disease state 21 . Given that the abundance of M. tuberculosis DNA across all biofluids in the endemic cohort was over 9.5-fold more than the non-endemic cohort (adjusted p-value = 2.1 × 10 −23 , Wilcoxon test), but still significantly less than the tuberculosis cohort (14.15-fold less across all biofluids; adjusted p-value = 1.1 × 10 −3 , Wilcox test).

Diagnostic potential of M. tuberculosis DNA is influenced by choice of controls.
We evaluated the diagnostic performance of M. tuberculosis DNA in oral swabs, plasma, and urine. We found poor separation between sputum positive and sputum negative individuals across all biofluid types (area under the curve [AUC] = 0.6, 0.6, and 0.5 for plasma, urine, and oral swabs, respectively; Fig. 3A and Table S2). We found a modest increase in performance when comparing the tuberculosis and endemic cohorts (AUC = 0.62 and 0.65 for urine and plasma, respectively; Fig. 3B). In contrast, we found almost perfect separation for the tuberculosis and non-endemic cohorts (AUC = 0.93, 0.97, and 0.99 for urine, plasma, and oral swab, respectively; Fig. 3C). Moreover, we found that the diagnostic performance was not influenced by the choice of metagenomic classifier, reference database, or removal of confounding reads (see Table S3-S4).
To explore the effects of a biological background on the diagnostic performance, we randomly selected combinations of 15 urine samples composed of sputum positive and a known mixture of sputum negative and nonendemic controls. We performed 50 sampling rounds and found a positive correlation between the proportion of non-endemic samples and the diagnostic performance, with a mean AUC of 0.57 when the negative control consisted of only sputum negative samples and a mean AUC of 0.95 when the negative control was composed entirely of non-endemic samples (Fig. 3D). However, we saw no correlation when we performed the same analysis using a mixture of sputum negative tuberculosis and endemic controls: the area under the curve remained relatively constant, fluctuating between 0.56 and 0.60 across all negative control mixtures of sputum negative and endemic samples. This observation highlights a crucial but often overlooked criterium for metagenomic diagnostic test development: the choice of control population. Differences in diagnostic performance are highly influenced by the geographic origin of the samples. This is supported by the performance of published studies assaying nucleic acids in blood or urine for tuberculosis diagnosis: sputum culture positive and sputum culture negative samples are nearly indistinguishable unless the control cohort include samples from individuals living in TB non-endemic regions (Table S5).
Poor diagnostic performance can be attributed to a background of biological nontuberculous mycobacteria. We hypothesized that the poor diagnostic performance could be attributed to geographic factors: individuals living in TB endemic regions are exposed to nontuberculous mycobacteria, such as M. avium, M. abscessus, and M. kansasii 22 . This is further supported by observations that the abundance of M. tuberculosis DNA is positively correlated with age (Fig. S1). This exposure results in a nontuberculous mycobacteria www.nature.com/scientificreports/ background that is indistinguishable from a true disease signal due to the low abundance of M. tuberculosis and the high sequence similarity between Mycobacterium genomes, which can exceed 99% nucleotide similarity 23 .
To determine whether the poor diagnostic performance could be attributed to a biological nontuberculous mycobacteria background, we simulated datasets by digitally spiking in M. tuberculosis reads into datasets generated for non-endemic, endemic, and tuberculosis urine samples. Across a range of 0 to 1000 spiked-in M. tuberculosis reads, we obtained nearly perfect classification of the synthetic reads in non-endemic samples (0.987 ± 0.0210; Fig. 4A). However, the endemic and tuberculosis samples exhibited logarithmic relationships between the number of spiked-in reads and precision. When we evaluated precision as a function of the relative coverage of spiked-in M. tuberculosis reads to all Mycobacterium reads in a sample, we found a strong correlation between the signal-to-noise ratio and the classification precision (Fig. 4B). Non-endemic samples had little to no background DNA from Mycobacterium species and exhibited high precision, while endemic and tuberculosis samples had a higher background of nontuberculous mycobacteria that reduced the classification precision. To further test our hypothesis that background nontuberculous mycobacterial DNA drives poor classification precision, we removed all Mycobacterium reads prior to spiking in M. tuberculosis, M. bovis, or M. avium reads and found perfect classification across all samples, regardless of the number of reads simulated (Tables S6-S8).
The availability of sputum PCR results for the oral swab samples provided further opportunity to evaluate the M. tuberculosis signal relative to the nontuberculous mycobacterial background. The Xpert® MTB/RIF Ultra assay (Cepheid, Sunnyvale, USA) targets three segments of the M. tuberculosis complex, two of which are the IS6110 and IS1081 insertion sequences. M. tuberculosis isolates contain between 0-25 copies of IS6110 24 , whereas the IS1081 is present in all M. tuberculosis complex species at a stable number of 5-7 repeats per genome 25 . We evaluated the presence of IS1081 detected by metagenomic sequencing because the range in copy numbers across different M. tuberculosis complex species is narrower. IS1081 was detected by metagenomic sequencing in 5 of 27 sputum positive oral swab samples and was not detected in any of the 15 sputum negative oral swab samples, as expected. Because IS1081 is unique to M. tuberculosis, we were able to obtain a lower bound of the relative abundance of M. tuberculosis versus Mycobacterium by comparing the per-base sequence coverage of the IS1081 gene segment relative to the M. tuberculosis genome. Using the minimum copy number for the IS1081 gene (five copies per genome), we found that the coverage of nontuberculous mycobacteria relative to IS1081 was 210.397 (range of relative coverage: 23-394). Given that IS1081 was not detected in 22 Table S9) 22,26 . Further validation is required to determine if these species are co-infectious and influence disease outcome.

Discussion
We show that the utility of a minimally invasive metagenomic sequencing assay for pulmonary tuberculosis diagnostics is dependent on the geographic origin of control samples and limited by the low abundance of M. tuberculosis in extrapulmonary sampling sites. Such an assay is sensitive to the detection of nontuberculous mycobacteria that arises from a lifelong exposure to species from the Mycobacterium genus and contributes to the microbiome composition of samples originating from TB endemic regions. Our findings demonstrate that the influence of geography on the microbiome directly impacts the diagnostic performance: the inclusion of non-endemic samples in the control cohort invariably results in a near perfect test while poor diagnostic separation is obtained for geographically-controlled TB positive and TB negative individuals. Mathematical modeling demonstrates that the diagnostic potential is correlated with the abundance of M. tuberculosis DNA relative to the background of nontuberculous mycobacterial DNA. Quantitative comparisons to matched qPCR reveals www.nature.com/scientificreports/ that nontuberculous mycobacterial DNA is 23-fold or more abundant than the abundance M. tuberculosis DNA in the samples investigated here. The overwhelming biological background of Mycobacterium in samples of interest, in combination with the low abundance of M. tuberculosis in extrapulmonary sampling sites, presents a major barrier for the implementation of an unbiased metagenomic DNA sequencing assay for pulmonary tuberculosis diagnostics. Detection of tuberculosis using a metagenomic sequencing assay for tuberculosis diagnostics is thus akin to looking for a needle in a haystack: M. tuberculosis DNA constituted less than 4.4% of the total abundance of Mycobacterium in samples from TB endemic regions included in this study. Our work suggests that the median abundance of M. tuberculosis is lower than 0.06 and 0.42 copies/mL in blood and urine, respectively, and lower than 284 genome copies/µg of DNA collected by oral swab, an estimate that is in agreement with previous reports quantifying the abundance of M. tuberculosis DNA through sequence-specific amplification 8,27 . Improvements to a metagenomic sequencing assay for tuberculosis diagnostics could be made by increasing the volume of input biofluid 28 , choosing a sample preparation workflow with improved DNA extraction and short read amplification 12,27 , or enriching for M. tuberculosis-specific sequences using ultrashort PCR amplicons 8,29 . These approaches would minimize noise from nontuberculous mycobacteria and increase the sequencing budget allocated to M. tuberculosis. Additionally, further exploration of the nontuberculous mycobacteria fraction may reveal patterns in disease outcome and provide new insights in the development of a robust geographic control. Together, our results reveal challenges and opportunities for the development of a DNA-based diagnostic tests for pulmonary tuberculosis and provides a comprehensive characterization of M. tuberculosis in extrapulmonary sites that can inform the development of molecular tests.

Data availability
The sequence data for the non-endemic urine cohort was deposited in the database of Genotypes and Phenotypes (dbGaP, accession number phs001564v3.p1). The sequence data for the non-endemic plasma cohort was deposited in the Sequence Read Archive (accession number PRJNA263522). The sequence data generated in the scope of this study will be deposited in the Sequence Read Archive.