Bacterial load slopes represent biomarkers of tuberculosis therapy success, failure, and relapse

There is an urgent need to discover biomarkers that are predictive of long-term TB treatment outcomes, since treatment is expense and prolonged to document relapse. We used mathematical modeling and machine learning to characterize a predictive biomarker for TB treatment outcomes. We computed bacterial kill rates, γf for fast- and γs for slow/non-replicating bacteria, using patient sputum data to determine treatment duration by computing time-to-extinction of all bacterial subpopulations. We then derived a γs-slope-based rule using first 8 weeks sputum data, that demonstrated a sensitivity of 92% and a specificity of 89% at predicting relapse-free cure for 2, 3, 4, and 6 months TB regimens. In comparison, current methods (two-month sputum culture conversion and the Extended-EBA) methods performed poorly, with sensitivities less than 34%. These biomarkers will accelerate evaluation of novel TB regimens, aid better clinical trial designs and will allow personalization of therapy duration in routine treatment programs. Magombedze et al. propose a new method combining mathematical modeling and machine learning to derive early (within 2 months) effective predictive biomarkers from bacterial load slopes for tuberculosis long term treatment outcomes, thereby accurately predicting treatment failure and success.

T uberculosis (TB) is the most important infectious cause of death worldwide, accounting for 3% of all deaths; it killed one billion people over the last two centuries 1 . In both drug-susceptible TB and multidrug-resistant TB (MDR-TB) 2 , therapy duration is 6 months, after which patients are followed for up to 18 months to document relapse. The large numbers of patients with TB (10 million/year), the long therapy duration, and the follow up period of up to 2 years, makes TB one of the most expensive diseases to treat. Thus, it is of crucial importance to identify TB treatment regimens that are equally as effective in drug-resistant TB as in drug-susceptible TB, to identify regimens that can shorten therapy duration, and to identify early biomarkers that obviate the need for 2-year follow up [1][2][3][4][5][6][7][8][9][10][11] . A closely related problem is the time it takes to evaluate and compare such new regimens in phase I-III clinical trials; they take decades to complete given the long follow-up time required to document relapse. Thus, biomarkers that obviate the need for the long follow up to document relapse, and that can be deployed immediately on a global scale at little cost, need to be urgently developed for both routine patient care and to accelerate the time-table of clinical trials.
The tools currently used to monitor TB treatment in the clinic and in clinical trials arose in the historical context of the microbiology technology of 50 years ago. In the late 1970s Jindani and Mitchison performed a 14-day treatment clinical study in East Africa (n = 124 patients) that utilized solid agar-based Mycobacterium tuberculosis (Mtb) colony-forming unit (CFU)derived kill rates defined by linear regression slopes to define early bactericidal activity (EBA), and the 14-day or extended-EBA to capture sterilizing activity, which are the basis of current phase II clinical trials 7,8 . In 1993 Mitchison summarized results of seven clinical studies to propose the use of two-months sputum culture and smear as a surrogate of relapse; the two-month (eight-week) endpoint is now the basis of clinical decision-making in routine clinical care 3,10-13 . Eight-week studies are also widely used as phase IIb studies to select TB regimens that go into the larger phase III studies in which long-term outcomes such as relapse, death, and cure are evaluated. However, the accuracy of these phase I/II studies in predicting hard clinical outcomes such as cure, therapy failure, and relapse, have been challenged [10][11][12]14,15 . In addition, more recent technological advances with semiautomated liquid cultures have demonstrated that the eight-week agar-based cultures may have been over-optimistic and are associated with substantial false-negative rates [16][17][18][19] . On the other hand, time-to-positivity (TTP) in the liquid cultures can be used in place of CFUs 20,21 . The liquid culture technology has been widely deployed across the world for routine clinical care as a diagnostic and for susceptibility testing. Here, we sought to identify mechanistic biomarkers (based on quantitative biology of the disease) that fulfill the definition of the US Food and Drug Administration BEST (Biomarkers, EndpointS, and other Tools) Resource, for use early during therapy to predict long-term hard clinical endpoints such as cure, therapy failure, and relapse 22,23 .
We have developed a mechanistic model to quantitatively explain the drug-regimen bacterial kill kinetics and dynamics of both fast-replicating and semi-dormant/non-replicating persistent (NRP) Mtb subpopulations in TB patients as reflected in sputum 24 . Here, we used serial sputum TTP-data from patients in the Rapid Evaluation of Moxifloxacin in Tuberculosis (REMoxTB) phase III clinical trial to identify the trajectory of these two bacterial sub-populations and to estimate time in which both Mtb bacteria subpopulations reach extinction (time-toextinction) 24 . According to Burman, "The ability to prevent relapse is termed sterilizing activity because it is presumed to require killing nearly of all bacilli remaining after the initial phase of therapy" 9 . Restated, failure to reach extinction by the Mtb population in lung lesions is a required condition for therapy failure and relapse. Therefore, the time-to-extinction of all bacillary populations marks the required minimum duration of therapy in order to avoid relapse. However, some patients who do not reach bacterial-population extinction can still achieve relapsefree cure because of immune-response that can potential eliminant remaining bacteria.

Results
Clinical and laboratory characteristics in derivation and validation datasets. First, REMoxTB clinical trial patients who had (i) majority of sputum samples that were contaminated (TTP<4days), or missing, or (ii) ≤4 data-points within the first 8 weeks (i.e., data-points fewer than ODE model parameters minus one) excluding the baseline value were removed, leaving 637 (33%) patients randomized to the standard therapy arm, 654 (34%) randomized to the isoniazid arm, and 633 (33%) randomized to the ethambutol arm (Fig. 1). This was followed by converting the 1,924 patients TTP-series to CFU/mL using Eq. 1, before modeling the data with a set of ODEs 2 and 3, to describe trajectories of Mtb CFU/mL with time (i.e., slopes). We identified ODE-model parameter estimates using 8-week (2-months)-, 4months-, and 6-months accrued TTP-derived data for all 1,924 patients. The model parameter estimates are shown in Table S1. As an internal check for consistency with clinical observations, the range of proportions (fraction f ) for semidormant and nonreplicating bacteria to the log-phase population was 1% to 25%. We termed the Mtb kill rates γ-slopes, where γ f is the slope for fast-replicating Mtb and γ s is the slope for semi-dormant/nonreplicating (NRP) Mtb. The model was also used to calculate the time-to-extinction of the total Mtb population for each patient, with results shown in Fig. 2.
Data partitioning into derivation and validation datasets. We separated the 1924 patients' data into derivation and validation datasets, shown in Table 1. The derivation dataset was comprised of 318 (50%) patients on standard therapy, as shown in Fig. 1. All patients in the derivation dataset were randomized to six-months therapy duration. The validation datasets comprised of (i) 319 patients on standard therapy for six-months duration, and (ii) 1287 patients randomized to the experimental arms (isoniazid or ethambutol) that had a four-months therapy duration. Table 1 shows that the demographic and clinical characteristics were similar between the derivation data set and all validation data sets, which means that the data-partitioning step was executed successfully.
Time-to-extinction versus clinical trial-based outcome definitions. We then used the derivation dataset to determine if the time-to-extinction of the total Mtb sputum population for each patient had clinical relevance, especially given that TTP versus CFU/mL relationship could change with time during treatment. The number of patients deemed cured at different time intervals in the course of treatment obtained by counting the number of negative cultures/TTP as defined in the REMoxTB protocol versus those identified using our time-to-extinction model definitions (derived from CFUs calculated from TTPs) had a Spearman rank correlation of 1.0 (p = 0.017). Moreover, when we used Cohen's kappa (κ) to assess agreement between individual pairs of either time-to-extinction versus standard clinical definitions, they were highly concordant (κ = 0.65, p < 0.001). Furthermore, the Spearman rank correlation between γ f (fast slope) and 14-day extended-EBA (derived using linear regression) was 0.68 (p < 0.001), which suggests that the extended-EBA mainly reflects Step 1 Step 2 Step 3 Step 4 Step 5 Step 6  Step 1: Patients without sufficient data points to derive bacterial kill slopes were removed.
Step 2: The weekly sputum time-to-positivity data was then converted to colony forming units and then modeled using ordinary differential equations.
Step 3: Data partitioning of 50% of patients in standard of care six-months therapy as derivation data-set and the other 50% into valdiation dataset. All patients in experimental arm, administered over 4 months were assigned to validation datasets.
Step 4: Four mathematical modeling and machine learning types of analyses in derivation dataset to 1 identify predictors of time-to-extinction (TTE) and 2 threshold values deliniating different TTE, and 3 design a diagnostic rule for different therapy durations.
Step 5: Accuracy of diganostic rule/biomarker for six-months therapy duration in standard of care validation dataset using clinical definitions of outcome (relapse, cure).
Step 6: Accuracy of diganostic rule/biomarker for four-months therapy duration in two experimental arms in validation dataset using clinical definitions of relapse and cure. Predictors of outcome in derivation dataset. Classification and regression trees (CART) were used, to identify predictors of target outcome, defined as sputum microbial outcomes (cure at end of therapy, therapy failure or relapse), using potential predictors that included ALL the clinical and laboratory features, including ODE-model derived γ-slopes, for the tasks of classification and regression as input/independent variable. CART identified the γ s (semi-dormant/NRP kill) slope as the primary predictor (which had a variable importance score of 100%), followed by initial bacterial burden just prior to therapy commencement (which we termed B (0)), which had a variable importance score of 91.7%. This means that the initial TTP (B(0)) improved the primary predictor by an extra 91.7%. Notably, γ f was not ranked as a predictor using this agnostic machine learning method. Similarly, features such as HIV status, cavitation, biological sex (male or female) ranked low and had variable importance scores below 10%. CART performs its own cross-validation within the derivation dataset, in this case by randomly splitting the derivation dataset five times. With the cross-validation, the post-test validation area under the curve (AUC) in the same derivation set was >85%, demonstrating that γ s plus initial TTP (B(0)) will likely perform as good predictors in future and separate datasets.
Clustering-based approaches to identify biomarkers in derivation dataset. Cluster analysis is an agnostic, quantitative and unsupervised machine-learning we used to group similar longitudinal patterns. In this case, TTP trajectories in the derivation dataset (with 238 patients) were grouped into four distinct homogenous groups based on the K-means algorithm as shown in Fig. 3. These were 1 , a cure cluster of 80 (33.61%) patients (Fig. 3a, b) 2 , a slow-cure cluster of 100 (42.02%) (Fig. 3c, d) 3 , a relapse cluster of 34 (14.28%) patients (Fig. 3e, f), and 5 a treatment failure cluster of 24 (10.08%) patients (Fig. 3g, h). The slow cure cluster identified by this unsupervised machine learning method denoted those patients who had delayed attainment of microbiologic cure at the end of six months therapy (failed therapy at the end of six months) but achieved relapse-free cure when standard therapy was continued beyond six months duration. These four clusters represented 238/ 318 (74.84%) of patients with less than 2 missing observations or more missing observations during follow up. The model explained these data well, as is shown in online Figures S2 and S3, and Table S1, while the corresponding summary statistics for each cluster are shown in Table S2.
We used this clustering step to identify the minimum duration of data gathering that would give a γ-slope that could accurately predict cure or therapy failure or relapse. Figure 4 shows the distribution of model derived γ s and γ f values, when these slopes were derived based on 8-weeks-derived TTP data (2-months) (Fig. 4a, b), 4-months-derived TTP data (Fig. 4c, d), and 6-monthsderived TTP data (Fig. 4e, f). The 8-week-, 4-months-, and 6months-derived γ s and γ f values (shown in Table S1) versus outcomes were examined in pairwise comparisons using the Mann-Whitney-Wilcoxon test. Figure 4 shows that the γ f values did not         Fig. 4 Distributions of model estimated kill rates/biomarkers. Distributions of γ s -and γ f -slopes based on 2-months, 4-months and 6-months accrued TTP data. a γ s -slopes based on 2 months TTP data, b γ f -slopes based on 4 months TTP data, and c γ s -slopes based on 6 months TTP data. d The magnitude of γ f -slopes at 2 months, e at 4 months, and f at the end of 6 months. g Pairwise analysis using the Mann-Whitney test for distributions of slopes based on 2, 4 and 6 months data versus each cluster (cure, slow-cure, relapse and failure) for the γ s -slopes shows p-values in each cell that were significant even with 2-months of data, except with cure versus slow cure group. h Pairwise statistical difference analysis for γ f , found that few p-values were significant, and even those had an inconsistent pattern. discriminate failures from cures, consistent with CART findings. However, γ s = 0.15 or <0.1 log 10 CFU/mL/day (modeling semidormant/NRP Mtb) were better at discriminating these outcomes. The slopes derived with 8-week-vs-4 months data differed in the misclassification of patients' outcomes, the former misclassifying more relapses as cures and the latter misclassifying more cures as relapses. Nevertheless, as demonstrated by the statistical comparisons in Fig. 4h, the 8-week derived TTP data γ s (Fig. 4g) adequately diagnosed relapse versus other outcomes. In other words, γ s calculated using eight-week-derived TTP data is a good predictor of sterilizing effect up to 18-months after therapy cessation, and this eight-week data-derived slope thus measures sterilizing activity rate. Subsequently, all γ s discussed herein were those identified using the first eight-weeks-derived data.
Monte Carlo Simulations to identify biomarker thresholds in indeterminate outcome zones. Given the misclassification of relapses as cures by the eight-week TTP-derived γ s , and γ s thresholds in the indeterminate outcomes region (i.e, overlap of relapse versus slow cures), we utilized Monte Carlo Simulations (MCS) of time-toextinction in tandem with CART to further discriminate γ s cut-off values in indeterminate outcome zones, with results shown in Table S3, Figures S4 and S5. Cure was clearly delineated by γ s > 0.15, therapy failure by γ s < 0.1 plus initial bacterial burden B(0) > 5.6 log 10 CFU/mL (TTP = 5.49 days), and relapse delineated from cure by γ s < 0.13. Figure S4c, d shows the CART-derived biomarker thresholds based on the simulation for predicting treatment outcomes after either 4-months or 6-months therapy duration. Patients with initial bacteria burden B(0) > 4.5 log 10 CFU/mL (TTP = 8.11 days), and γ s -slopes between 0.1 and 0.15 had >55% chance of failing treatment at 6 months (Fig. S4c). However, for a four-month therapy duration regimen, patients with B(0) > 5.4 log 10 CFU/mL (TTP = 5.93) and γ s between 0.09 and 0.14 had a > 65% chance of failing treatment. Figure S4 also shows that in order to achieve cure/bacillary population extinction within 2 months of treatment, then γ s ≥ 0.15 (−3.90 TTP per day) would be required, while patients with γ s ≤ 0.1 (−2.60 TTP per day) would fail. Patients on standard therapy with B(0) > 5.6 log 10 CFU/mL (TTP = 5.49) with γ s < 0.13 would relapse.
Creation of γ s -based rule to predict relapse for different therapy durations from derivation dataset. In the final derivation step, we established a diagnostic rule for the relationship between γ s -slopes and the outcomes, using Latin hypercube sampling for sensitivity analyses, with results shown in Fig. 5. Figure 5a-d shows that increasing or reducing the γ s (i.e., speed of kill of slowreplicating bacteria) γ s changes the time-to-extinction and therefore the required minimum duration of therapy. As an example, the six-months therapy duration would need to be extended to eight-months duration (i.e., slow-cure) in patients with high bacterial burden when γ s is reduced from 0.148 to 0.131 and extended to 9-months when its reduced to 0.125 (Fig. 5b). However, for patients in the medium and low CFU load categories, lower slopes can still achieve cure within 6 months (Fig. 5c, d). On the other hand, to reduce treatment duration to four-months γ s should increase to 0.183, and in order to reduce therapy duration to two-months γ s should increase to 0.286 (Fig. 5b). The relationship between γ s and initial TTP versus minimum duration of therapy is shown in Fig. 5e, f, and is nonlinear function. From this, we calculated the target γ s to achieve cure (extinction of bacterial population) with one-month therapy duration, shown in Fig. 5e, f. This establishes a diagnostic rule between γ s versus minimum treatment duration for relapse-free cure for different initial Mtb burdens. After this step, the derivation work was completed, and the derivation dataset patients excluded from subsequent validation studies.
Performance of γ s -based rules in forecasting outcomes for 6 months therapy duration. Next, we calculated the accuracy of how well our diagnostic rule performed in the six months therapy duration validation datasets, using the clinical and microbial treatment outcomes defined by the REMoxTB trial protocol. Treatment outcome calls could be made in 218 of the 319 patients who also had more than 4 data points within eight-weeks to give statistically robust estimates of the bacteria kill slopes: 169/218 (74.31%) achieved relapse-free cure, 137/218 (16.97%) had therapy failure at the end of treatment, and 19/218 (8.72%) relapsed after initially looking like cure at the end of therapy. The accuracy of the γ s -based rules are compared to the extended-EBA and two months sputum conversion in Table 2, together with the relative risk (RR) of failure when each biomarker was positive versus not-positive (numbers in each cell shown in Table S4). Table 2 shows that the extended-EBA had a sensitivity of 14% and specificity of 92% in identifying failure from cure without relapse and the RR 95% confidence interval crossed 1 (p = 0.205); the number needed to diagnose (NND) failure/relapse was 15.27. Similarly, two-months sputum conversion had a sensitivity of 33% and specificity of 71%, RR was statistically 1, and NND was 21.41. On the other hand, the eight-weeks-data derived γ s combined with the initial TTP at treatment commencement had a sensitivity of 92% and specificity of 86% in identifying failure from relapse-free cure, the RR of failure when this biomarker was positive versus not positive was 28 ( Table 2 and Table S4), while NND was 1.29. Failures either arise as therapy failure or relapse; Table 2 shows the sensitivities for these different biomarkers in predicting relapses from treatment failures. The slope decision rule based on γ s > 0.15 has a sensitivity of 92% and a specificity of 89% in predicting relapses from failures. Thus, the biomarkers we derived were highly specific at identifying relapse-free cure, therapy failure, and relapse.
Performance of γ s -based rules in forecasting 4-months therapy duration outcomes. Also, we tested the accuracy of the diagnostic rule for four-months therapy duration in the validation datasets comprised of the REMoxTB trial experimental arm patients. In the arm in which isoniazid was replaced by moxifloxacin and therapy administered for four-months (n=655), 530 patients had enough TTP data in the first 8 weeks to calculate slopes. In this dataset, 369/ 530 (69.62%) patients achieved cure, 40 (7.55%) patients had therapy failure at the end of 4 months of therapy, and 121 patients (22.83%) relapsed. Table 2 shows that the γ s >0.15 had a sensitivity of 81% and specificity of 89% for relapse-free cure versus failure, and among the failures had a sensitivity of 75% and specificity of 60% for separating relapse from therapy failure. The relative risk of failure in patients with positive slope-based biomarker versus negative biomarker was approximately 15 ( Table 2 and Table S4); the NND was 1.47. The 2-month sputum conversion was not designed for 4 months therapy duration regimens, and is not shown, while the extended-EBA which is used to triage shorter duration regimens is shown; the NND was 16.69.
In the arm in which ethambutol was replaced by moxifloxacin (n=633), 533 patients had enough data to calculate 8-week slopes. In this dataset 385 (72.23%) of patients achieved cure, 46 (8.63%) had therapy failure, while 102 relapse (19.4%). The sensitivity of the extended EBA was only 10%, and the NND was 18.73. The sensitivity of γ s -based slopes was 70% and the specificity 71% for cure versus therapy failure, while the sensitivity was 70% and specificity 65% for picking relapse versus therapy failure. The NND was 1.89.
In order to summate, we calculated an overall value of the relative risk of failure when our B(0) and γ s -based slope predicted poor outcome for a specified duration of therapy (using 6-months and 4month duration data combined). Among patients with positive biomarker for specified therapy duration, 159/205 (78%) failed therapy compared to 218/1072 (20%) in whom the biomarker was negative. The RR of failure with the rule was 8.25 (95% CI: 6.09-11.20); p<0.0001. In terms of cure only 4% of entire validation dataset cohort of patients achieved relapse-free cure when our rule was positive while 67% achieved cure when it was negative.

Discussion
First, our model estimates of initial proportion of non-replicating persistent/semidormant of 1-25% of fast replicating (Table S1) Figure S2. Thus, our model parameter estimates were highly consistent with the biology of bacillary populations, and their changes with therapy. This is the major strength of our approach, which is that they are mechanistic and based on the biology observed in patients. Second, we found that the γ s (slow replicating) slope is a good surrogate of sterilizing activity, based on ability to predict relapse. Conversely, the extended EBA had a sensitivity of 14% for predicting outcomes at 6 months and beyond, and a poor accuracy. The extended EBA is effectively two-weeks accrued data; the poor sensitivity means that the total time for which the bacterial kill data is collected is too short to accurately capture sterilizing activity slopes. Indeed, the poor sensitivity of γ f -slope-based metric means that most regimens with good sterilizing effect could be thrown away (too many false negatives for sterilizing activity) in regimen selection for sterilizing activity. Similarly, the 2-month sputum conversion had a sensitivity of 33% and specificity of 71%. These commonly used clinical indices gave us an opportunity to externally validate our modeling approach. In this case, the last major meta-analyses on 2-month cultures as a predictor of long-term outcome in TB performed by Horne et al in 2010 identified a sensitivity of 40% (95% CI, 25-56%) and specificity of 85% (95% CI, 77%-91%), which was confirmed in subsequent studies 14,30,31 . Thus, our modeling findings are consistent with results of these major meta-analyses. This means that our 8-weeks-derived γ s slope plus initial bacterial burden, which had a sensitivity of 92% and specificity of 86% for 6 months therapy duration regimens, would perform better than the 2month sputum conversion. In addition, our γ s slope can predict outcomes at shorter therapy durations than 6 months such as 4months duration; the relative risk of therapy failure among patients with positive biomarker for specified therapy duration was > 8.0. Thus the γ s -slope based on the first 8-weeks TTP data is a good response biomarker for sterilizing activity, even for therapy duration less than standard short course chemotherapy.
The γ s -slope, which we will henceforth term the "sterilizing activity rate", fulfills the BEST criteria and definition of a monitoring biomarker in the category of a pharmacodynamic/response biomarker, in a similar fashion to HIV and hepatitis C viral load biomarker, and could play the same role in TB therapeutics and clinical trials 23,29,32,33 . According to BEST criteria, a pharmacodynamic/response biomarker provides early evidence (in this case 8weeks) that a treatment might have an effect on a later pharmacologic clinical endpoint (in this case relapse at 2 years). In the case of HIV treatment trials, identification of viral load as a surrogate of efficacy in 1995 dramatically cut the duration and costs of clinical Table 2 Biomarker threshold values, sensitivity and specificity scores, and risk of failure, with 95% confidence intervals.
Biomarker using log 10   Sensitivity analyses and rule-making of γ s slopes versus time-to-cure. a Shown are γ s slopes required to achieve cure within 6 months for patients with high bacterial burden compared to those with medium and lower bacterial burdens. b-d The γ s slopes required to achieve cure at 2, 4 and 6 months duration or for delayed cure of an additional 1 to 3 months beyond month 6 (i.e, 6 months +1, or +2, or +3 months), are shown for patient starting with high (b), medium (c), and low (d) Mtb burdens. e Magnitudes. of slopes for therapy duration of only 1 and 3 months (for high, medium and low Mtb burden) could be extrapolated and interpolated in log 10 CFU/mL/day as (0.42, 0.36, and 0.30) and (0.22, 0.20, and 0.17), respectively, based on the relationships between slope and duration of therapy (r 2 > 0.999). f Magnitudes of slopes for therapy duration of 1 and 3 months are extrapolated and interpolated TTP-slope as (12.08, 10.08 and 8.49) and (5.67, 5.06 and 4.48), respectively, based on the relationships between TTP-derived slope and duration of therapy (r 2 > 0.999).
trials, while avoiding use of potentially catastrophic clinical endpoints such as therapy failure and death 32 . For TB, we propose identification and ranking of regimens using preclinical models that can accurately translate the sterilizing activity rate to patients 24,34 . The regimens so derived, including optimal doses, and the translated sterilizing activity rate will provide good priors for the design of 8-week clinical trials for novel regimens versus standard therapy, with weekly TTP as the main output and drug pharmacokinetics as a secondary outcome. The sterilizing effect rate (γ s -slope), initial TTP, and trajectories can then be used to estimate therapy duration for the novel regimens and determine if indeed the new regimens can shorten TB treatment prior to performance of phase III studies. The 8-weeks TTP-data derived slopes can be used to compute a lower and more accurate patient sample sizes required to power the phase III trials, given the good accuracy in forecasting relapse. As an example, the number needed to diagnose (NND) failure and relapse of <2, when compared to~20 for extended EBA and 5-6 for 2months therapy, gives a more straightforward insight into the relative number of patients tested in each arm by different biomarkers. Moreover, these data and slopes can be used for optimal design by identifying optimal sampling times of sputa (TTP), and optimal number of samples that minimize uncertainty in the slopes in future clinical trials. Furthermore, since the predictive value of the sterilizing activity rates on relapse or cure or therapy failure is independent of the regimen the slopes can be used in clinical trials of MDR-TB and for "pan-susceptible" TB regimens, indeed for any TB regimen. As regards to clinical practice, our findings add to the recent discovery that initial Mtb burden can be used to determine patients who can benefit from 4-month duration therapy 35 . Imperial et al used standard inferential statistics found that nonadherence was the most significant factor leading to poor outcomes (hazard ratio 5.9), and that low initial bacterial and disease burden were the most important determinants of optimum duration of therapy 35 . Here, we found that the sterilizing activity rate was ranked higher than initial bacterial burden or any other clinical factors. To put this is context, the risk of development of AIDS and death in patients whose HIV viral load did not reach undetectable within first 12 months was 2.40-fold compared to those who had, and a <75% reduction in viral load had a RR of 2.27-fold for poor outcomes [36][37][38] . Patients in whom the γ s -slopebased rule was positive for different durations of therapy had a an 8.25-fold higher risk of failure, which is better performance than this commonly used HIV test used to individualize therapy. Thus, our findings could also be used to individualize therapy, in place of two-month smears/cultures currently recommended in routine care in TB programs worldwide. First, if these patients with potentially higher rates of therapy failure and relapse were identified during the first eight weeks of therapy, then interventions such as dose increases or switching therapy regimens could be made 39 . Second, the sterilizing effect rate (γ s slopes) could also be used by TB programs to identify patients who could be cured with specific shorter therapy durations of either 2, 3 or 4 months, on any regimen. Alternatively, they could be used to identify how long therapy duration should be extended beyond 6 months, thereby individualizing therapy duration, in patients with sputum γ s slopes that predict the slow cure clusters. Since many TB programs across the world already employ liquid culture systems that generate TTP, it means that the biomarker we propose would come at no extra cost to those TB programs. Computation of the slope could easily be implemented on a computer (or on a phone with specifically designed app).
Our study has some limitations. First, there were no accompanying CFUs to the TTPs in the REMoxTB study and we relied on the TTP to CFU conversion formula from a prior study; 24 REMoxTB TTP-CFU data-pairs would have been the best at characterizing the uncertainty in TTP conversion to CFU conversion. However, we tested the time-to-extinction based definitions of cure derived using this TTP-CFU conversion from the prior study to those observed using the REMoxTB clinical trial protocol definitions and patient microbiology and identified a Spearman rank correlation of 1.0 (p = 0.017). Thus, our TTP-CFU conversion and slopes derived from it were robust. Second, it could be argued that our findings are specific to the dataset we analyzed. However, the machine-learning cross-validation procedures we used are scored on how well predictors will perform on an entirely independent dataset in the future. Nevertheless, the accuracy of the biomarkers will still need to be further confirmed in other large datasets in a range of clinical contexts and with different regimens. Third, calculation of slopes is relatively complex. However, software can easily be written to automate this, as we have attempted elsewhere. Fourth, we relied on data supplied by the clinical trial team, and thus could not assess the quality and volume of sputum samples as treatment progressed, and several factors associated sputum collection effect of bacterial burden. This limitation however is somewhat mitigated by the finding that model fits did not change from start of therapy to end of therapy. Finally, not all patients who do not reach bacterial population extinction will fail therapy or relapse. This means that our approach may lead to over treating of these patients who would otherwise be cured. Examination of our proposed biomarkers with other tests such as radiological findings and therapeutic drug monitoring could reduce the number of over treated patients and are subject to ongoing analyses. However, even with these limitations, the early TTP-based biomarkers that we identified as predicting long-term clinical outcomes such as relapse for different therapy durations, have sensitivities and specificities that are higher than currently employed methods.

Methods
Study design, data extraction and definitions. Our study design is reported in detail in Fig. 1. Briefly, we took data for bacteriologically confirmed TB patients that were enrolled in the REMoxTB clinical study 3 . In which patient sputum was cultured in the Mycobacteria Growth Indicator Tube (MGIT) to confirm bacteria viability. Since our aim was to develop a method agnostic of regimens used and drug-resistance status, patient data from the study 3 was used in our analyses regardless of drug-resistance status. Patients with majority of sputum samples that were contaminated or missing were excluded.
Patient and microbial details, including therapy regimens and serial TTPs, were extracted from the CPTR website (http://www.cptrinitiative.org). Time-toextinction was defined as achieving a bacterial burden ≤10 −2 colonies/mL, as mathematically justified in our prior work 24 . Microbiologic cure was defined as two negative sputum cultures without an intervening positive. Relapse was defined by the re-appearance of positive culture in patients deemed cured at the end of therapy. Relapses were confirmed by 24-locus mycobacterial-interspersedrepetitive-unit analysis 3 . Failure to attain microbiologic cure at the end of therapy defined therapy failure, as per REMoxTB study protocol 3 .
Data partitioning. Patients on the standard TB therapy regimen were randomly partitioned into two subsets of equal size. The first set was designated as the model derivation set, while the remainder was assigned for use in model validation (validation data set). To capture sufficient relapse events, only patients with at least two consecutive sputum samples during follow-up after treatment were used in model training and cross validation. Patients who received the experimental REMoxTB arms were used only in the validation dataset for sensitivity and specificity of predictors with 4 months therapy duration.
Mathematical modeling for converting TTPs to CFUs. In order to convert TTPs to CFU/mL, we applied the formula: where F is CFU/ml, x is TTP (number of days the MGIT indicate presence of viable bacteria), α-represent log 10 CFU/mL quantity when TTP take less than 1 day to turn positive, while β is a rate (per day) that apportions changes in TTP to the corresponding CFU value, and γ is a conversion adjustment parameter. Estimated values for these parameters are given in Table S5 together with an alternative model with γ = 0. We previously derived these parameters using more than 600 data point pairs from logarithmic phase growth and semi-dormant (or non-replicating phase) hollow fiber system model experiments 24 . Bacterial burden from these experiments were quantified using (i) solid agar culture for CFUs, and (ii) liquid medium in the MIGIT for TTP. The hollow fiber model is repetitively sampled for CFUs and TTPs for up to 56 days on therapy. Bowness and colleagues have found that as treatment progresses, the recovered Mtb grew more slowly in culture, so that a linear equation model (including only constants a, b, c) that remain unchanged during treatment would be incorrect by day 14, and instead a Gompertz model with a time parameter would be better 40 . While our formula is not a linear regression equation, we still wanted to find out if it was accurate at the start of therapy as at 56 days, in patients. Therefore, we applied formula/equation #1 to an independent clinical data set of patients on TB therapy, the vitamin A study in which we had weekly TTPs and CFUs in 56 patients on standard therapy 18,24 . Results are shown in Figure S1, which shows that our formula remained accurate at 56 days as on day 0. Therefore, we employed equation #1 for toggling between CFU/mL and TTP.
Mathematical model. Our mathematical model, described in detail in the past 24 43,44 . The parameters r f and r s also measure of the reproductive or growth fitness, a measure of their virulence. The fast replication (log phase growth) Mtb grow at rate r f while the slow at rate r s . It has been shown that in TB patients, these bacteria subpopulations co-exist, however, in active TB disease, the population of bacteria in log-phase is dominant 25,26,28,41,42,45 .
The model has flexibility to track the time evolution of both Mtb subpopulations simultaneously, under effect of treatment with different combination regimens. In relation to assessing new surrogate markers or biomarkers for predicting TB treatment outcomes, the model has two sets of quantifiable parameters (i) r f , r s and K max (Mtb growth parameters) and (ii) γ s and γ f (drug-regimen based microbial kill slopes), that are linked to disease pathogenesis, and therefore has the ability to predict disease outcomes independent of a specific TB therapy regimen. Further mathematical details and assumptions of the model are shown in our previous study 24 .
ODE-based model to data fitting. First, all patient TTP longitudinal observations were converted to CFU values using Eq. 1. Then the data was fit to the system of ODEs (Eqs. 2 and 3). We implemented the Markov chain Monte Carlo (MCMC) method in R 22,24,46 to estimate the drug kill parameters using 50,000 runs of the chain. A Gaussian log-likelihood was used to generate posterior distributions for parameters assuming uniform distribution for the priors. Model to data fitting was done in two steps as was done in the study in Magombedze et al. 24 . In the first step, Hollow fiber control experiments data was used to estimate growth rates (r f and r s ) of the bacteria subpopulations and the bacteria population carrying capacity (K max ). The control experiments were carried out to determine the growth of the clinical isolates to imitate bacteria log-phase growth and the non-replicating growth phase. The model was fit to the data with the assumption that there were no drugs, that is, γ f = 0, and γ s = 0. The estimated values are given in Table S6. In the second steps, bacteria growth rates and carrying capacity estimated is step I, were used here and are kept fixed during model fitting to estimate the γ f and γ s slopes. These estimates and parameters have been tested in a clinical dataset of 78 patients in the past, with excellent fit, demonstrating that the assumptions hold in clinical sputum samples 24 . The estimates were derived as medians of the MCMC posterior distribution, the uncertainty was given by 95% credible intervals (CrIs) calculated from the 2.5% and 97.5% quantiles of the MCMC parameter posterior distribution. MCMC convergence was assessed visually and by using the chain convergence diagnostic tools in the R coda package.
Identification of biomarkers predicting outcomes in derivation dataset. Identification of biomarkers that best predicted therapy outcomes was carried out using classification and regression criteria (CART) of Breiman et al. 47 . Using the derivation dataset, we examined all demographic, clinical, and radiological factors, as well as model-derived γ s and γ f slopes and the initial bacterial burdens (B(0)), as potential predictors of outcome. Outcome was defined as either therapy success at end of therapy, or therapy failure (failure at the end of treatment or relapse), or relapse. The steps we followed were implemented by two independent investigators in R (Rpart) and Salford software, and have been described in detail in the past 39 . First, CART analysis was used to identify and rank the top predictors of therapy failure and relapse. Second, we used clustering to characterize the relationship of the top predictors for each specific treatment outcome, and also identified the statistical association 48 . TTP trajectories were clustered using the K-means algorithm implemented in the KML-package in R 48 . The 6-month TTP data for each cluster was reduced to derive (i) the 4-month slopes [using the first 4 months accrued data] and (ii) then 2-month slopes (based on the first eight-week accrued data). The model was fitted to data for each separate cluster and their respective reduced subsets.
Finally, we utilized Markov chain Monte Carlo simulations of time-to-extinction in tandem with CART to identify slope thresholds and initial bacterial burden that best classified relapses and therapy failure 46 .
Mathematical simulations for indeterminate data zones. We computed 10,000 bacteria trajectories to simulate different treatment outcomes. The initial bacterial burdens based on the range in derivation data set of between 3-7 log 10 CFU/mL and γ-slopes between 0.05 to 0.5 log 10 CFUs/day, were varied simultaneously, with the rest of all model parameters held constant. TTE for each separate trajectory was computed. The TTE values define the transcritical bifurcation points that explains when the Mtb NRP stable state switches to extinction. Regions of time within which bacteria subpopulations would go extinct were constructed and partitioned to reflect the expected clinical treatment duration intervals.
Sensitivity analysis for treatment duration. Monte-Carlo experiments were carried out to identify changes in γ s values that resulted in treatment duration shortening (2 and 4 months) and those that led to prolonged treatment duration (7, 8 and 9 months). Magnitudes that correspond to these treatment end-points were determined relative to different categories of patient initial bacterial load, (i) high (>5·0 log 10 CFU/mL), (ii) medium (3·5-5·0 log 10 CFU/mL) and (iii) low (<3·5 log 10 CFU/mL). These bounds were selected to toggle between CART discrete bounds and sweep across continuous patient CFU burdens to examine effect of different slope magnitudes on outcome for the defined therapy durations.
Validation of identified biomarkers. Individual patient TTP trajectories were fitted to the model to identify the corresponding γ s and γ f in the validation datasets.