Mixed infections in genotypic drug-resistant Mycobacterium tuberculosis

Tuberculosis disease (TB), caused by Mycobacterium tuberculosis, is a major global public health problem, resulting in more than 1 million deaths each year. Drug resistance (DR), including multi-drug (MDR-TB), is making TB control difficult and accounts for 16% of new and 48% of previously treated cases. To further complicate treatment decision-making, many clinical studies have reported patients harbouring multiple distinct strains of M. tuberculosis across the main lineages (L1 to L4). The extent to which drug-resistant strains can be deconvoluted within mixed strain infection samples is understudied. Here, we analysed M. tuberculosis isolates with whole genome sequencing data (n = 50,723), which covered the main lineages (L1 9.1%, L2 27.6%, L3 11.8%, L4 48.3%), with genotypic resistance to isoniazid (HR-TB; n = 9546 (29.2%)), rifampicin (RR-TB; n = 7974 (24.4%)), and at least MDR-TB (n = 5385 (16.5%)). TB-Profiler software revealed 531 (1.0%) isolates with potential mixed sub-lineage infections, including some with DR mutations (RR-TB 21/531; HR-TB 59/531; at least MDR-TB 173/531). To assist with the deconvolution of such mixtures, we adopted and evaluated a statistical Gaussian Mixture model (GMM) approach. By simulating 240 artificial mixtures of different ratios from empirical data across L1 to L4, a GMM approach was able to accurately estimate the DR profile of each lineage, with a low error rate for the estimated mixing proportions (mean squared error 0.012) and high accuracy for the DR predictions (93.5%). Application of the GMM model to the clinical mixtures (n = 531), found that 33.3% (188/531) of samples consisted of DR and sensitive lineages, 20.2% (114/531) consisted of lineages with only DR mutations, and 40.6% (229/531) consisted of lineages with genotypic pan-susceptibility. Overall, our work demonstrates the utility of combined whole genome sequencing data and GMM statistical analysis approaches for providing insights into mono and mixed M. tuberculosis infections, thereby potentially assisting diagnosis, treatment decision-making, drug resistance and transmission mapping for infection control.

Tuberculosis (TB), caused by Mycobacterium tuberculosis, is a major global health problem, responsible for 10.6 million cases and 1.6 million associated deaths in 2021 alone 1 .Whilst, TB is a treatable disease, resistance to anti-TB drugs, especially first-line rifampicin (RR-TB) and isoniazid (HR-TB), together called multi-drug resistance (MDR-TB), is making infection control more difficult.To acquire resistance to anti-TB drugs, M. tuberculosis drug targets or activating proteins are often mutated 2,3 , including by single nucleotide polymorphisms (SNPs) and insertions and deletions (indels); a process involving vertical, but not horizontal, gene transfer.It is being increasingly recognised that within-host mixed strain infections (MSIs) are contributing to TB drug resistance, with heteroresistance involving the co-existence of susceptible and resistant strains.MSIs can arise due to the reinfection of an infected host with a new strain of M. tuberculosis, which is often observed in relapse patients, as well as emerge where there is distinct clonal evolution within the infected host 4 .MSIs may be driven by inadequate treatment schemes where a diagnosed TB patient will receive combination therapies of sometimes toxic drugs for a minimum of 6 months, and non-compliance or treatment failure can arise.Heteroresistance has been responsible for higher rates of treatment failure, thereby limiting treatment options in TB patients 5 .Often without proper strain and drug resistance profiling, the treatment of MSI patients may involve second/third-line drugs with less efficacy, more serious adverse drug reactions, and a prolonged treatment period.Therefore, identifying the complete pathogen diversity within the host is useful for achieving favourable clinical treatment outcomes.
The phylogeny of M. tuberculosis consists of 4 major lineages (L1-L4), which consist of different strain types that may vary in their propensity to transmit and cause severe disease 6 .MSIs of M. tuberculosis can be identified in high-depth whole genome sequencing (WGS) data through the presence of heterozygous genotypes.Strains and SNPs with high numbers of heterozygous sites are typically removed from the analysis, often thought to be the effects of contamination or sequencing errors.The deconvolution of different lineages within MSIs can be determined from such data by estimating the ratios of allele coverage at different lineage-specific SNPs 6 .However, for heteroresistance the challenges lie in determining the lineage each resistance-linked SNP belongs to; thereby obtaining information for lineage-specific drug resistance profiling in an MSI.To infer this, we often rely on any overlap between the lineage-specific and drug resistance SNPs, which is not straightforward using short-read sequencing data, and often leads to many orphan drug resistance SNPs that are unassigned to strains.However, this problem can be resolved using data from long-sequencing platforms.
It is possible to profile drug resistance and lineages from WGS data to inform clinical and infection control, for example, using the TB-Profiler tool 2 .However, whilst it is possible to call mixed genotypes, such software typically lacks the means of disentangling the different SNPs on specific different strains within an MSI, which could enhance profiling.Previous work 4,7,8 on mixed infections in TB has provided a means of identifying specific lineages involved in MSI samples and the sample drug resistance.Nonetheless, the connection between the identified lineage and sample drug resistance is still undetermined.Here we built a statistical tool based on Gaussian mixture models (GMMs) to distinguish different strain lineages' fractions in an MSI, and assign drug resistance to each lineage, without the need for detecting drug resistance in lineage-specific SNPs on the same sequencing read.In general, a GMM is a probabilistic model representing multiple Gaussian distributions within a population, and the algorithm determines their number and mixing proportions.Amongst many applications, GMMs have been used successfully to identify protein families 9 , cell types from omics data 10 , and to classify cancers 11 .Here, we apply a GMM model to deconvolute 531 MSI samples detected in a large M. tuberculosis WGS "50k" dataset (n = 50,723) 6,12 by TB-Profiler software 2 (Fig. 1A).We test the accuracy of the GMM algorithm in a simulation study and estimate the number of MSIs and heteroresistance across different lineages.Ultimately, the disentanglement of strains and drug resistance involved in MSIs could assist in the optimisation of treatment decisions and potentially prevent the emergence of further resistance.

Performance of GMMs on artificial mixes
The predictive power of a GMM approach was first assessed using WGS data from 48 samples with known mixes of DNA, varying with major proportions of 1, 0.95, 0.9, and 0.7, obtained from clinical Malawi M. tuberculosis 4 strains (S3 Table ).GMM models and TB-Profiler achieved low mean squared errors (MSEs) in samples with a predominant strain, but were consistently low overall (GMM: 0.006; TB-Profiler: 0.006).Quant-TB appears to perform better on samples with a major mixing proportion of 0.7, but obtained a higher overall MSE value (0.020) than the other methods.
A limitation of the mixed DNA samples from Malawi is that none are drug resistant and there were no ratios close to parity (50:50).To explore the ability of GMM to predict heteroresistance in a wider range of lineage ratios and larger sample sizes, in silico artificial mixtures with known drug resistance and mixing proportions were generated (S4 Table ).Our simulations suggest that the GMM approach achieves high accuracy for the prediction of drug resistance across the minor proportions between 0.05 and 0.50 (accuracy: median 0.93; range: 0.89-0.97).The levels of MSE for the GMM were low and consistent across mixing proportions (overall MSE of 0.012).In comparison, the overall MSE values for TB-Profiler were slightly lower (0.009) and for Quant-TB greater (0.013).

Discussion
Mixed-strain infections (MSIs) of M. tuberculosis present themselves WGS data as heterozygous genotypes and are typically removed from the analysis.However, they are informative for heteroresistance, which can derail treatment effectiveness.The degree of MSIs may have been historically underestimated with various colony sampling techniques leading to the culture of clonal M. tuberculosis, as well as through bioinformatic analyses that have excluded samples with too many heterozygous genotypes.Recently, it has been found that short-term culture methods and the direct WGS of sputum or lung tissue can lead to a more accurate representation of within-host M. tuberculosis diversity of TB patients 13 .Relatedly, studies of lung tissues reveal that TB infections may be more complex than previously thought, compared to sputum, which is typically the predominant biological sample used.Further, it has been shown that the magnitude of MSIs in high-burden TB settings is underestimated when only testing sputum samples 14 , and in such settings diversity and complexity can be further reduced and underestimated through M. tuberculosis culture and colony selection 15 .
In other contexts, such as malaria, the degree of the multiplicity of infection may be a surrogate of transmission intensity 16 .A previous TB study has identified MSIs in high transmission regions in Pakistan 12 , and we confirm an example of such a complex sample (ERR4829977).In our work, we observe the higher involvement of lineages 2 and 4 in MSIs, which may reflect the convenience nature and confounding effects of the sampling, but such strain types have been found previously to be more transmissible and virulent 6 .For samples in lineages 5 and 6, which are thought to be less transmissible and have slower growth on conventional TB diagnostic media that may influence phenotypic testing results, our genotypic-based modelling approach could detect MSIs.Overall, our results appear to reveal decreased involvement of the less transmissive lineages, such as lineage 7, which is consistent with other studies 17 , but may also be due to their lower sequencing rates.Irrespective, none of the MSI samples found seemed to be present in transmission cluster, which may be indicative of re-infection of TB by incomplete treatment and poor adherence.
Typically, the issue of MSIs can be disentangled into linked components, namely, identifying their presence, followed by estimation of the (minimum) number of clones, and then deconvoluting the genotypes in those clones.Here, we applied a GMM method to determine the clonal drug-resistant genotypes in MSIs in M. tuberculosis with WGS data, identified initially by TB-Profiler 4 , which has the flexibility of using different informative mutation lists for genotypic profiling.It was found that the signal from alternative haplotype frequency was sufficient to differentiate strains, supported by the analysis of mixtures in artificially generated mixed samples.Similarly, Quant-TB and TB-Profiler software were used to confirm the mixing proportions, and the GMM offered lower or comparable error rates in simulated data and samples.Overall, the GMM approach appears to provide a "rapid", non-culture-based method to assign the drug resistance profile to each lineage, thereby providing insights into lineage-specific drug resistance, which can inform clinical decision making.Specifically, the disentanglement of drug resistance to each lineage in an MSI can assist diagnosis and optimise the personalisation of treatments.It can also help prevent the development of drug resistance, including by avoiding the use of ineffective drugs.Our study, therefore, provides additional proof of concept evidence for the use of WGS-based diagnostics.
Our GMM approach can be used to monitor the within person evolution of strains, the detection of drug resistance mutations and their transmission or related outbreaks at a population level, where tracking the source and spread of each strain can be challenging.It can detect and dissect susceptible samples from the same patient with similar or identical sub-lineages, thereby potentially inferring reinfections.If these samples were from an MSI, and not with identical genomes, our approach would work directly.The detection of heteroresistant infections within the same sub-lineage is more difficult, but can also be analysed, as our input files www.nature.com/scientificreports/contain the proportion of alternative allele in SNPs, which can be used for detecting any mixed gene sequencing reads.Our approach was tested using a convenience sampled collection of isolate data, sourced mostly from clinical samples collected across many different studies using varying individual collection and laboratory culture methods (e.g., MGIT), which may have influenced the estimated prevalence of MSIs and drug resistance.The growing application of whole genome or amplicon-based sequencing platforms, including using portable Oxford Nanopore Technology, will lead to increasing amounts of genomic data for such surveillance and clinical applications, including an accurate estimation of the extent of MSIs Further improvements to the GMM approach can be made by using the M. tuberculosis phylogenetic tree structure to extend the approach to other members of the MTBC, as well as exploit intrinsic linkage disequilibrium patterns to increase the strain lineage identification for each drug resistance SNP.In addition, it has been shown that some strains of M. tuberculosis are preferentially associated with resistance to certain drugs 18 , and mixed resistance accuracy could be improved by including this association.Alternatives or extensions to GMMs, such as variational GMM, can also be implemented to improve the levels of performance.In lieu of such efforts, we have presented a GMM-based tool that uses WGS data to disentangle lineage and drug resistance genotypes, which can be used to inform clinical and surveillance decision-making for TB control.

(
A) Sample size by region (B) Drug resistance by World Health OrganizaƟon region Map coloured according to WHO regions.HR-TB: isoniazid mono-resistance.MDR-TB: MulƟ-drug resistance; RR-TB: Rifampicin resistance; XDR: Extensively drug-resistant; Other: mono-and polyresistance not covered by other resistance classes.Sample size (C) ProporƟon of TB heteroresistance Countries with no data are coloured on the map in light grey.

Figure 1 .
Figure 1.World maps of the 50k M. tuberculosis dataset.(A) Sample size by region.(B) Drug resistance by World Health Organization region.Map coloured according to WHO regions.HR-TB isoniazid monoresistance, MDR-TB multi-drug resistance, RR-TB rifampicin resistance, XDR extensively drug-resistant, Other mono-and poly-resistance not covered by other resistance classes.(C) Proportion of TB heteroresistance.Countries with no data are coloured on the map in light grey.

Figure 3 .
Figure 3.The M. tuberculosis isolates putative evidence of mixed strain infections as measured by drug resistance type (n = 531).

Table 1 .
The M. tuberculosis isolates by World Health Organization (WHO) region.

Table 3 .
Examples of clinical mixed strain infections with disentangled the drug resistance mutations established for each strain using our GMM.*Mixing proportions: (TB-Profiler prediction) [GMM prediction]; PAS P-aminosalicyclic acid.