A sister lineage of the Mycobacterium tuberculosis complex discovered in the African Great Lakes region

The human- and animal-adapted lineages of the Mycobacterium tuberculosis complex (MTBC) are thought to have expanded from a common progenitor in Africa. However, the molecular events that accompanied this emergence remain largely unknown. Here, we describe two MTBC strains isolated from patients with multidrug resistant tuberculosis, representing an as-yet-unknown lineage, named Lineage 8 (L8), seemingly restricted to the African Great Lakes region. Using genome-based phylogenetic reconstruction, we show that L8 is a sister clade to the known MTBC lineages. Comparison with other complete mycobacterial genomes indicate that the divergence of L8 preceded the loss of the cobF genome region - involved in the cobalamin/vitamin B12 synthesis - and gene interruptions in a subsequent common ancestor shared by all other known MTBC lineages. This discovery further supports an East African origin for the MTBC and provides additional molecular clues on the ancestral genome reduction associated with adaptation to a pathogenic lifestyle.

Last paragraph of the Intro -"we used PacBio and Illumina WGS to reconstruct the full circular genome" -please specify which sample (country) was used for this as well as why this was not done for both samples here. I understand only sequencing data was available for the second patient's sample, but this was only clear much later in the paper; it would be more clear if this were stated upfront.
"Molecular scars"? Could authors please include a definition of this in the current manuscript, as even working in this field, this is not terminology I am familiar with; this definitely needs a bit more explanation to be accessible to a broader audience. Along these lines, will the general audience of this journal know what standard short-course MDR TB treatment is? Please review the manuscript for very TB-specific concepts, and whether these may warrant a bit of additional explanation for those who aren't TB specialists.
There were a couple places where authors should be less firm about their conclusions, given the paucity of data (2 samples so far from this lineage); for example, in the Abstract, authors state Lineage 8 is restricted to the African Great Lakes region. They also state this on the second page after "WGS analysis" ("indicate that L8 is generally rare and geographically restricted to…"). While their two isolates are from this region, and the work is suggestive that this lineage is from this area, this is not conclusive from this work, so I would change the wording a bit here to convey this uncertainty. Another example is "our results indicate L8 is as clonal as the rest of the MTBC"; again, this is an analysis of 2 samples so this is very preliminary and suggestive at best.
Related to this, please show the ClonalFrameML results in the Supplement, rather than saying 'data not shown'. Thank you.
Results -"L8 related TB patient in Rwanda" -is it 'L8 related' or is it just a 'patient with L8 TB'? The wording could be changed here. There is three instances of "L9" being used instead of L8; please correct.
In the Discussion, authors report no previous TB treatment for the Rwandan patient. Have authors tried to contact the publishers of the Ugandan genome to see if this information would be available for this patient? Also in the Discussion, "If true, more recently emerged or introduced cobF-deleted strains might have conceivably largely outcompeted L8 strains" -how would authors propose to test this hypothesis?
A few comments for clarity: While I appreciate the format of the journal has the Methods last, many readers prefer to read the Methods before the Results to be able to properly interpret them. As the paper stands, some of the Methods are not clear on their own; for example, authors mention the DeeplexMycTb assay in the Methods, but not what this is and what samples it is being done on, and don't explain that the Rwandan sample is actually avai. Could this please be added to the Methods themselves?
In the Discussion "suggests prolonged exposure to antibiotic treatment and human-to-human transmission of a drug resistant strain" -This sentence could be rephrased for clarity. I think authors mean that these patients were likely infected with already-resistant strains, which had been likely circulating for some time in the community, not that these patients themselves had prolonged exposure to antibiotics? The following two sentences (starting with "The observation that…" could possibly be moved before this.
Please add line numbers and page to submissions if possible. It's more difficult to keep track as a reviewer and properly refer the authors to the right sections without these. An extra space or indenting for new paragraphs is also helpful. Thank you.
Reviewer #3 (Remarks to the Author): The paper "A sister lineage of the Mycobacterium tuberculosis complex discovered in the African Great Lakes region" constitute a description of a new mycobacterium tuberculosis complex (MTBC) lineage which forms a taxonomic outgroup to the typically pathogenic lineages observed to date, but is more closely related to these and the possibly environmental M. canettii. The analyses performed strongly suggest that the new lineage (L8) is a very rare pathogen of humans. The observation is interesting, but in my opinion, of limited interest to people outside the core TB community, and would perhaps be more suitable for a more specialized journal.
The presence of resistance mutations yielding INH and RIF mutations which are both rare, yet common to both L8 isolates is quite interesting, especially in light of the significant SNP-distance between the two isolates (100 SNPs). Yet, it is also clear that the authors don't have much evidence at the moment to explain this observation which they state would indicate a high substitution rate (for the resistance SNPs to have emerged following the introduction of the relevant drugs in the region). In the discussion, the authors don't cite any source for their statement about the history of INH and RIF use in the region -could RIF possibly have been used earlier? Could the isolates have emerged elsewhere, where RIF has actually been used earlier? Or could the "upper mutation rate" of 2.2 SNPs/genome/year (Menardo et al 2019) be somewhat off?
Overall, I'm not convinced that the observation of a weird and very rare lineage which has retained a cobF gene region lost in other MTBC members, has any major consequences for interpretations of MTBC evolution and pathogenicity. Perhaps, with more data it will. However, I do believe it 1. The statement in the abstract and the paper that L8 is restricted to the great lakes region is too strong when there are only two observations available. The authors may well be correct that it is, but the evidence is still not available for such a strong statement.
2. The introduction starts with a statement that MTBC is among the most ancient human diseases, but I'm not convinced the sum of current evidence supports this notion. The TMRCA of MTBC has been estimated to 4-6000 years ago, using the best available (aDNA) methods (Bos et al Nature). Is this significantly older than "typical" pathogens? 6. Figure 2 doesn't seem to add much value and could optionally be skipped entirely.

Reviewers' comments
Reviewer #1 (Remarks to the Author): Jean Claude Semuto Ngabonziza and co-workers report the discovery of a novel, rare, geographically unique, human-to-human transmitted lineage (L-8) that is a phylogenetic outlier of canonical MTBC complex consisting of two major lineages. Complete genomebased investigation of a strain from this lineage explains the emergence, diversification, and success of MTB as a virulent and successful pathogen. The genomic resource and findings from this lineage fill a major and probably most critical gap in systematic understanding on the East African origin of MTBC from environmental or -non-professional pathogenic form as STB to an obligate, successful and virulent pathogen.
We thank the reviewer for the very positive comments on the significance of our results.
Below are major comments that need to be addressed before taking any decision on the Between the two phylogenetic approaches and the gene gain/loss analysis, we believe that our finding of the placement of this new lineage to be robust.
2) How their study explains the origin of human and animal associated lineages...about host jump or about how human transmitted the disease to animals.
Authors' reply: Thank you for this question. In the Discussion, we explain that our results are consistent with the presumed scenario of a human rather than a zoonotic origin of the MTBC, already established 20 years ago in ref.31 and 43 as follows: "Moreover, the observation that both L8 strains share two uncommon rifampicin-and isoniazid-resistance conferring mutations in rpoB and inhA suggests that multidrug resistance was already acquired in the common ancestor of these two strains. Isoniazid and rifampicin were introduced in TB treatments in the African Great Lakes region in the late fifties and 1983, respectively (Dr. Armand Van Deun, personal communication). These shared MDR-defining mutations, and the detection of these isolates in human patients in both cases (with reported absence of previous TB history for the Rwandan patient), suggest that these patients were infected with an already-resistant strain, which was exposed to drug selective pressure already decades ago and had been likely circulating in the community for some time. Overall, this pattern thus suggests human-to-human transmission rather than infection from a non-human source. While based on only two initial strains, these results are consistent with the presumed scenario of a human rather than a zoonotic origin for the MTBC." We respectfully think that speculating on host jump and how humans transmitted the disease to animals might be beyond the scope of this study.
3) The authors report that "Among 14 other isolates out of 27 from Uganda and Rwanda 4) What is the prevalence of cobF in sub-sample 1500 isolates originating from Uganda, Rwanda and DRC that were screened using Deeplex MTB or in the NCBI WGS data set of strains from MTBC complex?
Authors' reply: Likewise, no L8 SNP and spoligotype pattern was identified in these 1500 samples by Deeplex-MycTB, also implying membership to other known MTBC lineages. As mentioned above, none of the 36 complete genomes or the 6,456 quality draft genome assemblies from MTBC lineages 1 to 7 or animal lineages available from the NCBI was found to contain cobF. The only genome assembly in the NCBI dataset that we found to contain cobF is the L8 strain from a patient from Uganda that was originally misclassified as M.
bovis, as indicated in the text. Authors' reply: cobF is lacking in all the complete genomes as well as in the draft genome assemblies other than the single L8 strain from Uganda available from the NCBI. Moreover, via BLAST analysis, we further confirmed the systematic absence of cobF in any of 6,456 quality draft genome assemblies available from the NCBI, from strains belonging to lineages 1 to 7 or the animal lineages of the MTBC. We thereby determined that the junction between the sequences flanking the cobF deletion was at the same nucleotide position in all but 6 of these genomes, resulting in the truncation of rv0943c and rv0944 genes as seen in the complete MTBC genomes (Fig. 4). Consistent with the clonal evolution of the MTBC with negligible, if any, horizontal gene transfer between strains (ref. 1, 14, 32), this perfect conservation of this sequence junction suggests that cobF was lost in the MRCA of the other MTBC lineages, after its divergence from L8. The 6 exceptions were 3 strains from lineage 4.3 and 3 that showed slightly larger deletions, including the 5' region of rv0943c or the 5' region of rv0943c, rv0944 and the 5' region of rv0945, respectively, suggesting probable subsequent deletion events in particular sub-branches of these lineages. This additional information is now included in Results as such (lines 327 onwards). Authors' reply: Thank you for this suggestion. A genome completeness check has been included using the CheckM lineage-specific workflow. This found a 98% completeness and no contamination or strain heterogeneity. This has been outlined in the methods lines 739-742.
Moreover, we proceeded even more systematically by verifying the presence/absence of all genes or collocated genes (single-or multi-copy) in L8 versus 36 MTBC reference genomes descending from a same ancestor, as well in a phylogenetically proximal, external set of M.  Table 2). This is now indicated in Results as follows: "Apart from these three deletions and two dozen repetitive/multicopy genes (IS6110-related, PE/PPE-, or Mce-encoding), we only found 5 non-repetitive genes, included in two small segments (3.4 and 4.4 kb), which were undetected in the complete L8 genome while being present in reference MTBC genomes (Supplementary Table 2)".
Conversely, the cobF locus is the only non-repetitive region that is present in the genomes of both L8 strains, and absent in all complete and draft MTBC genomes analysed (Supplementary Table 3). The two L8 genomes were sequenced completely independently, clearly excluding the result of a contamination, consistent with the CheckM results. 8) In the results section describing complete genome assembly of PacBio reads, coverage of the genome is not clearly mentioned. What does 186x, 39x and 38x infers? Are these corresponding to chromosome and plasmids in complete genome?
Authors' reply: These numbers refer to the estimated genome coverage depths using raw, corrected and trimmed PacBio reads, respectively. This is now clarified as follows in the Methods: "After discarding 60,272 reads below minimal quality parameters, 106,681 reads were used for the assembly. Based on the expected genome size, the average coverage depth was estimated at 186x using raw reads, and 39x and 38x using corrected and trimmed reads, respectively".

Other comments
1) In methods the last section it is mentioned as L9 genomes, it should be L8 2) Figure 1 legend L9 is mentioned.....it should be L8.

Reviewer #2 (Remarks to the Author):
This is a well-written paper reporting a very interesting and relevant finding for the TB community. In the context of two separate analyses (1 done in an MDR-TB trial, the other a screening of publicly available genomes), authors identified what appear to be a novel lineage of MTBC. The first strain, from Rwanda, was available to the authors for sequencing with PacBio/Illumina to complete the genome. The latter strain was from Uganda; the raw sample did not appear to be available for further analysis. The Rwandan and Ugandan strains appear very similar based on a variety of analyses (e.g., spoligo, comparison of AMR genes/mutations), and form a new clade on phylogenetic analysis compared to a representative set of global TB human and animal-adapted lineages.
Authors presented a comprehensive, detailed, and convincing analysis of these two strains, therefore, most of my comments are fairly minor.
Thank you for the appreciation of the interest and comprehensiveness of our analysis. 1) Last paragraph of the Intro -"we used PacBio and Illumina WGS to reconstruct the full circular genome" -please specify which sample (country) was used for this as well as why this was not done for both samples here. I understand only sequencing data was available for the second patient's sample, but this was only clear much later in the paper; it would be more clear if this were stated upfront.
Authors' reply: We accordingly changed this paragraph as follows: "These two strains were discovered in two independent analyses, and were both multidrug-resistant (MDR; i.e. resistant to at least rifampicin and isoniazid). One was isolated from a tuberculosis patient in Rwanda through an ongoing MDR-TB diagnostic trial in Africa. The second isolate was from a patient in Uganda, and its identification was inferred upon screening publicly available draft genome datasets, where it was misclassified as an M. bovis strain. We used PacBio and Illumina WGS to reconstruct the full circular genome of the Rwandan strain. We utilized these data and the available Illumina sequencing data of the Ugandan strain to reconstitute the phylogeny of this novel lineage, which we named Lineage 8 (L8), and further investigate molecular and evolutionary events associated with the emergence of the MTBC." 2) "Molecular scars"? Could authors please include a definition of this in the current manuscript, as even working in this field, this is not terminology I am familiar with; this definitely needs a bit more explanation to be accessible to a broader audience. Along these lines, will the general audience of this journal know what standard short-course MDR TB treatment is? Please review the manuscript for very TB-specific concepts, and whether these may warrant a bit of additional explanation for those who aren't TB specialists.
Authors' reply: Thank you for these suggestions. We now explained molecular scars as follows: "Further evidence for the early branching of L8 relative to the rest of the MTBC comes from examination of interrupted coding sequences (ICDSs). These ICDSs correspond to frameshifts or in-frame stop codons detected in genes originally intact in a common progenitor, thus putatively representing so-called molecular scars inherited during progressive pseudogenization of the MTBC genomes". Likewise, we describe the standard short-course MDR TB treatment as follows: "standard short-course MDR-TB treatment (i.e. 9-month WHO-endorsed MDR-TB regimen, including moxifloxacin, kanamycin, protionamide, ethambutol, clofazimine, high dose isoniazid and pyrazinamide ). Also in response to reviewer 1's comments n°1, 3 and 5 (please see above), we further clarified other concepts or notions more specific to the TB field, such as the principle of the Xpert diagnostic test, clonality of the pathogen's population structure with negligible horizontal gene transfer between strains, and the use of M. canettii as an outgroup.
3) There were a couple places where authors should be less firm about their conclusions, given the paucity of data (2 samples so far from this lineage); for example, in the Abstract, authors state Lineage 8 is restricted to the African Great Lakes region. They also state this on the second page after "WGS analysis" ("indicate that L8 is generally rare and geographically restricted to…"). While their two isolates are from this region, and the work is suggestive that this lineage is from this area, this is not conclusive from this work, so I would change the wording a bit here to convey this uncertainty. Another example is "our results indicate L8 is as clonal as the rest of the MTBC"; again, this is an analysis of 2 samples so this is very preliminary and suggestive at best.
Authors' reply: In accordance with your suggestions, we changed the abstract as follows: "…representing an as-yet-unknown lineage, named Lineage 8 (L8), seemingly restricted to the African Great Lakes region". Likewise, the part after WGS analysis was changed in: "The absence of any matching pattern in the global spoligotype database, as well as the lack of detection of this clade in previous large WGS datasets of MTBC strains from global sources, suggests that L8 is rare and seemingly geographically restricted to the African Great Lakes region". We also changed the last sentence as follows: "Although our analysis is limited to two genomes identified to date, our results suggest L8 is as clonal as the rest of the MTBC".