Tuberculosis (TB) has been difficult to eradicate, and according to the World Health Organization (WHO) global TB report, there were an estimated 10.4 million new TB infections and 1.3 million deaths worldwide in 2016 ( TB remains a substantial burden in various Asian countries, with annual incidence rates of 150–300 infections per 100,000 persons. In 2016, the rates were markedly high in Korea and the Philippines, with more than 500 cases per 100,000 persons, whereas the incidence rate in Japan was comparatively low at 13.9 cases per 100,000 persons ( However, the emergence of TB in Japan among foreigners has become problematic because of the increasing numbers of travellers and workers from countries with high incidence rates of TB infection1. Furthermore, the number of patients with dual infections of TB and human immunodeficiency virus is also high, and multidrug-resistant and extensively drug-resistant forms of TB are emerging in Asian countries2,3,4,5,6.

Technological developments, such as whole-genome sequencing (WGS) and next-generation sequencing, have enabled the elucidation of the lineages and mutations of drug resistance (DR) genes based on the detection of insertions and deletions in the genome of Mycobacterium tuberculosis (Mtb)7. Furthermore, WGS is a useful method for tracing the origins of isolates and phylogenetic analysis of single-nucleotide variants (SNVs) as well as for epidemiological studies for the detection of recent transmissions and for tracing outbreaks of TB8,9. The Relational Sequencing TB Data Platform (ReSeqTB) (, Pathgenseq (, and Sprint-TB ( databases have been developed to better understand the biology and behaviour of Mtb. In particular, the ReSeqTB database10 contains information on more than 6,000 isolates, including DR forms, from European, African, and Asian countries, mainly from the United Kingdom, Germany, South Africa, and the Russian Federation. Likewise, in 2016, our group created the Genome Research for Asian Tuberculosis (GReAT) database, which contains information on more than 3,000 clinical Mtb isolates from East and South-East Asian countries, including China, Taiwan, the Philippines, Vietnam, Korea, Mongolia, and Japan; this database is financially supported by the Japan Agency for Medical Research and Development. The purpose of the GReAT project is to collect genomic data for the control of TB worldwide and for the development of new technologies for the diagnosis and treatment of TB. The isolates included in the GReAT database were collected from surveys of TB and drug-resistant TB conducted in each of the participating Asian countries. In the present study, the characteristics of Mtb isolates included in the RYOKEN 2007 nationwide anti-TB drug susceptibility study in Japan and the GReAT database were investigated.

Materials and Methods

Strain collection and drug susceptibility testing

All Japanese strains were collected in 2007 by Tuberculosis Research Committee (RYOKEN) Japan a nationwide coalition of TB hospitals in Japan that report DR rates once every 5 years11. These isolates were collected from August 2007 to July 2008. In total, 329 isolates collected from patients aged >40 years were randomly selected from the registered patients in this study to investigate the proportion of foreign-born TB patients. In addition, 338 isolates collected from patients aged 0–39 years were also included for analysis. WGS of the isolates was performed using an all-in-one web-based tool for genotyping of Mtb, namely, Total Genotyping Solution for TB (TGS-TB). All isolates were subjected to phylogenetic analysis, in silico spoligotyping, and DR prediction12. The concordance, specificity, and sensitivity of prediction of drug susceptibility by TGS-TB were calculated by comparing the results of phenotypic drug susceptibility testing (DST), which was used as a gold standard. The phenotypic susceptibility of the isolates to anti-tuberculous drugs [i.e., isoniazid (INH), rifampicin (RFP), streptomycin (SM), ethambutol (EB), and levofloxacin (LVFX)] was calculated using the proportion method13. The genotyping results of the isolates were compared with the results of SNV-based genotype analysis on TGS-TB12.

DNA preparation and WGS of the Mtb isolates

Genomic DNA was extracted using the DNA ISOPLANT Kit (Wako Pure Chemical Industries, Ltd., Osaka, Japan), purified with a QIAquick column (QIAGEN GmbH, Hilden, Germany), and quantified using a Qubit 2.0 fluorometer (Thermo Fisher Scientific, Waltham, MA, USA). Paired-end libraries were prepared from 50 ng of purified DNA with the QIAseq FX DNA Library Kit (QIAGEN GmbH, Hilden, Germany) in accordance with the manufacturer’s protocol. The average fragment size (500–600 bp) of the DNA libraries was estimated by 2% agarose gel electrophoresis. Then, the fragments were eluted using the Wizard SV Gel and PCR Clean-Up System (Promega Corporation, Madison, WI, USA). The 24 purified DNA libraries were pooled, and the DNA concentration was quantified with a Qubit 2.0 fluorometer. The pooled libraries (11 pM) were sequenced on an Illumina MiSeq system (Illumina, Inc., San Diego, CA, USA) with the MiSeq Reagent Kit ver. 3 (600 cycles), which showed that the first paired-end reads were 350 nt in length, whereas the second paired-end reads were 250 nt in length.

Genomic analysis of the isolates with informatic tools

The reads obtained from sequencing were analysed using TGS-TB12, which is a pipeline for conventional epidemiological analysis. Prediction of genetic markers for antimicrobial resistance (e.g., ahpC, embA, embB, embC, embR, ethA, ethR, gid, gyrA, gyrB, inhA, kasA, katG, pncA, rpoB, rpoC, rpsA, rpsL, and rrs) listed in the TB profiler database14, lineage analysis (i.e., lineage 1, lineage 2, lineage 2/Beijing, lineage 3, and lineage 4) based on single-nucleotide polymorphisms followed by KvarQ15, and in silico spoligotyping were automatically performed based on the sequence data. Core-genome phylogenetic and linkage networks were also analysed using TGS-TB12. Prior to in silico genotyping, the adapter sequences were trimmed from the short reads, and low-quality bases with a Phred score of <15 were eliminated12 using the Skewer program to obtain sequences that were at least 50-mers16. The remaining reads were mapped using the BWA-mem program17 with the reference genome sequence of Mtb strain H37Rv (NC_000962.3)18. Reliable SNV sites with at least a 5× coverage depth and a Phred score of ≥20 were selected12. In this study average coverage depth was 85. Maximum likelihood phylogenetic analysis of all concatenated SNV alleles was performed using RAxML v8.2.019 with 1,000 bootstrap iterations. To identify epidemiological linkages among the isolates, data from queries for isolate-specific genes or the abovementioned reference genomes were downloaded as a NEXUS format file to visualize linkage networks, such as by the median-joining method for network visualization using PopART ( In silico spoligotyping was performed by a search using the Basic Local Alignment Search Tool with 43 spacer sequences20.

Statistical analysis

Data are summarized as the mean, median, and/or range, as appropriate, and compared using Fisher’s exact test or the chi-square test. All tests were two-sided, and a probability (p) value of < 0.05 was considered statistically significant.

Ethics approval

The study protocol was reviewed and approved by the institutional review boards of the ethics committee of the Japan Anti-Tuberculosis Association (no. 28–10). As researchers had no access to information to identify an individual patient or treatment, this committee waived the need for informed consent for this study.


Lineages of isolates from male and female patients

As shown in Table 1, of the 667 Mtb isolates (371 collected from males and 216 from females), 31 (4.6%), 5 (0.8%), 481 (72.1%), 3 (0.5%), and 147 (22.0%) were identified as lineages 1, 2, 2/Beijing, 3, and 4, respectively. Notably, lineage 1 isolates were significantly more common in females than in males aged 20–39 years (n = 5 vs. 17, respectively; odds ratio = 0.17; 95% confidence interval = 0.06–0.47, p < 0.01; Table 1).

Table 1 Lineage distribution of Mtb isolates in the RYOKEN 2007 dataset of the GReAT database by age group and gender.

Isolates from Japanese and foreign-born patients

As shown in Table 2, of the 667 isolates, 40 (6.0%) were from foreign-born patients, with 20 (50%), 12 (30%), and 8 (20%) belonging to lineages 1, 2/Beijing, and 4, respectively. Of these 40 isolates, 37 were from foreign-born patients aged 20–39 years, with 18 (45%) identified as lineage 1. Of the 31 isolates identified as lineage 1, 20 (64.5%) were from foreign-born patients, including nine patients from the Philippines, two each from Thailand, Nepal, and Vietnam, and five from other countries.

Table 2 Lineage distribution of Mtb isolates in the RYOKEN 2007 dataset of the GReAT database by age group and country of origin.

As shown in Table 2, of the 321 isolates from patients aged 20–39 years, 219 (68.2%) were identified as lineage 2/Beijing. The number of lineage 2/Beijing isolates from Japanese patients was significantly higher than the number of isolates of other lineages from Japanese patients (odds ratio = 5.53, 95% confidence interval = 2.65–11.54, p < 0.01). Of the 481 isolates identified as lineage 2/Beijing, only 12 (2.5%) were from foreign-born patients (six from China and one each from the Philippines, Korea, Mongolia, Nepal, Indonesia, and other countries). Of the 147 isolates identified as lineage 4, only 8 (5.4%) were from foreign-born patients (including 3 from Brazil, 2 from Indonesia, 2 from Peru, and 1 from China). Isolates identified as lineages 2/Beijing and 4 were mainly from Japanese patients (84% and 85%, respectively).

Phylogenetic analysis of lineage 1 isolates from the RYOKEN 2007 dataset  of the GReAT database

A phylogenetic tree of 31 lineage 1 isolates from the RYOKEN 2007 dataset of the GReAT database generated by SNV-based analysis of the core genome with TGS-TB is shown in Fig. 1. No identical or closely related isolates were observed (Fig. 1 and Supplemental Table 1). The lineage 1 isolates were classified into three sub-lineage types. Of the 31 isolates, 27 (87.1%) were identified as lineage 1.2.1 [East-African-Indian (EAI)], two as lineage 1.2.2 (EAI1), and two as lineage (EAI4). The isolates identified as lineage 1.2.2 (EAI1) were from foreign-born patients (Indonesia and Nepal). The isolates belonging to lineage (EAI4) were also from foreign-born patients, one of whom was born in Vietnam. The isolates belonging to lineage 1.2.1 (EIA2) formed two clusters, namely, EAI2 sub-lineage 1 and EAI2 sub-lineage 2. Of the 13 EAI2 sub-lineage 1 isolates, 9 (69.2%) were from foreign-born patients, with most (8/9, 88.9%) from female patients from the Philippines, including three that were resistant to INH (Fig. 1). Furthermore, a 2,316-bp deletion (NC_000962.3, 4056664–4058980) was detected in the genomes of all isolates of sub-cluster EIA2 sub-lineage 1 (Supplemental Fig. 2). Of the 14 EAI2 sub-lineage 2 isolates, 7 (50%) were from foreign-born patients (Thailand or the Philippines) (Fig. 1). Among these EAI2 sub-lineages, there was no significant difference in the results of spoligotyping, which analysed the spacer regions of direct repeats in the genome of Mtb (Supplemental Table 2). A large deletion was observed in the EAI2 sub-lineage 1, which was the largest sub-lineage from the Philippines (data not shown). Taken together, the results show that this large deletion in sub-lineage 2 could be useful for identification of sub-lineage 1.2.1 (EAI2).

Figure 1
figure 1

Phylogenetic analysis of the isolates belonging to lineage 1 in the RYOKEN 2007 dataset of the GReAT database. In total, 31 isolates from the RYOKEN 2007 dataset were classified into lineage 1 by TGS-TB. (a) Phylogenetic analysis of the isolates belonging to lineage 1. A phylogenetic tree was created based on SNV analysis of the core genome among the lineage 1 isolates with TGS-TB. Other strains were representative of each sub-lineage of lineage 1. M, male; F, female; IND, India; JPN, Japan; NPL, Nepal; PHL, the Philippines; THA, Thailand; VNM, Vietnam; OTHER, other foreign country; and INH-R, isoniazid resistant. Hatched rectangles are the sub-lineages of lineage 1.2.1 (EAI 2). (b) Schematic depiction of a large-scale deletion of 2,316 bp (4,056,664–4,058,980) observed in all isolates belonging to lineage 1.2.1 (EAI2) sub-lineage 1. The hatched line, grey arrow, and rectangle indicate the deleted regions in the genomes. ephA encodes epoxide hydrolase A.

Prediction of drug susceptibility by genomic analysis

The drug susceptibility of the isolates was predicted by TGS-TB, and the results were consistent with the mutations identified by TGS-TB12 and TB Profiler14. According to the genotyping results, 72 (10.8%) of the 667 isolates were resistant to at least one of the anti-TB drugs (Supplemental Table 4), 34 were resistant to INH, 9 to RFP, 40 to SM, 6 to EB, and 5 to LVFX (Table 3). In phenotypic method, 72 (10.8%) of the 667 isolates were resistant to at least one of the anti-TB drugs (Supplemental Table 4), 24 were resistant to INH, 2 to RFP, 48 to SM, 12 to EB, and 5 to LVFX (Table 3). The concordance ratio of INH, RFP, SM, EB, and LVFX between the genotypic and phenotypic methods was high, with values of 96.4%, 99.0%, 98.5%, 99.1%, and 100.0%, respectively, as were the specificities, with values of 97.4%, 98.9%, 99.9%, 100.0%, and 100.0%, respectively. However, the sensitivities to INH, RFP, SM, EB, and LVFX were 70.8%, 100.0%, 81.3%, 50.0%, and 100.0%, respectively. In particular, the sensitivities to INH, SM, and EB were low; therefore, unknown mutations in the genes associated with resistance to INH, SM, and EB were predicted by genomic analysis using TGS-TB ver. 2 ( The results identified mutations in candidate genes: katG (V230L, A256G, D259A, W397G, R463L L472R, P533S), ndh (L239V), inhA promoter (−40, −76), gid (G34A, D67H, L74S, G76D, V88G, E92D, L142F, S149R), D67H), and embB (G991S) (Supplemental Table 3). The genomic analysis results predicted low sensitivity of Mtb isolates to drugs (Table 3).

Table 3 Phenotypic DST, RYOKEN 2007, Japan.


This study was the first to identify the genetic characteristics of Mtb throughout Japan by WGS. The genomic analysis showed that the largest group of isolates belonged to lineage 2/Beijing, accounting for 485 (72.1%) of the 667 isolates analysed, whereas lineage 4 accounted for 147 (22.0%), and lineage 1 accounted for 31 (4.6%). Epidemiological analysis by WGS of foreign-born TB patients in countries with low TB prevalence has been conducted in the U.S.21, Canada22, Spain23, Italy24, and Germany25. Although most of these studies found limited evidence of the transmission of TB between foreign-born and native-born patients, the transmission of TB between foreigners and Japanese patients was not observed in the present study. Consistent with the increasing numbers of immigrants from Asian countries with high TB burdens, such as the Philippines, China, Vietnam, Nepal, and Indonesia, to Japan, the number of TB cases among foreign-born patients has also been increasing, going from 842 (3.3%) of 25,311 cases in 2007 to 1,338 (7.6%) of 17,625 cases in 2016. Of the 1,338 foreign-born TB patients in Japan identified in 2016, 318 (23.8%), 272 (20.3%), 212 (15.8%), 135 (10.1%), and 90 (6.7%) were from the Philippines, China, Vietnam, Nepal, and Indonesia, respectively (Tuberculosis in Japan: Annual Report 2017, Therefore, to control TB from foreign-born patients, genetic markers from patients from each of these countries should be identified to predict the invasion and transmutation of TB.

As shown in Table 2, of the 31 isolates identified as lineage 1, 20 (64.5%) were from foreign-born TB patients (Table 2). The proportion of lineage 1 isolates from foreign-born TB patients in the Philippines was quite high compared with that of other lineages (0% for lineage 2, 2.5% for lineage 2/Beijing, 0% for lineage 3, and 5.4% for lineage 4). Kobayashi et al. reported that in Tokyo, the proportion of lineage 1 isolates was significantly higher in foreign-born patients than in Japanese patients26. Therefore, a greater prevalence of lineage 1 isolates would be expected from foreign-born patients. There were significant differences in the prevalence of lineage 1 isolates among groups, with foreign-born females aged 20–39 years being the most frequently infected. In particular, 14 isolates identified as lineage 1 (EAI2) from females from the Philippines aged 20–39 years formed a sub-cluster. These isolates were not closely related, as there were 86 to 289 SNVs among them (Supplemental Table 1). However, all 14 of these isolates had a large deletion of approximately 2.3 kb at the same position in the genome (Supplemental Fig. 2). The distribution of TB lineages in the present study was similar to that described in other reports using different methods, such as spoligotyping20,27. However, in the present study, spoligotyping was not able to distinguish the EA12 sub-lineage (Supplemental Table 2). The large deletion in sub-lineage 1 was also observed in a majority of the isolates from the Philippines in the GReAT database (unpublished data). Therefore, this deletion might be a heritable characteristic of sub-lineage 1 and could thus serve as a candidate genetic marker of the isolates originating from the Philippines.

The proportion of DR to at least one anti-TB drug was 9.6% in 2007 in Japan11, comparable to the value of 10.8% observed in the present study (Supplemental Table 4). The proposed method seems promising as a standard method for DST in the future. Indeed, the concordance and specificity of the genotyping method for predicting drug susceptibility was good compared with the method currently used (Table 3). However, the sensitivity differed among drugs (70.8%, 81.3%, 50.0%, and 100% for INH, SM, EB, and RFP, respectively) (Table 3).

The GReAT database and bacterial culture-based methods can also improve the prediction of DR in Mtb by detecting genomic mutations responsible for DR and reducing the time necessary for DR detection, from weeks and months to within days or hours28. To detect the mutation of the rpoB gene responsible for RFP resistance, gene targeting methods, such as line probe assays or the Cepheid Xpert Mtb/RIF assay, can provide DST results within days or hours. However, these methods can detect the most frequent mutations of the rpoB gene but are limited to RFP. With future improvements in sequencing methods, mutations associated with resistance to other anti-TB drugs can be obtained simultaneously by WGS. Bioinformatic tools, such as KvarQ15, TB Profiler14, PhyResSE29, CASTB30, Mykrobe Predictor31, and TGS-TB12, are also useful to predict DR to not only first-line but also second-line anti-TB drugs. However, predictions made with these tools are limited to the reported major mutations in the DR genes. Therefore, other relevant mutations in these genes may be unreported. In the present study, unreported mutations in the katG, gid, and embB genes responsible for resistance to INH and EB were identified (Supplemental Table 3). By updating the list of mutations in DR genes, these findings should contribute to the improvement of the prediction of DR in TB.


WGS of 667 isolates from the RYOKEN 2007 dataset of the GReAT database revealed genomic markers of Mtb isolates from the Philippines (Fig. 1) and previously unknown mutations in genes associated with resistance to INH, SM, and EB (Supplemental Table 3) to improve genomics-based DST. This is the first report of WGS for anti-TB DST in Japan, showing the usefulness of the GReAT database.