Molecular diversity of Mycobacterium tuberculosis complex in Sikkim, India and prediction of dominant spoligotypes using artificial intelligence

In India, tuberculosis is an enormous public health problem. This study provides the first description of molecular diversity of the Mycobacterium tuberculosis complex (MTBC) from Sikkim, India. A total of 399 Acid Fast Bacilli sputum positive samples were cultured on Lőwenstein–Jensen media and genetic characterisation was done by spoligotyping and 24-loci MIRU-VNTR typing. Spoligotyping revealed the occurrence of 58 different spoligotypes. Beijing spoligotype was the most dominant type constituting 62.41% of the total isolates and was associated with Multiple Drug Resistance. Minimum Spanning tree analysis of 249 Beijing strains based on 24-loci MIRU-VNTR analysis identified 12 clonal complexes (Single Locus Variants). The principal component analysis was used to visualise possible grouping of MTBC isolates from Sikkim belonging to major spoligotypes using 24-MIRU VNTR profiles. Artificial intelligence-based machine learning (ML) methods such as Random Forests (RF), Support Vector Machines (SVM) and Artificial Neural Networks (ANN) were used to predict dominant spoligotypes of MTBC using MIRU-VNTR data. K-fold cross-validation and validation using unseen testing data set revealed high accuracy of ANN, RF, and SVM for predicting Beijing, CAS1_Delhi, and T1 Spoligotypes (93–99%). However, prediction using the external new validation data set revealed that the RF model was more accurate than SVM and ANN.

In India, the burden of tuberculosis (TB) is enormous. According to the latest estimate of the World Health Organisation (WHO), the largest number of incident cases in 2018 were from India (2.69 million, 95% CI 1.84 to 3.70 million) accounting for 27% of global cases 1 . Advances in molecular technology have helped us understand the genetic structure of Mycobacterium tuberculosis complex (MTBC) providing insights regarding the population dynamics and spread of MTBC locally and globally. The information obtained by molecular typing of MTBC isolates is essential for understanding TB epidemics and preventing TB [2][3][4] . Current studies also indicate that the outcome of TB infection may be related to strain diversity of MTBC 5,6 . For example, "Beijing strain" of MTBC has been reported to be more virulent in animal models and is often reported to be responsible for causing outbreaks 7,8 . Moreover, knowledge of the genetic diversity of MTBC is very useful for assessing the impact of the TB control program 9 . Although numerous studies on the genetic diversity of MTBC have been conducted in India 9-41 yet no such types of studies are available from the hill state of Sikkim where the prevalence of MDR strains of MTBC is high 42 .
The 399 MTBC isolates from Sikkim were found to represent 394 24-loci MIRU-VNTR profiles out of which 389 profiles were unique, i.e. each type is represented by only one MTBC isolate and 5 MIRU-VNTR types formed clusters and the clustering rate was 2.51% ( Table 3). The maximum number of isolates in a cluster was 10. The Hunter-Gaston Discriminatory Index (HGDI) of combined 24-loci MIRU-VNTR typing analysis was 0.9999.
To capture population snapshot of genetic diversity of Beijing and Non-Beijing MTBC isolates from Sikkim we used Minimum Spanning Tree analysis using 24-loci MIRU-VNTR data. We also determined the presence of Clonal Complexes based on Single Locus Variants (SLVs) i.e. MTBC isolates having similar MIRU-VNTR profiles but differ only at a single locus. Neighbor-Joining phylogenetic tree of 249 Beijing and 150 Non-Beijing MTBC isolates from Sikkim based on spoligotyping data and 24-MIRU-VNTR profile is also given (Supplementary Fig. 2 and 3).

Random forests (RF), support vector machines (SVM) and artificial neural networks (ANN).
In this study, we aimed to predict three dominant spoligotype using RF, SVM and ANN. We used supervised learning i.e. the machine learning (ML) algorithm was first trained on a training data set (70% randomly selected data) to learn predictive patterns and subsequently applied to testing data set (30% data which was kept aside and not used for model training) for evaluation of classification accuracy, sensitivity and specificity. We used tenfold cross-validation for SVM and ANN models. The testing data sets are non-overlapping. Finally, k-ML models are generated. K-fold cross-validation helps in avoiding model overfitting and the metric calculations of model performance are calculated as mean over the k-folds. The model specifications of SVM analysis were: Number of independents is equal to 24 (i.e. 24-loci MIRU-VNTR), SVM type was C-classification, Kernel type was Radical Basis Function. Hyperparameter optimisation revealed that best parameters for Epsilon was = 0 and Cost = 4.The results of SVM analysis are given as a confusion matrix showing predicted and observed Spoligotypes of Beijing, CAS1-Delhi & T1 Spoligotypes for training data set and testing data set ( Table 6). The sensitivity, specificity, and accuracy of SVM classification/prediction for the training data set and the Testing data set are given in Table 7. For testing dataset, the sensitivity of detecting Beijing Spoligotype of MTBC was 97.06%and the specificity was  A multilayer perceptron network was used for ANN. K-fold cross-validation was also performed for ANN analysis. The input layer consisted of 24 factors (24-MIRU-VNTR types). The number of units includes 24 (excluding the bias unit). The number of hidden layers was 1 and the number of units in the hidden layer were7 (excluding the bias unit). The activation function used was Hyperbolic Tangent. The output layers included one dependent variable for Spoligotypes (Beijing or CASI Delhi or TI). The activation function used was SoftMax and the error function was cross-entropy.
The result of tenfold cross-validation for predicting Beijing or CAS1-Delhi or T1 Spoligotype using Artificial Neural Network analysis for testing, training datasets are given in Table 6. The accuracy for the prediction of three dominant Spoligotypes (Beijing, CAS1-Delhi, and T1) in the testing data set was 97-99% (  www.nature.com/scientificreports/ data set from Sikkim which were not used for model training (Fig. 6) shows sensitivity versus specificity graph i.e. classification performance for all possible cut-offs. The curves of Beijing, CAS1-Delhi, and T1 are quiet away from the 45° baseline indicating a more accurate and robust classification achieved by ANN. This interpretation is also supported by significantly high Area Under Curve (AUC) result. The 24-MIRU_VNTR independent variables ranked on the basis of their importance for prediction of MTBC spoligotype is given in importance chart (Fig. 7). The importance values of each independent (predictor) variable is computed based on training and testing samples as implemented in SPSS v.26 (https:// www. ibm. com/ in-en/ analy tics/ spss-stati stics-softw are). The normalized importance values are computed by dividing importance values by the largest importance value and expressed as percentage. Sikkim with reference MTBC isolates available at the MIRU-VNTRplus database. The NJ tree was constructed using spoligotyping and 24-loci MIRU-VNTR data. MIRU-VNTR alleles and spoligo-patterns from 17 isolates are also represented along with the NJ tree. This phylogenetic tree was used to predict lineage of the orphan/ new MTBC isolates from Sikkim.Web tools MIRU-VNTRplus (https:// www. miru-vntrp lus. org) and MEGA v7.0 were used to make the phylogenetic trees (https:// www. megas oftwa re. net). www.nature.com/scientificreports/ Random forest analysis.. The optimal number of decision trees was found to be 3000 and optimal number of variables used at each split was found to be four. The out of bag (OOB) estimate of error of final tuned model based on training data set was 3.78%. The confusion matrix for training and testing data set is given in tables 7and 8, respectively.
External validation and performance evaluation of SVM and ANN. We used MTBC dataset from different region (Assam) for validating of ML models. All ML models used in the present study viz., RF, SVM and ANN models were trained to predict Beijing or CAS1-Delhi or T1 Spoligotypes using data from MTBC isolates obtained from Sikkim. However, to validate the performance of RF, SVM & ANN models new data set used was based on MTBC data generated from Assam. The accuracy, sensitivity and specificity of RF, SVM and ANN models against external new data set are given in Table 8. The results show that RF is better classifier to predict Beijing or CAS1_Delhi or T1 strains of MTBC using MIRU-VNTR data.

Discussion
India has still the highest burden of TB despite intense national efforts to control and eliminate it (RNTCP, 2014). TB is difficult to control and eliminate in India probably due to its vast geographical and socio-economic diversity. Recent global studies have shown that high genotypic diversity of MTBC strains is an important factor in the pathogenesis of TB by affecting virulence, transmissibility, host response and the emergence of drug resistance 53 . Recent advances in MIRU-VNTR profiling and spoligotyping methods have provided powerful tools to determine various MTBC strains circulating in TB patients and to understand transmission dynamics of tuberculosis in a region 31 . Till date, only limited studies have been conducted on 24-loci MIRU-VNTR and spoligotyping based method to characterize MTBC strains in India 31 . Our study based on 24-loci MIRU-VNTR and spoligotyping of 399 MTBC isolates provides the first insight into the population structure of MTBC isolates from the hill state of Sikkim. According to this study the Beijing spoligotype was found to be the most dominant Spoligotype responsible for tuberculosis transmission in Sikkim, followed by CAS1_Delhi. The Delhi/CAS Spoligotype is effectively confined to India, Western Asia and Eastern Africa 14   The predominance of Beijing isolates in Sikkim indicates that more attention is needed to be given to the TB control program in this region to prevent the spreading of this dominating genotype in the community. Recent studies have shown that the modern Beijing strains of MTBC are spreading throughout the world because of their high degree of transmission potential 65 and BCG vaccination has been found to favour the positive selection of Beijing strains 66 .
In addition to Beijing family strains, we also identified strains belonging to other families such as CAS1_Delhi  In this study we tried to predict the main spoligotypes of MTBC in Sikkim, India using 24-loci MIRU-VNTR profiles. Two-dimensional scatterplot of MTBC isolates indicates that 24-loci MIRU-VNTR data can group MTBC isolates according to their spoligotype (Fig. 4). These preliminary results encouraged us to explore the effectiveness of RF, SVM & ANN to predict dominant spoligotype of MTBC using 24-loci MIRU-VNTR profile. The results of testing data (unseen sample) clearly indicate that classification; accuracy rate for ANN was significantly high, followed by RF and SVM models. However, RF model turned out to be better predictor of MTBC spoligotype when new external data was used for testing. The major limitation of this study is small sample size for some Spoligotypes. Further studies are needed using more diverse samples from different geographical areas to validate these finding at global level. Nevertheless, this study has clearly shown the possible use of Artificial Intelligence in predicting Spoligotypes from 24-loci MIRU-VNTR profiles. The high-resolution molecular characterization of MTBC done in the present study gives us the first insight into the genotypic diversity of MTBC isolates from Sikkim, where MDR TB is emerging as an important public health concern. The results of the present study are interesting due to the high predominance of Beijing genotype. However, more elaborate longitudinal studies are needed to be undertaken in this region to understand the transmission dynamics of MTBC, and also to get an insight into the efficiency of the TB control program in Sikkim.

Methods
Bacterial culture, identification and DNA extraction. A total of 399 AFB positive sputum samples were collected from 2016 to 2018 from Sikkim and brought to the ICMR-Regional Medical Research Centre, North-East Region laboratory Dibrugarh, for culture, Drug Sensitivity Testing (DST) and molecular characterization of MTBC isolates. Biosafety level 3 was used for culture and DST, and BSL level 2 facility was used for molecular experiments. Modified Petroff 's method was used to decontaminate sputum samples, and all the   ). Briefly, the direct repeat (DR) region was amplified with primer pair Dra, 5′-GGT TTT GGG TCT GAC GAC -3′ (biotinylated 5′ end) and DRb, 5′-CCG AGA GGG GAC GGA AAC -3′. The DNA amplification was carried out in GENEAMPPCR system 9700 of Applied Biosystems. The amplified PCR products were hybridized with nitrocellulose membrane having covalently linked 43 spacer oligonucleotides following the standard procedure 69 . The hybridized fragments were detected using an enhanced chemiluminescence system (GE Healthcare, UK Ltd., Buckinghamshire, UK) and subsequent exposure in X-ray film in darkroom 70 . The spoligotypes were initially reported as 43 digits binary representation of 43 spacers; one was scored for positive hybridization and zero for no hybridization.  The copy number of the tandem repeats was calculated as a function size of the PCR product and interpretation based on the reference table 46 . In doubtful cases, the experiment was repeated for confirmation. For quality control, Mycobacterium tuberculosis H37Rv and one Beijing strain were used in every batch of the experiment.

Genotype analysis and comparison with databases.
Web tools MIRU-VNTRplus (https:// www. miru-vntrp lus. org/) and SITVIT2 (http:// www. paste ur-guade loupe. fr: 8081/ SITVI T2/) were used for assignment of MTBC species, Spoligotypes, and genotypes by comparing with international reference database strains 45,70,71 . Spoligotypes were identified by a similarity search in MIRU-VNTRplus and SITVIT2. As on 3 rd April 2020, the SITVIT2 database contains 1,11,635 entries from 177 countries. In this database, the spoligotypes are designated as Spoligotype International Type (SIT) if isolates share them from two or more patients, and if a spoligotype is from a single patient, it is designated as orphan 70 . Phylogenetic genetic analysis of 399 isolates and the international reference strains was done using Neighbour joining (NJ) tree method based on combined analysis of spoligotypes & 24-MIRU-VNTRs implemented by MIRU-VNTRplus web tool. A Minimum Spanning Tree (MST) using 24-loci MIRU-VNTR dataset was also constructed for Beijing strains (n = 249) and Non-Beijing strains (n = 150) to determine their Clonal Complexes (CC). We allowed single-locus variants (SLVs) to be included in clonal complexes and identical patterns of MIRU-VNTR. Clonal Complexes identified genetically closely similar strains sharing common transmission link 41,72 . Unknown spoligotypes/orphan strains were subjected to phylogenetic tree analysis using the Neighbour Joining (NJ) tree and categorical coefficient to predict these isolates' Spoligotype. The discriminatory power of spoligotyping and MIRU-VNTR typing system was calculated using the Hunter Gaston Discriminatory Index (HGDI) 73 .
Where N = the total number of strains in the sample population, S is the total number of types described, and the NJ tree is the number of strains belonging to the j th type. Table 6. Confusion matrix showing three major spoligotypes of MTBC from Sikkim based on actual spoligotyping conducted using reverse hybridization (row data). Predicted lineages (columns) are based on the support vector machine (SVM) and artificial neural network (ANN) analysis using 24-loci MIRU-VNTR profiles. Based on k-fold cross-validation. . Built in R function 'prcom' and libraries 'devtools' , 'ggplot2′,'plyr' , 'scales' and 'grid' were also used. First two PCs were used for plotting MTBC isolates (n = 335) in 2-D scatter-plot to get an idea if these MTBC strains tended to group according to their spoligotype (Beijing, CAS1-Delhi or T1). The MTBC isolates were colour coded by Spoligotype type to visualize possible clustering in three groups.
Dataset preparation. Target input variable was categorical representing three classes of MTBC Spoligotypes viz., Beijing, CAS1_Delhi and T1 and was encoded as 1, 2 or 3 representing three classes. Independent input variables were 24 in number, and all were numeric. Numeric variables were normalized to have values ranging between 0 and 1. No data was missing. Table 7. Showing performance measure, i.e. (accuracy, sensitivity and specificity) of random forest (RF)/ support vector machine (SVM)/artificial neural network (ANN) analysis. Based on k-fold cross-validation. The training data set is based on 70% of MTBC isolates from Sikkim selected randomly and testing data set is remaining 30% of MTBC isolates from Sikkim which were not used for model training. *The values are in % and 95% confidence intervals are given in parenthesis for SVM/ANN.  This sensitivity versus specificity plot shows the high performance of ANN in predicting spoligotype lineages. Software package R progamme was used for analysis.   www.nature.com/scientificreports/ problems. Learning in the SVMs is achieved by finding an optimal linear hyperplane using appropriate kernel functions, maximizing the margin between the classes. The classification can be binary or multiclass. Prediction of spoligotype using 24-loci MIRU-VNTR profile of MTBC is an example of the multiclass classification task. In the present study, the data set of 335 MTBC isolates from Sikkim having information on three dominant spoligotypes and 24-loci MIRU-VNTR profiles. We used k-fold cross-validation (10-fold for Beijing or CAS1-Delhi and T1 MTBC Spoligotype) to test the performance of the model. The single dependent variable was categorical ('1' for Beijing '2' for CAS1/Delhi and '3' for T1) and 24 independent variables (24-loci MIRU-VNTR profile) were numeric. We used 'caret' and 'e1071' libraries in R for SVM analysis. We tested 'radial' , 'linear' , polynomial' and 'sigmoid' kernel functions to determine best function suitable for classifying three dominant MTBC Spoligotypes. The optimal value for 'Epsilon' and 'Cost' were determined using 'tune' library. Artificial Neural Networks (ANN) are currently popular and powerful machine learning tools that are biologically inspired computational models that imitate brain neurons and solve complex problems. The ANN typically consists of the three-layered network (the input layer, the hidden layer and the output layer) consisting of artificial neurons or nodes and interconnected by connections (synaptic weights). ANN require training data (supervised learning algorithm) for model building. The dependant and independent variables are given as input (training phase) that information will be used for the system to learn using the back-propagation learning algorithm to predict outputs. The Multilayer Perceptron (MLP) was used to build ANN. K-fold cross-validation was also used for the evaluation of ANN. We used SPSS v26 for ANN analysis. SPSS software has the provision to manually choose parameters such as number of hidden layers, number of units in hidden layers, activation functions (hyperbolic tangent or sigmoid), and output layer activation functions like identity, SoftMax, hyperbolic tangent & sigmoid. Instead, we opted for automatic architecture selection to select optimal parameters with a number of hidden layers one to fifty. The accuracy of the ANN model was best as revealed by ROC analysis. Sensitivity, specificity, and accuracy were calculated to measure the performance of SVM and ANN predictions. Sensitivity was measured by the formula TP/(TP + FN), specificity was measured as TN/(FP + TN), and accuracy by (TP + TN)/(TP+TN+FP+FN) where TP, TN, FN and FP represent true positive, true negative, false negative and false positive, respectively. The Receiver Operating Characteristic Curve (ROC) analysis was also used to determine ANN classifiers' performance, where x-axis represents 1-specificity and the y-axis represents sensitivity and the value ranges between 0.0 and 1.0.
For external validation, 132 isolates of MTBC collected from different geographical areas (the state of Assam) were also processed for spoligotyping and 24-loci MIRU-VNTR typing. To check for overfitting, we used blind external data (obtained from MTBC isolates from Assam) to evaluate model performance of all ML methods i.e. RF, SVM and ANN.
Excel 2016 of Microsoft office was used for calculations related to performance measure of RF, SVM and ANN classifiers. This was done by generating confusion matrices generated for training, testing and external new data sets.
Logistic regression analysis. Binary logistic regression analysis was used to find the association between multiple drug resistance (MDR) status and MTBC Spoligotype. The dependant variable used was MDR status of MTBC isolate and the independent variable used was whether the MTBC isolate belonged to Beijing or non-Beijing Spoligotype. The Wald test was used to find statistical significance of Independent variable. The strength of the Association between MDR and Spoligotype was determined using the odds ratio and 95% confidence interval of the odds ratio.
Ethics approval and consent to participate. This study was approved by the Ethical Committee of ICMR-Regional Medical Research Centre, North-East Region, Dibrugarh. All processes were performed in accordance with the related regulations and guidelines. Written informed consent was obtained from all the participants or their guardians in the case of minors who provided sputum samples. Patients found positive for AFB were referred to the nearest DOTS centre for treatment.