Introduction

Genomic variations, such as single nucleotide variations (SNVs), insertion/deletion, copy number changes and changes in synteny are some of the major causes of genetic divergence and phenotypic differences among different strains and species1. Though the identification of these variants has become much easier and the underlying molecular mechanisms are getting revealed, it is still not clear if there is a pattern by which genomic changes occur2. Hot spots and cold spots are regions that display either higher or lower SNVs respectively, compared to the predicted normal frequencies3. Traditionally hot spots have been studied with respect to recombination frequencies and specific octamer DNA sequences (e.g. Chi sites 5′-GCTGGTGG-3′) were thought to be associated with these spots4. Most of the mutational analyses have been done with either individual genes such as p535,6,7 and some kinases8, small genomes such as viruses9,10,11, or extra chromosomal DNA elements like mitochordria12 and chloroplast13. In general hot spots have been defined in terms of frequencies of variants arising at a single nucleotide position or a single amino acid. The frequencies are known to vary across genomes. Selection pressure, such as drug or immune pressure14 plays an important role in determining which genomic regions that are likely to harbor the hot spots. Moreover, sites/genes that are hyper variable may be governed by evolution and are a result of the intricate relationships among genes, networks and environment15. Since identification of hot and cold spots can be highly useful in defining drug and vaccine targets it is important to develop tools that can identify these regions systematically. Of the few computational approaches available for identification of hot and cold spots, mutation spectrum analysis is one approach16. A mutation spectrum is a distribution of frequencies of every type of mutation along nucleotide sequences of a target gene. This is then transformed into distribution of observed mutational frequencies and compared with expected frequencies. However, these methods are designed for finding hotspot sites in a gene but not for scanning entire genomes in a short period of time. Moreover, there are a number of methods that can be used for mutational frequency analysis and it has been suggested that a combination of methods are needed to accurately identify hot spot sites16. Here we describe an approach based on “Shewhart Control Chart” for analysis of whole genome sequences of different strains of Mycobacterium tuberculosis, the causative agent of tuberculosis. Shewhart control chart is widely used in statistical quality control17. It has also been used in quality estimation in healthcare industries18,19.The predictions we have made by using this method were validated by sequencing the putative regions amplified from clinical isolates.

Tuberculosis continues to be a major public health problem of the world20. It is an air borne infection and manifests predominantly as a pulmonary disease. Besides pulmonary tuberculosis, it can also occur, though less frequently at extra-pulmonary sites. The strains isolated from both these clinical conditions have been investigated in the present study. In addition to variation in gene expression patterns during infection, intra-genomic variation among pathogenic strains has been recognized as a critical feature in pathogenesis of microorganisms.

Results

Hot & cold spots prediction using Shewhart Control Chart

Shewhart Control Chart (SCC) is one of the most popular techniques for maintaining process control in the field of statistical process control17. This chart is routinely used to monitor one or more variables that are directly or indirectly associated with the production process. This chart may instantly detect a large shift in the process level. Regardless of how carefully a process is maintained, a certain amount of natural variability does always exist. A process is said to be statistically “in control” when the amount of natural variability is within a certain limit. On the other hand, if the variability exceeds a certain limit, then the process is statistically “out of control”. This chart graphically displays the quality of product or process based on characteristics of a sample in relation to sample number or time. Basic characteristics of these charts are Center Line (CL), the Upper Control Limit (UCL) and Lower Control Limit (LCL). In effect the use of Shewhart Control Chart in statistical process control mainly ensures that the statistical attributes of the process lie within the UCL & LCL. In our case SNV frequencies in the genome falling outside the control limit will satisfy hot spots. (see “Methods” for details)

We have defined hot and cold spots as regions of genomes (windows of 2000 nucleotides) that either display higher or lower than expected number of SNVs respectively in a population of isolates/strains. We have used ABWGAT (Anchor Based Whole Genome Analysis Tool) to carry out pair wise comparison of fully sequenced M. tuberculosis genomes in order to identify SNVs21. The distribution of SNVs identified by comparing M. tuberculosis CDC1551 and M. tuberculosis H37Rv strains across the genome is shown in Fig. 1. M. tuberculosis H37Rv strain was used as a reference strain. SNV counts were plotted using non-overlapping segment of 2000 nucleotides. We have also generated random SNVs and the positions of these are also depicted in Fig. 1. It is clear from the figure that the distribution of natural SNVs was non-uniform in comparison to randomly generated ones. The number of SNVs in a segment of 2000 is estimated to have a Poisson distribution with mean 0.4954. This was verified statistically by a Kolmogorov-Smirnov22 test which yielded a D-statistic of 0.0337. (see Fig. 2).

Figure 1
figure 1

Distribution of SNVs across whole genome.

Pink dots indicate frequency of SNVs identified by comparison between M. tuberculosis CDC1551 and M. tuberculosis H37Rv were mapped on H37Rv genome using a bin size of 2000 nucleotides. Blue dots indicate distribution of randomly generated SNVs on H37Rv genome. X-axis represents whole genome position. Y-axis represents SNV frequency.

Figure 2
figure 2

Kolmogorov-Smirnov test to check if SNV distribution follows Poisson distribution.

The function F1 is the empirically observed cumulative distribution of the SNVs and the function F2 is the cumulative distribution of a Poisson random variable with parameter 0.4954.

SNVs generated by comparing two M. tuberculosis strains (see Fig. 1) were used to derive SCC(Fig. 3). The chart shows UCL, CL and LCL as dotted lines. The red color indicates out-of-control processes, that is, the genomic regions with high SNV frequencies. We have identified cold spots as those that show very few or negligible SNVs. In order to extend the studies to clinical isolates of M. tuberculosis, we have used two different strategies. In the first one we identified putative hot and cold spots from pair wise comparison of different strains and isolates using SCC and then mapped these with respect to each other in order to identify the common regions based on sequence. Only those regions that showed hot and cold spots in all the strains were considered for further analysis. In the second strategy, we considered all SNVs in all strains and isolates and mapped these to H37Rv sequence (reference sequence). This facilitated the generation of an average number of SNVs in each bin in the context of H37Rv. SCC of the binned average SNVs permitted the identification of hot and cold spots, (Supplementary Table 1, 2). Our results showed a total of 44 hot spots and 32 cold spots in M. tuberculosis genome. Some of the genes, in the hot spot regions, such as Rv0064 and Rv0095c have been functionally characterized; however a large number of genes with unknown function are also located in these regions (hypothetical proteins). Rv0064 has been annotated as a probable transmembrane protein based on sequence similarity with integral membrane proteins. A homolog of Rv0064, (ML0644) has been described as a conserved hypothetical transmembrane protein in M. leprae (http://www.ncbi.nlm.nih.gov/gene/909429).

Figure 3
figure 3

Shewhart Control Chart: (a) Chart was derived using SNV frequencies from Fig. 1. (b) Average SNV frequencies in all the strains and isolates. Red and black dots indicate out-of-control (“hot spots”) and in-control respectively. Yellow dots indicate violating runs.

In our analysis we did not consider those nucleotide variations of the reference strain H37Rv that are absent in all other strains and isolates. We also did not consider SNVs that mapped to multigene families, such as PPE/PGRS and repetitive regions as these can skew SNV count.

Sequencing of hot and cold spots of clinical isolates

Our identification of hot and cold spots is based on completely sequenced genomes. Though we have also taken into consideration sequences from M. tuberculosis isolates that have not been assembled, it is still likely that the changes observed by us may be present only in these selected isolates and not relevant in a global context. We tested the methodology for its reliability and robustness to predict hyper and hypovariable regions by sequencing two representative predicted hot and cold spot regions from a large number of clinical isolates. While the hot spot regions displayed 38 and 4 SNVs in Rv0095c and Rv0064 respectively per 500 nucleotides in 40 isolates, no SNV was detected in the cold spots of any isolate, validating the strategy used in the present study for prediction of hot and cold spots. Multiple sequence alignments of a part of the sequenced regions of one of the hot spots and cold spots of clinical isolates are shown in Fig. 4. We have also analyzed published data on SNP distribution in 89 individual genes from 99 human adapted M. tuberculosis strains23. GyrB, that falls in a hot spot was among the genes sequenced and displayed 15 SNPs. On the other hand there were 3 SNPs in one of the cold spot genes PstS1. These results validate predictions made using SCC.

Figure 4
figure 4

Multiple sequence alignment of representative amplified re-sequenced cold spot and hot spot regions from clinical isolates.

Left side of the alignments is the isolates names with F/R indicating forward/reverse stands. (A) Cold spot; (B) Hot spot.

Discussion

Genome sequence divergence facilitates organisms to adapt to varying environmental conditions24. Periodic alternation and fluctuations in the host milieu are acknowledged features which pathogenic microorganisms encounter following infection. For example, the transition in microenvironment of the infectious tubercle bacilli, from the droplet state (free living) to the intracellular environment of the host macrophage is demanding, requiring efficient adaptation. Elucidating patterns in genome variations can help in establishing comprehensive strategies in formulating appropriate vaccines and therapeutics against pathogens25. Development of new sequencing technologies has made available genome sequences from a large number of organisms, particularly different species/strains and isolates. These provide major resources for deriving patterns of genome variations. Genome sequencing also provides a simple way to map mutations and this has tremendous potential in mapping drug resistance26. Identification of the regions that either have SNV clusters (hot spots) or lack any SNVs (cold spots) can lead to knowledge about rapidly evolving or conserved regions of genomes. In this article we have described a simple computational method which can be used to identify hot and cold spots in sequenced genomes. For this, we have analyzed fully assembled genomic sequences of laboratory strains and non assembled next generation sequence data of twenty isolates from a recent study to derive a composite prediction27. We provide experimental evidence to support our predictions.

SCCs are used routinely for quality control in manufacturing processes and to our knowledge this is the first example of its use in computational and comparative genomics. It is highly useful to find outliers from large sequential data and we have exploited that to find hot and cold spots which are also essentially outliers in genomes. In this analysis we have identified two genes Rv005 and Rv006 that map to hot spot regions and encode the gyrase gene. Gyrase genes are known to be associated with drug resistance and mutations are often found in these genes in drug resistant strains28. Our results suggest that this gene is in hyper variable region and is likely to undergo variations leading to drug resistant phenotype. We have also found Rv3919c, a gene involved in resistance to streptomycin in the hot spot region29. We did not find any gene associated with drug resistance in the cold spots. On the other hand Rv2986c, a housekeeping gene encoding a histone like-DNA binding protein was found in the cold spot region. This protein is a conserved protein which is required for survival of the organism (http://www.tbdb.org)30. Our prediction and validation strategy involved sequencing only the protein coding genes that fall within selected and predicted hyper and hypovariable regions from a number of clinical isolates. This was done to see if the selected regions computed on the basis of sequenced genomes were also outliers in terms of presence of SNVs in Indian clinical isolates. The experimental results supported our predictions.

There are many reasons why hot and cold spots exist in genomes15. While mutations occur more or less randomly, SNVs appear as clusters because of positive selection in some regions due to adaptive advantage. It is also possible that these regions have unusual structural features that promote errors of different types31. It is also likely that SNVs may occur more commonly around pre-existing mutations as a result of DNA repair system31. Whatever the reason, occurrence of hyper and hypo variable regions suggests that different regions of M. tuberculosis genomes are changing at different rates. Identification of these regions may be helpful in deciphering future therapeutic targets. In conclusion, we have shown that Shewhart Control Chart can be useful to identify hot and cold spots in microbial genomes.

Methods

Datasets

The sequences used were complete genome sequences of M. tuberculosis [H37Rv (NC_000962.2), CDC1551 (NC_002755.2)] and whole genome re-sequencing short reads of 20 clinical isolates of M. tuberculosis downloaded from SRA (http://www.ncbi.nlm.nih.gov/sra). The data were generated on a high throughput sequencing platform (Illumina) with an average depth of 50x and read length of 52 nucleotides27.

Identification of single nucleotide variations (SNVs)

Identification of single nucleotide variations (SNVs) was done separately for complete genome and next generation sequencing data. We used published genome sequence of M. tuberculosis H37Rv as reference genome in both data sets. SNVs were identified from genome data using Anchor Based Whole Genome Analysis Tool (ABWGAT)21, an online sever for identification of genetic variations from whole genome sequences. Output from the server is a list of SNVs in tabular format including reference genome position, nucleotide change, COG, functions etc. For re-sequencing short reads data, we mapped these with respect to reference genome individually using MAQ32 allowing at most two mismatches. SNV calling parameters used were minimum read depth 3, maximum read depth 256 and consensus quality score 20. We filtered the low score SNVs using SNPFilter, a module of MAQ to get high confidence SNVs.

Prediction of Hot and Cold spots using Shewhart Control Chart

We divided the reference genome into bins of size of 500 to 5000 nucleotides and calculated SNV frequency in each bin. The frequency in each bin was then plotted to find the distribution of SNVs over the genome. We have found the optimal bin size of 2000 in order to get on an average a single SNV in a bin as the total number of SNVs in different datasets was found to be around 1500–2500.

To identify hot and cold spots we used a statistical quality control method called Shewhart Control Chart17. Quality control is a technique to monitor a process with the goal of making it more efficient. Shewhart control chart can easily identify outliers in a production process. For our study the presence of outliers indicates hot spots. The chart contains three lines, named UCL -upper control limit, CL -control limit and LCL -lower control limit. (See Fig. 2).

Where μ = mean,

σ = standard deviation

Analysis of M. tuberculosis isolates from patients

DNA extracted from mycobacterial cultures maintained/stored at −20°C on Lowenstein-Jensen (LJ) media in the TB immunology laboratory, Biotechnology department, AIIMS, New Delhi and in the Microbiology Department, Lala Ram Sarup Institute of Tuberculosis and Respiratory Diseases, Meharuli, New Delhi, was used in the study. The 40 mycobacterial isolates included in the study have been derived from a variety of clinical samples, which include sputum and extra-pulmonary samples such as cerebral spinal fluid, pleural fluid, fine needle lymph node aspirates, endometrial biopsies etc.

DNA extraction, PCR amplification and sequencing

DNA extraction from the isolates was carried out as described before33. Briefly, a single colony of a M. tuberculosis was picked and suspended in 100 µl of 0.1% Triton –X 100.The suspension was boiled in a dry bath at 90°C for 45 min and centrifuged at 10,000 rpm for 10 minutes. The supernatant was used as template DNA in PCRs.

Amplifications were carried using reagents obtained from Fermentas AB, Vilnius, Lithuania, using a thermocycler (Applied Biosystems , USA). The amplicons were analyzed in a 1.5% agarose gel. Specific DNA bands corresponding to the estimated amplicon size were cut and DNA extracted as per the manufacturer's recommendation, (Real Biotech Corporation,Tawian). Sequencing of the extracted amplicons was done commercially, (GCC Biotech, (India) Pvt. Ltd., Kolkata) for both forward and reverse strands.

SNV calling

Sequences were aligned with M. tuberculosis H37Rv genome sequence as reference using CLUSTALW multiple alignment tool34. Any nucleotide change was marked as an SNV if the change was observed in both the forward and reverse strands. Otherwise it was considered as a sequencing error.