Introduction

Tuberculosis (TB) is one of the three major world-wide infectious diseases in addition to AIDS and malaria. According to the Global Tuberculosis Report 2014 produced by the World Health Organization, there was an estimated 9.0 million incident cases of TB (126 per 100,000 people) reported globally in 2013. Approximately, 1.5 million people were estimated to have died from TB that year.1 TB is a challenging world-health issue, especially because of co-infection with HIV and the existence of multidrug-resistant TB and extensively drug-resistant TB.1 It is estimated that approximately one-third of the world’s population is infected with the TB pathogen Mycobacterium tuberculosis. However, only 5–10% of infected patients will progress and develop the clinical disease.2

Several studies have been conducted to identify the genetic factors involved in TB susceptibility. An early twin study showed that monozygotic twins had a 2.5-fold higher concordance rate for TB when compared with dizygotic twins.3 A recent re-analysis of the twin study data concluded that environmental factors are more important than genetic factors to TB susceptibility. However, genetic factors have a determinate role in patient immune response to infection by M. tuberculosis.4 There are multiple putative TB-associated genes and loci that have been identified via a variety of methods. These methods include linkage studies, candidate gene studies and genome-wide association studies (GWAS).5 A GWAS conducted with an Indonesian cohort identified several potential TB susceptibility loci. However, none of the loci reached genome-wide significance.6 We previously conducted a single-nucleotide polymorphisms (SNP)-based genome-wide linkage study using a group of Thai patients and affected sibling-pair samples.7 The study identified a region on chromosome (Chr.) 20p that had a significant linkage with earlier TB onset based on an ordered subset analysis conducted using minimum age at TB onset. The maximum logarithm of odds score was 3.33.7 A GWAS was also conducted with a cohort of Thai and Japanese individuals. The study identified a risk locus on Chr.20q12 associated with young-onset (<45 years old) TB.8 The difficulty in identifying obvious and reproducible susceptibility loci in genetic studies suggests the presence of additional unknown genetic factors.

The current study attempted to identify new TB susceptibility gene(s) and/or variants within a candidate region on Chr. 20p in a cohort of young Thai individuals. We used next-generation sequencing (NGS) technology to resequence the candidate region. The candidate variants were then selected from the sequencing data for subsequent case–control association analyses.

Materials and methods

Study samples

There were 13 cases used for NGS. These cases were selected from Thai multiplex TB families that were used in our previous genome-wide linkage study.7 The target region for this resequencing study was Chr. 20p13-12.3 and this region shows significant linkage with early-onset TB. We define early-onset TB as disease occurrence in patients ranging from 12 to 23 years of age.7 The youngest affected individual <25 years old in each multiplex family was selected for sequencing.

There were two sample sets used for the association analyses. The first set of samples consisted of 665 TB patients and 777 healthy controls that were studied in the previous Thai GWAS.8 These individuals were recruited from the Chiang Rai, Lampang and Bangkok provinces. Microscopic identification and mycobacterial culture were used to confirm TB diagnosis in 98% of the TB cases. The second set of samples consisted of 545 TB patients and 407 controls recruited from Chiang Rai, Bangkok and Northern Thailand. These patients were also used in the association analysis. Individual and familial histories of TB and TB-associated diseases such as diabetes mellitus (DM) were evaluated. DM status was monitored via fasting blood sugar, HbA1c, rapid testing of capillary blood and history of DM treatment. The numbers of samples in each set are summarized in Supplementary Table S1. The study was approved by the Ethics Review Committees of the Ministry of Public Health in Thailand and the Faculty of Medicine, The University of Tokyo.

Candidate region selection

The region on Chr. 20p13-12.3 was previously demonstrated to be significantly linked with early-onset TB based on an ordered subset analysis.7 The SNP marker rs750702 located in this region shows a peak logarithm of odd score of 3.33. Thus, a 1-Mbp region centered on rs750702 was selected for the current study because it covers the linkage peak (Supplementary Figure S1).

Sequencing of TB cases

The samples were subjected to candidate region capture using the Ion TargetSeq Custom Enrichment Kit 500 kb-2 Mb (Thermo Fischer Scientific, Waltham, MA, USA). The kit was designed and customized by providing chromosomal physical position information on the target region to the manufacturer (Supplementary Information; Supplementary Figure S1).

The samples were prepared and sequenced on the Ion Torrent Personal Genome Machine (PGM; Thermo Fischer Scientific) according to manufacturer’s protocols. The samples’ genomic DNA (gDNA) were sheared into 200-bp fragments with the Ion Xpress Plus Fragment Library Kit (Thermo Fischer Scientific). The fragmented gDNA were ligated to sequencing adapters and individually barcoded using the Ion Xpress Barcode Adapters 1–16 Kit (Thermo Fischer Scientific). The prepared gDNA were then amplified. The Invitrogen SOLID Library Size Selection 2% E-gel (Thermo Fischer Scientific) and the E-Gel Safe Image Real-Time Transilluminator (Thermo Fischer Scientific) were used to select optimal-sized fragments with attached adapters and barcodes of ~330 bases in length. The fragments were then extracted from the gel before being purified and pooled to make sets of libraries consisting of ~500 ng of gDNA.

The Ion TargetSeq Custom Enrichment Kit was used to hybridize and capture the target region for each library set. The molarity of each hybridized library was measured with the Agilent 2100 Bioanalyzer High Sensitivity Kit (Agilent Technologies, Santa Clara, CA, USA). The libraries were then combined into three sequencing sets. The sequencing sets consisted of seven, six and five pooled-sample libraries. Sequencing templates were prepared from the pooled libraries by emulsion PCR with the Ion OneTouch (Thermo Fischer Scientific) and Ion OneTouch ES system (Thermo Fischer Scientific) using reagents from the Ion OneTouch 200 Template Kit v2 DL (Thermo Fischer Scientific).

The sequencing was conducted on the Ion Torrent PGM using the Ion PGM 200 Sequencing Kit and Ion 318 Chips (Thermo Fischer Scientific) according to manufacturer’s protocol.

Variant detection and selection of candidate variants

All sequencing data were analyzed with two software programs; Ion Variant Caller plugin (Thermo Fischer Scientific) and CLC Genomics Workbench v5.5 (Qiagen, Venlo, Limburg, Netherlands). The sequence data from the Ion Torrent PGM were subjected to the Ion Variant Caller plugin variant detection workflow. A separate analysis and variant detection workflow was created for the use on CLC Genomics Workbench v5.5 (Supplementary Information). The sequencing data were aligned to human genome build hg19 in both programs. The detected variants were required to have 20× read depth coverage in both software programs to be included for further analysis. These requirements helped to improve accuracy of detected variants.

Information on each variant was obtained from the UCSC Genome Browser (GRCh37/hg19 assembly) (http://genome.ucsc.edu/)9 and the HapMap10 database for the population of Han Chinese in Beijing, China (CHB). The detected variants observed in only one sequenced sample were excluded to increase stringency. Variants that were already studied in the Thai GWAS8 and the proxy SNPs (r20.8 in CHB and Japanese in Tokyo, Japan (JPT) populations) were also excluded. The proxy SNPs were identified using the SNAP website Version 2.2,11 and the 1000 Genomes Project12 Pilot 1 data were used as a reference. The variants with minor allele frequency<0.05 in the HapMap database and/or 1000 Genome Project CHB population were also excluded. All remaining non-synonymous variants reported in exon regions were selected as candidates. A second filtering and selection step was conducted on the remaining variants with minor allele frequency0.05. Candidate variants located in the 3ʹ-untranslated region (UTR) of genes were screened with two databases; microRNA.org—Targets and Expression (http://www.microrna.org/microrna/home.do)13,14 and miRDB (http://mirdb.org/miRDB/).15,16 This screening was used to determine whether the variants were located in predicted microRNA-binding sites. A prediction score >80 was set as the threshold for this selection. The variants located within predicted microRNA-binding sites were also listed as candidates. In addition, variants were also examined to determine whether they were located within DNaseI hypersensivity sites or gene regulatory sites (e.g., promoter regions) using HaploReg v2 (http://www.broadinstitute.org/mammals/haploreg/haploreg.php)17 and to prioritize functionally interesting variants. Variants located in a DNaseI hypersensivity site with motif change or regulatory regions were also included as candidates. The site of each candidate was examined with the UCSC Genome Browser9 and variants within regions with high histone H3 lysine 27 acetylation (H3K27Ac) markers were selected as candidates.

The candidate variants were validated by direct Sanger sequencing. Sequencing primers (Supplementary Table S2) were designed using Primer3 v.0.4.0 or Primer3web version 4.0.0.18,19 The amplification of sequencing amplicons was conducted on a GeneAmp PCR System 9700 (Thermo Fischer Scientific) or a TGradient Thermocycler (Biometra, Göttingen, Germany) using FastStart Taq DNA Polymerase, dNTPs, and 10× PCR Buffer with MgCl2 (Roche, Basel, Switzerland). The sequencing was conducted using an ABI PRISM 3130xl Genetic Analyzer (Thermo Fischer Scientific). The sequencing results were viewed with Sequence Scanner v1.0 (Thermo Fischer Scientific). Detailed protocols are described in the Supplementary Information.

TagSNPs were designed to capture the ITPA gene in detail for association analysis and was performed using Haploview 4.2 (Supplementary Information). The following SNPs were selected to cover the gene region: rs11087570, rs8362, and rs6139034.

Genotyping and association analysis

The candidate variants were genotyped by TaqMan assays in the first sample set. The variants rs1127354 and rs13830 were further genotyped in the additional sample set. The results from all sample sets were also analyzed together to improve statistical power. All genotyping probes were ordered from and manufactured by Applied Biosystems (Thermo Fischer Scientific) (Supplementary Table S3). The genotyping was conducted with KAPA PROBE FAST qPCR Master Mix (KAPA Biosystems, Wilmington, MA, USA). The Invader assay (Hologic, Bedford, MA, USA) was used to genotype rs13830 in the second sample set.

The Hardy–Weinberg equilibrium test was performed, and the χ2-test was applied to evaluate differences between allele and genotype frequencies between all cases and controls. These tests were also performed for the subgroup analyses. In this study, allelic and genotypic models were used in addition to the dominant and recessive models tested. A subgroup analysis was also conducted. The cases were divided into two groups; a ‘young’ subgroup (<45 years old) and an ‘old’ subgroup (45 years old). The subgroups were then compared with the controls. The 45-year-old threshold was also applied empirically in the previous GWAS and was based on the age of onset distribution for TB patients in the studied countries.8 A Fisher’s exact test was applied to improve statistical accuracy when subgroups had small sample numbers.

Linkage disequilibrium and haplotype analysis

Linkage disequilibrium and haplotype analyses were conducted using Haploview 4.2,20 and a 10,000-time permutation P value was calculated for each haplotype.

In silico SNP–gene expression quantitative trait loci analysis

An in silico expression quantitative trait loci (eQTL) analysis was conducted with Genevar software to assess the association between the SNP and gene expression.21 SNP–gene association and eQTL–gene analyses were conducted with expression profile data obtained from lymphoblastoid cell lines from 80 CHB subjects in the HapMap3 project. The NCBI36/Ensembl 50 database was used as the reference. A Spearman’s rank correlation coefficient (rho) with 10,000-time permutation was used as an analysis parameter.

Results

NGS coverage, candidate variant detection, and selection

The Ion TargetSeq Probe Kit coverage for the candidate region was reportedly 87.6% (Supplementary Figure S1). The mean raw accuracy and average sequencing target statistics are summarized in Supplementary Table S4. The numbers of detected variants in each sequenced sample are summarized in Supplementary Table S5. After between-sample duplicates were removed, there were 1,878 variants. There were seven variants selected as candidates for the association analysis after filtering all detected variants. The SNP rs13830 was not observed in any microRNA-binding site or DNaseI hypersensivity site. However, it was detected in the 3′-UTR of ITPA in more than one sample by NGS. This SNP (rs13830) was reported22 to have a high linkage disequilibrium with a detected non-synonymous variant rs1127354 in a Japanese population. The SNP rs13830 was included as a candidate and was examined for functional effects. Furthermore, there were more variants in ITPA than other genes. Thus, we examined the gene using tagSNPs. There were a total of 11 variants tested for association (Table 1).

Table 1 Candidate variants selected for genotyping

Association, linkage disequilibrium, and haplotype analyses

Allelic, genotypic, dominant, and recessive models were used to analyze the genotyped variants. The subgroup analyses were performed by age and then compared with controls. There were no variants that showed deviation from the expected genotype counts in the Hardy–Weinberg equilibrium test. There were marginal associations observed for rs1127354 in the allelic and recessive models (P=0.015; odds ratio (OR)=1.41; 95% confidence interval (CI)=1.07–1.87 and P=0.013; OR=1.50; 95% CI=1.09–2.06, respectively) in the young subgroup for the non-synonymous variants detected by NGS. There were no significant associations for any other non-synonymous variants (Supplementary Table S6). The variant rs13830 showed marginal association in the allelic and recessive models (P=4.4E–03, OR=1.50, 95% CI=1.13–1.99 and P=3.7E–03, OR=1.61, 95% CI=1.16–2.22, respectively) in the young subgroup analysis (Supplementary Table S6). In addition, marginal associations were observed for rs6139034 in the recessive model in the old subgroup analysis (P=0.017, OR=1.37, 95% CI=1.06–1.79) and for rs8362 in the recessive model in the young subgroup analysis (P=0.024, OR=1.56, 95% CI=1.06–2.30; Supplementary Table S6).

No tested variants showed a stronger association than rs1127354 or rs13830. These two variants were selected for genotyping in the second sample set. The data from both sample sets were analyzed to increase statistical power. Both rs1127354 and rs13830 passed the Hardy–Weinberg equilibrium test and showed lower P values in the young subgroup for the allelic (P=1.3E–03, OR=1.39, 95% CI=1.14–1.70 and P=5.1E–05, OR=1.52, 95% CI=1.24–1.86, respectively) and recessive (P=1.1E–03, OR=1.47, 95% CI=1.17–1.85 and P=4.5E–05, OR=1.62, 95% CI=1.28–2.04, respectively) models (Table 2).

Table 2 Genotyping results for rs1127354 and rs13830 with first and second sample sets

The non-synonymous variant rs1127354 showed a strong linkage disequilibrium with rs13830 (r2=0.88; Figure 1). There were no tested haplotypes that reached significant permutated P values and none of the haplotypes had P values lower than the SNPs rs13830 and rs1127354 (Supplementary Tables S7 and S8).

Figure 1
figure 1

LD structure of variants on ITPA. LD structure of variants on the ITPA gene in the genotyped ‘young’ (<45 years old) Thai population. The numbers in each box show the r2 value between variants. The variants rs1127354 and rs13830 showed strong LD (r2=0.88). LD, linkage disequilibrium.

In silico eQTL analysis

The SNP–gene association analysis for ITPA and rs13830 showed a significant correlation between expression level of ITPA rs13830 genotype (Pperm=0.0048). Lower expression was associated with the ‘G’ allele (Figure 2).

Figure 2
figure 2

Correlation between rs13830 genotype and ITPA expression. Differential expression levels of ITPA gene for the different genotypes. The risk ‘G’ allele is observed to have significantly lower expression compared with the ‘A’ allele.

Discussion

There are no previously published studies using NGS to search for human genetic factors that affect TB susceptibility. NGS has largely been used for genetic studies of the pathogen M. tuberculosis.23,24 The current study is the first attempt to use NGS to gain insight into host genetic factors associated with TB. However, analyses of data obtained from NGS have always posed a challenge.25,26 In the current study, the number of variants detected by NGS in each sequenced sample for the 1-Mbp region was large. As a result, the variants were filtered and selected using the described methods (Supplementary Information).

The variants rs1127354 (missense) and rs13830 (3′-UTR) located in the ITPA gene showed moderate association with TB susceptibility. The variant rs13830 showed the strongest association. These findings were observed primarily in the ‘young’ subgroup analysis. The association was not as prominent when all age groups were included in the analysis. This finding supports the results from previous studies that show stratification by TB age of onset effectively identifies genetic factors associated with TB susceptibility.7,8,27

ITPA encodes the enzyme inosine triphosphate pyrophosphatase, which functions to catalyze the hydrolysis of inosine triphosphate to inosine monophosphate and pyrophosphate.28,29 The role of inosine triphosphate pyrophosphatase in humans is not well defined. However, it is important for the maintenance of genomic stability by preventing DNA damage and mutagenesis in human cells.30,31 A previous study reported that there was a relationship between low ITPA activity and adverse effects of the immunosuppressive drug azathioprine.32 In addition, it has been speculated that ITPA has a role in immunity.33 The variant rs13830 is located in the 3′-UTR of ITPA and polymorphisms in the 3′-UTR of a gene may affect regulation of messenger RNA transcripts and gene expression.3436 The eQTL analysis showed that the expression level of ITPA in lymphocyte cell lines differed according to rs13830 genotype, with higher expression in the minor ‘A’ allele. The understanding of TB disease progression has largely focused on the host immune response and the roles of T cells37 and B cells in response to M. tuberculosis infection.38,39 The results observed in our association and in silico eQTL analyses indicate that expression of ITPA may be affected in immune cells and is associated with rs13830 genotype. The association with young-onset TB observed with the ‘G’ allele suggests that ITPA could have a role in host immune response and subsequent progression of TB. Furthermore, investigations in the UCSC genome browser (http://genome.ucsc.edu/) Gene Sorter database40 revealed that expression of ITPA is high in peripheral blood CD4+ T cells and in the lungs. Thus, it may be reasonable to speculate that the risk ‘G’ allele of rs13830 may affect translation and/or regulation of ITPA and reduce expression. The decreased expression impairs the host immune response to M. tuberculosis infection in young-onset cases. It has been suggested that genetic effects that affect susceptibility to infectious diseases may be stronger in younger patients than in older patients.41 The results of the association analysis for rs13830 may reflect the genetic effect of rs13830 for TB susceptibility. These results suggest that it has a more profound role for young-onset TB cases than in older patients. Older patients may be susceptible due to other factors such as a compromised immune reaction to M. tuberculosis infection due to age or secondary infection.

Although the Chr. 20q12 locus was identified as a risk locus in the previous Thai GWAS,8 our analysis of the association results in the region showed no significant associations. The variants studied here and their proxies were not included in the Thai GWAS. This finding indicates that the filtering and selection method employed here can successfully identify novel candidate variants.

There are several limitations in the current study. The P values of rs13830 and rs1127354 did not reach genome-wide significance, which might be due to the limited sample size of the ‘young’ subgroup. The associations should be confirmed by increasing the sample numbers and/or conducting replication studies in other Asian populations that are genetically close to the Thai population. The possibility that rs13830 is not the causative variant cannot be excluded. Thus, further investigations of rs13830 proxies are warranted.

In conclusion, this study has demonstrated that the current NGS method can be coupled with rigorous filtering and selection processes. Our approach can successfully identify novel genetic susceptibility loci and contribute to the elucidation of genetic factors with roles in TB disease progression. In addition, targeted resequencing methods may also reveal unidentified susceptibility loci for other common diseases. This is the first report of a potential association of ITPA with young age-at-onset TB in the Thai population. The findings may improve our understanding of TB pathogenesis and may be useful for studies that attempt to uncover effective drug targets.