Introduction

Tuberculosis (TB) remains the leading cause of death in developing countries, despite the advanced insights into the mechanisms of disease development from basic science studies with animal models of host immunity and extensive studies of the virulence factors in Mycobacterium tuberculosis. Among the host immune mechanisms postulated from studies of disease development in the animal models, confirmation of their roles in human host response to M. tuberculosis infection can be obtained by identifying functional polymorphisms that affect the variability of disease development after infection in humans. Studies of polymorphisms distributed differently in the human population that affect disease development in TB would provide an unbiased estimate of the role of each immune mechanism in humans.

The conventional candidate gene approach has been applied in the search for genetic risks for leprosy, a related mycobacterial disease caused by Mycobacterium leprae. In addition, linkage analyses1, 2, 3 and a genome-wide association study (GWAS) in leprosy4 identified multiple genetic factors with moderate to large effect sizes in biologically relevant candidate genes, such as LTA, PARK2, HLA-DR-DQ, C13orf31, CCDC122 and RIPK2. For several common infectious diseases, susceptibility loci with moderate to large effect sizes (at-risk odds ratios >1.5) were successfully identified by GWAS. For example, associations have been identified between HbS and malaria5 and between CFH locus and infection by Neisseria meningitides.6 Recently, a GWAS of two African TB populations identified a gene desert in 18q11.2 as a novel candidate locus for TB, but unlike GWASs in other common infectious diseases, GWASs of TB were not successful in identifying any genetic factors with moderate to large effect sizes.7

Clinical and epidemiological classifications of TB have been based on the timing of the expression of signs and symptoms after infection. Classically, three groups of TB based on this epidemiological classification were proposed: 1) primary disease after infection; 2) exogenous reinfection; and 3) endogenous reactivation.8 The primary infection and exogenous reinfection are clinical diseases that develop within 5 years after infection, whereas endogenous reactivation is a disease that develops >5 years after infection. Although there are overlapping features of clinical diseases by this classification, a proportion of primary infection TB expresses distinct clinical features, and it is hypothesized that genetic risk factors might have a major role in these TB cases.9 Unfortunately, surrogate clinical biomarkers or definite clinical definitions are not routinely available to classify patients into these categories. We cannot directly use this classification in clinical practice. However, epidemiological studies and mathematical modeling indicate that the majority of old TB cases in developed countries with a comparatively low TB incidence are mainly endogenous reactivation, whereas young TB patients in developing countries are mainly primary TB. These transmission models are typically used for prioritization of limited resources in developing countries to minimize TB incidence. The biological significance of these disease development models for TB has not been clearly defined, but the age at onset of TB may be the best available classifier to identify the subset of TB patients based on the disease development model.

Based on the epidemiological classification of TB and our findings that subset-ordered linkage analysis in TB sibpairs identified chromosome 17p and chromosome 20p as candidate loci that were enriched in young TB sibpairs10 and the results are quite different from traditional linkage analyses of TB,11, 12 we hypothesized that subdividing (or classifying) patients based on their disease-development model would enrich genetic heritability in each subset and reduce genetic heterogeneity by reducing misclassification bias. In the present study, genome-wide association analyses of TB were conducted in two East Asian populations (Japanese and Thai). The effect of age at onset of TB was examined to determine the effect on each subset of cases based on this covariate in a meta-analysis of TB GWASs, and the meta-analysis results were replicated in two independent replication data sets from Thai and Japanese populations.

Materials and methods

Japanese and Thai genome-wide association samples

The Thai samples for genome-wide genotyping included 433 TB patients and 295 healthy control samples. The patients were recruited from Chiang Rai, Lampang and Bangkok provinces. The control samples were recruited from the blood donors in Chiang Rai province. Populations in these regions were highly similar to each other and also close to the HapMap Chinese population.13 All cases were human immunodeficiency virus-seronegative when TB was diagnosed, and the diagnosis of TB was confirmed by microscopic identification or mycobacterial culture.

Japanese TB patients were recruited from the Japan BioBank projects and included 188 TB cases and 934 healthy controls (934 healthy volunteer blood donors recruited for the JSNP project14). TB was diagnosed by identification of bacteriologically confirmed M. tuberculosis. Screening for human immunodeficiency virus infection was not routinely done in Japan, but the incidence of human immunodeficiency virus in Japan was negligible at the time of the study.

Japanese and Thai replication samples

The replication Japanese samples included 112 cases of microbiologically confirmed TB and 1089 controls, and age and sex-matched controls were selected from the existing control populations into the replication analyses. The replication samples in Thailand included 369 TB cases from Chiang Rai provinces, the Chest Disease Hospital in Bangkok and the Payao Hospital in Northern Thailand. The 439 controls were recruited through the blood donors in Chiang Rai province and the healthy hypertension patients followed-up at Chiang Rai regional hospital. The control populations were screened for a history of TB and a family history of TB; all were free of diabetes mellitus. This study was approved by the Ethics Review Committee of the Ministry of Public Health in Thailand and Institutional Review Board of the Center for Genomic Medicine, RIKEN.

Genome-wide genotyping and quality control for genotyping

The genome-wide genotyping was accomplished with Illumina Hapmap 610 (Illumina, San diego, CA, USA) in the Thai GWAS samples and Illumina Hapmap 550 in the Japanese GWAS samples. Meta-analysis was carried out on the overlapping 533 252 single-nucleotide polymorphisms (SNPs) between the two platforms. Quality controls for genotypes from genome-wide genotyping were carried out using the following criteria: Hardy-Weinburg equilibrium (HWE) cut off was at 10−5 and the minimum allele frequency of polymorphism in the further analysis was 0.05. The genome-wide significance level was determined with Bonferroni’s correction taking into account the 533 252 SNPs that passed quality control (P<9.37 × 10−8). Multidimensional scaling of pairwise identity by state statistics was carried out in the GenABEL package (www.genabel.org) and indicated three outlier samples in the Thai GWAS samples; these outlier samples were later excluded. The genomic inflation factors (λ) were calculated from trend test P-values in the Thai TB GWAS (λ=1.02) and the Japanese TB GWAS (λ=0.98). These levels of stratification were acceptable. SNPs with significance levels <10−5 were checked for allelic discrimination by visual inspection of intensity cluster plots of these SNPs.

Meta-analysis of genome-wide association studies in Thai and Japanese populations

Meta-analysis was carried out with the Mantel–Haenzel test under the fixed effect model. The genotype counts of cases and controls in the Thai and Japanese TB patients were used for meta-analysis. First, 25 SNPs ordered by their significance levels (P2M-H<10−5) from the meta-analysis were subsequently replicated in the replication samples of Thai and Japanese populations by invader genotyping assay.15 Analyses of the replication samples were done, and meta-analyses of four data sets of Thai and Japanese populations were carried out, but the top at-risk 25 SNPs from the meta-analyses were not replicated in the replication samples. The meta-analysis was modified by stratifying the TB cases into two groups: a group with age at onset <45 years (young TB; GWAS-Tyoung =137/295 and GWAS-Jyoung=60/249) and another group with age at onset ⩾45 years (old TB; GWAS-Told=300/295 and GWAS-Jold=123/685), re-analyzed GWAS based on age-stratified data and replicated the significant findings in two independent replication samples (Rep-Tyoung=155/249, Rep-Jyoung=41/462, Rep-Told=212/187 and Rep-Jold=71/619). Then, the genotypes of the top 50 from meta-analyses for young TB and old TB were carried out in replication samples and presented. In the GWAS-Tyoung and GWAS-Told data sets, only age of TB cases were utilized in stratification for association analysis, but for the rest of the analysis in GWAS-Jyoung and GWAS-Jold and all replication data sets in Thais and Japanese, controls were selected from age-matched control populations.

Results

Meta-analysis of GWASs in Thais and Japanese and replication results

Genome-wide association analyses and a meta-analysis of two GWASs in Thais and Japanese were performed, and initial attempts at replication of the top 25 findings (P2M-H<10−5) carried out in second groups of the Thai and Japanese replication samples failed to identify evidence of replication. A total of 25 SNPs with P2M-H<10−5 were selected for the replication analysis in the Thai and Japanese replication sets. None of the association analyses of the top 25 SNPs passed the genome-wide significance criteria, and the meta-analysis significance levels from all four data sets (P4M-H) were higher than the original meta-analysis (P2M-H) using only the GWAS data, suggesting non-replicable results from genome-wide association of TB in Thai and Japanese populations

QQ plots of various age cut-offs in the Thai GWAS

Genome-wide association analysis using various age cut-offs other than 45 years were plotted for the age range 30–70 years by 5-year intervals. The Q-Q plots of the GWAS in younger TB tended to demonstrate highly differentiated GWAS results, with observed P-values deviating from the expected P-value. This deviation increased with younger age cut-offs despite the lower number of cases in each subset (see Supplementary Figure 1).

Meta-analysis of age-stratified GWASs in Thais and Japanese and replication results

Meta-analyses of young and old TB GWASs in Thais and Japanese and their replication results are shown in Supplementary Tables 1 and 2, respectively. Selected locus that passed genome-wide significance after the meta-analysis is shown in Table 1 and Figure 1. After meta-analysis of all the data sets, there was only one locus, HSPEP1-MAFB on chromosome 20q12, that was significantly associated with young TB by genome-wide significance criteria. In addition, six additional loci provided replication evidence with lower significance levels on meta-analyses of the four data sets. For the old TB group, there were eight additional SNPs with lower P-values in the four data sets meta-analyses.

Table 1 Association statistics of rs6071980 in GWASs and replication studies of Thai and Japanese young TB
Figure 1
figure 1

Plot of −log10 (P-value) against the physical location of the HSPEP1-MAFB locus. Blue diamond represents the PM-H of rs6071980 against its location. White, yellow, orange and red diamonds represent −log10 (P trend) from each SNP genotyped in this region from genome-wide association analysis in young Thai TB; the red diamond represents rs6071980 and white, yellow and orange colors represent the linkage disequilibrium between other SNPs and rs6071980 in Thais. Blue line indicates the recombination rates from Chinese and Japanese Hapmap population.

Discussion

Interestingly, GWASs of TB per se in Africans and in our original GWAS from two Asian populations could not identify any obvious genetic risks to TB. The sample sizes of the present TB GWAS were similar to those of other infectious diseases and it appears reasonable that one should be able to identify the common moderate- to high-risk polymorphisms that are present in other common infectious diseases, such as the association of human leukocyte antigen Class II with leprosy,4 HbS is associated with malaria5 and complement factor H variants are associated with meningococcal disease.6 Failure in replication of top signals from the meta-analysis of GWASs of TB per se in Asian populations in the present study reinforce the results of GWASs in African TB;7 no common polymorphisms with moderate to high risks to TB per se could be identified despite the ubiquitous nature of TB in global populations. GWASs in Asian populations do not have the problem of sparse linkage disequilibrium like GWASs in African populations, and they have been successfully utilized to identify common variants with large effects in Thais.16 Thus, the failure of the initial GWAS in Asian TB is not due to the problem of SNP coverage in the current microarray genotyping platform.

Taking into consideration the caveats that might affect the power of a GWAS in identifying genetic risks to common diseases17 and our experience of the increasing power of linkage analysis of TB by the subset-ordered analysis based on the age at onset of TB sibpairs,10 we hypothesized that a gene–environmental effect (G X E) interaction, such as bacille calmette guerin (BCG) vaccination, exposure to different strains of M. tuberculosis,18 or misclassification, might cause heterogeneity in TB. All of these factors contribute to different age effects on TB development. Thus, GWAS analysis of TB stratified by the age at onset might lead to identification of moderate- to high-risk polymorphisms by reducing the genetic heterogeneity.

Having access to two GWAS data sets, the TB cases were empirically split into two groups based on the age at onset of TB: the young TB group (age at onset <45 years) and the old TB group (age at onset ⩾45 years). This age cut-off is based on the bimodal distribution of age at onset of TB of patients within this study and the age distribution of TB patients in Japan19 and Thailand.20 A meta-analysis of young and old TB provided non-overlapping lists of SNPs and suggested that the top signals that contributed to young TB and old TB are distinct (see Supplementary Tables 1 and 2). Major limitation of this approach is smaller number of samples in the age-stratified analysis resulting in lower statistical power to reject null hypothesis.

A replication study for the top 50 SNPs identified from meta-analyses was performed in young TB and of old TB. A total of 100 SNPs was genotyped in both Thai and Japanese replication samples; replication in two case-control data sets identified the HSPEP1-MAFB locus as a novel susceptibility locus on chromosome 20q12 for young TB on GWASs. This locus had moderate effect sizes with an odds ratio (confidence interval) of 1.73 (1.42–2.11). rs6071980 is located 450 kb proximal to MAFB, a transcription factor determining the fate of monocyte/macrophage differentiation, and 300 kb distal to HSPEP1 (Chaperonin 10), a heat shock 10-kDa protein suggested to be an auto-antigen in autoimmune hepatitis and type I diabetes. Interestingly, rs6028945 and rs6071980, two closely located SNPs with high correlation in Caucasians, have also been suggested as genetic markers for anti-TNF responsiveness by a GWAS.21 MAFB was reported to be highly expressed in active TB compared with lower expression in latent TB in an extensive study of whole-blood gene expression signatures that differentiated active TB cases from health individuals.22 These variants might influence the development of TB through an effect on the reactive response to M. tuberculosis infection by expression of MAFB from non-lymphocytic population of white cells. Other loci with lower association evidence (P4M-H<10−5) need additional replication evidence in other populations.

This study demonstrated that consistent replication in TB gene identification is achievable by stratified analysis based on the age at onset of TB despite the smaller number of cases available for analyses. The heterogeneity by age might reflect the complexity of the epidemiology of TB in Asia: introduction of Bacille Calmette Guerin (BCG) Japan vaccination in the late 40s in Thailand23 and Japan;24 viability of individuals who carried genetic risks of TB; host adaptation to various strains of M. tuberculosis;25, 26 misclassification in the control population due to non-exposure to M. tuberculosis; temporal immunosuppression by immunosuppressive drugs;27 immune senescence; and other conventional risk factors of TB. In different age cohorts of TB, the intertwining of these factors contributed to disease development and caused genetic heterogeneity. Stratification based on the age at onset of TB as a classifier to homogenize other uncontrollable factors might be a simple and efficient method for identifying TB susceptibility genes. We suggest that consortium-level meta-analysis of age-stratified TB might provide further insight into the genetic landscapes of TB.

For genetic association studies of TB, the genetic risks for early disease after latent infection and late disease after latent infection might be totally different, and analysis that combines these similar but distinct clinical phenotypes will need large samples to overcome this heterogeneity. We chose to analyze these two clinical phenotypes separately, using age at onset as a proxy to distinguish the clinical diseases. In this study, despite the relatively small numbers of the GWAS and replication samples, we identified genetic risks that reliably replicated in two additional sample groups. Confirmatory evidences are needed in additional study from other populations. Decreasing the heterogeneity by stratification is not uncommonly used in genetic mapping of complex traits, and because it may have been overlooked, it may have contributed to the large samples required in GWASs in some diseases. For diseases with hidden heterogeneity, a case-classification system that clusters cases into more homogeneous groups might be an alternative approach to overcome genetic heterogeneity.