Introduction

The human major histocompatibility complex (MHC), the human leukocyte antigen locus (HLA) on chromosome 6p21.3, spans about 4 Mb and contains many polymorphic genes relevant to the adaptive immune system.1 Among them, genes for classical HLA molecules play pivotal roles in the immunological recognition of self versus non-self through presentation of antigenic peptides from either intracellular or extracellular origin.2 Most of the extensive polymorphisms in the HLA genes were found at the peptide-binding groove of HLA molecules, thereby defining the bound peptides.3 The HLA alleles at a given locus differ from each other by 1–30 amino acids at the protein level4 and have been designated by the four-digit number or more according to the patterns of single-nucleotide polymorphisms (SNPs) and insertion–deletion polymorphisms within the coding sequence. The difference in allele distribution among different ethnic groups may be shaped by selective and demographic history.5 It is well known that there is a strong linkage disequilibrium (LD) among alleles of genes in HLA locus, and combination of these alleles in LD form specific haplotypes.6 Owing to the functional significance of classical HLA genes, the MHC haplotypes have been defined by using classical HLA alleles as highly polymorphic markers, and the HLA haplotypes served as a model system for high-resolution mapping of disease susceptibility genes,7 evolution8 and population structure.9

Detailed information on allelic diversity, recombination hotspot and profiles of LD within the MHC region are available,6 but data on the lineage of the MHC haplotype and its evolution are not complete. The MHC Haplotype Sequencing Project was designed to elucidate the complete MHC genetic maps of several common Caucasian MHC haplotypes,10 but little information is available for MHC haplotypes from different ethnic groups other than that from the Caucasians. Using selected genomic variation, including SNPs, individual MHC haplotypes can be characterized. This strategy has been used extensively to resolve the structure of the HLA allelic composition of SNPs and to determine new HLA alleles.11 However, conventional SNP-based tagging could not adequately provide a resolution to capture the characteristics of variations of the MHC region at the worldwide population level. In other words, genetic markers other than SNPs, including copy number variations (CNVs) and microsatellites, might provide additional information in tracing the differentiation of the MHC haplotype.

Microsatellites, in general, undergo rapid change because of the insertion or deletion of one or multiple repeat units, primarily through replication slippage.12 Moreover, the mutation rate of microsatellites (10−5–10−3 per generation) exceeds that of SNPs and CNVs by several orders of magnitude. The difference in the mutational dynamics suggested that the microsatellites may be useful in tracing recent divergence in the structure of MHC haplotypes. More than 1000 polymorphic microsatellite markers have been described within the HLA region.13, 14, 15, 16 The microsatellite markers showed considerable polymorphism and strong LD with particular alleles of classical HLA loci composing of well-defined extended HLA haplotypes.17 The HLA haplotypes can be separated into several blocks, including a haplotype block containing the HLA-Cw and -B genes, just centromeric to the MHC class I region, which is known to be one of the highest polymorphic loci in the human genome.18 In this study, we analyzed the microsatellite diversity surrounding the HLA-Cw/-B loci to investigate the haplotype lineages in a Japanese population.

Materials and methods

Study population and genotyping methods

The study population consisted of 261 Japanese individuals selected at random. All subjects gave informed consent and the study was approved by the Research Ethics Committee of Medical Research Institute, Tokyo Medical and Dental University and Tokai University School of Medicine. Complete genotyping was achieved for classical HLA genes and nine microsatellite markers from all individuals were enrolled in this study. Deviation from the Hardy–Weinberg equilibrium was tested for each HLA locus and each microsatellite marker. None of the selected markers showed significant (P<0.05) deviation from the Hardy–Weinberg equilibrium. High-resolution HLA genotyping (at four-digit allele resolution) was carried out with a sequence-based typing method at the class I genes (HLA-A, -B and -Cw) as recommended by the 13th International Histocompatibility Workshop protocols (http://www.ihwg.org/) and/or manufacturer's instructions (Forensic Analytical, Hayward, CA, USA). When an ambiguity in the genotype assignment was observed in the sequence trace data, genotype was predicted from the allele frequency and LD information in the Japanese.19 DNA regions spanning the microsatellite polymorphisms were amplified by PCR using primer pairs under the conditions listed in Table 1 , and the sequenced reference B-cell line samples, COX and PGX, were used as standard for sizing assignment of microsatellite.20 In addition, to show the motif variation at C1_2_5 locus, we sequenced the PCR products obtained from each subject on both strands. The number of repeat units was determined by the direct sequencing along with the fragment length analysis. A part (about 38%) of the subjects was also investigated for the C1_2_5 allele by cloning the PCR products using the TA cloning kit (Invitrogen, Carlsbad, CA, USA). Data from the cloning of C1_2_5 were completely consistent with the genotyping data obtained from the direct sequencing method.

Table 1 Primer sets for microsatellite genotyping around the HLA-B/-Cw loci

Phylogenetic analysis

Sequence data on the HLA-Cw alleles (exon 4) were obtained from the IMGT/HLA sequence database (http://www.ebi.ac.uk/imgt/hla/index.html). Sequence alignments of the alleles were created by using GENETYX version 8.1.2 (GENETYX CORPORATION, Tokyo, Japan). Phylogenetic analyses were performed using the unweighted pair group method using arithmetic average (UPGMA) by the MEGA software Version 4.0 (http://www.megasoftware.net/).

Statistical analysis

Deviation from the Hardy–Weinberg equilibrium was tested for all marker loci by using the PyPop v.0.6.0 software package (http://www.pypop.org/).21 The expectation–maximization algorithm implemented in the ‘haplo_stat’ package for R statistics software (http://www.r-project.org/)22 was used to construct haplotypes and estimate their frequencies. The strength of pairwise LD between the alleles of classical HLA genes and/or microsatellite markers was quantified by two LD coefficients, D′ and r2, through the add-on R software package ‘genetics’.23 We also evaluated the associations between the HLA-B and HLA-Cw alleles by sensitivity and specificity; sensitivity was defined as the probability of observing the HLA-B allele when a particular HLA-Cw allele was observed, whereas specificity was the probability of not observing the HLA-B allele in the absence of the particular HLA-Cw allele. The long-range association was investigated by the extended haplotype homozygosity (EHH) statistic that was calculated according to the formula developed by Sabeti et al.24 Overall LDs between two loci were estimated by using two statistics, Hedrick's multi-allelic D′25 and Cramer's V.26 When there are only two alleles per locus, Cramer's V is equivalent to the correlation coefficient between the two loci. Statistical significance of the LD between pairs of loci was tested using a permutation test with 1000 permutations for each locus pair.

Results

Association between HLA-B and -Cw gene loci

Significant associations between the alleles of HLA-Cw and HLA-B genes were found among 261 Japanese individuals as expected from the physical proximity of HLA-Cw and -B (85 kb). Of the 75 different HLA-B and -Cw allele combinations observed, 10 were relatively common with haplotype frequency above 3%, by which 63% of the Japanese panels could be explained (Table 2 ). Two combinations, Cw*1202-B*5201 and Cw*1403-B*4403, showed high correlations (over 0.90) for sensitivity, specificity, D′ and r2. In contrast, HLA-Cw/-B haplotypes containing Cw*0102, Cw*0303 and Cw*0304 showed less association, although these haplotypes could account for a considerable part in the Japanese population, because these HLA-Cw alleles composed of several haplotypes with different HLA-B alleles.

Table 2 Association performance of HLA-Cw/-B haplotypes in a Japanese population

Long-range haplotype around the HLA-B and -Cw genes

To analyze a long-range structure of the region, EHH analysis was performed, which enabled us to estimate the length of LD from the alleles of a landmark locus. As illustrated in Figure 1, each EHH profile within the 300 kb from the landmark tended to decline the LD with increasing distance from the landmark as expected. However, the pattern of EHH varied substantially, depending on the allele at the landmark locus and on the two-locus haplotype. The haplotypes landmarked by the alleles of HLA-Cw and -B genes extended longer to telomeric side (MHC class I region) and centromeric side (MHC class II region), respectively (Figures 1a–d). Nevertheless, HLA-Cw/-B haplotypic combinations (for example, Cw*1202 and B*5201, Cw*1403 and B*4403) formed by almost one-to-one correspondence showed the long-range LD. In clear contrast, others (for example, Cw*0102, Cw *0303 and Cw *0304) with highly diverged combinations rapidly diminished the EHH score even within approximately 100 kb around the landmark locus (Figure 1b). As a rapid EHH decay was found at the C1_2_5 locus around 22.1 kb centromeric to the HLA-Cw locus, we examined the EHH pattern from the landmark of two-locus haplotype extended from C1_2_5 to either HLA-B or -Cw locus. The degree of EHH decay from the haplotypes of HLA-B or -Cw coupled with C1_2_5 showed a similar tendency to that obtained from the landmark of HLA-B or -Cw alone. Interestingly, it was found that the EHH pattern was different between Cw*0702 and Cw*0303 even though these two HLA-Cw alleles were linked to the identical allele, C1_2_5*200 (Figure 1d). As expected, the EHH scores of HLA-Cw/-B haplotypes tended to maintain a long-range LD extending centromeric and teromeric to the landmark locus (Figure 1e). These observations suggested that the diversity of C1_2_5 locus at the nucleotide level was well correlated with the lineage of the HLA-Cw/-B haplotype.

Figure 1
figure 1

Long-range haplotype test using classical HLA genes and microsatellite markers. Each plot represents the extended haplotype homozygosity (EHH) values spanning about 200–300 kb from alleles at two landmark loci, (a) HLA-B and (b) HLA-Cw, and three two-locus haplotypes, (c) HLA-B-C_1_2_5, (d) C1_2_5-HLA-Cw and (e) HLA-B-HLA-Cw, in both directions. Vertical lines and arrowheads over the map indicate the locations of microsatellite markers and HLA loci, respectively. The gene map was obtained from the Wellcome Trust Sanger Institute (http://www.sanger.ac.uk/HGP/Chr6/MHC.shtml). The physical distances are given in kb, with negative and positive numbers used for locations proximal to and distal from the landmark, respectively.

Structural analysis of C1_2_5 marker

To further delineate the haplotypic structure of the HLA-Cw/-B region, we focused on the motif structure of a microsatellite marker, C1_2_5, which was located between HLA-B and HLA-Cw. Sequencing analysis of C1_2_5 revealed four motifs consisting of nucleotide substitutions in addition to gain or loss of CA repeat units (Figure 2a). These substitutions per se were observed within the repeat tract and hence did not change the size of PCR fragments, whereas the differences of the motif structure provided us with the additional information on diversity, as exemplified by C1_2_5*200 and C1_2_5*218. Using these data, genetic associations between C1_2_5 alleles and individual HLA-Cw/-B alleles were investigated to characterize the diversity of HLA haplotypes. The (CA)nCTCA and (CA)4AA(CA)5AA(CA)nCTCA motifs were in tight LD with Cw*0801 and Cw*0102, respectively, and the majority of C1_2_5 alleles showed strong LD, with particularly HLA-Cw alleles, but there were several exceptions. For example, Cw*0304 was in LD with three different C1_2_5 variations, (CA)4AA(CA)19CTCA, (CA)4AA(CA)21CTCA and (CA)4AA(CA)23TACACTCA. The former two variations forming the identical HLA-Cw/-B haplotype, Cw*0304-B*4002, should be derived from the same repeat motif. In contrast, the third variation with different motif was linked to a different HLA-B allele, B*4001, forming the Cw*0304-B*4001 haplotype.

Figure 2
figure 2

Phylogenetic relationship between C1_2_5 motif and HLA-Cw/-B haplotype. (a) Phylogenetic tree predicted from the repeat motif structures at the C1_2_5 locus. HLA-Cw/-B haplotypes associated with C1_2_5 alleles were indicated with two LD coefficients, D′ and r2. Representative HLA-B alleles were shown. (b) Phylogenetic analysis of exon 4 sequences of HLA-Cw. Phylogenetic tree was constructed by the UPGMA method. The numbers for interior branches refer to the bootstrap values in percentage with 1000 replications.

Phylogenetic relationship between alleles of C1_2_5 and HLA-Cw

The mutation rate of SNP was estimated to be 10−8 per generation, whereas that of microsatellite was between 10−5 and 10−3 per generation.27 Relationships among C1_2_5 alleles with four motifs were phylogenetically analyzed (Figure 2a). Of four major motifs, the simplest structure was (CA)nCTCA, observed in short alleles of both C1_2_5*188 and *192. All other C1_2_5 alleles had a (CA) to (AA) change at the fifth CA unit, resulting in a motif sequences (CA)4AA(CA)n, interrupting the CA repeat array. In addition, C1_2_5 alleles containing the interrupting sequence, (CA)4AA(CA)n, can be subdivided into two different motifs as follows; (CA)4AA(CA)nTACACTCA resulted from a (CA) to (TA) change in 3′-side of the CA repeat and (CA)4AA(CA)5AA(CA)nCTCA resulted from CA-to-AA change at the 11th unit. As all microsatellite motifs shared the simplest motif, it was assumed that (CA)nCTCA was the core structure of the C1_2_5 microsatellite. On the other hand, to investigate the relationships of lineage between the C1_2_5 motif structures and the neighboring SNPs, we constructed a phylogenetic tree using the exon 4 sequences of HLA-Cw alleles, which encoded the α3 domain, to exclude the effects of selective pressure acting on the peptide-binding domain (Figure 2b). It was found that the relationship among C1_2_5 alleles (microsatellite lineage) was not always concordant with the relationship of HLA-Cw alleles (SNPs lineage).

Multiallelic analysis of LD between C1_2_5 and its franking HLA genes

As the EHH analysis was focused on the LD among specific pairs of alleles and haplotypes with relatively high frequency (>3%), we also evaluated overall LDs between two loci among HLA-B, -Cw and C1_2_5 to figure out the overall nature of the LD structure in this region (Table 3 ). It was found that the C1_2_5 locus, at both the allele level and the motif level, showed stronger LD with HLA-Cw/-B haplotype than with either HLA-B or -Cw locus. These observations suggested that the divergence of C1_2_5 locus reflected its tight association with the HLA-Cw/-B haplotype rather than the association with HLA-Cw alleles or HLA-B alleles.

Table 3 Overall LD among HLA-B, HLA-Cw and C1_2_5 loci

Discussion

In this study, we investigated whether a microsatellite marker adjacent to the most polymorphic HLA-Cw/-B loci could provide us with information to delineate the haplotype lineage. We found that the C1_2_5 microsatellite was highly variable by three substitutions within the CA repeat array in addition to the number of CA repeats. The unique polymorphic patterns at C1_2_5 locus were well correlated with the HLA-Cw/-B haplotypes. It was also shown that the simple analysis of fragment-size variation should overlook the nature of microsatellite variations.

The structure of repeat motif was attractive from the evolutional viewpoint because the microsatellite and HLA-Cw alleles appeared to co-evolve. For example, the change of C1_2_5 found in the Cw*0304-B*4002 haplotypes was attributable to the differences in the number of repetitive units, which can be explained by a strand-slippage mechanism. On the other hand, the difference of repeat motifs in C1_2_5*198 and C1_2_5*206 associated with the identical allele, Cw*0304, was characterized by distinct HLA-B alleles, B*4002 and B*4001, respectively. It was unlikely that these two Cw*0304-linked haplotypes were shaped by a simple recombination event between HLA-Cw and -B loci, as the motif structures of C1_2_5 were different between them. Instead, Cw*0304 might originally exist in two different haplotype lineages.

Comparison of EHH profile showed that the length of LD varied depending on the HLA haplotypes. One possible explanation for the variation includes the diversity of pairing between the alleles of HLA-B and -Cw. Indeed, the HLA allele with a short-range LD profile showed larger diversity due to the repeated recombination events over time, thereby providing the LD decay between the landmark allele and the linked markers. On the other hand, haplotypes with a long-range LD profile might be of recent origin. In general, human genetic geography showed high continuity, and it is well known that the MHC haplotypes in neighboring populations were introduced to Japan through multiple routes.28 Therefore, the MHC haplotype structures in the Japanese population might be shaped by multiple immigrations.

Each repeat motif observed in the C1_2_5 locus was in tight LD with a particular HLA-Cw allele and in part with an HLA-B allele, which consisted of HLA-Cw/-B haplotypes. The mutation rate at a microsatellite is known to depend on the intrinsic features, including repeat number, length and motif size.29 For example, microsatellites with greater number of repeats showed higher mutation rates due to the increased probability of slippage.30 In contrast, interruption of perfect repeat array had a great impact on the stability of microsatellite alleles.31 Indeed, interrupted motif within repeat tracts that were correlated with HLA-DR/-DQ haplotypes was described for DQCAR.32

In conclusion, we revealed that unique mutational dynamics at C1_2_5 locus could serve as a useful resource for tracing haplotype lineage in the Japanese population. Analysis of C1_2_5 structures along with HLA-Cw/-B haplotypes in other ethnic groups will show the lineages of haplotypes. Statistical methodology for predicting the HLA allele and its haplotype carried on the chromosome have been established using informative SNPs inside and/or outside the HLA genes.33, 34 However, the use of bi-allelic SNPs as a marker requires more efforts to obtain the information than the use of multi-allelic microsatellite markers, because many HLA alleles show a mosaic structure shaped by multiple polymorphic backgrounds. Microsatellite markers will shed light on the haplotype lineage in a different perspective from the SNP-based tagging approach.