Introduction

The completion of a high-quality sequence of the human genome is a landmark event in this century, symbolizing the beginning of the postgenomic era. In this era, much interest has turned to genome variation (Collins et al. 2003), that is, to an understanding of how genomes change and take on new functional roles. Comparison of genome sequences from evolutionarily diverse species provides insight into the evolution of genes (Fu and Li 1999; Verrelli et al. 2002; Wooding et al. 2002) and a more comprehensive understanding of the function of important genomic elements. The study of sequence variation within species will also be important in defining the relationships between genotype and biological function, such as individual differences at health, susceptibility to diseases, drug response, and so on.

Besides the protein-coding sequences, a large amount of the noncoding portion of the human genome is also under active selection, suggesting that it is functionally important. It probably contains the bulk of the regulatory information controlling the expression of protein-coding genes as well as nonprotein-coding genes (Bamshad et al. 2002). It may contain sequence determinants of chromosome dynamics such as methylation and chromatin remodeling (Collins et al. 2003). Therefore, the noncoding portion of the human genome also becomes a focal point in the study of genetic variations.

Major histocompatibility complex (MHC) class-II antigens of human (HLA) are cell-surface molecules regulating a specific immune response to a pathogen by presenting antigens to T-cell receptors so as to mediate the activation of T lymphocytes. There are three isotypes of class-II molecules—DR, DQ, and DP—each consisting of two subunits, one α and one β chain encoded by separate genes DR (DQ, DP) A and B, respectively. The abnormal expression of HLA-II genes causes certain diseases. For example, the expression of class-II molecules on inappropriate cells may change the ability of antigen-presenting cells to present antigen (both foreign and self) to T lymphocytes, triggering an autoimmune disease (Laurie et al. 1992). A deficient expression of the MHC-II gene results in a hereditary immunodeficiency disease called bare lymphocyte syndrome (BLS) (Mach et al. 1996). Thus, the expressional regulation of HLA-II genes is crucial in the control of the immune response.

Four sequence motifs within promoter proximal regions of all class-II genes have been identified as cis-acting regulatory elements, termed W, X1, X2, and Y boxes, respectively (van den Elsen et al. 1998). These four boxes are highly conserved with respect to their sequences, relative positions, orientation, and spacing. Variation within these boxes could affect the gene expression level and the nuclear protein-binding affinities, which have been confirmed on the DRB1, DRB3, DQA1, DQB1, and DPB1 genes (Emery et al. 1993; Morzycka-Wroblewska et al. 1997; Andersen et al. 1991; Varney et al. 1999). Therefore, it is very important to study the polymorphism of the regulatory regions of HLA-II genes, which is helpful to better understand the expressional regulation in association with the immune response in humans. Our study is the first attempt to study the variation of the regulatory region in the MHC-II genes of Chinese populations. The SNPs in the promoter region of HLA-II genes found in this study can be used as resources of markers for association studies of complex diseases, assessment of individuals’ predisposition to diseases, and therapy tailoring, as well as markers for population genetics and evolution research.

HLA-DPA1 and HLA-DPB1 genes are classical HLA-II genes, and they are organized in head-to-head fashion with their 5’ ends pointing toward each other resulting in the sequence between them functioning as promoters of both genes. Therefore, we selected an approximately 5-kb-region at these two loci containing the promoter region, the exon1, and partial intron1 of both DPA1 and DPB1 genes. To cover as much polymorphism as possible, sequence data were obtained from seven different ethnic populations, including both ethnic populations of southern origin and those of northern origin, in China (Yao et al. 2002). They are Jing, Lahu, Yao, Pumi, Naxi, Li, and Guangdong Han, mainly from southwest China.

Materials and methods

Fourteen healthy and unrelated peripheral blood samples with different HLA-DPB1 alleles (including 02012, 0202, 03011, 0401, 0402, 0501, 1401, 2201, 6201, 2801, 6301, 5101, 5601, 8001) based on our works before were collected from southwest China populations for studying the 5-kb region. Genomic DNA was extracted from whole blood containing ACD anticoagulant by the modified salting-out method, as indicated by the International Histocompatibility Work Group (IHWG) (http://www.ihwg.org/protocols).

Based on the contig NT_033951 (gi: 27498326) in GenBank containing the complete sequence of the HLA region, two 28-nt primers were designed to amplify the 5-kb target fragment: 5’-AGGGCTTGAGGGGCTGTATTCAGGAGAT-3’ and 5’-AGCTGGGTCTGGACTTCAAACTTGGCTC-3’. PCR amplification was performed in a 20-μl reaction volume containing 0.75 mmol/l each dNTP, 0.25 μmol/l each primer, 1U Extaq polymerase (Takara) and 50 ng genomic DNA. A two-step PCR program of 35 cycles in total was carried out: 95°C for 3 min; 10 cycles of 94°C for 40 s, 68°C for 4 min; and 25 cycles of 94°C for 40 s, 68°C for 4 min (increasing 5 s each cycle) followed by 72°C for 10 min at the end. The products were cloned into the pGEM-T Easy Vector (Promega, USA). Six positive plasmids for each consensus sequence were sequenced from both directions on an ABI 3700 sequencer using Bigdye reagent (Applied Biosystems, USA).

All segment sequences were assembled automatically using SeqMan in DNASTAR software package and then were carefully checked manually using the same program. All sequences were aligned with the Clustalx program (Thomson et al. 1997). Singletons and doubletons were verified by reamplifying and resequencing in both directions.

Results and discussion

From the 14 samples, 25 cloned sequences of 4751to 4759-bp-long fragments were obtained. There were 170 polymorphisms found, eight of which were insertions and deletions (INDELs) and all of which were shorter than 12 bp, with three INDELs in the intron 1 of the HLA-DPB1 gene and five in the region between HLA-DPB1 and DPA1 genes. The detailed sequence information about the polymorphisms is listed in Table 1.

Table 1 Characterization of variations in the 4.7-kb region in southern Chinese ethnic populations. Variation is shown by capital letter. The number of nucleotides in the coding sequences, 5’-untranslated region, and 5’-flanking regions is according to the sequence information of NT_033951 (gi: 27498326) from NCBI. DPA1 exon 1 and DPB1 exon 1, 100 bp each; 5’-untranslated region of DPA1, 31 bp; 5’-untranslated region of DPB1, 59 bp; 5’-flanking region of DPA1, 41 bp; 5’-flanking region of DPB1, 34 bp. The position of variations in the region between two genes is according to the distance to the DPA1 gene’s transcription initiation code. Frequency the number of the major single nucleotide polymorphism (SNP) allele/the number of the minor SNP allele: e.g., the variation with ID170 has 20 Gs and 5 Ts in our samples. INDEL insertion and deletion polymorphism

After exclusion of all INDELs, 4735 bp remained and were used to position the polymorphic sites (Table 1). Within the 4735-bp region, we identified 162 SNPs: 49 in intron1 of the HLA-DPB1 gene, three in exon1 of the HLA-DPB1 gene, two in the 5’-untranslated region of the HLA-DPB1 gene, five in the 5’-flanking region of the HLA-DPB1 gene, 83 in the region between the HLA-DPB1 and DPA1 genes, two in the 5’-flanking region of the HLA-DPA1 gene, two in the 5’-untranslated region of the HLA-DPA1 gene, one in exon1 of the HLA-DPA1 gene, and 15 in intron 1 of the HLA-DPA1 gene. The distribution was one SNP per 29 bp on average, and much denser than the average level of the human genome (0.11%). Frequencies of substitutions by types were 38.9% for A/G (63), 32.7% for C/T (53), 9.3% for A/T (15), 7.4% for C/G (12), 5.6% for A/C (9), 4.9% for G/T (8), and 0.6% for both C/T/G (1) and C/T/A (1). The ratio of transition and transversion was 2.6, being close to the 2.3.

Regarding the four cSNPs (SNP in coding regions), two were synonymous and the other two were nonsynonymous. One synonymous substitution in exon1 of the HLA-DPB1 gene was C/T at +117 and encodes alanine; another synonymous substitution was A/T at +40 in exon1 of the HLA-DPA1 gene and encodes the same proline. Both nonsynonymous substitutions were in exon 1 of the DPB1 gene: one was a C/T transition at the +140 position, leading to Thr/Met change at 16; another was T/C transition at +152, leading to Met/Thr change at 20. None were reported in either the dbSNP database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=snp) or the IGMT/HLA sequence database (http://www.ebi.ac.uk/imgt/hla/) until September 2003.

In the highly conserved X1, Y, and W’ box within the promoter of the DPB1 gene (van den Elsen et al. 1998), there was one substitution per box, respectively. These three substitutions were all G/A transitions and had been reported before by Varney et al. (1999) who named the allele containing these three G/A substitutions as DP-PRO4. More interestingly, Varney et al. found this allele in seven individuals with eastern Asian origin. All these data suggest that DP-PRO4 allele containing these 3 G/A substitutions in the X1, Y, and W’ box may originate from China. Their competitive binding assay (Varney et al. 1999) showed that the substitutions in the W’ and X1 boxes had no effect on binding affinity, while a single substitution at the site immediately adjacent to the inverted CCAAT motif in the Y box reduced binding affinity. However, whether this substitution can influence the transcription of the DPB1 gene in vivo should be further studied by experiments in vivo, since the Y box has not the same importance as the X1 box in regulating gene expression.

By comparing our data with SNPs deposited in the dbSNP database in the NCBI, we found that 145 (89.5%) of 162 SNPs were novel as of August 2003. However, three SNPs found in the dbSNP database (rs2071349, rs2856830, and rs4279481) in GenBank within this region have not been found in our 25 sequences. In short, these high-resolution genome variation maps with an unusually high density of SNPs can be used as resources of markers for association studies of complex diseases, assessment of individuals’ predisposition to diseases, and therapy tailoring, as well as research markers for population genetics and evolution.