Introduction

Human genetic studies have become much easier since the availability of annotated DNA sequences from the Human Genome Project and cataloged single-nucleotide polymorphisms (SNPs) of major ethnic groups from the HapMap project.1, 2 High-throughput of genotyping and sequencing technologies has also facilitated the related research. To investigate genetic variation, association studies are frequently used to identify DNA variants conferring phenotypes such as physical appearance, disease susceptibility and drug metabolism.3, 4, 5 However, obstacles resulting from population substructure,6 variation in linkage disequilibrium (LD) patterns7, 8 and local natural selection9 have affected the progression of the study. Poor replication of genotype–phenotype associations furthers the problem while evaluating the true effect of a genetic variant.10, 11 All these can be improved with information on the underlying population differentiation by deep sequencing with more individuals and more populations.

The International HapMap Project released a collection of over 3.1 million SNPs taken from individuals on three continents (Africa, Europe and Asia).12 However, the polymorphisms of populations other than the HapMap populations are inadequately represented.13, 14 Two concerns arise from population-related studies on the Chinese population with the use of HapMap data. First, the SNP discovery preceding the HapMap genotyping project was mostly conducted in a small sample panel, predominantly of European descent.15, 16 The derived genetic information from the HapMap database is often applied to a larger sample set. As a result, ascertainment bias may be introduced while conducting association studies with Asian populations. Second, the genetic complexity of Chinese populations cannot be solely represented by the 45 HapMap Hans. China has a significant proportion of the world population with over 1 billion people, and is composed of 56 official nationalities.17 Population genetic studies revealed that Chinese minorities are distinct and have multiple origins: Southeast Asia, Northeast Asia (Altaic) and Mid-Asia.18, 19, 20, 21 Furthermore, Han, the major ethnic group of the Chinese, can be divided into northern and southern groups.18, 19, 22 Questions regarding the underlying population substructure of the Chinese are still awaiting investigation.

Despite the differences in sampled populations, genotyped markers and examined genomic regions, previous studies found that at least four major Chinese groups exist.18, 22, 23 To be properly representative of Chinese populations, six populations with 283 individuals were selected in this study for their geographic location and association in major linguistic subfamilies. Each of the aforementioned four Chinese groups was exemplified by at least one of our study populations. The six populations are Han Shandong (from Northern China), Han Guangdong (from Southern China), Li, Yi, Tibetan and Mongolian. We sequenced a 31-kb region, comprising the tumor necrosis factor (TNF) gene cluster, in the class III region of the human histocompatibility complex on chromosome 6p21.3 (Figure 1). Five immune-related genes (LTA, TNF, LTB, LST1 and NCR3) are located in the region with known connections to various human diseases (http://www.ncbi.nlm.nih.gov/omim). In particular, TNF, a proinflammatory cytokine, has an important role in innate immunity as the first line of host defense to infection and is implicated in infectious disease, metabolic disorder and cancer.

Figure 1
figure 1

Map of the 31-kb region comprising the TNF gene cluster on human major histocompatibility complex.

By resequencing this gene-rich and immune-related region, we attempt to uncover genetic differentiation at the DNA sequence level for Chinese populations. The objective is to gain knowledge for future association studies of disease mapping with Chinese populations.

Materials and methods

Sampled populations

Six Chinese populations (283 unrelated individuals in total) including Han Shandong, Han Guangdong, Li, Yi, Tibetan and Mongolian were used in this study with sample sizes of 45, 47, 51, 46, 46 and 47, respectively. Mongolian and Han Shandong populations reside in North China; Han Guangdong and Li populations reside in South China; Yi and Tibetan populations reside in Southwest China. All DNA samples were extracted from cultured cells using a phenol–chloroform method, measured by NanoDrop ND-1000 (ThermoFisher Scientific, Waltham, MA, USA) and adjusted to a concentration of 30 ng μl−1. The cell lines were deposited at the Immortalize Cell Bank of Chinese Nationalities, which is supported by the Chinese Human Genome Diversity Project. The donor signed the written informed consent for cell line establishment and subsequent studies. Our project was reviewed and approved by the Ethics Committee at the Chinese Academy of Medical Sciences and Peking Union Medical College.

PCR amplification, DNA sequencing and identification of SNPs

Primer sets for PCR and sequencing were designed using Primer-Premier 5.0 (PREMIER Biosoft International, Palo Alto, CA, USA). All PCR reactions were performed with a touchdown method and PCR products were purified using an AcroPrep 384 Multi-well Filter 30k plate (Pall, Port Washington, NY, USA). Sequencing reactions were conducted using Applied Biosystems Big Dye Terminator chemistry, and the products were resolved on ABI Prism 3730XL DNA Analyzers (Applied Biosystems, Carlsbad, CA, USA). Sequence trace files were analyzed using Phred/Phrap/Polyphred/Consed (University of Washington, Seattle, WA, USA) software. The base-quality value threshold was set to 20 in Phred (that is, a 99% probability that the base is accurate), and all polymorphic sites were manually inspected by at least two individuals. To validate every polymorphic site, the 31-kb region was sequenced on both strands at least once. Singletons and doubletons were further sequenced with additional primer sets for verification. To maintain data integrity, a polymorphic site was retained for those cases for which the confirmed genotype rate at the site was over 95%.

Population genetic analysis

To ensure that only unrelated individuals were included in each population, the Hardy–Weinberg equilibrium test was used on all identified polymorphic sites. Only 8 out of 1638 tests showed P<0.05 (χ2-test) in a sporadic manner, which may be caused by the high number of repeated tests or by a small sample size for each population. The SNPs corresponding to the aforementioned eight tests were not removed from subsequent analysis because of their negligible impact on results. Population pairwise GST (Supplementary Table 1), rather than FST, was calculated with DnaSP v5,24 because Nei's GST provided more reliable estimates (based on our simulation results) in this study. To obtain an empirical P-value of the metric, a permutation-based procedure was used. For each paired population, identities of individuals were swapped at random and GST was calculated for each new configuration. One thousand permutations were performed and the percentage of GST greater than the observed value was then taken as the P-value. An unrooted neighbor-joining tree was constructed using observed GST as the genetic distance in MEGA 4.25 θW and π, two parameters that describe nucleotide diversity, were estimated with ARLEQUIN 3.1.1.26 Selection tests, including Tajima's test and Fu and Li's test, were performed using DnaSP v5.24 The empirical distribution of these test statistics was generated by a coalescent simulation method, and the P-value corresponded to the percentiles of the observed statistic against an empirical distribution. A sliding-window analysis of Tajima's test was also performed to inspect its variation along the sequence. Fu's Fs statistic, which is sensitive to population growth,27 was also calculated.

For haplotype analysis, 26 polymorphic sites with a minor allele frequency (MAF) greater than 0.2 in the combined population (283 individuals) were used (Supplementary Table 2). Population-specific haplotypes were constructed using the Bayesian statistical method implemented in PHASE 2.1.28 The best-fitting haplotype set of each population was obtained with five different seeds of inferences. Finally, the Unweighted Pair Group Method with the Arithmetic Mean (UPGMA) clustering method was used to evaluate the relationships of all haplotypes. The sex- and time-averaged population recombination rate was estimated with the subprogram Interval of LDhat 2.1,7 assuming an effective population size of 10 000 for the two Han populations and 5700 for the other four populations.29 Haplotype blocks and tag SNPs were obtained using Haploview 4.0.30 Tag SNPs were selected using the pairwise mode with r2 thresholds of 0.8. Only biallelic SNPs were used for the above analyses. Insertion/deletion (Indel) and short tandem repeat were not considered. The homologous sequences of one chimpanzee and one rhesus macaque were downloaded from the NCBI database to serve as outgroups.

Results

Sequence variation in the 31-kb region

With sequences of 283 unrelated individuals from the six Chinese populations, a total of 273 polymorphic sites were identified in the 31-kb region (8.5 sites per kb). SNPs represented the majority of the polymorphisms (89%), whereas Indel and short tandem repeat made up the rest (10 and 1%, respectively). A summary of sequence variants of the region in the six Chinese populations is shown in Table 1. The Mongolian population had the highest number of variants (177) among the six populations; Li and Yi populations were less polymorphic with low numbers of variants (121 and 122, respectively). Percentages of singletons and doubletons differed significantly among populations. In particular, singletons varied from 7% in Li to 29% in Mongolian populations. These differences in variation distribution among the six Chinese populations further substantiate the known genetic differentiation among Chinese populations.

Table 1 Summary of sequence variants across the 31-kb region in six Chinese populations

A total of 20 SNPs were identified in the coding region with six synonymous SNPs and 14 nonsynonymous SNPs. Furthermore, two novel SNPs found in the genic region only appeared in the Li population: one (MAF=0.13) was located in the 3′UTR region of TNF and the other (MAF=0.06) was in the exon 2 region of NCR3 (as a nonsynonymous SNP). Among all identified SNPs, eight SNPs are associated with diseases according to OMIM (Table 2). We observed a one- to five-fold variation in allele frequencies of the eight SNPs in the six populations. Such frequency variation of disease-associated SNPs may reflect differential disease susceptibilities among populations.

Table 2 Disease-associated SNPs in six Chinese populations

Population subdivision

To measure genetic differentiation of the six populations, we used Nei's pairwise GST along with an empirical P-value to substantiate the measure. Most of the pairwise GST values between populations were statistically significant (P<0.05), except the value between Han Shandong and Mongolian populations (GST=0.00108; P=0.176). An neighbor-joining tree constructed with GST shows that the grouping of these six populations agrees fairly well with their geographic distributions (Figure 2). We noticed that the geographic proximity of the Han Guangdong and Li populations, as well as that of the Yi and Tibetan populations, reflects their genetic distances in terms of GST fairly well. Furthermore, Han Shandong and Han Guangdong populations are clearly genetically differentiated. Our previous study, including the same six populations with 10 short tandem repeat markers from chromosome 3, exhibited a similar pattern of genetic relationships (H Lin et al., unpublished data). It is reasonable to believe that geographic isolation and gene flow between adjacent populations drove the genetic characteristic of Chinese populations within the same region to become more similar.

Figure 2
figure 2

Unrooted phylogenic neighbor-joining tree of the 31-kb region in six Chinese populations. The tree was constructed with Nei's GST as genetic distance. The scale of the GST value is 1/1000.

Selection tests

To reveal possible selection processes in the six populations, allele frequency-based selection tests were performed with DnaSP. As shown in Table 3, the Mongolian population is the most polymorphic, with the highest measures of nucleotide diversity: π and θW. Li and Yi populations have the lowest θW, owing to their low numbers of polymorphic sites. Both populations exhibit similar values of π and θW, whereas the other populations have distinct values. Significant negative values of selection tests observed in Han Shandong (Fu and Li's D/F/F* tests, P=0.031, 0.042 and 0.036, respectively) and Mongolian (Tajima's D and Fu and Li's F* tests, P=0.046 and 0.045, respectively) populations indicate the deviation from neutrality and the possibility of selection. Nonetheless, population demographic history, such as population expansion, could confound this deviation, for negative values of Fs were observed in these two populations. In contrast, the significant positive values of Fu and Li's D/D* test (both P=0.004) and Fu's Fs test (P=0.015) in the Li population indicate that the Li population may have either experienced a balancing selection or had a constant size of population for a long period.

Table 3 Statistics of sequence diversity and natural selection tests

Instead of generalizing the selection test results with pooled genetic data of the 31-kb region as above, Tajima's D test across the region was also conducted with a sliding-window method (Figure 3). The six populations have similar patterns of Tajima's D, except for differences in D values and their associated significance levels. At the LTA gene, all populations displayed positive D values, but were only significant in the Li population. As for the TNF and LTB genes, Han Shandong, Han Guangdong, Tibetan and Mongolian populations had significant negative D statistics. The NCR3 gene was also influenced by selection, for the two Han populations have significantly negative values. Although demographic events such as population expansion can confound the results of selection tests, they always function on all loci of the human genome, whereas natural selection functions on specific regions, such as the genic region.31 Therefore, observed negative D values can mostly be ascribed to selection at the genic regions of the 31-kb sequence.

Figure 3
figure 3

Sliding-window analysis with Tajima's D test in the 31-kb region. Tajima's D statistics are calculated in 30-SNP sliding windows with 5-SNP steps. The x axis represents the ordered 242 SNPs along the 31-kb region. The respected position of each dot on the x axis marks the middle of a sliding window. The y axis represents the value of Tajima's D. A display of corresponding genes is on the top of each panel. The symbol ‘+’ represents 0.05<P<0.1, and ‘*’ represents P<0.05.

Haplotype reconstruction and recombination rate estimation

To simplify the haplotype analysis, only SNPs with an MAF>0.2 in the combined population were used. As shown in Figure 4, 36 haplotypes were derived from the 26 analyzed SNPs, and they clustered into two major clades. The ancient clade (40% of total chromosomes) was defined by cogrouping with the haplotypes from chimpanzee and rhesus macaque. Except for the haplotype distribution of the Li population slightly departing from those of the other populations, there were no significant differences in haplotype distributions among the six populations or between the three closely related population pairs (Han Shandong and Mongolian, Han Guangdong and Li, and Yi and Tibetan) defined by GST.

Figure 4
figure 4

Haplotype clades and distributions of the six Chinese populations. The blue square represents the major allele of an SNP. The yellow square represents its minor allele. A total of 26 SNPs (MAF>0.2) were used to infer haplotypes by PHASE. The derived haplotypes were clustered with the UPGMA method, giving rise to eight clades separated by darker lines. The number of chromosomes in each haplotype from the six populations is shown on the right. Han SD represents Han Shangdong; and Han GD represents Han Guangdong. Chimpanzee and rhesus macaque sequences were used as the outgroup.

Recombination is the driving event to shape an LD block. To investigate the variation in the LD patterns of these populations, we used all SNPs obtained in the 31-kb region to estimate local recombination rates and to map recombination hot spots. In total, eight haplotypes were found in all six populations. These eight shared haplotypes exceed the average number of four common haplotypes per LD block with HapMap CHB+JPT data.32 Frequent recombination events in the region may contribute to this excessive number of haplotypes. The overall population recombination rate ranged widely from 3.1 in the Li population to 18.5 in the Han Shandong population. However, the patterns of recombination rate were similar among the six populations (Figure 5), marked with three recombination hot spots at around the 10-, 22- and 30-kb sites. Subtle differences in recombination patterns, such as the position and magnitude of hot spots, were observed among populations. For example, hot spot shifting was noticed for the Li population from the 10-kb site to a 5-kb site. In the case of the Tibetan population, the 10-kb hot spot was replaced with a wide-spread hot spot at the 20-kb site. Because of active recombination, only a small 7-kb block was observed at the beginning of the 31-kb region for Han Shandong and Mongolian populations. In a disease association study, this type of LD blocks, flanked by active recombination hot spots, will have coverage problems with tag SNPs from populations other than the same ethnicity.

Figure 5
figure 5

LD blocks and recombination patterns of the 31-kb region among the six Chinese populations. The LD block was constructed from SNPs with MAF⩾0.1 by Haploview. The colors of the D′ plots indicate the following—red: D′=1 and LOD⩾2; pink: D′<1 and LOD⩾2; blue: D′=1 and LOD<2; and white: D′<1 and LOD<2. The SNPs used to construct LD blocks are displayed under each D′ plot. The panel beneath each D′ plot shows the estimated population recombination rates (in cM/Mb, by LDhat) versus the physical position (in base pair) of all 242 SNPs along the TNF gene cluster region.

Tag SNP transferability

To evaluate tag SNP transferability of the 31-kb region in Chinese populations, we compared the efficiency of a population-specific tag SNP set in capturing genotypic information of other populations with the use of our resequencing data and Phase II HapMap data. The HapMap CHB-derived tag SNPs displayed an averaged coverage rate of 64% for common SNPs from the six populations (Table 4). The coverage rate was not improved by using HapMap CHB+JPT-derived tag SNPs (54% on average). Although it is known that the HapMap CHB sample is a Han collection obtained from Northern China,2 in this study, the coverage ability of the CHB-derived tag SNPs performed better in the Han Guangdong than in the Han Shandong (66 vs 57%; P=0.003 obtained by a re-sampling method) population. This unexpected result may be caused by limited sample size.13 The low coverage of CHB-derived tags for Li (55%) and Mongolian (46%) populations may be ascribed to the high number of population-specific SNPs in the Li population and the high nucleotide diversity of the Mongolian population, respectively. Tag SNPs selected from the Mongolian population could account for 87% of the common SNPs of the other five populations, whereas tag SNP sets of both Li and Yi populations exhibited low transferability with a coverage of 69%. This finding suggests that the Mongolian population is a good proxy for diverse Chinese populations and a preferred population for tag SNP selection with a Chinese-based association study. Moreover, when considering only Han populations, Northern Han is the preferred population for tag SNP selection, for we observed that tag SNP transferability of the Han Shandong population is higher than that of the Han Guangdong population (Table 4). Limited tag SNP transferability in the 31-kb region among these populations suggests that an association study on Chinese populations should be designed cautiously, especially when selecting tag SNPs from HapMap CHB.

Table 4 Coverage of common SNPs by tag SNPs from the HapMap CHB and six Chinese populations

Discussion

Previous genetic profiles of the Chinese population revealed the following: (1) existence of genetic differences between the Northern and Southern populations, even to the extent that Han shared this geography-related distinction;17, 18, 23, 33 (2) formation of a divided continuum of the Chinese population through migration, geographic isolation and indigenous convention.19, 20, 22, 34 However, the detailed structure of Chinese populations remains elusive.17, 20, 22 The information available to date was collected mostly by genotyping markers on mtDNA, Y chromosome and autosomal chromosomes. Our approach was the use of autosomal DNA with the resequencing method, which was seldom used in past studies with Chinese samples. From the sequence polymorphism of the 31-kb region, we observed similar findings as in previous reports, namely, that our six Chinese populations are genetically different from each other, including the two Han populations, in terms of heterozygosity and genetic distance. Even in this small number of populations, three major groups can be defined by a GST-based phylogenic tree. In Figure 2, the geographically adjacent populations, such as the Han Shandong and Mongolian populations, and the Yi and Tibetan populations, are grouped together. It supports the second notion above that populations residing in proximate areas are generally closer in genetic relationship.19, 22 Therefore, our results demonstrated that Chinese populations not only have genetic distinctions between the north and south but can also be further divided into more demographic groups, and local gene flow may contribute to the genetic similarity that we observed.

TNF gene cluster is conserved during evolution,35 and encodes inflammatory cytokines crucial to host defense against infection.36 Their roles in the immune system make the genes susceptible to balancing or negative selection.37 Dependent on the geographic environment, traditional customs and infectious agents, populations may encounter different selection pressures, leading to changes in relevant genes. The six populations used here are representative of the diversity of Chinese populations and reflect sequence changes in the TNF gene cluster, in part, to account for the distinct environments that they live in.

Our sliding-window analysis of Tajima's D test revealed that most values were negative for all the six populations, with statistical significance on the genic regions of TNF, LTB and NCR3. The only region that had positive values for all populations was LTA, but it was only statistically significant in the Li population. In contrast to the other five populations, the Li population stood out from the populations in our study by being the least polymorphic and by virtue of its positive Fs statistics. It is known that the Li population migrated to Hainan Island during the Neolithic period38 and was isolated until recent years (in agreement with our finding of a positive Fs statistic). Furthermore, Hainan Island has been one of the major areas of malaria epidemics in China.39 The associations of malaria infection with the human sickle-cell trait and certain genotypes of G6PD were demonstrated to be a function of balancing selection.40, 41, 42 As shown by Zhou's study, the percentage of α(+) thalassemia carriers in the Li population is very high (56.7%) to counteract the malaria epidemic.43 Furthermore, an SNP located in the TNF promoter region is associated with cerebral malaria, with high levels of circulating TNF.44, 45 It is reasonable to speculate that the unusual polymorphic pattern of LTA in the Li population may result from balancing selection. Because of their proximity to the immediate upstream region of TNF, polymorphic sites in LTA can affect the level of TNF expression and collaborate with TNF to have an important role in resistance to malaria in the Li population. This may explain the unusual pattern of selection on Li.

The analysis of fine-scale recombination of the 31-kb region revealed three recombination hot spots. They mostly locate in intergenic regions and share a similar physical distribution, with variation in the intensity and exact location among the six populations. A recent report from Coop et al.46 suggested that LD-based hot spots were used for 60% of actual crossovers, and also stressed that variation in hot spot usage is extensive and heritable among individuals. The CCTCCCT-derived recombination motifs were found to mainly scatter in three regions (8–12, 17–24 and 27–30 kb; Figure 5), which coincide with the observed hot spots.47 Population-specific polymorphic sites found in certain motifs can reflect the intensity of a hot spot.48 This observation supports the fluidity of recombination hot spots. The recombination pattern is clearly different among the populations, but it is also confounded by selection and demographic history.7 It is of interest to use sperm typing or pedigrees for direct testing of population-specific hot spot usage, as well as for understanding the evolution of a recombination hot spot.46, 49

Tag SNP transferability of the HapMap populations was evaluated for association study.13, 50, 51, 52, 53 Most studies agreed that the population-specific HapMap tag SNPs can be applied to other populations,54 but are inferior while being evaluated against resequencing data.52, 53 Related studies indicated the existence of SNP-deposition bias in the dbSNP and HapMap database, for the majority of SNPs are common in frequency and shared among populations.16 To avoid this SNP ascertainment bias, resequencing data are more desirable for selection of tag SNPs and uncovering of the population substructure. Tag SNP evaluation with our resequencing data revealed that 16–54% of common SNPs for the six Chinese populations are not accounted for by the HapMap CHB-derived tag SNPs. In this 31-kb region, the genetic information of the Li and Mongolian populations is least covered by HapMap CHB tags, not only for active recombination in the region but also for the unique polymorphism distribution of the Li population and the highly polymorphic nature of the Mongolian population. Therefore, to improve this shortcoming while using Chinese populations without the knowledge of the underlying population substructure, we propose to raise the r2 threshold for more tag SNP inclusion, to apply multipoint imputation methods for further information extraction,55 and/or to use ethnic-mixed data for tag selection.13, 53 Furthermore, if candidate genes (or regions) of an association study contain known recombination hot spots, direct sequencing of hot spots is a preferred way to ensure a proper SNP coverage of the area.12

The 31-kb DNA region in human major histocompatibility complex has the characteristic of being highly polymorphic, gene rich, subjected to selection and ragged in LD block. As it represents only a part of the genome, some of our conclusions, such as tag SNP transferability, may not apply to every genomic region. Our finding would be most useful for an association study with a candidate-gene approach on genomic regions that have shared characteristics with the TNF gene cluster. Our results also support the conclusion that there are distinct genetic compositions between Northern and Southern Chinese populations, with further divisions according to their geographic locations. By gaining knowledge of this underlying substructure of Chinese populations, we may improve the efficiency of disease mapping by Chinese-based association studies.