Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Haplotype tagging efficiency in worldwide populations in CTLA4 gene


The cytotoxic T lymphocyte antigen 4 (CTLA4) acts as a potent negative regulator of T-cell response, and has been suggested as a pivotal candidate gene for autoimmune disorders such as Graves' disease, type 1 diabetes and autoimmune hypothyroidism, among others. Several single-nucleotide polymorphisms (SNPs) have been proposed as the susceptibility variants, or to be in strong linkage disequilibrium (LD) with the variant. Nevertheless, contradictory results have been found, which may be due to lack of knowledge of the genetic structure of CTLA4 and its geographic variation. We have typed 17 SNPs throughout the CTLA4 gene region in order to analyze the haplotype diversity and LD structure in a worldwide population set (1262 individuals from 44 populations) to understand the variation pattern of the region. Allele and haplotype frequency differentiation between populations is consistent with genomewide averages and points to a lack of strong population-specific selection pressures. LD is high and its pattern is not significantly different within or between continents. However, haplotype composition is significantly different between geographical groups. A continent-specific set of haplotype tagging SNPs has been designed to be used for future association studies. These are portable among populations, although their efficiency might vary depending on the population haplotype spectrum.


CTLA4 (cytotoxic T lymphocyte antigen 4) is a member of the immunoglobulin supergene family that functions as a potent negative regulator of T-cell response,1, 2 its lack causing massive lymphoproliferative disorders, fatal multiorgan destruction and early death in knockout and CTLA4-deficient mice.3 CTLA4 gene is located at position 2q33 in a region also containing the ICOS and CD28 genes, and shares with the latter a high nucleotide identity strongly suggesting that they are the result of a gene duplication.4 CTLA4 is composed of four exons encoding different functional domains: a leader sequence, and extracellular, transmembrane and cytoplasmic domains. Two amino-acid replacement SNPs have been described for this gene: 49 A/G (T17A) in exon 1 and 2814 G/A in exon 2 (M90I), the latter described in African Americans.5 Several other single-nucleotide polymorphisms (SNPs) have also been described in the promoter region and at the 3′-end of the gene,6, 7 as well as an (AT)n in the 3′-unstranslated region.

Owing to its crucial role in T-cell regulation, CTLA4 has been shown to act as a primary determinant of susceptibility to autoimmune disorders; actually, CTLA4 has been shown to be robustly associated with Graves' disease,8 while some of the SNPs mentioned above have been proposed as the susceptibility variant of disease or to be in strong linkage disequilibrium (LD) with it in different human populations. +49 G has been related to multiple diseases, including type 1 diabetes (T1D),9 rheumatoid arthritis,10 multiple sclerosis,11 Graves' disease,12 asthma,13 or systemic sclerosis.14 SNP −1722 has been related to lupus erythematosus in two contradictory studies,15, 16 while the −1661 G variant has been suggested to be associated with T1D.17 Also, SNPs 6230 G, 10 717 G, 10 242 G and 12 310 T have been related to GD, autoimmune hypothyroidism and T1D.7 Nonetheless, the associations have not been always replicated in different populations.

Haplotype and LD analyses have revealed themselves as being a powerful tool to map disease genes18 as well as to disclose the history of human populations (Betranpetit et al19 and references therein). In that sense, studies in different loci have shown higher haplotype diversity and lower LD values in African populations than in non-African ones (Tishkoff et al20 and Sawyer et al21 and references therein), which is explained by the ‘out-of-Africa’ hypothesis of human dispersal, the subsequent founder effect in non-African populations and the larger effective population size of African populations. However, LD patterns are more complex than this simple difference between Africans and non-Africans.21, 22

With the availability of large SNP data sets, a structure of haplotype blocks with a very low recombination rate separated by hot-spots of recombination has been suggested,23, 24 the genome being organized in blocks shorter in Africans than in Europeans or Asians.25 The presence of these high LD regions allows definition of haplotype tagging SNPs (tag-SNPs), that is, the minimum set of SNPs that would capture most of the haplotype diversity within a block.6 CTLA4 is one of the nine genes where the term tag-SNP was first defined. However, both the presence of blocks and the use of tag-SNPs as markers for disease genes are controversial.26, 27 Moreover, tag-SNPs identified in one population may not be applicable to other populations28 even if they seem to be quite general for Europeans.29 Furthermore, Crawford et al,30 after analyzing the sequence diversity in 100 human genes, concluded that the amount of LD and the haplotype structure should be empirically analyzed in order to assess which and how many SNPs must be typed in a specific gene to detect an association with a disease-causing SNP in a case–control study.

The aim of the present study is to analyze the haplotype diversity within the CTLA4 gene region in a worldwide population set in order to establish its structure, define LD patterns and haplotypes, provide a standard set of markers with known variation for further studies in populations of different ethnic or geographic origin and shed light on the discrepancies found in association studies.


Polymorphism and haplotype description

Figure 1a shows, for the 17 SNPs analyzed, their position, the allelic variants and the ancestral states deduced from the primate samples. All SNPs and populations were in Hardy–Weinberg equilibrium after Bonferroni correction for multiple testing. Table 1 summarizes several parameters for each population and geographical region, including the number of fixed (monomorphic) alleles. As shown in Figure 2, all the SNPs analyzed are highly polymorphic worldwide except for SNP –658, which shows little polymorphism in Europeans, North Africans and Middle Easterners, and reaches fixation in most other populations; and SNP –319, which is also fixed in a large number of samples.

Figure 1

Map of the CTLA4 gene region and the SNPs typed. Genes (CD28, CTLA4 and ICOS) are represented by gray boxes, CTLA4 exons by black boxes and SNPs are represented by vertical tick marks. (a) Allelic variants found in the worldwide human population set and ancestral states deduced from primate genotyping. (b) Suggested tag-SNPs for different geographical populations.

Table 1 Population descriptive parametersa grouped by continental region
Figure 2

SNP frequencies for each population. SSAFR stands for sub-Saharan Africa (Bantu, Biaka Pygmi, Mandenka, Mbuti Pygmi, San, Tanzan, Yoruba), NA for North-Africa (Mozabite, Moroccan, Saharawi), ME for Middle-East (Bedouin, Druze, Palestinian, Adygei), EUROPE for Europeans (Basque, Catalan, French Basque, French, Continental Italian, Orcadian, Russian, Sardinian, Spanish), CSASIA for Central/South Asia (Balochi, Brahui, Burusho, Hazara, Kalash, Makrani, Pathan, Sindhi), EASIA for East Asia (Cambodian, South China, North China, Han, Japanese, Yakut), O for Oceania (Nan, Papuan) and AME for America (Colombian, Karitiana, Maya, Pima and Surui). SNPs are displayed according to their position in the gene.

Ancestral alleles (Table 2) were generally more frequent than the derived alleles, which argues against a recent selective sweep at CTLA4 in the human lineage. Divergence between the human and the chimpanzee CTLA4 sequences is 1.085%, close to the average human–chimpanzee divergence31 and consistent with a lack of accelerated nucleotide divergence for this gene.

Table 2 Fst values for each SNP in the eight geographical regions

In order to test for SNP frequency differences, both between SNPs and between populations, FST values were calculated by SNP and by geographic regions (Table 2). The average FST for all 44 populations is 0.10, in agreement with previous studies of neutral markers.32, 33, 34 When instead of 44 independent populations the eight continental groups are considered, there is only a slight decrease in the FST values. As expected, the highest continental FST value is found in sub-Saharan Africa, with very low values and extreme homogeneity within Europe, North-Africa and Central/South Asia. For these continents, AMOVA (analysis of molecular variance) shows no statistical significance, in contrast to the highly significant values for sub-Saharan Africa, Middle East and East Asia. When a two-tier AMOVA (continental groups and populations) is applied, most differences are within populations (89.2%), a high proportion among regional groups (8.5%) and a mere 2.3% among populations within regional groups.

When comparing FST among SNPs, a maximum differentiation is found for SNP 6230 (0.137 between continental groups), which has been associated to disease.7 However, FST values are not significantly different between disease-related and nondisease-related SNPs (Mann–Whitney test, P=0.494). In order to evaluate the significance of the FST values, the FST for each SNP was compared to the FST distributions provided by genomewide studies: Akey et al35 for a large set of SNPs in three populations and Kidd et al36 for gene-centred SNPs in a similar set of populations than the one used here. No extreme FST values were found within the CTLA4 region compared to the genome distribution. This is a conservative approach, since Akey et al35 considered only three main population groups which, by design, would yield lower FST values than our worldwide sample. In relation to Kidd et al36 the highest FST value of our results (0.137) and the average FST (0.096, Table 2) fall close to the mean for 369 markers.36 This suggests that the geographic stratification shown by CTLA4 is well below a genomewide average and points to the absence of strong geographically specific selective pressures.

A total of 181 haplotypes were estimated, of which 68 account for 95% of the global variation. The cumulative frequency of the nine most frequent haplotypes is 79%, the 10th adding only 1.27%. As shown in Figure 3, which represents haplotype composition for world regions all over the world, neither the number of haplotypes nor their distribution are homogeneous even for the most frequent ones. Out of the 180 haplotypes, 72 (40%) were found in more than one population and 63 (35%) were found in more than one continental region. The ancestral haplotype estimated from primate data is not present in our sample; the haplotypes most similar to the ancestral are h152 and h165, with only one change each. Both haplotypes are rare (maximum two chromosomes), h152 being present in two chromosomes from two different African populations and a single h165 chromosome being private to Bantus.

Figure 3

Frequencies of the nine more frequent haplotypes represented by geographical groups. Pies are proportional to the number of individuals typed and contain the data of all populations within the geographical region. Haplotype diversity for each geographical group is shown besides the continental pies.

Table 1 shows haplotype descriptive parameters for each population and averaged by geographical groups. Haplotype diversity (Dh, that is, expected haplotype heterozygosity) values, as well as the number of haplotypes (Kh) and the number of private haplotypes (Kh private), are significantly different between geographical groups (Kruskal–Wallis test, P=0.0001, 0.0021 and 0.0046, respectively). In a recent extensive gene-centered analysis,30 the mean number of common haplotypes (frequency >5%) varied greatly from gene to gene, with a mean of 5.0 and 4.5 in populations of African and European descent, respectively. In the CTLA4 region similar values are found, African samples having the maximum mean number of common haplotypes (5.8), quite high in the Middle East (5.3) and similar in the other groups (around 4), except for the low value in the Americas (2.6).

LD analysis and haplotype tag-SNPs

An overall measure of haplotype diversity in the CTLA4 region can be obtained with the FNF statistic (see Materials and methods), shown in Table 1. In general, FNF values are high, corresponding to a dearth of haplotypes and, thus, of high LD in the region. Furthermore, these values are significantly different between geographical groups (Kruskal–Wallis test, P=0.0020), being lowest in sub-Saharan Africans, according to their higher haplotype diversity, and FNF values are very high in the Americas, with a low diversity and number of haplotypes.

The LD structure of CTLA4 (measured with D′ or r2, with similar results) shows that the whole region of around 14 kb presents substantial LD, as described previously,7 and now confirmed for all the worldwide populations studied, and depending on block definition, it may be contained within a single LD block. A first approach to test whether the amount of LD was similar among populations consisted of taking r2 values for adjacent SNPs and comparing them among populations, by means of Friedman test, and comparing the average r2 values of a continental group with the average of all other groups, using Wilcoxon's test. No statistical differences in r2 values were found among populations. Therefore, populations within continental groups are homogeneous regarding their amounts of LD, as are also all worldwide regional groups. Next, we tested whether the structure of LD was similar by computing the correlation between the same r2 values between pairs of populations. Comparisons within continental regions showed high correlations between population pairs, all of them being statistically significant (P<0.05). When pairs of populations from different continental regions were compared, the correlations were also significant, with the exception of the two Pygmy populations, which presented low correlation and nonsignificant values with most of the analyzed populations (data not shown). If instead of just taking the diagonal, the whole LD matrix is considered, correlations with Mantel's tests give significant correlation values both between populations pairs within regions (with the exception of Mbuti Pygmies) and for pairs of continental regions (data not shown). Thus, both the amount and pattern of LD in the CTLA4 gene region was similar across human populations.

Considering the high LD in the region, as well as its similarity among populations, we would expect tag-SNPs to be extremely useful for the analysis of the CTLA4 gene and portable across populations. For this purpose, we have defined sets of tag-SNPs for the whole CTLA4 region in our worldwide sample for geographical regions, grouping the population haplotypes in each continent. As shown in Figure 1b, continental regions can be described using only two-to-four tag-SNPs, with geographical regions with higher LD needing less tag-SNPs. It is interesting to note that none of the seven SNPs at the 3′-end has been selected as tag-SNP in any continental region, and, in most cases, only those at the 5′-end are defined as tag-SNPs. The number of common haplotypes (those with frequencies over 5%) and the fraction of the total haplotypes detectable with the continental tag-SNP sets in each population is shown in Table 3. The tag efficiency of the tag-SNP sets for each continental region is defined as the frequency of total haplotypes in the population divided by the number of tag-SNPs in the set. For instance, the European tag-SNP set consisted of three SNPs (−1765, −1661 and –658; Figure 1b) scores 88% of the total common haplotypes (which represent 80% of the total European haplotypes). Thus, the detected haplotypes represent 71% of the total European haplotypes, and therefore, the tag efficiency of each of the three SNP in the European tag-SNP set is 24%. As expected, tag efficiency is higher in those regions with stronger LD (ie Oceania and America) and lower in sub-Saharan Africans.

Table 3 Tag-SNP haplotype scoring and tag efficiency in the CTLA4 gene in worldwide populations

In a previous study of the CTLA4 region,6 five tag-SNPs were described for the region contained between SNPs -1765 to +6249 in UK control individuals. We have tested the original five tag-SNPs defined by Johnson et al6 (−1722, −1661, −658, −319 and +49) to assess their tag efficiency between populations. With this objective, new haplotypes have been estimated considering only the gene region reported in that previous study (ie from positions –1765 to 6249, encompassing nine SNPs). Table 3 shows the fraction of detected haplotypes using the five previously defined tag-SNPs. These tag-SNPs, defined in Europeans, detect 87% of haplotypes among Europeans (population values between 70 and 96%). These values, as expected, are low for Africans and high for Amerindians but, surprisingly, are very high (94%) for East Asian populations, with population values reaching 99%. These tag-SNPs can be considered portable among populations with the described limitations, that is, from populations needing more tag-SNPs to populations needing less tag-SNPs, but not vice versa.

When the present continental tag-SNPs are applied to the estimated haplotypes from positions –1765 to 6249, the amount of detected haplotypes is similar to the amount provided by the previously defined tag-SNPs, but the efficiency of each tag-SNP (defined as the proportion of the whole variation each one explains) increases dramatically, and these values are similar to those found when the whole region is considered. As in Oceania and Middle East, one of the present tag-SNP lies outside the haplotypes estimated from −1765 to 6249, new tag-SNPs have been recalculated only for this region. In North-Africa, East Asia, Oceania and America, the tag-SNPs have not changed, while in sub-Saharan Africa, Europe and Central/South Asia, one extra SNP is needed and in Middle East three SNPs have been added. This increase in the number of SNPs seen in some continental groups can be explained by the effect of adding some low-frequency haplotypes that shared the first nine SNPs.


Polymorphism patterns

The genetic diversity pattern found in the CTLA4 region in SNP and haplotype frequencies, FST and genetic structure, and LD patterns, agree with the population history of the samples analyzed and the general pattern of diversity that has been shown in other global studies in different gene regions,19, 21 and therefore can be explained by the ‘out-of-Africa’ origin of modern humans, with more genetic diversity, less LD and higher heterogeneity among populations in Africa than elsewhere. However, these patterns could have been affected by other processes such as ascertainment bias of the markers or selective pressures.

On the one hand, as the analyzed SNPs were described in European populations,6, 7 a certain bias in their description could exist. Nevertheless, the present worldwide results do not show this bias as, with minor exceptions especially in Native Americans, all SNPs are universally polymorphic. In fact, SNP ascertainment bias would not be relevant for ascertaining common haplotypes, especially in genetic regions, like the CTLA4, with high LD.

On the other hand, in contrast to population processes that affect the whole genome, gene factors such as differential selection can shape the haplotype structure and LD in specific gene regions and might result in population differences. Since CTLA4 has a key role in the immune system and has been related to autoimmune diseases, the exposure to geographic differential selective pressures, such as the presence of pathogens that could have affected the CTLA4 gene structure, could be envisaged. This has been the case of the selection for resistance to malaria detected in several genes such as G6PD,37 Duffy38 and TNFSF5.39 Although differences among populations in allele frequencies have been found in the CTLA4 gene, none of the SNPs analyzed presented significantly higher FST values compared to an SNP genomewide distribution of FST.35, 36 This fact stresses that the differences among populations are not unexpectedly large and points to a lack of local selective pressures across human populations on the CTLA4 gene, at least in recent times.

The analysis of the CTLA4 region has shown that there is a clear population structure in haplotype frequencies, but there are minor differences in the LD among populations. Thus, genomic processes might have affected all global populations in the same manner, giving similar LD patterns in the CTLA4 region, whereas the differences found, basically in haplotype and SNP frequencies, might be explained by demographic processes, such as expansions, founder effects and migrations.

LD patterns

The present study has been performed on a region spanning 14 kb, which has been described as having a low recombination rate (0.3 cM/Mb).40 This points to high LD over the region, which has already been reported by Ueda et al.7 Our results not only show this high LD but also that it persists across populations worldwide.

Previous studies in other genes, such as PAH41 or the PKLR-GBA gene region,42 have shown a general pattern of moderate geographical structure of LD with higher values out of Africa and low values in sub-Saharan Africa. However, a number of deviations from this overall pattern has been described, ranging from extreme differences between African and non-African populations, such as those found in the CD4 locus,20 DRD243 or DM,44 among others, to similar LD values in African and non-African populations such as the CFTR gene.45 The different geographical patterns observed in different loci can be interpreted as the distribution of a stochastic variable, in which CD4 and CFTR would be the opposite extremes. When comparing our results with these previous studies, LD patterns in CTLA4 show a similar pattern to that found in the CFTR region, that is, lack of strong differences among human groups.

Until the present CTLA4 analysis, the only gene related to immunity studied in worldwide samples was CD4. However, their LD patterns do not coincide, as the CD4 locus showed a much higher LD geographical structure than CTLA4. This difference could be related to their function, with possible diversifying selection in CD4 but not in CTLA4, and also it could be due to stochastic variation. More immunity genes should be globally studied to assess whether an LD trend in immunity genes exists.


It would be expected that tag-SNPs would be particularly efficient in the CTLA4 region, given its high LD. A previous study (Gonzalez-Neira et al46) has shown that tag-SNPs are portable among populations even across regional groups, which means that most of the haplotype diversity in different populations can be scored by a unique set of tag-SNPs. However, those results were obtained in a gene-free, low-LD region and may not be extended to other regions with different properties. The present analysis provides some tag-SNP sets for different continental regions that score a large and similar amount of the total haplotype variation. Moreover, these tag-SNP sets appear to be portable between populations within continental regions since no significant differences are found in the amount of haplotype variation detected. In contrast, the portability of these tag-SNP sets between continental regions depends on the number of tag-SNPs that form a continental set. For instance, the sub-Saharan tag-SNP set, formed by four SNPs, would score most of the variation in the rest of the continents, but the scores of the American set applied to other continents would yield poorer results. Therefore, the larger the number of tag-SNPs in a set, the more portable to other geographical regions. Nonetheless, the tag efficiency (defined as the amount of variation detected by each tag-SNP) would be dramatically reduced in some continental areas.

Owing to the high-throughput technologies available, a reduction from 17 to two-to-four SNPs to be typed may seem unnecessary. However, at a whole-genome scale, or with a candidate-gene approach, if such levels of reductions were achieved in each of the hundreds or thousands of genes to be typed, the saving in costs and time would be more than justified, and would certainly allow to increase the number of genes to be typed.

Impact on association studies

Up to now, association studies in CTLA4 have led to contradictory results in different populations. Our results show global FST10%, that is, that populations worldwide for the CTLA4 gene are not more different from each other than the average calculated in previous studies of neutral markers.32, 33, 34 However, significant differences in SNP frequencies and haplotype composition exist not only between geographical groups but also, in some cases, within groups. This could explain the contradictions found in the literature, especially considering that association studies work with very large sample sizes, which would make significant small differences not reflected here. These results, then, are a clear indication that the design of new case–control studies should take into account the heterogeneity both at inter- and at intragroup level, and points to the need of a very well matched control population to compare with patients: different ethnic or geographic extraction could easily jeopardize the differences found between both groups.

The knowledge of the haplotype structure and LD patterns in specific regions, such as the CTLA4 gene, in worldwide populations will shed light not only on the population history of the populations analyzed but also on the genomic processes that could be pivotal for biomedical interests.18

Materials and methods


A total of 1262 individuals from 44 human populations were analyzed for the CTLA4 gene region. In all, 38 populations are those from the HGDP-CEPH Human Genome Diversity Cell Line Panel,47 which contains lymphoblastoid cell lines from 1051 individuals in 51 populations located in all major geographic regions of the world. The rest of populations were chosen to improve the coverage in some geographic areas (Saharawi, Tanzanian, Moroccan, Catalan, Basque and Spanish). All samples were obtained with appropriate informed consent. Sample sizes varies from seven (San) to 71 (Spanish) individuals, most of them having around 25 individuals (Table 1). Within the HGDP-CEPH panel, some populations represented by a reduced number of individuals were grouped and analyzed together: Continental Italy (North Italian, Tuscan), North China (Daur, Hezhen, Mongola, Oroquen, Tu, Uygur, Xibo) and South China (Dai, Lahu, Miazou, Nai, She, Tujia, Yizu). To elucidate the ancestral state of the SNPs analyzed, DNAs from one chimpanzee (Pan troglodytes), one gorilla (Gorilla gorilla) and one orangutan (Pongo pygmaeus) were used.

SNP typing

A total of 17 SNPs were typed, seven of them at the 5′-end and first exon of the gene6 and 10 at the 3′-end.7 Setting the first nucleotide at the Met initiator codon as 1, the nucleotide positions of the SNPs typed were: −1765, −1722, −1661, −1577, −658, −319, 49, 6230, 6249, 7092, 7482, 7982, 8173, 10 242, 10 717, 12 131 and 12310. This numbering refers to the cds as in GenBank entry AF411058. All of these SNPs had been submitted to dbSNP and their rs i.d.'s are given in Table 2. SNPs mostly encompass the 5′ and 3′ regions because these variants are those relevant for disease-association studies and the whole coding region lies within a clear and very strong LD block,6, 7 a fact confirmed in the present study.

PCR amplification

The 5′-end region containing SNPs from −1765 to +49 was amplified using conditions described previously.17 The 3′ region containing the rest of the SNPs was amplified in two fragments in a multiplex reaction with the following cycling conditions: 94°C for 5 min; 35 cycles of 94°C for 30 s, 63°C for 30 s and 72°C for 45 s; and a final elongation step of 72°C for 5 s. PCR products were purified using EXO-SAP (5 units of EXO and 25 units of SAP per reaction), with an incubation of 60 min at 37°C followed by 15 min at 72°C to inactivate the enzyme.

SNaPshot reaction

All SNPs were typed using the SNaPshot™ Multiplex technique (Applied Biosystems), a single-base primer extension method that uses labelled ddNTPs to interrogate SNPs. The single-base primer extension was performed following supplier's recommendations using different primer lengths. Two different SNaPshot reactions were performed, one for the 5′-end and first exon SNPs and the other for the 3′-end SNPs. Unincorporated-labelled ddNTPs were removed by adding one unit of CIP to the primer extension products for 60 min at 37°C followed by 15 min at 72°C to inactivate the enzyme. Products were analyzed in an ABI PRISM3100 and ABI PRISM GeneScan Analysis Software v3.7 (Applied Biosystems). LIZ-120 (Applied Biosystems) was used as size marker.

In order to check the accuracy of the SNaPshot technique, SNP 6230 was also genotyped for the worldwide diversity panel with TaqMan's Assays-by-DesignSM service (Applied Biosystems), which consisted of a mix of unlabelled PCR primers and TaqMan® MGB probes (FAM™ and VIC® dye-labelled). We performed the assay using TaqMan Universal PCR Master Mix following supplier's recommendations. Results were analyzed using SDS software package version 2.1 (Applied Biosystems). There were minor discrepancies between TaqMan and SNaPshot genotypes (three samples of the panel were scored as heterozygotes by the SNaPshot technique and as homozygotes by TaqMan), which implies that the SNaPshot genotype assignment is 99.7% concordant with the TaqMan assays.

Ancestral-type inference

All SNPs were typed in three primate individuals, one chimpanzee (P. troglodytes), one gorilla (G. gorilla) and one orangutan (P. pygmaeus) as described above. Since some of the SNPs were difficult to type in the primates using the SNaPshot technology, we sequenced the fragment containing the seven SNPs located at the 5′-end of the gene and the SNP 10 717 in the three primates. Sequence was carried out using the following cycling conditions: 3 min at 94°C and 25 cycles of 96°C 10 s, 50°C 5 s and 60°C 4 min; and precipitated using BigDyes 3.0 protocol (Applied Biosystems). Chimpanzee results have also been checked using the chimpanzee genome in Ensembl ( and the ancestral positions have been determined using the approach in Iyengar et al.48

Data analysis

Haplotype frequencies were estimated using a Bayesian algorithm as implemented in the PHASE package version 2.0.49 Populations with sample sizes of less than 20 chromosomes (namely, the San) were dropped from the haplotype analyses given the high uncertainty associated with the haplotype estimation.

FST gives a measure of the proportion of the genetic variance explained by differences among populations. Incorporating the molecular distance between alleles to FST, ΦST is obtained. Hardy–Weinberg equilibrium, FST and ΦST values, and AMOVA were calculated using Arlequin software.50 LD measures D′ and r2 were calculated using Haploview.

Friedman and Wilcoxon nonparametric tests were performed using Statistica. Mantel tests to compare r2 matrices were computed using Passage version 1.0. Comparisons were made between populations within each continental group. Subsequently, r2 values were recalculated for populations pooled into geographical groups and Mantel tests were applied.

We computed the FNF statistic,45 which can be interpreted as the fraction of haplotypes not found in a population, where the number of haplotypes expected under linkage equilibrium, given the sample size and the allele frequencies, is compared with the number of observed haplotypes in each population. It can be calculated as FNF=1−Kh/Kmax, where Kh is the number of haplotypes found in the sample and Kmax is the maximum possible number of different haplotypes expected under total linkage equilibrium given the size and allele frequencies of the population. FNF values are independent of the number of loci and are expected to increase with LD.

New tag-SNPs were described both using htSNP26 ( and BEST.51 These two programs identify the minimum set of SNPs accounting for a given minimum fraction of the genomic variation. One method to define htSNPs, implemented in htSNP2, seeks to find the minimum set of SNPs so that the proportion of the total variance explained (as measured by the multiple regression R2 statistic) is above a certain threshold (0.8 in our case). The second algorithm used, implemented in BEST, takes as input a set S of haplotypes. The algorithm returns a minimal set of SNPs from which all of the other SNPs in the haplotype set can be derived. The tag-SNPs obtained using htSNP2 are concordant with the ones obtained with BEST (or, in the cases where BEST provides more than one set, coincident with one of them). Haplotype tag-SNP portability was tested using htSNP2.

Accession codes




  1. 1

    Walunas TL, Lenschow DJ, Bakker CY et al. CTLA-4 can function as a negative regulator of T cell activation. Immunity 1994; 1: 405–413.

    CAS  Article  Google Scholar 

  2. 2

    Walunas TL, Bakker CY, Bluestone JA . CTLA-4 ligation blocks CD28-dependent T cell activation. J Exp Med 1996; 183: 2541–2550.

    CAS  Article  Google Scholar 

  3. 3

    Khattri R, Auger JA, Griffin MD, Sharpe AH, Bluestone JA . Lymphoproliferative disorder in CTLA-4 knockout mice is characterized by CD28-regulated activation of Th2 responses. J Immunol 1999; 162: 5784–5791.

    CAS  PubMed  Google Scholar 

  4. 4

    Harper K, Balzano C, Rouvier E, Mattei MG, Luciani MF, Golstein P . CTLA-4 and CD28 activated lymphocyte molecules are closely related in both mouse and human as to sequence, message expression, gene structure, and chromosomal location. J Immunol 1991; 147: 1037–1044.

    CAS  Google Scholar 

  5. 5

    Martin AM, Athanasiadis G, Greshock JD et al. Population frequencies of single nucleotide polymorphisms (SNPs) in immuno-modulatory genes. Hum Hered 2003; 55: 171–178.

    CAS  Article  Google Scholar 

  6. 6

    Johnson GC, Esposito L, Barratt BJ et al. Haplotype tagging for the identification of common disease genes. Nat Genet 2001; 29: 233–237.

    CAS  Article  Google Scholar 

  7. 7

    Ueda H, Howson JM, Esposito L et al. Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease. Nature 2003; 423: 506–511.

    CAS  Article  Google Scholar 

  8. 8

    Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K . A comprehensive review of genetic association studies. Genet Med 2002; 4: 45–61.

    CAS  Article  Google Scholar 

  9. 9

    Ide A, Kawasaki E, Abiru N et al. Association between IL-18 gene promoter polymorphisms and CTLA-4 gene 49A/G polymorphism in Japanese patients with type 1 diabetes. J Autoimmun 2004; 22 (1): 73–78.

    CAS  Article  Google Scholar 

  10. 10

    Lee CS, Lee YJ, Liu HF et al. Association of CTLA4 gene A–G polymorphism with rheumatoid arthritis in Chinese. Clin Rheumatol 2003; 22: 221–224.

    Article  Google Scholar 

  11. 11

    Teutsch SM, Booth DR, Bennetts BH, Heard RN, Stewart GJ . Association of common T cell activation gene polymorphisms with multiple sclerosis in Australian patients. J Neuroimmunol 2004; 148: 218–230.

    CAS  Article  Google Scholar 

  12. 12

    Vaidya B, Oakes EJ, Imrie H et al. CTLA4 gene and Graves' disease: association of Graves' disease with the CTLA4 exon 1 and intron 1 polymorphisms, but not with the promoter polymorphism. Clin Endocrinol (Oxford) 2003; 58: 732–735.

    CAS  Article  Google Scholar 

  13. 13

    van Oosterhout AJ, Deurloo DT, Groot PC . Cytotoxic T lymphocyte antigen 4 polymorphisms and allergic asthma. Clin Exp Allergy 2004; 34: 4–8.

    CAS  Article  Google Scholar 

  14. 14

    Hudson LL, Silver RM, Pandey JP . Ethnic differences in cytotoxic T lymphocyte associated antigen 4 genotype associations with systemic sclerosis. J Rheumatol 2004; 31: 85–87.

    CAS  PubMed  Google Scholar 

  15. 15

    Hudson LL, Rocca K, Song YW, Pandey JP . CTLA-4 gene polymorphisms in systemic lupus erythematosus: a highly significant association with a determinant in the promoter region. Hum Genet 2002; 111: 452–455.

    CAS  Article  Google Scholar 

  16. 16

    Fernandez-Blanco L, Perez-Pampin E, Gomez-Reino JJ, Gonzalez A . A CTLA-4 polymorphism associated with susceptibility to systemic lupus erythematosus. Arthritis Rheum 2004; 50: 328–329.

    CAS  Article  Google Scholar 

  17. 17

    Bouqbis L, Izaabel H, Akhayat O et al. Association of the CTLA4 promoter region (−1661G allele) with type 1 diabetes in the South Moroccan population. Genes Immun 2003; 4: 132–137.

    CAS  Article  Google Scholar 

  18. 18

    Cardon LR, Bell JI . Association study designs for complex diseases. Nat Rev Genet 2001; 2: 91–99.

    CAS  Article  Google Scholar 

  19. 19

    Bertranpetit J, Calafell F, Comas D, Gonzalez-Neira A, Navarro A . Structure of linkage disequilibrium in humans: genome factors and population stratification. Cold Spring Harb Symp Quant Biol 2003; 68: 79–88.

    CAS  Article  Google Scholar 

  20. 20

    Tishkoff SA, Dietzsch E, Speed W et al. Global patterns of linkage disequilibrium at the CD4 locus and modern human origins. Science 1996; 271: 1380–1387.

    CAS  Article  Google Scholar 

  21. 21

    Sawyer SL, Mukherjee N, Pakstis AJ et al. Linkage disequilibrium patterns vary substantially among populations. Eur J Hum Genet 2005; 13: 677–686.

    CAS  Article  Google Scholar 

  22. 22

    Gonzalez-Neira A, Calafell F, Navarro A et al. Geographic stratification of linkage disequilibrium: a worldwide population study in a region of chromosome 22. Hum Genomics 2004; 1: 399–409.

    Article  Google Scholar 

  23. 23

    Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES . High-resolution haplotype structure in the human genome. Nat Genet 2001; 29: 229–232.

    CAS  Article  Google Scholar 

  24. 24

    Wall JD, Pritchard JK . Haplotype blocks and linkage disequilibrium in the human genome. Nat Rev Genet 2003; 4: 587–597.

    CAS  Article  Google Scholar 

  25. 25

    Gabriel SB, Schaffner SF, Nguyen H et al. The structure of haplotype blocks in the human genome. Science 2002; 296: 2225–2229.

    CAS  Article  Google Scholar 

  26. 26

    Clark AG, Weiss KM, Nickerson DA et al. Haplotype structure and population genetic inferences from nucleotide-sequence variation in human lipoprotein lipase. Am J Hum Genet 1998; 63: 595–612.

    CAS  Article  Google Scholar 

  27. 27

    Wang N, Akey JM, Zhang K, Chakraborty R, Jin L . Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am J Hum Genet 2002; 71: 1227–1234.

    CAS  Article  Google Scholar 

  28. 28

    Weale ME, Depondt C, Macdonald SJ et al. Selection and evaluation of tagging SNPs in the neuronal-sodium-channel gene SCN1A: implications for linkage-disequilibrium gene mapping. Am J Hum Genet 2003; 73: 551–565.

    CAS  Article  Google Scholar 

  29. 29

    Mueller JC, Lohmussaar E, Magi R et al. Linkage disequilibrium patterns and tagSNP transferability among European populations. Am J Hum Genet 2005; 76: 387–398.

    CAS  Article  Google Scholar 

  30. 30

    Crawford DC, Carlson CS, Rieder MJ et al. Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations. Am J Hum Genet 2004; 74: 610–622.

    CAS  Article  Google Scholar 

  31. 31

    Chen FC, Li WH . Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet 2001; 68: 444–456.

    CAS  Article  Google Scholar 

  32. 32

    Barbujani G, Magagni A, Minch E, Cavalli-Sforza LL . An apportionment of human DNA diversity. Proc Natl Acad Sci USA 1997; 94: 4516–4519.

    CAS  Article  Google Scholar 

  33. 33

    Rosenberg NA, Pritchard JK, Weber JL et al. Genetic structure of human populations. Science 2002; 298: 2381–2385.

    CAS  Article  Google Scholar 

  34. 34

    Excoffier L, Hamilton G . Comment on ‘Genetic structure of human populations’. Science 2003; 300: 1877; author reply 1877.

    CAS  Article  Google Scholar 

  35. 35

    Akey JM, Zhang G, Zhang K, Jin L, Shriver MD . Interrogating a high-density SNP map for signatures of natural selection. Genome Res 2002; 12: 1805–1814.

    CAS  Article  Google Scholar 

  36. 36

    Kidd KK, Pakstis AJ, Speed WC, Kidd JR . Understanding human DNA sequence variation. J Hered 2004; 95: 406–420.

    CAS  Article  Google Scholar 

  37. 37

    Tishkoff SA, Varkonyi R, Cahinhinan N et al. Haplotype diversity and linkage disequilibrium at human G6PD: recent origin of alleles that confer malarial resistance. Science 2001; 293: 455–462.

    CAS  Article  Google Scholar 

  38. 38

    Hamblin MT, Thompson EE, Di Rienzo A . Complex signatures of natural selection at the Duffy blood group locus. Am J Hum Genet 2002; 70: 369–383.

    Article  Google Scholar 

  39. 39

    Sabeti PC, Reich DE, Higgins JM et al. Detecting recent positive selection in the human genome from haplotype structure. Nature 2002; 419: 832–837.

    CAS  Article  Google Scholar 

  40. 40

    Kong A, Gudbjartsson DF, Sainz J et al. A high-resolution recombination map of the human genome. Nat Genet 2002; 31: 241–247.

    CAS  Article  Google Scholar 

  41. 41

    Kidd JR, Pakstis AJ, Zhao H et al. Haplotypes and linkage disequilibrium at the phenylalanine hydroxylase locus, PAH, in a global representation of populations. Am J Hum Genet 2000; 66: 1882–1899.

    CAS  Article  Google Scholar 

  42. 42

    Mateu E, Perez-Lezaun A, Martinez-Arias R et al. PKLR- GBA region shows almost complete linkage disequilibrium over 70 kb in a set of worldwide populations. Hum Genet 2002; 110: 532–544.

    CAS  Article  Google Scholar 

  43. 43

    Kidd KK, Morar B, Castiglione CM et al. A global survey of haplotype frequencies and linkage disequilibrium at the DRD2 locus. Hum Genet 1998; 103: 211–227.

    CAS  Article  Google Scholar 

  44. 44

    Tishkoff SA, Goldman A, Calafell F et al. A global haplotype analysis of the myotonic dystrophy locus: implications for the evolution of modern humans and for the origin of myotonic dystrophy mutations. Am J Hum Genet 1998; 62: 1389–1402.

    CAS  Article  Google Scholar 

  45. 45

    Mateu E, Calafell F, Lao O et al. Worldwide genetic analysis of the CFTR region. Am J Hum Genet 2001; 68: 103–117.

    CAS  Article  Google Scholar 

  46. 46

    Gonzalez-Neira A, Ke X, Lao O et al. The portability of tagSNPs across populations. A worldwide survey (submitted).

  47. 47

    Cann HM, de Toma C, Cazes L et al. A human genome diversity cell line panel. Science 2002; 296: 261–262.

    CAS  Article  Google Scholar 

  48. 48

    Iyengar S, Seaman M, Deinard AS et al. Analyses of cross species polymerase chain reaction products to infer the ancestral state of human polymorphisms. DNA Seq 1998; 8: 317–327.

    CAS  Article  Google Scholar 

  49. 49

    Stephens M, Donnelly P . A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 2003; 73: 1162–1169.

    CAS  Article  Google Scholar 

  50. 50

    Schneider S, Roessli D, Excoffier L . Arlequin Ver. 2.0: A software for Population Genetic Data Analysis, 2nd edn., University of Geneva, Geneva, 2000.

    Google Scholar 

  51. 51

    Sebastiani P, Lazarus R, Weiss ST, Kunkel LM, Kohane IS, Ramoni MF . Minimal haplotype tagging. Proc Natl Acad Sci USA 2003; 100: 9900–9905.

    CAS  Article  Google Scholar 

Download references


We thank Elena Bosch, Arcadi Navarro, Michelle Gardner, Lourdes Sampietro and Mònica Vallés (Universitat Pompeu Fabra) for helpful advise and technical support. Mark Shriver (Pennsylvania State University) and Kenneth K Kidd (Yale University) kindly provided the raw data for the FST comparison. We thank Howard Cann (CEPH, Paris) for providing the HGDP-CEPH panel. We also thank Anna Pérez-Lezaun, Roger Anglada and Stéphanie Plaza (Servei de Genòmica, Universitat Pompeu Fabra) for technical support. This research was supported by Ministerio de Educación y Ciencia of the Spanish Government (BFU2004-04208/BMC) and Departament d'Universitats, Recerca i Societat de la Informació (DURSI) of the Generalitat de Catalunya.

Author information



Corresponding author

Correspondence to D Comas.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Ramírez-Soriano, A., Lao, O., Soldevila, M. et al. Haplotype tagging efficiency in worldwide populations in CTLA4 gene. Genes Immun 6, 646–657 (2005).

Download citation


  • CTLA4 gene
  • population diversity
  • haplotype
  • linkage disequilibrium
  • tag-SNP

Further reading


Quick links