Introduction

The androgen receptor gene (AR; OMIM 313700) has been proposed as a strong candidate for risk and progression of prostate cancer (Porkka and Visakorpi 2004). This gene, located on the X chromosome (Xq11-12), belongs to the superfamily of nuclear hormone receptor (NR) genes that mediate the action of lipophilic ligands. The AR gene product regulates expression of the genes necessary for growth and development of many target tissues, including male and female reproductive organs. Protein functional domains include a variable N-terminal domain (NTD), a highly conserved DNA-binding domain, a hinge region, and a C-terminal ligand-binding domain (LBD). Transcriptional activation involves a ligand-induced intramolecular interaction between NTD and LBD domains. Upon binding the hormone ligand (dihydrotestosterone), the receptor dissociates from accessory proteins, translocates into the nucleus, dimerises, and then stimulates transcription of androgen-responsive genes.

The N-terminal domain of the protein is encoded by exon 1 (Brinkmann et al. 1989) and contains two polymorphic short tandem repeats (STRs) (CAG and GGC) separated by approximately 1.1 kb. The CAG microsatellite encodes a polyglutamine tract that ranges in size from 5 to 33 glutamine (Gln) residues, with an average length of 20 trinucleotide repeats. The GGN tract (GGT3GGG1GGT2GGCn) encodes a polyglycine tract with a length of 22–24 glycines (average length of 16 GGC repeats). In vitro studies (Chamberlain et al. 1994; Ding et al. 2004) have demonstrated an inverse relationship between the length of both repeats and AR activity. Low size CAG (<19 repeats) and GGC (<15 repeats) alleles result in higher receptor activity, and have been associated with prostate cancer, earlier age of onset, and a higher grade and more advanced stage of prostate cancer at the time of diagnosis (Visakorpi 2003). Alleles of small GGC size have also been associated with oesophageal cancer (Dietzsch et al. 2003). In contrast, long CAG and GGC alleles are associated with decreased transactivation function in the AR receptor and have been associated with breast (Suter et al. 2003) and endometrial (Sasaki et al. 2003a) cancers.

CAG and GGC allele length variation has been described in populations (Kittles et al. 2001; Sasaki et al. 2003b) of the main ethnic groups (African, American of African descent, Asian and European) but, until now, information about CAG and GGC allele distribution in Mediterranean countries has been practically non-existent (Hadjkacem et al. 2004). For both STRs, remarkable allele size differences have been observed between African and non-African populations. These differences may be important to clarify differences in ethnic incidences of some cancers (Ferlay et al. 2004). Prostate cancer incidence in American men of African descent (age standardized rate (ASR) =271/100,000) are higher than those observed in American men of European descent (ASR=167/100,000), whereas the incidences of prostate cancer and other steroid-related cancers are remarkably low in Asia.

This work intends to report data describing CAG and GGC allele variation in a sub-Saharan African sample of the Ivory Coast as well as in 12 Mediterranean groups. The peopling of the Mediterranean is a profusely studied anthropological topic (Plaza et al. 2003; Quintana-Murci et al. 2003; Esteban et al. 2004). Rather than going deeper into this subject, although the populations and markers will allow us to do so, the main goal here is to obtain new data about the distribution of the above polymorphisms in South Europeans (seven samples from Spain, Italy, Greece and Turkey) and North Africans (five samples from Morocco, Algeria and Egypt). The proposed associations among the AR CAG and GGC STRs, and risk for several cancers make them particularly interesting. Knowledge of the current population variation of these two markers may be extremely informative as a baseline for the design of future epidemiological studies. It is well known that population stratification due to population differences in allele frequencies may lead to confounding results in association studies (Ardlie et al. 2002). This fact must be seriously considered in a region such as the Mediterranean, with important South–North migratory movements, both in the past and recently. Moreover, different genetic analyses based on uniparental and autosomal markers (Plaza et al. 2003; Quintana-Murci et al. 2003; Esteban et al. 2004) indicate a certain degree of Sub-Saharan African admixture in some North African populations. To test the magnitude of this contribution in terms of the CAG and GGC STR variation may contribute relevant information to population-based research in genetic epidemiology.

Materials and methods

Populations sampled

Blood samples were collected from healthy and unrelated males and females from 13 different groups within eight countries. Samples were obtained with the informed consent of the participants. All participants had their four grandparents born in the same region. With the exception of the Greek sample, participants were representatives of rural areas of geographically well-defined regions. Some of these geographical regions are the homeland of several anthropologically well-defined Berber groups in Morocco, Algeria, and Egypt. Three Spanish samples were analysed from northwest Spain (number of chromosomes n=106), southeast Spain (n=112), and the Basque Country (n=62). The island of Sardinia was sampled in two different localities: inner Sardinia (n=68) and coastal Sardinia (n=86). Greece (Athica region, n=66) and Turkey (Anatolia Peninsula, n=150) completed the North Mediterranean sampling. The Moroccan samples consisted of three Berber groups from High Atlas (N=88), Middle Atlas (N=118), and Northeast Atlas (n=90). Berbers from Mzab in Algeria (n=74) and from the Siwa Oasis in Egypt (n=82) were also analysed. Finally, a sample from the Ahizi ethnic group (n=88) from the Ivory Coast was genotyped to include a representation of the Sub-Saharan African variation.

Genetic analyses

CAG and GGC STRs were amplified using the methodology defined in previous studies (Kittles et al. 2001; Sasaki et al. 2003b). PCR products were pooled and electrophoresed on an ABI PRISM 3700 DNA sequencer (Applied Biosystems, Foster City, CA). Genescan and Genemapper 3.0 programs (ABI PRISM, Applied Biosystems) were used to generate fragment sizes and genotypes. Different male individuals selected by their CAG and GGC alleles were sequenced to confirm size lengths.

Statistical analyses

Trinucleotide allele frequencies were determined by direct gene counting. Allele frequencies in both sexes were compared by means of a G-test. Male and female data were grouped since no sex differences were observed.

The mean number of repeats and its variability (heterozygosity and variance of the allele distribution) were used as indicators of repeat dynamics. Two statistical parameters, β and ρ, proposed by Deka et al. (1999) to detect demographic and microsatellite mutational traits, were estimated. The β parameter is known as the imbalance of heterozygosity and allele size variance; the β value may differentiate diverse population demographic situations (constant population size, population growth after bottleneck or after equilibrium) or detect allele size constraints. The ρ parameter is the slope of the regression line between the mean and variance of allele size. This parameter is independent of demographic effects and provides information about the existence of a contraction or expansion bias in the STR. Theoretical predictions of these two parameters interpreted together concerning demographic and mutational aspects are detailed in previous works (Deka et al. 1999; Andrés et al. 2002).

Mean, variance, median, and ρ parameters were calculated using SPSS v.10.0 and STATISTICA software packages. The β parameter and its confidence intervals (CI; determined by re-sampling) were estimated using Microsoft Excel 2000 software as described in Andrés et al. (2002). For each STR, overall β and ρ values were computed by pooling all populations. Standard diversity indices, population comparisons (exact test of population differentiation), and hierarchical analyses of molecular variance (AMOVA) were implemented by the Arlequin v 2.0 package (Schneider et al. 2002).

Results

AR CAG and GGC allele size distributions in the 13 studied populations are available as Electronic Supplementary Material. In general, CAG and GGC distributions are in Hardy–Weinberg equilibrium; however, departures from equilibrium (tested in the female sub-samples) were found in 6 cases out of 26 tests. The low females’ number in some samples, together with the high number of alleles in both STRs are probably the cause of these Hardy–Weinberg equilibrium departures. Allele diversity values and other statistical parameters for the CAG and GGC markers are shown in Tables 1 and 2, respectively.

Table 1 Statistics and genetic variance parameters of the CAG trinucleotide in the androgen receptor (AR) gene in 13 populations. Absolute and relative (in parentheses) repeat frequencies are grouped into alleles of less than 19 repeats, alleles of 19 to 21 repeats, and alleles of more than 21 repeats. n Number of gene copies (chromosomes), H heterozygosities; no. alleles number of different alleles
Table 2 Statistics and genetic variance parameters of the GGC trinucleotide in the androgen receptor (AR) gene in 13 populations. Absolute and relative (in parentheses) repeat frequencies are grouped into alleles of less than 15 repeats, alleles of 15 to 17 repeats, and alleles of more than 17 repeats

CAG variation

The CAG repeat shows notable levels of within-population variation (heterozygosity values from 84 to 89%) in all samples. Variance in the allele sizes is more heterogeneous; some populations, such as Siwa, Middle and High Atlas, and South Spain, show high values (10.37–9.28), whereas other groups, including the Ivory Coast, Turkey, Basque Country, and Greece, exhibit the lowest values (5.91–5.03). The high number of different alleles found in all groups (10–17) makes difficult to ascribe a global population variation pattern. In an attempt to summarise allele variation, CAG allele frequencies were grouped (see Table 1) into three different categories defined by the observed median repeat values: short alleles of less than 19 repeats, medium alleles of 19 to 21 repeats, and long alleles of more than 21 repeats. This summarised information reveals that our European samples are characterised by low (mean average frequency of 7%) and high (51%) frequencies of the short and long alleles, respectively. The Ivory Coast shows the opposite trend (36% of short and 13% of long alleles). North African populations exhibit intermediate values of short alleles (17%), and the lowest frequencies of the medium category (35%). A recent work (Buchanan et al. 2004) describes an allele range (from 16 to 29 repeats) considered as the critical size for maintaining a protein NTB/LBD interaction that ensures adequate AR activity. Under this criterion, the Ivory Coast (8.98% of alleles <16 repeats) and, to a lesser degree, High Atlas (5.75%) and Middle Atlas (4.27%) Berbers, show at least twice as many alleles under this critical size than those observed in the remaining groups.

The imbalance of heterozygosity and allele size variance, expressed as the β parameter, is shown in Table 1 for each population. The global β yields a pooled population value of 0.183 (99% CI: 0.175–0.198) whereas the ρ parameter (ρ=−0.047) is not significantly different from zero.

GGC variation

In contrast to CAG, the GGC locus (Table 2) is less variable. The majority of samples analysed have 6–10 different alleles, with the exception of inner Sardinia and Greece, which have more reduced variation (2 and 3 different alleles). The repeat size pattern shows a highly frequent allele of 16 repeats in all populations except those of the Ivory Coast and Siwa Berbers. In these two latter groups, the 15-repeat allele is found at high frequencies. As can be seen in Table 2, more than 80% of the total allele frequencies are explained by the presence of the 15–17 repeat alleles in all samples except those of the Ivory Coast and High Atlas Berbers. Alleles of less than 15 repeats are found only in notable frequencies (12–25%) in the African groups (except Middle Atlas) and in the Iberian samples of South Spain (16.07%) and the Basque Country (12.90%). High heterozygosities are observed in African samples (from 79% in the Ivory Coast to 60% in Middle Atlas Berbers) and South Spain (65%). Population heterozygosity values were compared by means of a non-parametric ANOVA. Significant differences (χ21=6.33, P=0.012) were found in the comparison between north (seven samples) and south (five samples) Mediterraneans. These differences remain significant (χ21=6.00, P=0.014) when the Northeast Mediterranean (inner Sardinia, coast Sardinia, Greece and Turkey) samples are compared with North Africans, but not for the North African versus Iberian Peninsula comparison (χ21=2.68, P=0.101).

As for the imbalance index, β values for each population are indicated in Table 2. The overall β value (0.792, 99% CI: 0.636–1.002) is not significantly different from 1. The ρ value is −3.28 (P=0.00035).

Population relationships

Pairwise population comparisons for the CAG locus report only 14 significant results out of 78 comparisons. In contrast, the GGC locus is more heterogeneous since 40 out of 78 comparisons are significant. Leaving aside some punctual population differences, coastal Sardinia, the Ivory Coast, and the Siwa Berbers show a pattern of GGC allele frequencies that differs significantly from the remaining groups.

The proportion of the genetic variance attributable to differences among groups shows remarkable differences between the CAG and GGC markers. For the CAG locus, only a global FST of 0.52% (13 samples, P=0.039) yields a weak but significant result. However, when CAG frequencies are grouped into the short, medium, and long categories, the hierarchical analyses indicated in Table 3 shows a significant geographical structure related to these allele size categories. When the North Mediterranean group is compared with the African groups, both including and non-including the Ivory Coast, significant among–groups values (2.9 and 1.9%, respectively) are observed.

Table 3 Variability analysis in the CAG and GGC loci

For the GGC locus, the amount of total genetic variance for the 13 populations examined shows a high value (FST of 9.01%; P<0.001). Decreasing, but significant intra-groups FST diversity values are also observed (Table 3) in the different geographic Mediterranean groups. Hierarchical analyses indicate an apparently noticeable geographic structure related to this repeat. However, at the sight of the particular pattern of GGC frequencies showed by Siwa Berbers and Sardinians, genetic variances have been recalculated excluding these two samples. In this case, genetic variances change to non-significant values although a global FSTof 1.9% (P=0.0237) remains significant.

To summarise population relationships, different principal component (PC) analyses based on the allele frequencies of CAG and GGC have been checked. A plot (not shown) drawn with complete allele frequencies of the two repeats in the 13 populations analysed here, and also including two samples from Japan and Germany (Sasaki et al. 2003b), shows a population distribution around two main axes (43.6% of total variation) that clearly separates the Ivory Coast and Siwa Berbers from the remaining populations, with an intermediate position of some North African samples. The low percentage of variation expressed by the two main axes, in spite of the different populations included, may be due to both the low number of loci and the extreme allele dispersion of these STRs. A PC re-analysis based on grouped allele frequencies (Tables 1, 2) for CAG and GGC among the 12 Mediterranean samples is represented in Fig. 1,explaining 78.5% of the genetic variance. The first axis (48.7%) distinguishes two clusters, one comprising Sardinian, Greek, and Turkish samples, and another composed of the remaining groups with the exception of High Atlas Berbers, which occupy an extreme position. The second axis (29.8%) separates the Siwa Berbers. The first axis is influenced by GGC diversity: allele frequencies of GGC<15 repeats correlate best (87%) with the High Atlas differentiation, whereas the 15–17 repeats GGC frequencies are highly correlated (92.4%) with the East Mediterranean cluster. Also, the Siwa differentiation may be related to short CAG allele frequencies (75.2% correlation).

Fig. 1
figure 1

Plot of the frequencies of the two principal components (PC) of CAG and GGC alleles of the androgen receptor (AR) gene in 12 Mediterranean populations

Discussion

This paper attempts to give a picture of the general population distribution of two STRs of the AR gene in a well-defined geographic area: the Mediterranean region. In this region, CAG data show similarly high within-population variation but the population distribution of short, medium, and long alleles is remarkably different between South Europeans and Sub-Saharan Africans; North Africans exhibit intermediate frequencies. As a result of this, the CAG allele distribution in the 12 Mediterranean samples shows a significant geographical (North-South) structure.

On the other hand, the variation at GGC locus allow us to distinguish two groups, one comprising African and Iberian Peninsula samples characterised by both high heterozygosities (H average =0.6577±0.0845) and allele diversities (average number of different alleles =8.78±1.48), and the other formed by Sardinia and Greece, with lesser diversities (H =0.3280±0.2226; different alleles average =3.33±1.53). The relatively general Mediterranean homogeneity for the GGC marker seems to be disrupted for some specific populations rather than by any North–South geographical structure. Thus, the Sardinian samples are characterised by low genetic diversities that may be interpreted as the result of genetic drift and geographic isolation, as evidenced by other autosomal (Cavalli-Sforza et al. 1994) and uniparental (Morelli et al. 2000) markers. Another well differentiated group are the Siwa; this Berber-speaking group is characterised by high endogamy levels (Fakhry 1973) and geographic isolation from the remaining Egyptian populations. In our case, the Siwa are differentiated mainly by the particular pattern of GGC allele sizes rather than by low genetic diversities, in accordance with the only previous genetic study conducted in this population that demonstrated important genetic diversity levels contrasting with the idea of an isolated population (Amory et al. 2004).

Finally, High Atlas Berbers are, among Moroccan Berbers, the most closely related to the Sub-Saharan variation. Also true to this trend, Southern Spaniards, among the Iberian samples, are the most related to the North African variation. These findings suggest genetic influences from South to North, both across the Sahara and the Mediterranean, as previously reported in other studies based on mt-DNA, Y-chromosome, and autosomes (Plaza et al. 2003; Quintana-Murci et al. 2003; Esteban et al. 2004).

The results regarding repeat dynamics and population demography reveal different patterns for CAG and GGC. The global CAG pattern (β<1, ρ≈0) is compatible with a theoretical model of an unbiased constrained repeat; the β<1 values observed in each sample (Table 1) may be interpreted as constraints in size (a gene particularity) rather than a population effect. The GGC pattern (β≈1, ρ<0) agrees with a model of a constrained repeat with an expansion bias in expanding populations after a bottleneck. This demographic information is concordant with the known genetic evidence regarding populations of non-African origin. For this marker, the Ivory Coast sample, which shows (Table 2) an individual β of 0.37 (99% CI: 0.19–0.55), has a value compatible with a constant population model, as previous studies (Deka et al. 1999; Andrés et al. 2002) have suggested for other Sub-Saharan African groups.

From the point of view of the possible relationship between the dynamics of these two STRs and disease, the CAG locus analysed here has a pattern of high within-population variation and not significant (ρ≈0) correlation of mean and variance of allele sizes. A similar pattern has been described in disease-causing trinucleotides (Deka et al. 1999), even though the subjects analysed are disease-free and lie within the normal size ranges. On the contrary, the considerably low heterozygosities and within-population variances of the GGC STR, together with a significantly negative ρ value, is similar to the dynamics described for GC-rich gene-associated and anonymous loci.

With regard to the relationship between STR allele sizes, androgen receptor activity, and cancer risk, short CAG repeats (≤18 repeats relative to ≥26) have been associated with an increased risk (relative risk=2.14) of advanced prostate cancer (Giovanucci et al. 1997). A more recent study (Buchanan et al. 2004) reveals that under a critical size of 16 CAG repeats, the conformational structure resulting from the short polyglutamine tract encoded by these repeats could enhance the binding of specific transcriptional coactivators, resulting in higher AR activity even at lower androgen concentrations. The Ivory Coast and all North African samples show high frequencies of CAG alleles ≤18 repeats. Furthermore, in the particular case of the Ivory Coast, High Atlas, and Middle Atlas Moroccan Berbers, the proportion of alleles under the size of 16 repeats (9, 6, and 4%, respectively) is at least two or three times higher than that observed in other samples.

Short GGC alleles have also been correlated with an increased risk of prostate cancer (Chang et al. 2002). A recent study (Ding et al. 2004) revealed that GGC repeat sizes are directly correlated with cell AR protein levels: a GGC STR of 16 repeats (GGC16) yielded, on average, 1.7 times more AR protein than did GGC17, and GGC13 yielded 2.7 times more AR protein than did GGC17. The Ivory Coast, several North African samples, and South Spain exhibit high frequencies of alleles under the size of 15 repeats. The proportion of these alleles is 3.5–7 times higher than in Europeans.

Considering the chromosomal location (Xq12) of the AR gene, the effects of androgen in males is mediated by a single AR allele whereas in females, with two different AR alleles, random X-chromosome inactivation leads to effects of different alleles in different cells. The Ivory Coast, and High and Middle Atlas Moroccans show remarkably high frequencies of CAG and GGC alleles of low sizes. This fact may be translated into a considerable proportion of males, and, to a lesser degree, of females, in these populations carrying AR alleles of low sizes. To go beyond this affirmation, and to speculate about the possible relationship between the distributions of these repeats in the populations analysed here and cancer risk would be less than prudent. However, the data here reported must be discussed in relation to prostate cancer incidences.

Prostate cancer incidences (Ferlay et al. 2004), as indicated by means of ASR are: 19.7/100,000 in the Ivory Coast, 6.4/100,000 in Morocco, 5.6/100,000 in Algeria, and 4.4/100,000 in Egypt. Among South Europeans, the ASR is 35.9/100,000 in Spain, 40.5/100,000 in Italy, 26.2/100,000 in Greece, and 8.0/100,000 in Turkey. These data indicate higher incidences in the Ivory Coast than in North African countries. However, these rates also show incidence values at least five times lower in North Africans than in South Europeans and, hence, do not confirm the correlation between low size alleles and high prostate cancer incidence previously reported in other populations. In any case, these prostate cancer rates must be interpreted with caution due to the significant differences among African and European countries in some important aspects concerning cancer epidemiology: access to health services and diagnostic procedures [prostate-specific antigen (PSA); Liu et al. 2001], differences in several dietary factors [lycopene, total fat, and long-chain (n-3) fatty acids; Terry et al. 2004], and the remarkable divergence in population age structure among these countries.

We are aware of the extreme difficulty of genetic epidemiology studies, in particular those related to cancer, in which different genetic and environmental factors may contribute with hundreds of pieces to a very complicated puzzle. Given this fact, and in the light of the remarkable differences in CAG and GGC frequencies among the Ivory Coast, several Moroccan groups, and South Europeans, new studies with a multi-ethnic perspective may contribute to clarifying the risk factors associated with prostate cancer. Since CAG and GGC STRs are located in the same exon, further analyses of the degree of linkage disequilibrium should be strongly considered.