Introduction

In most societies surnames are transmitted from father to child, just like Y-chromosome genes. In this way, an analysis of the geographic diffusion of surnames can provide accurate estimates of migration rates1 and give information to reconstruct the paths followed by men (and their genes) from the beginning of surname history. In addition, if the ‘cultural’ nature of surnames is considered, it is possible to detect much more ancient migrations which, in some cases, point to a common genetic origin.2,3,4 In fact, surnames are linguistic attributions, reflecting local language, which indicated the identity of people in the area where they lived. For this reason, individuals bearing surnames that are derived from words of the same local dialect probably share the same place of origin and the same gene pool.

Sardinia, which shows a particular genetic and cultural pattern due to two contrasting factors: (i) the conquests it has undergone and (ii) its long history of isolation, represents a good test case to verify whether surname history can also provide clues in detecting gene history.

In this study, we examined a sample of 256 Sardinian male subjects for both surnames and Y-chromosome markers (13 biallelic polymorphisms, the complex 49a,f/TaqI system and three microsatellites), in order to evaluate the power of surname analysis in revealing the Y-chromosome ancestry.

Material and methods

The sample

The sample consists of 256 unrelated apparently healthy Sardinian male subjects who gave their informed consent. In total, 116 were conscripts from the Sassari province and were collected in Porto Torres (N=105) and Olbia (N=11); 140 were subjects sampled at the clinical analysis laboratories of the hospitals ‘Ospedale Regionale Microcitemie’ of Cagliari (N=123) and ‘Ospedale A. Segni’ of Ozieri (N=17).

Y-chromosome DNA analysis

The individuals were examined for 13 biallelic markers, namely: the 12f2 [DYS11],5 YAP [DYS287],6 and, in a hierarchical way, M9, M17, M26,7 RPS4Y,8 M35, M74,9 M89, M170, M173, M201,10 M269,11 for the 49a,f/TaqI [DYS1] RFLPs12 and for the variable microsatellites YCAlIa, YCAIIb13 and DYS19.14 The 12f2 and 49a,f/TaqI polymorphisms were analyzed according to Passarino et al15 and the Alu insertion (YAP) according to Hammer and Horai.16 The mutations M17, M26, M35, M170, M173, M201 and RPS4Y were detected by DHPLC as reported by Underhill et al.17 The M9, M74, M89 and M269 were typed through PCR/RFLP assay: M9 and M269 according to Cruciani et al,11 M89 according to Akey et al18 and M74 by using the primers 5′-ATG CTA TAA TAA CTA GGT GGT GAA G-3′ and 5′-AAT TCA GCT TTT ACC ACT TCT GAA-3′, followed by digestion with the restriction enzyme HpyCH4 V. Analyses of the YCAII and DYS19 microsatellites were performed as described by Mathias et al13 and Roewer et al14 who first reported the polymorphisms.

Surname analysis

A geo-linguistic research on the origin of Sardinian surnames was carried out, firstly, by analyzing the distributions of surname data derived from three different sources encompassing the whole territory of Sardinia at different times. The ancient data come from the collection of 48 470 consanguineous marriages celebrated between 1750 and 1950 in the 442 parishes of Sardinia. The more recent data consist of (1) surnames of electric power users in 1983 for all communes of Sardinia (N=484 484); (2) surnames of telephone users in 1993 (N=483 072). Only surnames present in each of the three data sets are considered and their distributions are analyzed in parallel allowing us to trace the dispersion of 4386 surnames throughout the 370 Sardinian communes and to estimate the parameters describing it: the place of maximum frequency and the center of dispersion area.

About 75% of the Sardinian surnames are dispersed on a very small area around the point of highest frequency corresponding to a specific commune. This type of surname is called ‘monophyletic’, assigning to this term the meaning of uniqueness of place of origin.

Moreover, data on surname linguistic origin are obtained searching either for identity (or similarity) between a surname and a toponym of ancient or recent origin in the same neighborhood,19,20,21 or for derivation of a surname from a lexical form,22,23 or for the presence of a surname in old Sardinian documents like the ‘Condaghe’ dating from XII to XIII centuries. For many surnames, these linguistic information may support or rectify the localization of the ‘probable territorial origin’ obtained from the study of their frequency distribution.

For each surname classified as monophyletic, a place of origin is attributed as a numerical code in order to warrant the individuals' privacy.

To emphasize the cultural and genetic heterogeneity of Sardinia, the three large areas that reflect its ancient history and geography were considered (Figure 1): the northern zone delimited by the mountain chain crossing Sardinia from the central-west to the north-east and linguistically different from the rest of the island; the south-western zone, delineated by the presence of many Phoenician and Carthaginian archeological sites24 and the central-eastern zone, asylum land of the ancient Sardinian population during invasions and domain of pastoral culture. This zone includes the more conservative, or ‘archaic’ area, defined by archaeological and linguistic25 studies and, more recently, also by studies in geo-linguistics and genetics26 (for a more detailed subdivision of Sardinia on the basis of genes, languages and surnames, see Cavalli-Sforza et al27).

Figure 1
figure 1

Subdivisions of Sardinia based on geographic and historical criteria. N=northern zone; CE=central-eastern zone including the archaic region; SW=south-western zone. The black line indicates the mountain chain separating the northern zone from the rest of the island. The four centers from which samples were collected are also indicated.

Results

A total of 202 individuals of the initial sample corresponded to 175 monophyletic surnames of which 151 were associated to a single individual and 24 to more individuals. Subjects carrying the same surname, but not the same haplotype, were also included in the analysis, since it may originate from the same linguistic area from which their surnames derived. The remaining 54 individuals corresponded to 41 surnames (19%) for which it was not possible to assign a specific place of origin (polyphyletic surnames).

Figure 2 illustrates the world-wide Y-chromosome phylogeny (The Y-chromosome Consortium, 2002),28 where Sardinian Y chromosomes belonging to both monophyletic and polyphyletic surnames are introduced and compared with data from Italy and the Middle East. More than 95% of Sardinian samples fall into haplogroups E-M35, I-M170, J-12f2, G-201 and R-M173. Thus, the haplogroup composition of the Sardinian Y-chromosome pool is very similar to that of Italians and of other Europeans.9,17,29,30,31 The observed differences are therefore mainly quantitative, probably due to isolation and genetic drift. For example, haplogroup I-M170, which has its highest frequency in central-eastern Europe and lower frequencies in western Europe,9,17,32,33 is very frequent in Sardinia. Moreover, the majority of it harbor the additional mutation M26. As for the R-M173 chromosomes, the greater part belongs to the western European R-M269 subcluster, whereas the eastern European subcluster R-M17 is barely represented.

Figure 2
figure 2

Phylogenetic tree of the Y-chromosome haplotypes and their percent frequencies in the Sardinian samples carrying ‘monophyletic’ and ‘polyphyletic’ (50 out of 54 individuals) surnames. Data on Italian and Middle Eastern samples30 are also given for comparison. Numbering of mutations is according to the YCC:28 those examined in the present study are shown in bold face type; those inferred are shown in italics. Capital letters indicate haplogroups according to the YCC.28 *Ten and one of these chromosomes were not tested for M26 because DNA was finished. One major 49a,f-YCAIIa-YCACIIb-DYS19 compound haplotype characterizes each haplogroup: c-Ht 49a,f-Ht5/YCAIIa-22/YCAIIb-19/DYS19-13 for haplogroup E-M35; c-Ht 49a,f-Ht7/YCAIIa-22/YCAIIb-19/DYS19-14 for J-12f2, c-Ht 49a,f-Ht12/YCAIIa-21/YCAIIb-11/DYS19-17 for I-M26; c-Ht 49a,f-Ht15/YCAIIa-23/YCAIIb-19/DYS19-14 for R-M269 and c-Ht 49a,f-Ht8/YCAIIa-20/YCAIIb-20/DYS19-15 for G-M201.

The analyses of the 49a,f system and three microsatellites (data not shown) reveal the presence of one major compound haplotype per lineage (Figure 2 legend).

To search for an ancient territorial heterogeneity of these haplogroups, we distributed the 202 individuals carrying a monophyletic surname in the three zones described above, according to the surname place of origin. The resulting haplogroup frequencies are shown in Table 1. A general, significant heterogeneity in the distribution of all the haplogroups in the three areas is observed (χ2[10]=34.93, P<0.001). In particular, the cell χ2 analysis shows that the frequency of haplogroup G-M201 is significantly higher than expected in the northern zone (P<0.05), that of haplogroup I-M26 is significantly higher in the central-eastern zone (P<0.01) and lower in the northern zone (P<0.005), and that of haplogroup R-M269 is significantly lower in the central-eastern zone (P<0.05).

Table 1 Frequencies, observed and expected (in italics), of the Y-chromosome haplogroups in the northern, central-eastern and south-western zones of Sardinia

In contrast, when places of birth of sampled individuals were used to distribute the different Y-chromosome haplogroups, no heterogeneity among the three areas was detected: χ[10]2=13.36, P=0.204 underlining the effect of migration.

Discussion

Sardinia appears to be a particularly appropriate test case to evaluate the extent to which surnames are informative in identifying the history of Y-chromosome haplogroups.

As the other European populations, almost all the Sardinian Y chromosomes belong to haplogroups E-M35, I-M170, G-M201, J-12f2 and R-M269. Haplogroups E, G and J, which are believed to have an African (E) or Middle Eastern (G and J) origin and entered Europe through different migrations,30,34,35 show frequencies in the same range as other Mediterranean populations. By contrast, haplogroups I-M170 and R-M269 harbor unusual frequencies. Haplogroup R-M269 represents 20.8% of the Sardinian Y chromosomes, which is the lowest frequency in Western Europe (50–80%).30 On the contrary, haplogroup I-M170 shows the highest incidence (41.6%) among western European populations (3–22%),30 and most of it (91.9%) is represented by the subclade I-M26 which in addition is characterized by the compound haplotype 49a,f-Ht12, YCAIIa-21, YCAIIb-11 and DYS19-17 previously proposed as a ‘Sardinian’ marker.36,37 Outside Sardinia, this subclade was only observed at a very low frequency in the Basques,9 the Iberian Peninsula32 and, as inferred by the presence of the YCAIIb-11 (only observed in haplogroup I, and in particular in its subclade M26), in Béarnais, few Corsican and central-southern Italian subjects.38,39,40,41,42

In order to search for genetic heterogeneity inside the island, the effect of migration in the last centuries had to be considered. Indeed, demographic studies on the population evolution of Sardinian communes43 demonstrated that, from 1861 to 1991, the mountain area lost 20% of its population in favor of the plain. The distribution of individuals by birth place compared with that of their ancestors' place of origin seems to reflect this process of homogenization (Figure 3).

Figure 3
figure 3

Distribution of individuals in the three areas by birth place (gray columns) and by their origin detected through surname analysis (white columns). χ2 test for comparison of the two distributions is significant at P-level 0.001.

So, results shown in Table 1 may enlighten the genetic history of the different parts of Sardinia. The ‘Sardinian’ subhaplogroup I-M26, which is currently distributed almost uniformly in all parts of the island, shows a high heterogeneity between the areas when samples were redistributed according to the ancestral location of surnames. Interestingly, most surnames of individuals carrying this haplogroup seem to have originated in the central-eastern zone, which includes the archaic area. This supports the antiquity of this haplogroup. Indeed, history tells that indigenous populations retreated to the archaic area when Phoenicians and, later, Carthaginians colonized the southern part of the island, and this was followed by centuries of isolation which allowed genetic drift to increase the haplogroup frequency. Moreover, sub-haplogroup I-M26 shows a frequency significantly lower than that expected in the north and a nonsignificant increase in the southwest. Thus, ancient migrations could have brought this haplogroup from the central area towards the more open southern regions, separated only by a failing cultural barrier, more frequently than towards the northern regions, separated by the less accessible geographic barrier. Afterward, recent migrations have dispersed I-M26 all over the island.

The isolation of the central-eastern area could also explain the heterogeneous distribution of the R-M269 and G-M201 haplogroups. The low frequency of haplogroup R-M269 in the central-eastern area of Sardinia and its prevalence in the north suggest that R-M269 arrived to the Sardinian coasts from the continent, possibly after the occurrence and diffusion of the autochthonous I-M26 subhaplogroup, while the high frequency in the northern area of haplogroup G-M201, which is scarcely represented in Europe and in the Middle East,30 could be due to genetic drift.

In conclusion, new methods of sampling are called for and surnames, allowing the detection of the common genetic origin of the families, can bypass the effect of recent migrations and enlighten real genetic differences. Even if this analysis is obviously limited to the male component of the population, the obtained results on the genetic heterogeneity could be extended to the entire Sardinian population due to the smallness of the areas within which matrimonial exchanges occurred.44 Finally, this geo-linguistic approach, in a more general way, could be utilized to select samples of individuals as control for epidemiological studies.