Introduction

People constituting an ethnic community share characteristic cultural traits of language, religion, patterns of social interaction and a feeling of unity and solidarity. While community members also believe that they share a common origin and history, very often this is only a myth. Many scholars maintain that language is an important if not the strongest feature of ethnic identity. Language represents the natural way of thinking, communicating, and associating people with each other, thus providing the main social tie1 that could influence the genetic structure of an ethnic group through marriage and reproduction. The spoken language can, in fact, act as a genetic and cultural barrier, increasing the genetic variation and divergence between populations. However, the construction of ethnic identity does not necessarily involve biological contributions.

Among the various groups speaking non-Italian languages who have slowly penetrated Italy since prehistoric times, the Croatian-speaking community of Molise represents the smallest ethnolinguistic minority.2

Historic background

Population contacts between the two coasts of the Adriatic have existed since prehistoric times. According to historic data, the first migrations of Croatians into central-southern Italy began as early as the 10th century and continued up to the 15th century. Small groups and families crossed the Adriatic and formed their own small colonies.3 In contrast with these sporadic migrations, more lasting visible traces were left by Croatian immigrants to southern Italy during the 15th and 16th centuries. At that time, the Ottomans were advancing through the Balkan Peninsula to the Adriatic, forcing many population groups to flee from the continental interior toward the Adriatic coast where they resettled on the Middle and Southern Adriatic (Dalmatian) islands, the Istrian peninsula to the northwest and in the southern regions of Italy across the Adriatic. According to historic sources, 15 Croatian communities resettled in the region of Molise, with a total of 7000 to 8000 inhabitants.4 Their number grew to over 15 000 and then gradually declined partially due to assimilation with the Italian population. Today, their descendents (total: 2081 individuals5) live in the mountainous inland villages of Acquaviva Collecroce, San Felice del Molise and Montemitro. Although Croatian is still spoken in all three villages, the number of individuals and families speaking it has greatly decreased through economically motivated migrations to other regions of Italy and overseas countries.2

Linguistic evidence of the origin of Croatian populations

Throughout the last century, several different hypotheses as to the origin of these Croatian immigrants were put forth, along with a range of possible locations of origin including Istria and northern Dalmatia or Middle Dalmatia. However, a thorough linguistic analysis comparing the Molise speech with the local vernaculars of 52 villages in Istria and Dalmatia has shown that the speech of the inhabitants of Molise is undoubtedly a štokavian-ikavian dialect of the Croatian spoken in Middle Dalmatia before the Turkish invasion.6 The linguistic analysis clearly supports the hypothesis that the ancestral homeland of the immigrants into the region of Molise is to be found in this area, specifically between the Cetina and Neretva rivers.

The present study examines the genetic effects of geographic and linguistic subdivision in Molise, a small region of Italy. In particular, direct sequencing of the two most variable segments of the mitochondrial DNA (mtDNA) control region and high-resolution RFLP analysis of the whole molecule were performed to describe mtDNA variation in the Croatian minority of Molise, the Croatian parental population and the neighbouring Italian groups in order to assess whether the spoken language has contributed to the preservation of the genetic identity of the Croatians living in Italy.

Materials and methods

The study was carried out on three groups of unrelated and apparently healthy people with attested maternal genealogy. The first group consisted of 41 schoolchildren (age range, 11–15 years) from the three villages belonging to the Croatian ethnolinguistic minority of Molise (Acquaviva Collecroce, Montemitro and San Felice del Molise); the second comprised 199 individuals from central-southern Italy (52 from Lazio, 48 from Campania, 73 from Abruzzo/Molise, and 26 from Puglia); the third group consisted of 96 individuals from four villages located on the Makarska coast near the islands of Brač and Hvar, from north-west to south-east: Krilo Jesenice (35), Mimice (24), Zaostrog (13) and Živogošće (24). In addition, 311 individuals chosen randomly from the parish registries of three Croatian islands in the eastern Adriatic (Middle Dalmatian islands) (105 from Brač, 108 from Hvar, and 98 from Korčula7, 8), for which HVS-I complete sequences and RFLP analyses were available, were also included. The Krk sample was not used for comparison since the ethnohistory of this island differs considerably from that of the other Croatian islands.7 Figure 1 shows all sampling localities in Italy and Croatia and the reconstruction of historic migration directions.

Figure 1
figure 1

Geographical map showing the location of the study samples and the migration routes from Croatia to Italy.

Sample preparation and DNA extraction

Croatian ethnolinguistic minority of Italy

Total genome DNA was extracted from a small amount of blood samples according to the method described in Budowle et al.9 The blood sample collection strategy is reported elsewhere.10

Populations of Central Italy

After obtaining informed consent, blood or mouth swab samples were collected at the Blood Transfusion Centre of the Hospital of Rome ‘Umberto I’. DNA was extracted following the procedures of Miller et al11 and Budowle et al.9

Populations of Croatia

DNA extraction from blood samples and mtDNA analyses were performed following the procedures reported in Tolk et al.7

mtDNA analyses

D-loop PCR amplification and direct sequence analysis

The mtDNA HVS-I and HVS-II of the Croatian–Italian and the Italian samples were amplified simultaneously by applying the multiplex amplification strategy. For a detailed description of PCR and automated sequencing conditions, see Rickards et al.12, 13

The HVS-I was amplified and sequenced for the southern Croatian samples between the region at nps 16024 and 16400.14, 15

RFLP screening

RFLP screening was performed in a hierarchical manner. Purified DNA of selected Croatian-Italian, Croatian and Italian samples was used to amplify the polymorphic sites defining Eurasian and African mtDNA lineages.16 Primers and conditions are reported in the Supplementary Information.

Haplogroup classification strategy

For all the samples, 384 (376 for the coastal Croatians) bps of the HVS-I (from positions 16016/16024 to 16400) sequence and RFLPs were determined to allocate a sequence to its haplogroup;16 for the Italian and Croatian-Italian samples, information on HVS-II (from position 030–408) was also considered. Length variations were not considered in phylogenetic and statistical analyses. To avoid controversies with phylogenetic relations revealed by complete mtDNA sequences,17 finer classification of control region sequences of haplogroups U, T, and J at subhaplogroup level (sensu18) was avoided.

Population – genetic and phylogeographic analyses

Gene and nucleotide diversities were calculated according to Nei.19 The average number of nucleotide pairwise differences, the number of segregating sites, and Tajima's D statistics20 were estimated. All these analyses were based on HVS-I (bps 16024–16400) and performed using Arlequin package software.21 Distance matrices between populations were generated using two methods: Reynolds' distance22 and Slatkin's linearized Fst's.23 Multidimensional scaling analysis (MDS24) based on the above reported distance matrices and principal component (PC) analysis based on haplogroup frequencies were performed with STATISTICA package software (StatSoft). The admixture proportions of the Italian and Croatian populations in the Croatian minority of Molise were quantified using the ADMIX 2.0 program.25 Estimates of the admixture coefficients and their SEs were calculated using 1000 bootstrap replicates. A median network based on the algorithm by Bandelt et al.26 was manually created to show the phylogenetic relationship of the combined RFLP and HVS-I data.

Results and discussion

Genetic diversity

The number of haplotypes, their frequency, the diversity indexes, and the mean number of pairwise differences according to HVS-I (Table 1) were calculated to assess the mtDNA genetic diversity of the ethnolinguistic minority of Molise and their neighbouring Italian and Croatian parental populations. Molecular information obtained from the three Croatian-Italian samples was pooled together since frequency analysis of the identified haplogroups did not show any significant differences between the three villages. In fact, when 95% credible regions were calculated, according to Richards et al,16 the estimated frequencies overlapped. Moreover, many haplotypes were identical in the three Croatian villages. Other haplotypes that were found were closely related to these. Also Fst values between the three villages did not show any significant differences (data not shown).

Table 1 Diversity indexes for HVS-I in the Croatian-Italian, Italian and Croatian populations

These findings are consistent with the homogeneity revealed by previous genetic studies on nuclear loci.10

The highest gene diversity in the mtDNA HVS-I region was observed among the southern Italians (Campania and Puglia); this is in line with the nonhomogenous genetic origin of these populations.27 The Croatian-Italians, and the Croatians in particular, displayed somewhat lower values. The reduced pattern of gene diversity and the high frequency of repeated haplotypes displayed by the Middle-Dalmatian sample is very likely a result of the effects of inbreeding, genetic drift and founder effect due to a certain degree of isolation of these populations, which is consistent with ethnohistoric records.6, 28 The nucleotide diversity index of the three groups displayed values that were comparable with those of other European populations (range of variation: 0.010–0.01929). The average number of pairwise nucleotide differences (MPSD) was similar for the Italians and the Croatian-Italians, while the Croatians showed lower values. The Croatians and the Italians showed a bell-shaped distribution of HVS-I pairwise sequence differences, and negative and mostly significant Tajima's D values (see Table 1), signalling likely a late Pleistocene demographic expansion affecting the majority of European populations.30 In contrast, the pairwise distributions in the Croatian ethnolinguistic minority of Molise showed two major peaks, one at six and another at three substitutional differences (data not shown), which is very likely compatible with a scenario of admixture events with Italian females, since heterogeneity among the three villages was excluded. In fact, Brakez et al,31 analysing the mtDNA sequence variation in the Moroccan population, showed that admixture can create a bimodal pairwise difference distribution.

Haplogroup frequency distribution

The samples were classified according to the existing mtDNA haplogroup nomenclature32, 33 on the basis of control region sequence data and diagnostic RFLP sites from coding region data (sequence and RFLP data are reported in Tables 2 and 3 of the Supplementary Information). The majority of the sequence types could be classified as belonging to typical European and Western Asian haplogroups (Table 2). Only one Italian sequence, from Lazio, could not be attributed to any specific haplogroup. It displayed only two transitions in HVS-I with respect to the Cambridge Reference Sequence (CRS),14 namely at bps 16224 and 16273, and the RFLP analysis excluded its inclusion in either haplogroup H or K. The quantity of DNA for further analysis was too small to allow its proper classification in the phylogeny.

Table 2 Haplogroup frequency distribution in the Croatian-Italian, Italian and Croatian populations

Sequences belonging to haplogroups that are uncommon to European populations, like L1b, L1c, M1, R*, U6a, U7, D and F were found sporadically. The presence of three sequences belonging to the northwestern African subhaplogroup U6a in the Italians (Lazio) might be of interest since this haplogroup has been previously found in Europe in notable frequencies only in Iberian Peninsula, where U6 could have been introduced already through a prehistoric African occupation.34, 35 The occurence of these haplotypes in Lazio could also be attributed to gene flow from North West Africa, hypothesizing a possibly wider range of contacts between the circum-Mediterranean populations in the past.

Although Italians, Croatian-Italians and Middle-Dalmatian Croatians share the same basic haplogroups at similar frequencies, some of the identified haplogroups were differentially distributed among the three studied groups, a possible indication of genetic affinities between the Croatian-Italians and the other two populations. It should be emphasized, however, that since the haplogroup frequencies in individual populations, as exemplified by the wide range of proportions demonstrated by individual haplogroups in the Croatian villages (ie 0–15% for T), is mainly a subject of drift, notwithstanding possible sampling errors, the frequency pattern could be misleading and may not be the best way to estimate the parental contributions. Therefore, molecular differences between mitochondrial sequences rather than frequency differences between haplogroups were used in the following analyses.

Interpopulation differences

Intermatch analysis

To investigate the affinities of the linguistic minority of Italy, the Croatian source populations and the Italians, the mean number of pairwise differences between the Croatian-Italian mtDNA sequences and those detected in the various Croatian and Italian villages (intermatch) was calculated, wherein the lower the pairwise difference, the closer genetically related the populations under comparison are. This analysis shows that the MPSD values are similar and low for both Croatia (coastal region: 4.633, P=0.004; islands: 4.953, P=0.058) and total Italy (5.057, P=0.238; no differences in the MPSD values were found when the comparisons were made with each Italian region separately (data not shown)), indicating that the mtDNA sequences found in the three Croatian-speaking villages of Molise are closely related to both the present day Croatians, especially those living in the coastal area, and the surrounding Italian population. Such a pattern is expected from populations that may have interchanged a significant number of migrants over long periods of time.36 In fact, when the intermatch analysis was extended to populations from different continents the obtained values were higher (from 7.902, P=0.691 for Africa to 5.985, P=0.668 for Asia).

Genetic distances and migration rates between populations

To better define the maternal contribution of the Italian and the Croatian populations to the Croatian ethnolinguistic minority living in Molise, genetic distance analysis was performed. The obtained distance matrices between the populations were visualized by MDS analysis. Figure 2 shows only the MDS plot based on Slatkin's linearized Fst, since the Reynolds' distances gave a similar plot (data not shown). The values reported by Kruskal24 for the stress statistic, which measures the goodness of fit between distances in the configuration space and the monotonic function of the original distances, indicate that the level of stress of the present analysis is good, that is, the degree of distortion of the pairwise distances in the three-dimensional space is acceptable. Moreover, the not appreciably lower value of the stress provided by the four-dimensional plot is a further indication that the three-dimensional representation reproduces the distance matrix of the populations reasonably well.

Figure 2
figure 2

Multidimensional scaling (MDS) plot of the genetic distances between the Croatian-Italian, Italian (full squares) and Croatian (empty circles: islands; full circles: coastal villages) populations on the basis of mtDNA HVS-I sequences in three-dimensional space.

A striking feature of the MDS representation is the intermediate position of the Croatian-Italians between the Italian and the Croatians. This pattern again seems to be compatible with a certain degree of admixture between the Croatian population of Molise and the surrounding Italian populations. The genetic relationships between populations obtained when PC analysis based on haplogroup frequencies is applied are similar to those based on Slatkin's linearized Fst distance matrix (data not shown).

To quantitatively estimate the degree of admixture in the ethnic minority of Molise, the method proposed by Dupanloup and Bertorelle25 was applied. The admixture coefficients are 41% (SD 20%) for the coastal Croatian parental population and 59% for the Italian population from Molise. The estimated Croatian and Italian contributions to the Croatian-Italian gene pool did not change substantially considering as Croatian parental population either the coastal or the island villages, and as Italian parental population either the Molise sample or the total Italian sample (data not shown). However, this result must be taken with due caution. In fact, due to the large standard deviation associated with the admixture coefficients, the present data do not allow a reliable estimates of the parental contributions. To get a more reliable evaluation on these estimates, a phylogeographic approach was performed, which allows as well to distinguish between a possible Croatian origin (from either the coast or the islands) and an Italian origin of each of the sequences found in the Croatian-Italians.

A median network was constructed to detect the exact matches and derivatives within the study populations (Figure 3a–c). The network shows three specific matches between the Croatian-Italians and the Italians that were not found in Croatian sample, equally large in size: K (16188–16218–16224–16311), U2 (16051–16189–16266–16294), T (16126–16294–16304), and four matches with the Croatians of Croatia that were not observed in Italian sample: T1 (16126–16163–16186–16189–16294), U5a (16256–16270–16399), J (16069–16126–16366). Only two of these matches (ie in K and U2) are specific, whereas other lineages occur with different frequencies among many European populations; the haplogroup J lineage (16069–16126–16366) seems to be frequent in the northwestern Balkan area but almost absent elsewere (author's unpublished data).

Figure 3
figure 3figure 3figure 3

Median-network showing the phylogenetic relationship of mtDNA haplotypes detected among Croatian-Italians of Molise, Italians, populations of the Croatian coast and the southern Croatian islands. Diagnostic RFLP sites and nucleotide positions of the mutations in the HVS-I (without the prefix 16 000) compared with the Cambridge Reference Sequence14, 15 are indicated along the branches connecting the haplotypes. The node sizes are proportional to haplotype frequencies specified by the number of individual sequences within each node. Origin of the samples: black nodes — Croatian-Italians of Molise; grey nodes — Italians; white nodes – populations of the Croatian coast and the southern Croatian islands. (a) The structure of mtDNA haplogroups H, HV, V, pre-V, and pre-HV. Proportion of individuals in these clades vs total sample size: Croatian-Italians of Molise (17/41), Italians from Campania (20/48), Puglia (15/26), Molise/Abruzzo (39/73), Lazio (26/52), Croatian Coast (51/96), Middle Dalmatian Islands (170/311). (b) The substructure of mtDNA superhaplogroups L3* and N*. Proportion of individuals in these clades vs total sample size: Croatian-Italians of Molise (4/41), Italians from Campania (8/48), Puglia (4/26), Molise/Abruzzo (3/73), Lazio (4/52), Croatian coast (9/96), Middle Dalmatian Islands (20/311). (c) The structure of mtDNA haplogroups J, T, U, F and R. Proportion of individuals in these clades versus total sample size: Croatian-Italians of Molise (20/41), Italians from Campania (20/48), Puglia (7/26), Molise/Abruzzo (31/73), Lazio (21/52), Croatian coast (36/96), Middle Dalmatian Islands (121/311).

Interestingly, the unusual haplogroup HV type characterized by transitions at 16093 and 16311, while quite frequent in the Croatian-Italians (7%), is missing in both Italy and Croatia. It has been observed so far only in the Romanians,16 pointing to a possible eastern European origin of this haplotype. Otherwise, the unusual Croatian-Italian type could also have derived from an independent mutation at the hot-spot position 16093 in the frequent HV haplotype defined by the single mutation at 16311. The HV-16311 type is completely absent in continental Italy, whereas it is highly frequent in Croatia, where it has been observed in all local populations, with the highest frequency in Brač (9%). The Croatian-Italian unique N1a haplotype, shared by two individuals, derives from one transition at np 16209 from their common founder type sampled in Estonia, Lithuania, and Russia37 (our unpublished data). So, a potential Slavic source could be argued or at least speculated for this lineage as well. The haplogroup H haplotype with 16134 and 16362 mutations has its closest relative among the Kabardinians.16 This motif derives from the likely paraphyletic subfounder lineage defined by 16362 mutation. H-16362 frequency in continental Italy is about 0.5%, whereas the Croatian average is five times higher (2.5%), reaching 6% (8/133) in the northwest island of Krk (our unpublished data). The U5a founder haplotype 16114A–16192–16256–16270–16294, from which the Croatian-Italian sample CRO189 uniquely derives at np 16369, is absent in the total pool of continental Italy, but it is common in Slavic and north-eastern European populations38, 16 and it has been sampled twice also in our unpublished Croatian and Bosnian samples, indicating its likely Slavic origin. The same potential Slavic origin applies to the R* lineage characterized by the recurrent 16343 mutation, according to the available distribution of R* CRS haplotypes and their absence in continental Italy. The ancestral haplotype of X (CRO179) has been sampled from central Italy39 and Turkey but not in Croatia. The H-CRS haplotype frequency in the Croatian-Italian sample (15%) is also more similar to the Italian average (16%) than to averages of either the Croatian mainland (4%) or the island (9%) populations. The bulk of evidence gained from the phylogeographic approach therefore points again to the neighborhood influence on the Croatian community of Italy.

Conclusions

The founder analysis by individual haplogroups based on HVS-I resolution level indicated that the fraction of the maternal gene pool of the three communities of Molise derives equally from the Croatian parental population and the Italian receptor population of Molise, highlighting the high penetrance of Italian maternal lineages into their gene pool. This result validated what inferred from the admixture analysis, which although associated with large standard errors converges in the picture obtained by the phylogeographic approach. Moreover, the genetic similarity between the Croatians of Molise and their neighbouring Italian populations, as highlighted by the MDS representation and intermatch analysis, indicates that there was no reproductive isolation between the two geographically proximate, yet culturally distinct populations. Therefore, the presence of appreciable levels of gene flow between the Croatian ethnic minority and their neighbouring Italian populations indicates that differences in language or other cultural traits did not create reproductive barriers in the maternal gene pool. Other examples of genetic admixture of culturally distinct groups have been reported elsewhere in Europe.40, 41, 42 For example, very strong cultural differences exist between Romani and non-Romani populations but these groups are genetically indistinguishable.43

Overall, our results support the idea that the Croatian-speaking population living in Molise was not reproductively isolated. Of course, the fact that Croatian-Italians from Molise are not genetically distinct does not undermine the notion of ethnic and cultural distinctiveness.

Studies dealing with other DNA markers would augment the current mtDNA data and better delineate the scenario outlined in this study. For instance, it would be interesting to test whether analysis of Y-chromosome polymorphisms reveals a higher degree of genetic isolation of the Croatian minority of Molise from the neighbouring Italian populations due to asymmetric migration of males. While there are examples of a differential rate of migration between sexes among human populations,44, 45, 46 recent analysis of the global distribution of human mtDNA and Y chromosome variation47 shows a relatively symmetrical gene flow for females and males. It would be interesting to see if the Croatian-Italian genetic history conforms to this global model or is another interesting exception.

If this reconstruction will be proved to be correct, very likely further studies on Y chromosome polymorphisms would not modify the Croatian Italian genetic history inferred from mtDNA data.