Introduction

The study of individual variation is one of the principal areas of interest for geneticists. Variation is the key to evolution. In the eighties, genetic research shifted much of its attention from gene products to the genes themselves. With the introduction of techniques that allow the direct study of nucleic acids, and with continuous improvements to these techniques, it has become possible to study variation at the DNA level, but this avenue of research has been explored by the new techniques to only a very limited extent. In fact, at its inception, the Human Genome Project showed very little interest in variation [1].

In 1991 a project for testing human genomic diversity was proposed [2], and a HUGO committee was appointed by its President, Sir Walter Bodmer, with the aims of (a) planning a systematic study of genetic differences in suitably chosen samples of our species, and (b) saving, for future analysis, the DNA of a significant proportion of individuals representing current human diversity. The program will in part focus on populations that are currently vanishing either owing to a combination of low fertility and high mortality or because of a loss of identity due to acculturation, urbanization and migration. The human species is moving toward increasingly intensive amalgamation. The genetic differences between extant populations which are rapidly disappearing are an irreplaceable source of information for understanding our evolution. Genetic information is being obtained from fossil records [3], but if we have no bona fide modern data with which to compare it, this knowledge will be of little use.

In this paper we summarize the genetic knowledge about Europe derived from classical (pre-DNA) polymorphisms, and discuss some of its specific problems in the framework of the human genome diversity program. Europeans form more than one-tenth of the world’s population [4] and have a different social structure compared to much of the rest of the world.

A Genetic Picture of Europe

We review here the main conclusions of research into the genetic history of Europe which are part of a forthcoming book [5], where detailed references will be found.

Europeans Are Genetically More Homogeneous

Compared with the aborigines of other continents, Europeans are more homogeneous. Genetic data collected from the available literature are summarized in table 1: genetic loci, alleles, number of collected samples, mean gene frequencies, FST values [6] in Europe and, for comparison, in Africa, Asia, America and Australia are listed. The mean gene frequencies in Europe are intermediate with respect to those of other continents, but this may be in part an artifact, because almost all polymorphisms were first detected in Europeans. FST values are the variances of gene frequencies among populations divided by the variances due to sampling [6]: they measure standardized between-population genetic differences. They are lower in Europe than in other parts of the world, and this is initial evidence for the genetic homogeneity of Europeans when compared with the populations of other continents.

Table 1 Number of samples (A), mean gene frequencies × 1,000 (B), FST values × 1,000,000 (C), for the genetic loci and alleles in five continents

On the basis of the genetic loci and alleles listed in table 1 and by eliminating several less tested populations, genetic distances between 26 selected populations were calculated according to the coancestry coefficient of Reynolds et al. [7]. The populations were Austrian, Basque, Belgian, Czechoslovakian, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Icelandic, Irish, Italian, Lapp, Norwegian, Polish, Portuguese, Russian, Sardinian, Scottish, Spanish, Swedish, Swiss, Yugoslavian. The matrix of genetic distances is given in table 2, and a phylogenetic tree reconstructed by average linkage [8] from the distances in table 2 is shown in figure 1.

Fig. 1
figure 1

Genetic tree of 26 European populations. Genetic distances are based on the allele frequencies shown in table 1. Details on the data and methods can be found in Cavalli-Sforza et al. [5].

Table 2 Genetic distances (× 10,000) of 26 European populations (lower triangle), and their SEs (uppe triangle)

By applying to this tree the bootstrap technique [9], one finds that Lapps are outliers in 76% of the resampled trees; they are replaced as the most extreme outlier by Sardinians 18% of the time; Sardinia is the next outlier in 63% of the bootstrapped trees in which Lapps are first.

There is a group of five other extreme outliers: Greeks, Yugloslavs, Basques, Icelanders and Finns. All of them are separated from the other populations of the tree in > 50% of the bootstrapped trees. The remaining populations form a series of small groups, all of which are geographically close neighbors which appear clustered in the same configuration in <50% of the bootstrapped trees. If one draws a consensus tree among all the bootstrapped trees by applying what is commonly called the ‘majority rule’ (i.e. an agreement of > 50%), the consensus applies only to the outlier populations and not to the core structure of the tree. This lack of robustness, confirmed by the test of treeness [10], has a simple interpretation: the European populations have not evolved according to a tree of descent. A basic assumption for giving the tree a phylogenetic meaning is that each of its branches evolves independently from the others. This could be true for very distant or isolated populations, but it is a very unrealistic model in the case of Europe where migrations in prehistorical and historical times are known to have played a major role.

The Genetically Most Deviant Populations

For the seven outlier populations mentioned above, whose special genetic structure is confirmed by the statistical technique of bootstrapping, it is possible to formulate hypotheses explaining their genetic isolation.

The most deviant population is Lapp (Saame), limited numbers of whom live in the extreme north of Scandinavia, in four countries (Norway, Sweden, Finland and Russia). There are seven or more Lapp groups distinct by territory, dialect, and preferential occupation (fishing, reindeer breeding, and others). Although they are heavily mixed with Scandinavians and are by now, on average, less pigmented than southern Europeans [11] they have variable external phenotypes, and a fraction of them retain a phenotype characteristic of northern Siberian people, in particular Samoyed, who speak a language of the same family (Uralic). Classical polymorphisms show Lapps to be an admixture, in which European genes predominate, but genes in common with people from the Uralic region may reach between 20 and 50% [12]. This could be a reasonable hypothesis for their outlier nature in the genetic tree.

Sardinians are the next outliers. The island was settled at least 10,000 years ago [13] and the local population had reached substantial numbers (200,000 and more) 3,000 years ago [14], before other foreign colonizers, Carthaginians, arrived, especially in the south of the island. There were no Greek settlers at the time of Greek colonization of the western Mediterranean. The Roman occupation had little genetic consequence; however, it did affect the language, which is of Latin origin, with substrata of earlier, probably non-Indo-European languages [15]. Later settlers from Italy and Spain had a local and limited influence on certain coastal regions. The earliest founders may have been pre-Neolithic; later contacts and migrations were sufficiently few so that the population may have undergone considerable genetic drift and is therefore rather different from other populations. Some genetic similarity with more direct descendants of Paleolithic people from Europe, such as Basques or Caucasians [16], indicate that the first immigrants may have been Paleolithic, and the first archaeological date available for the island is in agreement with this notion. The arrival of Neolithic farmers (who originally came from the Middle East) at some unknown but probably early date, and genetic contributions from both Phoenicians and Carthaginians may help to explain why Sardinians show a genetic resemblance to the Lebanese, second but only by a small amount to that of Italians, who are their closest geographic neighbors, and who have contributed to the island’s colonization since late Roman times.

Among the group of the less extreme outliers, Basques are probably the most direct descendants of the earliest post-Neanderthal settlers of Europe. They are easily distinguishable from neighbors because of their unique language, which has no known relative in Europe except for languages of the North Caucasian family [17]. Caucasian is also believed to have some ancient relationship with the American Na-Dene and the East Asian Sino-Tibetan families [18]. If so, this group of languages may be a very ancient superfamily that spread widely east and west in northern Asia and Europe during the Paleolithic, perhaps at the time of the replacement of Neanderthals in Europe around 40,000 years ago. It is very difficult to find traces of a common origin in languages after such a long time, therefore the matter is inevitably highly speculative. The fact remains that Basque is still spoken in southwestern France and northeastern Spain, and toponymy shows it was spoken over a wider area at an earlier time [19]. Our analysis shows that there is a marked genetic similarity among the people living today in the regions corresponding to the French and Spanish areas with Basque toponymy, and that these areas show the greatest difference from the rest of western Europe, which was probably settled later. It is presumably not a coincidence that the same area shows the greatest concentration of cave art in the upper Paleolithic. Basques probably maintained an endogamy that was not very rigid but still sufficiently so to conserve a distinctive genotype that was probably diluted to some extent by later admixture with new neighbors, beginning especially in Neolithic times. Conservation of a distinct language must have been an important factor in maintaining social and genetic identity. It is very likely that the genetic uniqueness of Basques is a product of the remarkable isolation of western from eastern Europe at the time of the last glaciation, which peaked around 18,000 years ago, and there may have been very limited genetic and cultural exchanges between the two halves of Europe during the early Paleolithic [20].

Another interesting outlier population shown by the tree analysis is Iceland. According to tradition, this island was settled in the 9th century by an estimated 20,000 Norwegians from the middle of Norway [21]. There were very few later immigrants. The Norwegian origin has been confirmed [22] in spite of earlier indications of greater similarity of Icelanders to the Scots and Irish. This misunderstanding was partly caused by earlier analyses which relied on genetic systems like ABO which are more sensitive than most others to environmental changes, and by the considerable similarity of people from northern Ireland and Scotland to Scandinavian Vikings, who settled the coasts of these countries before the colonization of Iceland. Moreover, Icelandic cattle have their genetic origin in mid-Norway. Iceland had a small population for a long time and there are still only 200,000 people today. It is the smallest European country and it probably owes its outlier position to having been subjected to much more drift and less immigration than most other European countries.

Finns, Greeks and Yugoslavs are also outliers in the tree for reasons which may correlate with their linguistic isolation. Finns speak a Uralic language [see also ref. 12] of the Ugro-Finnic family which also includes Hungarians; the latter, however, do not seem to segregate clearly from other East Europeans even if their genetic relationships with Uralic populations from West Siberia have been shown [12]. Greeks speak a language which is an almost entirely separate branch of Indo-European languages (as in the classical tree by Schleicher [23] represented in fig. 2; see also [24]). Yugoslavs (Southern Slavs) are also linguistically separated and genetically rather heterogeneous. Each of these populations has a complex history which may help in understanding their genetic patterns. Historical explanations are untestable by experiment, but future research may add more evidence in favor or against using them in genetic interpretations even if totally unambiguous conclusions can never be reached.

Fig. 2
figure 2

The family tree model for the Indo-European languages proposed by Schleicher [23].

Genetic and Linguistic Trees of Descent

The remaining populations of figure 1 tend to associate in clusters in which one might discover frequent affinities of a linguistic nature. The first cluster from below includes the two Celtic-language-speaking samples (Scots and Irish). The second includes two of the four Slavic-speaking populations (Russians and Poles); Hungarians have been mentioned above; the third (Czechoslovakians) is isolated in the tree probably because of its intermediate geographic position (it is located between Slavic- and German-speaking people and was part of the Austrian empire for some time). Spaniards, Portuguese and Italians are grouped together as southwestern Europeans speaking a Romance language. Swedes and Norwegians are associated both geographically and linguistically. The Romance-speaking French, genetically rather heterogeneous, seem intermediate between this last group, the southwestern Europeans and the Saxons. The Saxon samples are grouped in two subclusters: the ‘northern’ subcluster is made up of Dutch, Danish, and English people, the ‘central’ subcluster of Austrian, Swiss, Germans, and Belgians. All of them speak Germanic languages but Belgium and Switzerland are linguistically divided.

It is quite interesting to compare our genetic tree with the historical classical tree of Indo-European languages proposed by Schleicher [23] and shown in figure 2. In addition to some similarity in the two clustering structures, note the analogy between the isolation of Greek- and Germanic-speaking people in both trees.

On the world scale, most Europeans show relatively few differences. Averaging over all genes or taking a more robust statistics, the median, Europe has the lowest FST of all continents (see the last two rows of table 1). European ‘races’ descriptions between 1850 and 1950 were based on a few anthroposcopic features, such as skin, hair and eye color, stature, and cephalic index, and reflected the strong correlation between pigmentation and climate, most probably because skin pigmentation is an adaptive response to the intensity of solar radiation. Other genes also tend to show some differences between northern and southern Europeans, but they offer no confirmation of the clustering of Europeans according to any of the old anthropological classifications of European races [summarized e.g. in ref. 25, 26].

Trees of Descent and ‘Synthetic Maps’

Trees have relatively little use in areas which, like Europe, have had a very active genetic exchange generating a network more than a tree-like structure, as shown by the low average FST value and the results of bootstrap tests. The use of other methods for reconstructing trees also seems rather unrewarding in the case of Europe for the same reason. The average linkage tree we showed, however, has the advantage of pointing to populations which have some degree of uniqueness, i.e. are sufficiently different from neighbors, most probably because of isolation and drift, so that they stand out as outliers. Not surprisingly, they are almost all geographically peripheral.

Other methods can be more informative than trees for describing the relationships among European populations, e.g. principal component analysis. This is a linear transformation of the observed gene frequencies (in geometrical terms a rotation of the coordinate axes) whose coefficients are chosen so as to maximize the variation of the transformed data (having as coordinates new principal component values or principal coordinates) measured along each new coordinate axis (principal component). Principal components are ranked according to the fraction of total variation each of them can independently explain: for instance the ‘first’ principal component (explaining say 30% of the total gene frequency variation) is by definition more informative than the ‘second’ principal component (explaining < 30% of the total gene frequency variation), and all principal components are independent of one another [for the method see ref. 27, and for applications to genetic data see ref. 5]. A widely employed graphical display is that of the first two highest ranking component values of the populations: in a Cartesian diagram abscissa and ordinate represent the first and second principal component axes, respectively, and each point places a population in this transformed space. An interesting property of this representation is that the geometric distance between any pair of populations represents the ‘true’ multidimensional genetic distance with the least possible error. Figure 3 shows this kind of presentation calculated from the genetic distance matrix of table 2.

Fig. 3
figure 3

Principal component (PC) map of European populations. N = Northern, C = central, S = southern.

In 1978 we showed how a quite different application of the principal component analysis, namely geographic contour mapping of the highest component values by smoothing the original gene frequency data over the whole geographical surface, could resolve the genetic picture of Europe into a number of ‘genetic landscapes’ each of which most probably summarizes a particular set of events and thus highlighted a particular historical scenario [28]. Of the first three principal component representations shown in that research, each of which corresponded to a geographical contour map of principal component values, or, as we called them, ‘synthetic maps’, the map representing the first principal component agreed very closely with that expected from the demographic expansion of Neolithic farmers. This had been previously hypothesized [29, 30] on the basis of archaeological information. A simulation of the process of expansion of Neolithic farmers in Europe with compatible growth rates confirmed this interpretation of the genetic data [31, 32].

The Major Genetic Landscapes of Europe

The leading component of the European genetic landscape is a gradient originating in the Middle East and directed to the northwest. It has been confirmed [33, 34] that this gradient was generated by a migration of Neolithic farmers from Anatolia, directed west along the coast of the Mediterranean and northwest via the Balkans and central Europe to France, England, and Scandinavia. The genetic gradient is the result of continuous, partial admixture of the expanding farmers with local hunter-gatherers. There is a very high correlation (r = 0.792 ± 0.051) between the value of the first principal component shown in the synthetic map, and the archaeological map of the first arrival of farming in over 100 archaeological sites radiocarbon-dated in Europe. The first PC explains 28.1% of the genetic variation of Europe.

The second principal component explains 20% of the total genetic variation and shows a clear north-south gradient, with a peak where Lapps are living today. This component is highly correlated with latitude, and hence probably with temperature. It is worth noting, however, that it is also correlated with a partition of Europe into two linguistic families, Indo-European and Uralic. We are currently testing the statistical significance of the two findings, but it is quite possible that both correlations are correct. Lapps have up to 48% of genetic admixture with Eastern Uralic-speaking people [12]. People speaking Uralic languages may have spread westwards along the Arctic coast from an unknown area of origin: today Samoyeds who are perhaps the most representative speakers of Uralic languages live not far from the Arctic Ocean east of the Urals.

The third principal component explains 10.6% of the total genetic variation and shows a correlation with a possible expansion of pastoral nomads from a region north of the Caucasus and Black Sea, which, according to Gimbutas [35], is the area of origin of Indo-European speakers [Piazza A, et al., unpubl. results]. Barbujani and Sokal [36] found a correlation between linguistic and genetic boundaries in Europe. In the majority of cases (22 out of 33) there were also physical barriers that may be the cause of both genetic and linguistic boundaries. In 9 cases there were only linguistic and genetic boundaries but not physical ones. It remains to be established if in these cases or in some of them linguistic boundaries have generated or enhanced the genetic ones, or if both are the consequence of political, cultural and social boundaries (as in the case of Lapps and non-Lapps) that have played a role similar to that of physical barriers.

Much more of the demographic history and prehistory of Europe can be understood by an accurate study of its human population genetics. This knowledge may contribute greatly to archaeological, historical, and linguistic information, and the joint study from all these perspectives will be especially illuminating. Modern genetic techniques have brought analysis to an unprecedented degree of sophistication, and the knowledge from nongenetic disciplines that can enhance the understanding of the history of human evolution is more developed in Europe than on any other continent. This is the time to join forces and take full advantage of the current trend towards cooperation among Europeans.

Testing Human Genome Diversity in Europe

Europe is in a special situation with respect to other continents. Most aboriginal groups outside Europe are difficult to reach, and many are likely to become extinct soon. This does not hold for most European populations, and it is questionable whether one should establish transformed immortalized cell lines from individuals of most European countries. Taking a sample of 50 individuals from each of, say, 25 countries, immortalization would require a basic expense in the order of a million dollars, but this would not provide a sufficiently detailed data base. As most Europeans show few genetic differences, samples of 50 individuals may rarely give statistically significant results in intra-European comparisons unless a very large number of genes is tested: the high genetic homogeneity of Europe puts a high cost on the study of local genetic variation.

In Europe there are, however, much cheaper ways of obtaining DNA from large numbers of individuals other than by immortalization, e.g. via a good network of well-equipped blood banks. In the countries where this holds, DNA can be obtained from blood donors. This would be a nonrenewable resource, but samples from the same blood donors might be obtained again, and in most cases there would be no scientific loss using DNA from different individuals of the same population, after a first set of DNA samples from a given population is exhausted.

One may nevertheless want to immortalize some populations, especially the outliers mentioned above, which are of special interest and some of which may not be so easily accessible through blood banks, for example the most important, least acculturated groups of Lapps.

Blood donors have already been used to extract DNA from a population of Celtic origin in Trino Vercellese, Italy [37], as well as from many European samples collected during the 11th Histocompatibility Workshop [38]. The transformed cultures collected in Bergamo, Italy, by Ferrara et al. [unpubl. data] are also produced by blood donations. They have the advantage that white cells in toto can also be used for analysis of immunoglobulin regions [39, 40] whereas transformed B lymphocytes may be unsatisfactory.

It would be important to explore the feasibility of testing human genome diversity in each European country. Local committees should be formed, including geneticists who have experience and interest in testing DNA samples by molecular techniques, blood bank experts, and a group of specialists in history, archaeology, cultural anthropology, ethnography and linguistics. This would also be an interesting experiment of collaboration between the ‘two cultures’ where the ‘other’ culture group should suggest geographical areas and populations worthy of special attention. Major criteria in this choice are historical, archaeological and linguistic information on the origin of the people, and the need to avoid cities as sources of samples, concentrating instead on rural and mountain areas, i.e. on regions where recent immigration is exceptional. Ideally, one would like to select individuals whose four grandparents are from the area of interest. Where this information is impossible to obtain, an appropriate analysis of surnames could supply useful selection criteria.

Sample Sizes and Selection of DNA Markers

It is not necessary to sample all individuals in a single village but just a few from each village belonging to the same region worth examining for that particular population. The sample size should be around 70–100 individuals who should be unrelated or distantly related (less than first cousins).

A large, standard sample of markers must be tested in all populations, both in Europe and the rest of the world, otherwise comparisons will not be possible. One of the major difficulties in using published data for classical markers is the diversity of markers tested by different research workers. Fortunately, a few very informative markers have been analyzed fairly systematically and by a homogeneous set of reagents in a number of populations: the HLA system is the best known example [41], but some earlier identified genetic systems like GM have also been tested extensively [42]. The need to examine as many markers as possible, and to use standard testing conditions can only be achieved by widespread agreement among researchers. It is probably superfluous to add that the burgeoning ‘molecular paleontology’ cannot give meaningful results unless it is accompanied by the accumulation of control data from modern Europeans and other populations.

There is at the moment much innovative activity in the development of testing and sequencing methods, as well as increasing opportunities for automated research. Only a few sets of DNA markers are sufficiently well established, but information is beginning to accumulate on many populations. It is possible that for some populations, like many ESokal ropean ones that are extremely similar and have therefore only a short differentiation history limited to the last few millennia, markers with high mutation rates (like CA repeats) may be especially valuable [43], but it is necessary to test if these can be used for comparisons between genetically more unrelated populations.