Introduction

The advances in genotyping technologies have provided important and considerable insights regarding our views of human population structure. The knowledge of patterns of genetic variation within and among human populations have contributed to a better understanding of the relationship between genetics and ethnicity, as well as improved the design and analysis of case–control association studies. Although there are several studies that have investigated the genetic structure of non-Caucasian populations, including individuals of African, African Americans, Asian, and Native American ancestry, most studies have primarily focused on individuals of European ancestry.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 Therefore, coverage of the global human population remains incomplete with populations from South America being underrepresented in the databases of human genetic variation. Included in these understudied populations are individuals from Brazil, a country of almost 200 million people, which represents approximately 52% of the South American population and 3% of the world's population.

Historically, the Brazilian population always experienced large degrees of intermarriage between ethnic groups, and Brazilians are known to be heavily admixed with Amerindian, European, and African ancestries. In general, Brazilians trace their origins to the original Amerindians and two main sources of immigration: Africans and Europeans.13, 14 In the five geographical regions of Brazil (North, Northeast, Center–West, Southeast, and South), Northern Brazilians are mostly of Amerindian ancestry, with some African ancestry. Current inhabitants of Northeast and Center–West are mostly of African origin, although some individuals whose ancestors migrated from Southern Brazil can trace their roots to Europe. Southern and Southeastern Brazilians are mostly of European origin. However, individuals of African and Asian descent are also found in several localities of the Southeast. For decades, new immigrants, as well as migrants from other parts of Brazil, have flocked to Southeast Brazil where intermarriage between individuals of different ancestry is very common. The goals of the present work are to: (i) identify patterns of population structure among the Southeast Brazilian population enabling individuals from this region to be included in future studies of genetic variation, (ii) to identify marker panels that can effectively capture the variation revealed by dense genotyping from samples of the Southeast Brazilian population and samples from the 11 populations of the HapMap Project, Phase III, which include individuals of Asian, African, European, and Mexican ancestry, and (iii) assess global and local ancestry inferences of the Southeast Brazilian population.

Materials and methods

Datasets and preprocessing steps

Analysis was performed considering samples of the Southeast Brazilian population (BRZ), as well as samples from the following 11 populations of the HapMap database, Phase III: African ancestry in Southwest (ASW), Utah residents with Northern and Western European ancestry from the CEPH collection (CEU), Han Chinese in Beijing, China (CHB), Chinese in Metropolitan Denver, Colorado (CHD), Gujarati Indians in Houston, Texas from the western state of Gujarat in India (South Asia) (GIH), Japanese in Tokyo, Japan (JPT), Luhya in Webuye, Kenya (LWK), Mexican ancestry in Los Angeles, California (MEX), Masai in Kinyawa, Kenya (MKK), Tuscans in Italy (TSI), and Yoruba in Ibadan, Nigeria (YRI). International HapMap Project, Phase III is available at http://www.sanger.ac.uk/humgen/hapmap3.

The Southeast Brazilian population samples are from a study conducted with trios of individuals (mother, father, and son or daughter), whose children have a congenital heart disease and parents do not. All individuals are from the general urban population of Sao Paulo, the largest metropolitan area of the country. In the present analysis, we have only used data from those unrelated individuals (mothers and fathers). These individuals were enrolled in the current study at the Heart Institute of the University of Sao Paulo. Genotyping for these samples was performed using the Affymetrix SNP array 6.0 platform (Affymetrix, Santa Clara, CA, USA). All subjects gave verbal and written consent. The present protocol was approved by the University of Sao Paulo Medical School IRB (CAPPesq). Samples from the HapMap were genotyped using two platforms, Affymetrix SNP 6.0 and Illumina Human 1M arrays (Illumina, San Diego, CA, USA). More details from the HapMap populations are available from the HapMap Project webpage. Only unrelated individuals were considered in the present analysis. Only SNPs located on the autosomal chromosomes and successfully genotyped in all populations were used for this analysis.

SNPs that were not accurately assessed on the Affymetrix 6.0 array were excluded from the final analysis. That is, we removed, separately for each of the 12 populations, SNPs with more than 5% missing genotype, SNPs that were not in Hardy–Weinberg equilibrium (P≤10−4), and also those with a minor allele frequency less than or equal to 0.01. At the end of these steps, 365 116 autosomal SNPs, shared by all 12 population data sets and 1129 unrelated individuals representing the 11 HapMap populations (n=991) and the Brazilian population (n=138), remained.

Statistical analysis

We used Principal Components Analysis (PCA), a dimensionality reduction technique,1, 2 to analyze the data. For each population k, the data set consists of nk unrelated subjects, where each subject has m biallelic SNPs common for all populations. Data for all 12 populations were then displayed in a matrix G of dimension m by n with n = ∑k = 112 nk. The values 0, 1, 2, or empty, correspond to the genotypic information assigned to each SNP.2 After mean-centering and normalizing each row i of the matrix G, n eigenvalues and n corresponding eigenvectors (axes of variation) were calculated, using the covariance matrix of individuals ψ=GG. Plots of the eigenvectors associated with the largest eigenvalues were then used to investigate the structure of the populations under analysis. PCA was run without the removal of outliers and without eliminating SNPs in linkage disequilibrium.

To investigate whether a smaller number of SNPs could effectively capture the variation revealed by the 365 116 common SNPs, we built three panels of markers. The first panel has 250 SNPs, consisting of the top 50 SNPs retained from each of the top five axes of variation. SNPs were ranked on the basis of their loading scores (in absolute value) obtained from the axes of variation. The second and third panels were obtained by retaining the top-ranked 100 and 150 SNPs from each of the same top five axes, respectively. As there were no common SNPs among those retained, the total number of SNPs left in each panel was 250, 500, and 750, respectively. The relationship between the different populations was also investigated by calculating the Fst statistic, a metric representation of the effect of population subdivision15, 16 for each pair of populations, using the SNPs in the three panels, and also the 365 116 common SNPs. Fst statistic is often expressed as the proportion of genetic diversity due to allele frequency differences among populations. A zero value implies that the two populations are interbreeding freely and a value of one that the two populations are completely separate.

For global ancestry analysis, we applied the model-based STRUCTURE program17 to estimate the admixture proportion for the BRZ samples. This was done by applying the STRUCTURE program to two different pooled data sets consisting of four reference populations each (CEU, YRI, MEX, and BRZ, for model 1) and (TSI, ASW, MEX, and BRZ, for model 2), without informing the program which samples were the reference samples. The reason for selecting model 2 was based on the smallest Fst values obtained between the BRZ and HapMap, Phase III samples of Caucasian and African origin. As seen in Figure 3, the performance of the first two PCs in each of the two different pooled data sets is similar. We allowed the program in such an unsupervised mode to infer the underlying ancestral populations, as well as the ancestral proportion for each subject. The number of ancestral populations K was fixed at 3, 4, 5, and 7. For a given K, we ran STRUCTURE 10 times with different random seeds (10 000 iterations for burn-in phase, and 10 000 iterations for Markov chain optimization and recorded L(K), the log likelihood of the data given K, from each run. We used the metric ΔK to find the optimal K, which is selected to have the largest ΔK value.18 The inferred number of ancestral populations for the pooled data was 3.

Analyses described above were carried out using the publicly available STRUCTURE,17 and EIGENSTRAT2, 7 software packages.

Results

Principal components analysis

PCA using the 12 populations showed pronounced patterns of genetic variation within and amongst the populations. To visualize these patterns graphically, we shall consider the top three axes of variation chosen on the basis of their eigenvalues (Figure 1).

Figure 1
figure 1

Eigenvalues associated with the 20 first PCs (axes of variation) obtained from the PCA, in which all common SNPs were used.

The two and three most informative axes of variation, PC1 and PC2 (Figure 2a), and PC1, PC2, and PC3 (Figure 2b), can resolve the 11 populations available in the HapMap study. That is, despite some overlap, we observed that the individuals from the 11 HapMap populations were clearly separated by their different ancestries of origin (African, Asian, European, and Mexican). Asian populations were tightly clustered and distinct from the African and European populations. The Southeast Brazilian population formed a continuum between Europeans and Africans, with some overlap of the Mexican population. The continuum of genotypes observed in the Brazilian population is consistent with the high degree of intermarriage between individuals of the European and African descent.

Figure 2
figure 2

Projection of 1129 individuals from 11 populations of the HapMap Project, Phase III, and the Brazilian population on their (a) first and second, and (b) first, second, and third axes of variation obtained from PCA, which used 365 116 SNPs. ASW, African ancestry in Southwest; CEU, Utah residents with Northern and Western European ancestry from the CEPH collection; CHB, Han Chinese in Beijing; China, CHD, Chinese in Metropolitan Denver, Colorado; GIH, Gujarati Indians in Houston, Texas; JPT, Japanese in Tokyo, Japan; LWK, Luhya in Webuye, Kenya; MEX, Mexican ancestry in Los Angeles, California; MKK, Masai in Kinyawa, Kenya; TSI, Tuscans in Italy; YRI, Yoruba in Ibadan, Nigeria; and BRZ, Brazilians in São Paulo, Brazil.

Fst statistic results

The Fst statistic was calculated for all population pairs using the 365 116 common SNPs (Table 1). Small Fst values (0.001 to 0.008) were found for each pair of Asian populations (CHB, CHD, and JPT), indicating less pronounced genetic differences between these populations. Similarly, each pair of African populations (ASW, LWK, MKK, and YRI) is separated by low Fst scores. Greater Fst distances (0.128 to 0.168) were observed between Asian and African populations. Populations with European ancestry (CEU and TSI) are also separated by small Fst values (0.003). Three distinct clusters of ancestral populations (Asian, African, and European) are distinguished by Fst scores. MEX and GIH populations are closer to the European cluster than to the African cluster as measured by Fst distance. Fst scores confirm that the Southeast Brazilian population is close to both the European, African, and Mexican populations.

Table 1 FST statistics calculated between each pair of populations using all 365 116 common SNPs

Ancestry informative markers

Small sets of ancestry informative markers (AIMs) that can provide substantial substructure information have been the focus of several studies.19, 20, 21 AIM sets consisting of 200 markers or less can map ancestral origin to Africa, Europe, or Asia. We considered three panels of markers. SNPs on each panel were selected on the basis of their loading scores obtained from a PCA performed on the covariance matrix of the SNPs. The first panel has 250 SNPs consisting of 50 SNPs with highest loading scores (in absolute value) on the top five axes of variation. The second and third panels retained 100 and 150 SNPs, respectively, of the top five axes of variation, and have 500 and 750 markers, respectively. Plots of the two first axes of variation (PC1 and PC2) were obtained by performing PCA for each of the three panels of SNPs (data not shown). The 250 SNP set reproduced the stratification observed with the entire 365 116 SNP set (Figure 2). The 500 and 750 SNP set produced results that were indistinguishable from the 250 SNP set. The chromosomal distribution of the 500 SNP set was uniform. Although the magnitude of the Fst values varied, the same pattern could be observed for all three panels of markers (Table 2). All three SNP marker panels captured the variation revealed by the entire >300 000 SNP set. Indeed, calculation of the pairwise Spearman correlation coefficient between the four Fst matrices yielded results always higher than 0.964.

Table 2 FST statistics calculated between each pair of populations using Panel 1 (A), Panel 2 (B), and Panel 3 (C)

Global ancestry inference of the Brazilian population

Global ancestry inference of the studied samples was able to determine mean ancestries for Amerindian, African, and European. For such, we have first recalculated Eigenstrat principal components, using two different subsets of HapMap samples as ‘ancestral’ populations. In the first model, we have used the CEU, YRI, and MEX samples to represent, respectively, a Caucasian, African, and Amerindian ancestral population. In the second model, we used the TSI, ASW, and MEX samples to represent such populations. The reason for using the first model was because of the common use of these as ancestral populations in most of the earlier reports. In the second model, we have used the populations with smallest Fst pairwise differences with the BRZ sample. No significant differences between these two models were observed (Figure 3). Structural analysis, using the 100 most important SNPs from PC1 and PC2, from these two models is presented in Figure 4. In our sampled individuals from the Brazilian Southeast region, mean values were 0.15, 0.24, and 0.61, respectively, for Amerindian, African, and European ancestries for Model I markers, and 0.17, 0.27, and 0.56, respectively, for Amerindian, African, and European ancestries for Model II markers (Figure 4).

Figure 3
figure 3

Projection of individuals from three potentially ancestral populations of the HapMap Project, Phase III, and the Brazilian population on their first and second axes of variation (PCs) using Model 1=YRI, CEU, MEX, and BRZ, and Model 2=ASW, TSI, MEX, and BRZ.

Figure 4
figure 4

Proportion of membership of each pre-defined population in each of the three clusters. (a) Triangular plot of the genomic proportions of African, European, and American ancestry, of the sampled populations from Model I (CEU, YRI, and MEX). (b) Barplot structure analyses with admixture model for sampled populations from Model I. (c) Triangular plot of the genomic proportions of African, European, and American ancestry, of the sampled populations from Model II (TSI, ASW, and MEX). (d) Barplot Structure analyses with admixture model for sampled populations from Model II. Red, blue, and green, represent the proportions of inferred ancestry from European, African, and American ancestral populations. (MEX, Mexican ancestry in Los Angeles, California; CEU, Utah residents with Northern and Western European ancestry from the CEPH collection; YRI, Yoruba in Ibadan, Nigeria; TSI, Tuscans in Italy; ASW, African ancestry in Southwest; and BRZ, Brazilians in São Paulo, Brazil).

Discussion

We have compared the genotypic variation of 365 116 SNPs among 1129 unrelated individuals of five continents (Asia, Europe, Africa, and North and South America) to individuals from Southeast Brazil. We demonstrate that this population is a highly admixed population and quite distinct from other HapMap populations. Principle component analyses demonstrate extensive of intermarriage between individuals of African and European descent. This intermarriage occurred between 1500 and the present day reflecting about 20 generations of intermarriage. Thus, the genomes of Brazilian individuals consist of chromosomal segments of distinct ancestry with substantial European and African-related admixture. These findings will have important implications for the correct design and analytical planning of studies exploring complex traits in this population. We expect that the large degree of admixture observed in the Southeast Brazilian population can be exploited for the gene mapping of important disease loci.

The study cohort was collected in Southeast Brazil, in Sao Paulo state. Individuals of African, Amerindian, and perhaps Asian ancestries, may be underrepresented in this study, as individuals with European ancestry comprise a majority in this region. Thus, additional analyses using larger and random samples that can cover all five Brazilian regions might perhaps show an even more pronounced degree of genetic variation than the one suggested by our analysis. Whether the same degree of intermarriage will be observed in other parts of Brazil or other parts of Latin America will be addressed in future studies.

New dense genotyping data from other forthcoming Brazilian studies will determine whether the same pattern of extensive genetic admixture exists in other parts of Brazil.