Introduction

The non-recombining region of the Y chromosome (NRY) is the subject of intense research in the field of human population genetics and evolution.1, 2, 3 Several studies of the peopling of the Americas have made use of Native American NRY polymorphic variants.4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 The major founder lineage (or haplogroup) of Native American populations, Q-L54,4, 5, 6, 11, 16, 17 is defined by the derived allele of the L54 SNP and belongs to haplogroup Q, which in turn is defined by the derived allele at locus M242.12, 23, 24 Lineage Q-L54, which is currently divided in Native Americans into Q-L54*(xM3) and Q-M3, accounts for at least 85% of the autochthonous Y chromosomes of native Americans.12, 14, 15, 16, 17, 19, 20 A rare native American lineage is haplotype C-M217 (or C3),17 which is defined by marker M21721 and is derived from haplogroup C, which is in turn defined by markers M216 and M130.22 This latter haplogroup is also found among various ethnic groups from northeast Asia, Australasia and Oceania.25

The use of Y chromosome single-nucleotide plymorphisms (Y-SNPs) in human history reconstruction depends on their ability to discriminate between paternal lineages shared by individuals and populations.1, 10 Next-generation sequencing technologies have identified thousands of new SNPs, and these discoveries have significantly enhanced the resolution and topology of the Y-chromosomal phylogeny.3, 20, 26, 27, 28 However, many markers provide redundant information or define terminal phylogenetic branches that are not useful at the population level.24, 29

Genetic studies have traditionally combined Y-chromosomal SNPs and short tandem repeats (STRs). Hundreds of STRs on the human Y chromosome have been described,30 and their high variability allows phylogeographic studies of SNP-defined Y-chromosomal lineages shared by individuals from different ethnic groups.31, 32, 33, 34 Y-STR haplotype analyses have been used for the historical reconstruction of the peopling of Americas in population migration analyses, demographic estimates, dating of historical events and evaluations of paternal gene flow.12, 24, 33, 34, 35 However, because of the high mutation rate of Y-STRs,31 ancient SNP-defined lineages (for example, >10 000 years old) can present highly homoplasic phylogeographic networks, resulting in unresolved phylogenies with low genealogical information as depicted by network analyses of the Q-M3 lineage.17, 18 Thus, to increase the phylogeographic information of Y chromosome data, one can use SNPs defining more recent sublineages to generate a more detailed Y-STR genealogy, as previously shown.24, 31

Following this approach, we identified new informative Q-L54 sublineages among native South American populations by screening genealogically divergent haplogroup Q Y chromosomes, for 34 SNPs identified through re-sequencing work. They were genotyped in a large sample of South American natives together with SNPs available in the literature. Five new SNPs defining two Q-M3-derived and three Q-L54*(xM3)-derived sublineages were validated for use in phylogeographic studies of native South American populations. All informative SNPs were used to increase the resolution of the phylogeny of haplogroup Q lineages in combiantion with previously described SNPs, to build a more detailed and informative genealogy of native South American Y chromosomes.

Materials and methods

A general scheme of the procedures described below to identify and validate informative Y-SNPs for native South American population studies is depicted in Figure 1.

Figure 1
figure 1

Flowchart of methodological procedures for Y-SNP characterization. Scheme of the methodology applied to select samples for Y chromosome sequencing, SNP identification, and population validation through multiplex genotyping. SNP, single-nucleotide polymorphism; Y-SNPs, Y chromosome SNPs.

Sampling

The subjects of the study consisted of 1841 native American individuals (Supplementary Table 1) belonging to Y chromosome Q or C lineages (carrying the derived alleles at SNPs M242 or M130, respectively) from indigenous communities in Peru, Bolivia, Ecuador and Brazil, sampled during the Genographic Project,18, 24 and also 13 Coyaima individuals (Cariban linguistic family) from Colombia. DNA samples were obtained from cells collected via buccal swabs.18, 24 In the case of the Coyaima individuals, aliquots of DNA from a previous study24 were used. The present study was authorized in Brazil by the ethics commissions of the Universidade Federal de Minas Gerais and the National Commission for Ethical Research (CONEP, Resolution 763/2009), and by local ethics commissions of the non-Brazilian countries in which samples were collected.24

Initial genotyping of NRY SNPs and STRs

Native American samples were initially genotyped for NRY SNPs M242, M3 and M130 using custom TaqMan genotyping assays (Applied Biosystems) to identify haplogroup Q and C Y chromosomes (Supplementary Text 1). At a later stage, all samples were also genotyped in the BeadXpress system (see below) for the presence of the derived alleles at markers L54 and M217, which identify native American lineages Q-L54 and C-M217, respectively. The samples were also genotyped using 17 Y-STRs (DYS389a, DYS389b, DYS390, DYS456, DYS19, DYS385a, DYS385b, DYS458, DYS437, DYS438, DYS448, GATA_H4, DYS391, DYS392, DYS393, DYS439, DYS635) with a Y-filer Kit x(Applied Biosystems, Foster City, CA, USA) following manufacturer’s recommended protocol36 (Supplementary Text 1).

Sample selection for large-scale sequencing

Using the median-joining method implemented in NETWORK 4.6,37 haplotype networks were constructed with Y-STR data from South American individuals belonging to the Q-M3 and Q-L54* lineages, as described by Jota et al.24 Samples submitted for Y chromosome sequencing were selected by choosing Y-STR haplotypes located in separated network clusters, giving priority to representatives of different ethnicities and geographic regions from South America. This selection based on Y-STRs was used to increase the chance to identify new SNPs defining sublineages on different branches of Q-M3 and Q-L54* lineages.

Identification of new Y-SNPs in haplogroup Q

To identify new Y-SNPs, NRY regions from 21 samples of indigenous South Americans belonging to the haplogroup Q-L54 were sequenced using different strategies. Sixteen individuals were analyzed via Sanger sequencing, two individuals via long-PCR and next-generation sequencing, two individuals via the Walk Through the Y (WTY) sequencing service (Family Tree DNA), and one individual via complete Y chromosome sequencing (Complete Genomics). The individuals were selected using the previously defined phylogenetic, geographic and ethnic criteria.

Among the 16 samples selected for Sanger sequencing (two Bolivian, six Brazilian, two Colombian and six Peruvian Indians), six belonged to the Q-L54*(xM3) paragroup and 10 to the Q-M3 lineage. The samples were subjected to de novo sequencing after PCR amplification of known Y chromosome regions at loci M323,38 M378,39 MEH2,1 N14,40 P292,10 M242,23 M3,41 M138, M217,42 Page53,12, 13 Alu486, DFFRY04, DBYd1 and P8943 (Supplementary Text 1).

High-coverage next-generation sequencing from long-PCR-amplified fragments was performed using two samples from the Q-M3 lineage (1 each from Peru and Brazil). This sequencing involved the generation of 672 DNA segments via long PCR using Taq HiFi (Invitrogen), covering about 3 million base pairs of the NRY region3, 44 (Supplementary Text 1).

Two samples from native Peruvians belonging to the Q-M3 lineage underwent Sanger sequencing using the WTY (Walk Through the Y) service at the Genomic Research Center of Family Tree DNA.45 This protocol generates sequences over 180 000 bp of the NRY region and returns a list of probable SNPs identified from single copy, euchromatic regions of the human Y chromosome.

One sample from the Q-M3 lineage originating from Peru was also sent for complete genome sequencing at Complete Genomics.46

Y-SNP multiplex genotyping

Two 96 Y-SNPs platforms were constructed by designing a personalized multiplex genotyping system using the VeraCode OPA system (Illumina, San Diego, CA, USA, Supplementary Text 1) and performing genotype reading using the BeadXpress system (Supplementary Tables 2 and 3). Allowing for duplications of SNPs included in both platforms, a total of 119 SNPs were tested (Table 2). We evaluated 34 new SNPs (Table 1) that were identified in our study, and 85 previously known Y chromosome SNPs: 28 SNPs identifying haplogroup Q sublineages, including L54;10, 45, 47 22 SNPs that were previously identified in various studies through Y chromosome sequencing of native Americans, but were not evaluated at the population level in South America;3, 47, 48 and 35 SNPs identifying other Y haplogroups.1, 10 The 119 Y-SNPs, and their chromosomal positions, which were used on the two multiplex genotyping platforms are listed in Table 2.

Table 1 Details of 34 new Y-SNPs selected for genotyping of South American populations
Table 2 Y-SNPs tested in BeadXpress Multiplex Platforms 1 and 2

The two multiplex sets of 96 SNPs were referred to as platform 1 (VC0014123-OPA) and platform 2 (VC0014259-OPA). Platform 2 was constructed in a second stage, with the inclusion of SNPs that were validated on platform 1 and shown to be shared among native South Americans. Details of the BeadXpress genotyping assay were published elsewhere. 49, 50

Native South Americans were submitted to multiplex genotyping using BeadXpress platforms 1 and 2, in a hierarchical procedure defined by the Y chromosome tree1, 10 and a screening method based on the Y-STR profile (see below). The BeadXpress genotyping assays for SNPs SA02 and SA03 were not successful, thus they were genotyped by standard restriction fragmentation length polymorphism (RFLP) analysis (Supplementary Text 1). In total, 960 samples (including also controls, data not shown) were fully genotyped using both platforms (Supplementary Tables 2 and 3).

All new SNPs were incorporated into the Y-chromosomal tree according to the haplogroup hierarchy and nomenclature defined by the Y Chromosome Consortium.1, 10

Screening of samples for NRY SNP genotyping

Using the Y-STR haplotype networks from South American individuals belonging to the Q-M3 and Q-L54* lineages, samples were selected for genotyping new NRY SNPs on BeadXpress or PCR–RFLP genotyping (SA02 and SA03 SNPs). Samples were selected by choosing Y-STR haplotypes phylogenetically related to haplotypes carrying new SNPs (Supplementary Figure 1). This process was repeated when new individuals bearing new Y-SNPs were found. In addition, all individuals from a location or ethnic group where a new SNP was found were also selected for genotyping each particular SNP.

Dating of new South American Q-L54 sublineages

Estimates of the time to the most recent common ancestor (TMRCA) for chromosomes carrying the derived alleles of the new lineage-defining SNP stemming from the present study were determined using rho statistics, implemented in the NETWORK program,37 employing a mean effective mutation rate of 6.9 × 10−4/locus/25 years51 for each Y-STR locus. The ancestral haplotype was inferred using the modal allele at each STR locus.32

Results

Our initial survey of Y chromosome variation in South American natives with TaqMan RT-PCR identified 1,836 Q-M242 and five C-M130 haplogroup individuals. Because there is an overwhelming predominance of haplogroup Q chromosomes in South America, we focused on SNP identification and validation of new Q-derived lineages. Using four different sequencing strategies for the NRY region, we identified a total of 2503 putatively new Y-chromosomal SNPs. All Y-SNPs were aligned according to their relative position against a reference genome (hg19/GRCh37) and were selected for validation via multiplex genotyping in a BeadXpress system (Illumina). The SNP selection criteria for inclusion in the genotyping platforms were (i) previously unknown variable positions in high-quality sequenced regions, particularly transversions, (ii) SNPs shared between two or more individuals, and (iii) SNPs from individuals representing different ethnic and geographic backgrounds.

In the present study, 34 Y-SNPs were initially selected from the different sequencing strategies to generate BeadXpress genotyping platforms 1 and 2 (Table 1). The positions of the SNPs and the sequencing methodology used to identify them are summarized in Table 1. Altogether with the new Y-SNPs found here, we included most of the SNPs described in the literature for haplogroup Q,1, 10, 12, 16, 17, 20, 24, 29 particularly the ones likely informative for discriminating among South American chromosomes, such as L53, L54, SA01, Z5915, Z19483, Z19319, CTS11357, M19 and CTS1780 (Table 2). The M557 and PV2 SNPs16 were not tested in our surveys.

Using multiplex genotyping platforms 1 and 2, we identified three new informative SNPs in the Q-L54 lineage—SA04, SA05 and SA29—which were validated for studies at the population level. Two other population-informative SNPs, SA02 and SA03 (SA03.2), were genotyped by Sanger sequencing and RFLP analysis because the BeadXpress assay failed for these two markers. Furthermore, we validated through the BeadXpress genotyping system an additional five SNPs—Z5915, Z19483, Z19319;47 CTS11357;3 and CTS1780 (ref. 48; Supplementary Tables 2 and 3). Many synapomorphic SNPs, shared by different individuals, are shown in the new Q-haplogroup phylogeny based on our results (Figure 2). For each of these new Q-L54 sublineages (except Q-SA29), we estimated the TMRCA for the chromosomes carrying the derived allele (Table 3).

Figure 2
figure 2

Haplogroup Q Phylogeny. Phylogeny of Q haplogroup based on Y-SNPs genotyped (Table 2) in this study. SNP, single-nucleotide polymorphism; Y-SNPs, Y chromosome SNPs A full color version of this figure is available at the Journal of Human Genetics journal online.

Table 3 Estimates of the TMRCA (years) for new Y-SNPs validated for population studies

The SA04 SNP consists of an A->T transversion at Y position 15974563 (GRCh37/hg19) and is found in haplogroup Q-M3. We identified 45 individuals exhibiting the SA04-derived allele, 37 of whom were from five Brazilian indigenous communities (from the municipality of São Gabriel da Cachoeira, Amazonas State, northwest Amazon of Brazil), seven from Ecuador, and one from Peru (Supplementary Tables 4 and 5). We tested a total of 667 haplogroup Q-M3 samples for SA04 using multiplex platforms 1 and 2 (Table 2). Among the 45 individuals displaying the SA04-derived allele, seven Andeans were from Ecuador, one individual belonged to the Muniche community (language isolate) from Loreto (Peru), and 37 individuals were from the cultural confluence area of the northwest Brazilian Amazon on the Upper Negro River (Supplementary Tables 4 and 5). A phylogeographic network was constructed with the Y-STR haplotypes of SA04-derived allele-carrying individuals with their linguistic families (Supplementary Figure 2). The geographic and cultural connectedness of the Q-SA04 lineage individuals corroborates the hypothesis of its recent genealogical origin.

The SA05 SNP is an A->G transition at Y position 8148836 (GRCh37/hg19) found in haplogroup Q-M3. We identified 60 Amazonian individuals exhibiting the SA05-derived allele, 40 of whom originated from 13 different Peruvian indigenous communities, and 20 from 6 indigenous communities from Bolivia (Supplementary Tables 4 and 5). We tested a total of 667 haplogroup Q-M3 samples for SA05 using SNP multiplex platforms 1 and 2 (Table 2).

The SA29 SNP consists of a C->T transition at Y position 6931891 (GRCh37/hg19) found in lineage Q-L54*(xM3). We identified five individuals carrying the SA29 derived allele (Supplementary Tables 4 and 5) from the Maxacali indigenous community, which belongs to the Jean linguistic family, from Minas Gerais (Brazil). We tested a total of 117 Q-L54* individuals for SA29 using SNP multiplex platform 2 (Table 2). Even though these Q-SA29 individuals were apparently unrelated (at first degree) and from different locations, they have the same Y-STR haplotype (Supplementary Table 4).

The SA02 and SA03 (SA03.2) SNPs were tested with multiplex genotyping platforms 1 and 2, with unsatisfactory results. Thus, new genotyping assays based on RFLP analyses were developed for both SNPs (Supplementary Text 1). SA03 actually includes two linked SNPs (SA03.1 and SA03.2) at the DBYd1 locus, which detect the same lineage (Supplementary Text 1). SA03.2 consists of an A->G transition at position 15019822 (GRCh37/hg19), and SNP SA03.1 consists of a C->G transversion at position 15019808 (GRCh37/hg19), which is separated from SA03.2 by 13 base pairs. Both the SA03.1 and the SA03.2 SNPs were detected in eight Coyaima individuals (Cariban linguistic family) by sequencing and RFLP analysis of PCR products amplified using the DBYd1 Alu region primers (Supplementary Text 1). The SA02 SNP consists of an A->G transition at Y chromosome position 14820439 (GRCh37/hg19) in the DFFRY04 locus (Supplementary Text 1). We identified seven individuals exhibiting the SA02 derived allele in the Coyaima population (Cariban linguistic family) from Colombia (Supplementary Tables 4 and 5). All seven individuals with the SA02 derived allele also exhibit the SA03.1 and SA03.2 derived alleles, but a single Coyaima individual carried the derived SA03.1 and SA03.2 alleles and the ancestral SA02 allele. The eight individuals tested for the three aforementioned SNPs also carried derived alleles for markers M242 and L54 and the ancestral allele for marker M3. In the sample set tested in the present study, SA02 and SA03 (SA03.1 and SA03.2) occur exclusively in the Coyaima population of Colombia.

The BeadXpress multiplex platform 2 (Table 2) was also used to characterize the previously described Q sublineages defined by SNPs CTS11357, CTS1780, M19, Z19483, Z19319 and Z5915 (Figure 2). We tested a total of 272 Q-M3 samples for CTS11357 using multiplex platform 2, and identified eight individuals from Western Amazon exhibiting the derived alleles, seven of whom originated from three different Peruvian indigenous communities, while one originated from Bolivia (Supplementary Tables 4 and 5).

We tested a total of 660 haplogroup Q-M3 samples for the M19 SNP using multiplex platforms 1 and 2 (Table 2). We tested a total of 272 Q-M3 haplogroup samples for the Z5915 SNP using multiplex platform 2 (Table 2), and identified two individuals with the derived allele for Z5951 in haplogroup Q-M3 (Supplementary Tables 4 and 5). We tested a total of 117 Q-L54* samples for the CTS1780 SNP using multiplex platform 2 (Table 2), and identified 112 individuals exhibiting the derived state. Although five individuals failed to genotype in BeadXpress, all genotyped South American Q-L54* chromosomes were included in a new derived lineage, Q-CTS1780. This lineage is widely distributed in Peru, Bolivia and Ecuador (Supplementary Table 4 and 5), and found among eight Coyaima individuals from Colombia also bearing the SA03 SNP, and five individuals from the Maxacali community (Jean linguistic family) from Minas Gerais (Brazil), who also had the SA29 derived allele. Both Q-SA29 and Q-SA03 are sublineages within Q-CTS1780 (Figure 2). The TMRCA (Table 3) of Q-CTS1780 reveals it to be an ancient lineage, likely predating the first entry into Americas by the first settlers.4, 16, 17

We initially tested a total of 231 Q-M3 individuals for the Z19319 SNP using multiplex platform 2 (Table 2), and identified 21 haplogroup Q-M3 individuals exhibiting the derived allele. Twenty Q-Z19319 individuals were from Andean indigenous communities of Peru and Bolivia (Supplementary Table 4 and 5). Thirteen individuals exhibiting the Z19319 derived allele also carried the SA0124 derived allele (Supplementary Tables 4 and 5). Q-SA01 now constitutes a sublineage within Q-Z19319 (Figure 2). The eight individuals with the Z19319 derived allele and the SA01 ancestral allele are now classified in the Q-Z19319*(xSA01) paragroup (Figure 2). One individual of paragroup Q-Z19319* is found in each location, except for Chogo (northern Central Andes), which presented two of them (Supplementary Tables 4 and 5). The TMRCA for the Q-Z19319 lineage (Table 3) indicated an origin in the Holocene (~9,000 years ago), and likely in the northern region of Central Andes (Cajamarca, Peru), as it is also true for its derivative lineage, Q-SA01, which was dated at about 5,300 years ago.24

We tested a total of 323 Q-M3 haplogroup samples for the Z19483 SNP using multiplex platform 2 (Table 2), and identified 60 haplogroup Q-M3 individuals carrying the Z19483-derived allele. Most of them were Andeans, with 41 individuals coming from 21 different Bolivian indigenous communities, and 19 from 12 different Peruvian indigenous communities (Supplementary Tables 4 and 5).

We also identified a single Q-L53* individual from Quinuabamba, Peru, among our total sample (Supplementary Tables 4 and 5). Even though we identified a large set of new Q-M3 sublineages, Q-M3* paragroup individuals are still the most frequent and widely distributed chromosomes, being observed in 462 individuals who were genotyped using multiplex platforms 1 and 2 (Supplementary Tables 2 and 3).

The five haplogroup C (C-M130) individuals originated from two Ecuadorian communities, and one from Peru. All five individuals were Quechua speakers (Supplementary Table 4), and displayed also the derived allele for SNP M217; thus, these South American natives belong to the C-M217 lineage, as previously reported.17

Supplementary Tables 2 and 3 summarize the Y-SNP genotyping results. Of the 960 samples, 172 served as controls for haplogroups or duplicates, and 10 failed to be genotyped using BeadXpress.

Discussion

The use of the BeadXpress multiplex genotyping platform contributed to the proposal of a new haplogroup Q phylogeny (Figure 2), which differs from those proposed by Van Oven et al.29 and Geppert et al.20 in its enhanced resolution to discriminate between South American indigenous lineages. Within the Q-L54 lineage, South American individuals were divided into the Q-CTS1780 and Q-M3 sublineages, which represent two ancient lineages (>20 000 years old) that likely arrived concomitantly or split early during the first peopling of Americas. These results are in partial agreement with previous reports.11, 14, 15, 16, 17, 18, 20

A significant difference in the phylogeny proposed in the present study is related to the Q-SA01 lineage, which is now derived from the Q-Z19319 lineage. Considering the distribution of SA01 chromosomes shown in the map of Figure 3, and the occurrence of Q-Z19319*(xSA01) chromosomes in a more northerly part of the Central Andes in Peru, our findings corroborate an Andean migratory route from north to south as previously suggested by the analysis of Y-STRs of the Q-SA01 lineage.24

Figure 3
figure 3

Distribution of derived alleles for new Y-SNPs found in this study among native South Americans. A map of South America showing locations of indigenous communities displaying haplogroup Q Y chromosomes with derived alleles at SA01, SA03, SA04, SA05, and SA29 SNPs. Individuals displaying the derived SA02 allele are part of the Q-SA03 lineage. SNP, single-nucleotide polymorphism; Y-SNPs, Y chromosome SNPs.

All new lineages found in this study have a restricted spatial distribution (Figure 3), showing that the inclusion of new SNPs also increases the degree of geographical association, which was weakly observed when only Y-STRs within the entire Q-M3 lineage were considered.17 For example, the SA05 SNP is found scattered in lower altitude regions of the pre-Andes Amazon from Peru and Bolivia, while the SA04 SNP is found in regions of the northwestern Amazon and northern Central Andes (Ecuador), displaying a strong association with indigenous populations of the Tukanoan linguistic family. The findings of many unrelated and younger branches in the haplogroup Q phylogeny (Figure 2) is expected considering the population expansions occurring during the settlement of Americas, including more recent local and regional expansions as already shown for native American Y chromosome data by Battaglia et al. 16

The sublineage Q-SA05 occurs exclusively in western Amazon communities of Bolivia and Peru, some living in transition rainforest areas (Yunga) between lowland Amazon and the Andes. Interestingly, it was found in 60 individuals speaking 15 Amazonian languages belonging to five linguistic families, and also two language isolates. Because SA05 appears to have a relatively old origin (TMRCA ~10 000 years ago, Table 3), it was expected to have a broad distribution throughout indigenous communities speaking different languages. However, owing to its restricted geographic distribution and occurrence in many indigenous populations, it could alternatively be an ancient lineage belonging to local hunter-gatherers who were assimilated into modern groups. This restricted distribution could also be attributed to genetic drift, which is much more pronounced among Amazonian indigenous groups than Andean ones.33

Individuals belonging to the Q-SA04 lineage are restricted to the northwestern Amazon, across Brazil, Peru and Ecuador. We identified the probable root of the Q-SA04 lineage as the ancestral Y-STR haplotype (IAU06) based on each STR locus’ modal allele (Supplementary Table 4). The Y-STR data collected in the present study suggest a likely migratory route for lineage Q-SA04 from the Brazilian Amazon to the Andes, as the haplotypes carrying the modal alleles were located in Iauaretê, on the border between Brazil and Colombia. The presence of more recent and derived Q-SA04 STR haplotypes in the Andes (Ecuador) may indicate that Amazonian groups were assimilated into the Central Andes indigenous communities (Quechua and Aymara) as previously suggested,18 likely during the formation of the Andean states, which culminated with the Inca Empire. However, genotyping a larger number of samples from this region will allow further clarification of this matter.

Interestingly, the derived allele of the SA04 SNP appears in 37 of the 63 individuals from São Gabriel da Cachoeira, in the Upper Rio Negro (northwestern Brazilian Amazon, close to the border with Colombia). Twenty-nine of them belong to the Tukanoan linguistic family, six to the Arawakan linguistic family and two to the Puinavean linguistic family. The Tukano and Arawak groups are horticulturalist communities who interact culturally in the region through the exchange of wives. By contrast, the Puinavean people are typical endogamous hunter–gatherer groups, but sometimes allow marriages between Tukano males and Puinavean females.52

The SA03 and SA02 SNPs were found in a single ethnic group sampled from Colombia, although those chromosomes display different Y-STR haplotypes and a TMRCA of about 1000 years (Table 3). Therefore, a larger population survey in Colombia and neighboring regions is needed to reveal their true distribution among native South Americans. The SA29 SNP is the first Y-SNP described for populations from the Jean linguistic group. However, even though a careful sampling was done to avoid relatives up to third degree, all the individuals bearing SA29 also have the same Y-STR haplotype. Thus, it is likely that this SNP has a very recent origin, although a larger population survey is still needed to confirm this hypothesis.

The distribution of Y-SNPs also provides clues about transition areas for the occurrence of paternal lineages. The area between Puerto Ocopa and Mazamari, in Peru, which is a boundary region between the Andes mountain range and the Amazon plain, exhibits the greatest haplotypic diversity among Q chromosomes, with five different lineages being observed there, including Q-CTS11357 (n=2); Q-Z19319* (n=2); Q-SA01 (n=1); Q-SA05 (n=9); Q-M3* (n=6; Supplementary Table 5). This is also the area where the greatest number of Andean (Q-Z19319* and SA01) and lowland (Q-SA05 and Q-CTS11357) lineages were observed, indicating a point of contact between Andean and Amazon populations (Figure 3). Indeed, it is located in an Andes-Amazon transition region of the Junin Province (Peru) where many natives from Andean (Quechua) and Amazonian (Ashaninka, Nomatsiguenga, Kakinte and Yanesha) origins (INEI, Peru) live.

The TMRCA estimates (Table 3) for each Q sublineage based on Y-STR variation also give clues about their association with particular cultural shifts and demographic events, such as the suggested association of maize cultivation spread in the Andes with Q-SA01.24 According to Scliar et al.,53 lineages Q-SA05 (Amazonian) and Q-Z19319 (Andean) found in Peru are related to Pre-Ceramic periods I and II, when the cultivation of cassava, pumpkin and sweet potato was starting. In addition, lineage Q-CTS11357 in Peru may be related to Pre-Ceramic periods III and IV in the Amazon region, without showing a link to the cultivation of any particular vegetable. Lineage Q-SA01 is related to Pre-Ceramic periods V and VI and cultivation of many plants, including maize. Lineage Q-Z19483 is related to the Late Intermediate Period, marked by the founding of the Inca civilization, and could be an important marker for the study of the formation of the Andean states in the last millennium. Interestingly, Q-Z19483 is the most recent lineage (Table 3), but occurs in 60 native individuals distributed throughout Central Andes, which could be a likely outcome of a recent population expansion associated with a complex civilization like the Incas. However, these date estimates based on Y-STR and distance-based statistics should be interpreted with care because a number of issues have been raised.54

The discovery of new SNPs that are useful from the population point of view is important for acquiring further knowledge about the history of the peopling of South America. We were able to allocate approximately 30% of the Q-M3 lineage samples into derived sublineages. The new proposed phylogeny enhances the resolution of haplogroup Q in native Americans by including new Y-SNPs and connecting branches, particularly for populations in the Western Amazon and Andes.