Improved phylogenetic resolution for Y-chromosome Haplogroup O2a1c-002611

Y-chromosome Haplogroup O2a1c-002611 is one of the dominant lineages of East Asians and Southeast Asians. However, its internal phylogeny remains insufficiently investigated. In this study, we genotyped 89 new highly informative single nucleotide polymorphisms (SNPs) in 305 individuals with Haplogroup O2a1c-002611 identified from 2139 Han Chinese males. Two major branches were identified, O2a1c1-F18 and O2a1c2-L133.2 and the first was further divided into two main subclades, O2a1c1a-F11 and O2a1c1b-F449, accounting for 11.13% and 2.20% of Han Chinese, respectively. In Haplogroup O2a1c1a-F11, we also determined seven sublineages with quite different frequency distributions in Han Chinese ranging from 0.187% to 3.553%, implying they might have different demographic history. The reconstructed haplogroup tree for all the major clades within Haplogroup O2a1c-002611 permits better resolution of male lineages in population studies of East Asia and Southeast Asia. The dataset generated in the present study are also valuable for forensic identification and paternity tests in China.

Haplogroup O2a1c-002611 probably didn't participate in the formation of Tibeto-Burman groups but was heavily involved in the origin and expansion of Han Chinese 12,15,16 .
Despite its abundance, wide distribution and the importance to Sino-Tibetan populations, the phylogeny of Haplogroup O2a1c-002611 has not been adequately resolved with respect to O-M95 17 and O-M134 18 . The population history of Han Chinese remains unclear because the phylogeny of Haplogroup O2a1c-002611 still lacks resolution with no downstream markers having been genotyped and described in large scale sample collections and the phylogenetic positions of those markers having yet to be determined. To date, the only two markers investigated in literature internal to O2a1c-002611 have been F11 and F238 12 , which were not sufficient to resolve the phylogeny of the lineages belonging to this haplogroup. The recent next-generation sequencing of East Asian samples has yielded a variety of novel SNPs purportedly belonging to the O2a1c-002611 lineage 14,[19][20][21] . Here, we describe a large-scale, nationwide study of Haplogroup O2a1c-002611 in Han Chinese by using high-density genotype data to examine phylogenetic positions of newly reported markers and provide useful tools for future population history analysis.

Methods
All participants were drawn from the customer base of WeGene, Inc., a consumer personal genetics company. The study was conducted in accordance with the human and ethical research principles of The Ministry of Science and Technology of the People's Republic of China (Interim Measures for the Administration of Human Genetic Resources, June 10, 1998). Participants provided informed consent and participated in the research online, under a protocol approved by the Ethical Committee of WeGene, Inc.
DNA extraction and genotyping were performed on saliva samples. Samples have been genotyped on WeGene V1 genotyping platform using Affymetrix arrays with a total of about 596,000 SNPs. Quality control (QC) was performed in PLINK V1.07 22 . The individuals and SNPs with genotype call rate of <98.5% were excluded. The relatedness was checked pair wisely for all the samples and where identity by descent (IBD) scores of >0.125 (3rd-degree relative) were identified with one from each such pair removed. The individuals whose analyses failed repeatedly were recontacted by WeGene customer service to provide additional samples, as is done for all WeGene customers. The WeGene V1 arrays were designed to identify all known Y-chromosome lineages with 18963 Y-chromosome phylogenetic relevant SNPs. In this study, we investigated 89 SNPs that overlap with the markers listed in ISOGG O2a1c-002611 phylogenetic tree accessed on 21 April 2016, with 14 August 2016 correction (http://www.isogg.org/). Here, we follow the regulations proposed by the Y Chromosome Consortium 23 which defined a set of rules about how to update the haplogroup names and phylogenetic trees of Y-chromosome.

Results
Among the 2139 male individuals, 305 of them (14.26%) belong to the O2a1c-002611 lineage (Table 1), in agreement with previous studies of East Asian populations 4,[12][13][14] . For these individuals with a derived allele at IMS-JST002611, we investigated other 88 SNPs purportedly belonging to the O2a1c-002611 haplogroup (genotyping results with hg19 physical positions and sample locations are given in Table S1), and the results allowed us to update the phylogenetic tree of O2a1c-002611. We applied the parsimony rule in tree construction. For example, F61, CTS1872, F240, F247, CTS2483, F302, F309, CTS5879, F460, and F562 showed derived status in all IMS-JST002611 derived samples, supporting that they are equivalent with IMS-JST002611 in the phylogeny. For F18, the majority samples have derived alleles, but we did find some showing ancestral status, indicating that F18 is a downstream SNP of IMS-JST002611 ( Fig. 1).
Our identification of the seven branches within O2a1c1a-F11 is consistent with the previous finding 14 that this lineage probably experienced huge population expansion in Neolithic Time. However, those seven sub-branches show quite different frequency distributions in Han Chinese ranging from 0.187% in O2a1c1a7 to 3.553% in O2a1c1a1. The frequency of O2a1c1a5 in Han Chinese also reaches 2.665%, while the frequencies of other four sub-branches are all below 1% ( Table 1).
The geographic distribution pattern of Haplogroup O2a1c-002611 in our current study is consistent with previous estimations that this haplogroup enriches in the eastern part of China. The population in Jiangsu, Anhui, Zhejiang, and Shanghai have nearly one-third of the males belonging to this lineage as shown in Table 1. There are interesting substructures in distributions regarding different sublineages. One of the two main subclades of O2a1c-002611, O2a1c1a-F11 (and its sublineages), is equally distributed in eastern, northern and southern China regarding frequency. However, the other subclade O2a1c1b-F449 and its sublineages O2a1c1b1-F238 and O2a1c1b2-F1266 are particularly enriched in northern China with a frequency of 1.12% but only 0.47% and 0.61% in eastern and southern China, respectively. The observation is consistent with our hypothesis in Wang et al. 12   the split with Tibeto-Burman and other southern native populations. The lineage O2a1c1a*-F11 (the samples only have derived alleles at sites F11 and F425 but other no downstream derived SNPs) is two to three times lower in frequency in northern China compared with that in eastern and southern China, and we have not found O2a1c1a1*-F632 in northern China. However, Haplogroup O2a1c1a1a1b, O2a1c1a5, O2a1c1b1a1, and O2a1c1b2 are more frequent seen in northern China than in southern and eastern China.

Discussion
Haplogroup O2a1c-002611 is frequently distributed in East Asia and surrounding areas. The genotyping of 89 phylogenetic relevant SNPs under Haplogroup O2a1c-002611 enables us to refine and update the phylogeny of this lineage. The reconstructed haplogroup tree for all the major clades within Haplogroup O2a1c-002611 permits better resolution of male lineages in population studies of East Asia and surrounding areas. This study shows that the 89 SNPs are highly informative for separating a substantial part of O2a1c-002611 samples in China. We observe a huge expanded lineage named O2a1c1a-F11 within Haplogroup O2a1c-002611, comprising 11.13% of the Han Chinese. There are seven subclades nested within O2a1c1a-F11, suggesting the expansion of this lineage is star-like 7 . Those subclades might have experienced different demographic histories since they were separated from a common ancestor because the frequencies of those subclades in present-day Han Chinese are so different ranging from 0.187% to 3.553%. A similar pattern has been observed in another  11 . One possible explanation for this uneven expansion is a social selection that a few paternal lineages achieved a greater continuous advantage on the existing basis of the early expanded farming population that enabled them to have more decedents.
Since the Haplogroup O2a1c-002611 has distinct distributions in Han Chinese and Tibeto-Burman populations and probably experienced agriculture-induced expansion, exploring the detailed phylogenetic relationships of the subclades in this lineage is not only informative for tracing prehistoric migrations, but also for understanding the origin and diversification of Sino-Tibetan language family in the future. For instance, although Haplogroup O2a1c-002611 is rare in Tibeto-Burman groups, we have found it at 1% to 3% in Qiangic speaking populations, such as Muya, Jiarong, Queyu and Qiang in the Tibeto-Burman Corridor 12 . The Qiangic speaking groups are suggested to have played an important role in the formation of Sino-Tibetan populations based on historical documents, linguistics, and genetic studies 15,24,25 . To genotype the Qiangic speaking populations with this improved phylogeny of Haplogroup O2a1c-002611 will certainly provide detailed information in understanding the origin of Sino-Tibetans.
We note a limitation of our study is that we have only genotyped Haplogroup O2a1c-002611 in Han Chinese samples, but this haplogroup has also been found with moderate or even high frequency in various ethnic groups in southern China, Laos, Vietnam, and Philippines 12,13,26 . Detailed characterization of this haplogroup could provide a broader framework of peopling East Asia and Southeast Asia.
The recent next-generation sequencing of worldwide samples has yielded tens of thousands of novel SNPs on Y chromosome purportedly being phylogenetic relevant 14,[19][20][21] . But it is extremely time and money consuming (or even impossible) to validate all those markers by the PCR and SNaPshot techniques that we usually used in the previous studies 4, 8,9,12,15 . Here, we give a successful example of how the consumer-based genetic test with the advent of microarray SNP genotyping technology could be used in Y-chromosome phylogeny analysis. The reconstructed phylogeny of these new markers in this study is only the first step, and the real benefit will come from typing a large number of O2a1c-002611 derived individuals of various phylogeographic and ethnic backgrounds, which will certainly broad our understanding of the population history.