Using Y-chromosome capture enrichment to resolve haplogroup H2 shows new evidence for a two-path Neolithic expansion to Western Europe

Uniparentally-inherited markers on mitochondrial DNA (mtDNA) and the non-recombining regions of the Y chromosome (NRY), have been used for the past 30 years to investigate the history of humans from a maternal and paternal perspective. Researchers have preferred mtDNA due to its abundance in the cells, and comparatively high substitution rate. Conversely, the NRY is less susceptible to back mutations and saturation, and is potentially more informative than mtDNA owing to its longer sequence length. However, due to comparatively poor NRY coverage via shotgun sequencing, and the relatively low and biased representation of Y-chromosome variants on capture assays such as the 1240 k, ancient DNA studies often fail to utilize the unique perspective that the NRY can yield. Here we introduce a new DNA enrichment assay, coined YMCA (Y-mappable capture assay), that targets the "mappable" regions of the NRY. We show that compared to low-coverage shotgun sequencing and 1240 k capture, YMCA significantly improves the mean coverage and number of sites covered on the NRY, increasing the number of Y-haplogroup informative SNPs, and allowing for the identification of previously undiscovered variants. To illustrate the power of YMCA, we show that the analysis of ancient Y-chromosome lineages can help to resolve Y-chromosomal haplogroups. As a case study, we focus on H2, a haplogroup associated with a critical event in European human history: the Neolithic transition. By disentangling the evolutionary history of this haplogroup, we further elucidate the two separate paths by which early farmers expanded from Anatolia and the Near East to western Europe.

cutting them in half before drilling out bone powder from the dense portion 2 (or by keeping them intact and drilling into the dense portion from the outside. Teeth were sampled by removing the crown and drilling into the pulp chamber to produce bone powder. Resulting bone powder (50-100mg) was placed in 2mL Biopure tubes and stored until DNA extraction.
To the Biopure tubes containing 50-100mg bone powder one mL of extraction buffer made up of 0.9mL 0.5M EDTA, 0.025mL 0.25mg/mL Proteinase K and 0.075mL UV HPLC-water was added. The resulting mixture was incubated at 37 o C for 20 hours under constant rotation. Following incubation, Biopure tubes were centrifuged at 18500 relative centrifugal force (rcf) for two minutes, separating the soluble from the insoluble parts of the mixture. The lysate was added to a 50mL falcon tube in which 10mL of binding buffer was mixed with 400µL of sodium acetate (pH 5.2, 3M).
This mixture was then placed in a High Pure Extender Assembly (HPEA) tube and centrifuged at 1500 rcf for 8 minutes in a 50 mL Thermo Scientific TX-400 Swinging Bucket Rotor. Each HPEA's column was removed and inserted into a clean collection tube before centrifuging again at 18500 rcf for 2 minutes. To each column, 450µLof wash buffer from the high pure viral nucleic acid kit (HPVNAK) was added and the mixture was centrifuged for one minute at 8000 rcf. The columns were then placed into fresh collection tubes before another round of washing during which 450μL of wash buffer from the HPVNAK was added to each column followed by one minute centrifugation at 8000 rcf. Resulting columns with washed DNA were placed in 1.5 silicon tubes to which 50µL of TET was placed in the middle of the columns. The columns were incubated at room temperature for three minutes and centrifuged for one minute at 18500 rcf. Another round of adding 50µLof TET followed by incubation and centrifugation was performed and the final 100µL of DNA extract was stored at -20°C until further downstream use.
The resulting solutions were incubated for 30 minutes at 37°C and 10 minutes at 12°C. The UDG reactions were inhibited by adding 3.6μL 2U UGI to each PCR strip tube and incubating at 37°C for 30 minutes and again at 12°C for one minute. Blunt end repair was done by addition of 1.65μL 3U T4 DNA Polymerase and 3μL 10U T4 Polynucleotide Kinase, followed by incubation at 25°C for 20 minutes, and then at 12°C for 10 minutes. The resulting mixtures were purified with MinElute kits followed by elution in 20μL Elution buffer (EB) mixed with 0.05% tween.
Ligation of Illumina adapters was performed through the mixture of 18μL eluate from the previous step with 1μL 5U Quick Ligase, 1μL 10µM Adapter Mix and 20μL of 2x Quick Ligase Buffer. The resulting solution was incubated for 20 minutes at 22°C and purified using a MinElute kit, followed by elution in 22μL EB containing 0.05% tween.
Adapter fill in reactions were done by adding 20μL of eluate from previous step to 2μL 8U Bst 2.0 Polymerase, 0.2μL 25mM dNTPs, 4μL 10x Isothermal buffer, and 13.8μL UV HPLC-water and incubating at 37°C for 30 minutes followed by 80°C for 10 minutes. The resulting DNA libraries were stored at -20°C until indexing.
Library-specific and unique index combinations were ligated to both 5' and 3' ends of DNA fragments in each library via an indexing PCR. The total volume of each library was split into four different indexing PCR reactions which were done by mixing 2μL 10µM P5 index, 2μL 10µM P7 index, 1μL 2.5U Pfu Turbo Polymerase, 1μL 25mM each dNTPs, 1.5μL 20mg/ml BSA, 10μL 10x Pfu Turbo Buffer, 73.5μL UV HPLC-water, and 9μL of DNA library. The resulting mixture was amplified in a thermocycler with initial denaturation at 95°C for 2 minutes, followed by 10 cycles of 95°C for 30 seconds, 58°C for 30 seconds, 72°C for 1 minute, and finally 72°C for 10 minutes. The indexed libraries of the same sample were pooled and purified with a MinElute kit. Libraries were then quantified using qPCR and PCR amplified to contain 10 13 copies of DNA.
Resulting libraries were shotgun sequenced (~5,000,000 reads, either paired end with 50 cycles or single end with 75 cycles) to assess the library complexity and degree of human DNA preservation (% endogenous DNA, aDNA damage). Libraries with more than 0.1% endogenous DNA were deemed adequately preserved and selected for an insolution hybridization enrichment (Fu et al. 2014) that targets ~10,445 kB on the NRY ("YMCA capture"). YMCA captured libraries were single end sequenced with 75 cycles to an average depth of 40 million reads per YMCA library.
Libraries were not pooled prior to this capture method. After the enrichment the libraries were amplified to a concentration of 10 nM followed by a single end sequencing with 75 cycles to an average depth of 40 million reads per YMCA library.
All libraries were processed using EAGER 3 , a modular tool that streamlines the processing of libraries from FastQC and quality filtering to mapping and duplicate removal. Sequencing adapters were clipped with AdapterRemover

Derivation of our method for estimating the pairwise time to most recent common ancestor (TMRCA)
We are interested in estimating the total amount of evolutionary time that has passed between two samples, denoted τij, which can be separated into three non-overlapping intervals: the total evolutionary time until the last substitution occurs, and the total amount of evolutionary time that passed after the final substitution for samples i and j respectively, Respectively, where !" * = 2 !" $ + !" (see Figure S1).
We begin by estimating the total evolutionary time between the i th and the j th samples, denoted !" * ,up until the final substitution occurred. Let !" > 0, be the total number of overlapping SNPs., let !" > 2 be the number of observed segregating sites, and let % > 0 be the rate of substitutions per site per calendar year.
We assume that each substitution occurs according to a Poisson process with rate !" = % !" , relative to the number of overlapping SNPs for individuals i and j. Hence, for some evolutionary time > 0, the total number of observed substitutions has an Erlang distribution with probability density function (pdf) Hence, if we assume that !" and !" are known, then we may look for the optimal value of that maximizes the pdf.
We do this by considering the log-likelihood function Hence, we have that If we set the first derivative of the log-likelihood to zero, we obtain a candidate maximum likelihood estimate (MLE) ' * , and since 4 !" − 15, 1 > 0, it must be that /0 * > is a maximum likelihood estimate. We also use the property that the variance of a single, unknown parameter is approximately the negative of the reciprocal of the Fisher information, e.g.
It must also be considered that the final substitutions probably did not occur at time ! and " , and that some time will have passed with no substitution events since the last substitution. Hence, to our MLE for the time until the final substitution /0 * > , we must add some additional amount of time.
To achieve this we make the standard assumption that the time until the next substitution would have occurred for sample ∈ { , } is exponentially distributed, that is, !" 2~4 !" 5. We also assume that we observed a random proportion of the interval until the next substitution, denoted !" Now that we have derived estimates for !" * and the !" 2 , we transform these to find an estimate of the TMRCA of samples i and j, relative to the present day. Note that the true total amount of shared evolutionary time between the final substitution between samples i and j can be rewritten in terms of the shared and unshared branch times (as in Figure ??) where !" = 4 " + !" " 5 − 4 ! + !" ! 5.
Since L !" 2 M = 1/2, it can be can be shown that Note that since a natural choice of estimator for the length of shared evolutionary time for individuals i and j would be yielding an estimator for the TMRCA for individuals i and j, relative to the present day would be for which a best estimator can be simplified to give ;.
Finally, we have that To test the performance of our method, we simulated 10,000 realisations of the following process. We used a grid search for the number of overlapping SNPs ranging from 20,000 to 10,000,000, (the observed values from our filtered data set) and two randomly sampled branch lengths from a log-Normal distribution from between 20,000 and 175,000 years to sample a random tree and number of overlapping sites. We used the simSeq function from the phangorn package 4 to produce a pair of sequences (using a Jukes-Cantor model of substitution and a substitution rate of 4.5 × 10 +,% substitutions per site per year per individual), from which we could count the number of pairwise segregating sites.
We found that 94.69% of the known simulated TMRCAs were within the 95% confidence intervals as calculated by our method. We found that the accuracy of our method was uncorrelated with the number of overlapping sites (p=0.951), the true TMRCA (p=0.961), or a combination of both (p=0.193) indicating that our method is unbiased for both the depth of time, and the number of overlapping sites observed in our data.

TMRCA Estimation
To count the number of pairwise-segregating sites between two individuals, we began by making a fasta file of aligned consensus sequences for all each bam file, and performing the following quality filter; for each sequence we considered only sites for which we had at least three reads, with a minimum allele frequency less than 10%, and called increase in the substitution rate due to using ancient samples. When we estimated the substitution rate using the combined data, we found a substitution rate of 4.5 × 10 +,% , which falls within the confidence intervals of existing estimates 5,6 .   Figure S4: Presence and absence of SNPs for our H2 sub-haplogroups (note that we include our potential further sub-haplogroups H2m1, H2m2, H2d1 and H2d2) with samples on the y-axis, and SNP positions on the x-axis. Columns facets represent ISOGG SNP branch assignments, and potentially newly identified SNPs (denoted with "ABR"), row facets indicate our H2 sub-haplogroups assignments. Green, red and black indicate derived, ancestral or missing forms of SNPs.