A finely resolved phylogeny of Y chromosome Hg J illuminates the processes of Phoenician and Greek colonizations in the Mediterranean

In order to improve the phylogeography of the male-specific genetic traces of Greek and Phoenician colonizations on the Northern coasts of the Mediterranean, we performed a geographically structured sampling of seven subclades of haplogroup J in Turkey, Greece and Italy. We resequenced 4.4 Mb of Y-chromosome in 58 subjects, obtaining 1079 high quality variants. We did not find a preferential coalescence of Turkish samples to ancestral nodes, contradicting the simplistic idea of a dispersal and radiation of Hg J as a whole from the Middle East. Upon calibration with an ancient Hg J chromosome, we confirmed that signs of Holocenic Hg J radiations are subtle and date mainly to the Bronze Age. We pinpointed seven variants which could potentially unveil star clusters of sequences, indicative of local expansions. By directly genotyping these variants in Hg J carriers and complementing with published resequenced chromosomes (893 subjects), we provide strong temporal and distributional evidence for markers of the Greek settlement of Magna Graecia (J2a-L397) and Phoenician migrations (rs760148062). Our work generated a minimal but robust list of evolutionarily stable markers to elucidate the demographic dynamics and spatial domains of male-mediated movements across and around the Mediterranean, in the last 6,000 years.

Low quality reads, contamination with adapters and repeated reads were discarded and the sequences of each subject were aligned to the human Y chromosome reference sequence (GRCh37/hg19) by means of the BWA (Burrows-Wheeler Aligner) software 1 , generating an alignment file (.bam format) 2 .

Data filtering and variant calling
The filtered regions were further refined with the removal of ~ 0.70 Mb within the repetitive elements (using the Repeat Masker and Simple Repeats tracks from the Table browser tool of the UCSC Genome Browser). Two blocks (chrY:21152521-21155078; chrY:28793241-28819317) with high average depth were also discarded, because this value could be indicative of chromosome rearrangements.
For each subject high quality positions were then retained 2 if base quality threshold >40, mapping quality threshold >30, a minimum depth of 3 and not higher than the 97.5 percentile of the depth distribution in the same subject. The positions passing these steps were intersected across subjects, arriving at 2,723,854 bp Variant calls were obtained with both GATK 3 and FreeBayes 4 , with haploid specific parameters and stringent criteria. Indels were removed. The remaining variants called differently by the two software were visually inspected with the Integrative Genome Viewer and rejected if any of these conditions were verified: 1) occurrence in short simple repeats, 2) involvement of multiple adjacent positions or 3) mean alternate fraction <75%. Exclusion of a variable position included removal of 100 bp on both sides, in order to minimize the impact on the overall density of variants per sequenced position. A final step consisted in the check, in each subject, of the equality of depth in the shared vs. private variants. One subject did not pass this step (t-test p=0.00013) and was  Table 2) out of 2,711,986 bp (Supplemental Table 3).

Quality controls
The number of private variants did not correlate with the overall sequencing depth across subjects (r 2 = 3E-6, n.s.).
For six subjects, targeted capture and sequencing was independently performed (Supplemental Table 1) in a previous study 5 . When the genotype calls of these subjects in the two studies were compared, no variant detected in ref. 5 was missed in the present study. Conversely, we scored an excess of 15 variants (Supplemental Table 4), yielding an overall rate of discrepant genotype calls of 0.98%. Three lines of evidence indicate that the majority of these can be considered bona fide variants. First, some of them were already recorded in dbSNP. Second, when previously detected (Isogg.org), the same variants were assigned to Hg J or some sub-haplogroup of it. Third, 11 of the 15 variants were reconfirmed in additional subjects among the 58 here studied (see branch assignment in Supplemental Table 2), and only 4 remained private and could not be further confirmed.
Overall, we observed a vast excess of transitions vs. transversions (Supplemental Table 5), in line with well-established genome-wide patterns 6 . We further explored the dependency of variant type on the nucleotide context 7 , by downloading from the reference sequence the trinucleotides centred on each of our 1079 variable positions (Supplemental Table 2

SNP and STR genotyping
Genotyping of additional subjects at selected variable positions was by Sanger sequencing of PCR products obtained with primers designed with PrimerBLAST (https://www.ncbi.nlm.nih.gov/tools/primer-blast/index.cgi?LINK_LOC=BlastHome ) on the 2000 bp centred on the target position (Supplemental Table 6). For each of these SNP, only carriers of the lineage candidate to harbour the derived allele at the target positions were genotyped.

4/14
Genotyping at seven microsatellite loci (YCAIIa, YCAIIb, DYS19, DYS390, DYS391, DYS392 and DYS393) was by PCR, in the presence of fluorescently labelled primers 8 and separation in an automated sequencer. PCR fragment sizes were transformed in repeat sizes by comparison with sequenced standards.

Tree construction and statistics
The maximum parsimony (MP) tree was obtained with MEGA 9 from the 1079 x 58 matrix of allele states (Figure 1 and Supplemental Figure 3). The MP tree inferred a single recurrent mutational event at a CpG dinucleotide (Y:16610308) (Supplemental Table 2). Topology and branch lengths were further verified by constructing a median-joining network with the program NETWORK 10 on the same matrix. This was also used to calculate node ages with the rho method 11 .
The Branch Length Ratio test 7 was obtained by calculating the length of both descending branches from the MRCA of all possible pairs of tree tips (subjects), with the R package "ape" 12 .
The departure from the null, equal length hypothesis, was evaluated by a 1 d.f. Chi-square test, and the results summarized in a Q-Q plot (Supplemental Fig. 2B).
NETWORK was also used to represent the relationships between 7-STR haplotypes within SNP-defined clades, weighting each STR locus according to the inverse of its repeat variance and modelling SNPs as additional loci with weight = 99.

Dating
Node ages based on SNP diversity (Supplemental Table 7) were obtained with BEAST 13 for the tree including also the Hg J individual "Kotias" 14 . This was given an age of sampling of 9,720 years ago, i.e. the midpoint of the range in calibrated years (Supplemental Figure 4). A lognormal relaxed clock model was used, with all parameters as in ref. 15 . Two runs of 10E6 steps were performed (20% burnin), and combined.
With this method we obtained an estimated mutation rate of 8.88E-10 events/bp/year (95% C.I. 5.50E-10 -12.2E-10). When projected onto the 2.7 Mb resequenced in the 58 modern subjects, this translated into a rate of one mutation every 415 years, which was used to obtain node ages with the rho method (Supplemental Table 7).
Ages of nodes to which SNP and STR genotyping was applied were also obtained with BATWING 16 . Priors for STR mutation rates were obtained from father-son trasmissions

5/14
(https://yhrd.org/); gamma(2,1000) was used for YCAIIa and YCAIIb. SNP data were included in the input files, when appropriate, with no inference on the SNP mutation rate. Ages of the SNP event were obtained from the descendant and ancestral STR node times of the branch on which the mutation occurred (Supplemental Table 7).

Demographic reconstructions
Bayesian Skyline Plots 17 (Supplemental Figure 5) were obtained with BEAST. For this analysis we used the whole tree as well as the 5 clades devoid of nested subclades also used in the selection process. This strategy aimed at avoiding over/down representation of downstream lineages (e.g. J2a-M92 within J2a-M67). A lognormal relaxed clock and a mutation rate of 8.8E-10 (see above) were used, with 10 and 4 groups for the whole tree and internal lineages, respectively.

Merging with other datasets
Merging with the previous landmark studies was obtained by selecting Hg J carriers from the respective vcf files. The lists of resequenced positions (bed files) were intersected with those covered here with bedtools, to obtain distinct reduced vcf files from each of the external sources.
We refrained from assembling a single tree for all subjects, as different filters applied in each study resulted in uncertain calls for some positions, some of which highly relevant for the Hg J topology.
The comparison of genotypes at phylogeographically interesting markers was performed by scrutinizing each vcf file separately.

Additional analyses
A Venn diagram for the sharing of variable positions across studies (Supplemental Figure 6) was obtained with the tool available at http://bioinformatics.psb.ugent.be.
Maps were drawn with scripts in R 18 and sampling locations superimposed based on their long-lat coordinates (Figure 2 and Supplemental Figure 7). Centroids were obtained by calculating Mercator coordinates, taking averages for long and lat, and re-transforming in degree coordinates. A test for the spatial shift of the centroid of the derived alleles was performed by simulation: the 95% C.I. for the centroids of ancestral alleles were obtained by drawing 1000 random samples of the same size as the n. of derived alleles, calculating their centroids and the ellipse embracing 95% of them. This procedure was omitted for rs760148062, as the number of ancestral and derived alleles was similar, approaching the extraction of 1000 replicas of the same centroid. Pairwise sampling distances were obtained with the R function rdist.earth separately for carriers of ancestral and derived alleles. Their equality was tested using the t-test.
Supplemental Figure 3. Same tree as in Figure 1, with individuals identified by their labels. Branches are numbered sequentially as in Supplemental Table 2. In grey: branches for which 100% of SNPs were not recorded in dbSNP 147. Markers discussed in the text are shown next to the branches where they occur, in grey boxes.