Evolutionary dynamics driving continental radiations of Fagaceae forests across the Northern Hemisphere


 Northern Hemisphere forests changed drastically in the early Eocene with the diversification of the oak family (Fagaceae). Cooling climates over the next 20 million years fostered the spread of temperate biomes that became increasingly dominated by oaks and their chestnut relatives. Here we investigate the timing and pattern of major macroevolutionary events and ancient genome-wide signatures of hybridization across Fagaceae. An unparalleled transformation of forest dynamics began with the rapid diversification of major lineages within 15 million years following the K-Pg extinction. Innovations related to seed and pollen dispersal are implicated in triggering waves of continental radiations, while fungal symbioses fortified a competitive edge underground. We detected introgression at multiple time scales, including ancient events predating the origination of genus-level diversity. As oak lineages moved into newly available temperate habitats in the early Miocene, secondary contact between previously isolated species occurred. This resulted in adaptive introgression, further amplifying global proliferation.

suggest that a cooling global climate afforded ecological opportunities to plant groups 55! that were physiologically predisposed to disperse into and radiate within broadening 56! and often repeated seasonal biomes across what would become the Americas and 57! Eurasia 4-7 . Central to this pattern of floristic replacement with significant ecological 58! consequence are the roughly 900 species currently recognized within Fagaceae (oak, 59! beech, chestnut, stone oak). Important components of the timing and pattern of 60! macroevolutionary events and the role of ancient hybridization, however, have yet to 61! be sufficiently described across Fagaceae.

69!
Hemisphere, indicating long-term presence and differential patterns of 70! diversification [29][30][31][32][33][34][35][36][37] . Recent studies integrating these fossils within phylogenies of 71! modern taxa have provided essential context to estimate divergence times [38][39][40] . With a 72! minimum age of ca. 80 million years ago (Ma) and precise aging of new fossilized 73! pollen and macrofossils assigned to some modern groups by 50 Ma, the evolution of 74! major lineages appears to be unusually rapid for forest tree species 41,42 . However, the 75! diversification history of Fagaceae remains incompletely understood, with the 76! exception of modern lineages of Quercus 39,43-45 . Therefore, a complete historical 77! account of this continental radiation is needed to bring to light the dynamics of 78! speciation through the genomes of these ecologically important tree species.

79!
Oaks have a long history of divergence in spite of gene flow. Recent estimates 80! of phylogeny using next generation sequencing of nuclear DNA resolve the main oak 81! groups while demonstrating that oak species are generally not of hybrid origin 39,46 . 82! However, more targeted phylogenomic studies have shown that hybridization has left 83! its signature: one of unstable lineages and taxa, the likely result of ancient 84! hybridization, another of intermediate position between parental lineages as expected 85! by recent-generation hybrids [47][48][49][50] . New insights into nuclear genomic architecture of 86! hybridization complement various datasets derived from the maternally-inherited 87! plastome and suspected cases of plastome capture and resulting cytoplasmic-nuclear 88! discordance have been shown at various phylogenetic depths in Quercus 51-53 . Now the 89! timing and impact of these events within Quercus, as well as within and between 90! other lineages, is within reach: chronograms for both genomes along with thorough 91! interrogation of the nuclear genome using modern analytical approaches provides the 92! framework needed to estimate the timing of hybridization events, identify the 93! signatures of gene flow, and detect evidence for adaptive evolution.    Supplementary Fig. 6). Major nodes along the backbone of the plastid tree 203! ! 7! were highly supported (BS > 80% and BI > 95%) and consistent with the nuclear 204! trees in the resolution of Fagus and Trigonobalanus as early-branching lineages ( Fig.  205! 2). The plastome topology, however, differs markedly from the trees obtained with 206! nuclear loci in regards to the composition and placement of major lineages within the 207! HS clade (Fig. 2). While most plastome subclades comprise related species, several 208! combine disparate taxonomic groups. We failed to recover monophyly of two genera, 209! Quercus and Notholithocarpus, and five sections of Quercus (Quercus, Virentes, 210! Ponticae, Protobalanus, Ilex and Cerris). The structure of the plastome reconstruction 211! within the HS clade is largely geographic, consistent with previous studies 52, 53,75,76 , 212! with the taxonomic diversity divided into two major clades we treat here as New 213! World (NW) and Old World (OW) (Fig. 2).

214!
The NW-OW pattern recovered in our plastome analyses suggests an early 215! geographic homogenization of cytoplasm across lineages generating the observed 216! cytoplasmic-nuclear discordance at the deepest level of the HS clade (Fig. 2). While 217! the most likely source of cytoplasmic-nuclear discordance is hybridization, 218! incomplete lineage sorting (ILS) could produce a similar pattern. To discriminate 219! between these two hypotheses, we performed coalescent-based simulations. We found 220! the plastid tree discordance to be significantly higher than the expected distribution 221! under a strict coalescent process ( Supplementary Fig. 7) and conflicting plastid 222! bipartition frequencies at or near zero in the 10,000 simulated organellar gene trees 223! ( Supplementary Fig. 8). ILS alone is therefore insufficient to explain the observed 224! cytoplasmic-nuclear incongruence recovered in these datasets and a scenario of 225! historical gene flow must be invoked.

226!
Hybridization is a widespread phenomenon within modern lineages of 227! Fagaceae, especially between species within sections of Quercus 77 , and plastome 228! capture events are well documented between sympatric species 52,78-80 . Hybridization 229! is also prevalent between closely related species across many genera within other 230! fagalean families 81-83 . However, gene flow between modern genera is without 231! precedent. When we applied molecular dating methods to the full plastome data, we 232! found estimated divergence times for the deepest splits to generally fall within the 233! rapid diversification phase for the HS clade based on nuclear data (

266!
We also assessed the distribution of alternative topologies within our 2124 267! nuclear gene dataset and found introgressed signals to be widely scattered across the 268! genome ( Supplementary Fig. 11). This is expected given that long-term 269! recombination tends to fragment introgressed stretches of DNA following initial 270! hybridization events 85 . However, positive selection has been shown to maintain long 271! ! 9! introgressed haplotypes in populations of humans and maize 86 IBDs was not associated with recombination rate (Spearman's ρ = -0.19 -0.14, P = 286! 0.10 -0.79; Supplementary Fig. 12). Therefore, these haplotypes shared between 287! Quercus sections are most likely due to historical inter-sectional hybridization instead 288! of the maintenance of ancestral polymorphisms in regions with reduced 289! recombination rates.

290!
To test this prediction, we calculated the probability of maintaining selectively 291! neutral haplotypes of a given length in both oak sections after introgression using 292! methods developed to study introgression in humans 86 and using generation times and 293! mutation and recombination rates derived for oak species 89-91 . We determined that 294! 166 IBD blocks (11724 -113757 bp) were significantly longer than expected if the 295! introgressed fragments were selectively neutral (P < 0.05; Fig. 4d; see details in 296! Methods), suggesting that the IBDs identified here provide convincing evidence of 297! adaptive introgression. Multiple GO categories with important metabolic processes 298! and molecular functions (e.g., terpene metabolic processes, sesquiterpenoid metabolic 299! processes) were overrepresented for genes located in these IBD regions 300! (Supplementary Table 7 topologies also were generated based on the two datasets when using the same 460! phylogenetic method (data not shown), suggesting weak reference bias in our data.

461!
To further monitor the accuracy of genotyping in the query dataset with 462! different divergence levels from the reference genome, we generated mutated 463! sequences (henceforth referred to as "mutated-sequence") by randomly adding 0.25% 464! − 20% mutations to the longest chromosome of Q. robur (chromosome 2, henceforth 465! referred to as "reference-sequence"). Next, we simulated 150bp pair-end reads from 466! each mutated-sequence with 30× coverage (close to our sequencing depth 25-40×).

467!
Simulated reads were mapped to the reference-sequence, and SNPs were called and 468! filtered using the same protocol as described above. For each simulated dataset, we 469! compared genotype calls to the mutated-sequence from which the datasets were 470! generated. To mimic the real data, SNPs called from the repetitive regions was 471! excluded from data analyses. The true positive (TP) rate was defined as TP/(TP+FP), 472!
where TP is position identical to mutated-sequence, and FP (false positives) are called 473! genotypes different from mutated-sequence. The missing rate (MR) was defined as 474! MISS/SIZE, where MISS is non-genotyped sites and SIZE is total sites (~51.2 Mb) in 475! ! 15! the reference-sequence after excluding masked repetitive regions. High TP rate (> 476! 97.7%) and low MR (< 1.5%) were found in datasets with divergence levels from 477! reference-sequence no more than 10% (Supplementary Fig. 12 39,113 , and then added un-sampled taxa 585! to a random position in each corresponding lineage (Supplementary Table 11). The 586! BAMM analyses were run for 10 million generations, saving every 1000 generations.

587!
The first 30% samples were discarded as burn-in, and the remaining samples were 588! summarized and plotted using BAMMtools 138 . 589!

Topological concordance analyses 591!
To evaluate the conflicts between nuclear gene trees and the species tree, we first 592! calculated the internode certainty all (ICA) to quantify the degree of conflict on each 593! node between a target tree and gene trees 139 . ICA values close to 1 indicate strong 594! concordance for the bipartition defined by a given internode, while ICA values close 595! to 0 indicate strong conflict. Negative ICA values indicate that the defined bipartition 596! conflict with other high frequent bipartitions. The ICA values were estimated in 597! RAxML and the species tree found by ASTRAL-III was used as the target tree. We 598! further summarized the number of conflicting and concordant bipartitions with 599! PHYPARTS 140 , using the species tree estimated by ASTRAL-III and the individual 600! gene trees. 601!

604!
To investigate whether base substitution saturation biased the accuracy of 605! phylogenetic inference in plastome phylogenetic analyses, we estimated the amount 606! of substitution saturation using methods detailed in Xia et al. 141 . This involved 607! employing critical index of substitution saturation (ISSc) that defines a threshold for 608! significant saturation in the data. From the data of 76 chloroplast genes, we assessed 609! the level of substitution saturation for codon12 and codon3 using the program 610! ! 19! DAMBE7 142 , and found that there was sufficient phylogenetic information at all 611! codon positions (Supplementary Table 12).

612!
To investigate how synonymous codon usage varies among Fagaceae species, 613! and whether synonymous codon biases have resulted in artificial and random 614! phylogenetic inference, we measured Relative Synonymous Codon Usage (RSCU) 615! values using GCUA 143 . RSCU is defined as the ratio of the observed codon 616! appearance to the number expected given that all synonymous codons appear with 617! uniform frequency. We found similar level of GC content and variation in codon bias 618! across Fagaceae species (Supplementary Fig. 13). These results suggested that 619! Fagaceae plastid genomes are highly conserved, and the plastid-based analyses would 620! be not biased due to substitution saturation or compositional heterogeneity among 621! species. 622!

Coalescent simulation 624!
To test whether incomplete lineage sorting (ILS) alone could explain the 625! incongruence between plastome tree and nuclear species tree, we followed Folk et 626! al. 144 to simulate 10,000 plastome trees under the coalescent model using 627! DENDROPY v.4.1.0 145 . The ASTRAL-III tree was used as a guide tree for the 628! simulation. To simulate plastome trees, branch lengths were scaled by a factor of four 629! to account for the haploidy and maternal inheritance of the plastome. Clade 630! frequencies of simulated trees were summarized using PHYPARTS 140 . In the scenario 631! of ILS alone, the topology from our empirical plastome tree should be present in 632! simulated trees with high frequency; if gene flow is present, the topology recovered in 633! our empirical tree should be absent or at very low frequency in the simulated trees. H2, we restricted our analyses by sampling H1 and H2 from same genera, or same 651! sections within genus Quercus. In addition, because H1 and H2 are sister species, the 652! sites with the pattern of BBAA are expected to be larger than ABBA and BABA 653! patterns. We further filtered trios that violated this assumption, and applied ABBA-654! BABA test to 25882 trios extracted from the species tree. To account for multiple 655!  Following Shen et al. 155 , we calculated site-wise log-likelihood scores for the primary 684! and alternative topologies in our concatenated matrix using the "-f G" command in 685! RAxML. After that, the difference in site-wise log-likelihood scores (ΔSLS) between 686! topologies were summed across sites in each gene, generating gene-wise log-687! likelihood scores (ΔGLS). For each node of interest, the primary topology was 688! defined as the species tree recovered by ASTRAL-III, and the alternative topologies 689! were ML trees constrained to recover the most common conflicting bipartitions. 690!

Identity-by-descent (IBD) analyses 692!
We performed IBD analyses based on genome-wide SNP data in the genus Quercus. 693! By using a same SNP calling and filtering procedure described above (see section 694! "Orthologous gene identification and nuclear alignment matrix assembly"). Raw 695! reads of Quercus species were trimmed using Trimmomatic v0.38 120 , aligned to Q. 696! robur reference genome assembly 91 using BWA 121 , and called genotypes using 697! GATK v4.1 156 . We applied a strict filtering process to remove low quality SNPs. We 698! removed all sites located in repetitive regions of the Q. robur reference genome 91 , and 699! discarded all indels and multiallelic SNPs. We further set genotypes supported by less 700! than four reads as missing data, and deleted SNPs with mean depth <5 or >100, or 701! genotyped in less than half of individuals, or proportion of called heterozygous 702! genotypes >50%. Finally, we obtained 34,250,467 high quality SNPs for IBD 703! analyses.

704!
We used Beagle v4.1 157 to phase and impute the SNP data, and uncover shared 705! IBD blocks between species. The following parameters were used for IBD analyses in 706! Beagle: window = 100,000; overlap = 10,000; ibdtrim = 100; ibdlod = 5. To compare 707! the recombination rate between IBD blocks and genomic background, we used a 708! genetic map of Q. robur developed by Plomion et al. 91 Table S3). S1-S4 indicate four nodes where shifts in diversification rate were identified. Taxonomic labels of genera, subgenera and sections follow Manos et al. 2001, Manos et al. 2008and Denk et al. 2017. Illustrations: lax catkins indicate the placement of the change from insect-pollination to wind-pollination that diagnoses the genus Quercus; hypogeous seed and seedling marks the origin of the HS clade. Images: representative cupule types are shown on the right. A consistent color scheme was used for taxonomic labels and image borders.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryInformation.pdf