Introduction

The walnut family, Juglandaceae, comprises 63 species in eight genera (Fig. 1) and includes some of the World’s commercially most valuable nut-producing crops, such as Persian walnut, Chinese Iron walnut (both in the genus Juglans), pecan, and hickory (genus Carya). Many species also are valuable timber trees. Because of their economic importance, several genomes have been assembled to study fungal resistance genes and genes related to fruit quality, and their analysis has revealed an ancient whole-genome duplication (WGD) in the family1,2,3,4,5, termed juglandoid WGD1. In addition, comparisons of the genomes of Juglans regia and Carya illinoinensis show biased fractionation and asymmetric loss of duplicated genes between two parental subgenomes that persist in these species3,5,6, pointing to an allotetraploid origin of the family. This was also suggested in cytogenetic studies that explained the predominant Juglandaceae chromosome numbers of 2n = 32 or 64 by assuming an ancestor shared with, or embedded in, Myricaceae, which have a basic number of x = 87,8.

Despite numerous molecular studies over the past 20 years, the relationships among the family’s generally recognized genera have remained unclear, with at least six incongruent topologies having been reported9,10,11,12,13,14,15,16,17,18,19, although the main signal in many of these studies came from the same chloroplast matK gene sequences. Most problematic has been the position of the enigmatic east Asian endemic Platycarya strobilacea, which has ‘jumped’ between major clades. Today, this species occurs in Vietnam, Taiwan Island, Mainland China, Korea, and Japan, but fossil Platycarya leaves and fruits are known from the Upper Paleocene and Early Eocene of North America and Europe20,21. Platycarya is unusual among the Juglandaceae in its cone-like fruit (Manning22; our Fig. 1), but the fossil leaves of P. americana and P. castaneopsis share unique leaf architectural traits with modern Engelhardieae20, and extant Platycarya and Engelhardia further share thin nut walls without lacunae22, terminal panicles in which the male and female flowers usually are combined (Stone23: Fig. 7.1), and similar pollen (Stone23: Fig. 7.3). At least four modern morphological data matrices9,14,19,24 have been unable to unambiguously place Platycarya, even when fossil taxa were included (Discussion). The first such analysis20, however, inferred a Platycarya/Engelhardia clade (also see Manchester25; our Fig. 1a).

Other genomic and population-genetic studies have highlighted the role of ancient and ongoing hybridization in the Juglandaceae’s most diverse genus, Juglans26,27. If cross-species gene flow (introgression) has influenced large portions of related genomes, phylogenetic inference based on DNA-sequence alignments will produce species trees that reflect the reticulation history rather than the bifurcation history28,29. In such situations of extensive introgression, microsynteny appears to retain phylogenetic signal30.

The molecular-clock hypothesis plays a central role in the study of molecular evolution, timing of speciation events, and historical demography. However, substitution rates can vary greatly among lineages, and gene duplicates can show distinct patterns of temporal and functional evolution31, related to generation times32, other life history traits (e.g., Lanfear et al. 33), different population sizes34,35,36,37,38, or possibly DNA repair systems34,39,40. This last possibility has received the least attention so far due to the technical difficulty of comparing the numbers of genes involved in DNA repair in different taxa.

Here, we use syntenic information and gene presence/absence to reconstruct the phylogenetic history of Juglandaceae, taking advantage of its two coexisting homoeologous subgenomes and chromosome-level genome assemblies. Our gene-content- and microsynteny-based phylogenetic approach builds on work by Zhao et al. 30 and Pett et al. 41, but differs from their studies by focusing on an allotetraploid clade. Resolving the phylogeny of the walnut family is important to better leverage the available genomes by knowing the direction of trait evolution in the family and because a correctly rooted topology is essential for biogeographic and molecular-clock studies. We also infer DNA substitution rates, past population sizes, and relative numbers of DNA repair genes to test the possibility that substitution rates may vary with the expansion of DNA repair gene families. In the context of other recent studies, our results highlight a pattern of dramatic post-polyploid evolutionary slowdowns in woody clades of core eudicots, calling for greater caution in molecular-clock dating of such clades.

Results

Genome assemblies and annotations

The total assembled genome of Rhoiptelea chiliantha comprised 408.19 Mbp, and the contig N50 and scaffold N50 values were 6.97 Mbp and 24.34 Mbp, respectively. Of the whole genome, 91.71% were ordered and oriented into 16 pseudo-chromosomes (Supplementary Fig. 1). The R. chiliantha genome consisted of ~39.35% repetitive sequences, and 32,505 predicted protein-coding genes (Supplementary Table 1). The Engelhardia roxburghiana assembly size was 884.78 Mbp, and 97.73% of the whole genome were ordered and oriented into 16 pseudo-chromosomes (Supplementary Fig. 1). The E. roxburghiana genome consisted of ~57.11% repetitive sequences and 30,590 annotated protein-coding genes (Supplementary Table 1). Completeness assessment of the assembly revealed that 96.1% and 93.7% of the universal single-copy orthologs in the BUSCO embryophyta_odb9 database were present in R. chiliantha and E. roxburghiana, respectively, indicating that the two genome assemblies were rather complete (Supplementary Table 2).

Whole-genome duplication

The circular plot and dot plot of the pseudo-chromosomes of R. chiliantha and E. roxburghiana yielded abundant synteny information between homoeologous chromosomes (Fig. 2a and Supplementary Fig. 2, 3). The numbers of collinear blocks (syntenic blocks) were 399 (Carya illinoinensis), 403 (E. roxburghiana), 324 (Juglans regia), 444 (J. mandshurica), 338 (J. microcarpa), 384 (Platycarya strobilacea), and 463 (R. chiliantha), respectively. To identify the most recent and more ancient Ks peaks (WGDs) in each species, we plotted the distribution for the median Ks values of the collinear gene pairs contained in each syntenic block within a species. In the Ks distribution, we detected two polyploidization events in each of the genomes of C. illinoinensis, E. roxburghiana, J. mandshurica, J. microcarpa, J. regia, P. strobilacea and R. chiliantha (Fig. 2b). The first peak of Ks was 0.17 for R. chiliantha, whereas it was 0.36–0.48 for the other six species. The second peak of Ks for all species was narrow, pointing to the shared ancient γ-WGT. These results were further supported by the distribution of the fourfold synonymous third-codon transversion rate (4DTv) (Supplementary Fig. 4).

To determine whether R. chiliantha shares a recent WGD event with the other Juglandaceae species, we used both genome-synteny analysis42,43 and the ‘multiple gene family’ phylogenetic tree method by Pfeil et al. 44, using J. regia as a representative of Juglandaceae sensu stricto. In the genome-synteny analysis (Fig. 3a), the chromosomes of R. chiliantha and J. regia formed eight quartets, each consisting of a pair of homoeologous chromosomes from both R. chiliantha and J. regia. The dot colors clearly distinguish orthologous (yellow: smaller Ks) and paralogous (green: larger Ks) relationships between interspecific homologous chromosomes, pointing to a shared WGD between them. Rhoiptelea chiliantha and the remaining species of Juglandaceae show a similar pattern (Supplementary Fig. 59). With Pfeil et al.’s phylogenetic tree method, more than 92% of the gene trees (bootstrap value ≥ 80) supported the topology (((R. chiliantha, J. regia), (R. chiliantha, J. regia)), Quercus lobata), (((R. chiliantha, J. regia), R. chiliantha), Q. lobata) and (((R. chiliantha, J. regia), J. regia), Q. lobata) (Fig. 3b and Supplementary Data 1); the remaining species of Juglandaceae showed a similar pattern (Supplementary Data 1). These two results provide unambiguous evidence that R. chiliantha and the remaining Juglandaceae share the same WGD event, which must have been present in their most-recent common ancestor.

Molecular evolutionary rates following the juglandoid WGD

Although R. chiliantha and the other Juglandaceae share the juglandoid WGD, they have markedly different Ks distributions of homoeologous gene pairs. To quantify the discrepancy in the substitution rates of R. chiliantha and other Juglandaceae species, we calculated Ks values between a focal species and the common ancestor of its sister clade (for more details see Fig. 3c and its legend). Calculated in this way, Ks now reflects the evolutionary rate of a species since its divergence from the common ancestor (Fig. 3c). A Wilcoxon rank-sum test (Mann-Whitney test) was performed on the Ks distribution of R. chiliantha and other species. The results suggested there were significant differences in Ks-distribution between R. chiliantha and each of other six species (P < 0.01; Fig. 3d and Supplementary Fig. 10). We next quantified the rates using Eq. 1, where t is assumed to be 85 MYA based on Budvaricarpus serialis, a fossil fruiting structure that is similar to R. chiliantha45 and that is the oldest fossil of Juglandaceae. It comes from the Late Turonian to Santonian, a period spanning 89.8 to 83.6 MYA. The substitution rates were estimated as 1.60 × 10−9, 1.95 × 10−9, 1.47 × 10−9, 1.54 × 10−9, 1.51 × 10−9, 1.89 × 10−9, 0.55 × 10−9 per site per year, respectively, for, C. illinoinensis, E. roxburghiana, J. mandshurica, J. microcarpa, J. regia, P. strobilacea and R. chiliantha (Supplementary Table 3).

Small effective population size of Rhoiptelea chiliantha compared to other Juglandaceae

According to the nearly neutral theory, effective population size (Ne) can profoundly influence the rate of molecular evolution38,46. Compared with other Juglandaceae species, the effective population size of R. chiliantha sharply declined during the time from 5.0 to 0.5 MYA in the PSMC plots (Fig. 4a). Moreover, the Ne of R. chiliantha was smaller than that of other species during most of time from 2.0 to 0.1 MYA (Fig. 4a). We also used genome-wide heterozygosity to calculate the long-term average for each species using Eq. 2, where μ is the mutation rate per year and g is the generation time, assumed to be 30 years (Methods). Rhoiptelea chiliantha had the lowest Ne (8,900) among all analyzed species, with the others species’ Ne ranging from 10,700 to 25,700 (Supplementary Table 4). We compared the Ka/Ks among lineages and found a significant difference between R. chiliantha and the other Juglandaceae species (Fig. 4b and Supplementary Fig. 11). Taken together, these results show that R. chiliantha has a small effective population size, which, if anything, would lead to a higher substitution rate, not the slower rate found here.

More genes related to DNA repair and recombination in R. chiliantha compared to other Juglandaceae

Rhoiptelea chiliantha has 323 genes identified as being involved in DNA repair and recombination, while the other Juglandaceae have 271 to 302 such genes (Supplementary Data 2 and Supplementary Data 3). These numbers were inferred by assigning the DNA repair and recombination genes from all investigated species to 260 orthogroups (gene families) (Supplementary Data 3). The number of genes of each orthogroup were significantly different between R. chiliantha and other species of Juglandaceae (Fig. 4c). We also checked the asymmetric pattern of DNA repair and recombination genes across orthogroups between R. chiliantha and each of the other species. There were 43 orthogroups in which R. chiliantha has two gene copies and J. regia has at most one gene copy; conversely, there were only seven orthogroups where J. regia has two gene copies while R. chiliantha has at most one copy. That is to say, there is a strong asymmetry of 43:7 in the comparison of R. chiliantha and J. regia. Similar numbers of 43:17, 37:14, 34:10, 49:9, and 43:11 were obtained in comparisons of R. chiliantha with E. roxburghiana, C. illinoinensis, J. mandshurica, J. microcarpa and P. strobilacea. There also were 27 orthogroups in which R. chiliantha has two copies, while other Juglandaceae species have only one. In 26 of these 27 orthogroups, the two copies in R. chiliantha were determined as being derived from the juglandoid WGD, whereas in one orthogroup, the two copies were transposition-derived. All 54R. chiliantha gene copies in the 27 orthogroups were expressed in floral buds, with 45 (83%) of them also expressed in leaves, suggesting that most, if not all, are truly functional (Supplementary Data 4). By contrast, we found only five orthogroups in which R. chiliantha has at most one gene copy but other Juglandaceae species have at least one gene copy (Supplementary Data 3) and six gene copies were expressed in catkin, pistillate flower and leaf of J. regia (Supplementary Table 5).

We carried out a KEGG enrichment analysis for the genes that retain two copies in R. chiliantha from the juglandoid WGD, but at most only one copy in the other Juglandaceae species. These genes were significantly enriched in transcription-coupled repair (RPB1) and base excision repair (Aprataxin), among various functions (Supplementary Fig. 12). Two recent studies have demonstrated that transcription-coupled repair can play a much larger role in DNA repair than previously thought47,48.

Mirror image subgenome-level phylogenies of Juglandaceae

For the seven Juglandaceae species, we performed subgenome assignments for each pair of homoeologous chromosomes based on the numbers of retained ancestral genes at on the whole chromosome or intraspecific collinear blocks49,50,51 (Supplementary Note 1; Supplementary Data 5 and Supplementary Data 6). After subgenome assignment, we used three approaches to infer parental lineages and subgenome relationships in the walnut family, namely microsynteny, local gene content, and DNA-sequence alignment (numbers of syntenic blocks, matrix lengths, and gene family numbers are given in Supplementary Table 6). Both the microsynteny- and the gene-content-based approach (with the subgenomes assigned by partitioning homoeologous chromosomes) yielded topologies in which the seven recessive subgenomes and Myrica rubra formed a clade that was sister to a clade formed by the seven dominant subgenomes (chromosomes with more ancestral genes are here referred to as ‘dominant’, those with fewer ancestral genes as ‘recessive’). Different parameter settings in MCScanX52, namely A5G25, A10G25, or A15G25, with ‘A’ being the minimum number of collinear gene pairs and ‘G’ the maximum number of intervening genes between adjacent blocks, did not change the topology, consistent with our presupposition that Juglandaceae are of allopolyploid origin (Fig. 5a, b and Supplementary Fig. 13). When the dominant and recessive subgenomes instead were assigned by partitioning intraspecific collinear blocks (allowing us to include one additional species, Pterocarya stenoptera, which was assembled at the scaffold level), the A5G25 parameter setting yielded a topology in which the subgenomes formed a clade that was sister to M. rubra (Supplementary Fig. 14a), while the A10G25 and A15G25 settings yielded topologies in which the recessive subgenomes and Myrica rubra formed a clade (Supplementary Fig. 14b, c). In all analyses, regardless whether the subgenomes were assigned by partitioning homoeologous chromosomes or intraspecific collinear blocks, Platycarya always grouped with Engelhardia, with high bootstrap support (Fig. 5a, b, Supplementary Figs. 13, and Supplementary Fig. 14a–d).

In the trees obtained from the sequence-alignment approach with the subgenomes assigned by partitioning either homoeologous chromosomes or intraspecific collinear blocks, the dominant and recessive subgenomes of Juglandaceae formed a clade that was sister to M. rubra (Fig. 5c and Supplementary Fig. 14e) and Engelhardia and Platycarya were separated from each other (Fig. 5c and Supplementary Fig. 14e).

We also used GRAMPA53 to identify the most likely placement of the polyploid clade and its parental lineages without a priori subgenome assignment. GRAMPA needs a species tree to start with, which we inferred with ASTRAL-Pro54. The optimal multi-labeled tree (MUL-tree) inferred by GRAMPA was consistent with the tree obtained from our subgenome-aware sequence-alignment approach (Supplementary Fig. 15).

Discussion

The particular placement of the monotypic Asian genus Platycarya obtained here from genome-structural data—which remained identical regardless of collinear block sizes or subgenomes used (Fig. 5a, b, Supplementary Fig. 13, and Supplementary Fig. 14a–d)– has not been recovered using DNA alignments, neither in our study (Fig. 5c and Supplementary Fig. 14e) nor over the past 20 years9,10,14,15,16,18,19. This implies that the placement of Platycarya in either the genome-structure-based phylogeny or the DNA-alignment-based phylogenies must be wrong. Since the true phylogeny is currently unknown, are there morphological arguments in favor of one or the other placement of Platycarya?

Wing and Hickey’s20 analysis of 30 characters scored in fossils and extant species of Juglandaceae yielded a Platycarya/Engelhardia group plus its Central American relatives, the genera Alfaroa with five Central American species, and Oreomunnea with two species in Mexico, neither of them included here. In southeast Asia, Engelhardia comprises seven species of which two, E. roxburgiana in Vietnam, Laos, Myanmar and southwest China and E. fenzelii in southeast China55, are sometimes recognized as the separate genus Alfaropsis (Supplementary Fig. 16 shows a nuclear ITS phylogeny of the entire clade). While Wing and Hickey20 did not publish their data matrix, their text stresses four to seven leaf architectural traits and pollen and panicle morphology shared by fossil and living Platycarya and Engelhardia. In addition, both genera have thin nut walls without lacunae, while the nut walls of Juglans, Carya, and Pterocarya have lacunae22. Manchester’s25 intuitive phylogeny of the Juglandaceae showed the same groupings (our Fig. 1a), influenced by Wing and Hickey’s study. Four more recent morphological data matrices are available, and we re-analyzed them for this study. Manos and Stone9 coded 64 characters for 40 living taxa, and their morphological tree left Platycarya in a tritomy between the Engelhardia clade and the Juglans clade (Manos et al. 10: Fig. 3A). Hermsen and Gandolfo24 emended and enlarged the matrix of Manos and Stone9, by coding 64 characters for 28 living and six extinct taxa. Their data yield the (Platycarya, Engelhardia, Alfaroa, Oreomunnea) clade, but leave the placement of Rhoiptelea unresolved (our Supplementary Fig. 17 shows a Neighbour-Net generated from their data), possibly because they failed to include the oldest fossil of Rhoiptelea, an 85 MY old fruiting structure that is the oldest fossil of any Juglandaceae45. Larson-Johnson14 coded 89 characters from 37 living and 27 extinct taxa, and Zhang et al. 19 coded 73 characters for 47 living and 113 extinct taxa. Both matrices fail to unambiguously place Platycarya, although fossils of Platycarya group with Engelhardia in the phylogeny (Fig. 4) of Larson-Johnson14. Thus, morphological traits do not contradict the placement of Platycarya found here and based on Wing and Hickey20 may support it (Supplementary Fig. 17).

The DNA-alignment-based trees all placed Engelhardia and Platycarya far apart (Fig. 5c and Supplementary Fig. 14e). An explanation for this could be post-polyploid gene flow among Juglandaceae (as demonstrated within the genus Juglans26,27), which would result in discordant phylogenetic signal. A test for gene tree discord that does not require a pre-specified species tree uses gene tree quartets under the multispecies coalescent (MSC) model to calculate quartet count concordance factors (qcCF)56,57. We carried out such an analysis for each set of four taxa to evaluate how well expected qcCFs under the MSC model fit the data. With subgenomes assigned via homoeologous chromosomes, we performed MSCquartets analysis on 5,882 gene trees for the seven taxa. Applying the Holm-Bonferroni method to adjust for multiple testing, 12 of 35 tests, about 34.29% (Supplementary Fig. 18 and Supplementary Data 7), rejected the null hypothesis and instead supported significant discord. Of particular note, all quartets, each of which consists of Engelhardia, Platycarya, Carya and one of the Juglans species (Supplementary Data 7), produce significant tests, pointing to the possible presence of ancient hybridization between them. Similar results were obtained if the subgenomes were assigned by intraspecific collinear blocks (see Supplementary Fig. 18 and Supplementary Data 8).

As mentioned above, we lack representative genomes of the Central American genera Alfaroa (5 species) or Oreomunnea (2 species) and a second Asian Engelhardia, preferably E. spicata, the type species of this genus (Supplementary Fig. 16). Their inclusion could influence the precise attachment of Platycarya to the Engelhardia clade.

Our results from the Ks-analysis, genome synteny (collinearity), and phylogenetics converge to confirm the occurrence of a WGD event in the common ancestor of the walnut family, including Rhoiptelea (Fig. 3 and Supplementary Data 1). Our presupposition of an allopolyploidy event at the base of the Juglandaceae family (originally inferred from cytogenetics7,8) is supported by the phylogenies that show the recessive subgenome clade as sister to Myrica (Fig. 5a–b, Supplementary Figs. 13, and Supplementary Fig. 14b–c), consistent with the result of Xiao et al. 6 who, using sequence similarity as a criterion, found that one subgenome of pecan showed higher average identity with Myrica rubra. Reassuringly, a disproportionate loss of duplicated genes between the two subgenomes (i.e., biased fractionation) has been detected in prior studies of Juglans and Carya genomes3,5,6 and in our study (Fig. 6 and Supplementary Fig. 1924). That the recessive subgenome clade is sister to Myrica could imply that one parent of Juglandaceae was a species closely related to Myricaceae, while the other parent was extinct or unsampled. However, the alignment-based tree from both the dominant and recessive subgenomes (Fig. 5c) shows Juglandaceae forming a clade that is sister to Myrica rubra, implying that both parents of Juglandaceae were either nested within Myricaceae or had a most recent common ancestor with Myricaceae. Future work should sample an additional genome of Myricaceae, ideally Canacomyrica monticola, Comptonia peregrina, or Myrica gale.

Two Ks peaks corresponding to the juglandoid WGD and the γ-WGT shared by core eudicots were detected in Rhoiptelea as well as the other Juglandaceae species (Fig. 2b). The first Ks peak of R. chiliantha corresponding to the juglandoid WGD is 2.1 to 2.8 times smaller than that of six other intrafamilial members (Fig. 2b). As Doyle and Egan58 have pointed out, the Ks peaks of homoeologs in an allopolyploid would overestimate its molecular evolutionary rate because the subgenomes would have diverged at the point of speciation between the two progenitors, rather than at the hybridization (and genome doubling) event itself. So, the Ks for homoeologs derived from a WGD is at best a crude estimate of the descendant lineages’ substitution rates (see Thomas et al. 53: Fig. 1). However, the more direct estimation (Fig. 3c) yields similar results: the rate of R. chiliantha was 2.6 to 3.5x slower than that of the other species following the juglandoid WGD.

An extremely slow substitution rate in R. chiliantha is supported by additional lines of evidence. The results of the BUSCO assessment (Supplementary Fig. 25) show that 24% of the universal single-copy orthologs in embryophyta_odb10 database were complete and duplicated in R. chiliantha, 3–4x more than in the other Juglandaceae species (5.1–6.8%); and most of these duplicate BUSCO groups were derived from the juglandoid WGD (Supplementary Data 9). The percentages of retained orthologous genes of Rhoiptelea based on a 100-gene sliding window along each Q. lobata chromosome were higher than those in the other six Juglandaceae species for both the dominant subgenome and recessive subgenome (Fig. 6 and Supplementary Fig. 1924). We also found R. chiliantha to retain more WGD gene duplicates (Supplementary Fig. 26 and Supplementary Table 7), indicating that fewer deleterious mutations have accumulated in the Rhoiptelea gene duplicates. However, we cannot rule out other hypotheses, such as deletion differences of the genes after WGD59, calling for additional study.

A species’ population size will affect the intensity of selection and drift, which causes different fixation probabilities of nearly neutral mutations in a population. In small populations, weakly deleterious mutations may be retained as effectively neutral alleles, so fewer weakly deleterious mutations are exposed and removed from the gene pool through negative selection, which should lead to high rates of molecular evolution34,35,36. Our results from PSMC, Ka/Ks ratios, and heterozygosity (Fig. 4 and Supplementary Fig. 11), however, indicate that R. chiliantha has a smaller Ne, no matter in which way it is inferred, making it unlikely that its Ne explains its slower substitution rate. Instead, the higher number of genes in R. chiliantha involved in DNA repair and recombination compared to other Juglandaceae (Fig. 4c, a violin plot of the number of genes in gene families associated with DNA repair and recombination in the seven species, and Supplementary Data 3) and the juglandoid WGD genes in Rhoiptelea enriched in the transcription-coupled repair (RPB1) and base excision repair (Aprataxin) points to more efficient DNA repair as the principal reason for the reduced substitution rate in R. chiliantha following the juglandoid WGD.

In Fig. 2b, the Ks peaks corresponding to the juglandoid WGD range from 0.17 to 0.48, whereas the Ks values corresponding to the γ-WGT range from 1.70 to 2.01. Assuming that the juglandoid WGD occurred at about 85 MYA45, simple calculations tell us that during the time period after the γ-WGT but before the juglandoid WGD (i.e., from ~120 to ~85 MYA) the evolutionary rate of Juglandaceae was 7.7 to 22 times higher than that after the juglandoid WGD (i.e., during the time period from 85 MYA up to now). In other words, molecular evolution in Juglandaceae slowed markedly following the juglandoid WGD. Dramatic slowdowns in woody clades of core eudicots were first suggested by Smith and Donoghue32 and have been quantified in several genomic studies. For instance, the Ks peak values for the salicoid WGD (willow family) occurring at 60–65 MYA60,61 range from 0.34 to 0.56, while the Ks peaks for the γ-WGT of core eudicots to which willows belong were at around 2.531, indicating a slowdown of 5.0 times. In the Rosaceae subfamily Pomoideae (containing apples and other fruit trees), the Ks peaks for the Pomoideae WGD at 48–50 MYA62,63 range from 0.27 to 0.39, compared to the Ks peak for the γ-WGT at around 2.531, indicating a slowdown of 4.5 times. Disregarding such dramatic rate slowdowns results in unreliable estimation of evolutionary events64, examples being the calibration of the γ-WGT shared by core eudicots with the time of the juglandoid WGD6.

In conclusion, the placement of Platycarya with Engelhardia has implications for the evolution of juglandoid nut walls as well as for the biogeography of the family. Microsynteny and gene-content-based phylogenetic approaches contain so-far undervalued phylogenetic signal. This signal may be especially valuable for studing the phylogenetics of allopolyploid organisms because allopolyploidy involves hybridization and speciation simultaneously. For such lineages, genome-structure permits to distinguish coexisting homeologous chromosomes and to use them separately to infer organismal ancestry. More importantly, any large genome-structural variants capable of altering gene order and/or content probably rarely travel horizontally between species, thus making tree inference based on gene order or content more robust to introgression.

Methods

Fresh leaves of Rhoiptelea chiliantha (Xichou County, Yunnan Province, China, 23°22′42.73″N, 104°47′17.21″E) and Engelhardia roxburghiana (Wangmo County, Guizhou Province, China, 25°15′30.62″N, 105°57′40.7″E) were collected for extracting and sequencing genomic DNA. A permanent voucher of each species has been deposited in the BNU herbarium (Zhang BNU20180707-4 and Cao BNU20200818-1). Rhoiptelea chiliantha and E. roxburghiana were sequenced with 44.28 and 103 Gb 150 bp paired-end reads on Illumina HiSeq X Ten sequencing platform, respectively, and GenomeScope 2.065 was then used to evaluate their genome sizes with k-mer count histograms constructed by Jellyfish v2.366 with a k-mer length of 17. Genome sizes for Rhoiptelea and E. roxburghiana were ~410 Mbp and ~880 Mbp, respectively (Supplementary Fig. 1).

For R. chiliantha, a total of 43 Gb (~105×) PacBio single-molecule long reads and 103 Gb (~108×) Illumina short reads of insert size 350 bp were used for the initial assembly and subsequent correction. In addition, a total of 40.86 Gb (~100×) raw data from Hi-C libraries were used for chromosome-scale genome assembly. For E. roxburghiana, a total of 71 Gb (~75×) PacBio single-molecule long reads and 103 Gb (~108×) Illumina short reads of insert size 350 bp were used in the assembly and subsequent correction. A total of 305 Gb (~347×) raw data from Hi-C libraries were used for chromosome-scale genome assembly. The details of each genome assembly are given in Supplementary Note 2.

To represent the major lineages of Juglandaceae (shown in Fig. 1), we downloaded five chromosome-level genomes of Carya67, Juglans3 and Platycarya27 from public data bases. We also downloaded genome of Myrica rubra (Myricaceae) and other Fagales that experienced no further polyploidy events beyond the γ-WGT as the outgroup68,69,70,71,72 for subsequent analysis. Myricaceae have three genera and 50 species, and are the sister family of Juglandaceae.

Prediction of repeats and genome annotation

To annotate the genomes of R. chiliantha and E. roxburghiana, a combination of homology-based inference, ab initio prediction, and transcripts from RNA sequencing (RNA-seq) was used (for details see Supplementary Note 3). The final gene sets were functionally annotated using the Kyoto Encyclopedia of Genes and Genomes (KEGG) Automatic Annotation Server (https://www.genome.jp/tools/kaas/)73 to perform KEGG Orthology (KO) annotation.

Detecting whole-genome duplication

We used protein sequence and annotation information from seven species (C. illinoinensis, E. roxburghiana, J. mandshurica, J. microcarpa, J. regia, P. strobilacea and R. chiliantha) to detect and confirm WGD. BLASTP74 (e < 10−10 and top5 matches) was used to search all potential homologous gene pairs of protein sequences within each species and the collinear gene pairs were identified by MCScanX52 with default settings. We used TBtools75 to visualize the collinear blocks of 16 chromosomes of R. chiliantha and E. roxburghiana, respectively. Using information of intra-species collinear gene pairs of the seven species and the inter-species collinear gene pairs between each of the seven species and Quercus lobata69 in DupGen_finder31, we identified five modes of gene duplication, including whole-genome duplication (WGD), tandem duplication (TD), proximal duplication (PD), transposed duplication (TRD), and dispersed duplication (DSD).

For all collinear gene pairs (anchor genes) of WGD, we estimated synonymous substitutions per synonymous site (Ks), nonsynonymous substitutions per nonsynonymous site (Ka), and Ka/Ks using Gamma-MYN algorithm76 in KaKs_Calculator 2.077. The fourfold synonymous third-codon transversion rate (4DTv) between collinear gene pairs was calculated using a Perl script (https://github.com/JinfengChen/Scripts/blob/master/FFgenome/03.evolution/distance_kaks_4dtv/bin/calculate_4DTV_correction.pl). To ensure the independence of Ks and the 4DTv in each collinear block within species, we followed Schnable et al. 49 and estimated the median Ks and 4DTv values for each collinear block. Ks values equal to 0 or greater than 531 and 4DTv distances equal to 0 or greater than 1 were excluded. The R package ggplot278 was used to plot the distribution of Ks and 4DTv of collinear gene pairs.

Inference of the WGD event and estimation of the substitution rate for each sampled Juglandaceae species

Based on the Ks distribution of collinear gene pairs, we found that the first peak was at 0.17 in R. chiliantha, much smaller than that in the other Juglandaceae where it ranged from 0.36 to 0.48 (Fig. 2b). To account for this discrepancy, we tested two hypotheses: (1) R. chiliantha independently experienced a different WGD or (2) R. chiliantha shared the same juglandoid WGD but evolved much slower post-WGD. Under the latter hypothesis, the evolutionary rate of each species can be estimated using the procedure developed for the relative rate test79. To test if the WGD event is shared or not, we used both intergenomic synteny analysis42,43 (Fig. 3a) and the ‘multiple gene family’ approach of Pfeil et al. 44 (Fig. 3b).

For the intergenomic synteny analysis42,43, we used BLASTP (e < 10−10, top5 matches) and MCScanX to determine intergenomic collinear blocks between R. chiliantha and the remaining Juglandaceae species. After using the Gamma-MYN algorithm in KaKs_Calculator 2.0 to calculate the Ks of collinear genes, we used WGDI80 to visualize the median Ks values of collinear blocks between R. chiliantha and the remaining Juglandaceae species. If R. chiliantha and the other Juglandaceae shared a WGD event, it should precede their divergence from each other; homologous chromosomes between species should show different divergences, and the median Ks values of WGD collinear blocks should be larger than the Ks values of orthologous collinear blocks between R. chiliantha and the remaining Juglandaceae species (Fig. 3a). If, however, R. chiliantha and the other species experienced independent WGDs, these WGD events would have occurred after their divergence from each other and consequently each chromosome in one species should be equally distant from any homologous chromosome in the other species, producing similar median Ks values for WGD and orthologous collinear blocks.

For the ‘multiple gene family’ approach44, we used gene families with at least two gene copies in R. chiliantha or the other Juglandaceae species (C. illinoinensis, E. roxburghiana, J. mandshurica, J. microcarpa, J. regia, P. strobilacea) and at least one copy in Quercus lobata (see details in Supplementary Note 4 and Supplementary Fig. 27). Within this set of gene families we distinguished three types: (i) families in which both R. chiliantha and the other Juglandaceae have two gene copies; (ii) families in which R. chiliantha has two copies and the other species have only one; (iii) families in which R. chiliantha has only one copy and the other species have two copies. If R. chiliantha and the other species of Juglandaceae (SP) shared the WGD event, the topology of the gene family trees should be either (Q. lobata, ((SP, R. chiliantha), (SP, R. chiliantha))) or (Q. lobata, ((SP, R. chiliantha), SP)) or (Q. lobata, ((SP, R. chiliantha), R. chiliantha)) (the left part in Fig. 3b). By contrast, if the WGD of R. chiliantha and the other species of Juglandaceae (SP) occurred independently, the topology should be either (Q. lobata, ((SP, SP), (R. chiliantha, R. chiliantha))) or (Q. lobata, ((SP, SP), R. chiliantha)) or (Q. lobata, ((R. chiliantha, R. chiliantha), SP)) (the right part in Fig. 3b).

After establishing that R. chiliantha and the other species share the WGD event, we used a relative rate test79 to estimate the evolutionary rate (Ks) for each species (Fig. 3c, Supplementary Fig. 27). The average evolution rate is calculated by

$$r={K}_{s}/t$$
(1)

where r is the rate of nucleotide substitution per synonymous site per year, Ks is the expected number of substitutions per synonymous site between the common ancestor and the focal species, and t is the divergence time64.

Inferring effective population sizes

To determine whether the seven species of Juglandaceae experienced population contraction/expansion, we used the pairwise sequentially Markovian coalescent (PSMC; v.0.6.5-r67) approach81 to infer the changes in Ne from the whole-genome information of one diploid individual. Except for E. roxburghiana and R. chiliantha, the whole-genome sequencing data in this study came from published Juglandaceae genomes used elsewhere (Supplementary Table 1). We used the BWA-mem algorithm82 of BWA v. 0.7.15 to align trimmed reads from the seven individuals to their respective own reference genomes and used the SENTIEON DNAseq software package v. 201808.0883 to sort alignment sequence, remove PCR duplicates and realign indels for each individuals. We then used the obtained Binary Alignment Map (BAM) file and SAMTOOLS v.1.284 to perform variant calling. SNPs were filtered with the quality adjuster -C setting to 50, the minimal mapping quality to 20. When preparing the input files for PSMC, we set the minimum depth to one third of the mean genome depth and the maximum depth to two times the mean genome depth in vcfutils.pl vcf2fq (https://github.com/lh3/samtools/blob/master/bcftools/vcfutils.pl) pipeline. The substitution rates were set to the values estimated in this study, and the generation time was assumed to be 30 years85.

We used ANGSD86 to calculate the heterozygosity of one individual for each species. The BAM file of each individual was as input of ANGSD, with parameters setting as ‘-C 50 -minQ 20 -minMapQ 30’. The genome-wide heterozygosity was used to estimate the long-term average for each species

$${N}_{e}=\theta /(4\times \mu \times g)$$
(2)

where μ is the mutation rate per year and g is the generation time.

According to the nearly neutral theory, selection efficiency would be lower in populations with lower effective population size (Ne)46. Thus, the species with smaller Ne will likely accumulate more nonsynonymous mutations, leading to larger Ka/Ks. We compared the distribution of Ka/Ks between R. chiliantha and the other Juglandaceae species for those genes that have similar values of Ks in both species.

Transcriptome mining for genes related to DNA repair and recombination

Four of the major DNA repair pathways: base excision repair (BER), nucleotide excision repair (NER), repair by homologous recombination (HR), and non-homologous end-joining (NHEJ) are highly conserved in eukaryotes87. We used KAAS (KEGG Automatic Annotation Server: http://www.genome.jp/kegg/kaas/) to annotate the proteins of the seven Juglandaceae species and then counted the number of genes related to DNA repair and recombination (KEGG Ontology: ko03400). Eleven individuals of two tissues (leaf and flower bud) of R. chiliantha were collected for RNA-seq. We also downloaded paired-end RNA-seq data for the leaf, catkin, and pistillate flowers of J. regia2. After using Trimmomatic v0.3288 to trim low-quality bases or reads, we used Hisat v.2.1.089 to align the trimmed reads to the reference genome and htseq-count v.0.11.290 to count the number of reads. The number of fragments per kilobase of transcript per million mapped reads (FPKM) for each gene was calculated with an R script. In this way, we used the RNA-seq data to validate how many genes expressed in J. regia and R. chiliantha relate to DNA repair and recombination. To infer the function of genes that retain two copies in R. chiliantha from the juglandoid WGD, while the other Juglandaceae species have at most one copy, we used R package clusterProfile91 to perform KEGG enrichment analysis.

Subgenome assignment and phylogenetic tree inference

To infer the parental species of the juglandoid WGD, we used the subgenomes of seven Juglandaceae species and five representatives of Fagales that have experienced no further WGD beyond the γ-WGT as outgroups, namely Quercus lobata69, Betula pendula70, Carpinus fangiana68, Ostryopsis davidiana71, and Myrica rubra72. Subgenome assignment was performed for each pair of homoeologous chromosomes (or intraspecific collinear blocks, details in Supplementary Note 1) based on numbers of retained ancestral genes49,50,51. We used reciprocal best hits (RBHs) between Q. lobata and each of seven Juglandaceae to obtain a list of retained ancestral genes. We denoted chromosomes with more ancestral genes as ‘dominant’ and those with fewer ancestral genes as ‘recessive’. This approach for assigning subgenomes produces the same result as Zhu et al. 3 who used a slightly different approach based on the number of all genes, and Xiao et al. 6 used the evolutionary distance to assign subgenome.

After subgenome assignment, we applied three methods of phylogenetic inference (alignment-based with ASTRAL-Pro54, whole-genome microsynteny-based, and local gene content-based) in three datasets to infer the phylogeny of Juglandaceae by including (i) both dominant and recessive subgenomes of all Juglandaceae and the above five outgroups; (ii) only the dominant subgenomes of each Juglandaceae and the outgroups; (iii) only the recessive subgenomes of each Juglandaceae and the outgroups. The protein sequences with the longest transcripts for each gene were selected for further analysis. For the DNA-sequence-alignment method, we used OrthoFinder v.2.4.092 to obtain the gene families for each dataset. To avoid multiple sequence alignment errors introduced by too large gene families93, we kept the size of gene families smaller than 100 genes for downstream analysis. The protein sequences of each gene family were aligned using MAFFT v7.47594 and then converted into CDS alignment with PAL2NAL v.1495. We used IQ-TREE v2.1.2 to construct gene family trees that were then used as input data for ASTRAL-Pro54 to infer the species tree. Details on the gene families in the three datasets are provided in Supplementary Table 6.

For whole-genome microsynteny-based phylogenetic inference, we used synteny matrix representation with a likelihood (Syn-MRL) pipeline30 to reconstruct phylogenetic trees from the above three datasets, namely both dominant and recessive subgenomes (dataset 1), only dominant subgenomes (dataset 2), and only recessive subgenomes (dataset 3). Following Zhao et al. 30, BLASTP74 was used to perform the searches for the all potential inter- and intra-species homologous gene pairs (e < 10−10 and top5 matches). MCScanX52 was used to detect the synteny blocks of inter- and intra-species with the recommended parameters of A5G25 (A: the minimum number of anchor pairs required to call a collinear block, G: maximum number of intervening genes between two (adjacent) anchor pairs in collinear blocks). We also explored the A10G25 and A15G25 parameter settings, which would have removed the smaller collinear blocks with fewer than 10 or 15 collinear gene pairs from analysis. In the resulting synteny network, nodes are collinear genes in collinear blocks, and edges are used to connect collinear gene pairs. Using the Infomap algorithm v. 1.6.0 (https://github.com/mapequation/infomap), we detected synteny clusters within the map equation framework, setting the two-level partitioning mode with ten trials (--clu -N 10 −2).

For dataset 1 with A5G25 setting, the entire synteny network summarizes information from 67,954 pairwise syntenic blocks, and contains 270,276 nodes and 1,976,821 edges. The entire network was assigned to 19,849 clusters using the Infomap algorithm v. 1.6.0. These clusters were transformed into a binary presence-absence data matrix where rows and columns represent species and clusters, respectively. We removed the cluster that contained only one (sub)genome following Pett et al. 41. The resulting matrix of 19 rows × 19,202 columns was used to infer a phylogeny with IQ-TREE v2.1.2, under the Mk + R + FO model. Details on the matrices in datasets 2 and 3 with the A10G25 and A15G25 setting are provided in Supplementary Table 6.

For gene-content-based phylogenetic inference41, we followed Zhao et al. 30 to assign gene families of the three datasets with OrthoFinder v.2.4.0 and then used the gene presence/absence matrix to construct a phylogenetic tree with IQ-TREE v2.1.2. For each dataset, the gene presence/absence matrices had 19 rows × 17,149 columns, 12 rows × 17,926 columns, and 12 rows × 16,894 columns, respectively.

Since GRAMPA53 can infer gene duplications and losses in the presence of polyploidy and identify the most likely placement of the polyploid clades and their parental lineages, we also used this method to infer the parental lineages of the juglandoid WGD without performing a priori subgenome assignment. First, we used OrthoFinder v.2.4.092 to obtain the gene families for the seven Juglandaceae species and five outgroups and IQ-TREE v2.1.2 to construct gene trees. Then, we used ASTRAL-Pro to infer the species tree from the gene trees. We specified the seven species of Juglandaceae as possible polyploid lineages (H1 nodes) and GRAMPA searched the possible placements of the second parental lineage (H2 nodes). Finally, GRAMPA returned the reconciliation score for each multi-labeled tree (MUL-tree) considered (as well as for the original singly-labeled species tree). The MUL-tree with the minimum reconciliation score was the optimal placement for the H2 node.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.