Genomic basis of homoploid hybrid speciation within chestnut trees

Hybridization can drive speciation. We examine the hypothesis that Castanea henryi var. omeiensis is an evolutionary lineage that originated from hybridization between two near-sympatric diploid taxa, C. henryi var. henryi and C. mollissima. We produce a high-quality genome assembly for mollissima and characterize evolutionary relationships among related chestnut taxa. Our results show that C. henryi var. omeiensis has a mosaic genome but has accumulated divergence in all 12 chromosomes. We observe positive correlation between admixture proportions and recombination rates across the genome. Candidate barrier genomic regions, which isolate var. henryi and mollissima, are re-assorted in the hybrid lineage. We further find that the putative barrier segments concentrate in genomic regions with less recombination, suggesting that interaction between natural selection and recombination shapes the evolution of hybrid genomes during hybrid speciation. This study highlights that reassortment of parental barriers is an important mechanism in generating biodiversity.

For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection

Data analysis
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Yongshuai Sun
Jun 5, 2020 Reference genome -Sequence data was collected by nanopore, Illumina and HiC with read calls produced by the provider platform software. Re-sequencing data was collected by Illumina instruments with read calls produced by the provider platform software. The paper reports a case of homoploid hybrid speciation and describes fundamental discoveries about the role of natural selection and recombination in hybrid speciation of one Castanea taxon endemic in Mount Emei. In this hybrid system, two parental lineages have similar geographic distributions but distinct phenotypes. A third species, sister to one of the two parental species, provides an ideal phylogenetic control for identifying potential genes contributing to reproductive isolation. Furthermore, there are several unique aspects to our work. First, based on the nanopore sequencing data and Hi-C data, we provide a chromosome-scale genome of the Chinese chestnut, providing a solid basis for genomic analysis. Second, we use a large-scale population resequencing study to characterize polymorphisms across the range of four Castanea taxa. Third, we have used multiple methods to test the hypothesis of homoploid hybrid speciation. For the first time, we identify the barrier genomic regions isolating parental lineages, taking advantage of the phylogenetic control to exclude effects of random drift. Finally, we analyze the relationship between barrier loci and recombination rates.
We generate a genome assembly for C. mollissima and genomic sequences of 115 chestnut trees. Most genetic variation can be captured when sample size is >=20 in traditional population genetics (see the book Mathematical Population Genetics by Warren J. Ewens in 1979). In our design, each lineage is treated as a population. So, we collected young leaves from 20 trees of C. henryi var. omeiensis, 24 trees of C. henryi var. henryi, 43 trees of C. mollissima and 28 trees of C. seguinii, spanning the geographic ranges of the four taxa (Figure 1 and Supplementary Tables 6-9), to represent the focused four populations.
The Chinese chestnut trees are important economic trees in China and are planted widely. We filtered out populations <50 km far from villages, cities and man-made chestnut forests, aiming to deduce effects of the domesticated Chinese chestnut trees. For each lineage, we sampled 20 or more trees to capture their genetic variation. We don't perform sample size assessment, because this sample size is thought to be suitable (see the book Mathematical Population Genetics by Warren J. Ewens in 1979).
We used GATK4 and ANGSD to generate the dataset used in population genomics. This work was carried out by S. Y. and M. H.
In 2017-18, we collected leaves of 115 chestnut trees spanning their geographic ranges in China.
No data were excluded, all genomes of 115 trees were used in this study. Bases with Phred quality score  20 were defined as low quality, because the accuracies of these bases are lower than 99%. Low quality bases were masked, and were trimmed if they were at end of the read.
Four trees of C. mollissima and three trees of C. henryi were used to assess the quality of the SNPs called by the GATK4-pipeline above, based on duplicated re-sequencing. For each of 7 trees, two separate DNA samples were sequenced. These datasets were processed in identical pipeline with the processing from quality control to the GATK4-pipeline, blind to the fact that they were duplicates. Then we compared the genotypes in the two duplicated samples for each of 7 trees. Further, we verified the SNPs using population re-sequenced data of C. mollissima and C. henryi. The rationale is that, if a heterozygotic genotype from one tree genome cannot be detected in its duplicated sequencing data or be found in population samples, it would be counted as an unverified genotype. The maximum proportion of unverified heterozygotic genotypes for each tree is 0.000167%, corresponding to a Phredscaled score of 57.8, suggesting that variants detected by the present GATK4-pipeline is of high quality.
All 115 samples were allocated into one of three species ( C. mollissima, C. seguinii, C. henryi ) when collecting the leaves, according to their distinct morphological traits. To assure the correctness, we performed genetic delimitation using clustering analyses, including ADMIXTURE, FASTSTRUCTURE.
When collecting samples and analyzing data, the analyst knew nothing about the demography of each species, their relationships and whether there is hybridization among them. When estimating SNPs, we used a double blind experiment to validate the quality of our SNPs data.