Mei (Prunus mume) is an ornamental woody plant that has been domesticated in East Asia for thousands of years. High diversity in floral traits, along with its recent genome sequence, makes mei an ideal model system for studying the evolution of woody plants. Here, we investigate the genetic architecture of floral traits in mei and its domestication history by sampling and resequencing a total of 351 samples including 348 mei accessions and three other Prunus species at an average sequencing depth of 19.3×. Highly-admixed population structure and introgression from Prunus species are identified in mei accessions. Through a genome-wide association study (GWAS), we identify significant quantitative traits locus (QTLs) and genomic regions where several genes, such as MYB108, are positively associated with petal color, stigma color, calyx color, and bud color. Results from this study shed light on the genetic basis of domestication in flowering plants, particularly woody plants.
Prunus mume has been domesticated for thousands of years in China because of its favorable ornamental features1, and its cultivation has further expanded to the entire East Asia. The flowers of mei are featured for colorful corollas, varying flower types, a pleasant fragrance, and tolerance of temperatures as low as −19 °C2. Being a long-lived woody plant, many mei trees that are hundreds or even thousands of years old in several locations in China, providing a unique set of material to study the genetic processes underlying domestication. Distant hybridization has been extensively conducted between mei and other Prunus species to improve its agronomic traits and environmental adaption, and to understand the genetic diversity of important ornamental traits3,4,5. Various types of molecular markers have been developed for mei that provide powerful tools for studying the pattern of genetic diversity within and between populations and for constructing genetic linkage maps aimed to identify QTLs controlling quantitatively inherited traits6,7,8,9. More recently, genome sequencing of mei has made it an ideal model system for the genetic research of woody plants10.
Flowers have long been a focus of interest in studying this species due to their ornamental value. Mei exhibit an astonishing variety of petal colors, shapes, sizes, petal numbers, and floral bud aperture (whether there is an opening at the top of the otherwise closed floral bud). Further, other ornamental traits, such as wood color and branching habit, have also received considerable attention. Several studies have investigated the evolution of regulatory networks for transitions in floral organ identity11,12, floral symmetry13, and flowering time14. Flower pigmentation has also been studied from an evolutionary genetic perspective14,15,16. The genetic and molecular bases for the evolution of petal color have been characterized14,15. However, a systematic study to chart the genetic architecture of these traits in a large population using a genome-wide association (GWA) method has not yet been reported.
Here, we report the identification of significant QTLs that control floral traits in mei in a GWAS including 348 mei accessions. We further sequenced transcriptomes of flowers with diverse traits to validate the QTLs by biased expression of candidate genes between transcriptomes. The present study has for the first time elucidated the genetic architecture of floral size, color, and structure, in terms of the number of loci, their genomic distribution and the magnitude and pattern of their effects in a woody plant. In addition, by comparing mei with other Prunus species, we can begin to study the evolutionary diversification of the Prunus genome.
Sequencing and variant discovery
We collected a highly phenotypically diverse population including most of the existing P. mume cultivars, wild mei, and its close relatives for whole genome sequencing. A total of 348 mei accessions, including 333 landraces and 15 wild mei accessions were sampled and sequenced for the present study (Supplementary Data 1). Landraces could be classified into eleven cultivar groups (Pendulous, Single Flowered, Versicolor, Pink Double, Flavescens, Tortuosa, Green Calyx, Albo-plena, Cinnabar Purple, Apricot Mei and Meiren) according to classification system of Chinese mei1. Three other close relatives of mei, Prunus sibirica, Prunus davidiana, and Prunus salicina, were also sampled and sequenced in this analysis. The geographic origins of the mostly representative mei accessions analyzed here spanned China, Japan, and France. Each sample was sequenced to ~19.27-fold using the Illumina HiSeq 2000 sequencing platform to generate a total of 16.28 billion raw paired-end reads and 13.71 billion clean reads after filtering (Supplementary Data 1). Deep sequencing data (~70.1-fold coverage) from eight mei trees from different populations and three other Prunus species were used to establish the pan-genome of P. mume and Prunus genus. By mapping all clean reads against the P. mume reference genome10, we identified a total of ~12.76 million raw SNPs and ~5.34 million high-quality SNPs after calibration and filtration (Supplementary Table 1). Applying the same variation calling method to the sequencing data from the same individual of which the genome had been assembled as ref. 10, we identified 0.28% sites to be homozygous as a different genotype which might be false positive, indicating high accuracy of the variation calling method. Allele frequency spectrum (Supplementary Fig. 1) also showed a proper segregation of the population. A total of 1,298,196 (10.17%) of raw SNPs were located within coding regions of genes, and 733,292 (5.74%) of them were non-synonymous (Supplementary Table 1). The ratio of non-synonymous to synonymous substitutions in this mei collection was 1.30, very similar to that of a collection of peach accessions (1.31)17. We also detected an average of 7313 deletions, 1117 insertions, and 623 structural variants (SVs) in this population (Supplementary Data 2).
Population evolution of P. mume
We explored the phylogenetic relationships of these 351 accessions using all high-quality SNPs identified, with three other species of Prunus genus as outgroup. The 348 mei accessions could be roughly divided into 16 subgroups within the phylogeny (Fig. 1a). We calculated bootstrap values for each node with 91.1% nodes (318/349) having a bootstrap value over 90 and all the 16 subgroups were of high confidence (Supplementary Fig. 2). We mapped ten representative ornamental traits of P. mume on phylogenetic tree and found that these traits were also quite diverse within the subgroups (Fig. 1). We carried out the population structure analysis to estimate individual ancestry and admixture proportions using FastStructure (v1.0)18. The population structure (Supplementary Fig. 3a) was highly-admixed and revealed eight sub-populations (when K = 8, the cross-validation error was minimum) (Supplementary Data 3). This was highly consistent with the phylogenetic tree (Fig. 1a) and the principle component analysis (PCA) (Supplementary Fig. 3b). The structure result was also applied in the later GWAS analysis as fixed covariate in the regression model to eliminate effects of population structure.
Linkage disequilibrium (LD) values (correlation coefficient, r2) were also calculated among wild and different cultivar classes (Supplementary Fig. 3c). Higher LD was found in most cultivar sub-populations (5 out 7 cultivar classes) comparing to wild population. However, we found two classes, Pink Double and Single Flowered, to have lower LD than other cultivar sub-populations and the wild population, which should be caused by the massive introgressions from the other species to these two sub-populations (as indicated in the next session). Besides, Genetic diversity (π) of wild and cultivated mei were estimated to be 2.82 × 10−3 and 2.01 × 10−3, respectively, relatively low compared with crops (Supplementary Table 2). We also estimated LD decay for four representative traits (Supplementary Fig. 3d). LD levels for most subgroups were lower than that for wild group as well. For subgroups with opposite phenotypes for bud color and pistil character, LD patterns were similar. However, LD levels were higher for subgroups with red wood and green stigmas than for their corresponding opposite subgroups.
Introgression from apricot and plum
There were three major branches of mei cultivars (True Mume Branch originated from wild mei, Apricot Mei Branch hybrids between P. mume and P. armeniaca, and Meiren Branch hybrids between P. cerasifera cv. Pissardii and P. mume) according to previous study1, and these three classes can be distinguished on the phylogenetic tree (Fig. 1a). Fourteen of all the 16 Apricot Mei cultivars were clustered into the outgroup or subgroup P1, which was consistent with the fact that these Apricot Mei cultivars were originated from natural hybridization between mei and apricot19. In the meantime, wild mei accessions appear in several lineages, while many of the current cultivars are genetically closer to apricot or Apricot Mei hybrids, which was also consistent with the artificial hybridization events between cultivars and apricot/Apricot Mei3,4,5. Taking together, there should be extensive introgression events in mei cultivars from Prunus species.
Further to analyze the introgression events which were considered to be essential for mei cultivation19, we carried out the three-population F3 test20 to assess the extent of introgression (Supplementary Fig. 4, Supplementary Data 4). We first analyzed the introgression in the two hybrid cultivar groups (Apricot Mei and Meiren) and observed significant introgression from apricot and plum, respectively (Supplementary Fig. 4), proving the reliability of the introgression analysis. Thus, then we analyzed the introgression in the nine True Mume cultivar groups (Pendulous, Single Flowered, Versicolor, Pink Double, Flavescens, Tortuosa, Green Calyx, Albo-plena, and Cinnabar Purple). Pink Double and Single Flowered (two varieties of cultivated mei) cultivars showed significant introgression (Z-score < −1.96) from the three Prunus species (Fig. 2), while Pendulous and Cinnabar Purple cultivars showed weak inter-species introgression signatures. This reflected the extensive inter-species introgression between mei and Prunus species, which made population structure and dissection of domestication history to be quite complicated.
The pan-genome of P. mume and Prunus
The core- and pan-genome shared by all sequenced accessions is a powerful tool for investigating genetic diversity within populations and genomic variants arising during domestication21. We sequenced and assembled genomes of nine representative mei trees including those of individuals representing eight cultivars from subpopulations P2, P5, P6, P8, P9, P11, and P15 and one wild accession from subpopulation P15 (Table 1), and also those of four close relatives including P. sibirica, P. davidiana, P. salicina and a previously reported Prunus persica genome7 to establish a pan-genome for P. mume and Prunus. To assess assembly method and evaluate assembly quality of each genome, we re-assembled the reference genome using the same approach and mapped the sequencing reads against the assembled genomes of each species. The consistent ratio (~98.13%) and sequencing depth (Supplementary Table 3) indicated good quality of the method. Assembly of the genome from each sequenced accession resulted in an average contig N50 of ~15.5 kb and scaffold N50 of ~22.6 kb, respectively (Table 1, Supplementary Table 4). An average of 25,839 genes were annotated per genome, which accounted for 82.32% of the mei reference genome (31,390 genes in total, Table 1 and Supplementary Table 5). Assembly-based methods identified 1.30–1.47 million SNPs among the P. mume accessions, and 2.85–3.38 million SNPs among the other sequenced Prunus genomes (Supplementary Table 5). A low but consistent SNP frequency might suggest a low rate of divergence during the domestication of mei.
We found that 71.68% and 60.96% of the P. mume reference genes were shared by all nine P. mume and 13 individual Prunus accessions (Supplementary Tables 6 and 7). We found that 3364 genes did not appear in the Prunus core gene set and were thus mei-specific genes (Table 1). Among the mei-specific genes, those related to flavonoid, phenylpropanoid, stilbenoid, diarylheptanoid and gingerol biosynthesis, and phenylalanine metabolism were relatively enriched (Supplementary Data 5). These functional pathways likely influence the development of important aspects of the ornamental traits of mei. For instance, flavonoids are important plant pigments that influence flower coloration, and phenylpropanoids serve as essential components of floral pigments and scent compounds in addition to their roles in wood formation. Our findings were further confirmed by the subsequent GWAS. It is noteworthy that three out of six DAM (dormancy-associated MADS-box transcription factors family) genes reported in P. mume10 were not present in either the mei or Prunus core gene sets, possibly due to the divergence of blooming time between mei and Prunus accessions.
Novel sequences in each genome were identified by whole-genome alignment with the reference genome and the core pan-genome assembly. Detected presence-absence variations (PAVs) varied in length from 0.19 to 0.55 Mb in P. mume genomes and from 8.94 to 25.85 Mb in other Prunus genomes (Supplementary Tables 8 and 9). To eliminate low-confident PAVs and illuminate their population patterns, we mapped reads from 351 resequencing samples to all eight P. mume genomes’ PAVs. We found about 6.25% of PAVs resulted from unassembled sequences and thus excluded them from following analysis. We further performed hierarchical clustering of samples based on the distribution of coverage and identity of high-confidence and population-specific PAV sequences (Supplementary Table 10). Samples could be clustered into ~16 groups, in high concordance with the number of previously identified subpopulations (Fig. 3a). We then discovered several population-specific PAVs that showed particular patterns within geographical subpopulations, and decreased coverage that might reflect domestication. Subpopulation P11 includes the wild mei accession S329 from Tibet, which is regarded as the progenitor of current mei accessions, and S179, which were used to construct the Prunus core pan-genome. Two PAVs identified in the S179 genome were highly specific to subpopulation P11 and showed different patterns of change in coverage during domestication (Supplementary Table 11). For example, coverage of PAVs was high in accessions from Tibet, Wuhan, and other populations from south China, but coverage was lower in populations from north China, including Beijing and Qingdao, and those from Japan (Fig. 3b). PAVs could thus be used for identification of mei accessions.
We used core genes identified in 13 individual Prunus accessions and three sequenced relatives from the Rosaceae to construct a phylogeny and reconstruct the evolutionary history of Prunus genus. Our results suggested that P. sibirica might be more closely related to mei than is any other Prunus species in the present study, which was consistent with a previous report22. We estimated the divergence times between P. mume and other Prunus species as ~3.8 MYA and between wild and cultivated mei as ~2.2 MYA, which far predated the estimated domestication of cultivated mei (Fig. 3c). Therefore, divergent selection may have contributed to the differentiation of the two subspecies long before the domestication of mei.
GWAS analysis of ten ornamental traits
Our study population was composed of different landraces of a woody plant, whose selection had a different process from what is true in crops, as mentioned above. Thus, we developed a method based on logistic regression to perform GWAS of 24 mei floral traits (Supplementary Data 3), considering population structure calculated before as a fixed covariate in the model. It turned out that our method provided quite satisfactory results about GWAS, as shown by Q–Q plots of p-values for each trait (Supplementary Fig. 5), in which p-values after correction were found to be close to expected curves. We identified five significant (Bonferroni-corrected P < 0.01) candidate regions on four chromosomes associated with ten traits including petal color, stigma color, calyx color, bud color, staminal filament color, wood color, petal number, pistil character, bud aperture, and branching phenotype (Fig. 1b). To validate the differential expression of candidate genes significantly associated with flower-related phenotypes in mei, we performed RNA-Seq analysis by sequencing six transcriptomes of two representative cultivars, ‘Wu Yu Yu’ and ‘Mi Dan Lv’ (WYY and MDL for short, three biological replicates per sample) that have distinct phenotypes for petal, calyx color, and petal number (Supplementary Table 12). We identified a total of 3277 significant differentially expressed genes (DEGs), 159 of which were specifically expressed in one sample type and subsequently designated as specifically expressed genes (SEGs) (Supplementary Fig. 6).
We identified a total of 76 SNPs within DEGs that were associated with petal, stigma, calyx and bud color, respectively (Supplementary Data 6-9). Interestingly, we found that these SNPs were associated with the region spanning from 229 kb to 5.57 Mb on chromosome Pa4 (Fig. 4a, Supplementary Figs. 7, 8 and 9). MYB108 (Pm012912, Pa4:411731:413009) encodes an R2R3 MYB transcription factor and located in the same region on chromosome Pa4 as the petal color-associated SNPs (Fig. 4a). Several members of this gene family reportedly affect flower color in plants23,24,25. Moreover, MYB108 was expressed in all three WYY samples, which have red flowers, but not in any of the MDL samples, which have white flowers (Fig. 4b). To investigate the regulatory network affecting expression of MYB108, we used the String database (http://string-db.org/)26 to construct an interactive network of the SEGs identified above (Fig. 4c). In the network model, the E2F transcription factor 1 (E2F1) has the same expression pattern as MYB108, which binds a sequence present in the promoter of the S-phase-regulated gene CDC6 and is a member of a multigene family with several different activities in Arabidopsis27. This indicated that E2F1 might be involved in the regulation of MYB108 gene expression. We also predicted three possible promoters in the upstream region of MYB108 using TSSP28 and found that one of them was significantly associated with a SNP (Pa4:411479, P = 11.39) located at an ambiguous nucleotide in this promoter (RSP00161, WAAAG, Supplementary Data 10). This mutation in the putative MYB108 promoter might affect the regulation of MYB108 expression. We also found that MYB108 only exists as a single-copy gene in all reported plant genomes. MYB108 is highly conserved within Prunus, but differs notably among clades (Supplementary Figs. 10 and 11). These results suggest that MYB108 might play a critical role in the control of petal color in Prunus.
Our GWAS results showed that wood and staminal filament color were each significantly associated with two regions located on chromosome Pa3 (wood color, R1: 20601577-20832908 and staminal filament color, R2: 444623-3375607, Fig. 5a, Supplementary Fig. 12, Supplementary Data 11–12). R1, which spans ~231 kb and contains 48 genes, includes SNPs significantly associated with wood color compared to those in the regions flanking R1 (Fig. 5b). We also identified a polymorphism in R1 between subpopulations with green or red wood by calculating the Fst29 and π values for these subpopulations, which indicated that R1 was involved in a selective sweep driven by artificial selection30 (Fig. 5c). For staminal filament color, we focused on the significant DEGs between MDL and WYY in these two regions (Fig. 5d). In R2, we found several candidate genes that could affect staminal filament color, including HAESA (Pm010102) (Fig. 5e), a dual-specificity receptor kinase that acts on both serine-/threonine- and tyrosine-containing substrates to control floral organ abscission31. Another is SPL5 (squamosa promoter binding protein-like 5, Pm010075), a transacting factor that binds specifically to a consensus nucleotide sequence in the AP1 promoter32. A third is PRXR1 (peroxidase 42, Pm009792), which is involved in the biosynthesis and degradation of lignin32.
We identified a region spanning ~3.64 Mb located on chromosome Pa1 (4058003-7693997) that was significantly associated with petal number, pistil character, and bud aperture of mei (Supplementary Figs. 13, 14 and 15, Supplementary Data 13–15). This region contained 41 DEGs, including the SEGs Pm000751 (LAC17, laccase 17), Pm001026 (hydroxyproline-rich glycoprotein family protein), and Pm000753 (PRS, PRESSED FLOWER) (Supplementary Data 16). LAC17 functions in lignin degradation and detoxification of lignin-derived products32,33,34,35. PRS is a probable transcription factor that is involved in lateral sepal axis-dependent flower development perhaps by regulating the proliferation of L1 cells in lateral regions of flower primordia33,36,37. These two genes could be involved in the control of petal number and pistil character of mei.
A candidate region associated with branching phenotype spanning ~1.15 Mb located on chromosome Pa7 was identified in a hybrid population in a previous study38. In our study, several candidate regions were also found on chromosome Pa7 (Supplementary Fig. 16). Between the two studies, a total of 13 candidate genes that might be associated with branching phenotype, including the transcription factor bHLH157 (Pm024214) and cytochrome P450 78A9 (CYP78A9, Pm024229), have been identified (Supplementary Data 17). Several P450-dependent reactions have been characterized in the phenylpropanoid pathway, which controls the synthesis of lignin39.
By using a combined strategy of genome resequencing, de novo assembly and genome-wide association analysis, we address a fundamental challenge for inferring the population diversification of mei (Prunus mume), a woody plant, and the genetic architecture of its floral traits. We sampled and sequenced 333 mei landraces and 15 individual mei trees from a wild stand, along with three other Prunus species, to reveal the genetic divergence of mei under its domestication. We studied the genetic control of flower color, structure, and shape because these traits directly determine the reproductive behavior, mating system, and domestication process of higher plants40.
The 348 mei accessions sampled were clustered into 16 distinct subgroups, suggesting mei landraces in our population might be selected from wild group and could be regarded as founder (Fig. 1). Our findings testified extensive introgression between mei and other Prunus species, even in some True Mume groups, according to the most-update classification system of mei (Fig. 2). Moreover, we found two cultivar groups had lower LD level than the wild and these two groups showed strong signature of introgressions from Prunus species. It suggested that introgression was probably the reason why higher LD was observed in wild population, as compared to that in these cultivar groups. These evidences together lead to complicated ancestry of current mei population.
By further deep sequencing of nine representative landraces of P. mume and its relatives, P. persica, P. salicina, P. sibirica and P. davidiana, we established a pan-genome for Prunus. Lineage- and species-specific sequences were found to highly correlated with sub-population genetic architecture (Fig. 3) or particular adaptive or ornamental traits. These data will facilitate future work in evolutionary research of Prunus genus and genetic basis of floral traits.
The genetic control of flower-related morphologies of mei was found to be polygenic. For each of the ten floral traits studied, including petal color, wood color, petal number, flower density, branch type, and branching structure, numerous QTLs were identified (Figs. 4 and 5, Supplementary Data 18), but with a single locus explaining 1–3% of the phenotypic variance. Further gene expression studies characterized the transcripts of genes that underlie these QTLs. Many of the QTLs detected were found to reside in genomic regions near candidate genes. A QTL affecting petal color located in the same region of MYB108 (Pm012912, Pa4:411731:413009) on chromosome Pa4. MYB108, encoding an R2R3 MYB transcription factor, was observed to mediate flower color in plants23,24,25. By combining transcriptome data and data from existed database, we further reconstructed a regulatory network related to MYB108, from which the E2F transcription factor 1 (E2F1) that binds a sequence present in the promoter of the S-phase-regulated gene CDC6 was found to activate the expression of MYB108. As a highly conserved gene within Prunus, MYB108 may have played a critical role in modulating the genetic control and evolution of petal color in Prunus.
We identified two QTLs located on chromosome Pa3, associated with wood color and staminal filament color, respectively (Fig. 5a, Supplementary Fig. 12, Supplementary Data 11-12). A polymorphism within the first QTL was detected to determine how wood is colored. One genotype at this polymorphism forms green wood, whereas the alternative forms red wood. By calculating the Fst29 and π values for the subpopulations that are dominated by two alternative genotypes at this polymorphism, it was inferred that the QTL for wood color may have been involved in a selective sweep driven by artificial selection30 (Fig. 5c). The QTL for staminal filament color are associated with several candidate genes, including HAESA (Pm010102) (Fig. 5e), a dual-specificity receptor kinase that acts on both serine-/threonine- and tyrosine-containing substrates to control floral organ abscission31, squamosa promoter binding protein-like 5 (SPL5, Pm010075), a transacting factor that binds specifically to a consensus nucleotide sequence in the AP1 promoter32, and PRXR1 (peroxidase 42, Pm009792), which is involved in the biosynthesis and degradation of lignin32.
Petal number is a key trait that determine the flower structure of mei. One QTL for petal number, as well as for pistil character and bud aperture, was found in a region of chromosome Pa1 (Supplementary Figs. 13, 14 and 15, Supplementary Data 13–15), containing 41 DEGs, such as the SEGs Pm000751 (LAC17, laccase 17), Pm001026 (hydroxyproline-rich glycoprotein family protein) and Pm000753 (PRS, PRESSED FLOWER) (Supplementary Data 16). These genes display important functions in lignin degradation and detoxification of lignin-derived products32,33,34,35 and lateral sepal axis-dependent flower development.
The candidate regions associated with petal color (Pa3), wood color (Pa4), and petal number (Pa1) located in different regions of the mei genome, and their independence suggest that these traits have experienced different routes of evolution. Although many QTLs detected can be annotated, there are also many QTLs that are in previously uncharacterized regions or not associated with any annotated genes. The biological functions of these unknown QTLs deserve further investigation. Taken together, the identification of genetic loci associated with floral and other traits provides more insight into the genetic mechanisms that underlie the domestication of mei and provides opportunities to design strategies for genomic selection to improve the performance of ornamental species.
Plant materials and genome sequencing
We collected leaves from 333 representative mei landraces, 15 wild P. mume, and three close relatives of Prunus, including P. sibirica, P. davidiana, and P. salicina (Supplementary Data 1). Meanwhile, information for up to 24 phenotypic categories was recorded for each sample (Supplementary Data 3). Genomic DNA was extracted from fresh or gel-dried leaves using a standard cetyl trimethyl ammonium bromide (CTAB) protocol41. After DNA extraction, genomic libraries were prepared following the manufacturer’s standard instructions (Illumina). To construct paired-end libraries, DNA samples were fragmented by nebulization with compressed nitrogen gas and treated to create blunt ends before adding an A to each 3′-end. DNA adaptors with a single T 3′-end overhang were ligated to the above products. Libraries with short insert sizes of 500 bp were constructed for each resequencing sample, and extra libraries with long insert sizes of 2 kb were constructed for the core- and pan-genome assembly. All libraries were sequenced on the Illumina HiSeq 2000 sequencing platform, and paired-end reads from each library were obtained. Reads were firstly filtered to obtain high-quality sequences for mapping and assembly by eliminating: (i) reads containing ≥10% ambiguous bases, (ii) reads with low-quality data (Phred quality scores Q ≤ 7) for 65% of bases for short inserts libraries (<2 kb) or 80% of bases for long inserts (2 kb), (iii) reads containing 10-bp adaptor sequences, (iv) reads with >10 bp overlaps between two ends of short-insert reads, and (v) reads with identical sequences at both ends.
RNA-Seq of accessions with contrasting phenotypes
We performed RNA-seq for fresh flowers from two mei landraces (MDL vs. WYY) with three replicates for each were collected in April, 2016. Samples were immediately fixed in liquid nitrogen after collection and stored at −80°C. Total RNA was isolated using a modified CTAB protocol42 and used for cDNA library preparation. RNA was first assessed by capillary electrophoresis on an Agilent BioAnalyzer 2100 (Agilent Technologies, Palo Alto, California, USA). Polyadenylated RNA isolated using oligo (dT)-attached beads was fragmented and reverse transcribed to cDNA. Paired-end libraries, with 500 bp in length, generated from each sample were then sequenced separately. RNA-seq were performed on the Illumina Hiseq 2000 platform. A total of 483.58 Mb raw data were produced by transcriptome sequencing of six samples and details were mentioned in Supplementary Table 12.
All reads were mapped against the mei reference genome using BWA43. SNP calling was then performed following best practices for the GATK44 (v3.1) pipeline (using mainly the module UnifiedGenotyper, followed by quality filtering (VariantFiltration with parameters: QD < 2.0 || FS > 60.0 || MQ < 40.0 || HaplotypeScore > 13.0). We further filtered SNPs with more than 10% samples with missing genotypes, or deviated from Hardy-Weinberg principle in some downstream analysis. Structural variants (SV) were identified using BreakDancer45, which cataloged deletions, insertions, inversions, and intra-chromosomal translocations.
Genetic distance was estimated using genome-wide SNPs46, where the distance between two individuals i and j was defined as:
L represented the length of the SNP region, where at position 1, d ij would be equal to 0 or 1 if genotypes were identical or different between two individuals, and d ij would be equal to 0.5 for other scenarios. A matrix of genetic distance was then used to generate a Neighbor-Joining tree using PHYLIP (v3.69)47. Linkage disequilibrium values for wild and cultivated mei, and populations with particular traits were calculated using Haploview48 with filtration parameters “-minMAF 0.05 -hwcutoff 0.01.” LD decay was then estimated as the relationship between the distance between each pair of SNPs and their corresponding correlation coefficient (r2) value. The distance that the LD decays to half of the highest value was then calculated. Principle Component Analysis were taken by EIGENSOFT49 software using genome-wide SNP.
Genome assembly and establishment of a core- and pan-genome
Nine mei and three Prunus varieties were subjected to deep sequencing (~70.1×) using a single paired-end library and a mate-paired library with insert sizes of 500 bp and 2 kb, respectively, using the Illumina HiSeq 2000. All reads from each sample were processed and assembled using SOAPdenovo50(v2.04), followed by gap closing using GapCloser (v1.12) from the SOAPdenovo package. Both homology-based and ab initio prediction of protein-coding genes was then performed using P. persica, P. mume, Fragaria vesca51 and Pyrus × bretschneideri52 gene sets as references, and final gene prediction results for each genome were generated using GLEAN53. The core-genome was established by: (1) aligning all 12 assembled genomes plus the P. persica reference genome to the P. mume reference genome, using pairwise alignment software NUCmer from the MUMmer package;54 (2) filtering alignment results with identity <90%, map_len_min = 100 bp, query_seqs <500 bp and coverage <0.8; and finally (3) extracting and retaining core-genome sequences for nine mei and 13 Prunus species under different ratio.
Specific sequences identification
PAVs in each genome assembly were identified in a stepwise manner. First, unaligned sequences were collected after aligning all mei and Prunus genomes analyzed here during the establishment of the core- and pan-genome. Then these previously unaligned sequences were realigned to the reference genome using BLAT55 and unaligned sequences with identity <95% and length >100 bp were extracted. Finally, sequences generated from each assembly above were aligned to any other assemblies to identify any sequences specific to individuals (identity <90%, length >100 bp). For validation of the PAVs and identification of population-specific PAVs, we mapped all paired-end reads from the 351 resequenced samples and calculated coverage for each PAV. From this set, PAVs covered by more than 90% of the samples with coverage and identity over 90% were excluded. To investigate population pattern of PAVs, a total of 93 PAVs were selected and their distribution among all samples was displayed on heatmap using R/pheatmap56. PAVs specific to the P11 subpopulation were then identified and average coverage was calculated in the genomic vicinity of each PAV. The ggmap57 package in R was used to visualize the distribution of PAVs across these locations.
The phylogenetic tree for mei and other chosen plant genomes was constructed using single-copy orthologous genes identified using the clustering program OrthoMCL12. The generation-time hypothesis, which suggests that shorter generation times could accelerate a species’ molecular clock58, might explain differences in the divergence rates in molecular clocks. Firstly, we used MUSCLE59 to align the predicted proteins of single-copy genes. Then the protein sequences were reverse-transcribed into coding DNA sequences (CDS) based on the alignment results. Fourfold degenerate sites in each alignment were identified and concatenated into a single supergene for each species. A phylogenetic tree was then constructed using PhyML60 or MrBayes61. The fourfold degenerate sites were used to estimate the neutral substitution rate per year and divergence times among species.
The models “Correlated molecular clock” and “JC69” were chosen to calculate species divergence times using the MCMCTREE program from the PAML62 package. The Markov Chain Monte Carlo (MCMC) process in MCMCTREE program was run 800,000 times, with a burn-in of 80,000 iterations. Four independent runs were performed to check convergence.
Genome-wide association study
where μ is the overall mean, X i refers to the subpopulation of individual i, β is the effect for subpopulation, a and d are the additive and dominant effect of SNPs, respectively, ξ i and ζ i are the indicator vectors of the additive and dominant effects of SNPs for individual i, and ɛ i is the residual error. The J-th elements of ξ i and ζ i are defined as
The regression coefficients can be fitted using maximum likelihood method. The hypothesis about a marker affecting trait can be formulated as
where H0 corresponds to the reduced model and H1 corresponds to the full model. The test statistics for testing the hypothesis is calculated as the log-likelihood ratio (LR) of the full over reduced model. LR can be viewed as being asymptotically χ2 distribution with different df. Level of significance is adjusted by Bonferroni method.
However, in GWASs, a number of traits are either discrete or continuous. The discrete traits are categorized into binary, multinomial, and ordinal. Traits such as white or red flowers or calyx color due to pigment phenotypes are classic unordered categorical variables. This study used three different methods to detect relationships between genotypes and phenotypes based logistic regression model for discrete traits. (1) General linear model (GLM): GLM was used to test the associations based binary logistic regression link function for the binary variable. (2) Multinomial logistic model (MLM): MLM was used that the dependent variable is nominal with more than two levels and no intrinsic ordering. (3) Ordinal logistic regression (OLR): OLR was used to test the associations between markers and multiple ordered levels. For the continuous traits, linear regression model was used to test association.
Besides, we used two models to analyze the data, considering effects of population structure or kinship. The first is the Q model that adjusts for the population structure of mei varieties, and the second is the Q + K model that correct for population structure and the kinship, i.e., the most probable identity by state of allele between varieties. The degree of kinship was calculated through correlation analysis using marker data. The model that better fits the data was assessed and chosen according to the Q−Q plot, in conjunction with the inflation factor estimated from the median of the test statistics for all the markers. Optimal model for each trait and Q−Q plot with results generated from both models were presented in Supplementary files (Supplementary Table 13, Supplementary Fig. 17).
Functional enrichment analysis
We performed functional enrichment analysis of genes associated to floral traits, with the Gene Ontology (GO), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. By comparing with the background of all genes in mei genome using hyper-geometric distribution, enrichment analysis provides all terms (GO term and KEGG pathway ID) that are significantly enriched in the target genes. P-value was thus defined as
where N is the count of all genes with function information in mei genome; n is the count of genes associated to specific floral traits in N; M is the count of all genes annotated to certain functional terms or pathway; and m represents the count of genes associated to specific floral traits in M. The calculated p-value experienced Bonferroni Correction, and functional terms or pathways with corrected p-value ≤ 0.05 were defined as significantly enriched functional terms.
Computer code used to perform population genomics and GWAS is available from the corresponding authors upon request.
The sequence data of mei genome resequencing involved in this study have been deposited in NCBI with the accession number SRP093801 (BioProject: PRJNA352648). All other relevant data supporting the findings of the study are available in this article and its Supplementary files, or from the corresponding authors upon request.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The research was funded by the National Natural Science Foundation of China (Grant No. 31471906), the Fundamental Research Funds for the Central Universities (No. 2016ZCQ02), the Special Fund for Beijing Common Construction Project, and the National High Technology Research and Development Program of China (2013AA102607).
Electronic supplementary material
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.