Main

Accurate sequence information, genome assemblies and annotations are the foundation for genetic and genome-wide studies. The major factors that limit de novo genome assembly are heterozygosity and repetitive sequences, such as TEs, which are often collapsed to single copies in draft genomes1. In recent years, however, evidence supporting the importance of TEs in genome evolution, genome structure, regulation of gene expression and epigenetics has been mounting2,3,4,5. The characterization of sequences and the distribution of TEs within a genome is, therefore, of great importance.

Until now, the study of epigenetically controlled characteristics in perennial plants has been hampered by the draft status of their genome sequences. In the case of apple, a draft was produced6 but remained incomplete with inaccurate contig positions7; this hindered its utility for genetic and epigenetic studies. De novo sequencing and assembly of a new genome for apple, using technologies of the third generation, had thus become a necessity.

In the last few years, single-molecule sequencing and optical-mapping technologies have emerged8, which are well suited for assembling genomic regions that contain long repetitive elements. Recently, several high-quality genome assemblies have been published using one or both technologies9,10,11,12,13,14. The use of long-read sequencing technologies may also tackle potential assembly issues that are related to the presence of highly similar sequences resulting from whole-genome duplication events that frequently occurred in angiosperm genomes15.

In addition to DNA sequence modifications, it has been shown that epigenetic variations contribute to genome accessibility, functionality and structure16,17. Several studies have demonstrated that local DNA methylation variants, which are represented by differential cytosine methylation at particular loci, can have major effects on the transcription of nearby genes and can be inherited over generations18,19,20.

Apple, like most other fruit tree crops, is propagated by grafting onto rootstocks, which over time can allow the acquisition and propagation of epimutations, via variation in DNA methylation states that can influence various phenotypes, such as fruit color21,22. Thus, knowledge of the epigenetic landscape of apple cultivars could provide new tools to study somatic variants, leading to the development of epigenetic markers for marker-assisted selection.

To produce a high-quality apple reference genome and methylome, we generated a de novo assembly of a 'Golden Delicious' doubled-haploid tree (GDDH13) composed of 280 assembled scaffolds and arranged into 17 pseudomolecules, which represent the 17 chromosomes of apple. This assembly resulted from a combination of short (Illumina) and long sequencing reads (PacBio), along with scaffolding based on optical maps (BioNano) and a high-density integrated genetic linkage map23. This chromosome-scale assembly was complemented by a detailed de novo annotation of genes based on RNA sequencing (RNA-seq) data, TE annotation and small RNA alignments.

To understand the potential role of epigenetic marks on fruit development, we constructed genome-wide DNA methylation maps that compared different tissues and two isogenic apple lines that produce large or small fruits. This led to the identification of differential DNA methylation patterns that are associated with genes involved in fruit development.

This work provides a solid foundation for future genetic and epigenomic studies in apple. Furthermore, our TE annotation provides novel insights into the evolutionary history of apple and may contribute to explaining its divergence from pear.

Results

Genome sequencing, assembly and scaffolding

The doubled-haploid Golden Delicious line (GDDH13, also coded X9273) used in this study is the result of breeding efforts that were initiated at INRA in 1963 (ref. 24) (Supplementary Fig. 1 and Online Methods). Homozygosity of this line was confirmed with microsatellite markers that are distributed along the apple genome (data not shown) and by observation of the k-mer spectrum of Illumina reads derived from GDDH13 (Fig. 1a and Supplementary Note).

Figure 1: Assembly and validation of the GDDH13 doubled-haploid apple genome.
figure 1

(a) k-mer (23 bp) spectra of the doubled-haploid GDDH13 and the heterozygous Golden Delicious33 genomes. The x axis represents k-mer multiplicity, and the y axis represents the number of k-mers with a given multiplicity in the sequencing data. The green dashed line represents the ideal Poisson distribution fitted on the data of GDDH13. (b) Overview of the processing pipeline used for the assembly of the GDDH13 genome (see Supplementary Note for details). (c) Graphical representation of the location of SNP markers on the physical map (x axis), as compared to their position on the integrated genetic map (y axis), for Chr11 of the GDDH13 genome. Each marker is depicted as a circle on the plot (1,069 data points). The colors depict the chromosomes as follows: red for Chr01, light green for Chr04, pink for Chr08, blue for Chr10 and violet for Chr11. (d) Graphical representation of the mean local recombination rates between successive SNP markers along Chr11 (3-Mb sliding window, 1-Mb shift, threshold 4). The x axis represents the physical positions of the means on Chr11, and the y axis indicates the recombination ratio (centiMorgan (cM)/Mb) in each 3-Mb sliding window. (e) Heat map of genotypic linkage disequilibrium (LD; r2) in Chr11 in the 'Old Dessert' INRA apple core collection. Shown are the graphical representation of the location of SNPs on the physical map (top) with correspondence to their order in a regular distribution (bottom) of Chr11 (1,461,195 data points). The color bar indicates the level of LD, from high LD (red) and low LD (blue).

To perform de novo assembly of the GDDH13 genome, we combined three different technologies: short-read sequencing, long-read sequencing and optical mapping (Fig. 1b). Using DNA from the leaves of GDDH13, we generated 120-fold coverage of Illumina paired-end reads (72 Gb), 80-fold coverage of Illumina Nextera mate-pair reads (58 Gb) at three different insert sizes (2, 5 and 10 kb) and 35-fold coverage of PacBio sequencing data (24 Gb; 2,837,045 subreads with a mean length of 8,474 bp). The Illumina paired-end reads were first assembled using SOAPdenovo25, and the resulting contigs were combined with the PacBio reads using the DBG2OLC assembler26. This resulted in an assembly that consisted of 2,150 contigs with an N50 of 620 kb (i.e., 50% of the assembly was contained in contigs ≥620 kb in size) (Supplementary Table 1) and a total length of 625.2 Mb, which were subsequently corrected by using the Illumina paired-end reads (94,896 single-base assembly errors corrected; 1,054,709 insertions (1,466,015 bp) and 123,510 deletions (178,733 bp)) and scaffolded by using Illumina mate-pair reads with BESST (assembly N50 increased from 620 kb to 699 kb).

Next, using a 600-fold-coverage BioNano optical map, we generated a consensus map that resulted in an assembly of 649.7 Mb. This consensus map was then used for the hybrid assembly with the corrected scaffolds, which, together with single-nucleotide polymorphism (SNP) markers derived from a high-density genetic linkage map23, allowed the construction of the 17 pseudochromosomes (Supplementary Table 2 and Supplementary Note). To estimate the genome size, we calculated different k-mer frequency distributions of the Illumina reads. The estimated GDDH13 genome size of 651 Mb was very close to the 649.7-Mb size in the consensus map.

Assessment of genome quality

We assessed the quality of the assembly by using two independent sources of data. First, we used the SNP markers that were mapped on the previously mentioned integrated genetic linkage map to validate scaffold assembly. Of the 15,417 SNP probe sequences, we identified sequence homology in the GDDH13 genome for 14,732 of them. We then assessed their position on the scaffold assemblies by comparing their location on the integrated genetic linkage map. In total 14,117 of the mapped markers (95.8%) were found to be located at their expected positions (Supplementary Note). To visualize these data, we plotted the genetic distance against the physical distance of the genetic markers for each chromosome (Supplementary Fig. 2); the data for chromosome (Chr) 11 is shown as an example in Figure 1c. This analysis showed that there was very little discrepancy between the physical and genetic maps. For comparison, we plotted these markers to the heterozygous apple genome (v 1.0; Supplementary Fig. 3). We also plotted the recombination rates in sliding windows of 3 Mb on this chromosome (Fig. 1d) and identified a decrease in recombination frequency toward the middle of Chr11.

Second, we estimated the level of linkage disequilibrium (LD) using the r2 parameter between all pairwise SNP comparisons by using marker data that were derived from an apple core collection27,28. In the present version of the GDDH13 genome, we did not identify any abrupt jumps in LD, indicating the overall robustness of the assembly (Fig. 1e and Supplementary Fig. 4). Using previously published genetic data29, we generated a haplotype map for GDDH13, which allowed the identification of recombination breakpoints (Supplementary Fig. 5).

Finally, the completeness of the assembly was tested by searching for 248 core eukaryotic genes30 (CEGs). In total, 237 of 248 CEGs were completely present, and 7 CEGs were partially present, indicating that fewer than 2% of the CEGs could not be detected, which compared very favorably with other assemblies31.

Genome annotation

To obtain a global view of the apple transcriptome, we performed a high-throughput RNA-seq analysis on poly(A)-enriched RNAs from nine libraries that originated from different genotypes and tissues. RNA-seq reads were assembled, and the resulting contigs were mapped to the scaffolds and integrated in the EuGene combiner pipeline32. In total, we identified 42,140 protein-coding genes (which represent 23.3% of the genome assembly) and 1,965 non-protein-coding genes (Supplementary Table 2 and Supplementary Note). Evidence of transcription was found for 93% of the annotated genes.

To further evaluate the quality of the annotation, a comparison with annotations of previous apple genome assemblies6,33 was performed using the BUSCO v2 method, which is based on a benchmark of 1,440 conserved plant genes34. The results indicate that our apple genome annotation is the most complete, despite having the lowest number of predicted genes (Table 1).

Table 1 Comparison of the GDDH13 genome with previously published assemblies of the apple genome

The de novo annotated genes were named using the following convention: MD (for Malus domestica) followed by the chromosome number and gene number on the chromosome (in steps of 100) going from top to bottom according to the linkage map, for example, MD13G0052100.

Previously published small RNA (sRNA) data35 were also mapped to the genome. We found that most 21- and 22-nt-long sRNAs mapped to protein-coding genes, whereas most 24-nt-long sRNAs mapped to TEs. The distribution of 23-nt-long sRNAs was evenly included in both types of genomic features (Supplementary Fig. 6).

Ancestral genome duplication

Intragenomic synteny of GDDH13 was assessed using SynMap (CoGe; http://www.genomevolution.org) and visualized with Circos36. Results of this analysis (Fig. 2) showed an even clearer genome duplication pattern than has previously been reported6. Only very few regions showed no synteny to other parts of the genome (for example, the middle part of Chr04).

Figure 2: Synteny and distribution of genomic and epigenomic features of the apple genome.
figure 2

The rings indicate (from outside to inside, as indicated in the inset) chromosomes (Chr), heat maps representing gene density (green), TE density (blue) and DNA methylation levels (orange). A map connecting homologous regions of the apple genome is shown inside the figure. The colored lines link collinearity blocks that represent syntenic regions that were identified by SynMap.

Transposable elements and annotation of repeat sequences

To produce a genome-wide annotation of repetitive sequences, TE consensus sequences (provided by the TEdenovo detection pipeline37) were used to annotate their copies in the whole genome. To refine this annotation, we performed two iterations of the TEannot pipeline. In the GDDH13 genome, TEs represented 372.2 Mb (57.3% of the 649.7 Mb BioNano assembly; Supplementary Table 2). Excluding undefined bases (Ns), the TE content of the total nucleotide space in the final annotation was 59.5% of the assembly. The most abundant repeats in this genome are retrotransposons or class I elements (74.8% of TE content, 42.9% of genome assembly), and in particular long terminal repeat retrotransposons (LTR-RTs), which represent 66% of this type of repeat, whereas non-LTR retrotransposons (LINE and SINE) accounted for 7% (Fig. 3a and Supplementary Table 2). DNA transposons or class II elements (DNA transposons and Helitrons) make up 23% of the TE content (13.4% of the genome assembly; Fig. 3a and Supplementary Table 2). A complete list of identified TEs, their integrity and copy number can be found in Supplementary Table 3.

Figure 3: Distribution and evolution of transposable elements in the apple genome.
figure 3

(a) Percentage of base pairs of the assembled GDDH13 genome that represent genes, pseudo-genes, TEs and non-annotated regions. Retrotransposons (class I) are shown in shades of red, and DNA transposons (class II) are shown in shades of blue. (b) Chromosomal density plots of all TE families on Chr01 to Chr04 (top), and the recombination rate for each corresponding chromosome (3-Mb sliding window) (bottom). (c) Distribution of sequence identity values between genomic copies and consensus repeats in the GDDH13 assembly (based on 2,198,722 data points). The relative frequencies per percentage of identity of the Helitron, TIR, LTR, LINE, SINE and unclassified TEs (NoCat) are represented in different colors.

We ran the REPET38 pipeline on the PacBio contigs, which allowed us to identify an additional hyper-repetitive consensus sequence (Genbank entry KX869746). This consensus sequence was automatically classified as a 9,716-bp LTR-RT with over 500 full-length copies, and it accounted for 3.6% of the genome assembly (22.3 Mb). We termed this TE consensus sequence HODOR (high-copy Golden Delicious repeat). At the chromosomal level, a higher density of HODOR copies coincided with particular regions of each chromosome that show reduced recombination levels, whereas the density level of other TEs remained constant or was decreased at these same regions (Fig. 3b and Supplementary Fig. 7). Even though the retrotransposon consensus sequence has clear 5′ and 3′ LTRs that are 1.8 kb in size, there are no homologies with typical TE-related sequences encoding a gag protein, a reverse transcriptase or an integrase. However, we found partial sequence similarity to the Malus domestica Copia-100 element present in RepBase Update39, corresponding to different domains such as gag pre-integrase, RNase H and integrase. These results suggest that HODOR is a non-autonomous LTR retrotransposon derivative or LARD (large retrotransposon derivative). We scanned the genome and were able to identify TEs that could contribute to the mobilization of HODOR (Supplementary Table 3 and Supplementary Note). Notably, we also found significant (BLASTX e-values ≤ 1 × 10−29) similarities with sequences encoding three short bacterial proteins of unknown function (Supplementary Fig. 8a), and mining of transcriptome data35 showed HODOR to be primarily transcribed in the sense and antisense orientations in apple seeds (Supplementary Fig. 8b).

To investigate the evolutionary history of TEs in the apple genome, we plotted the distribution of identity values between genomic copies and their consensus sequences (Fig. 3c). Distributions for all classes of repeats showed a peak at 77% identity. By considering the mutation rate that has been reported for LTR-RTs in plants (1.3 × 10−8 base substitutions per site per year40,41), we estimated the age of those insertions as described by the International Human Genome Sequencing Consortium42. We concluded that the peak at 77% identity corresponded to an insertion age of around 21 million years ago (Mya) (Fig. 3c). We also noted a second peak, particularly for LINE elements, at 98% identity that corresponded to a TE burst at 1.6 Mya (Fig. 3c).

The apple methylome

To investigate the apple methylome, we produced genome-wide maps of DNA methylation content at single-base resolution for GDDH13 leaves and young fruits43,44.

Globally, in leaves we found DNA methylation levels of 49%, 39% and 12% in the CG, CHG and CHH sequence contexts (where H is adenine, thymine or cytosine), respectively (Fig. 4a). DNA methylation was not evenly spread throughout the chromosomes (Fig. 4b shows the profile for Chr11; see Supplementary Fig. 9 for the profiles for all of the chromosomes), and peaks of methylation coincided with recombination cold spots.

Figure 4: DNA methylation landscape of the GDDH13 genome.
figure 4

(a) Percentage of DNA methylation distributions of the three methylation contexts (CG, CHG or CHH) in Arabidopsis44, soybean60 and apple. For apple, the percentages were estimated based on the number of cytosines that had a methylation ratio ≥0.75. (b) Top, chromosomal distribution of the methylation ratios along Chr11. Bottom, the recombination rate plot from Figure 1d, for comparison purposes. (c) Global distribution of DNA methylation levels at protein-coding genes, TEs and HODOR, including a 1-kb window upstream of the TSS and downstream of the transcription end site (TES). In all of the panels, the DNA methylation sequence contexts are color-coded as follows: brown for CG, yellow for CHG and blue for CHH.

As expected45,46, there are reduced overall DNA methylation levels in gene sequences, whereas TEs are extensively methylated (Fig. 4c). For genes, we identified three major types of DNA methylation patterns. Genes in cluster 1 were characterized by high levels of DNA methylation in the gene body in the CG and CHG contexts, which was concomitant with high DNA methylation in the surrounding regions. Genes in cluster 2 had low CG, and very low CHG and CHH, methylation in the gene itself, yet there were increased levels in the surrounding region. Finally, genes in cluster 3 featured low DNA methylation levels in both the gene body and in the surrounding regions (Supplementary Fig. 10). This last cluster contained the largest number of genes (27,179; 64.5% of all genes), showing that in apple, genes are generally depleted for DNA methylation. By mining previously produced large transcriptome data sets for apple35, we found that genes covered with very high levels of DNA methylation (cluster 1) showed the lowest expression levels (1.58 median log2 value), whereas cluster 2 and cluster 3 genes had higher log2 values (3.3 and 2.8, respectively). This result confirmed that the amount of DNA methylation surrounding genes influences their expression level. As one example of TEs, we assessed the DNA methylation levels for HODOR and found that HODOR was almost completely methylated in the CG (90% methylated) and CHG (65% methylated) contexts but that it had much less methylation in the CHH context (3%) (Fig. 4c).

DNA methylation and fruit development

To assess how DNA methylation contributes to fruit development, we first compared DNA methylation levels between leaves and fruits. We called differentially methylated regions (DMRs) using a hidden Markov model (HMM)-based approach47. In total, we identified 1,017 high-confidence DMRs in all contexts between leaves and fruits, and we observed a very strong bias for DMRs containing methylation changes in the CHH context (875 DMRs; 86.0%) (Fig. 5a). We identified 294 genes that contained DMRs in their promoter region—14 DMRs were in the CHG context and showed increased amounts of DNA methylation in leaves, whereas the remaining 280 DMRs were found in the CHH context and showed increased amounts of DNA methylation in fruits. Thus, most methylation differences between leaves and fruits occurred at CHH sites, with a robust increase observed in the developing fruit. Among genes with DMRs that were 2 kb upstream of their transcription start site (TSS), we identified several apple orthologs of Arabidopsis genes with important roles in flower and fruit development and in epigenetic regulation (Fig. 5b).

Figure 5: Differentially methylated regions between apple tree leaves and young fruits.
figure 5

(a) DMR content in samples of GDDH13 leaves and young fruits (CHH, n = 875 DMRs; CHG, n = 88 DMRs; CG, n = 21 DMRs; CG and CHG, n = 17 DMRs; CHG and CHH, n = 14 DMRs; CG and CHH, n = 2 DMRs). Most of the DMRs (86%) were identified in the CHH context. (b) Selection of GDDH13 genes that present a DMR within a region 2 kb upstream of the TSS. The apple gene ID, the methylation context of the DMR, the orthologous Arabidopsis gene annotation and the function of the encoded protein are listed. (c,d) Representative image comparing the fruit sizes of heterozygous Golden Delicious, GDDH13 and GDDH18 at harvest (c) and quantification of the number of cell layers in the parenchyma of GDDH13 (orange) and GDDH18 (green) fruits, as assessed by microscopy (n = 12 data points per box plot) (d). The horizontal line in the box represents the median, the lower and upper hinges correspond to the first and third quartiles, the lower and upper whiskers extend from the hinge to the smallest and largest value (no further than 1.5-fold the inter-quartile range from the hinge), and outlying points are plotted individually. Scale bar, 1 cm.

Next we wanted to test whether DNA methylation could have a role in the regulation of fruit size. We took advantage of GDDH18, an isogenic line that was obtained from the same haploid that produced GDDH13 (Supplementary Note). Whole-genome sequencing showed the presence of 27 homozygous SNPs within genes between the two trees, with nine of these SNPs resulting in amino acid changes (Supplementary Table 4). Although the GDDH13 and GDDH18 trees were indistinguishable, the GDDH18 fruits were much smaller (Fig. 5c) because of a reduced number of cell layers in the parenchyma (Fig. 5d).

To elucidate whether the difference in fruit size could have an epigenetic basis, whole-genome bisulfite sequencing was performed on samples that were collected at 3 d before pollination (or –3 d after pollination (DAP); when fruits have a similar size and number of cell layers) and at 9 DAP (a few days before observing significant phenotypic differences between the fruits). As expected from their common origin, only a limited number of high-confidence DMRs (n = 197) could be found between young fruits of GDDH13 and GDDH18 at –3 DAP. Of these, 47 DMRs were located within 2 kb upstream of the TSS of genes. Similarly, we identified a total of 148 high-confidence DMRs between fruits of GDDH13 and GDDH18 at 9 DAP. From this analysis, we found that 53 genes contained DMRs in their promoter region (i.e., within 2 kb upstream of the TSS). At both time points a majority of genes with DMRs showed a decrease in methylation in their promoter region for GDDH18 (Supplementary Table 5). Notably, in both comparisons, DMRs in the CG–CHG and CHG contexts were over-represented.

The overlap of DMRs between the two time points analyzed included 22 genes with DMRs in their promoter regions, with most of them (n = 17) showing lower methylation in GDDH18 (Supplementary Table 5). Several of the 22 genes have orthologs in other species with a role that could explain the observed size difference between the GDDH13 and GDDH18 fruits—including SQUAMOSA PROMOTER-BINDING PROTEIN LIKE 13 (SPL13, MD16G0108400), 1-AMINO-CYCLOPROPANE-1-CARBOXYLATE SYNTHASE 8 (ACS8, MD15G0127800) and CYTOCHROME P450 FAMILY 71 SUBFAMILY A POLYPEPTIDE 25 (CYP71A25, MD14G0147300), which belong to the minority of genes with increased methylation in GDDH18.

Discussion

As a prerequisite to epigenomic studies in apple, we decided to produce a high-quality reference genome for apple. An advantage for us was the availability of the homozygous GDDH13 doubled-haploid line. Assembling a genome that is both highly heterozygous and recently duplicated into a haploid consensus sequence presents a substantial challenge. This is exemplified by the comparison of our first assembly steps to a recently published report on a heterozygous Golden Delicious apple genome sequence33. Following hybrid assembly of PacBio and Illumina reads, Li and colleagues33 reported a N50 of 112 kb, whereas we obtained a N50 of 620 kb at that same step. These results highlight the power of haploids or doubled haploids in genome sequencing projects48, particularly in those for apple, which is not only highly heterozygous but has also undergone a recent whole-genome duplication (ref. 6 and this study). The optical mapping then allowed us to produce scaffolds with a N50 of 5.5 Mb, which, in association with a high-density integrated linkage map, yielded highly contiguous pseudomolecules. In this new apple genome, we followed a newer convention23 in which the orientation of Chr10 and Chr05 became aligned by the inversion of Chr05. We chose to invert Chr05 because it is the least frequently reported of the two in previous genetic studies on quantitative trait loci (QTL), gene discovery and characterization.

We estimated the genome size of GDDH13 to be 651 Mb (Supplementary Table 2), which suggested that the GDDH13 genome may be smaller than that of the heterozygous Golden Delicious line, which was recently estimated to be 701 Mb (ref. 33). Although the GDDH13 tree looks similar to the heterozygous Golden Delicious counterpart (including tree architecture, flowering time and fruit appearance; Supplementary Fig. 1), it is possible that through the consecutive steps of selfing, haploid development and chromosome doubling, some minor parts of the genome might have been lost or re-arranged. Thus, it is possible that some of the genome sequence might be missing in the GDDH13 assembly.

Our gene prediction analysis reduced the estimated number of annotated genes in apple from 63,541 (Genome Database for Rosaceae, see URLs and ref. 6) to 42,140, which is much closer to the 42,812 genes that have been reported for pear49 and the 45,293 genes that were identified after filtering out overlapping genes from the original apple genome annotation49 (Supplementary Note).

TEs also have an important role in structuring genomes. The in-depth TE annotation we performed showed a major TE burst in apple that we estimated to have happened around 21 Mya. This affected all types of TEs, suggesting that the precursor of the modern apple underwent environmental changes with resulting stresses that led to the activation of these TEs50. The observed TE burst corresponds to the Miocene epoch (23 Mya to 5 Mya) and may coincide with two events: the divergence between pear and apple48 and an uplift event occurring at the Tian Shan mountains51, which cover the region where the ancestor of the apple originates from52. We hypothesize that these TE bursts, which presumably must have been very different in the predecessor of pear and apple, have contributed to the diversification, and possibly even speciation, of these plants.

Although our analyses using previously reported approaches53 did not identify any characteristic short centromeric repeat sequence in the apple genome, we can hypothesize the putative localization of centromeres on the GDDH13 chromosomes. We found that the regions in which we observed a decrease in the recombination rate between successive markers of the integrated linkage map coincided with the regions that showed an increase in the estimated level of LD in the core apple collection, as well as an increase in DNA methylation levels. In addition, we identified HODOR, the most repetitive consensus sequence in the apple genome, as being over-represented in these same genomic regions. These findings suggest that centromeric regions in the GDDH13 genome may be located within the regions that show an over-representation of HODOR. Future studies will show whether HODOR has a role in the centromere structure in the apple genome. Blast searches have revealed that the HODOR sequence also exists in pear, and because of its origin from potential horizontal gene transfer events, it will be of great interest to investigate when HODOR first appeared during the Rosaceae evolution.

The genome-wide distribution of DNA methylation peaked in putative centromeric regions of high LD and high HODOR content. As has been observed in Arabidopsis43, TEs were enriched and genes strongly depleted for DNA methylation. The 10% of genes that possess high levels of DNA methylation (gene body and surrounding region; Supplementary Fig. 10), globally showed a very low level of transcription, and these genes may be expressed during very specific developmental stages or tissues. The comparison of the apple leaf and fruit methylomes revealed a noteworthy pattern—the fruit globally had higher CHH DNA methylation levels, which suggested increased activity of the RNA-directed DNA methylation machinery in this organ54. Consistent with this observation, it has been shown for Arabidopsis that cell-type-specific DNA methylation differences mainly occur at CHH sites55. Notably, DNA methylation differences in the CHH context between leaf and fruit tissues occurred next to 294 genes. Several of these were found to be orthologous to genes that are known to be important regulators of flower and fruit development in other species. This suggests that apple fruit development is regulated by epigenetic processes, which is consistent with data obtained in tomato, demonstrating that DNA methylation is important for fruit ripening56,57,58.

In addition, among the major agronomical traits that contribute to both yield and quality, fruit size is one of the most important for many domesticated crops. Two of the key determinants that are known to alter plant organ size are cell number and cell size59. Here we investigated fruit size difference between two isogenic doubled-haploid apple lines. We found that the number of cell layers in the parenchyma of GDDH13 fruits increased more rapidly than those in the parenchyma of the smaller GDDH18 fruits, with significant differences being observed as early as 21 DAP. To identify regulators that contributed to the difference in fruit size between the two doubled-haploid apple lines, we found three genes that potentially contributed to the cell number difference, and these contained DMRs in their promoter regions (Supplementary Note).

The identification of potential molecular mechanisms that control cell-division-related processes by DNA methylation provides new insights into the understanding of this important process. However, by comparing the GDDH13 and GDDH18 genomes, we identified nine SNPs that affect protein sequences, and thus we cannot currently exclude a genetic effect.

The high-quality reference apple genome sequence reported here offers unprecedented insights into the genome dynamics of a tree and provides an important basis for future studies, not only in apple but also in other Rosaceae species.

Methods

Genome assembly of GDDH13.

Hybrid assembly. The genome assembly was performed using a combination of sequencing technologies: PacBio RS II reads, Illumina paired-end reads (PE) and Illumina mate-pair reads (MP). First, Illumina PE reads were separately assembled using SOAPdevo 2.223 (ref. 25). Next, the PacBio reads and Illumina contigs were combined to perform a hybrid assembly using the DBG2OLC pipeline26.

Assembly polishing. A polishing of the assembly using the Illumina paired-end reads was performed. The 120× Illumina reads were mapped to the contigs using BWA-MEM61. This alignment was then used with Pilon 1.17 (ref. 62) to correct the assembly.

Mate pair scaffolding. A total of 8.5 Gb of Illumina MP data (approximate sequencing depth = 15×), with an insert size varying between 2 kb and 10 kb, was used to scaffold the assembly. The MP reads were mapped on the corrected contigs using BWA-MEM. The alignments were processed with the BESST63 software to scaffold the assembly.

BioNano scaffolding. A BioNano optical mapping analysis was performed, and data was collected and analyzed with IrisViewer (v2.5). The 397 BioNano maps, with a N50 of 2.649 Mb and a total length of 649.7 Mb, were used in the hybrid assembly step with the scaffolds obtained from the MP scaffolding to assemble the final scaffolds in IrisViewer.

Scaffold validation and anchoring to the genetic map. An integrated multiparental genetic linkage map of apple23 that was composed of 15,417 SNP markers was used to organize and orientate the scaffolds into chromosome-sized sequences. The probe sequences of the 15,417 markers64 were mapped onto the genome using BWA-MEM. The physical and genetic positions of the mapped markers were used to place and orient the scaffolds and contigs relative to each other. Detailed methodological details describing the assembly processes can be found in the Supplementary Note.

Linkage disequilibrium (LD). The 'Old Dessert' INRA core collection, comprising 278 accessions27, was genotyped with the Axiom Apple-480K SNP genotyping array28. LD was estimated with the r2 statistics using the R package snpStats (R package version 1.16.0). Heat maps of pairwise LD between markers were plotted using the R package LDheatmap65.

RNA sequencing (RNA-seq) analysis. To maximize the number and diversity of genes that were identified by RNA-seq, mRNA was purified from various organs at multiple developmental stages derived from seven cultivars and hybrids. A total of nine libraries were generated (see Supplementary Note for more details).

The cDNA sequencing libraries were constructed following the manufacturer's instructions (Illumina, San Diego, CA, USA), and the Illumina GA processing pipeline Cassava 1.7.0 was used for image analysis and base-calling.

DNA extraction from leaf and developing fruits, and bisulfite sequencing. Young leaves from GDDH13 and developing fruits from GDDH13 and GDDH18 (two biological replicates from independently grafted trees) were collected 3 d before pollination (–3 DAP, with petals, sepals, anthers and styles removed) and 9 DAP. DNA was purified using the Macherey-Nagel NucleoSpin plant II DNA extraction kit (Germany), following the manufacturer's instructions. Bisulfite treatment was applied to determine the cytosine methylation status, using the Epitect bisulfite kit (Qiagen) and 100 ng of genomic DNA.

Whole-genome bisulfite sequencing was performed, and DMRs between leaves and young GDDH13 fruits, and between GDDH13 and GDDH18 fruits, at –3 DAP and 9 DAP were computed according to Hagmann et al.47. DNA methylation distribution plots and gene clustering analyses by methylation patterns were performed with deepTools66.

Small RNA alignment. Apple sRNA sequences derived from mature fruit parenchyma35 were aligned to the Golden Delicious doubled-haploid pseudomolecules using BWA-MEM. Only perfectly mapped sequences were considered further, and reads with identical sequences were allowed to be mapped to two or more loci.

Genome annotation. RNA-seq data derived from nine different libraries was de novo assembled using Trinity67 and SOAPdenovo-trans68. For each library, the assembly with the highest N50 value was chosen to annotate the genes. 2,033 mRNAs and 326,941 expressed sequence tags (ESTs) extracted from the NCBI nucleotide and EST databases, respectively, were also used for gene prediction.

The structural annotation of coding genes was performed using EuGene32 by combining Gmap transcript mapping69, similarities detected with plant proteomes and Swiss-Prot, and ab initio predictions (interpolated Marlov model and weight-array matrix for donor and acceptor splicing sites). Moreover, the EuGene prediction was completed by tRNAscan-SE70, RNAmmer71 and RfamScan72 to annotate non-protein-coding genes, including those encoding tRNA, rRNA, miRNA and snoRNA, and other regions with proof of transcription but without significant similarities and coding potential (named ncRNA).

Functional annotation of proteins was performed using InterProScan73. The functional annotation was then completed by the prediction of targeted signals using the TargetP software74.

Genome synteny. SynMap (CoGe, see URLs) was used to identify collinearity blocks using homologous coding sequence pairs. Detailed methodological details on the annotation processes can be found in the Supplementary Note.

Comparison of annotation between the heterozygous Golden Delicious and GDDH13 genomes. Malus domestica predicted gene (MDP) sequences obtained from the heterozygous genome annotation6 were mapped to the GDDH13 genome assembly using the best BLAT75 hit. Comparison of the two genome annotations was done using Bio++76.

Repeat annotation.

The TEdenovo pipeline37,77 from the REPET package v2.5 (see URLs) was used to detect TEs in genomic sequences and to provide a consensus sequence for each TE family. Consensus TE sequences were used to annotate the TE copies in the whole genome using the TEannot pipeline38 from the REPET package v2.5. Consensus sequences that were classified as potential host genes because they contain host gene Pfam domains were kept from this study. The same process was used to identify the HODOR consensus sequence on the PacBio assembly with the REPET pipeline. TE insertion ages were calculated using the adapted T = K/r formula for nonduplicated LTR sequences, where K is the sequence divergence, and r is the substitution rate78. The observed sequence divergence was corrected with the Jukes and Cantor model79. Additional methodological details on the repeat annotation can be found in the Supplementary Note.

Data availability.

This whole-genome shotgun project has been deposited at GenBank under the accession code MJAX00000000.1. The raw Illumina mRNA sequences were submitted to the NCBI under BioProject ID PRJNA191060, and the GDDH18 genome reads were deposited under BioProject ID PRJNA379390. DNA methylation data can be accessed on the Gene Expression Omnibus website under accession codes GSE87014 and GSE93950. Structural and functional annotations are available through our genome browser (https://iris.angers.inra.fr/gddh13/).

URLs.

Structural and functional annotations are available through our genome browser: https://iris.angers.inra.fr/gddh13/.

The Whole-Genome Shotgun project can be found in GenBank under: https://www.ncbi.nlm.nih.gov/nuccore/MJAX00000000.1.

The REPET package v2.5 used to detect TEs used in this study can be found here: https://urgi.versailles.inra.fr/Tools/REPET

SynMap- CoGe: http://www.genomevolution.org

Genome Database for Rosaceae: http://www.rosaceae.org.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.