High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development

Using the latest sequencing and optical mapping technologies, we have produced a high-quality de novo assembly of the apple (Malus domestica Borkh.) genome. Repeat sequences, which represented over half of the assembly, provided an unprecedented opportunity to investigate the uncharacterized regions of a tree genome; we identified a new hyper-repetitive retrotransposon sequence that was over-represented in heterochromatic regions and estimated that a major burst of different transposable elements (TEs) occurred 21 million years ago. Notably, the timing of this TE burst coincided with the uplift of the Tian Shan mountains, which is thought to be the center of the location where the apple originated, suggesting that TEs and associated processes may have contributed to the diversification of the apple ancestor and possibly to its divergence from pear. Finally, genome-wide DNA methylation data suggest that epigenetic marks may contribute to agronomically relevant aspects, such as apple fruit development.

Accurate sequence information, genome assemblies and annotations are the foundation for genetic and genome-wide studies. The major factors that limit de novo genome assembly are heterozygosity and repetitive sequences, such as TEs, which are often collapsed to single copies in draft genomes 1 . In recent years, however, evidence supporting the importance of TEs in genome evolution, genome structure, regulation of gene expression and epigenetics has been mounting [2][3][4][5] . The characterization of sequences and the distribution of TEs within a genome is, therefore, of great importance.
Until now, the study of epigenetically controlled characteristics in perennial plants has been hampered by the draft status of their genome sequences. In the case of apple, a draft was produced 6 but remained incomplete with inaccurate contig positions 7 ; this hindered its utility for genetic and epigenetic studies. De novo sequencing and assembly of a new genome for apple, using technologies of the third generation, had thus become a necessity.
In the last few years, single-molecule sequencing and optical-mapping technologies have emerged 8 , which are well suited for assembling genomic regions that contain long repetitive elements. Recently, several high-quality genome assemblies have been published using one or both technologies [9][10][11][12][13][14] . The use of long-read sequencing technologies may also tackle potential assembly issues that are related to the presence of highly similar sequences resulting from whole-genome duplication events that frequently occurred in angiosperm genomes 15 .
In addition to DNA sequence modifications, it has been shown that epigenetic variations contribute to genome accessibility, functionality and structure 16,17 . Several studies have demonstrated that local DNA methylation variants, which are represented by differential cytosine methylation at particular loci, can have major effects on the transcription of nearby genes and can be inherited over generations [18][19][20] .
Apple, like most other fruit tree crops, is propagated by grafting onto rootstocks, which over time can allow the acquisition and propagation of epimutations, via variation in DNA methylation states that can influence various phenotypes, such as fruit color 21,22 . Thus, knowledge of the epigenetic landscape of apple cultivars could provide new tools to study somatic variants, leading to the development of epigenetic markers for marker-assisted selection.
To produce a high-quality apple reference genome and methylome, we generated a de novo assembly of a 'Golden Delicious' doubledhaploid tree (GDDH13) composed of 280 assembled scaffolds and arranged into 17 pseudomolecules, which represent the 17 chromosomes of apple. This assembly resulted from a combination of short High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development 1 1 0 0 VOLUME 49 | NUMBER 7 | JULY 2017 Nature GeNetics A r t i c l e s (Illumina) and long sequencing reads (PacBio), along with scaffolding based on optical maps (BioNano) and a high-density integrated genetic linkage map 23 . This chromosome-scale assembly was complemented by a detailed de novo annotation of genes based on RNA sequencing (RNA-seq) data, TE annotation and small RNA alignments.
To understand the potential role of epigenetic marks on fruit development, we constructed genome-wide DNA methylation maps that compared different tissues and two isogenic apple lines that produce large or small fruits. This led to the identification of differential DNA methylation patterns that are associated with genes involved in fruit development.
This work provides a solid foundation for future genetic and epigenomic studies in apple. Furthermore, our TE annotation provides novel insights into the evolutionary history of apple and may contribute to explaining its divergence from pear.

Genome sequencing, assembly and scaffolding
The doubled-haploid Golden Delicious line (GDDH13, also coded X9273) used in this study is the result of breeding efforts that were initiated at INRA in 1963 (ref. 24) (Supplementary Fig. 1 and Online Methods). Homozygosity of this line was confirmed with microsatellite markers that are distributed along the apple genome (data not shown) and by observation of the k-mer spectrum of Illumina reads derived from GDDH13 ( Fig. 1a and Supplementary Note).
To perform de novo assembly of the GDDH13 genome, we combined three different technologies: short-read sequencing, long-read sequencing and optical mapping (Fig. 1b). Using DNA from the leaves of GDDH13, we generated ~120-fold coverage of Illumina paired-end reads (72 Gb), 80-fold coverage of Illumina Nextera mate-pair reads (58 Gb) at three different insert sizes (2, 5 and 10 kb) and ~35-fold coverage of PacBio sequencing data (24 Gb; 2,837,045 subreads with a mean length of 8,474 bp). The Illumina paired-end reads were first assembled using SOAPdenovo 25 , and the resulting contigs were combined with the PacBio reads using the DBG2OLC assembler 26 . This resulted in an assembly that consisted of 2,150 contigs with an N50 of 620 kb (i.e., 50% of the assembly was contained in contigs ≥620 kb in size) (Supplementary Table 1) and a total length of 625.2 Mb, which were subsequently corrected by using the Illumina paired-end reads (94,896 single-base assembly errors corrected; 1,054,709 insertions (1,466,015 bp) and 123,510 deletions (178,733 bp)) and scaffolded by using Illumina mate-pair reads with BESST (assembly N50 increased from 620 kb to 699 kb).
Next, using a ~600-fold-coverage BioNano optical map, we generated a consensus map that resulted in an assembly of 649.7 Mb. This consensus map was then used for the hybrid assembly with the corrected scaffolds, which, together with single-nucleotide polymorphism (SNP) markers derived from a high-density genetic linkage map 23 , allowed the construction of the 17 pseudochromosomes (Supplementary Table 2 and Supplementary Note). To estimate the genome size, we calculated different k-mer frequency distributions of the Illumina reads. The estimated GDDH13 genome size of 651 Mb was very close to the 649.7-Mb size in the consensus map.

Assessment of genome quality
We assessed the quality of the assembly by using two independent sources of data. First, we used the SNP markers that were mapped on the previously mentioned integrated genetic linkage map to validate   scaffold assembly. Of the 15,417 SNP probe sequences, we identified sequence homology in the GDDH13 genome for 14,732 of them. We then assessed their position on the scaffold assemblies by comparing their location on the integrated genetic linkage map. In total 14,117 of the mapped markers (95.8%) were found to be located at their expected positions (Supplementary Note). To visualize these data, we plotted the genetic distance against the physical distance of the genetic markers for each chromosome (Supplementary Fig. 2); the data for chromosome (Chr) 11 is shown as an example in Figure 1c. This analysis showed that there was very little discrepancy between the physical and genetic maps. For comparison, we plotted these markers to the heterozygous apple genome (v 1.0; Supplementary Fig. 3). We also plotted the recombination rates in sliding windows of 3 Mb on this chromosome (Fig. 1d) and identified a decrease in recombination frequency toward the middle of Chr11. Second, we estimated the level of linkage disequilibrium (LD) using the r 2 parameter between all pairwise SNP comparisons by using marker data that were derived from an apple core collection 27,28 . In the present version of the GDDH13 genome, we did not identify any abrupt jumps in LD, indicating the overall robustness of the assembly ( Fig. 1e and Supplementary Fig. 4). Using previously published genetic data 29 , we generated a haplotype map for GDDH13, which allowed the identification of recombination breakpoints ( Supplementary Fig. 5).
Finally, the completeness of the assembly was tested by searching for 248 core eukaryotic genes 30 (CEGs). In total, 237 of 248 CEGs were completely present, and 7 CEGs were partially present, indicating that fewer than 2% of the CEGs could not be detected, which compared very favorably with other assemblies 31 .

Genome annotation
To obtain a global view of the apple transcriptome, we performed a high-throughput RNA-seq analysis on poly(A)-enriched RNAs from nine libraries that originated from different genotypes and tissues. RNA-seq reads were assembled, and the resulting contigs were mapped to the scaffolds and integrated in the EuGene combiner pipeline 32 . In total, we identified 42,140 protein-coding genes (which represent 23.3% of the genome assembly) and 1,965 non-protein-coding genes (Supplementary Table 2 and Supplementary Note). Evidence of transcription was found for 93% of the annotated genes.
To further evaluate the quality of the annotation, a comparison with annotations of previous apple genome assemblies 6,33 was performed using the BUSCO v2 method, which is based on a benchmark of 1,440 conserved plant genes 34 . The results indicate that our apple genome annotation is the most complete, despite having the lowest number of predicted genes ( Table 1).
The de novo annotated genes were named using the following convention: MD (for Malus domestica) followed by the chromosome number and gene number on the chromosome (in steps of 100) going from top to bottom according to the linkage map, for example, MD13G0052100.
Previously published small RNA (sRNA) data 35 were also mapped to the genome. We found that most 21-and 22-nt-long sRNAs mapped to protein-coding genes, whereas most 24-nt-long sRNAs mapped to TEs. The distribution of 23-nt-long sRNAs was evenly included in both types of genomic features (Supplementary Fig. 6).

Ancestral genome duplication
Intragenomic synteny of GDDH13 was assessed using SynMap (CoGe; http://www.genomevolution.org) and visualized with Circos 36 . Results of this analysis ( Fig. 2) showed an even clearer genome duplication pattern than has previously been reported 6 . Only very few regions showed no synteny to other parts of the genome (for example, the middle part of Chr04).

Transposable elements and annotation of repeat sequences
To produce a genome-wide annotation of repetitive sequences, TE consensus sequences (provided by the TEdenovo detection pipeline 37 ) were used to annotate their copies in the whole genome. To refine this annotation, we performed two iterations of the TEannot pipeline. In the GDDH13 genome, TEs represented 372.2 Mb (57.3% of the 649.7 Mb BioNano assembly; Supplementary Table 2). Excluding undefined bases (Ns), the TE content of the total nucleotide space in the final annotation was 59.5% of the assembly. The most abundant repeats in this genome are retrotransposons or class I elements (74.8% of TE content, 42.9% of genome assembly), and in particular long terminal repeat retrotransposons (LTR-RTs), which represent 66% of this type of repeat, whereas non-LTR retrotransposons (LINE and SINE) accounted for 7% ( Fig. 3a and Supplementary Table 2). DNA transposons or class II elements (DNA transposons and Helitrons) make up 23% of the TE content (13.4% of the genome assembly; Fig. 3a and Supplementary Table 2). A complete list of identified TEs, their integrity and copy number can be found in Supplementary Table 3.  13 Li et al. 33 Velasco et al. 6 Sequenced genome size (Mb) 643.  Figure 2 Synteny and distribution of genomic and epigenomic features of the apple genome. The rings indicate (from outside to inside, as indicated in the inset) chromosomes (Chr), heat maps representing gene density (green), TE density (blue) and DNA methylation levels (orange). A map connecting homologous regions of the apple genome is shown inside the figure. The colored lines link collinearity blocks that represent syntenic regions that were identified by SynMap.
We ran the REPET 38 pipeline on the PacBio contigs, which allowed us to identify an additional hyper-repetitive consensus sequence (Genbank entry KX869746). This consensus sequence was automatically classified as a 9,716-bp LTR-RT with over 500 full-length copies, and it accounted for 3.6% of the genome assembly (22.3 Mb). We termed this TE consensus sequence HODOR (high-copy Golden Delicious repeat). At the chromosomal level, a higher density of HODOR copies coincided with particular regions of each chromosome that show reduced recombination levels, whereas the density level of other TEs remained constant or was decreased at these same regions (Fig. 3b and Supplementary Fig. 7). Even though the retrotransposon consensus sequence has clear 5′ and 3′ LTRs that are 1.8 kb in size, there are no homologies with typical TE-related sequences encoding a gag protein, a reverse transcriptase or an integrase. However, we found partial sequence similarity to the Malus domestica Copia-100 element present in RepBase Update 39 , corresponding to different domains such as gag pre-integrase, RNase H and integrase. These results suggest that HODOR is a non-autonomous LTR retrotransposon derivative or LARD (large retrotransposon derivative). We scanned the genome and were able to identify TEs that could contribute to the mobilization of HODOR (Supplementary Table 3 and Supplementary Note). Notably, we also found significant (BLASTX e-values ≤ 1 × 10 −29 ) similarities with sequences encoding three short bacterial proteins of unknown function ( Supplementary  Fig. 8a), and mining of transcriptome data 35 showed HODOR to be primarily transcribed in the sense and antisense orientations in apple seeds (Supplementary Fig. 8b).
To investigate the evolutionary history of TEs in the apple genome, we plotted the distribution of identity values between genomic copies and their consensus sequences (Fig. 3c). Distributions for all classes of repeats showed a peak at 77% identity. By considering the mutation rate that has been reported for LTR-RTs in plants (1.3 × 10 −8 base substitutions per site per year 40,41 ), we estimated the age of those insertions as described by the International Human Genome Sequencing Consortium 42 . We concluded that the peak at 77% identity corresponded to an insertion age of around 21 million years ago (Mya) (Fig. 3c). We also noted a second peak, particularly for LINE elements, at 98% identity that corresponded to a TE burst at ~1.6 Mya (Fig. 3c).

The apple methylome
To investigate the apple methylome, we produced genome-wide maps of DNA methylation content at single-base resolution for GDDH13 leaves and young fruits 43,44 .
Globally, in leaves we found DNA methylation levels of 49%, 39% and 12% in the CG, CHG and CHH sequence contexts (where H is adenine, thymine or cytosine), respectively (Fig. 4a). DNA methylation was not evenly spread throughout the chromosomes (Fig. 4b shows the profile for Chr11; see Supplementary Fig. 9 for the profiles for all of the chromosomes), and peaks of methylation coincided with recombination cold spots. As expected 45,46 , there are reduced overall DNA methylation levels in gene sequences, whereas TEs are extensively methylated (Fig. 4c). For genes, we identified three major types of DNA methylation patterns. Genes in cluster 1 were characterized by high levels of DNA methylation in the gene body in the CG and CHG contexts, which was concomitant with high DNA methylation in the surrounding regions. Genes in cluster 2 had low CG, and very low CHG and CHH, methylation in the gene itself, yet there were increased levels in the surrounding region. Finally, genes in cluster 3 featured low DNA methylation levels in both the gene body and in the surrounding regions (Supplementary Fig. 10). This last cluster contained the largest number of genes (27,179; 64.5% of all genes), showing that in apple, genes are generally depleted for DNA methylation. By mining previously produced large transcriptome data sets for apple 35 , we found that genes covered with very high levels of DNA methylation (cluster 1) showed the lowest expression levels (1.58 median log 2 value), whereas cluster 2 and cluster 3 genes had higher log 2 values (3.3 and 2.8, respectively). This result confirmed that the amount of DNA methylation surrounding genes influences their expression level. As one example of TEs, we assessed the DNA methylation levels for HODOR and found that HODOR was almost completely methylated in the CG (90% methylated) and CHG (65% methylated) contexts but that it had much less methylation in the CHH context (3%) (Fig. 4c).

DNA methylation and fruit development
To assess how DNA methylation contributes to fruit development, we first compared DNA methylation levels between leaves and fruits. We called differentially methylated regions (DMRs) using a hidden Markov model (HMM)-based approach 47 . In total, we identified 1,017 high-confidence DMRs in all contexts between leaves and fruits, and we observed a very strong bias for DMRs containing methylation changes in the CHH context (875 DMRs; 86.0%) (Fig. 5a). We identified 294 genes that contained DMRs in their promoter region-14 DMRs were in the CHG context and showed increased amounts of DNA methylation in leaves, whereas the remaining 280 DMRs were found in the CHH context and showed increased amounts of DNA methylation in fruits. Thus, most methylation differences between leaves and fruits occurred at CHH sites, with a robust increase observed in the developing fruit. Among genes with DMRs that were 2 kb upstream of their transcription start site (TSS), we identified several apple orthologs of Arabidopsis genes with important roles in flower and fruit development and in epigenetic regulation (Fig. 5b).
Next we wanted to test whether DNA methylation could have a role in the regulation of fruit size. We took advantage of GDDH18, an isogenic line that was obtained from the same haploid that produced GDDH13 (Supplementary Note). Whole-genome sequencing showed the presence of 27 homozygous SNPs within genes between the two trees, with nine of these SNPs resulting in amino acid changes (Supplementary Table 4). Although the GDDH13 and GDDH18 trees were indistinguishable, the GDDH18 fruits were much smaller (Fig. 5c) because of a reduced number of cell layers in the parenchyma (Fig. 5d).
To elucidate whether the difference in fruit size could have an epigenetic basis, whole-genome bisulfite sequencing was performed on A r t i c l e s samples that were collected at 3 d before pollination (or -3 d after pollination (DAP); when fruits have a similar size and number of cell layers) and at 9 DAP (a few days before observing significant phenotypic differences between the fruits). As expected from their common origin, only a limited number of high-confidence DMRs (n = 197) could be found between young fruits of GDDH13 and GDDH18 at -3 DAP. Of these, 47 DMRs were located within 2 kb upstream of the TSS of genes. Similarly, we identified a total of 148 high-confidence DMRs between fruits of GDDH13 and GDDH18 at 9 DAP. From this analysis, we found that 53 genes contained DMRs in their promoter region (i.e., within 2 kb upstream of the TSS). At both time points a majority of genes with DMRs showed a decrease in methylation in their promoter region for GDDH18 (Supplementary Table 5). Notably, in both comparisons, DMRs in the CG-CHG and CHG contexts were over-represented. The overlap of DMRs between the two time points analyzed included 22 genes with DMRs in their promoter regions, with most of them (n = 17) showing lower methylation in GDDH18 (Supplementary Table 5). Several of the 22 genes have orthologs in other species with a role that could explain the observed size difference between the GDDH13 and GDDH18 fruits-including SQUAMOSA PROMOTER-BINDING PROTEIN LIKE 13 (SPL13, MD16G0108400), 1-AMINO-CYCLOPROPANE-1-CARBOXYLATE SYNTHASE 8 (ACS8, MD15G0127800) and CYTOCHROME P450 FAMILY 71 SUBFAMILY A POLYPEPTIDE 25 (CYP71A25, MD14G0147300), which belong to the minority of genes with increased methylation in GDDH18.

DISCUSSION
As a prerequisite to epigenomic studies in apple, we decided to produce a high-quality reference genome for apple. An advantage for us was the availability of the homozygous GDDH13 doubled-haploid line. Assembling a genome that is both highly heterozygous and recently duplicated into a haploid consensus sequence presents a substantial challenge. This is exemplified by the comparison of our first assembly steps to a recently published report on a heterozygous Golden Delicious apple genome sequence 33 . Following hybrid assembly of PacBio and Illumina reads, Li and colleagues 33 reported a N50 of 112 kb, whereas we obtained a N50 of 620 kb at that same step. These results highlight the power of haploids or doubled haploids in genome sequencing projects 48 , particularly in those for apple, which is not only highly heterozygous but has also undergone a recent whole-genome duplication (ref. 6 and this study). The optical mapping then allowed us to produce scaffolds with a N50 of 5.5 Mb, which, in association with a high-density integrated linkage map, yielded highly contiguous pseudomolecules. In this new apple genome, we followed a newer convention 23 in which the orientation of Chr10 and Chr05 became aligned by the inversion of Chr05. We chose to invert Chr05 because it is the least frequently reported of the two in previous genetic studies on quantitative trait loci (QTL), gene discovery and characterization.
We estimated the genome size of GDDH13 to be 651 Mb (Supplementary Table 2), which suggested that the GDDH13 genome may be smaller than that of the heterozygous Golden Delicious line, which was recently estimated to be 701 Mb (ref. 33  counterpart (including tree architecture, flowering time and fruit appearance; Supplementary Fig. 1), it is possible that through the consecutive steps of selfing, haploid development and chromosome doubling, some minor parts of the genome might have been lost or re-arranged. Thus, it is possible that some of the genome sequence might be missing in the GDDH13 assembly. Our gene prediction analysis reduced the estimated number of annotated genes in apple from 63,541 (Genome Database for Rosaceae, see URLs and ref. 6) to 42,140, which is much closer to the 42,812 genes that have been reported for pear 49 and the 45,293 genes that were identified after filtering out overlapping genes from the original apple genome annotation 49 (Supplementary Note).
TEs also have an important role in structuring genomes. The indepth TE annotation we performed showed a major TE burst in apple that we estimated to have happened around 21 Mya. This affected all types of TEs, suggesting that the precursor of the modern apple underwent environmental changes with resulting stresses that led to the activation of these TEs 50 . The observed TE burst corresponds to the Miocene epoch (23 Mya to 5 Mya) and may coincide with two events: the divergence between pear and apple 48 and an uplift event occurring at the Tian Shan mountains 51 , which cover the region where the ancestor of the apple originates from 52 . We hypothesize that these TE bursts, which presumably must have been very different in the predecessor of pear and apple, have contributed to the diversification, and possibly even speciation, of these plants.
Although our analyses using previously reported approaches 53 did not identify any characteristic short centromeric repeat sequence in the apple genome, we can hypothesize the putative localization of centromeres on the GDDH13 chromosomes. We found that the regions in which we observed a decrease in the recombination rate between successive markers of the integrated linkage map coincided with the regions that showed an increase in the estimated level of LD in the core apple collection, as well as an increase in DNA methylation levels. In addition, we identified HODOR, the most repetitive consensus sequence in the apple genome, as being over-represented in these same genomic regions. These findings suggest that centromeric regions in the GDDH13 genome may be located within the regions that show an over-representation of HODOR. Future studies will show whether HODOR has a role in the centromere structure in the apple genome. Blast searches have revealed that the HODOR sequence also exists in pear, and because of its origin from potential horizontal gene transfer events, it will be of great interest to investigate when HODOR first appeared during the Rosaceae evolution.
The genome-wide distribution of DNA methylation peaked in putative centromeric regions of high LD and high HODOR content. As has been observed in Arabidopsis 43 , TEs were enriched and genes strongly depleted for DNA methylation. The 10% of genes that possess high levels of DNA methylation (gene body and surrounding region; Supplementary Fig. 10), globally showed a very low level of transcription, and these genes may be expressed during very specific developmental stages or tissues. The comparison of the apple leaf and fruit methylomes revealed a noteworthy pattern-the fruit globally had higher CHH DNA methylation levels, which suggested increased activity of the RNA-directed DNA methylation machinery in this organ 54 . Consistent with this observation, it has been shown for Arabidopsis that cell-type-specific DNA methylation differences mainly occur at CHH sites 55 . Notably, DNA methylation differences in the CHH context between leaf and fruit tissues occurred next to 294 genes. Several of these were found to be orthologous to genes that are known to be important regulators of flower and fruit development in other species. This suggests that apple fruit development is regulated by epigenetic processes, which is consistent with data obtained in tomato, demonstrating that DNA methylation is important for fruit ripening [56][57][58] .
In addition, among the major agronomical traits that contribute to both yield and quality, fruit size is one of the most important for many domesticated crops. Two of the key determinants that are known to alter plant organ size are cell number and cell size 59 . Here we investigated fruit size difference between two isogenic doubled-haploid apple lines. We found that the number of cell layers in the parenchyma of GDDH13 fruits increased more rapidly than those in the parenchyma of the smaller GDDH18 fruits, with significant differences being observed as early as 21 DAP. To identify regulators that contributed to the difference in fruit size between the two doubled-haploid apple lines, we found three genes that potentially contributed to the cell number difference, and these contained DMRs in their promoter regions (Supplementary Note).
The identification of potential molecular mechanisms that control cell-division-related processes by DNA methylation provides new insights into the understanding of this important process. However, by comparing the GDDH13 and GDDH18 genomes, we identified nine SNPs that affect protein sequences, and thus we cannot currently exclude a genetic effect.
The high-quality reference apple genome sequence reported here offers unprecedented insights into the genome dynamics of a tree and provides an important basis for future studies, not only in apple but also in other Rosaceae species.

METhODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper.