The Rosa genome provides new insights into the domestication of modern roses

Roses hold high cultural and economic importance as ornamentals and for the perfume industry. We report the rose whole genome sequencing and assembly and resequencing of major genotypes that contributed to rose domestication. We generated a homozygous genotype from a heterozygous diploid modern roses progenitor, Rosa chinensis ‘Old Blush’. Using Single Molecule Real-Time sequencing and a meta-assembly approach we obtained one of the most complete plant genomes to date. Diversity analyses highlighted the mosaic origin of ‘La France’, one of the first hybrids combining the growth vigor of European species and recurrent blooming of Chinese species. Genomic segments of Chinese ancestry revealed new candidate genes for recurrent blooming. Reconstructing regulatory and secondary metabolism pathways allowed us to propose a model of interconnected regulation of scent and flower color. This genome provides a foundation for understanding the mechanisms governing rose traits and will accelerate improvement in roses, Rosaceae and ornamentals.

Rose history 63 The genus Rosa represents a group of plants that appears to have undergone extensive reticulate evolution 64 with interspecific hybridization, introgression and polyploidization. These evolutionary processes have led to 65 the emergence of traits that respond to humankind's hedonistic expectations and have represented an 66 incredible source of diversity. Rose domestication is a particularly complex model produced by hundreds of 67 years of breeding and is based on altering whole pathways and networks. Rose domestication happened at 68 least twice independently in ancient China and the peri-Mediterranean area, encompassing part of Europe 69 and the Middle East 1,2 . In these two regions, generations of rose breeders had fastidiously selected the most 70 desirable traits of Rosa species by meticulous observation. Ornamental features, therapeutic and cosmetic 71 values have certainly motivated the domestication in these two world areas. Crosses between Rosa species 72 and cultivars have created complex polyploid cultivars that exhibited the most advantageous parent's traits 73 such as recurrent flowering, good looking flowers, pleasant scent, cold hardiness and pathogens 74 resistance 1,3,4 . Two biological groups are particularly important: the Damask roses cultivated for the 75 production of oils and fragrance and the Chinese roses that were unique in their continuous flowering. 76 The Chinese rose R. chinensis is among the few species that participated in breeding programs. In China, 77 roses have been cultivated for a very long time, dating back to the reign of Chin-Nun (2737-2697 BC) 2 . The 78 earliest cultivated Chinese roses were bred from local indigenous forms that grew wild in the mountains of 79 China, probably in the Yunnan and Sichuan areas 5,6 . The second steps in the evolutionary history of the rose 80 is the encounter of the two genic pools from the 18 th century that led to the introgression of the continuous 81 flowering, a trait from the Chinese roses in the occidental rose genome. Since the 19 th century, massive 82 controlled hybridization allowed the creation of numerous varieties. R. chinensis is considered among the 83 most important species that participated in the subsequent extensive hybridization using the gene pools from 84 the European / Mediterranean / Middle-East (mostly tetraploid) and Chinese (mostly diploid) roses. These 85 processes engendered the parental cultivars of the modern-day roses (modern rose cultivars, Rosa x 86 hybrida) 1,7 . These hybridizations likely happened independently several times and produced triploid hybrids. 87 Supposedly, the production of unreduced gametes allowed breeders to retrieve fully fertile tetraploid hybrids 88 and overcome this triploid block. One of the major Chinese roses used in the creation of modern roses was 89 'Old Blush' (Parson's Pink China), which also transmitted the recurrent flowering character. Yet, R. 90 chinensis 'Old Blush' displays specific phenotypical traits that pinpoint a possible hybrid origin. We 91 generated a high-quality genome sequence of R. chinensis 'Old Blush' and we resequenced rose species 92 and/or cultivars that could help in understanding the hybrid architecture of 'Old Blush'. Moreover, our 93 resequencing effort aimed to capture an image of the diversity that is at the origin of the modern-day R. x 94 hybrida complex genotypes, as well as the allopolyploid origins of R. gallica and R. damascena. Since the 95 species involved in domestication and later hybridization / introgression events mostly belong to Synstylae, 96 Chinenses and Cinnamomeae sections, our resequencing effort was focused on them to reflect their diversity. 97 Finally, in order to describe the genomic reorganization resulting from the combination of tetraploid 98 European and diploid Asian genomes after hybridization or introgression, we resequenced the emblematic R. 99 x hybrida ' La France'. Bred Fig. 1a) were sampled when the majority of microspores were at the mid-late 109 uninucleate/early bicellular developmental stages ( Supplementary Fig. 1b-e) and then surface-sterilized with 110 Pursept ® A Xpress for 1 minute, followed by a treatment with a bleach solution (1.5 % active chlorine) 111 containing 0.5% Tween 20 for 15 minutes. Buds were then rinsed 4 times with sterile de-ionized water. 112 Anthers were aseptically dissected from buds, and microspores were isolated as described 9 with the first 113 centrifugation being performed at 100 g for 3 minutes, followed by two centrifugations at 65 g for 3 minutes. 114 Microspores were then suspended in B medium 10 , pH 6.5. Microspore viability was checked by FCR test 11 115 and the developmental stage was assessed by DAPI staining 12 . In all experiments, 116 the microspores viability was around 50%. Density was then adjusted to 100,000 microspores/mL and the 117 suspension was pretreated at 4°C in darkness for 21 days in Falcon 353001 Petri dishes sealed with 118 Parafilm ® (1.5 mL microspore suspension per dish). Microspores were then rinsed twice with cold B medium 119 and centrifuged at 50 g for 3 min at 4°C. A portion of 160,000 microspores from this fraction was then 120 suspended in 600 µL of AT12 medium corresponding to AT3 medium 9 supplemented with 4.5 µM 2,4-D and 121 0.44 µM BAP, pH 5.8, and then incubated in a 12-well plate sealed with Parafilm ® at 25°C in the dark. After 122 3 weeks, the medium was carefully replaced with 600 µL of fresh AT12 medium, and the culture was further 123 incubated with the same conditions. Developing micro-calli (ca. 0.5 mm diameter) were observed about 8 124 weeks after subculture ( Supplementary Fig. 1f). Developing micro-calli were isolated and subcultured 125 individually in 300 µL of the same medium in a 24-well plate sealed with Parafilm ® in the same conditions. 126 After 2 weeks, calli were plated onto a CM3 solid medium containing MS salts 13 , B5 vitamins 14 , 30 g/L 127 sucrose, 2.5 mM MES, pH 5.8, supplemented with 4.5 µM 2,4D, 0.44 µM BAP and 6.5 g/L VitroAgar 128 (Kalys Biotechnologie, Saint Ismier, France). After 7 weeks of culture, developing calli were subcultured 129 once on CM3 medium for 12 weeks. At this stage, several calli issued from the same experiment of 130 microspore culture displayed developing somatic embryos (Supplementary Fig. 1g

144
Roses exhibit high heterozygosity levels that hamper high quality genome assembly. To overcome this 145 difficulty, we developed a protocol that allows 'Old Blush' microspores to switch from gametophyte to 146 sporophyte development. We used a combination of fine-tuning a starvation medium, cold stress and 147 hormonal treatments to induce microspores that initiate divisions and to form cell clusters (Supplementary 148 Fig. 1f) after about 11 weeks of culture. Clusters were developed and yielded both embryogenic and 149 proliferating calli that were then maintained on various media ( Supplementary Fig. 1g,h). 150 151 DNA genotyping (HRM) of isolated calli showed that all tested loci were homozygous (Supplementary 152 Fig. 1k). Developing calli displayed the same homozygous profile indicating that they likely derived from a 153 unique microspore development event. This callus was designated R. chinensis HzRDP12 (hereafter 154 RcHzRDP12; Supplementary Fig. 1g,h). The k-mer spectrum of Illumina reads derived from RcHzRDP12 155 provided the final proof that the genome of RcHzRDP12 genome was homozygous, demonstrating a loss of 156 heterozygosity in 'Old Blush' (Supplementary Fig. 1l). Experiments exploring the potential of RcHzRDP12 157 material have revealed that it is possible to maintain the embryogenic capacity of produced calli through 158 several subcultures. Furthermore, we readily regenerated plantlets with normal morphological phenotype 159 from RcHzRDP12 somatic embryos ( Supplementary Fig. 1i).

161
To determine the ploidy level of the homozygous RcHzRDP12 material, we performed fluorescence-162 activated cell sorting (FACS) analysis. We used R. chinensis 'Old blush' leaves, cultivated in vitro, as 163 control. Nuclei were isolated from RcHzRDP12 calli or from young leaves of regenerated plantlets, as 164 previously described 17 , and stained by adding 1 µg/mL DAPI (Sigma) for 1 hour at room temperature. FACS 165 analyses were performed using MACSQuant VYB (Miltenyi Biotec) cytometer and analyzed by FlowJo 166 software (FlowJo LLC). One major peak corresponding to diploid (2N) cells was observed after DAPI 167 staining for RcHzRDP12 ( Supplementary Fig. 1j). The ploidy profile of this homozygous material was 168 identical to that of the heterozygous R. chinensis 'Old Blush' plants, used as a control. In all samples, the 169 majority of cells were diploid and low proportion of polyploid cells (4N and 8N), frequently observed in 170 young tissues, was detected. These data demonstrate that haploid cells originating from the homozygous 171 callus did undergo spontaneous genome duplication during regeneration resulting in diploid homozygous R. 172 chinensis 'Old blush' callus and plant material. 173 To the best of our knowledge, this is the first demonstration of the production of a homozygous rose 174 plantlet. The use of such approach opens possibilities to implement haplomethods in rose genetics and 175 breeding. This possibility to generate Recombinant Inbred Like materials paves the way for novel breeding 176 strategies in roses, e.g. F1 breeding or reverse breeding. With respect to more fundamental research, 177 availability of homozygous rose genotypes may foster the study of a number of processes in simpler genetic 178 models (e.g. developmental mechanisms or metabolic pathways). In particular, homozygous genotypes 179 represent promising models for functional genetics. The first generation of long-read genome assembly software such as PBcR 18 and FALCON 19 enabled the 186 assembly of chromosomes or chromosome arms of small or medium sized genomes 18,20 . The genome 187 assemblies of genomes with higher repeat complexity (e.g. plant genomes) were still composed of several 188 hundreds or thousands of contigs 20,21 and required code modifications to adapt overlap filtering to 189 peculiarities of complex genomes 20  resolved. Then, we used CANU to perform a meta-assembly of our six primary assemblies in which the 198 number of contigs ranged between 298 and 413 and an N50 between 3.37 and 7.95 Mb. As the CANU 199 version 1.4 was unable to handle such large sequences, primary assemblies were transformed into very long 200 overlapping sequences with a maximum of 100 kb (50 kb overlap) prior the meta-assembly. The meta-201 assembly was executed with a minimal overlap of 10 kb and the overlap based trimming step was activated 202 in order to trim spurious contigs ends (found in one assembly only). The meta-assembly is composed by only 203 82 contigs for an N50 of 24 Mb ( Supplementary Fig. 2a) showing the complementarity of primary 204 assemblies. The obtained assembly with a few contigs was then easily integrated with a high-density map as 205 already described in the main text and in the Online Method section. 206 207 3.1.2 The til-r software 208 til-r is a C software implementing heuristics that aim to filter the graph of overlaps generated by the 209 FALCON pipeline. It replaces the call to the program "fc_ovlp_filter" in the script "run_falcon_asm.sub.sh" 210 in FALCON version 0.7. 211 The different pipeline functions, inputs and outputs, and defaults parameters are described in 212 Supplementary Fig. 2c. The four heuristics, the assumptions or combinations of assumptions on which they 213 are based and how they are applied at the read-end level are presented here: 214 Assumption #1: an overlap that spans a non-repeated region is not ambiguous. The length of PacBio reads 215 is long enough to span a large majority of repeated regions. 216 Heuristic #1: a list of non-repeated regions can be provided to til-r as a tabular text file or automatically 217 computed. Only overlaps spanning a non-repeated region are considered. To quickly identify likely non-218 repeated regions in reads, we first randomly sub-sample the read dataset to obtain less than 1x coverage per 219 slice. All reads are classified in one slice. The number of slices is computed depending on genome size and 220 targeted coverage. In each slice, the corresponding overlap positions are used to define repeated regions. 221 After consolidating of repeated regions over all slices, the list of non-repeated regions is defined. 222 Assumption #2: The identity percentage for overlaps depends on the read end quality and some tolerance 223 must be allowed for trying to avoid dead ends (read ends without any overlaps above the cut-off). At a given 224 identity cut-off, the overlaps list contains true positive overlaps but also false positives in the case of repeated 225 regions in the genomic regions. The identity percentage for the false positive overlaps is expected to be lower 226 than the one for the true positives. The best identity percentage found is an indirect measure of read end 227 quality. 228 Heuristic #2: A deltapci parameter that permits tuning the maximum difference allowed between the 229 overlap with the best identity percentage overlap and the other overlaps that are taken into account. When the 230 difference is too high, the overlaps are removed even if their identity percentages are above the general cut-231 off. 232 Assumption #3: The best overlap graph algorithm selects the largest overlaps to build the path of reads.

254
Methods 255 About 0.5 g of formaldehyde-fixed leaf tissues were used to prepare 2 independent in situ Hi-C libraries. The 256 sample fixation was performed as for ChIP-seq in this study. Nuclei extraction, nuclei permeabilization, 257 chromatin digestion, and proximity ligation treatments were performed essentially as previously described 23 . 258 The extracted nuclei were resuspended in 150 µL 0.5% SDS, split equally into three tubes and incubated at 259 62°C for 5 min. After which 145 µL water and 25 µL 10% Triton X-100 were added, and incubated at 37°C 260 for 15 min. Next, the nuclei in each tube were digested by adding 25 µL 10x NEB buffer 3 (100 mM NaCl, 261 50 mM Tris-HCl, 10 mM MgCl 2 , 1 mM DTT, pH 7.9) and 50 U of DpnII restriction enzyme, and incubated 262 at 37°C overnight. The next day, the nuclei were incubated at 62°C for 20 min to inactivate the restriction 263 enzyme. Next, the digested chromatin was blunt-ended by adding 1 µL of 10 mM dTTP, 1 µL of 10 mM 264 dATP, 1 µL of 10 mM dGTP, 25 µL of 0.4 mM biotin-14-dCTP, 14 µL water and 4 µL (40 U) Klenow 265 fragment, and incubated at 37ºC for 2 hr. Subsequently, 663 µL water, 120 µL 10x blunt-end ligation buffer 266 (300 mM Tris-HCl, 100 mM MgCl 2 , 100 mM DTT, 1 mM ATP, pH 7.8), 100 µL 10% Triton X-100, and 20 267 Weiss U T4 DNA ligase were added to start proximity ligation. The ligation reaction was placed at room 268 temperature for 4 hr. After ligation, the nuclei were collected by centrifugation at 1,000 rcf for 3 min, and 269 then resuspended in 750 µL SDS buffer (50 mM Tris-HCl, 1% SDS, 10 mM EDTA, pH 8.0), and incubated 270 with 200 µg proteinase K at 55°C for 30 min. The formaldehyde crosslink was reversed by adding 30 µl 5M 271 NaCl to the solution followed by overnight incubation at 65°C. The recovery of Hi-C DNA and subsequent 272 DNA manipulations were performed as described previously 24 . The final libraries were sequenced on an 273 Illumina NextSeq instrument with 2 x 75 bp reads. 274

275
Over the past few years, three-dimensional proximity information obtained by Hi-C was reported as an 276 efficient method to construct spatial proximity maps of many eukaryotes to help assemble their genomes 25 . 277 We constructed spatial proximity maps of the rose genome using chromosome conformation capture 278 sequencing (Hi-C) at a resolution of 400 kb and then used it to evaluate and confirm the genome assembly 279 and the rose 7 pseudo-chromosomes constructions. The two Hi-C-libraries (denoted A and B, with 280 respectively 198,638,690 and 219,337,784 reads) were independently analyzed with Hi-C-Pro pipeline 281 (default parameters and LIGATION_SITE=GATCGATC) 26 . Reads were first cut for adaptors with 282 trim_galore software 27 and then independently aligned against the genome (bowtie2, end-to-end algorithm 28 ) 283 in a 2-steps protocol to avoid chimeric reads. Only valid ligation products were kept independently for the 284 two libraries (26,067,262 and 23,907,222 respectively, for lib A and lib B) then merged together for the 285 9 interaction map construction. The genome was divided into equally sized bins and number of contacts 286 observed between each pair of bins, was reported. Finally contact maps were plotted with HICPlotter 287 software 29 . The high collinearity between the genetic map based pseudomolecules anchoring ( Figure 1) and 288 Hi-C based contact map information corroborated the overall assembly quality. 289 290

3.4
Localization of centromeres 291 Centromeric repeats are expected to have a very conserved length, with sequence variations. To localize 292 biological centromeres, first we detected tandem repeats (TRs) genome-wide using the TRF software 30 , with 293 parameters "2 7 7 80 10 80 2000 -d -m -l 16", and obtained 11,069 TR motifs. We used Blastn with 294 parameters "M=2 N=-5 Q=7 R=7 E=1e-10 wordmask=none filter=none V=10000000 B=10000000" to count 295 the number of occurrences of each TR pattern on the genome ( Supplementary Fig. 10a). We selected patterns 296 of an over-represented length in the genome (lengths: 61-65, 75-80, 92-97, 115-118, 159-162, 175-176, 522-297 526, 1044-1053), that were then assembled into contigs by length, with Cap3 31 . We obtained 931 contigs 298 that we mapped on the genome using Blastn, with parameters "M=1 N=-1 Q=2 R=2 E=1e-10 299 V=2147483647 B=2147483647 gapS2=500 gapX=500 kap". 108 contigs that had more than 1,000 300 occurrences in the genome, were kept. We then visually inspected the distribution of their sequence coverage 301 along the pseudomolecules by looking for TR highly repeated localized in a narrow region of each 302 chromosome, with a strong anti-correlation with gene density, and a correlation with TE density. We 303 selected 13 TR motifs of 61-65, 92-97 and 159-162 in length. Their combined density along the genome 304 (shown in Supplementary Fig. 10b), allowed to localize the centromere for each chromosome.   Table 6). Around 0.55% of the total contig length is 339 represented as ambiguities, and more than 93.8% of these ambiguities have exactly two forms. We believe 340 these ambiguities represent the residual polymorphism between haplotypes, for the fraction of the genome 341 that hasn't been resolved in two distinct haplotypes. After scaffolding, we obtained an assembly of 882.7 Mb 342 (Supplementary Table 6). 343 344

Validation of assembly completeness and separation of haploptypes 345
The assembly sequence was assessed with BUSCO v3.0.2b 32 which found 1,346 complete gene models out 346 of 1440 (93.5%) and 28 fragmented (1.9%); 73.8% of complete genes are in more than one copy, while this 347 is the case for only 4.5% of the homozygous genome. We mapped the 80,714 rose transcripts from 33 with 348 Blastn (parameters: "E=1e-8 W=9 wordmask=dust links hspsepSmax=12000") and est2genome 34 . 349 Supplementary Fig. 11 displays the distribution of the number of matches depending of the applied identity 350 percent cutoff. We found that at 90% sequence identity cut-off, 76.9% of transcripts have at least one match, 351 and around 71.5% among them have exactly two matches. Along with the overall heterozygous assembly 352 length (882.7 Mb, for an estimated haploid size of 560 Mb), these results show that our assembly process 353 managed to discriminate the two alleles for around 70% of the genes. The homozygous R. chinensis RcHzRDP12 genotype was obtained from microspores culture (Extended 358 Notes 2) and therefore underwent a meiosis. To identify putative loci of crossing-overs that occurred during 359 meiosis, we mapped Illumina reads from 5 distinct libraries from the heterozygous genome (paired-ends 370 360 bp, 480 bp and 630 bp, mate-pairs 3.3 kb and 5.4 kb; Supplementary Table 6) on the constructed pseudo-361 chromosomes and we counted pairs in which only one read had a match, in 10 kb-long windows.

362
Normalization was made using the number of consistent pairs for each library. We observed 50 windows 363 with over-represented one-end mapped pairs in at least two libraries. They were then kept as candidate 364 crossing-over loci (indicated as horizontal dashed lines on Supplementary Fig. 12 the 8 genotypes and the homozygous R. chinensis. The outcome is shown on Supplementary Fig. 12, with 375 red lines of three different intensities depicting the three similarity cutoffs. Conservation can be higher than 1 376 at a low stringency due to repeated sequences. 377 The observed segmental conservation pattern was in accordance with the inferred close relationship of the 378 genotypes. Moreover, the opposite patterns of conservation with WIC and GIG/SPO (high conservation with 379 one genotype and low conservation with the other genotype) confirmed that the haplotype extracted in 380 RcHzRDP12 is a mosaic of genomes closely related to the sequenced WIC and GIG/SPO, thus confirming 381 the hybrid origin of 'Old Blush'. Six candidate crossing-overs perfectly co-localized with breakpoints in the 382 conservation between the homozygous and heterozygous R. chinensis genomes or with inferred parents. It is 383 to note that crossing-overs that happened in regions where the two haplotypes of the heterozygous genome 384 have the same relative conservation with WIC and GIG/SPO could not be confirmed by this method. 385 Conservation between homozygous and heterozygous R. chinensis genomes also showed a segmental 386 pattern ( Supplementary Fig. 12 alignment. Chained alignments spanning less than 50% of the length of the database protein were removed. 433 The Illumina-based RNAseq datasets described in 5.1 were assembled with an iterative k-mer strategy based 434 on velvet 44 , parameters: -cov_cutoff 4 -read_trkg yes -exp_cov 100 -min_contig_lgth 150 -max_divergence 435 0.05 -long_mult_cutoff 0) allowing a homogenous integration of RNAseq data with two additional public 436 datasets of Sanger, 454 and unigene sequences (Genbank January 2015 45 . The four sets of "expressed 437 sequence tags" were aligned on the genome using gmap 46 and only the best scoring hit was kept. Spliced 438 alignments spanning at least 30% of the EST sequence length at a minimum of 97% identity were retained. 439 In case of splicing ambiguity, the introns with the highest number of occurrences in the four datasets were 440 selected. For each Rosa chinensis predicted protein, we only kept its matches with Rosa proteins bidirectional and 464 better than any match against another species. We then looked for cliques in the graph of alignments, 465 defining them as putative "allele sets". Most of the allele sets contains one gene model from the homozygous 466 genome, and two from the heterozygous genome (10,148 out of 27,287; Supplementary Table 7), 467 corresponding to the canonical case where the two alleles have been resolved in the heterozygous genome. 468 7,813 alleles sets contain one homozygous and one heterozygous gene models, corresponding to cases where 469 alleles were assembled as a consensus in the heterozygous genome. Other cases could be due to gene 470 duplications and/or gene losses having occurred independently in the two haplotypes, or to residual 471 transposable elements in our gene annotation.

473 474
By aligning the complete nucleotide sequence of genes predicted in one assembly on the genome sequence 475 of the other assembly with Blastn (parameters: "M=1 N=-3 Q=3 R=3 E=1e-30 wordmask=dust 476 hspsepSmax=30 hspsepQmax=30 links sump") and looking for overlaps between matches and genome 477 annotation, we built a correspondence table between genes from the two genomes, provided as 478 Supplementary Data 1. 479 480

tRNA and rRNA annotation 481
Transfer RNA genes were predicted using tRNAScan-SE v1.3 48 with parameters "-t R -C". Only 482 predictions with scoring higher than 20 were kept. We obtained 757 predicted tRNA genes, and 114 483 predicted pseudogenes. 1,153 tRNA genes and 155 pseudogenes were predicted in the heterozygous 484 genomes. Ribosomal RNA genes were predicted using RNAmmer v1.2 (RDNAs 49 ), with eukaryotic 485 parameters set for nuclear chromosomes and bacterial parameter for organellar chromosomes. We obtained 486 313 predicted rRNA genes. Most of them were on chromosomes 1 (149 genes) or 3 (123 genes). 49 rRNA 487 genes were predicted in the heterozygous genomes. 488 489

De novo transposable element annotation 491
We used the REPET package (https://urgi.versailles.inra.fr/Tools/REPET) to produce a genome-wide 492 annotation of repetitive sequences on the homozygote PacBio genome (7 pseudo-chromosomes and 46 493 unassigned contigs) and the heterozygote Illumina genome (15,938 scaffolds) (see Online Methods). In this 494 genome, the most abundant TE fraction is retrotransposons also called class I elements (31.6%) and in 495 particular, Long Terminal Repeat retrotransposons (LTR-RTs) represent 22.9% with Ty3/Gypsy superfamily 496 being more abundant than Ty1/ Copia superfamily. Non-LTR retrotransposons (LINE and potential SINE) 497 contribute approximately to 7% and class II elements (DNA transposons and Helitrons) to 11.6%. The 22% 498 remaining correspond respectively to unclassified repeats (7.87%), chimeric consensus with two 499 classifications (7.51%) and to potential host genes repeated in this genome (around 6%). These genes were 500 identified and kept in this study. We also identified 2,765 caulimoviridae insertions, representing 1.25 of the 501 genome ( Supplementary Fig. 4a,b; Supplementary Table 1). 502 Finally, we used this library of 3,933 consensuses to annotate the TEs copies in the heterozygote Illumina 503 genome assembly (15,938 scaffolds). Each consensus has at least one copy on the heterozygote genome and 504 the global and non-redundant TE content in the final annotation was 54.7% based on 746 Mb of sequence 505 assembly excluding undefined bases (Ns). The TE families distribution in this genome is the same as in the 506 homozygote with some difference for the Ty3/Gypsy superfamily (9.8%), class I-LARD elements (0,7%) 507 and chimeric (4.4%) ( Supplementary Fig. 4a reverse-crosslinked by adding 20 µL of 5 M NaCl and incubated over-night at 65°C. Reverse-cross-linked 523 DNA was submitted to RNase and proteinase K digestion, and extracted with phenol-chloroform. DNA was 524 ethanol precipitated in the presence of 20 µg of glycogen and resuspended in 20 µL of nuclease-free water 525 (Ambion) in a low-bind DNA tube. Ten nanograms of IP or input DNA was used for ChIP-Seq library 526 construction using NEB-Next Ultra II DNA Library Prep Kit for Illumina (New England Biolabs) according 527 to manufacturer's recommendations. Ten PCR cycles were used for all libraries. The library quality was 528 assessed with Agilent 2100 Bioanalyzer (Agilent), and the libraries were subjected to high-throughput 529 sequencing by NextSeq 500 (Illumina). 530 531

ChIP-Seq bioinformatics analysis 532
Preprocessing of sequenced reads for quality was performed using FASTQC 70 . A single end library 533 H3K27me3 and a paired end library H3K9ac and theirs corresponding inputs were cleaned and trimmed with 534 trim_galore 27 with following parameters: mean Phred quality score greater than 20 ; read length greater than 535 10 after trimming ; retain unpaired reads. Remaining reads were aligned onto the R. chinensis genome with 536 bowtie2 28 with a maximum mismatch of 1 bp and unique mapping. Result files were formatted with 537 samtools 71 and coverage calculated with Picard tools 72 . To determine the target regions of H3K9ac ChIP-538 Seq, the Model-based Analysis of ChIP-Seq (MACS2) 73 was used (number of duplicate reads at a location:1; 539 nandwidth:300; mfold of 5:30; q-value cutoff:0.05). SICER was used to detect H3K27me3 modification 540 regions SICER was used (window size:200, gap size:600) 74 . HOMER 75 was used to associate H3K9ac peaks 541 were located into a -2kb;+1kb windows around the gene TSS. To associate H3K27me3 genes, bedtools 542 intersect 76 was used to keep genes that are overlapped with a H3K27me3 region. Genes and mark densities 543 were calculated using Rstudio [RStudio Team] and plotted with Rstudio and Circos 77 for circular 544 visualization. The average coverage profile along the genic region and 1 kb gene flanking region was plotted 545 using NGSplot 78 To cluster the H3K9ac and H3K27me3 peaks, linear normalization and clustering of tag 546 density with Density Array method (window size 50 bp; 2 kb gene flanking region) was performed using 547 SeqMINER 79 . 548 17 549

550
Genome-wide studies in plants have provided evidence for the role of H3K9ac and H3K27me3 in gene 551 activation and repression, respectively 80-84 . The roles of these histone modifications in rose remain unknown 552 and represent a represent a limitation to the full understanding of how thousands of bioprocesses are 553 regulated. To determine the genomic landscape of these marks, we performed a ChIP-seq analysis using 554 H3K9ac and H3K27me3 antibodies on petals from a heterozygous plant. A minimum of 17 millions of 555 mapped reads was obtained ( Supplementary Fig. 13a). The MACS2 and SICER algorithms, which are 556 designed to detect sharp and broader histone peaks, respectively 73,74 , were used to determine loci that are 557 significantly enriched with H3K9ac or with H3K27me3 ( Supplementary Fig. 13a,b). We identified 23,770 558 H3K9ac marked genes and 11,223 H3K27me3 marked genes for homozygous genome; 28726 H3K9ac 559 marked genes and 15850 H3K27me3 marked genes for heterozygous genome ( Supplementary Fig. 13b). 560 Next, we analyzed the distributions of the two histone marks at the chromosome and gene levels. To 561 analyze the genome wide distribution, we used the homozygous assembly. However, in order to capture both 562 haplotypes diversities, all gene level analysis were performed on heterozygous assembly. At the 563 chromosomal scale, we observed an enrichment of both marks in gene-rich regions, which is consistent with 564 the role of these histone marks in the control of gene expression (Supplementary Fig. 13c,d). In order to 565 detail the H3K9ac and H3K27me3 distributions at the gene level, the peaks obtained for both modifications 566 were analyzed. We found that the peak length of H3K9ac ranged from 400 bp to 800 bp (Supplementary Fig.  567 13e), located preferentially at the TSS regions, (Supplementary Fig. 13f). In contrast, H3K27me3 peaks 568 presented an averaged length that ranged from 4,000 bp to 8,000 pb, covering the entire gene body 569 ( Supplementary Fig. 13e,g). Those patterns were consistent with previous studies on different plant species, 570 highlighting conserved aspects of the epigenetic system in the plant kingdom 85 . As expected, integration of 571 H3K9ac and H3K27me3 data sets showed an anti-correlation between those two marks (Supplementary Fig.  572 13h). Altogether, these results show that in rose, as in other plant species, H3K9ac and H3K27me3 are 573 distributed along the gene body, supporting the role of these two marks in gene regulation. 574 To connect H3K9ac and H3K27me3 histone marks with gene expression, we generated and integrated 575 RNA-seq data. We confirm that H3K9ac and H3K27me3 in rose are associated with gene expression and 576 gene repression, respectively (Supplementary Fig. 13k). Genes that are associated with both marks show an 577 intermediate expression profiles. To determine if the level of acetylation or methylation could be correlated 578 with gene expression, we equally divided all the genes into four groups, based on their expression levels.

579
Then we plotted them on their H3K9ac or H3K27me3 profile ( Supplementary Fig. 13i,j). We observed that 580 H3K9ac level increases with expression level while H3K27me3 showed the opposite pattern, where it 581 displayed a high enrichment in the lowest-expressed genes. These results suggest that in rose the more a gene 582 is marked by H3K9ac and H3K27me3, the more it will be expressed and repressed, respectively. 583

Rosaceae genome evolution for translational research 584
In order to assess the paleohistory of R. chinensis within the Rosaceae family, we performed a comparative 585 genomic investigation of Rosa with apricot (Prunus mume 57 ), peach (Prunus persica 58 ), apple (Malus 586 domestica 53 ), pear (Pyrus bretschneideri 56 ) and strawberry (Fragaria vesca 51 ), using the genome alignment 587 parameters and ancestral genome reconstruction methods described in Salse 2016 86 . Conserved gene 588 adjacencies deliver an ancestral Rosaceae karyotype (ARK) consisting of 9 protochromosomes (or 589 Conserved Ancestral Regions, CARs) with 8861 protogenes ( Supplementary Fig. 5a, top). The complete dot-590 plot based deconvolution into nine reconstructed CARs of the observed synteny and paralogy between ARK 591 and the investigated species validate the nine proposed protochromosomes as the origin of Rosaceae 592 ( Supplementary Fig. 5a, bottom). Our evolutionary scenario, reconciling the modern genome structures to 593 the founder ARK, clearly established that apricot and peach emerged from an ancestral Prunoideae 594 karyotype (APK) structured in 8 protochromosomes (with 16333 protogenes) deriving from ARK through 2 595 ancestral chromosome fissions and 4 fusions. The duplication of ARK followed by at least 11 ancestral 596 chromosome fissions and 12 fusions, shaped the ancestral Maloideae karyotype (AMK) in 17 597 protochromosomes (with 12,634 protogenes), as the founder ancestor of the modern apple and pear 598 genomes 53 , while no similar duplication was found in Rosa or Fragaria genomes. Finally, the ancestral 599 Rosoideae karyotype (ARoK) of the modern strawberry and rose genomes, structured into 8 600 protochromosomes with 13,070 protogenes, derived from ARK through one ancestral chromosome fission 601 and 2 fusions. While the modern strawberry genome experienced an extra ancestral chromosome fusion from 602 ARoK to reach its modern genome structure, rose genome went through one fission and 2 fusions, 603 independent from strawberry, to reach its modern genome structure. Our comparative genomics-based 604 evolutionary scenario unravels the Rosaceae paleohistory from the reconstructed ancestral Rosaceae 605 karyotype (ARK) with 9 protochromosomes and 8,861 protogenes delivering the complete catalog of 606 paralogous and orthologous gene relationships between the modern Rosaceae genomes as well as the 607 reconstructed ancestor (ARK, APK, AMK, ARoK). The gained knowledge can now be used as a guide to 608 perform translational research between the six-investigated species to accelerate the dissection of conserved 609 agronomical traits ( Supplementary Fig. 5a, bottom). 610 Rosoideae radiative evolution: The relative phylogenic relationships between rose, raspberry and 611 strawberry, all from the Rosoideae subfamily, are currently weakly supported, due to a lack of molecular 612 data 87,88 . The hypothesis is that Rosa and Fragaria diverged more recently from one another than from 613 Rubus. We used our rose genome sequence, and that of Rubus occidentalis 52 and Fragaria vesca 51 to address 614 this question, using Malus x domestica 54 as an outgroup. 615 We selected the 748 genes that were identified as complete and in unique copy in the four genomes with 616 BUSCO plant/embryophyta_odb9 dataset 32 (v3.0.2b). Based on their coding sequences, we computed 748 617 individual trees, using MUSCLE V3.8.31 89 and PhyML v3.1 89 with parameters "0 I 1 1000 HKY e e 4 e 618 BIONJ y y". We observed that 61.5% of the trees had a bootstrap value of 996/1000 or more. Among them, 619 68.7% support the hypothesis of a shorter distance between Rosa and Fragaria, compared with Rubus 620 ( Supplementary Fig. 5b, barplot). The consensus tree obtained from the concatenation of 600 gene CDSs 621 with the same method, with an additional step using Gblocks v0.91b 90 showed the same tendency 622 ( Supplementary Fig. 5b, bottom right). However, by plotting the Rosa-Fragaria and Rosa-Rubus 623 phylogenetic distances gene by gene ( Supplementary Fig. 5b, dot plot in lower panel), we observed that the 624 dots followed the diagonal (in blue) and that the slope was only marginally different from 1 (5% confidence 625 interval in red). These results favor the hypothesis that the three genera diverged approximately at the same 626 time, suggesting a process of evolutionary radiation inside the Rosoideae subfamily. 627 Despite being evolutionary close to each other, Rosa and Fragaria have differing genome size, respectively 628 560 and 240 Mb. We retrieved 3 datasets of genomic reads from distinct Fragaria vesca subspecies from 629 NCBI (SRR1513870, SRR1513871 and 1513872) to compare the fraction of repeated k-mers to the one of 630 our Rosa sequencing data. Individual reads were cleaned, and regions with a Phred quality lower than 26 in 631 average along a 4 bp sliding window were trimmed. Reads shorter than 55 bp were discarded. We filtered 632 out reads matching R. Illumina paired-end reads of the four Rosa species with read lengths greater than 100 nt were mapped to the 682 reference genome with the GLINT software (http://lipm-bioinfo.toulouse.inra.fr/download/glint/), with the 683 following parameters: --no-lc-filtering --best-score --mate-maxdist 10000 --lmin 80 --mmis 16 --step 2. The 684 mismatch cut-off was increased to 24 for the ten Rosa species with read lengths equaling 150 nt. 685 Variants were called for each genotype with SAMtools mpileup 71 and Varscan 92 , with the following 686 parameters for low coverage genotypes: min-coverage=5, min-reads2=5, --min-avg-qual 15, min-var-687 freq=0.1 --p-value 0.01 and with more stringent parameters for the high coverage 'Old Blush' heterozygous 688 genotype: --min-coverage 50 --min-reads2 25 --min-avg-qual 15 --min-var-freq 0.1 --p-value 0.01. Variants 689 with a mapping coverage higher than 60 and 300 in the fourteen resequenced Rosa species and in the R. 690 chinensis 'Old Blush' genotype respectively, were filtered out. 691 692

Origin of the Rosa chinensis 'Old Blush' genotype 693
The section Chinenses comprises old cultivated Chinese roses that are supposed to result from crosses 694 between two wild species, R. gigantea, and R. chinensis 'Spontanea', a rare wild species 2 . One of the first 695 Chinese roses used in the creation of modern roses, transmitting the recurrent flowering character was 'Old 696 Blush' (= Parson's Pink China  Fig. 14c). Such low values were not observed in the resequenced 710 genotypes of the Synstylae section genotypes (Supplementary Fig. 14e-g), nor in R. chinensis 'Spontanea',711 which thus appears as a lesser contributing ancestor of 'Old Blush' ancestor ( Supplementary Fig. 14d). This 712 is corroborated by the data in Supplementary Note 4.4 indicating that although a genotype closer to the 713 sequenced R. chinensis 'Spontanea has transmitted its cytoplasm to 'Old Blush', the latter's genome is closer 714 to R. gigantea than to R. chinensis 'Spontanea'. Furthermore, a region extending from 30 to 46.5 Mb on 715 chromosome 3 and originating from the Chinenses section displayed a very low variant density (< 2 variants 716 per kb), but displayed normal mapping coverage ( Supplementary Fig. 12), and is therefore homozygous in 717 the R. chinensis 'Old Blush' heterozygote genotype ( Supplementary Fig. 14b capillary column (J&W Agilent), at a film thickness 0.25 µm, with helium as the carrier gas at a flow rate of 748 1 mL min -1 . The GC oven temperature was programmed to increase from 40°C to 180ºC at rate of 1.50ºC 749 min -1 , and from 180ºC to 290ºC at rate of 10ºC min -1 and was finally maintained at 290°C for 1 min. All 750 experiments were performed at least two times. 751 The chromatographic data were analyzed using the Data Analysis software (Agilent)  Perl scripts were used to obtain files corresponding to scaffolds with interesting gene sequences (fasta 771 format), RNA-seq contigs 33 and RNAseq data, and automatic annotation. These files were visualized in 772 Artemis, a genome browser and annotation tool 100 . The transcriptome datasets were used for curation to 773 verify the automatic annotation of each gene sequence, using Artemis. When necessary, a manual annotation 774 was performed to correct the automatic annotation and a new file (gff format) was created. To check the 775 automatic predicted function of each gene sequence studied, a BLASTN search in NCBI using predicted 776 mRNA as the query, was performed and the results were reported in a new file (Genbank format). 777 The results of this manual annotation with predicted functions are presented in Supplementary Data 4. Genes 778 are organized according to the biosynthetic pathways. For each gene, the FPKM using the EST data 33 and the 779 predicted function by manual annotation and by automatic annotation are given. Generally, more than one 780 sequence corresponded to one Blast query. These sequences were considered as homologous copies of the 781 studied gene, and could be allelic variants or different gene copies. Supplementary Data 7 provides the 782 correspondence between heterozygous IDs and the reference genome annotation (homozygous) and helps 783 identify putative alleles for scent genes. 784 785

786
The emblematic rose perfume is a bouquet of more than one hundred VOCs, composed of terpenoids, 787 benzenoids/phenylpropanoids, fatty acid derivatives and others chemical families such as fatty acid 788 derivatives or phenolic methyl ethers (PME only found in species in the Chinenses section (R. chinensis and R. gigantea). 797 The enzymatic pathways of the VOCs are only partially known in roses 101-105 and many biochemical steps 798 remain to be discovered. Data mining of the rose genome reveals candidate genes for this perspective. We 799 took advantage of the rose genome to identify and reconstruct the biosynthesis pathways associated with the 800 relevant scent compounds. 801 802

809
TMB is synthetized by three successive methylations of phloroglucinol, the first step being catalyzed by a 810 phloroglucinol-O-methyl transferase (POMT) 107 ( Supplementary Fig. 15). The next steps are probably 811 25 catalyzed by OOMT1 and OOMT2. The origins of orcinol and phloroglucinol are not well documented. A 812 phloroglucinol synthase has been characterized in brown algae 108 and an orcinol synthase homologous to a 813 bacterial gene has recently been discovered in Rhododendron dauricum 109 . These two genes belong to the 814 polyketide synthase (PKS) family. 815 Homologous genes known to act in the PMEs pathway could be found in the genome of R. chinensis 'Old 816 Blush' genome (Supplementary Data 4; Supplementary Fig. 15 We found several prenyl transferase candidate genes in the rose genome ( Supplementary Fig. 7 Functional studies and enzymatic assays of these five terpene synthases will help unraveling their putative 907 roles in terpene biosynthesis pathway in rose ( Supplementary Fig. 7). RcHt_S605.34, which corresponds to 908 the previously characterized GDS 102 , is highly expressed in open flowers. In the haploid genome, several 909 putative LINS (linalool synthase) or NES (nerolidol synthase) sequences are clustered on chromosome 5. 910 These genes are not expressed in rose petals. 911 Genes corresponding to carotenoid cleavage dioxygenases involved in ionones production (CCD1, 912 RcHt_S2152.4, RcHt_637.14 and CCD4, RcHt_S10901.1) have also been found in the genome. CCD4, 913 which shows a very high petal expression in petals at blooming and senescent stages, could be involved in 914 dihydro-ß-ionol biosynthesis in R. chinensis 'Old Blush' petals. 915 916

977
The rose homologue of EGS1 was previously characterized 119 . In R. chinensis 'Old blush', RcHt_S564.16 or 978 RcHt_S3128.4 encodes the putative homologues of EGS1. Our expression data indicate that both genes are 979 expressed in 'Old Blush', thus consistent with the fact that eugenol is not produced in this rose cultivar. We 980 identified one gene copy of the putative eugenol O-methyltransferase (EOMT) homologue, (RcHt_S23.70), a 981 gene that was previously characterized in R. chinensis 'Spontanea' 120 . In 'Old Blush', this gene shows weak 982 expression specific to stamens. 983 Benzaldehyde and benzyl alcohol biosynthesis is partially known in several plants and can be derived from t-984 cinnamic acid or from cinnamoyl-CoA 121 . PAL and C4L are the only known genes involved in this pathway. 985 Homologues of these two genes were found in R. chinensis 'Old Blush' genome. C4L copies are identical to 986 the ones identified for eugenol biosynthesis. No genes could be proposed for the last biosynthesis steps in 987 this pathway. 988 To summarize, the manual annotation of genes involved in scent production allowed us to identify candidate 989 genes in all biosynthetic pathways operating in rose flowers. Characterizing these candidate genes in other 990 rose species with different scent characteristics will help elucidate the origin of the huge diversity of scent 991 production in the Rosa genus. The rose has already been shown to synthesize some of its terpenes differently 992 from other species, via a cytosolic nudix hydrolase. The origin and localization of the precursor of these 993 monoterpenes, GPP, are unknown. Our study here shows that the plastidic MEP pathway genes usually 994 involved in the GPP synthesis, have a very low expression in the flower. A more in-depth study of the 995 contribution of the two pathways in terpenes biosynthesis in rose will show if, conversely to other plants, 996 roses use cytosolic MVA pathway to synthesize precursors of monoterpenes. Characterized genes sequences in the flavonol / anthocyanin pathway, coming from various Rosa accessions 1002 (species and cultivars) were retrieved from an GenBank public database. tblastn was then used to find their 1003 closest homologs in R. chinensis 'Old Blush'. The genes were then mapped on the assembled haploid 1004 chromosomes. When several candidates could not be distinguished (ie. for Chalcone Synthase (CHS) or 1005 Glucosyl-Transferase 1 (GT1)) we used FPKM data (described in Supplementary Notes  in vegetative and floral tissues for each candidate were obtained in order to build in-silico expression profiles 1012 and to group SPL genes by functional sub-families. Particular attention was given to those SPLs that could 1013 be involved in vegetative to floral meristem transition. 1014 Using an adapted version of WMD3 miR pipelines (Ossowski Stephan, Fitz Joffrey, Schwab Rebecca, 1015 Riester Markus and Weigel Detlef, personal communication), we build a user-friendly application facilitating 1016 the prediction of miR156 targets in the rose genome. It is based on known properties of miR/target gene 1017 interaction such as number of mismatches, no mismatch at the positions 10 and 11 (cleavage region) quality 1018 of pairing in the seed region and hybridization energy 122 . We used the canonical sequence of Arabidopsis 1019 miR156 (UGACAGAAGAGAGUGAGCUC) to identify its counterpart in the rose, and then we interrogated 1020 the rose genome to predict the rose miR156 targets. 1021 Plant MYB proteins share a conserved R2R3 MYB domain. These transcription factors are involved in the 1022 control of cell identity and fate, cell growth and division as well as in secondary metabolism, especially the 1023 phenylpropanoid pathway. BLASTp was used to search for MYB transcription factors that have conserved 1024 R2R3 motif in the heterozygous genome. MYBs with two R2R3 motifs were kept. We retrieved 215 1025 annotated MYB sequences for the rose. Whenever possible, the correspondence of these sequences with the 1026 homozygous annotation was established, to identify allelic copies of each MYB. Finally, 120 MYB genes 1027 corresponding to one or two allelic sequences were mapped on the homozygous pseudomolecules. Nagel NucleoSpin® miRNA. PVP40 was added to the samples prior to grinding. One µg RNA was treated 1034 with DNase I (Ambion® DNA-free). In order to avoid over-dilution, small RNAs were eluted on a separate 1035 column and therefore their expression had to be normalized using 5.8S rRNA. Concentration was measured 1036 using NanoDrop ND-1000 Micro-Volume (NanoDrop Technologies) before and after DNase treatment.

1037
Three biological replicates were performed for each experiment. 1038

1059
The first rose cultivars arose independently in China and the peri-Mediterranean area more than 2000 years 1060 ago. Flowers of wild roses used in domestication were mostly pink or red. Breeding and selection for 1061 brightly colored flowers led to increased anthocyanin synthesis in domesticated plants when compared with 1062 their wild progenitors. Anthocyanins, in association with other polyphenolic co-pigments such as flavonols 1063 could, therefore, be considered as the main determinants of flower color diversity in cultivated roses. 1064 Therefore, we addressed the genetic determinism and gene regulatory pathways associated with floral 1065 anthocyanins and flavonols biosynthesis that were under selection for flower color during the early history of 1066 rose cultivation and domestication. 1067 The anthocyanin / flavonol pathway in rose flowers has been described in early 90's and most of the involved 1068 enzymes are now fully characterized. In rose flowers, the last two glycosylation steps for anthocyanin 1069 aglycone were shown to be controlled by a single glycosyl-transferase (RhGT1), different from other plants 1070 where these steps are achieved by the sequential action of two distinct glycosyl-transferases 123 . 1071 Although this pathway can now be considered as well described in roses, information is still lacking on how 1072 the onset of anthocyanin biosynthesis is coordinated with floral opening, which will lead to flower color 1073 variations. In Arabidopsis thaliana, genes controlling key steps of the anthocyanin biosynthesis, such as 1074 DFR, F3'H and ANS, are transcriptionally activated in stems by a MYB-bHLH-WD40 complex 124 . 1075 Over-expression of Arabidopsis R2R3 MYB transcription factor AtPAP1, leads to increased anthocyanin 1076 contents in rose petal, associated with higher emission of germacrene D 125 . This published evidence raises 1077 the possibility of a co-regulation between anthocyanin and some terpenes biosynthesis in rose flowers. R. 1078 chinensis 'Old Blush' scent is composed of Germacrene D, but PAP1 expression could not be detected during 1079 petal development. We identified that a second R2R3 MYB transcription factor, RhMYB10, is expressed in 1080 'Old Blush' petals. MYB10 was previously identified and characterized as an inducer of anthocyanin 1081 biosynthesis genes in Rosaceae, including in the rose 126 . Our analyses, taken together with published data, 1082 suggest that RhMYB10, but not PAP1, acts as a common activator of anthocyanin and germacrene D 1083 synthesis (Supplementary Fig. 8; Supplementary Data 8). Duplication events in first and last genes of anthocyanin biosynthesis genes. 1088 Chalcone synthase catalyzes the condensation of malonyl-coA and coumaroyl-CoA into 1089 tetrahydroxychalcone (or naringenin chalcone), which is the initial substrate necessary for synthesizing 1090 downstream polyphenolic compounds such as flavonols and anthocyanins. We identified three genes that 1091 could potentially encode a functional CHS. Among these three genes, only one CHSa (Chr1g0316441) is 1092 expressed in 'Old Blush' flowers according to FPKM data. This gene located on chromosome 1 with two 1093 alleles, RcHt_S637.2 and RcHt_S2110.9. 1094 Other genes in the pathway were identified as single-copy, except for cyanidin 3,5-diglucosyltransferase, 1095 previously named as RhGT1 123 . According to our data, two functional versions of this gene stand 700 kb 1096 apart from each other on chromosome 1 (Chr1g0378941 and Chr1g0380121). Only one copy (GT1a or 1097 Chr1g0378941 / RcHt_S2665.15) is expressed in buds and opened flowers of 'Old Blush', whereas GT1b is 1098 expressed in vegetative organs and senescent flowers, suggesting that an initial duplication event of an 1099 ancestral glucosyl-transferase was followed in Rosa by a specialization of one of the two copies in order to 1100 achieve 3,5-diglucosylation of cyanidin in flowers. Orthologous genes coding for enzymes normally 1101 catalyzing the sequential two-steps glucosylation process in cyanidin mapped closely to the telomeric ends of 1102 chromosomes 1 and 2. These two genes show very low expression levels in flowers, compared to GT1a. 1103 Other genes in the pathway were single-copy and were mapped on R. chinensis 'Old Blush' pseudo-1104 chromosomes ( Figure 3). 1105 Expression of most genes in the anthocyanin biosynthesis pathway, except F3H and GT1, increased during 1106 petal growth and pigmentation, between stage 1 and stage 3. RT-qPCR expression analyses of anthocyanin 1107 biosynthesis genes in petal ( Supplementary Fig. 8b) correlated with the in silico expression data 1108 ( Supplementary Fig. 8a). The small observed differences in expression levels could be explained by the fact 1109 that in silico transcriptomes were performed on bulk floral organs (sepals, stamens, carpels and hypanthium) 1110 compared to petals for the RT-qPCR experiments. 1111 1112

Regulators of anthocyanins pigments and flavonols co-pigments 1113 1114
Squamosa Promoter-binding Like (SPL) genes and miR156-miR157 expression patterns are consistent with 1115 a possible role in anthocyanins and flavonols synthesis. In Arabidopsis thaliana, anthocyanin synthesis is 1116 regulated by the miR156 -SPL9 module in an age-dependent manner. SPL9 destabilizes the MYB-bHLH-1117 WD40 complex, hampering anthocyanidin synthesis. High expression of miR156 promotes SPL9 1118 degradation, which in turn enables anthocyanidin synthesis. In rose petals, previous report shows that 1119 miR156 expression increases in response to ethylene and negatively correlates with SPL expression 127 . Here, 1120 we focused on the miR156 -SPL regulatory module, in order to identify the transcription factors that are 1121 most likely involved in controlling anthocyanidin production in rose flowers and that could influence flower 1122 color, by its action on the formation of MYB-bHLH-WD40 complex. 1123 Sixteen loci corresponding to putative SPL genes were predicted ( Supplementary Fig. 9). Among them, 1124 though harboring the characteristic zinc-finger domain, RcHt_S7297.1 is truncated. Among the 15 remaining 1125 SPL genes, 10 were predicted to be targets of miR156. Eight out of these 10 predicted targets show a 1126 decreased expression between floral development stage IMO (early floral organs) and OFT (open flower) 1127 ( Supplementary Fig. 9). Such a decrease, although occurring in flowers instead of stems, as in Arabidopsis, 1128 might respond to the increase of miR156 expression during the course of floral opening. The rose gene 1129 RcHm4g0437871 (rose SPL9 like) shares high identity with AtSPL9. We quantified rose SPL like expression, 1130 by RT-qPCR (Figure 3), in 'Old Blush' petals at three stages (from non-colored to maximum pigmentation at 1131 the beginning of anthesis) and then we correlated its expression with genes in the anthocyanin biosynthesis 1132 pathway. Previously, it was reported in Arabidopsis that accumulation of miR156 correlates with low 1133 expression of SPL9 124 . Our RT-qPCR quantifications of miR156 expression in rose petals show that high 1134 expression levels of miR156 correlate with a decrease of SPL expression during petal growth and 1135 pigmentation processes. These results, taken together with previously reported data in Arabidopsis and the 1136 rose, are consistent with the miR156-SPL9 module playing a role in anthocyanin synthesis, through SPL 1137 destabilizing the MYB-bHLH-WD40 complex, which activates the final enzymes of the pathway in the rose 1138 (Figure 3; Supplementary Fig. 8). 1139 1140 1141 1142 Comparative RNA-seq analysis of transcriptome dynamics in R. chinensis showed that seven MYBs were 1143 upregulated and one MYB was down-regulation during petal growth 128 . We identified candidate MYB that 1144 show a specific pattern of expression to flower tissues at different developmental stages (Supplementary 1145 Data 8). Strikingly, only one MYB (RcHm2g0172331; RcHt_S1331.19 / RcHt_S2066.7) was found to be 1146 highly expressed and specific to three developmental stages of the rose flower. Moreover, its expression 1147 increased from closed flower buds to open flowers. This MYB is related to At MYB21 and AtMYB24, 1148 which was previously shown to play a role in petal and stamen elongation in Arabidopsis 129 . MYB21 is also 1149 required for the activation of PHENYLALANINE AMMONIA-LYASE (PAL), the first enzyme in the 1150 phenylpropanoid pathway, that leads to secondary metabolites such as flavonoids (flavonols and 1151 anthocyanidins) and lignins (Supplementary Data 8). 1152 We performed phylogenetic analyses including MYB proteins from Fragaria and Malus, whose functions 1157 have been reported as activators of anthocyanin biosynthesis 130,131 (Supplementary Fig. 19). RcHt_S286.29 1158 from R. chinensis is the predicted most similar gene to rose RhMYB10 126 , previously shown to be associated 1159 with anthocyanin biosynthesis in Rosaceae 126 . 1160 1161

Coordination of pigments and volatiles synthesis 1162 1163
SPL genes and miR156-miR157 expression patterns are consistent with a possible role in germacrene-D 1164 synthesis 1165 1166 It was previously reported that over-expression of the Arabidopsis PAP1, a MYB activator of anthocyanin 1167 synthesis and possible sub-unit of the MYB-bHLH-WD40 transcriptional activator, in the rose triggers ANS 1168 overexpression but was also associated with Germacrene-D synthase (GDS) over-expression 125 . Two genetic 1169 copies corresponding to putative GDS were mapped on R. chinensis chromosomes. The first GDS gene copy, 1170 corresponding to that functionally characterized by Guterman et al 102 , is highly expressed in the petals of 1171 opened rose flowers. The second GDS gene copy, is also highly expressed in petals of open flowers, but also 1172 showed high expression levels in senescing flowers ( Supplementary Fig. 9). Functional characterization is 1173 needed to know if this second gene has a GDS function. Both expression patterns are evocative of the 1174 expression pattern of ANS. Given that PAP1 has been suggested as a possible activator of GDS expression 125 , 1175 we hypothesize that its action on GDS is mediated by the SPL9-miR156 regulatory module, which gives a 1176 functional basis to the necessary coordination of pigments and volatile molecule synthesis for pollinator 1177 attraction (Figure 3; Supplementary Fig. 8). 1178 To further address this hypothesis, we compared the expression of candidates for SPL (RcHm4g0437871), 1179 ANS (RcHm7g0199941), GDS (RcHm5g0038101), and RhMYB10 (RcHm3g0448721) in petals of two rose 1180 plants exhibiting contrasted flower colors: R. chinensis 'Sanguinea' which has petals that accumulate high 1181 levels of anthocyanins at flower opening, and R. hybrida 'Alister Stella Gray' which has petals that do not 1182 accumulate anthocyanins. In 'Sanguinea', SPL was expressed in non-colored petals (flower buds), and its 1183 expression was downregulated in colored petals ( Supplementary Fig. 20), thus similar to 'Old Blush'. SPL 1184 expression correlated with low ANS and GDS expression in flower buds before color production 1185 ( Supplementary Fig. 20). In the colored petals of 'Sanguinea', SPL downregulation correlated with the 1186 upregulation of both ANS and GDS expression, thus corroborating the data observed in 'Old Blush' ( Figure  1187 3b; Supplementary Fig. 20). In 'Alister Stella Gray', we observed that the expression of SPL, GDS, and ANS 1188 was very low at both analysed stages (flower bud and flower opening). The data show that the anti-1189 correlation of expression between SPL on one side, and ANS and GDS on the other side, is observed only in 1190 the colored flower cultivars 'Sanguinea' and 'Old Blush'. 1191 RhMYB10 exhibited similar expression patterns in both 'Sanguinea' and 'Alister Stella Gray' roses. 1192 RhMYB10 was expressed at low levels in flower buds and its expression increased in developing petals 1193 ( Supplementary Fig. 20).

1194
The positive co-regulation of ANS and GDS expression in anthocyanins-accumulating flowers and their anti-1195 correlated expression with SPL are other arguments favoring the hypothesis that anthocyanins and 1196 germacrene D biosynthesis could be coupled and achieved through the miR156-SPL regulatory module. 1197 These data also suggest that RhMYB10 expression is likely not the determinant factor, but rather it is the 1198 putative action of SPL on MYB-bHLH-WD40 complex, which activates the final enzymes of anthocyanins 1199 and germacrene D synthesis in rose (Figure 3; Supplementary Fig. 8).

Auxin Response Factor gene family 1202
Parts of this work were performed on the heterozygous assembly. The table in Supplementary Data 1  1203 shows heterozygous IDs matched with their reference genome annotations (homozygous). 1204 To identify Auxin Response Factor (ARF) gene family members in R. chinensis, the predicted proteins 1205 associated with the domain PF06507 (Auxin Response Factor) were extracted. From the 37 predicted protein 1206 sequences (Supplementary Data 5a), six were excluded from the phylogenetic analysis because they were 1207 highly truncated or contained very divergent regions (RcHt_S12618.1, RcHt_S1403.1, RcHt_S2738.6,  1208 RcHt_S2297.1, RcHt_S2297.6, and RcHt_S1950.5, indicated "No" in the column "Used for phylogenetic 1209 analyses"). These 31 protein sequences were aligned together with the sequence of 22 Arabidopsis ARF 1210 proteins (ARF23 was not included as it has a truncated DNA Binding Domain due to an early stop codon, 1211 and appears to be under negative selection, Supplementary Data 5b) using MAFFT 1212 (http://mafft.cbrc.jp/alignment/software/ 132 with the following parameters: (1) Iterative refinement methods: Parts of this work were performed on the heterozygous assembly. The table in Supplementary Data 1  1229 shows heterozygous IDs matched with their reference genome annotations (homozygous). 1230 To identify type II MADS-box family members, the R. chinensis predicted protein dataset was searched by 1231 local BLAST analysis with BioEdit software 134 , using Arabidopsis representatives of the major MADS-box 1232 subfamilies 135 as a template. Identified R. chinensis protein sequences (Supplementary Table 3) were 1233 assigned to any of the major MADS-box subfamilies based on homology scores and the presence of small 1234 conserved (C-terminal) peptide motifs that are diagnostic for the different subfamilies 136 . To generate the 1235 Neighbor-Joining (NJ) trees shown in Supplementary Fig. 22, protein sequences were first aligned using 1236 ClustalW 137 and aligned regions (Supplementary Data 6) were selected for phylogenetic analysis. NJ trees 1237 were computed with Treecon software 138 using the following parameters: (1) Distance estimation options: yes. All trees were rooted using the Arabidopsis FUL protein, except for the AP1/FUL subfamily, for which 1242 Arabidopsis SEP3 was used as an outgroup. For the phylogenetic analysis, rose and Arabidopsis proteins 1243 were each time compared, except for the B-function/Bsister MADS-box subfamilies, for which in addition 1244 Petunia hybrida representatives were included in the analysis. Some of the predicted rose MADS-box 1245 proteins mentioned in Supplementary Table 3 were excluded from the phylogenetic analysis because they 1246 were highly truncated or contained too divergent regions. These gene models may correspond to pseudo-1247 genes or alternatively, may be due to erroneous protein predictions. 1248 BLAST searching the R. chinensis predicted protein data set resulted in the identification of rose 1249 representatives (Supplementary Fig. 22 underscoring the hybrid/heterozygous origin of the R. chinensis genome. In other cases, one of the predicted 1253 protein sequences within such a pair appeared incomplete (Supplementary Table 3; Supplementary Data 6), 1254 suggesting that these represent degenerated gene copies (pseudo-genes) or alternatively inaccurately 1255 predicted protein models. A more detailed analysis of the subfamilies encoding the floral homeotic ABC 1256 functions, show that the rose genome contains MADS-box proteins in copy numbers comparable to other 1257 eudicot species, with 1 AGL6-like gene, 3 SEP-like genes, 2 FUL-like genes, 1 AP1-like gene, 1 AP3-like 1258 gene, 1 TM6-like gene, at least 2 PI-like genes, 1 Bsister-like gene, 1 AG-like, 1 PLE-like C-function gene 1259 and 1 AGL11-like gene (D-lineage). Because rose appears to have retained a TM6-like B-function gene in 1260 parallel with its AP3-like gene, and contains more than one PI-like gene, the rose B-function more closely 1261 resembles the complex B-function of the asterid species Petunia 139,140 than the Arabidopsis B-function. 1262 Intriguingly, we failed to detect members of the flowering repressor FLC clade in rose, although Arabidopsis 1263 contains 6 members of this subfamily. This may suggest that FLC genes have been lost in rose, or 1264 alternatively, that rose FLC genes have diverged too strongly to be easily identified as FLC members. 1265 37 13. Genetic pathways involved in diploid gamete formation 1266 Like many crops, most rose cultivars are polyploids 141,142 . Ploidy diversity is a limiting factor in rose 1267 breeding. Most interploidy crosses lead to infertile progeny. In rose domestication, breeders have often and 1268 inefficiently attempted to tinker with ploidy levels to overcome this reproductive barrier. In order to cross 1269 wild species and tetraploid cultivars, chromosome numbers must first must be balanced: (i) Haploidization, 1270 halving the chromosome number, has been unsuccessfully attempted by in vitro culture of haploid cells from 1271 unfertilized ovules or ovaries and microspores or anthers. A few haploidized rose plants have been produced 1272 from in situ parthenogenesis induced by fertilization with pollen inactivated by irradiation. The  1273 parthenogenetic development of a haploid cell from embryo sacs into a new plant was induced and embryos 1274 were subsequently rescued by in vitro culture. (ii) Chromosome doubling was successfully performed by 1275 mitotic polyploidization requiring microtubule drugs to transiently block chromosome segregation in mitosis 1276 and duplicate the number of chromosomes per cell. However, in vitro chromosome doubling is typically 1277 associated with somaclonal variation and cytochimerism phenomenas. 1278 The most promising alternative for rose breeders is sexual polyploidization using 2n gametes. 2n 1279 gametes were widely used in crop breeding to directly introgress new traits from diploid species into 1280 tetraploids such as potato, manihot or alfalfa 143 . They also have proved useful in recovering fertility in 1281 interspecific amphihaploid hybrids by generating new polyploids. They highly enhanced genetic diversity, 1282 heterozygosity, and heterosis. Finally, 2n gametes are very desirable as a key step in the apomictic pathway 1283 as well. In Rosa, 2n gamete production was demonstrated to be preponderant in hybrids, genetically 1284 controlled and dependent on environmental factors like heat 144,145 . However, to date, both environmental cues 1285 and genetic pathways giving rise to 2n gametes are too insufficiently known to be routinely used in rose 1286 breeding. 1287 Over the last decade, genetic pathways leading to 2n gametes were identified in Arabidopsis and Maize. 1288 They provide a basis for developing breeding strategies that introgress new wild traits into cultivated roses 1289 and enlarge modern rose diversity and genetic background. Most orthologues of the major genes responsible 1290 for forming 2n gametes are present in the rose genome ( Supplementary Fig. 23 Blush' genome. All tested loci (blue arrow) showed that the RcHzRDP12 genome was homozygous. l, Compared k-mer frequency distribution in heterozygous and homozygous Rosa chinensis genomes. k-mers of length 47 were counted using Jellyfish 18 in the whole raw Illumina datasets and the number of distinct k-mers was plotted against their number of occurrences in the reads. The top plot displays two peaks, at 211 and 444, denoting the existence of two types of regions in the genome: some present in one copy (occ.=211), and some present in two copies (occ.=444 ≈2*211), consistent with the hypothesis that most of the genome is highly heterozygous (one copy), while a smaller part is homozygous (two copies). In the homozygous genome (bottom plot), only one peak remains, confirming that we extracted one single haplotype from R. chinensis 'Old Blush' heterozygous genome; a very small bump can be seen on the right (occ.=157), which could correspond to tandem duplications in the extracted haplotype.   . Bottom: Complete dot-plot based deconvolution into nine reconstructed CARs (dot-plot y-axis in nine colors) of the observed synteny and paralogy (dot-plot diagonals) between ARK (dot-plot y-axis) and the investigated species (peach, apple and rose as dot-plot x-axis). The complete overview of paralogous and orthologous gene relationships between the modern Rosaceae genomes as well as the reconstructed ARK are illustrated in green circles, as case example for ARK protochromosome 1 (pink), for applied translational research. b, Rosoideae radiation: Phylogenetic trees were computed based on the coding sequence of 748 genes from Rosa chinensis 'Old Blush', Rubus occidentalis, Fragaria vesca and, as an outgroup, Malus x domestica. The base hypothesis was that Rosa and Fragaria diverged more recently from one another than from Rubus. The barplot (top left) shows that most of the trees with high bootstrap values supports this hypothesis, and so does the consensus tree obtained from the concatenation of 600 genes (bottom right), but when considering the Rosa-Fragaria and Rosa-Rubus distances gene by gene (dot plot in the lower part), we observe that the dots follow the diagonal (in blue) and that the slope is only marginally different from 1 (5% confidence interval in red). This result favor the hypothesis that the three genera diverged approximately at the same time. c, Comparative k-mer analysis between Rosa species and Fragaria vesca genomes. The fraction of genome represented by repeated k-mers of length 55, 47 and 43bp is depicted by vertical bars. Rosa datasets were randomly subsampled to 2.4Gb to be comparable to Fragaria ones, and the horizontal bars depict the standard deviation over 10 randomizations. The total size of the dataset and ploidy level is given between square brackets for each genotype b Supplementary Figure 6. Origin of the cultivar R. x hybrida 'La France'. Principal component analyses (PCA) were carried out on genic variants in a dataset of 15 resequenced Rosa genotypes. The genome was partitioned into 35 chromosomal segments based on changes in structuration of variants density in the rose cultivars (cf. Figure 2 and Supplementary Data 2). PCA for each segment are represented in the same order as in Figure 2. The Chinenses, Synstylae and Cinnamomeae sections are highlighted with red, green and blue ellipses respectively. The cultivar 'La France' is drawn in orange with other cultivars drawn in black. The X and Y axes represent the first and second component of the PCA and explained 29.29 to 40.53% and 12.07 to 19.89% of the variance, respectively (cf. Supplementary Data 2)  Terpenes produced in rose genotypes used in this study, are included. The name of the enzymes acting at different steps and the putative corresponding genes are indicated. Black arrows indicate that the biosynthetic step has already been identified in rose.
Predicted SPL, that are putative targets of miR156, are highlighted in red. These SPL genes are expressed at early floral organ initiation development stages, and their expression decreases during flower opening (OFT). DBO=active axillary buds (vegetative meristem), IFL=floral bud at floral meristem transition, IMO=floral meristem and early floral organs, BFL=closed flower, OFT=open flower, SEN=senescent flower) . Whenever necessary FPKM values are given for each allelic copy of the genes and appear in blue or red histograms. Alleles identifiers for are indicated and correspondence between heterozygous and homozygous annotations is shown in Supplementary Data 1.

Chr5
Chr6 Chr7 For each identity percent cutoff (horizontal axis), the plot shows the percentage of transcripts having 1 to 6+ matches on R. chinensis 'Old Blush' genome sequence (vertical axis). We infer that transcripts having two matches (65.5% of the transcripts at cutoff=90%) correspond to genes for which the two alleles are present in the genome assembly, and transcripts having one match (26.1% at cutoff=90%) correspond to genes for which the two alleles have been assembled as a consensus. Figure 12. Crossing-over localization in RcHzRDP12 genome.

Extended Data
Yellow frame: Crossing-over localization using one-end mapped pairs (OEM). Color dots depict the ratio of OEM pairs over consistent pairs in each 10kb window along the genome. Higher values are on the right. Five Illumina libraries from the heterozygous genome have been used: PE 370bp (green), PE 480bp (brown), PE 630bp (purple), MP 3.3kb (grey), MP 5.4kb (blue). Loci where two or more libraries show a significant enrichment in OEM pairs are considered as candidate crossing-overs and have been depicted with a horizontal dashed line.
Red plots: Segmental structure of sequence conservation between Rosa species. Red curves along the chromosomes depict the level of sequence conservation between the homozygous genome and 8 Rosa genotypes, including 'Old Blush' (Supplementary Notes 8). A conservation value of 1 means that the sequences are completely identical to the homozygous one, in both haplotypes of the resequenced genotypes. Conservation can be higher than 1 at a low stringency due to repeated sequences. Centromeres are displayed as red lines on the chromosomes.  Figure 13. a, ChIP-seq mapping metrics b, Number of detected peaks for H3K9Ac and H3K27me3 marks (left). Number of annotated genes for H3K9ac and H3K27me3 marks (right). c, Distribution of mapped reads for H3K27me3 (red shades) and H3K9ac (green shades) along the 7 rose chromosomes. Local peak densities of each epigenetic mark were plotted against the genetic distance (gray) and annotation of transcripts (blue). d, H3K27me3 and H3K9ac distribution at the chromosome level. Distribution of annotated genes (blue, upper panel), H3K9ac marks (green, medium panel) and H3K27me3 marks (red, bottom panel) in flowers are plotted along the chromosome 5. e, Box plot of H3K9ac peaks length (green) and H3K27me3 peaks length (red). f, g, Average tag density profile of H3K27me3 and H3K9ac along the gene body. ChIP-Seq densities of equal bins were plotted along the gene body and 2kb region flanking the TSS or the TES. h, Heat map representing the tag density distribution of H3K27me3 and H3K9ac across all genes and a 2kb flank. i, j, Correlation of H3K27me3, H3K9ac and gene expression level. All the rose proteincoding genes were divided in 4 quantiles according to their gene expression levels (lowest and highest expression level corresponding to red and green, respectively). For each quantile the number of H3K27me3 and H3K9ac mapped reads was averaged and plotted along the gene body and 1-kb region flanking the TSS or the TES.