The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution

The domesticated sunflower, Helianthus annuus L., is a global oil crop that has promise for climate change adaptation, because it can maintain stable yields across a wide variety of environmental conditions, including drought. Even greater resilience is achievable through the mining of resistance alleles from compatible wild sunflower relatives, including numerous extremophile species. Here we report a high-quality reference for the sunflower genome (3.6 gigabases), together with extensive transcriptomic data from vegetative and floral organs. The genome mostly consists of highly similar, related sequences and required single-molecule real-time sequencing technologies for successful assembly. Genome analyses enabled the reconstruction of the evolutionary history of the Asterids, further establishing the existence of a whole-genome triplication at the base of the Asterids II clade and a sunflower-specific whole-genome duplication around 29 million years ago. An integrative approach combining quantitative genetics, expression and diversity data permitted development of comprehensive gene networks for two major breeding traits, flowering time and oil metabolism, and revealed new candidate genes in these networks. We found that the genomic architecture of flowering time has been shaped by the most recent whole-genome duplication, which suggests that ancient paralogues can remain in the same regulatory networks for dozens of millions of years. This genome represents a cornerstone for future research programs aiming to exploit genetic diversity to improve biotic and abiotic stress resistance and oil production, while also considering agricultural constraints and human nutritional needs.

The domesticated sunflower, Helianthus annuus L., is a global oil crop that has promise for climate change adaptation, because it can maintain stable yields across a wide variety of environmental conditions, including drought 1 . Even greater resilience is achievable through the mining of resistance alleles from compatible wild sunflower relatives 2,3 , including numerous extremophile species 4 . Here we report a high-quality reference for the sunflower genome (3.6 gigabases), together with extensive transcriptomic data from vegetative and floral organs. The genome mostly consists of highly similar, related sequences 5 and required single-molecule realtime sequencing technologies for successful assembly. Genome analyses enabled the reconstruction of the evolutionary history of the Asterids, further establishing the existence of a whole-genome triplication at the base of the Asterids II clade 6 and a sunflowerspecific whole-genome duplication around 29 million years ago 7 . An integrative approach combining quantitative genetics, expression and diversity data permitted development of comprehensive gene networks for two major breeding traits, flowering time and oil metabolism, and revealed new candidate genes in these networks. We found that the genomic architecture of flowering time has been shaped by the most recent whole-genome duplication, which suggests that ancient paralogues can remain in the same regulatory networks for dozens of millions of years. This genome represents a cornerstone for future research programs aiming to exploit genetic diversity to improve biotic and abiotic stress resistance and oil production, while also considering agricultural constraints and human nutritional needs 8,9 .
As the only major crop domesticated in North America, with its sunlike inflorescence that inspired artists, the sunflower is both a social icon and a major research focus for scientists. In evolutionary biology, the Helianthus genus is a long-time model for hybrid speciation and adaptive introgression 10 . In plant science, the sunflower is a model for understanding solar tracking 11 and inflorescence development 12 . Despite this large interest, assembling its genome has been extremely difficult as it mainly consists of long and highly similar repeats. This complexity has challenged leading-edge assembly protocols for close to a decade 13 .
To finally overcome this challenge, we generated a 102× sequencing coverage of the genome of the inbred line XRQ using 407 singlemolecule real-time (SMRT) cells on the PacBio RS II platform. Production of 32 million very long reads allowed us to generate a genome assembly that captures 3 gigabases (Gb) (80% of the estimated genome size) in 13,957 sequence contigs. Four high-density genetic maps were combined with a sequence-based physical map to build the sequences of the 17 pseudo-chromosomes that anchor 97% of the gene content ( Fig. 1 and Supplementary Note 1.1-1.6). This compares favourably to an assembly of another sunflower genotype (HA412-HO; Supplementary Note 1.7), based on second-generation sequencing data, in which 2 Gb of sequence are placed in 816,854 contigs and 31,392 scaffolds. The sunflower genome encodes 52,232 inferred protein-coding genes and 5,803 spliced long non-coding RNAs (lncRNAs, Supplementary Note 2.1). To build the first small-RNAmediated regulatory network for the sunflower, we identified 123 microRNA (miRNA) genes that we classified into 43 families (Supplementary Data 1), including 16 novel families. Sixty-three lncRNAs and 1,020 mRNAs are predicted to be miRNA targets, including 71 loci that probably produce secondary phased short-interfering RNAs (siRNAs, Supplementary Note 2.2).
More than three quarters of the sunflower genome consisted of long terminal repeat retrotransposons (LTR-RTs), of which 59% belong to the Gypsy evolutionary lineage. Sunflower LTR-RT lineages are predominantly young and exhibit minimal sequence divergence owing to significant expansion in the past one million years 5 . This pattern contrasts with that of DNA transposons, where the greatest density of insertions is 2-4 million years old (Extended Data Fig. 1). The LTR-RTs in the sunflower exhibit non-random patterns of chromosomal distribution and are predominantly intact (Extended Data Fig. 2 3). We found that LTR sequences display an elevated transition-to-transversion ratio, similar to that of maize 14 , probably reflecting the outcomes of epigenetic silencing. We discovered that more than 6,000 transposons have acquired gene fragments, and Helitron transposons contained significantly more gene fragments than other transposon types (P = 2 × 10 −16 ). In addition, 8% of Helitrons contained more than one gene fragment, with the most commonly acquired sequences being related to metabolism and defence (Supplementary Table 2.3.4). These findings highlight the creative potential of transposons and provide tools for understanding gene function in this model system.
To assess the palaeohistory of the Asterid family, we performed a comparative genomic investigation of the sunflower with lettuce 15 and artichoke 16 as representatives of Asterids II, coffee as a representative of Asterids I (ref. 17) and grape 18 as an outgroup. The grape genome is considered to be the closest modern representative of the ancestral eudicot karyotype (AEK) consisting of 7 (pre-γ ancestor) or 21 (post-γ ancestor) protochromosomes, with γ indicating the ancestral whole-genome triplication of the Eudicots (WGT-γ ) 19 . We identified orthologous genes between the sunflower and grape-coffeelettuce-artichoke as well as paralogous genes within the sunflower (Supplementary Data 2 and Supplementary Note 3.1), coffee and artichoke genomes. In addition to WGT-γ (common with grape, artichoke, lettuce, coffee and sunflower) we established that sunflower, lettuce and artichoke experienced a whole-genome triplication (WGT-1) 15,16 , which has recently been proposed as independent genome duplications that are close in time 6 . A minimum of 3 chromosomal fissions and 57 chromosomal fusions were necessary for the lettuce to reach its current structure of 9 chromosomes, and 14 fissions and 60 fusions for the artichoke to reach 17 modern chromosomes. The sunflower experienced a much more complex evolutionary history with a lineage-specific whole-genome duplication (WGD-2, around 29 million years ago), in addition to the shared ancestral WGT-γ (dating back to around 122-164 million years ago) and WGT-1 (around 38-50 million years ago), plus 17 chromosomal fissions and 126 chromosomal fusions that finally shaped the present-day karyotype of 17 chromosomes (Fig. 2a). The K s distribution (Fig. 2b) of paralogues clearly illustrates the different rounds (WGD-2, WGT-1 and WGT-γ 7 ) of polyploidization events experienced by the sunflower so that for any ancestral

LETTER RESEARCH
region from the n = 7 AEK, a maximum number of 18 inherited regions are currently expected to be found in the modern sunflower genome. The dot plots (Fig. 2c) illustrate the paralogues inherited from WGD-2 in the sunflower genome (2-2 diagonal relationships), the paralogues deriving from WGT-1 in the artichoke genome (3-3 diagonal relationships) and finally the WGT-γ paralogues in the coffee genome (3-3 diagonal relationships). Thus, for any ancestral regions from the AEK (post-γ n = 21) the complete repertoire of 6-3-1-3 orthologous regions in the sunflower-artichoke-coffee-lettuce, respectively, is provided (Extended Data Fig. 3

and Supplementary Data 3).
The evolution of the cultivated sunflower progressed in two steps, domestication by native North Americans, followed by breeding involving selection on traits related to modern agricultural production. We applied an integrative approach to identify candidate genes for two major breeding traits: flowering time and seed oil content and quality. Sunflower gene networks were reconstructed with a supervised orthology-based transfer of knowledge from model species for both traits. Network genes that co-localized with genomic regions associated with variation in the traits of interest were further investigated by exploiting new information on paralogy relationships, expression and diversity data. We generated and integrated 58 transcriptomes for the roots, stem, leaves and eight floral organs (Fig. 1h, Extended Data Fig. 4 and Supplementary Data 5,6), and for the leaves and/or roots following application of nine hormones and three abiotic stress treatments (Supplementary Note 4.1-4.3). In addition, we re-sequenced 80 domesticated lines (10-20× coverage) (Supplementary Note 5.1, 5.2). The integrative web interface Heliagene provides visualization, querying tools for data mining and network exploration for the community (https://www.heliagene.org). Reconstructing the flowering-time genetic network in sunflower is of particular interest, because it is a key trait in crop production and the best-adapted flowering time has been selected in each cropping area during the breeding phase. Taking advantage of a recently developed database of flowering-time gene networks in Arabidopsis thaliana 20 , we identified 485 orthologues and in-paralogues (that is, paralogues post-dating speciation) for 270 flowering-time genes in the sunflower genome (Extended Data Fig. 5, Supplementary Data 7 and Supplementary Note 6.2). There were several sunflower in-paralogues for 180 Arabidopsis genes, illustrating the complexity of regulatory networks in sunflower.
Previous investigations of flowering-time architecture in the sunflower 21 , using more limited genomic data, focused on the transition from the wild sunflower to early domesticates. Whether flowering-time variation among modern lines involves the same genomic regions and gene families has broad implications for understanding pre-and post-domestication selection. Furthermore, the identification of ohnologous regions (that is, regions originating from whole-genome duplication) in the sunflower genome offers an excellent opportunity to determine the extent of functional diploidization for a quantitative trait in a complex genome. We used genome-wide association studies (GWAS) to dissect the genetic basis of flowering-time variation in a set of 480 F 1 hybrids obtained from 72 inbred lines, identifying 35 genomic regions associated with flowering time (Extended Data Fig. 5a and Supplementary Note 6.1). Comparison with flowering-time quantitative trait loci (QTLs) associated with domestication 21 suggests that similar genomic regions are responsible for variation among modern cultivars (Supplementary Note 6.2), possibly because selection during domestication has not been intense enough to eliminate variation at those loci, or because introgressions during sunflower breeding have reintroduced wild alleles 22 . The genomic architecture of flowering time has been shaped by the most recent whole-genome duplication (WGD-2), with more pairs of duplicated blocks associated with flowering time than is expected by chance (Extended Data Fig. 5b, Extended Data Table 1 and Supplementary Note 7). Therefore, even ancient ohnologues remain involved in the same regulatory networks and complete functional diploidization after whole-genome duplication may take long to achieve. Our integrative approach also highlights new candidate genes such as a newly discovered AGL24 in-paralogue, which directly colocalizes with single-nucleotide polymorphisms (SNPs) associated with flowering time and new FT paralogues (Extended Data Fig. 5c and Supplementary Note 6.2). This analysis therefore provides insights into the architecture of flowering time in domesticated sunflowers and provides a major resource for breeding programs.
Seed oil content and quality have been under selection during sunflower improvement 23 and continue to be a primary target of breeding programs. To determine the genetic bases of these traits, we reconstructed a genome-scale metabolic network for the sunflower (Extended Data Fig. 6a and Supplementary Note 8.1) and extracted metabolic pathways involved in oil synthesis, yielding a total of 429 genes mapped onto 125 reactions, corresponding to 12 pathways (Extended Data Fig. 6b). A review of the literature on sunflower-oil synthesis showed that our network captured all 40 genes that have already been described (Supplementary Data 8), demonstrating the sensitivity of the approach.
To find evidence of selection during sunflower breeding, we mapped resequencing data of 80 genotypes and measured differentiation (F st ) between oil and non-oil (for example, confectionary) types of domesticated lines (Supplementary Note 8.2). Genes of the oil metabolic network were enriched in the top differentiated genes, suggesting that we had successfully identified relevant candidates for oil improvement. We found 46 oil genes in 32 genomic regions corresponding to previously identified QTLs for seven oil-related traits (Supplementary Note 8.2). Nine of these genes were highly differentiated between highand low-oil lines (Extended Data Fig. 6c), including FAD2-1, which has been shown to be under selection during post-domestication 24 .
Another, HPPD, had already been found to co-localize with a QTL for the vitamin E precursor tocopherol 25 . Our data suggest that this gene may have been targeted by selection. The remaining seven genes mainly mapped onto the diacylglycerol and linoleate biosynthesis pathways (Extended Data Fig. 6d, e). In particular, a member of the PAP2 superfamily, which is involved in biosynthesis of fatty acid precursors 26 and controls total lipid content in micro-algae 27 , was predominantly expressed in seeds and co-localized with a QTL for total oil content. It therefore constitutes a strong candidate to improve this character (Extended Data Fig. 6f).
The availability of this reference genome and companion resources will not only strengthen interest in the sunflower as a model for ecological and evolutionary studies, but will also accelerate breeding programs. In addition to the genome-wide association study of flowering time presented here, precisely mapping loci that contribute to other ecologically and agriculturally important traits in wild and domesticated individuals will enable precision breeding through marker-assisted and genomic selection 28,29 . Functional validation of GWAS candidates will provide insights into the molecular mechanisms underlying variation in these traits 30 . The sunflower now has the potential to become a model crop for climate change adaptation, which can be achieved by exploiting genome-enabled systems biology and multi-disciplinary analyses of interactions between abiotic stressors, pathogen attacks and agronomic practices.
Online Content Methods, along with any additional Extended Data display items and Source Data, are available in the online version of the paper; references unique to these sections appear only in the online paper.

METHODS
A full description of the Methods can be found in the Supplementary Information. No statistical methods were used to predetermine sample size. The genome-wide association experiments were fully randomized and the investigators were not blinded to allocation during experiments and outcome assessment. Genome sequencing and assembly of the XRQ genotype. Sequencing. The DNA of the INRA inbred genotype XRQ (Supplementary Note 1.1) was extracted following a previously published protocol 31 , and sequenced using 407 SMRT cells with P6/C4 chemistry. Subreads were obtained using the SMRT Analysis RS. Subreads.1 pipeline (Supplementary Note 1.2). In total 32.8 million subreads were generated with an N 50 of 13.7 kb and a mean length of 10.3 kb. The targeted genome coverage of 102× was obtained with 367 Gb of raw sequence (340 Gb of subread data). Assembly. The PBcR wgs8.3rc1 assembly pipeline 32 was used to perform the correction of reads, WGS 8.3 to assemble the corrected reads and quiver 33 to polish the consensus sequence after the construction of the pseudomolecules (see below). However, to overcome challenges associated with the sunflower genome assembly, substantial parameter tuning, code modification and software development were required and these are described in Supplementary Note 1.3-1.7. Physical map construction, genetic map construction and assembly of pseudomolecules. To develop a robust physical map for the sunflower that could be used to help to place sequence contigs on chromosomes and determine the physical length of gaps between them, bacterial artificial chromosome (BAC) libraries were constructed for genotype HA412-HO by the French Plant Genome Resource Center (http://cnrgv.toulouse.inra.fr/en/library/sunflower). We used 382,464 clones from the three BAC libraries to develop a 12.5× physical map, which was integrated with high-density genetic maps (see below). The resulting physical map covers approximately 3.3 Gb (around 92.5% of the 3.6 Gb genome) and is publicly available at https://www.sunflowergenome.org/.
We developed several high-density genetic maps that we used for correctly placing and ordering BAC and sequence contigs on chromosomes, as well as for the association and QTL analyses. While individual maps had gaps with no mappable markers owing to identity by descent, this problem was minimized by the use of multiple mapping populations (Supplementary Note 1.5). The pseudomolecules were assembled as described in Supplementary Note 1.6, leading to a final assembly of 17 pseudomolecules and 1,509 unanchored contigs. A web browser of this genome assembly is available at https://www.heliagene.org/HanXRQ-SUNRISE/.
Sequencing, assembly and annotation of the genome of another genotype, HA412-HO, is presented in Supplementary Note 1.7. Annotation of protein-coding genes and lncRNAs. Gene models were predicted using EuGene 4.2 (ref. 34) embedded in a new and fully automated pipeline that integrates probabilistic sequence model training, genome masking, transcript-and protein-alignment computation and alternative splice site detection. The plant early release of BUSCO (release July 2015) 35 was run on the set of predicted transcripts, and it detected 92% of complete gene models (590 complete single copy and 291 duplicated, respectively) plus 10 additional fragmented gene models.
Protein-coding genes were annotated using a three-step process, taking into account reciprocal best hits in the SwissProt and TAIR10 (ref. 36) databases (12,360 sunflower proteins), protein-domain content using Interpro (26,646 sunflower proteins), and similarity with plant proteomes (Ensembl release 30) or coverage of the transcript with RNA-sequencing data (1,200 predicted proteins with similarities in other plant proteomes without expression support, 1,832 with similarities in other plant proteomes with expression support and 8,542 gene models supported by expression data, but without significant hits with other plant proteomes). The remaining 1,663 predicted proteins remained completely uncharacterized. Details of the gene prediction and annotation process are provided in Supplementary Note 2.1. Annotation of small RNA. To identify H. annuus miRNA genes, we constructed a small-RNA library using mixed RNAs from the various organs in control conditions (as for RNA sequencing) and sequenced them using Illumina GAIIx (oriented single-end 50 nucleotides (nt)). A total of 139 million reads were obtained that classically displayed a size distribution with two peaks of 21 and 24 nt small RNAs (Supplementary Note 2.2). Genome-wide prediction of miRNAs was performed combining Shortstack version 3.4 (ref. 37) and an adapted version of the pipeline described in ref. 38, post-processed with the stringent criteria proposed by MiRBase 39 . Targets of miRNA were predicted using miRanda version 3.0 (http://www.microrna.org). Annotation of repeats. LTR-RTs were annotated with an in-house pipeline that uses LTRharvest 40 and LTRdigest 41 . DNA transposons were annotated with a custom pipeline that includes the 'gt tirvish' command, which is part of the GenomeTools suite 42 16 , coffee 17 and lettuce 15 and with grape 18 as the outgroup. Identification of orthology and paralogy relationships, measurements of sequence divergence and estimation of divergence time through the level of synonymous substitutions were performed as detailed in Supplementary Note 3.1 on the basis of the methods described in ref. 45  Transcriptome sequencing and analysis. We generated 58 paired-end RNAsequencing libraries to measure expression in 11 sunflower organs, the responses to hormonal and osmotic and salt treatments in roots and leaves, as well as response to variable water status (Supplementary Note 4.1). Library sequencing was done with Illumina HiSeq, reads were mapped with the glint software (https://forge-dga. jouy.inra.fr/projects/glint) and only the best scoring pair(s) of reads was(were) kept. Expression measurements and normalization were performed as described in Supplementary Note 4.2. Organ-specificity was measured by computing a specificity index, Tau 47 , on the normalized expression score. We identified sets of organ-specific genes and regulators (transcription factors and lncRNAs) (Extended Data Fig. 4 and Supplementary Note 4.2). Analysis of differential expression in response to hormones and stress treatments were performed with the glm model of EdgeR 48 as detailed in Supplementary Note 4.2. Gene Ontology enrichment tests were carried out with Blast2GO Pro (one-sided Fisher's exact tests, false discovery rate of < 0.05). Resequencing of domesticated lines. We resequenced 80 lines of the sunflower mapping population (SAM) that represent the diversity of the cultivated sunflower. Statistics on resequenced lines are provided in Supplementary Table 5.1.1. Seventytwo parent lines of the 480 hybrids used in a genome-wide association analysis of flowering time were also resequenced. The paired-end libraries were resequenced with Illumina HiSeq, read mapping was performed with the glint software (https://forge-dga.jouy.inra.fr/projects/glint) and SNP calling with VarScan 49 (Supplementary Notes 5.1, 5.2). Identification of flowering time orthologues and in-paralogues. Flowering time genes in A. thaliana were retrieved from a recently developed database, FLOR-ID 20 , which includes 295 protein-coding genes and 11 miRNA genes and describes their interactions. We built gene clusters for a set of seven species, namely H. annuus, A. thaliana, Cynara cardunculus, Oryza sativa, Hordeum vulgare, Brassica rapa and Populus trichocarpa, chosen to be consistent with a previous study that identified orthologues for more than 30 flowering-time genes in the sunflower 21 , adding the proteome of the recently sequenced member of Asterids II C. cardunculus 16 . To identify orthologues and in-paralogues (that is, paralogues post-dating speciation) of A. thaliana genes, we built and visually examined trees for the clusters defined above (Supplementary Note 6.2) and manually screened BLAST reports on the sunflower genome browser. We identified 485 orthologues and in-paralogues (Supplementary Data 7). A genome-wide association study of flowering time was performed on a set of 480 hybrids obtained from 72 inbred genotypes (Supplementary note 6.1), and colocalization of flowering-time orthologues with flowering time QTLs was assessed with bedtools 50 . Analysis of paralogues dating from the most recent whole-genome duplication (WGD-2). Correlation of expression between WGD-2 paralogues was assessed quantitatively by measuring the Pearson correlation coefficient and qualitatively by counting the number of pairs of paralogues that belong to the same co-expression modules based on a weighted gene co-expression network constructed with WGCNA (Supplementary Note 7). Significance was tested with 1,000 permutations