A high-quality carrot genome assembly provides new insights into carotenoid accumulation and asterid genome evolution

We report a high-quality chromosome-scale assembly and analysis of the carrot (Daucus carota) genome, the first sequenced genome to include a comparative evolutionary analysis among members of the euasterid II clade. We characterized two new polyploidization events, both occurring after the divergence of carrot from members of the Asterales order, clarifying the evolutionary scenario before and after radiation of the two main asterid clades. Large- and small-scale lineage-specific duplications have contributed to the expansion of gene families, including those with roles in flowering time, defense response, flavor, and pigment accumulation. We identified a candidate gene, DCAR_032551, that conditions carotenoid accumulation (Y) in carrot taproot and is coexpressed with several isoprenoid biosynthetic genes. The primary mechanism regulating carotenoid accumulation in carrot taproot is not at the biosynthetic level. We hypothesize that DCAR_032551 regulates upstream photosystem development and functional processes, including photomorphogenesis and root de-etiolation.


Genome characterization
Carrot coding regions, tandem repeats, and mobile elements were characterized to evaluate the structural and functional features contributing to carrot evolution (Supplementary Note). Repetitive sequences accounted for 46% of the genome assembly (Table 1), of which 98% (193.7 Mb) were annotated as transposable elements (TEs) (Supplementary Table 9). Class II TEs accounted for 57.4 Mb-a greater amount of the genome than in similarly sized plant genomes, including rice (48 Mb) 17 . Given the abundance of class II TEs, we studied the evolution and distribution of insertion sites for two miniature inverted-repeat transposable element (MITE) class II families, Tourist-like Krak 18 and Stowaway-like DcSto 19 . The expansion of DcSto elements was characterized by multiple amplification bursts (Supplementary Fig. 7). Over 50% of DcSto and Krak insertion sites were located near (<2 kb away from) or inside predicted genes. However, no evidence was found to support their preferential insertion in genic regions (Supplementary Fig. 8), supporting the hypothesis that the impact of DNA transposons on gene function and genome evolution may reflect the interplay of stochastic events and selective pressure 20 .
Tandem repeat-rich regions create a technical challenge to genome assembly 21 . By using RepeatExplorer 22 and cytology, we identified four major tandem repeat families accounting for ~7% of the DH1 genome and traced their evolutionary history in the Daucus genus (Supplementary Table 10). These tandem repeats included the carrot centromeric satellite Cent-Dc (CL1) 23 and three new tandem repeats (CL8, CL80, and CL81). In DH1 and related species, 39-to 40-bp Cent-Dc monomers were organized in a higher-order repeat structure (Supplementary Fig. 9). Daucus species distantly related to carrot were enriched for the CL80 repeat, which occupied most subtelomeric and pericentromeric regions ( Fig. 1 and Supplementary Fig. 10). Conversely, the carrot CL80 sequence was associated with a knob on chromosome 1. Because Cent-Dc and CL80 were detected in members of the divergent Daucus clades (Daucus I and II), we hypothesize that their origin predates the estimated divergence of the two clades ~20 million years ago 24 . After Daucus radiated, these repeat families presumably underwent differential expansion and shrinkage of their repeat arrays and structural reorganization of monomers.
In assembly v1.0 gene annotation, 32,113 genes were predicted ( Table 1 and Supplementary Note), of which 79% had substantial homology with known genes (Supplementary Tables 11 and 12). The majority (98.7%) of gene predictions had supporting cDNA and/or EST evidence (Supplementary Table 13), demonstrating the high accuracy of gene prediction. Relative to five other closely related genomes, carrot was enriched for genes involved in a wide range of molecular functions (Supplementary Table 14). We also identified 564 tRNAs, 31 rRNA fragments, 532 small nuclear RNA (snRNA) genes, and 248 microRNAs (miRNAs) distributed among 46 families ( Fig. 1 and Supplementary Table 15).

Carrot diversity analysis
To evaluate carrot domestication patterns, we resequenced 35   A r t i c l e s Phylogenetic and cluster analysis separated samples by geographical distribution relative to carrot's Central Asian center of origin (eastern or western) and cultivation status (wild, cultivated, open pollinated, or inbred) (Fig. 2a). Eastern wild accessions were most closely related to cultivated carrots, further demonstrating a primary center of carrot domestication in the Middle East and Central Asia 3 . Cluster analysis showed extensive allelic admixture (Fig. 2a), reflective of the outcrossing nature within carrot combined with extensive geographical overlap between wild and cultivated carrot lines 4 . This pattern was particularly evident in eastern wild and cultivated samples, likely caused by less intensive carrot breeding in eastern regions. Indeed, some eastern cultivated carrots still maintain primary taproot lateral branching and reduced pigmentation (Supplementary Fig. 11). In contrast, western cultivars clearly separated from wild and eastern cultivated carrots, and some inbred lines (I3 and I4) have a purified genetic pattern shared with western cultivated accessions, reflecting the intensive breeding practiced in western regions. Nucleotide diversity (π) 25 estimates showed that wild carrots have a slightly higher level of genetic diversity than cultivated carrots (Supplementary Table 18), indicating the occurrence of a limited domestication bottleneck, consistent with previous findings 3, 26 . When D. carota subspecies, which have morphological characteristics contributing to their sexual isolation relative to carrot 27 , were excluded from diversity estimates, this observation was more evident from comparative analysis (wild, π = 9.5 × 10 −4 versus cultivated, π = 8.6 × 10 −4 ). In contrast, a clear reduction in genetic diversity and heterozygosity was found in inbred lines ( Fig. 2b and Supplementary Table 17), likely resulting from their use in hybrid carrot breeding programs 28 .
To identify genomic regions associated with domestication events, we computed pairwise population differentiation (F ST ) levels for wild and cultivated eastern accessions 29 , as these samples resemble the genetic pool for primary carrot domestication. We identified local differentiation signals on chromosomes 2, 5, 6, 7, and 8. Peaks on chromosomes 5 and 7 overlap with previously mapped quantitative trait loci (QTLs) controlling carotenoid accumulation in tap root (Fig. 2b), a major domestication trait in carrot.

Genome evolution
Comparative phylogenomic analysis among 13 plant genomes (Supplementary Table 19 and Supplementary Note) indicated that carrot diverged from grape ~113 million years ago, from kiwifruit ~101 million years ago, and from potato and tomato ~90.5 million years ago, confirming the previously estimated dating of the asterid crown group to the Early Cretaceous and its radiation in the Late-Early Cretaceous 8 ( Fig. 3a and Supplementary Fig. 12). Further divergence between carrot and lettuce, both members of the euasterid II clade, likely occurred ~72 million years ago.
We identified two new whole-genome duplications (WGDs) specific to the carrot lineage, Dc-α and Dc-β, superimposed on the earlier γ paleohexaploidy event shared by all eudicots (Fig. 3a,b). These WGDs likely occurred ~43 and ~70 million years ago, respectively (Fig. 3a). Estimating the timing of the Dc-β WGD to around the Cretaceous-Paleogene (K-Pg) boundary further supports the hypothesis that a WGD burst occurred around that time, perhaps reflecting a selective polyploid advantage in comparison to diploid progenitors 30 . These results may also suggest a co-occurrence of the Dc-β WGD with Apiales-Asterales divergence. To address this possibility, we compared the carrot genome with the genome of horseweed (Conyza canadensis) (Supplementary Note), an Asteraceae with a low-pass whole-genome assembly 9 . Pairwise paralog and ortholog gene divergence indicated  npg A r t i c l e s that a possible WGD occurred in the horseweed genome that does not overlap with the carrot Dc-β event, as it occurred after divergence with carrot ( Supplementary Fig. 13). This WGD is likely shared with lettuce and may represent a whole-genome triplication (WGT) recently described in lettuce that is basal to Asteraceae 31 . Using methods previously described 32,33 , we reconstructed the carrot paleopolyploidy history. Carrot chromosomal blocks descending from the seven ancestral core eudicot chromosomes were highly fragmented and dispersed along the nine carrot chromosomes (Fig. 3c). The two lineage-specific WGDs were clearly evident from the distribution of the fourfolddegenerate transversion rates of carrot paleohexaploid paralogous genes, whereas genes from the shared eudicot γ WGT were largely lost, likely owing to extensive genome fractionation (Supplementary Fig. 14). Comparative analysis with the grape, tomato, coffee, and kiwifruit genomes identified a clear pattern of multiplicons (1:5 or 1:6 ratio) (Fig. 3d). Depth analysis of duplicated blocks harboring paralogous genes under the Dc-α fourfold-degenerate transversion peak indicated over-retention of duplicated blocks. In contrast, duplicated blocks harboring paralogous genes under the Dc-β peak retained a larger number of triplicated blocks (Fig. 3e). We suggest that at least 60 chromosome fusions or translocations and a lineage-specific WGT (Dc-β) followed by a WGD (Dc-α) contributed to diversification of the 9 carrot chromosomes from the 21-chromosome intermediate ancestor.
Characterization of Dc-α and Dc-β duplicated blocks demonstrated that extensive gene fractionation has occurred during the evolutionary history of the carrot genome ( Supplementary Tables 20 and 21). Dc-α ohnologs are significantly enriched (P ≤ 0.01) in protein domains involved in selective molecule interactions (binding) and protein dimerization functions (Supplementary Table 22), supporting the gene dosage hypothesis 34 ; this observation predicts that categories of genes encoding interacting products will likely be over-retained.

Regulatory genes
Characterization of orthologous gene clusters across multiple genomes identified 26,320 carrot genes in 13,881 families, with 10,530 genes unique to carrot (Supplementary Fig. 15). Protein domains involved in regulatory functions (binding) and signaling pathways (protein kinases) were abundant among the genes unique to carrot (Supplementary Tables 23 and 24).
We identified 3,267 (10% of the total) regulatory genes in carrot, a number similar to that in tomato (3,209 regulatory genes) and rice (3,203   npg regulatory gene expansion, with ~33% of these genes retained after the two carrot WGDs, demonstrating the evolutionary impact of large-scale duplications on plant regulatory network diversity 34 (Supplementary  Table 27). Six regulatory gene families involved in lineage-specific duplications were expanded in carrot (Supplementary Table 28). The expanded families include a zinc-finger (ZF-GFR) regulatory gene family, the JmjC, TCP, and GeBP families, the B3 superfamily, and response regulators. The over-represented regulatory gene subgroups shared orthologous relationships with functionally characterized genes involved in cytokinin signaling, which can influence the circadian clock as well as plant morphology and architecture (Supplementary Figs. [16][17][18][19][20].

Pest and disease resistance genes
Using the MATRIX-R pipeline 38 with additional manual data curation, we predicted 634 putative pest and disease resistance (R) genes in carrot (Supplementary Tables 29-34 and Supplementary Note).
Most R gene classes were under-represented in carrot. The expanded orthologous subgroups included classes containing the NBS and LRR protein domains (NL) and coiled-coil NBS and LRR domains (CNL). Lineage-specific duplications contributed to the expansion and diversification of these R gene families in carrot and other genomes ( Supplementary Fig. 21 and Supplementary Table 35).
Many R genes (206) were located in clusters, and these clusters tended to harbor genes from multiple R gene classes (Supplementary Tables 36  and 37). The expansion of the NL and CNL families might reflect evolutionary events generating tandem duplications, resulting in  preferential clustering on chromosomes 2 and 3-7, respectively ( Supplementary Fig. 22). One cluster containing three RLK genes and one LRR gene, spanning only 50 kb, colocalized with the carrot Mj-1 region, which controls resistance to Meloidogyne javanica, a major carrot pest 39 (Supplementary Fig. 22). This analysis demonstrates the important role of tandem duplications in the expansion of R genes in carrot. Additionally, R gene clusters may provide a reservoir of genetic diversity for evolving new plant-pathogen interactions.

A candidate gene controlling high carotenoid accumulation
Carotenoids were first discovered in carrot and named accordingly. The Y and Y 2 gene model explains the phenotypic differences between white and orange carrots 40,41 , with elevated carotenoid accumulation in homozygous-recessive genotypes (yyy 2 y 2 ). In spite of the striking color variation attributed to these two genes, little is known about the molecular basis of carotenoid accumulation in carrot. Although homologs of all known carotenoid biosynthesis genes have been identified in carrot, none appear to be responsible for carotenoid accumulation [42][43][44][45][46] . Using two mapping populations, we demonstrated that Y regulates high carotenoid accumulation in both yellow and dark orange roots (Fig. 4a . 4b-e and Supplementary Fig. 25). Of the eight genes predicted in this region, none had homology with known isoprenoid biosynthesis genes (Supplementary   Table 39).
Using resequencing data, a haplotype block extending for 65 kb, with 64 kb overlapping the fine-mapped region, was associated with all but two highly pigmented root samples (C1 and I2) (Supplementary Fig. 27). In contrast, within the 65-kb region, seven haplotype blocks were detected in wild accessions. Polymorphism detection within the haplotype block identified eight nonsynonymous SNPs in four genes and two indels, including the 212-nt insertion in DCAR_032551, in yellow and dark orange samples ( Fig. 4f and Supplementary Table 40). No wild or cultivated white samples had the 212-nt insertion. The two highly pigmented (yy) accessions, C1 and I2, that did not share the 65-kb haplotype block were heterozygous for the insertion. However, further analysis of DCAR_032551 identified a 1-nt insertion in the second exon, 60 nt upstream of the 212-nt insertion site ( Fig. 4f and Supplementary Fig. 26). The 1-nt insertion was in trans phase relative to the 212-nt insertion, indicating that these accessions harbor two frameshift mutations that likely disrupt functioning of the Y gene product. Thus, resequencing supports the central role of DCAR_032551 in conditioning high pigment accumulation in carrot roots and identifies a second, independent mutation in this same gene, which we speculate should also be recessive to the wild-type allele.
To determine whether this region was ever under selection, we scanned for differences in nucleotide diversity, differentiation, and linkage disequilibrium (LD) between wild and cultivated accessions. An F ST peak on chromosome 5, located between 24.4 and 25.0 Mb, overlapped the 75-kb fine-mapped region underlying DCAR_032551 (Figs. 2c and 4g,h). In this region, LD was increased in highly pigmented cultivated materials and nucleotide diversity was drastically reduced in cultivated carrots (wild, π = 3.1 × 10 −4 versus cultivated, π = 2.0 × 10 −4 ) (Fig. 4g,h). The 50-kb window encompassing the Y candidate gene had the highest level of differentiation (F ST = 1.0) and the lowest level of nucleotide diversity (π = 1.5 × 10 −4 ) among cultivated carrots. The selective sweep in the Y region is relatively short in comparison with those for other genes controlling carotenoid   accumulation, including the selective sweep for y1 in maize, which extends 200 kb upstream and 600 kb downstream of the gene 47 . Rather, this scenario resembles the short sweep (60-90 kb) identified in maize around teosinte branched1 (tb1), a major domestication-associated gene 48 . A short sweep may reflect the highly effective rates of recombination expected in an outcrossing species like carrot. Gene flow between wild and cultivated carrot followed by recurrent phenotypic selection that likely occurred throughout the history of carrot 4 may have had a role in increasing the recombination rate around the Y locus.
Selection signatures, including reduction in nucleotide diversity and a decrease in the number of haplotypes, associated with the Y gene region further support the inclusion of carotenoid accumulation as a major domestication trait-a trait that contributes substantial nutritional and economic value to modern carrots. Furthermore, the identification of a second haplotype block for pigmentation surrounding the Y candidate gene suggests that this gene has been selected multiple times. These results may elucidate the timing and origin of the pigmented taproot phenotype during carrot domestication.

A model for carotenoid accumulation in carrot roots
To investigate gene expression in the region of the Y candidate, comparative transcriptome analysis was performed for white versus yellow and pale orange versus dark orange roots (Supplementary Note). DCAR_032551 was the only significantly differentially expressed (upregulated; P ≤ 0.001) gene in the yy (yellow and dark orange) relative to the Y-(white and pale orange) genotype (Supplementary Table 39), further supporting our mapping and resequencing results.
Weighted gene coexpression network analysis (WGCNA) indicated that DCAR_032551 is coordinated with a set of 925 genes (Supplementary Table 41). Gene Ontology (GO) term enrichment analysis indicated that isoprenoid pathway genes were particularly enriched (Supplementary Table 42). Among cellular components, membrane terms and molecular function terms related to oxidative reactions and biological processes in response to acids and chemicals were highly enriched (Supplementary Table 43). Assuming a conserved function of Y in yellow and dark orange roots, we annotated genes that were differentially expressed (upregulated or downregulated) in white versus yellow and pale orange versus dark orange comparisons. This analysis identified a positive relationship between high carotenoid accumulation and overexpression of several light-induced genes, including those involved in photosynthetic system activation and function, plastid biogenesis, and chlorophyll metabolism ( Supplementary  Tables 44 and 45), an unexpected finding in non-photosynthetic root tissue. These findings tie into the WGCNA analysis as components of photomorphogenesis are located in the thylakoid membranes and involve many oxidative processes and chemical responses, including hormonal regulation. Analysis of the 98 genes annotated in the plastidal methylerythritol phosphate (MEP) and carotenoid pathways (Supplementary Table 46 and Supplementary Note) confirmed coordinated overexpression of several genes in these pathways and carotenoid accumulation in yy plants. Furthermore, an inverse relationship was observed between the majority of differentially expressed terpene synthase genes (Supplementary Table 47) and high carotenoid accumulation, consistent with substrate flux into the carotenoid pathway. DXS1 and LCYE were the only genes in the MEP and carotenoid pathways that were differentially expressed in yy genotype samples with high carotenoid accumulation in both populations, suggesting that they possibly encode enzymes that regulate carotenoid accumulation. Although LCYE has not been reported to be a carotenoid regulatory gene target, its elevated expression may account for the relative abundance of lutein in yellow carrots and alpha-carotene in orange carrots. DXS1 is a limiting factor in upregulation of the carotenoid pathway in A. thaliana 49 . DXS1 expression is induced by light 50,51 , and it is the main DXS isoform catalyzing the biosynthesis of isoprenoid and carotenoid precursors in photosynthetic metabolism 52,53 . DXS1 also regulates carotenoid accumulation in A. thaliana and tomato 54,55 . Overall, these results indicate that DCAR_032551 is coexpressed with isoprenoid pathway genes and that overexpression of the light-induced/photosynthetic transcriptome cascades in orange and yellow carrot roots may explain elevated carotenoid accumulation.
The DCAR_032551 gene product represents a plant-specific protein of unknown function, and mutants of the A. thaliana homolog PSEUDO-ETIOLATION IN LIGHT (PEL) have an etiolated phenotype, a phenotype associated with defective responses to light 56 (Supplementary Table 44). In many ways, the physiology and genetics of carotenoid accumulation in dark orange and yellow (yy) carrots are similar to the phenotypes of the A. thaliana det, cop, and fus de-etiolated mutants. These mutants lack the ability to inhibit the light-induced photosynthetic transcriptome cascade associated with de-etiolation and photomorphogenesis in non-photosynthetic tissues such as roots 57 . De-etiolated mutants grown in the dark have characteristics of light-grown seedlings, including carotenoid accumulation and overexpression of light-induced photosystem and plastid biogenesis genes 58,59 . In contrast, when exposed to light, these mutants demonstrate ectopic expression of genes involved in chloroplast formation 58 . Physiological studies have demonstrated that, unlike other species, carrots with carotenoid-rich roots have ectopic chloroplast accumulation when exposed to light 44,60 and that highly pigmented carrot roots have upregulation of photosystem-related genes in comparison with white roots 27 Figure 5 Working model of the regulation of carotenoid accumulation in carrot root. Upward and downwardpointing arrows indicate upregulated and downregulated genes, respectively, in the yellow versus white (yellow arrows) and dark orange versus pale orange (orange arrows) comparisons. The orange box delimits the isoprenoid biosynthetic branch that leads to the carotenoid pathway. As shown in the green box, the majority of the upregulated genes in yellow and dark orange roots are involved in the photosynthetic pathway (supplementary table 45); genes that are included are involved in the assembly and function of photosystems I and II and plastid development. We hypothesize that loss of the constitutive repression mechanisms conditioned by genes involved in deetiolation and photomorphogenensis in nonphotosynthetic tissue, such as carrot roots, induces overexpression of DXS1 and, consequently, activation of the metabolic cascade that leads to high levels of carotenoid accumulation in carrot roots. npg A r t i c l e s transcriptome data presented here indicate that, similar to de-etiolated A. thaliana mutants, carrot roots with high levels of carotenoid accumulation may have lost the ability to inhibit the transcriptome cascade associated with de-etiolation and photomorphogenesis. The recessive nature of the Y gene in such roots is compatible with loss of the constitutive negative feedback function associated with the recessive det, cop, and fus mutants in A. thaliana. In addition, the A. thaliana homolog of the Y candidate produces a protein that interacts with genes such as FAR1 and COP9, involved in the light signaling pathway (Supplementary Table 48). Our hypothesis is further supported by the WGCNA analysis indicating that DCAR_032551 is coexpressed with COP1 and HY5 (Supplementary Table 41), genes both directly involved in the regulation of photomorphogenesis. Together, these findings make DCAR_032551 a plausible regulatory candidate. Considering our results coupled with previous physiological studies 44 , we hypothesize that carotenoid accumulation in carrot taproot results from root de-etiolation, whereby the repression of photomorphogenic development typically found in etiolated roots is lifted. The resulting overexpression of DXS1 provides precursors to the carotenoid biosynthetic pathway, which leads to an accumulation of carotenoids in orange and yellow (yy) carrot roots (Fig. 5).

DISCUSSION
Vitamin A deficiency is a global health challenge 62 , making the development of sustainable vitamin A sources a priority for crop improvement. Its plentiful carotenoids make carrot an important source of provitamin A in the human diet 6 . Although carrot was a model organism to study plant development and totipotency in the 1950s 63,64 , the molecular basis of neither carrot growth nor phytochemical accumulation has been well described. The high-quality carrot genome sequence described here, in combination with mapping and comparative transcriptome analysis, demonstrates that carotenoid accumulation in carrot is controlled at the regulatory level and that root de-etiolation leading to overexpression of the photosynthetic transcriptome cascade may have an important role in this regulatory mechanism. These results provide the foundation for new genetic mechanisms regulating carotene accumulation in plants, with potential application to several crops.
This study included the first comparative genomic and phylogenomic analyses comprising members of the euasterid II clade and clarified the evolutionary events surrounding the radiation of the main asterid clades. The two new WGD events (Dc-α and Dc-β) identified provide a new tool to study genome polyploidization. The two WGDs specific to the carrot lineage and the new WGD identified in the horseweed genome, which is possibly shared with lettuce, prompt important evolutionary questions about the possible involvement of the latter WGD in the early radiation of the Asterales order. The carrot genome is the first chromosome-scale Apiaceae genome to be sequenced and will provide a foundation for future comparative genomic and evolutionary studies.
Resequencing diverse Daucus species emphasized a high level of variability in repetitive sequence structure and chromosomal location, demonstrated a high level of genetic diversity retained in cultivated carrots, and identified a genetic sweep associated with domestication. This information lays the groundwork for future studies on carrot domestication and chromosome evolution across the Daucus genus.
The high-quality carrot reference genome and large set of SNP markers will accelerate marker-facilitated trait mapping through genome-wide association studies and genomic selection. The carrot genome sequence will support crop improvement efforts and help identify additional candidate genes underlying isoprenoid and flavonoid accumulation, biotic and abiotic stress resistance, and regulatory pathways controlling growth, flowering, seed production, and regeneration in tissue culture-all important traits for sustained agricultural production and improved human health.

METhODS
Methods and any associated references are available in the online version of the paper.
Accession codes. The genome assembly has been deposited at GenBank under accession LNRQ00000000 and at Phytozome. The version described in this paper is version LNRQ01000000. All raw reads have been deposited in the Sequence Read Archive (SRA) under umbrella project PRJNA285926, accessions SRP062070, SRP062113, and SRP062159. Further information is available through our website (see URLs).

Note: Any Supplementary Information and Source Data files are available in the online version of the paper.
Reprints and permissions information is available online at http://www.nature.com/ reprints/index.html. This work is licensed under a Creative Commons Attribution 4.0 International licence. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons licence, users will need to obtain permission from the licence holder to reproduce the material. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
To guide the construction of superscaffolds and anchor the genome, an integrated linkage map was developed using JoinMap 4.0 (ref. 68). CheckMatrix (see URLs) was used to remove markers with inconsistent placement. The collinearity of common markers was inspected using MapChart 2.2 (ref. 69), and inconsistent markers were removed before merging maps. Markers in common were used as anchor points (Supplementary Tables 52 and 53). Marker order correlations between composite and component map linkage groups were calculated in SAS 9.2 using the PROC CORR Spearman function (Supplementary Table 54). Linkage groups were assigned to chromosomes, oriented, and numbered using published classification 23 .
To build superscaffolds and to identify chimeric scaffolds and correct them, 29,875 paired-end BACs, 20-kb and 40-kb Illumina mate-paired sequences, and 2,075 marker sequences mapped in the carrot integrated linkage map were aligned to the v1.0 assembly. For each scaffold or contig, unambiguously aligned sequences were visualized in GBrowse. Superscaffolding was initiated with scaffolds containing sequences of mapped markers. Scaffold connections supported by at least two paired-end BACs were annotated, and sequences were further connected using a custom Perl script (cp1; see URLs). The quality of each scaffold assembly and contiguity were verified by visually inspecting the coverage of large-insert libraries (20 and 40 kb) and the consistency of marker order along the linkage map.
Possible chimeric scaffolds (Supplementary Fig. 2) were identified as those containing sequences of markers mapped to different linkage groups or to distal locations of the same linkage group or those containing regions not covered by mate-paired sequences. Within each chimeric scaffold, the chimeric regions were identified as those regions not covered by mate-paired or paired-end BAC sequences and were then manually inspected. The midpoint between the closest unambiguously aligned paired-end sequences flanking the chimeric region was defined as the misassembly point. Corrected scaffolds were then used to progressively construct superscaffolds as described above. This process generated assembly v2.0 and nine carrot pseudomolecules ( Supplementary  Figs. 29 and 30, and Supplementary Table 55).
See the Supplementary Note for additional details.
De novo assembly of the plastid and mitochondrial genomes is described in the Supplementary Note. Genome quality evaluation. The presence of possible sequence contamination was evaluated using DeconSeq 70 with scaftigs from the v2.0 assembly.
To evaluate the correctness of the assembled sequences, we used (i) an 8-kb 454 library of DH1 (SRA accession SRX1135252) and (ii) 4,717 paired-end BACs that were not used to join scaffolds into superscaffolds during assembly. Paired-end reads that aligned with both ends to a unique location in the carrot plastid genome or the v2.0 assembly were used to calculate the mean insert size.
A new linkage map including GBS SNP markers was developed to verify the order of the scaffolds and superscaffolds. GBS libraries were prepared as described by Elshire et al. 71 , with minimal modification. TASSEL version 4.3.11 (ref. 72) was used for analysis, with paired-end data preprocessed for TASSEL compatibility using a custom Perl script, bb.tassel (see URLs). SNPs were called using documented GBS pipeline procedures 73 . Sequences containing SNPs unambiguously aligned to the carrot genome assembly were kept (18,007 SNPs). SNPs scored as heterozygous but with an allele ratio a:b far from 1:1 were eliminated if the ratio was <0.3 or >3.0, where a and b were the two alleles for a given SNP. Mapping was carried out as described 74 (Supplementary Fig. 31 and Supplementary Note).
Gene space coverage was evaluated using carrot ESTs 14 , RNA-seq data from 20 different DH1 tissues (NCBI BioSamples SAMN03965304-SAMN03965323), and 258 ultraconserved genes from the Core Eukaryotic Genes data set. Previously published carrot ESTs 14 were aligned to the genome using BLASTN 77 ; RNA-seq data from 20 different DH1 tissues (NCBI BioProject PRJNA291977) were assembled with Trinity r2013_08_14 and mapped to the assembly using TopHat v2.0. 11 (ref. 78). Scaftigs from the carrot assembly were aligned to the Core Eukaryotic Genes data set 15 (Supplementary Fig. 33 and Supplementary  Tables 57-59). Consensus sequences were used to investigate intra-and interspecific relationships among families with Circoletto 82,83 ( Supplementary  Fig. 34). Stowaway-like elements carrying insertions >10 nt in length were removed from subsequent steps. Within-family similarity was calculated from a Kimura two-parameter pairwise distance matrix. The evolutionary history of related DcSto elements was investigated using MEGA6 (ref. 84).
Tandem repetitive sequences were analyzed with RepeatExplorer 22 and SeqGrapheR 85 using a subset of 1 × 10 7 Illumina reads from DH1 and five resequenced genotypes representative of Daucus clades I and II. To select tandem repetitive sequences, the node/edge ratio (number of nodes/number of edges) among aligned sequences in each cluster was calculated. Clusters with a ratio >0.09, representing more than 0.05% of the genome, were selected for further analysis. Tandem repeats were identified using Tandem Repeats Finder v4.07b 86 (Supplementary Tables 60 and 61).
The abundance and localization of selected repetitive sequences in DH1 and other Daucus species were also investigated by FISH (Supplementary Note).
For gene model prediction, mobile element-related repeats were masked using RepeatMasker (see URLs). De novo prediction using AUGUSTUS v2. 5 Tables 19 and 62).
Peptide sequence from 312 single-copy orthologous gene clusters was used to construct phylogenetic relationships and estimate divergence time. Alignments from MUSCLE 106 were converted to coding sequences. Fourfold-degenerate sites were concatenated and used to estimate the neutral substitution rate per year and divergence time. PhyML 107 was used to construct the phylogenetic tree.
The Bayesian Relaxed Molecular Clock (BRMC) approach was used to estimate the species divergence time using the program MCMCTREE v4.0, which is part of the PAML package 108 . The 'correlated molecular clock' and 'JC69' models were used. Published times for sorghum-rice (<55 million years ago, >35 million years ago) [109][110][111] , tomato-potato (<4 million years ago, >2 million years ago) 112 , and grape-rice (<130 million years ago, >240 million years ago) 113 divergence were used to calibrate divergence time.
Chromosome collinearity within carrot and between carrot and tomato, grape, and kiwifruit was evaluated with MCscan 114 (Supplementary Table 63). The synonymous mutation rate (k s ) and fourfold-degenerate transversion rate were calculated using the HKY model 115 .
The paleopolyploid history was determined as described by Salse 33 (Fig. 3c and Supplementary Figs. 35 and 36). Grape-carrot syntenic blocks descending from the seven ancestral chromosomes were detected in carrot as compared with grape, kiwifruit, tomato, and coffee (Supplementary Fig. 37).
Divergence and WGD time points in the carrot and tomato genomes were estimated using a method described by Vanneste et al. 30 Table 64).

(Supplementary
The comparative analysis with the horseweed genome 9 used the same gene prediction pipeline described earlier. In total 38,199 genes were predicted and clustered using OrthoMCL to find single-copy gene families across 14 species. A maximum-likelihood tree was reconstructed on the basis of the fourfolddegenerate sites from the 963 single-copy gene families. Reciprocal best BLASTN 69 hits within horseweed or between horseweed and other species were used to calculate the paralog/ortholog gene divergence (Supplementary Fig. 13).
We collected all syntenic blocks containing genes associated with the Dc-α, Dc-β, and Dc-γ WGD events (Supplementary Predicted regulatory gene classes were grouped with OrthoMCL as described. We then carried out a detailed analysis of expanded carrot regulatory gene families (Supplementary Fig. 38 and Supplementary Tables 66-79). See the Supplementary Note for the classification of duplication modes of each regulatory gene. For phylogenetic analysis, multiple-sequence alignments with complete protein sequence were conducted using Clustal W 118 with default parameters. Phylogenetic trees were constructed using the neighbor-joining method, with pairwise deletion, using MEGA6 (ref. 84 A candidate gene controlling carotenoid accumulation. Mapping populations, 97837 (n = 253) and 70796 (n = 285), were used to study the Y locus that regulates carotenoid accumulation in carrot root, where 97837 was derived from an intercross between yellow-and white-rooted cultivars and 70796 was derived from a cross between a dark orange inbred carrot and a wild whiterooted carrot (Supplementary Figs. 39-41). Carotenoids were quantified as described by Simon and Wolff 120 and Simon et al. 121 . Analysis of marker-trait associations was carried out with molecular markers considered as fixed effects in a linear model implemented in the GLM function of TASSEL 72 . The primers used for fine-mapping are reported in Supplementary Table 80. Genome assembly v2.0 was used as a reference to identify marker locations (Supplementary Tables 81 and 82). The genomewide significance threshold was determined by the Bonferroni method 122 . QTL analysis for population 70796 used R package qtl 123 Table 83).

(Supplementary
Resequencing of polymorphisms and phenotypes were used to identify the haplotype block associated with pigmented versus non-pigmented roots. SNPs covering the region associated with high carotenoid accumulation were loaded into TASSEL 72 and manually inspected to identify the start and end of the haplotype block. Sequence from the haplotype block and its flanking sequences were used for haplotype network analysis with PopArt v1.7 (ref. 124).
Haploview v4.2 (ref. 125) was used to calculate and visualize LD in the candidate region. F ST analysis of 1,393,431 original filtered SNPs was conducted pairwise between each of the 35 resequenced genotypes using VCFTools 126 with default parameters. The top 1% of F ST values were determined and visualized by a custom Perl script (cp2; see URLs). Nucleotide diversity (π) was estimated in TASSEL 72 as described by Nei and Lin 25 .
See the Supplementary Note for additional details.
Gene expression analysis. Root tissue was collected from population 97837 plants with yellow (yyY 2 Y 2 ) and white (YYY 2 Y 2 ) genotypes, with two biological replicates per genotype, 80 d after planting. Root tissue was collected from population 70796 plants with dark orange (yyy 2 y 2 ) and pale orange (YYy 2 y 2 ) genotypes, with three biological replicates, 100 d after planting. Total RNA was extracted from whole-root tissue using the TRIzol Plus RNA Purification kit. RNA quantity and integrity were confirmed with an Experion RNA StdSens Analysis kit. All samples had RQI values above 8.0. Paired-end libraries (insert size of 133 nt) were sequenced on Illumina HiSeq 2000 lanes (2 × 100-nt reads). Filtered reads were aligned to the v2.0 genome assembly using TopHat v2.0. 12 (ref. 78). The aligned read files were processed by Cufflinks v2.2.1 (ref. 93). Testing for differential expression was done at the level of genes, isoforms, and promoters. PCR was carried out to verify the 212-nt indel in the Y candidate gene (DCAR_032551) (Supplementary Fig. 42).
Expression values were log 2 transformed, and the WGCNA package 127 in R with signed correlations was used to determine gene coexpression modules with a soft threshold value β of 10 and a treecut value of 0.6. Functional annotation of genes within this module was determined by BLASTP search of protein sequences within this module against the A. thaliana TAIR10 (ref. 128) predictions, and GO enrichment analysis based on BLASTP best hits to TAIR10 was performed using AgriGO 129 and PANTHER 130 .
Genes that were simultaneously upregulated or downregulated in both yellow and dark orange samples, relative to the white and pale orange samples, were manually annotated. GO annotations and subcellular localization are also reported.
See the Supplementary Note for additional details.
Identification of flavonoid and isoprenoid pathway genes. The peptide sequences for carrot predicted genes were aligned against annotated flavonoid and isoprenoid pathway genes in the KEGG database ( Supplementary  Tables 46, 84, and 85). BLASTP 69 was carried out using default parameters. Sequences with <50% identity, <50 residues were excluded. Peptide sequences from genomes having orthologous relationships with retained carrot genes were extracted from the genome evolution analysis. Genes annotated from the A. thaliana and tomato genomes were manually verified. Multiple-sequence alignments were generated with Clustal W 118 Table 47) along with known terpene synthases (TPSs) from seven other species were used for analysis with MEGA. The amino acid substitution models tested were WAG, mtREV, Dayhoff, JTT, VT, Blosum62, and CpREV. The tree with the highest AICc value was obtained with the JTT+F model with estimation of the gamma distribution. The phylogenetic tree was then rooted at the split between the type I (TPS-c, TPS-e, TPS-f, and TPS-h) and type III (TPS-a, TPS-b, and TPS-g) subfamilies (Supplementary Fig. 45). See the Supplementary Note for additional details.