Population genomics identifies genetic signatures of carrot domestication and improvement and uncovers the origin of high-carotenoid orange carrots

Here an improved carrot reference genome and resequencing of 630 carrot accessions were used to investigate carrot domestication and improvement. The study demonstrated that carrot was domesticated during the Early Middle Ages in the region spanning western Asia to central Asia, and orange carrot was selected during the Renaissance period, probably in western Europe. A progressive reduction of genetic diversity accompanied this process. Genes controlling circadian clock/flowering and carotenoid accumulation were under selection during domestication and improvement. Three recessive genes, at the REC, Or and Y2 quantitative trait loci, were essential to select for the high α- and β-carotene orange phenotype. All three genes control high α- and β-carotene accumulation through molecular mechanisms that regulate the interactions between the carotenoid biosynthetic pathway, the photosynthetic system and chloroplast biogenesis. Overall, this study elucidated carrot domestication and breeding history and carotenoid genetics at a molecular level.


Table of Contents
PacBio SMRT cells genomic DNA library preparation: High molecular weight DNA was extracted using the CTAB protocol as described by 1 .DNA quality and concentration were evaluated using gel electrophoresis and the Qubit dsDNA BR Assay Kit (Thermo Fisher Q32850).About 10μg of DNA was assessed using the Agilent TapeStation to assure the size of the DNA fragments were ≥60 kb.One SMRTbell library was generated using the SMRTbell Template Prep Kit (Cat 100-259-100) following the PacBio's ">20kb Template Preparation instruction and using BluePippin Size-Selection System (15-20kb) for Sequel Systems" procedure and checklist.
PacBio whole genome sequencing: Sequencing was performed using a Pacific Biosciences Sequel machine at the Genomics Laboratory at the DHMRI, Kannapolis, NC.In total eight SMRT cells were run yielding over 34 Gb of long read DNA sequences (~73× genome coverage) with an average subread length of >8.6kb (table S1).All the sequences from the eight SMRT cells were combined and used for the downstream analysis.
Oxford Nanopore genomic DNA library preparation: DNA was extracted using the CTAB method as described in 1 with some modifications to clean the DNA samples.Modifications included an RNAse treatment at 37 °C for 60 minutes to remove RNA, removing the NaOAc treatment to avoid salt contamination, adding an overnight step at -20 ℃ prior to final precipitation and washing the DNA with 70% ethanol three times.DNA quality and concentration were evaluated as described above.The libraries for Oxford Nanopore sequencing were prepared according to the library preparation protocol 1D Lambda Control Experiment (SQK-LSK1108).
Oxford Nanopore whole genome sequencing: Using the Oxford Nanopore real-time long read sequencing platform (FLO-MIN107 and SQK-LSK108) over 717 Mb sequences (~1.5× genome coverage) were generated, with an average subreads length of 5.3 Kb (Supplementary Table 1).Sequencing was performed at Plants for Human Health Institute (PHHI), Kannapolis, NC.The base calling was performed using the base caller script implemented in "ont-albacore" software package v2.3.1 2 .

Plant material and library preparation:
The Hi-C library for DH1 was prepared using the PhaseGenomics (Seattle, WA) Proximo Hi-C Plant Kit.Approximately 0.2g of young leaf tissues were collected from the DH1 plants, rinsed thoroughly in dH2O, patted dry, and chopped finely in a petri dish using a scalpel.Petiole tissue was discarded.The finely chopped leaf tissue was used to create a Hi-C library according to the Proximo Hi-C Plant Kit protocol.The library concentration was determined using the Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA).QC was performed by Phase Genomics, using qPCR analysis and running the Agilent TapeStation (Agilent, Santa Clara, CA).An SPRI bead size-selection was performed using the right-hand selection protocol outlined in the SPRI select User Guide (Beckman Coulter, Brea, CA).After size-selection, the DH1 carrot library had an average fragment size of ~400 bp.The library had a molar concentration of 8.64 nM after size-selection and correcting for fragment size according to the qPCR run.
Hi-C sequencing: Hi-C libraries were sequenced by Novogene (Sacramento, CA) on one lane of the Illumina HiSeq 2500 using PE150 chemistry yielding to over 465M pair-end reads (~295× genome coverage).Raw reads were processed by trimming adaptor and low-quality sequences using Trimmomatic 3 considering the HiSeq adapters the following parameters: 2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:80.The trimmed and high quality sequences retained after this step were used for the downstream analysis (Supplementary Table 1).
In addition to the newly generated long reads (PacBio and Nanopore) and Hi-C reads, previously published 1 DH1 DNA sequences including 10 Kb, 20 Kb 40 Kb meta-pair reads, and a collection of BAC-end sequences (BES), were used for scaffolding and quality verification (Supplementary Table 2).

RNA sequencing
Plant material: For RNA extraction, fresh tissue from germinating seedlings, flower, phloem, xylem, petiole from 2 month old plants and leaf at two developmental stages (fully open and young unexpanded leaf), were collected in triplicates from DH1 carrot plants, frozen immediately in liquid nitrogen and stored at -80 ℃.
PacBio IsoSeq total-RNA library preparation: Total RNA was extracted using the RNeasy Mini Kit (Qiagen, Valencia, CA, United States) in accordance with the manufacturer's protocol and treated with RNase-free DNase I (New England BioLabs, Ipswich, MA, United States).
RNA integrity was analyzed using Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA, United States) and samples were pooled in equimolar amounts to generate four tissue specific SMRTbell® libraries corresponding to germinating seedlings (SD), phloem (PH), xylem (XY) and upper tissues (UP).The UP library included pooled (in equimolar amount) RNA from the leaf, petiole and flower samples.Each library was constructed according to the protocol for Iso-Seq™ Template Preparation for Sequel® Systems from PacBio (Pacific Biosciences, Menlo Park, CA, United States).In brief, for each library, 1 µg of total RNA was reverse transcribed using the SMARTer® PCR cDNA Synthesis Kit (Takara Bio USA, Inc.) and amplified with Prime STAR GXL DNA Polymerase (Takara Bio USA, Inc.).Amplicons were purified and size selected using 1× and 0.40× AMPure PB Beads and used to generate SMRTbell® libraries using the SMRTbellTM Template Prep Kit 1.0 (Pacific Biosciences, Menlo Park, CA, United States).
IsoSeq sequencing: IsoSeq sequencing was performed using a Pacific Biosciences Sequel machine at the Genomics Laboratory at the DHMRI, Kannapolis, NC.Each library was sequenced in two SMRT cells.In total 117 M reads were generated totaling 142 Gb of sequences (Supplementary Table 3).All the sequences from the four libraries, and eight SMRT cells were processed and used for downstream analysis.

Genome assembly
A schematic view of the carrot genome assembly strategy is shown in Extended Data Fig. 1 and was summarized in three phases (I, II, II).The following terms are used to describe the carrot genome assembly DH1 v3.0.A contig is a contiguous genomic sequence that does not contain unknown bases (Ns) and that was not merged into scaffolds or superscaffolds.Scaftigs are defined as all portions of a final assembly (superscaffolds, scaffolds, and contigs) consisting of contiguous sequence, with gapped sequences split into separate scaftigs at every occurrence of unknown bases (Ns).A scaffold is a portion of the genome sequence reconstructed from endsequenced whole genome shotgun fragments and composed of contigs with associated gaps.The final assembly is non-redundant, i.e. sequences or portions of sequences used in higher level constructs are removed from the lower category, e.g. if a contig is present in a scaffold, it is no longer present in the final assembly as a contig.The nomenclature for sequences is DCARv3_Chr# for the nine chromosome pseudomolecules, in which DCAR indicates carrot genome, v3 represents the third version of the genome assembly, and "#' is the unique numeric identifier for the sequence.Scaffold, contig and scaftig terminology was used only to describe the statistics of the assembly and not to label sequences in the final assembly.

Phase I De novo assembly, scaffolding and polishing
De novo genome assembly: The de novo genome assembly of DH1 carrot was conducted using all raw PacBio long reads sequences obtained from the eight SMRT cells (See section 1.1 and Supplementary Table 1 ).FALCON (v.0.3.0) 4 and CANU (V.1.5) 5assemblers were independently used for de novo genome assembly.The FALCON genome assembly was performed using the Bioinformatics Research Computing Cluster (BRC) at North Carolina State University using 17 nodes (each 32 cores and 64GB RAM) using the following parameters: length_cutoff=5000, length_cutoff_pr=1000, --max_diff=50, --max_cov=50, --min_cov=4.The CANU assembly was performed using a VRTX machine located at Plants for Human health Institute (PHHI), Kannapolis NC, with 72 cores and 200GB RAM using the default parameters.
The statistics of each assembly are reported in Supplementary Table 4.The results of the two assemblies indicated that CANU assembled a larger portion of the carrot genome, with a higher contig N50.Indeed the CANU assembly represented about 34 Mb (436 Mb vs 402 Mb) of extra sequence and had a 3.7 fold longer (3.36 Mb vs 0.89 Mb) contig N50 compared to the FALCON assembly.Also the CANU assembly covered the sequences assembled by the FALCON assembly.

Genome sequence polishing:
Although the CANU assembler includes a step to correct the PacBio reads as a part of its pipeline, two extra rounds of polishing were performed to further improve the quality of the assembled sequences.The assembly contigs were polished by aligning the raw PacBio reads to the assembly, and correcting the sequencing errors using the Arrow algorithm implemented in GenomicConsensus (https://github.com/PacificBiosciences/GenomicConsensus).To further improve the assembly, another round of polishing was performed by aligning the Hi-C short reads to the assembly and correcting the sequencing errors using Pilon (v1.23) 6 (Supplementary Table 4), using default parameters.

Scaffolding:
The high quality Hi-C reads were aligned to the polished contigs using BWA mem (0.7.17) 7 and the mapping information was used to cluster and order the contigs into scaffolds using SALSA2 (v2.3) scaffolder 8 .SALSA2 was executed with 200 iterations and GATC as the restriction site.SALSA2 also broke the contigs into sub-contigs when a significant number of discordant mapping reads indicating chimeric sequences were detected (Supplementary Table 4).
A number of unknown nucleotides ("N"s) separated the sequence of the contigs assembled into a scaffold.The number of Ns were estimated by SALSA2 considering the paired-end read insert size in the genome assembly.Contigs that were not assembled into scaffolds in this step remained as single contigs (Singletons).
Overall, 131 gaps were closed and 414 contigs were extended, adding ~2.6 Mb of known sequences (~0.118Mb replaced with Ns and ~2.5 Mb contigs extension) (Supplementary Table 45).This resulted in a significant improvement of the genome sequence contiguity (+ 1.9 Mb contig N50) (Table 1 and Supplementary Table 4).

Anchoring and chimeric sequence correction:
In this step, multiple data were integrated to identify and correct chimeric sequences, ordering and orienting all the sequences into their corresponding chromosome pseudo-molecules.
A high-quality genetic map (2,075 markers) and its bin map (918 bins each representing a true recombination event) 1 were used to anchor and orient the contigs and scaffolds into pseudomolecules.BWA mem 7 was used to map the marker sequences to the indexed genome produced in section 2.1 The sequence alignment/map (SAM) file was filtered for MAPQ>=10 and marker sequences with mapping frequency of >1 were discarded.The resulting high-quality and uniquely mapped markers were then integrated with their associated genetic linkage groups.
This analysis was coupled with the information from 10 kb, 20 kb, and 40 kb meta-pair reads (MPE), a collection of BAC-end sequences (BES) and the Hi-C heatmap information (Supplementary Table 2).STAR aligner (v2.7.10a) 10 was used to map the meta-pair reads to the indexed genome produced in section 2.1.BWA mem 7 was used to map the BES to the indexed genome produced in section 2.1.All the SAM files were then filtered for MAPQ >=10 and the reads that had both pairs mapping were kept for the downstream analysis.The Hi-C reads were processed as follows: • mapping the high-quality Hi-C reads pre-processed in section 1.2 on the genome assembly obtained from section 2.1 using BWA mem 7 considering the following parameters (-5 -t 44); • extracting the mapped reads with MAPQ >= 10 and order of R1, chr1, pos1,0, R2, chr2, pos2, 1, mapq1, mapq2 in tab delimited format; • Sorting the output using a custom command (sort -k3,3d -k7,7d); • Running juicer_tools.1.8.9_jcuda.0.8.jar included in the Juicer package 11 considering pre -q 10 (.hic file); Possible chimeric sequences were identified as: • Scaffolds or contigs containing sequences of markers mapped to different LGs or to distal locations of the same LG; • Scaffolds or contigs with regions not covered by MPE and BES sequences; • Sequences with no interaction signals along the Hi-C heatmap diagonal; Within each chimeric sequence, the chimeric region was identified as those sequences not covered by MPE or PE-BAC sequences and with connection to other distal locations.Those regions were than manually inspected.The mid-point between the closest unambiguously aligned MPE sequences flanking the chimeric region was defined as the misassembly point.The corrected sequences were then used to progressively construct pseudomolecules.
Using this approach, 27 misassembled/chimeric regions were corrected and 389 Mb (164 contigs) were anchored, ordered and oriented into the 9 carrot chromosomes.This resulted in a genome sequence assembly with a scaffold N50 of 44.02 Mb (Supplementary Table 4).

Phase III
Guided assembly: RaGOO (v1.1) 14 was used to further improve the contiguity of the genome assembly and to anchor any additional non-anchored sequences.In this process, the contig-level version (scaftigs and contigs) of the genome obtained from section 2.1 were used with the scaffold-level version of the same genome (section 2.2) considering the following parameters (-C -t 43 -i 0.3).In this step, the order and orientation of all the anchored and oriented scaftigs were not altered, and the 409 non-anchored sequences (largely telomeric and centromeric sequences based on the RepeatMasker analysis) were anchored to the 9 pseudomolecules (Supplementary Table 5).Hence, a fully anchored and oriented genome assembly sequence with a scaffold N50 of 51.16 Mb was obtained.
Gap Filling II: LR_Gapcloser (v3, 29 June 2007) 15 , a tiling path-based gap closer that uses long reads to close gaps in a genome assembly, was used to fill the gaps across the pseudomolecules (scaffolds) considering the following parameters (-s p -t 40 -r 5).The assembly obtained from the previous step and the PacBio and Nanopore reads were used as input files for this analysis.

Identification of the organelle genome in the assembled sequences
The organelle genomes of carrot DH1 were identified and extracted from the polished CANU sequences and were kept and analyzed independently and not reported in Supplementary Table 4.
The result of the analysis for the carrot organelle genomes are presented below.
To identify the chloroplast and mitochondrial sequences in DH1 v3 genome, DCARv2 plastid (NCBI accession number LNRQ01000011.1)and mitochondrial (NCBI accession number LNRQ01000010.1)sequences were tested against all the CANU assembled contigs.MUMmer 4 16 was used to identify the matching sequences and generate the collinearity plots.Interestingly, only one sequence (contig 12) had a clear match versus both sequences with no overlapping region.In fact, CANU assembled the DH1 v3 plastid and mitochondrial sequences both side by side in one sequence (contig 12).This allowed us to clearly establish the DH1 v3.0 organellar genomes for the downstream analysis.
The two organellar genomes were separately extracted from this contig.The circular plastid sequence was rearranged to follow the usual convention of starting with the large single copy region.The mitochondrial sequence, also a circular molecule, was rearranged to match the convention of the DCARv2 assembly, with the mitochondrial plasmid region at the 3' end.
Sequences and overlaps of the circular molecule ends were verified by examining mapped reads.
Comparing the organelle genomes of v2 versus v3, the v2 plastid and mitochondrial had one and two gaps (unknown sequence regions), respectively, while these gaps were filled with known sequences in the DH1 v3 organelle genomes.One mitochondrial region, 91 nt in v2 (DCARv2_MT:142210..142300) and 162 nt in v2.0 (DH1 v3.0_MT:146735..146896), was examined and determined to be a tandem repeat that was missed in the v2 assembly, since this repeat is slightly larger than the read length used for the v2 assembly.In total, v2 plastid and mitochondrial sequences had assembled lengths of 155,848 and 244,980 nt, respectively while the DH1 v3 plastid and mitochondrial sequences have assembled lengths of 155,839 and 250,368 nt.In addition to other small corrections, this new assembly completed the mitochondrial circular genome by adding 9,913 nt of new sequence linking the ends of the incomplete v2 sequence.The sequence of the final DH1 plastid and mitochondrial genomes were deposited in NCBI (BioProject PRJNA798760; accession number, Plastid = CP093353; Mitochondrion = CP093352).

Annotation of the organelle genome
Gene annotations from the DCARv2 genome, sequences LNRQ01000010 (MT) and LNRQ01000011 (PT) were transferred to the v3 genome using UCSC Liftover 17 .Sequences of the gene predictions were compared, and were found identical between the assemblies, with the exception of one intron in DCAR_032448 (v2) vs. DCAR_M036238 (v3) rps3 containing a 17 nt insertion.New sequence in the DH1 v3 mitochondrial genome not present in the v2 assembly was evaluated for additional genes, but none were found.

Assembly quality verification
The reliability of reference sequence data is crucial for the interpretation of downstream structural and functional genomic analysis.Thus, a comprehensive analysis was carried out to evaluate the quality of the final carrot DH1 v3 genome assembly.

Analysis of GC content and contamination
To evaluate and verify the quality of the DH1 v3.0 genome assembly, GC content distribution and sequence contamination analyses were carried out.
Sequence contamination can bias the average GC content across the whole genome sequence.
To evaluate the quality of the assembled sequence from this perspective, a binning approach to examine the average GC content of every 10 kb non-overlapping window covering the whole genome sequence was carried out using a custom script.The analysis revealed the average GC content of 34.9% (STD ± 3.1) across the genome with no abnormal GC content value for any specific regions (Extended Data Fig. 5).

Evaluation of sequence correctness using PE data
Assuming that the distance between two ends of a paired-end sequence (PE) represents their true physical distance, we estimated the observed distance between the two end sequences of the BES-PE data based on their alignment against the carrot genome assembly.For this analysis we used 4,717 PE BACs that unambiguously aligned (filtering parameters described in 2.2) with both ends to the carrot genome assembly DH1 v3 and that were not used during the assembly process (in section 2.2).The fraction of PE data that aligned within the expected library insert size should reflect the fraction of assembled sequences that are consistently contiguous and correctly assembled.The results of this analysis were expressed as percent of PE reads that aligned within the average estimated insert size of the PE library, plus/minus twice the standard deviation.Overall, 9,213 (99.5%)PE BACs, unambiguously aligned within the estimated library insert size (Supplementary Table 6).In addition, the Hi-C heatmap showed a uniform distribution of genomic interactions along the diagonal, demonstrating the proximity of the assembled sequences and the quality of the assembly (Supplementary Fig. 2).

Evaluation of pseudomolecules and genetic linkage map collinearity
To independently verify the order of the scaftigs along the nine pseudomolecules (chromosomes), an independent high-quality genetic map (different from the one used for the anchoring and orienting process, section 2.2) was used.The linkage map was developed from population 3242 and included 989 non-redundant markers 19 .Marker sequences were mapped against the DH1 v3 assembly using BWA mem 7 .Reads with MAPQ<10 and/or multiple mapping were discarded and only high-quality uniquely mapped reads were retained for this analysis.The results revealed that the DH1 v3 assembly and the linkage map were highly collinear confirming the quality of the assembly (Extended Data Fig. 3).

Gene space coverage
The following transcriptome data and mapping strategy were used to assess gene space coverage of the carrot DH1 v3 genome assembly: • a collection of 18,137 consensus carrot expressed sequence tags (ESTs) from 20 were aligned to the genome using BWA mem.Only ESTs with MAPQ>=10 were included; • a collection of 512.9MRNA-Seq reads from 20 libraries representing expressed sequences from 20 different DH1 carrot tissues, developmental conditions or developmental stages (NCBI BioSamples SAMN03965304 -SAMN03965323) 1 were mapped and assembled with StringTie (v1.3.5) 21; • a collection of 248,122 high-quality IsoSeq full-length transcripts were mapped to the carrot genome assembly using GMAP (v2021-08-25) 22 with the option -f 2 activating the splice aware prediction and mapping.These analyses indicated that ~99.04% of the Illumina reads, 96.7% of the ESTs and 99.95% of the high-quality full-length IsoSeq transcripts (Supplementary Tables 7 and 8) aligned to the carrot genome assembly, providing evidence that the assembly covers the majority of gene space.
Together, the statistics and verifications provided strong evidence that the DH1 v3 genome assembly is of high quality.Compared to the DH1 v2 genome, DH1 v3 has higher quality in terms of genome coverage, sequence contiguity length and sequence completeness (Table 1).

Repetitive sequences
To compare the repetitive content of the v2 and v3 genomes, a new annotation of the repetitive sequences in the two genomes was performed using the same method described here.
Repetitive sequences accounted for 49.5% (234 Mb) of the estimated genome size, including 53.1 Mb (11.3%) more repetitive sequences than predicted in the DH1 v2 genome assembly (Table 1 and Supplementary Table 9).The majority of the newly annotated repetitive sequences were long terminal repeat (LTR) elements, mainly located in centromeric and pericentromeric regions (Fig. 1a).A total of 8,063 full-length LTRs were predicted in DH1 v3, 25% more than in DH1 v2.The increase in the number and cumulative length of full-length LTRs was particularly evident for the most abundant lineages, DcSIRE (Ty1/Copia superfamily) and DcRetand, DcAthila, DcTekay (Ty3/Gypsy superfamily) (Extended Data Fig. 4).The higher GC content (36-47%) of these LTR families, contributed to a higher GC content observed in newly assembled sequences (Extended Data Fig. 5, Supplementary Table 10).As expected, the most abundant newly assembled and annotated LTR lineages DcSIRE, DcRetand, DcAthila were also the younger elements, which likely represent genomic regions with high similarity, that were not fully assembled in the DH1 v2 assembly (Extended Data Fig. 6).The raw LTR Assembly Index (raw LAI), a standardized metric for comparison between assemblies 22 was much higher for DH1 v3 (22.88) as compared to v2 (5.09).These differences were consistent across all chromosomes (Extended Data Fig. 7), reflecting markedly improved contiguity of the current assembly.

Gene model prediction with MAKER
Gene prediction in the DH1 v3 genome was based on the integration of de novo gene prediction and evidence-based predictions.This process included several iterations of homologybased and in silico gene prediction steps, using short and long read sequences (Illumina and PacBio) as well as gene models obtained from multiple closely related species and model organisms.

2019).
Carrot Illumina assembled transcripts were also used as a training sets, however preliminary results indicated that this data was introducing a large number of false gene models.Hence, these sequences were removed from the training set.MAKER was run with three iterations, and after each iteration the program was re-trained.After each iteration, predicted gene models were loaded into the IGV for quality verification and for tuning the parameters.MAKER based prediction produced 28,721 gene models.These gene models were then used to conduct the final homology based gene model prediction.

Gene model prediction with GeMoMa
GeMoMa (v1.6) 40 , a homology-based gene prediction program that predicts gene models in target species based on gene models in evolutionary related reference species, was used to improve the quality of the splice junction sites predicted by MAKER and to predict the gene models that were not predicted by MAKER.GeMoMa was run using default parameters.The datasets included as input in GeMoMa were: predicted genes from related species described in section 5.3.1;final gene models produced from MAKER pipeline; Illumina splice sites to enhance the splice site prediction of the genes on the genome.This analysis produced an intermediate set of 32,625 gene models that integrated MAKER based prediction + GeMoMa base prediction.The set of gene models were used for final gene model curation.

Gene model curation
In addition to the in silico and homology base gene model predictions described above, a multi-step approach was used to generate the most comprehensive gene model catalog for the carrot genome DH1 v3.DCARv2 (32,112) and RefSeq (44,484) gene models predicted on DCARv2 were transferred to the v3 genome sequence.This step was performed using GMAP 22 and GenomeThreader (V2021-08-25) 41 that are two independent splice aware gene model predictors.The predicted loci were compared with the predicted v3 gene models explained before.DCARv2 or RefSeqgenes that were successfully transferred/predicted on the v3 but not predicted by Marker and GeMoMa (see above) and that had at least one experimental evidence not masked by the repetitive sequences, were added in the v3 gene model catalog.In those case where the structure of the GMAP and GenomeThreader based predictions and/or/ the RefSeq and the DCARv2 gene models were not in agreement, the model was manually checked and the structure confirmed by the experimental evidence was considered as the final prediction.In addition, the full-length high-quality IsoSeq transcripts (248,122) were transferred to the v3 genome sequence (using GMAP and GenomeThreader as two independent splice aware gene model predictors) and the predicted loci was compared with the predicted v3 gene models explained before.Those IsoSeq full-length transcripts with appropriate gene structure that were not masked with repetitive sequences, and not yet predicted by all method described above, were also added to the catalog.In total, 3,591 gene models were added in this manual curation and polishing which resulted to the total of 36,211 gene models in the v3 gene model catalog (Supplementary Tables 11 and 12 ).

Gene annotation and quality verification
Blast2Go (OmicsBox 3.0.29) 42was used to annotate the predicted gene models obtained from the last step using the NCBI, KEGG, InterPro, and GO databases.In total, 99.5%, 3.4%, 0.8%, 94.2%, and 91.5% genes were annotated using the NCBI nr 43 , GO 44 , KEGG Pathway 45 , InterPro 46 and InterPro GO databases 47 , respectively.This allowed the assignment of 34,297 genes with potential functions (Supplementary Table 14).The remaining 271 genes were labeled as putative proteins in the gene model prediction and annotation catalog.
PlantTFcat (Downloaded on Dec 2020) 48 a reference plant transcription factor and transcriptional regulator categorization tool, was used to predict the transcription factors and regulatory genes in v3 gene models as well as the DCARv2 genes for comparison purposes.The summary tables of the transcription factor genes annotated in v3 and v2 as well as their comparison presented in Supplementary Tables 20 and 21.PRGdb (v3.0) 49 , a comprehensive platform for prediction and analysis of plant disease resistance genes, was used to predict the resistant genes in v3 as well as the v2 for the comparison purposes.The summary table of the disease resistant annotated gene models from v3 and DCARv2 as well as the comparison are presented in Supplementary Tables 22 and 23.
To assess the completeness of annotation, the predicted gene models were searched against the BUSCO v.3 50 plant dataset (embryophyta_odb9) (Supplementary Table 13).
1,000 * 2-ΔCT) to enhance readability 54,55 .Statistical analysis was performed using a one-way analysis of variance (ANOVA) using the software Statistical Product and Service Solutions (SPSS) v 23 (IBM, NY) followed by Tukey's HSD test.
The structures of all 8 loci's isoforms were confirmed (Extended Data Fig. 10).The relative expression level of each isoform at a specific locus correlates with the percentage of unique Circular Consensus Sequence (CCS) detected in the IsoSeq experiment.For instance, DcHpxo (DCAR_310121) is subject to an alternative splicing of its 4 th exon which was only detected in about 10% of the transcripts by both IsoSeq and RTqPCR experiments (Extended Data Fig. 10).For two loci (GST-DCAR_124495 and TPS-DCAR_113298), the conserved protein domain is lost in one of the isoform which could have impact its function, highlighting a potential functional role of these isoforms.For instance, in mosquitos, a GST was found to be regulated through alternative splicing with five coding exons that are alternatively spliced to produce four different mature GST transcripts 56 .TPS are the key enzymes responsible for the biosynthesis of terpenes 57 .Our RT-qPCR, and IsoSeq data, revealed that TPS-DCAR_113298 is specifically expressed in the aerial tissues with a third of its transcripts producing potentially no functional proteins, an interesting candidate to further investigate the function of alternative splicing with a potential role in the aroma of carrot and closely related aromatic plants from the Apiaceae family.

Testing Landrace_A-W as possible feral lineage
The high level of admixture between Landrace-AW accessions and Landrace-A observed in the population structure analysis suggests that gene flow between wild and cultivated accessions occurred during the carrot improvement process.To test this hypothesis, SNPs from all populations (low admixture set) were analyzed using TreeMix (v1.12) 58 .The topology of the population graph matched the relationship established by the phylogeny (Supplementary Fig. 2).
Gene flow from the Wild population to Landrace-AW was detected.In addition, the TreeMix results suggest the presence of gene flow from the Early cultivar group to the Wild population.
Gene flow between these populations is also reinforced by FST estimates, which indicated that among all cultivated accessions, Early cultivars have the least amount of differentiation (FST = 0.12) from the Wild population (Fig. 3a and Supplementary Table 27).Multiple lines of evidence point to the Landrace-AW accessions representing a feral lineage of carrot that has historically nm.Carotenoid concentrations was reported in μg/g dry weight of tissue.Since α-carotene and βcarotene are the primary carotenoids in orange carrots, and they account for carrot's orange color 59, 62 , the sum of the α-carotene + β-carotene concentration, relative to total carotenoid concentration, on a per sample basis, was used as an objective measure of carrot color in this study.This ratio is denoted as the α-carotene + β-carotene to total carotenoid ratio, or fraction of α-carotene + β-carotene to total carotenoid (Supplementary Table 30).A similar measure of lutein concentration was reported.The HPLC data was filtered to remove samples with inconsistencies between technical replicates.This resulted in a set of 435 accessions with HPLC scores and used for GWA analyses.A Spearman's rank correlation of the ratio of the carotenoids relative to the total carotenoids was completed in base R (v3.6.3)(R Core Team) and plotted with the R package ggplot 65 ( Supplementary Fig. 8).

RNA-sequencing analysis
Experimental Design and RNA Sequencing.In order to profile gene expression in the context of Or segregation, seed from an F3 population derived from a cross between an orange inbred parent (B493) from the USDA carrot breeding program and a white wild carrot (Z007) collected in Uzbekistan (accession Ames 27395) was obtained from 66 .This population was chosen for transcriptome profiling as it was homozygous recessive (based on pedigree and previous mapping information) at known major carotenoid accumulation loci, Y and Y2 1,67 , yet also segregating at the Or locus.Seed was planted in a greenhouse at the University of Wisconsin-Madison and harvested at 21-day intervals, starting at 60 days after planting.The first time point was chosen as it corresponded to the visual onset of carotenoid accumulation in the roots, as in 67 .
Since this population was segregating for the cultivated Or allele (Or_C) and the wild Or allele (Or_W), root and leaf tissue from each plant were flash-frozen in liquid nitrogen.Primers were designed to PCR-genotype a SNP located in exon 8 of the Or-like (DCAR_310369) coding DNA sequence (Supplementary Table 46).Genomic DNA was then extracted from leaf samples and plants were genotyped by Sanger sequencing of the PCR amplicon generated using the primers listed in Supplementary Table 46.Four roots homozygous for Or_C and four roots homozygous for Or_W were identified at each of the 3 time points, for a total of 24 samples (Supplementary Table 42).Total RNA was extracted from roots using the TRIzol Plus RNA Purification Kit (Life Technologies, Carlsbad, CA).Genomic DNA contamination was removed using an on-column DNAse digestion (Invitrogen, Carlsbad, CA).RNA integrity was evaluated using an RNA NanoChip on an Agilent 2100 Bioanalyzer.Sequencing libraries were prepared using a TruSeq Stranded mRNA kit (Illumina, San Diego, CA) and libraries were sequenced on a NovaSeq 6000 sequencer at the University of Wisconsin Biotechnology Center in Madison, Wisconsin.Transcriptome sequencing generated 1,091,729,253 reads across 24 samples (Supplementary Table 42).
The interrologous transcriptional interactome network was predicted by using string database and analyzed Cytoscape (v3.9) 68,69 .The genes having conserved orthologs with the plant species network databases having significant correlation edge (FDR ≤ 0.05) were considered as nodes in the predicted network (Supplementary Table 40).The MCODE tool into Cytoscape was used to predict the organ specific enrichment of Gene Ontologies, KEGG pathways and STRING clusters related functional modules in the constructed network 70 (Supplementary Table 41).The expression correlation of predicted network was computed among the nodes using Pearson's correlation to study the co-expression among significantly enriched pathways (FDR ≤ 1e-2) using R Bio-conductor (v4.2.1) package following similar methodologies as carried out in earlier studies 71,72 .

Genotyping the Y2 insertion
Reads mapping to 3,822 bp long sequence, containing Y2 gene and 1kb long flanking region, were extracted and used for genotyping.To call TE insertion we used custom scripts.In brief, reads were mapped to the analyzed region, in which TE sequence was masked, using bowtie2 (-time --very-fast-local ) and part of soft-clipped read that did not match the sequence was extracted, blasted against the Y2 region (-dust no -soft_masking false), and number of reads was reported for the 200 nt long terminal regions of TE.The empty site was supported by identification of part of soft-clipped reads, extracted from result of bowtie2 mapping of all reads to the Y2 region with insertion of TE, in the 400 nt long region flanking TE insertion.If both types of reads were identified, the genotype was called as heterozygous.The minimum.
Minimum number of soft-clipped reads was set to 1.

1 .
Long reads whole genome sequencing Plant material: A doubled haploid orange Nantes type carrot (DH1, NCBI Biosample SAMN03216637), was used for whole genome sequencing with Pacific Biosciences (PacBio), Oxford Nanopore and Hi-C sequencing technologies.Carrot DH1 plants were grown in the greenhouse at the North Carolina Research Campus, Kannapolis, NC.For DNA extraction fresh unexpanded leaves were harvested, frozen in liquid nitrogen and stored at -80 ℃.
IGV 12 was used to visualize the mapped MPE, BES and markers, and JuiceBox 13 was used to visualize the Hi-C heatmap.A custom program was used to visualize and identify connections of PE sequences (PE BACs, 10, 20 and 40 kb MPE) to other scaffolds or contigs.Construction of the pseudo-molecules was initiated using scaffolds containing sequences of mapped markers.Scaffold connections supported by MPE and BES were used to connect the sequences.During this process the quality of each scaffold assembly and contiguity was verified by visually inspecting the coverage of large insert libraries (20 and 40 kb) and the consistency of the Hi-C signals along the diagonal and of marker order along the linkage map.