A chromosome-level reference genome of non-heading Chinese cabbage [Brassica campestris (syn. Brassica rapa) ssp. chinensis]

Non-heading Chinese cabbage (NHCC) is an important leafy vegetable cultivated worldwide. Here, we report the first high-quality, chromosome-level genome of NHCC001 based on PacBio, Hi-C, and Illumina sequencing data. The assembled NHCC001 genome is 405.33 Mb in size with a contig N50 of 2.83 Mb and a scaffold N50 of 38.13 Mb. Approximately 53% of the assembled genome is composed of repetitive sequences, among which long terminal repeats (LTRs, 20.42% of the genome) are the most abundant. Using Hi-C data, 97.9% (396.83 Mb) of the sequences were assigned to 10 pseudochromosomes. Genome assessment showed that this B. rapa NHCC001 genome assembly is of better quality than other currently available B. rapa assemblies and that it contains 48,158 protein-coding genes, 99.56% of which are annotated in at least one functional database. Comparative genomic analysis confirmed that B. rapa NHCC001 underwent a whole-genome triplication (WGT) event shared with other Brassica species that occurred after the WGD events shared with Arabidopsis. Genes related to ascorbic acid metabolism showed little variation among the three B. rapa subspecies. The numbers of genes involved in glucosinolate biosynthesis and catabolism were higher in NHCC001 than in Chiifu and Z1, due primarily to tandem duplication. The newly assembled genome will provide an important resource for research on B. rapa, especially B. rapa ssp. chinensis.


Introduction
The Brassica genus comprises various economically important crops, many of which are extensively cultivated worldwide as oil crops and leafy vegetables. Six Brassica crop species comprise the "U's triangle", which includes the three basic diploid species B. rapa (A genome), B. nigra (B genome), and B. oleracea (C genome), as well as the three amphidiploid species B. juncea (A and B genomes), B. napus (A and C genomes), and B. carinata (B and C genomes) 1 . Brassica campestris, which is often used as a synonym for B. rapa, is an agronomically important species that includes various widely cultivated subspecies such as the turnip (ssp. rapa), Chinese cabbage (ssp. pekinensis), non-heading Chinese cabbage (ssp. chinensis), rapini (ssp. sylvestris), yellow sarson types (ssp. trilocularis), and oil types (ssp. oleifera) 2,3 . Brassica campestris has been cultivated for specific phenotypic characteristics, such as enlarged edible roots, mid-ribs, leaves, and oil seeds. Non-heading Chinese cabbage (NHCC) is an important B. campestris (syn. B. rapa) subspecies that includes pak-choi (var. communis Tesn et Lee), Tacai (var. rosularis Tsen et Lee), Caitai (var. tsai-tai Hort.), Fenniecai (var. multiceps Hort.), and Taicai (var. tai-tsai Hort.). Pak-choi can be further divided into white petiole and green petiole types. It is one of the most popular vegetables in China, Vietnam, the Philippines, and other East-Asian regions and is becoming increasingly popular around the world for its sweet, succulent, and nutritious leaves and stalks.
The genomes of two model plants, the dicot Arabidopsis and the monocot rice, were completed in 2000 and 2002 using early generation sequencing systems 4,5 . To date, approximately 200 plant genomes have now been published 6 . In recent years, sequencing technologies have undergone tremendous development. Single-molecule sequencing, also referred to as third-generation sequencing, aims to meet the demand for high-quality plant genome assembly 7 , and PacBio and Oxford Nanopore Technology (ONT) sequencing have been used to assemble new, high-quality reference genomes for maize and tomato 8,9 . The Brassica genus provides a good opportunity to study genome evolution in polyploids. The first B. rapa genome draft published in 2011 was assembled using a whole-genome shotgun strategy with Illumina short reads 10 . The recently released B. rapa Chiifu genome v3.0 based on PacBio sequencing lacked nearly 20% of the expected genome content (353.14 of 442.9 Mb), and the assembly was highly fragmented (contig N50 1.45 Mb) 11 . The genome of a new morphotype, B. rapa Z1, was also assembled using Nanopore sequencing with a contig N50 of 5.51 Mb 12 . Previously released B. rapa genomes provide great convenience for both genetic and comparative genomic studies, but they cannot fully satisfy the requirements of subsequent functional genomics research and the molecular breeding of non-heading Chinese cabbage. Therefore, it is necessary to assemble a high-quality reference genome for NHCC. In addition, current sequencing technologies are evolving rapidly, and the development of improved technologies enables the production of higher quality genomes.
Here, we present a chromosome-level assembly of B. rapa NHCC001 produced using a combination of PacBio sequencing and chromosome conformation capture (Hi-C) technologies. Our newly assembled B. rapa genome achieves a high level of continuity and completeness. It provides insights into the evolution of Brassica and constitutes an important resource for research, especially on the molecular mechanisms that underlie agricultural traits and on the breeding of B. rapa ssp. chinensis.

De novo genome assembly
The size of the B. rapa NHCC001 genome estimated by k-mer analysis was 477.76 Mb, larger than that of the B. rapa Chiifu genome, which was estimated to be 442.90 Mb (Supplementary Figure S1). The B. rapa NHCC001 genome heterozygosity rate was predicted to be about 0.17%, its repeat sequence content about 58.57%, and its GC content about 38.16%. We performed the genome assembly of B. rapa NHCC001 using 5,387,116 high-quality PacBio sequencing reads (61.44 Gb) with an N50 of 16,938 bases. Details of the PacBio sequencing reads are provided in Supplementary Table S1. All reads were assembled by SMARTdenovo into an initial genome of 384.71 Mb, which represented 80.52% of the estimated B. rapa NHCC001 genome size. The assembled genome contained 891 contigs with a contig N50 of 2.13 Mb and a maximum contig length of 16.11 Mb.
Hi-C data were used to assign the resulting contigs to their chromosomal positions. We generated 68.69 Gb of clean data with~143× coverage and anchored the assembled contigs to ten pseudochromosomes using the Hi-C data. The final chromosome-scale genome was composed of ten clusters, as indicated in the Hi-C interaction heat map ( Supplementary Fig. S2). The anchored 97.90% (396.83 Mb) of assembled genome content included 553 contigs clustered by Hi-C data ( Fig. 1, Supplementary Table S2). Among these clustered contigs, 300 (363.76 Mb) were anchored with defined order and orientation (Supplementary Table S2). The gap-closing step for pseudochromosomes was performed using errorcorrected Nanopore clean reads (Supplementary Table  S1). The final chromosome-scale genome was 405.33 Mb in length with 602 contigs (contig N50 = 2.83 Mb) and contained 312 scaffolds with a scaffold N50 of 38.13 Mb and 290 gaps ( Table 1). The final genome assembly represented 84.83% of the estimated B. rapa NHCC001 genome, compared with 79.73% and 74.20% for the recently released Chiifu v3.0 and Z1 genomes, respectively. The GC content of the assembled genome was 37.13%, similar to the 38.16% estimate from k-mer analysis (Table 1).
Completeness and accuracy of the assembly BUSCO v4.0.6 analysis revealed that 99.07% (1559 of 1614) of the core eukaryotic genes-including 1361 single-copy orthologs and 238 duplicated orthologswere present in our assembly (Supplementary Table S3). The base error percentage of the genome assembly was estimated to be 0.0011% (Supplementary Table S4). More than 99.96% of the full-length transcripts assembled de novo from transcriptome data had best hits on single contigs, confirming the completeness of the genome (Supplementary Table S5). In addition, 89.05-91.19% of the clean reads from four RNA-seq libraries could be uniquely mapped to the genome (Supplementary Table  S6). Taken together, these independent assessments confirmed that the B. rapa NHCC001 genome had high contiguity, completeness, and base accuracy.  Table S7). The vast majority of genes (97.88%; 47,135 of 48,158) were anchored on chromosomes, and only 2.12% (1023 of 48,158) were located on scaffolds. We used the newly assembled NHCC001 genome as a reference to measure the expression levels of 48,159 annotated genes (Supplementary Table S18) and found that 25,056 and 21,662 genes were expressed in leaf and root tissues, respectively, with FPKM ≥ 1.

Genome annotation
Homology-based and de novo approaches were also used to search for and predict repetitive sequences. A total of 213.04 Mb (52.56%) of the assembled NHCC001 genome comprised repetitive sequences, a percentage higher than that of the previous B. rapa Chiifu v3.0 assembly (37.93%, 133.95 Mb) (Supplementary Table S8). Among these repetitive elements, LTR retrotransposons were the most abundant, accounting for 20.42% of the genome, followed by DNA transposons (5.37%), LINEs (3.20%), and SINEs (0.29%) (Supplementary Table S8). In addition, 1.19% (4.86 Mb) of the assembled genome was annotated as non-coding RNA, including 151 miRNAs, 1,361 tRNAs, 3,907 rRNAs, and 1,272 snoRNAs.

Evolution of the NHCC001 genome
The genome sequences of 11 plant species and two other B. rapa subspecies were collected and used for comparative genomic analysis with NHCC001 to investigate its genome evolution and divergence. We clustered the annotated NHCC001 genes with those from the other plant genome using OrthoMCL. Of the 48,158 proteincoding genes in the NHCC001 genome, 46,316 were grouped into 27,536 gene families with an average of 1.68 genes per family. There were 7835 common gene families and 165 NHCC001-specific families (Fig. 2c, Supplementary Table S9). Furthermore, we found that 1771 gene families had expanded and 2051 had contracted in the newly assembled NHCC001 genome (Fig. 2b). The 165 NHCC001-specific gene families contained 423 genes, c Pseudomolecules with centromeres shown as red blocks. d Repeat density. e Gene density. f SNPs identified between B. rapa NHCC001 and B. rapa Z1. g SNPs identified between B. rapa NHCC001 and B. rapa Chiifu. h Indels identified between B. rapa NHCC001 and B. rapa Z1. i Indels identified between B. rapa NHCC001 and B. rapa Chiifu. Each inner curved line represents a syntenic block and 1501 of the 7835 common gene families contained one copy in each plant. We used these 1501 single-copy orthologs for phylogenetic analysis based on the maximum likelihood method. B. rapa NHCC001 and B. rapa Chiifu, which derive from the common ancestral genome of Brassica species, were clustered together on a branch (Fig. 2a). Using B. oleracea as an outgroup, evolutionary rates and protein mutation sites were compared among the three B. rapa genomes to add confidence to the genomic analysis ( Supplementary Fig. S4). The three subgenomes of B. rapa diverged approximately 1.0-2.3 Mya, and the data further confirmed that the Brassica ancestor diverged from Thellungiella parvula approximately 20.9-23.4 Mya (Fig. 2a).
4DTv and Ks values of gene pairs were estimated based on the syntenic blocks detected among NHCC001 and other plant genomes (Fig. 3a, Supplementary Fig. S3). The results indicated that a recent WGT event at 4Dtv ≈ 0.15 (Ks peak value ≈ 0.3), which was previously reported as a Brassicaceae-specific triplication (Br-α-WGD) [13][14][15] , had also occurred in the evolutionary history of NHCC001 (Fig. 3a). Furthermore, we confirmed the almost complete triplication of the B. rapa NHCC001 genome relative to those of Thellungiella halophila and A. thaliana  nigra occurred at a peak of ∼0.10, followed by that of B. rapa NHCC001 and B. oleracea (4dTv = 0.05), consistent with the phylogenetic analysis (Fig. 3). The genomes of B. rapa NHCC001, B. rapa Z1, and B. rapa Chiifu arose recently and did not exhibit significant divergence compared with the A and B genomes of Brassica species (Supplementary Fig. S3). These findings were consistent with the phylogenetic analysis of Brassica and other representative species.
Chromosome structure of the B. rapa NHCC001 genome Previous studies have proposed a model for comparative genomics and chromosomal analyses based on the concept of the Ancestral Crucifer Karyotype (ACK; n = 8), which comprises eight chromosomes and 24 genomic blocks (GBs, named A to X) 16 . Syntenic orthologs between NHCC001 and A. thaliana were identified first 17 , and the three subgenomes were identified based on the syntenic relationship between NHCC001 and A. thaliana (Supplementary Table S19). Subgenomes were ordered based on gene densities from high to low and were named LF, MF1, and MF2 17 . As expected, all 72 genome blocks (3 × 24) in the NHCC001 genome were identified, compared with 71 identified in Chiifu v3.0 and previous versions 18 (Fig. 4). Compared with the distribution of genome blocks in Chiifu V3.0, most were arranged similarly in NHCC001. The lost genome block G (MF2) in  Table S20). In addition, we identified two new fragmented genome blocks: T(MF1) and O(LF) were identified on chromosomes A05 and A09, respectively, but were not observed in Chiifu V3.0. Two new fragmented genome blocks, F (LF) and F (MF1) on chromosomes A01 and A05 in Chiifu V3.0, were not identified in NHCC001 or in ref. 10,18 . We counted the gene numbers in the three subgenomes and found 13,283, 9011, and 7419 genes in the LF, MF1, and MF2 subgenomes, respectively, that were syntenic to A. thaliana.
Global genome comparisons of three B. rapa genomes SNPs and indels were identified between B. rapa NHCC001, B. rapa Z1, and B. rapa Chiifu (Fig. 1,   Fig. 4 Distribution of genomic blocks along the ten chromosomes of the B. rapa NHCC001 genome. Genome blocks were assigned to the subgenomes LF, MF1, and MF2. Centromeres are shown as black ovals Supplementary Table S10). A total of 1,718,037 SNPs and 738,275 indels were identified between B. rapa NHCC001 and B. rapa Z1, and 1,305,874 SNPs and 469,629 indels were identified between B. rapa NHCC001 and B. rapa Chiifu. We then compared the NHCC001 genome sequence to those of Chiifu and Z1 and identified a large number of syntenic regions (Fig. 3c). A total of 534 syntenic blocks were detected between B. rapa NHCC001 and B. rapa Chiifu, comprising 49,166 gene pairs. Likewise, 557 syntenic blocks were detected between B. rapa NHCC001 and B. rapa Z1, comprising 48,976 gene pairs (Fig. 3c). Tandem gene arrays were also identified in the three genomes using SynOrths 19 . A total of 2211 tandem arrays (corresponding to 5296 tandemly duplicated genes) were identified in B. rapa NHCC001. By contrast, more tandem arrays (2317 arrays, 5584 genes) were identified in the B. rapa Chiifu genome and fewer (2013 arrays, 4781 genes) in the Z1 genome.
We identified 10,851 NHCC001-specific genomic segments (~13 Mb) and 8496 Chiifu-specific genomic segments (~10 Mb) longer than 500 bp (Supplementary  Table S11). Most (98.7%) of these PAV (presence and absence variation) sequences were shorter than 5 kb, although 222 and 177 PAV sequences were longer than 5 kb in NHCC001 and Chiifu, respectively (Supplementary Table S11). Three NHCC001-specific sequence clusters (compared with Chiifu) on chromosomes 7 and 2 contained 132 predicted genes, and eight NHCC001specific sequence clusters (compared with Z1) on eight separate chromosomes contained 596 predicted genes (Supplementary Table S12). Details of PAV sequences and clusters identified between B. rapa NHCC001 and B. rapa Z1 are shown in Supplementary Table S13. Among the PAV genes identified between NHCC001 and Chiifu, 125 were specific to NHCC001, and 369 were specific to Chiifu (Supplementary Table S21). These specific segments and genes may contribute to the diversity of the three B. rapa subspecies.

Leaf adaxial-abaxial patterning genes in B. rapa
The most significant difference between NHCC001, a non-heading Chinese cabbage cultivar, and Chiifu, the heading Chinese cabbage, is the heading trait. Leaf incurvature, controlled by multiple genes, is an essential prerequisite for the formation of a leafy head 20,21 . Previous studies have identified adaxial-abaxial (ad-ab) patterning genes and investigated their genetic variation to uncover the mechanisms that underlie leaf incurvature during head formation in heading B. rapa and A. thaliana 22 . Using 26 homologs from A. thaliana, we identified 51, 47, and 49 leaf ad-ab patterning genes in NHCC001, Chiifu v3.0, and Z1, respectively (Supplementary Table  S14). Copy number variation was found among the three B. rapa genomes. In non-heading NHCC001 and Z1, three homologs of AtARF4 were identified (Supplementary Table S14), compared with only two AtARF4 homologs in heading Chinese cabbage Chiifu. Previous studies have shown that arf3 arf4 double mutants develop leaves that are curled up and resemble the phenotype of kan1 kan2 leaves, indicating an overlap in the function of leaf abaxial polarity [23][24][25] . Furthermore, there were two homologs of AtAGO7 in NHCC001 and Z1 but only one AtAGO7 homolog in Chiifu. Two AtDCL1 homologs in Chiifu were identified as tandem duplicates. Copy number variation in ab-ad genes may therefore contribute to leaf head formation in Chinese cabbage.

Identification of genes involved in ascorbic acid and glucosinolate metabolic pathways
Pak-choi is well known for its high nutritional value, particularly its abundant contents of ascorbic acid (AsA, vitamin C) and glucosinolates (GSLs). A previous study reported that the leaf AsA concentration in NHCC001 was~110 mg/100 g FW 26 . Using the newly assembled NHCC001 genome and the sequences of AsA-related genes from A. thaliana, we identified and compared genes involved in AsA biosynthesis and recycling from B. rapa NHCC001, B. rapa Chiifu, and B. rapa Z1 (Fig. 5, Supplementary Table S15). A total of 87, 93, and 93 AsArelated genes were identified in the B. rapa NHCC001, Chiifu, and Z1 genomes, respectively (Supplementary Table S15). Four regions of homeologs had undergone tandem duplication. The numbers of AsA-related homologs were highly consistent among the three B. rapa genomes; most were located on conserved collinear blocks and showed little variation among the three subspecies. The expression patterns of these genes were measured in the root and leaf tissues of NHCC001 (Fig.  5). Two GGalPP homologs and one IPS homolog were highly expressed in the leaf (Fig. 5), whereas some APX, MDAR, and DHAR homologs from the recycling pathway were highly expressed in both roots and leaves (Fig. 5).
Glucosinolates and their hydrolysis products have an important role in human health and plant defence 27 . More than 130 structurally distinct GSLs are present in 16 families of dicotyledonous angiosperms 27,28 , and more than 14 different GSLs have been identified and quantified in the young leaves of different B. rapa varieties 29 . Using A. thaliana GSL genes as queries, GSL genes were identified in B. rapa NHCC001, Chiifu, and Z1 (Supplementary Table S16). Interestingly, GSL gene numbers were highly expanded in the B. rapa genome, with a high proportion of tandem genes. GSL biosynthesis and catabolism are probably similar among the three sequenced B. rapa species, but there was nonetheless substantial variation in the proportion of tandem genes such as MAM1, ST5b, FMOGS-OX, AOP, TGGs, and NSPs (Supplementary Table S16). Flavin-monooxygenase FMOGS-OX catalyzes the S-oxygenation of methylthioalkyl to methylsulfinylalkyl GSLs during the biosynthesis of aliphatic GSLs in Arabidopsis thaliana 30 . We identified six, three, and three FMOGS-OX homologs in NHCC001, Chiifu, and Z1, respectively (Supplementary Table S16). Phylogenetic analysis showed that the FMOGS-OX genes were clustered into four clades, one of which included only three FMOGS-OX genes from B. rapa NHCC001 (Fig. 6a). Gene expression analysis showed that BraC09g068950 in clade I and BraC09g014640 in clade II were highly expressed in both roots and leaves, whereas BraC09g014670 had the highest leaf expression of all the FMOGS-OX genes. TGG catalyzes the hydrolysis of GSLs into compounds that are toxic to various microbes and herbivores 31 . TGG homologs were identified in the three B. rapa genomes and used to construct a phylogenetic tree (Fig. 6b, Supplementary Table S16). To our surprise, only one TGG homologue was initially identified in the B. rapa Chiifu v3.0 genome, despite the fact that nine TGG homologs were identified in the previous B. rapa Chiifu genome version 5 . We speculated that this may have been caused by a gene prediction error in Chiifu v3.0. We therefore searched the Chiifu v3.0 genome using four A. thaliana TGGs as queries and found 17 new TGG genes. Detailed information on these genes is provided in Supplementary Table S17. There were 21 TGGs in B. rapa NHCC001. Two tandem duplicates, BraC02g043340 and BraC02g043360, both showed high expression in NHCC001 leaves (Fig. 6b).

Discussion
Brassica rapa species can be grouped into six subspecies: turnips, sarsons, turnip rapes, and the Japanese group, which includes pak-choi and heading Chinese cabbages 32 . Although they originated from the same ancestor, these varieties show very high morphotype diversity. To date, only one cultivar of heading Chinese cabbage (Chiifu-401-42) 10 and one sarson type (Z1) 12 have been sequenced. The scarcity of genomic resources has hindered research on the evolution of non-model plant species and the genetic basis of phenotypic diversity. In the present study, we assembled a high-quality genome of the pak-choi cultivar 'Suzhouqing' (NHCC001) using a combination of PacBio and Hi-C data. The assembly covers approximately 84.84% of the estimated NHCC001 genome. Use of PacBio and Hi-C technologies provided a high-quality assembly in terms of contiguity and completeness of genic and repetitive regions. More repeat sequences (213.04 Mb) were identified in the present study than in previously published B. rapa genomes (Table 1). A total of 48,158 genes were identified, more than those identified in the B. rapa Chiifu (45,985) 11 and B. rapa Z1 (46,721) 12 assemblies. Our assembly represents a real improvement of the B. rapa genome, particularly for regions enriched in repetitive elements, and provides a valuable resource for comparative genomics and evolutionary studies.
Many efforts have been made to resolve the relationships among subspecies and the domestication history of B. rapa. Chen et al. and Qi et al. also suggested that the Chinese cabbage group was positioned at the most distant point from the B. rapa root and was most divergent from the last B. rapa common ancestor 2,32 . Our phylogenetic analysis further confirmed that the sarson type diverged from the last B. rapa common ancestor earlier than Chinese cabbage and pak-choi (Fig. 2). Comparative genomic analysis revealed variations in the three B. rapa genomes; numerous intraspecific variations such as SNPs, indels, and PAVs were revealed. Copy number variation was also identified in adaxial-abaxial patterning genes, which may lead to a better understanding of the molecular mechanisms that underlie the leaf heading trait in B. rapa. Based on homolog searches, we identified candidate AsA-related and GSL-related genes in the three B. rapa genomes. This more complete genome assembly will provide a solid basis for future metabolism-related studies.

Conclusion
Our newly assembled B. rapa genome achieves a high level of continuity and genic completeness. The new B. rapa genome will be of great help in understanding the evolution of the Brassicas and will provide an important resource for research, especially the molecular investigation of agricultural traits and the breeding of B. rapa ssp. chinensis.

Illumina sequencing of short paired-end (PE) reads
The Brassica campestris (syn. Brassica rapa) ssp. chinensis ('Suzhouqing', green petiole type pak-choi) selfinbred line NHCC001 was used for genome sequencing. Genomic DNA was extracted from NHCC001 leaf tissue and fragmented. One library was constructed and sequenced on the Illumina platform (Illumina, San Diego, CA, USA), and 76.35 Gb of clean reads were generated after trimming adaptors and low-quality bases and removing mitochondria and chloroplast DNA contamination. These data were used for genome estimation, gap closing, assembly polishing, and completeness assessment of the final assembly. Genome size was estimated using the formula: Genome Size=total k-mer number/average peak depth using k-mer analysis.

RNA sequencing
Root, stem, leaf, and flower tissues were harvested and immediately frozen in liquid nitrogen. Total RNA was extracted using the TRIzol reagent (Invitrogen, USA) following the manufacturer's instructions and pooled for sequencing. SMRTbell libraries were constructed using the PacBio DNA Template Prep Kit 2.0 and then sequenced on the PacBio RS II platform. In addition, cDNA libraries were constructed from roots and leaves and used for second-generation sequencing on the Illumina HiSeq 2500 platform to generate 125-bp pairedend reads following the manufacturer's protocol.

Hi-C library sequencing
Hi-C libraries were prepared from NHCC001 leaves as described previously 36 and sequenced on the Illumina HiSeq X Ten platform (2×150 bp) to generate 229,427,069 paired-end reads. The Hi-C data were mapped to PacBiobased contigs using BWA (v0.7.10-r789; aln mapping method) with the parameters -M 3, -O 11, -E 4, and -t 8. HiC-Pro (v2.8.1) 37 was used for duplicate removal and quality control. Using contigs assembled from PacBio data, Hi-C data were used to correct mis-joins in contigs and to order and orient contigs. Pre-assembly was performed for contig correction by splitting contigs into segments and then pre-assembling the segments with Hi-C data. Misassembled points were defined and broken when split segments could not be placed into the original position. Finally, the corrected contigs were assembled using LACHESIS with parameters CLUSTER_MIN_RE_-SITES = 30, CLUSTER_MAX_LINK_DENSITY = 2, ORDER_MIN_N_RES_IN_TRUN = 25, and ORDER_-MIN_N_RES_IN_SHREDS = 26 with Hi-C valid pairs. Gaps between ordered contigs were filled with 100 Ns. To improve the contiguity of the assembly results, a gapclosing step for pseudochromosomes was performed using PBJelly (v15.2.20) 38 with the parameter --min-Gap=1 using error-corrected Nanopore clean reads. The contact map was visualized with heatmaps at a 100-kb resolution.

Gene prediction and functional annotation
GeMoMa (v1.3.1) 50 was used for homology prediction with the parameter evalue = 0.00001. Five de novo gene prediction software packages were used, including Genscan (hollywood.mit.edu/GENSCAN.html), Augustus (v2.4) 51 , GeneID (1.4) 52 , SNAP (v2006-07-28) 53 and GlimmerHMM (v3.0.4) 54 . Parameters in Augustus were trained with unigenes assembled from pooled RNA-seq data. For RNA-seq based prediction, NGS transcripts and full-length transcripts were used. NGS transcripts were assembled using HISAT (v2.0.4) 55 and StringTie v1.2.3) 56 and then aligned to the genome assembly using BLAT 57 with the parameters identity ≥ 0.95 and coverage ≥ 0.90. Unigenes were filtered using PASA (v2.0.4) 58 . Clean RNAseq reads were mapped to the assembled genome using TopHat 39 , and transcripts were assembled using Cufflinks (v2.1.1) 39 . TransDecoder (v2.0) 59 and GeneMarkS-T (v5.1) 60 were used to identify the gene structure. EVi-denceModeler (EVM, v1.1.1) 61 was used to obtain an integrated gene set from the three prediction strategies above with different weight settings. The final gene set was obtained after filtering out coding sequences (CDS) shorter than 300 bp with frameshift mutations or premature stop codons. Functional annotation of the final gene set was performed using BLASTP (E-value 1e −5 ) embedded in the blast+ package (v2.2.31) 62 against multiple databases, including KEGG 63 , Swiss-Prot 64 , TrEMBL 64 , and NCBI nr 65 . GO annotations were assigned using the BLAST2GO pipeline (v2.5) 66 . The newly annotated genes were named based on the following conventions: Bra for Brassica rapa, followed by C for chinensis, then the chromosome number and the letter "g" for gene. The six digits after "g" were assigned based on the gene's position relative to the top of the chromosome.
Non-coding RNA (ncRNA) predictions tRNAscan-SE (v2.0) 67 was used to predict tRNAs with two embedded searching methods (tRNA-scan and EufindtRNA). tRNAs located in repetitive regions were excluded, and tRNAs with prediction scores over 20 were retained. miRNAs were identified by a homology search against miRBase (release 22) 68 with one mismatch allowed. miRDeep2 69 was used to predict secondary structures, and miRNAs with hairpin structure were retained. Other ncRNAs were predicted based on an Infernal (v1.1.2) 70 search against the Rfam (v12.1) 71 database with default parameters.

Pseudogene prediction
GenBlastA (v1.0.4) 72 with the parameter -e 1e-5 was used to identify homologous sequences in the genome, and Genewise (v2.4.1) 73 with the parameters -both and -pseudo was used to identified pseudogenes when premature stop codons or frameshift mutations were present in homologous sequences with 60% identity and 60% coverage.
Expansion and contraction of gene families CAFE (v2.0) 78 was used to infer gene family sizes in the most recent common ancestor (MRCA) and to determine the significance of gene family expansion/contraction based on the phylogenetic tree topology. The birth and death parameter (λ) was 0.002, and the P-value was 0.01.

Synteny and 4DTv analysis
The BLASTP program was used to identify orthologous and paralogous genes. MCscanX 79 was used to recognize syntenic blocks with parameters E_VALUE = 1e−05, MAX GAPS = 25, and MATCH_SIZE = 5. Syntenic blocks were visualized with MCscan, and chromosome lengths were not scaled. The 4DTv value of each gene pair was calculated and then corrected using the HKY model 80 . The Ks value of each syntenic gene pair was calculated using the yn00 program in the PAML package 77 . Chromosome-scale syntenic block plots and dotplot were constructed using the python version of MC scan (https:// github.com/tanghaibao/jcvi/wiki/MCscan).

Identification of PAV sequences and PAV clusters
PAV sequences in the genomes of B. rapa NHCC001, B. rapa Chiifu, and B. rapa Z1 were identified using a sliding-window method as described previously 81 .

Identification of ascorbic acid-related and glucosinolaterelated genes
A. thaliana AsA-related and GSL-related genes have been reported and were used as the set of reference genes in this study 5,[82][83][84][85][86] . Their protein sequences were aligned with corresponding protein sets from the B. rapa genome using BLASTP (E-value ≤ 1 × 10 −10 , identity ≥55).