Introduction

The plant family Brassicaceae (Cruciferae) comprises over 330 genera and ~3700 species with a worldwide distribution1,2,3,4,5. Numerous crops are derived from this family, including vegetables (Brassica and Raphanus), ornamentals (Matthiola, Hesperis, and Lobularia), spices (Eutrema and Armoracia), and medicines (Isatis). Based on sequenced genomes, several model species have been developed for diverse studies, including Arabidopsis thaliana for molecular function studies, Brassica for polyploidization and whole-genome duplication (WGD) studies, and Eutrema salsugineum for abiotic tolerance-related studies. However, genetic biosynthesis of the major active compounds in medicinal plants of this family remains poorly investigated.

Isatis indigotica (2n = 14) belongs to tribe Isatideae in lineage II of the family3,6,7,8,9,10. This species is widely cultivated in China as an important medicinal plant because its dried leaves and roots are used as a traditional Chinese medicine for curing diseases and viruses11,12,13. The major active compounds isolated from this species comprise terpenoids, lignans, and indole alkaloids14,15,16,17. These compounds were confirmed to have antiviral18,19, antibacterial20, anti-inflammatory21,22, and antileukemia23,24 functions. Previous studies based on transcriptomes revealed a few candidate genes involved in the biosynthesis of active compounds in this species25,26,27,28. However, the limitations of transcriptome quality and integrity hinder the identification of all candidate biosynthesis-related genes.

In the present study, we used single-molecule sequencing combined with high-throughput chromosome conformation capture (Hi-C) technology to assemble the genome and construct the pseudochromosomes of I. indigotica. Based on homolog searching and functional annotations, we aimed to identify candidate gene sets involved in the biosynthesis of putative active components. The candidate genes and genomic resources recovered here will be critically important for further experimental verification and artificial syntheses of the active compounds of this medicinal plant in the future.

Results

Genome assembly and construction of pseudochromosomes

The genome size, genome repeat size, and heterozygosity rate of I. indigotica were estimated using K-mer analysis. The 19-mer frequency of Illumina short reads with the highest peak occurred at a depth of 94. The genome was estimated to be 279.90 Mb in size with 48.99% repeats, and the heterozygosity rate was estimated to be 0.44% (Supplementary Table S3 and Supplementary Fig. S3). In addition, the genome size of I. indigotica was estimated to be ~305 Mb based on flow cytometric analyses using Vigna radiata as the internal standard (Supplementary Fig. S2).

We sequenced and assembled the genome of I. indigotica using single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) and anchored the assembled contigs to seven pseudochromosomes using Hi-C techniques. The final chromosome-scale genome was 293.88 Mb in length with 1199 contigs (contig N50 = 1.18 Mb), a scaffold N50 = 36.17 Mb, and a maximum pseudochromosome length of 38.25 Mb (Table 1, Supplementary Table S4, and Supplementary Fig. S4).

Table 1 Statistics for the final genome assembly of I. indigotica

The completeness of the genome assembly was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCO)29. Of the 1440 plant-specific orthologs, 1416 (98.33%) were identified in the assembly, of which 1400 (97.22%) were considered to be complete (Supplementary Table S5). The assembly base accuracy was also assessed based on Illumina short read mapping. In total, 99.97% of the clean reads were mapped to the genome assembly, and 94.55% of them were properly mapped (Supplementary Table S6). The base error percentage of the genome assembly was estimated to be 0.000081% (Supplementary Table S7). All these evaluations indicate the high completeness, high continuity, and high base accuracy of the present genome assembly.

Repeat and gene annotations

Repetitive sequences were identified using a combination of ab initio and homology-based approaches. In total, we identified 53.27% of the assembled sequences as repetitive sequences, including 34.67% retrotransposons and 7.37% DNA transposons. Long terminal repeat (LTR) retrotransposons were found to account for 30.09% of the genome (Supplementary Table S8). We annotated protein-coding genes by combining transcriptome-based, homology-based, and ab initio predictions. Finally, we predicted a total of 30,323 genes, of which 5973 had alternatively spliced transcripts. The average transcript length and coding sequence size were 2693 and 1387 bp, respectively, with a mean of 5.50 exons and 1.39 transcripts per gene (Table 2). Overall, 29,522 genes (97.36%) were assigned functions, and 76.16% and 91.69% of these genes had homologies and annotated proteins in the Swiss-Prot and TrEMBL databases. Further functional annotations using InterProScan estimated that 95.86% of the genes contained conserved protein domains, and 87.32% of the genes were classified by Gene Ontology (GO) terms, with 29.41% mapped to known plant biological pathways based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database (Supplementary Table S9).

Table 2 Statistics of predicted protein-coding genes in the I. indigotica genome

Chromosome structure of I. indigotica

Evolution of chromosome structures in Brassicaceae has been traced and established through comparative chromosome painting techniques using BAC probes of the A. thaliana genome4,30. Using these techniques, Lysak and his colleagues31 proposed a model comprising eight chromosomes and 24 genomic blocks (GBs, named from A to X) for comparative genomics and chromosomal analyses based on the Ancestral Crucifer Karyotype (ACK; n = 8) concept31. This model was further updated to comprise 22 GBs by merging the four GBs to form two new blocks (K-L and M-N), and the GB boundaries defined by A. thaliana gene loci were also updated32 (Fig. 1a). Six tribes (Calepineae, Coluteocarpeae, Conringieae, Eutremeae, Isatideae, and Sisymbrieae) of expanded lineage II were found to derive from a common ancestor with the Proto-Calepineae Karyotype (PCK; n = 7). Among these tribes, three (Eutremeae, Isatideae, and Sisymbrieae) displayed an additional whole-arm translocation in the second and seventh chromosomes (translocation PCK, tPCK; n = 7)32,33,34 (Fig. 1a). ACK and PCK shared five similar chromosomes. Thus, they might descend from a common ancestor; alternatively, PCK may have evolved from ACK.

Fig. 1: Ancestral Brassicaceae genomes and the distribution of ancestral GBs along the seven pseudochromosomes of I. indigotica.
figure 1

a The ancestral genomes ACK, PCK, and tPCK, each comprising 22 ancestral GBs. Blocks with opposite orientations relative to that of ACK are represented by downward-pointing arrows. M-N and D GBs are translocated in tPCK chromosomes tPC2 and tPC7 compared with PCK chromosomes PC2 and PC7. The structures were drawn based on previous studies32,33,34. b Twenty-two GBs and their positions within the I. indigotica genome. Genes of each GB boundary are shown beside the chromosomes with the corresponding A. thaliana locus IDs within parentheses based on MCScanX results. The A. thaliana GB boundaries were derived from a previous study34

To determine whether the I. indigotica genome sequence also supported tPCK structure in Isatideae, we compared the seven pseudochromosomes of I. indigotica with the A. thaliana genome by LAST and MCScanX. We determined syntenic relationships and constructed the order and orientation of the 22 GBs along the seven pseudochromosomes of the I. indigotica genome (Supplementary Figs. S5, S6). Based on the gene intervals of each GB of A. thaliana, we determined the corresponding intervals and boundaries of each block in I. indigotica and renamed the pseudochromosomes based on Fig. 1a (Fig. 1b and Supplementary Table S10). The I. indigotica genome has good collinearity in each GB compared with the A. thaliana genome and is consistent with tPCK structure in both order and orientation (Fig. 1 and Supplementary Figs. S5, S6). Furthermore, we carried out sequence alignments between the genomes of I. indigotica and the other three species that might also display tPCK structure (Sisymbrium irio for Sisymbrieae, E. salsugineum for Eutremeae, and Schrenkiella parvula for unassigned genera) using LAST. Our analyses suggested that these four species have similar chromosome structures (Supplementary Figs. S7S9). However, we found obvious inversions in the S. parvula genome and low continuity of sequences in the E. salsugineum and S. irio genomes. These comparisons suggest that the present I. indigotica genome was better assembled in terms of both accuracy and continuity than others with tPCK structure.

Phylogenetic relationships and WGD analyses

We clustered the annotated genes into gene families among I. indigotica and eight other Brassicaceae species with Cleome hassleriana as the outgroup. A total of 24,382 I. indigotica genes (80.41%) clustered into 18,900 gene families, of which 10,826 (57.28%) gene families were shared with nine other species and 896 (4.74%) were I. indigotica specific (Fig. 2c and Supplementary Table S11). We selected 822 single-copy gene families among 10 species to construct a phylogenetic tree, which showed that I. indigotica was sister to S. irio. We further estimated the divergence time between them as 15.86 (12.71–19.20) million years ago (Mya) (Fig. 2a). The relationships of all 10 species are consistent with those from previous phylogenetic analyses3,6,7,8,10.

Fig. 2: Evolutionary and comparative genomic analyses.
figure 2

a The phylogenetic tree of I. indigotica and eight other Brassicaceae species with C. hassleriana as the outgroup. All branch bootstrap values are 100. Gene family expansions are indicated in purple, while gene family contractions are indicated in light brown. The estimated divergence times (million years ago, Mya) are indicated at each node; bars are the 95% highest probability densities (HPDs). Circles in blue represent recent whole-genome duplication events. b Ks value distributions between B. rapa, I. indigotica, and A. thaliana. c Orthogroups shared by selected species

Then, we used synonymous substitution rates (Ks) between collinear paralogous genes to identify potential WGD events, based on the assumption that the number of silent substitutions per site between two homologous sequences increases in a relatively linear manner with time. A density plot of Ks values for the collinear gene pairs suggested that I. indigotica experienced a recent WGD event with a peak value of ~0.76, consistent with At-α-WGD for all Brassicaceae species8,35,36. An independent WGD event was identified for B. rapa after its divergence from I. indigotica at Ks = 0.30–0.34, previously reported as a Brassiceae-specific triplication (Br-α-WGD)8,37,38,39 (Fig. 2b). Whole-genome alignment among the I. indigotica, A. thaliana, and B. rapa genomes carried out by LAST also confirmed the collinear relationship and these WGD events. For each genomic region of I. indigotica, we typically found one matching region in A. thaliana and three matching regions in B. rapa. These comparisons suggest that I. indigotica did not experience an independent WGD event after At-α-WGD (Supplementary Figs. S5, S10).

The expansion and contraction of gene families play critical roles in driving phenotypic diversification and enhancing special traits in plants. We discovered 1357 expanded and 3074 contracted gene families in I. indigotica relative to S. irio (Fig. 2a). Tandem duplication was the main contributor to the gene family expansions. GO enrichment analysis of tandem repeat genes suggested that they were enriched in defense response to virus, indole biosynthetic process, lignin biosynthetic process, flavone synthase activity, and glucosyltransferase activity, some of which might be involved in the biosynthesis of active compounds in I. indigotica (Supplementary Table S12). We also performed GO enrichment analysis of the contracted gene families, and the results showed that they were enriched in proton export across plasma membrane, proton-exporting ATPase activity, regulation of stomatal movement, and defense response to other organism (Supplementary Table S13), which are probably related to the environmental adaptation of the species.

Identification of genes involved in the biosynthetic pathways of active compounds

Based on the KEGG database, GO classification, and the suggested biosynthesis pathways, we used a combined method of homolog searching and functional annotation to identify candidate genes for the biosynthesis of three types of active compounds, namely, terpenoids, phenylpropanoids, and indoles, in I. indigotica14,15,16,25,40,41. Sterols are the major terpenoids in I. indigotica, mainly comprising β-sitosterol and daucosterol42. β-Sitosterol was reported to play a critical role in curing lung inflammation43, while daucosterol can inhibit cancer cell proliferation44. A total of 59 genes in the present genome, which encoded 31 enzymes, were identified to be involved in terpenoid and sterol biosynthesis (Supplementary Table S14). Based on the functional annotations of these genes, the biosynthesis pathway of β-sitosterol is nearly complete and daucosterol can be further synthesized from β-sitosterol by glucosyltransferases (Fig. 3a). In addition, the intermediate product geranyl diphosphate can be used not only to synthesize sterols but also to produce secologanin for monoterpene indole alkaloids in numerous medicinal plants such as Catharanthus roseus45. However, we annotated genes only with geraniol 10-hydroxylase activity (GO: 0102811). The lack of other related genes may account for the absence of secologanin and other related monoterpene indole alkaloids in I. indigotica.

Fig. 3: Putative biosynthetic pathways of three main class active compounds in I. indigotica.
figure 3

The putative biosynthetic pathways of terpenoids (a), phenylpropanoids (b), and indole alkaloids (c) of active compounds in I. indigotica. Values within brackets indicate the numbers of gene copies corresponding to the catalytic genes in the pathways

Phenylpropanoids comprise lignans and flavonoids, with critical roles in anti-inflammatory response, anti-oxidant activity46, and attenuation of mammary tumor growth47. We annotated 66 genes involved in the biosynthesis of lignans and flavonoids (Supplementary Table S15). The identified putative pathway mainly comprises the biosynthesis of isovitexin and lariciresinol, while their glycosides were further synthesized by glucosyltransferases (Fig. 3b). Indole alkaloids comprise another active component of I. indigotica48,49,50 with important anti-influenza, anti-inflammatory, and leukocyte inhibition effects11,23,51,52,53. Based on the KEGG maps and previously suggested pathways25,54,55, we identified 32 genes that encoded 11 enzymes involved in the biosynthesis of indole alkaloids (Fig. 3c and Supplementary Table S16). Because of the lack of downstream pathways, other genes for indole alkaloid biosynthesis in I. indigotica need further identification. It should be noted that numerous genes involved in the biosynthesis of the three major types of active compounds increased in copy number because of tandem duplication, for example, geranylgeranyl diphosphate synthase, cinnamate 4-hydroxylase, 4-coumarate-CoA ligase, and indole-3-pyruvate monooxygenase (Fig. 3 and Supplementary Tables S14S16).

Discussion

Continuity and completeness are important indicators of genome assembly. PacBio-based genome assembly plus error corrections based on Illumina data could greatly improve continuity and completeness56,57,58,59. Our genome assembly of I. indigotica by this strategy showed a highly resolved result with an N50 = 1.22 Mb and longest contig length = 8.99 Mb. In addition, we used Hi-C data to cluster the contigs into seven pseudochromosomes with a final scaffold N50 = 36.17 Mb and longest chromosome length = 38.25 Mb. The completeness and high quality of the present I. indigotica genome were further confirmed by BUSCO and comparative chromosome analyses32. A total of 97.22% of the genes examined by BUSCO were complete, and the chromosome structure of I. indigotica was consistent with the tPCK type.

We constructed the phylogenetic relationships of I. indigotica based on genomic data and found that I. indigotica of Isatideae is sister to S. irio of Sisymbrieae among the sampled species, consistent with the results of previously published phylogenetic analyses3,6,7,8,10. Based on the phylogenetic results, we identified expanded and contracted gene families in I. indigotica. The expanded genes in this species were mainly derived from tandem duplications and were obviously enriched in some secondary metabolite pathways. Based on homolog searching and functional annotation in our high-quality genome, we further identified candidate genes for the biosynthesis of three main classes of active compounds in I. indigotica: terpenoids, phenylpropanoids, and indole alkaloids. These candidate genes complete or replenish gene sets for biosynthetic pathways of these compounds concentrated in I. indigotica25,26,27,28 (Fig. 3). In addition, we found that in some synthesis steps, the copy number of enzyme-coding genes increased to two or more because of tandem duplications. The increase in copy number may drive the production of major active compounds in I. indigotica and account for its excellent antibacterial and antiviral activities because gene expansions are responsible for enhancing a special trait or the origin of a new trait60,61,62.

Overall, in this study, we present a high-quality genome for I. indigotica. We further identify or replenish candidate genes for biosynthesis pathways of the active compounds in this medicinal plant. These genes and genomic resources will provide a solid basis for future biosynthesis-related studies.

Materials and methods

DNA extraction and genome sequencing

We initially extracted high-quality total DNA from fresh young leaves of a 2-month-old plant artificially cultivated in the greenhouse using the cetyltrimethylammonium bromide method. We used a SMRTbell Template Prep Kit 1.0 (PacBio, Menlo Park, CA, USA) to construct the DNA libraries for PacBio long-read sequencing and sequenced them on a PacBio Sequel system. We obtained a total of four SMRT cells with 39.94 Gb of sequencing data (coverage of 142.71×) from the PacBio Sequel platform and generated a total of 4.30 million subreads with an N50 read length of 14.9 kb (Supplementary Table S1 and Supplementary Fig. S1). We also prepared paired-end Illumina libraries using an Illumina Genomic DNA Sample Preparation Kit and sequenced them on an Illumina HiSeq X Ten system for error correction and K-mer analysis and generated a total of 37.50 Gb of data and 31.79 Gb of clean data (Supplementary Table S1).

Genome assembly and pseudochromosome construction

We initially estimated the genome size of I. indigotica by flow cytometry with Vigna radiata as the reference63. We then used clean Illumina short reads to calculate K-mers (Illumina DNA short read size of 19 bp) by Jellyfish v.2.2.964 to confirm the genome size. The sequencing depth was estimated by determining the highest peak value of the frequency curve of the K-mer occurrence distribution. We used SMRT Link pipeline v.5.1.0.26412 to process the polymerase reads into subreads with readScore = 0.75 and minSubReadLength = 500 and used Canu v.1.665 to correct errors of the PacBio subreads and assemble the corrected reads into contigs after trimming low-quality bases using WTDBG (https://github.com/ruanjue/wtdbg). We corrected the assembled contigs by using 270 bp PE Illumina data by Pilon v.1.1366 and finally obtained a 293.83 Mb contig-scale assembly with a contig N50 of 1.22 Mb. The genome contained 1162 contigs, and the longest contig was 8.99 Mb with a 38.18% GC content. These contigs were further anchored to chromosomes by the Hi-C technique.

We grounded ~3 g of fresh young leaf tissue into powder in liquid nitrogen for Hi-C experiments and constructed a Hi-C library following Louwers et al.67 with chromatin extraction and digestion and DNA ligation, purification, and fragmentation. Finally, we obtained a total of 79.43 Gb of clean reads for Hi-C analyses by the Illumina HiSeq X Ten platform. We first carried out a preliminary assembly by splitting contigs into segments of 100 kb on average and mapping the Hi-C data to the contigs using BWA v.0.7.10-r78968 in order to correct contig errors. We then used LACHESIS software69 with the parameters CLUSTER MIN RE SITES = 22, CLUSTER MAX LINK DENSITY = 2, CLUSTER NONINFORMATIVE RATIO = 2, ORDER MIN N RES IN TRUN = 10, and ORDER MIN N RES IN SHREDS = 10 to cluster and reorder all corrected contigs into pseudochromosomes. We finally adjusted the order and direction of the contigs on the pseudochromosomes by examining their interactions in the Hi-C heatmap. We evaluated the completeness and quality of the final assembled genome through BUSCO v.3.029 tests using gene content from the Embryophyta_odb9 database29.

Repeat annotation

We identified repetitive elements through both RepeatModeler v.1.0.10 and RepeatMasker v.4.0.770,71. RepeatModeler employed RECON and RepeatScout to predict interspersed repeats and then obtained the consensus repeat library. RepeatMasker recovered the repeats in the I. indigotica genome through a homology-based repeat search using the ab initio repeat database and Repbase. The overlapping repeats belonging to the same repeat class were combined according to their coordination in the genome. The overlapping repeats belonging to different repeat classes were then split into different types.

Gene prediction and functional annotation

To improve gene prediction, we further obtained transcriptomes by sequencing high-quality RNA from mixed fresh leaf, flower, and stem tissues and sequenced them by the Illumina HiSeq X Ten platform. We removed adapters and discarded reads with >10% N bases or reads having more than 20% bases of low quality (below 5) using NGS QC Toolkit v.2.3.372 and finally generated 19.87 Gb of clean data. We assembled the de novo and genome-guided transcriptomes with clean reads by Trinity v.2.4.073. We also mapped the RNA-sequencing (RNA-seq) reads to the assembled genome to obtain the mapping rate through HISAT2 v.2.1.074 to evaluate the completeness of the genome.

We run PASA pipeline v.2.1.075 to align the transcripts to the assembled genome to carry out ORF prediction and gene prediction. To train the HMM model for Augustus, we extracted complete, multiexon genes, removed redundant high-identity genes (cut-off all-to-all identity of 70%), and finally generated the best candidate and low-identity gene models for training. We aligned the RNA-seq data to the hard-masked genome assembly by HISAT274 and used bam2hints in Augustus to generate the intron hint file. We used this hint file to carry out ab initio gene prediction by Augustus v.3.2.276. For homologous prediction, the reference protein sequences of Brassica rapa, Brassica napus, Raphanus sativus, Brassica juncea, and Brassica nigra were downloaded and aligned against the I. indigotica genome using TBLASTN v.2.2.3177 and searched with an e value of 1e−5. After filtering low-quality results, gene structure was predicted using GeneWise v.2.4.178. We combined the results from PASA, Augustus and GeneWise to generate the final protein-coding gene set using EVidenceModeler v.1.1.175. To obtain the untranslated regions and alternatively spliced isoforms, we used PASA to update the gff3 file for two rounds and obtain the final gene models.

We annotated the functions of the predicated genes against public databases by NCBI BLAST+ v.2.2.3177 with a cut-off e value of 1e−5 and maximum number of target sequences of 20, including the Swiss-Prot and TrEMBL databases79. Best-hit BLAST results were then used to define gene functions. We used InterProScan v.5.25-64.080 to identify motifs and domains by matching against public databases. We identified GO annotations by using Blast2GO v.4.181 according to the blast results and combined them with InterPro GO entries. We mapped the existing GO terms to enzyme codes by Blast2GO and submitted the predicted proteins to the KEGG (Kyoto Encyclopedia of Genes and Genomes) Automatic Annotation Server (KAAS)82 to obtain KO numbers for KEGG pathway annotation.

Gene family and phylogenetic analyses

We used protein sequences of I. indigotica and eight other Brassicaceae species (Arabidopsis thaliana, Capsella rubella, Brassica rapa, Brassica napus, Raphanus sativus, Schrenkiella parvula, Sisymbrium irio, and Eutrema salsugineum) with the outgroup species Cleome hassleriana for same-family gene clustering. For genes with alternative splicing variants, the longest transcript was selected to represent the gene. Similarities between sequence pairs were calculated using BLASTP v.2.2.3177 with a cut-off e value of 1e−5. Additionally, OrthoMCL v.2.0.9 was used with default parameters to assess gene family membership based on overall gene similarity combined with Markov Chain Clustering (MCL) v.14-13783.

We extracted single-copy orthologous genes from the ten species by OrthoMCL and aligned the resulting protein sequences by MAFFT v.7.31384. Then, we used Gblocks v.0.91b85 to extract the conserved sites of multiple sequence alignments and constructed a phylogenetic tree by RAxML v.8.2.1186. We used C. hassleriana as an outgroup and performed 1000 bootstrap analyses to test the robustness of each branch. We used the Bayesian relaxed molecular clock approach in MCMCTREE of PAML v.4.9e87 to estimate divergence time. We calibrated this tree based on the estimated divergence times in the TimeTree database88 for C. hasslerianaA. thaliana (35–59 Mya), A. thalianaC. rubella (7.4–12.8 Mya), B. rapa–S. parvula (19.3–28.6 Mya), and B. rapa–A. thaliana (23.4–33.5 Mya).

Gene families that had undergone expansion or contraction were identified in the eight sequenced species using CAFE89. The CAFE parameters included a p value threshold = 0.05 and automatic searching for the λ value. The algorithm in CAFE takes a matrix of gene family sizes in extant species as input and uses a probabilistic graphical model to ascertain the rate and direction of changes in gene family size across a given phylogenetic tree.

WGD analysis and identification of tandemly repeated genes

To examine WGD in I. indigotica and B. rapa, we extracted all homologous proteins between these two species and A. thaliana using an all-to-all search in BLASTP v.2.2.3177 with an e value cut-off of 1e−9. We used MCScanX90 with default parameters to identify collinear blocks, each containing at least five collinear gene pairs. To infer WGD events, we used the downstream MCScanX script add_ka_and_ks_to_collinearity.pl to calculate the Ks values between collinear genes among these three genomes. We further performed whole-genome alignment of the three species by LAST v.94691 and constructed a dot plot by the downstream program last-dotplot.

Identification of tandem repeat genes in the I. indigotica genome was based on three criteria: (1) two or more genes had more than 70% identity and 70% coverage according to BLASTP; (2) the pairwise gene distance was <100 kb; and (3) there were no more than 10 genes lying between the repeat genes on a single scaffold92. The genes identified in this way were subjected to functional analysis using GO enrichment.