The Taxus genome provides insights into paclitaxel biosynthesis

The ancient gymnosperm genus Taxus is the exclusive source of the anticancer drug paclitaxel, yet no reference genome sequences are available for comprehensively elucidating the paclitaxel biosynthesis pathway. We have completed a chromosome-level genome of Taxus chinensis var. mairei with a total length of 10.23 gigabases. Taxus shared an ancestral whole-genome duplication with the coniferophyte lineage and underwent distinct transposon evolution. We discovered a unique physical and functional grouping of CYP725As (cytochrome P450) in the Taxus genome for paclitaxel biosynthesis. We also identified a gene cluster for taxadiene biosynthesis, which was formed mainly by gene duplications. This study will facilitate the elucidation of paclitaxel biosynthesis and unleash the biotechnological potential of Taxus.

T axaceae, a widespread family of non-flowering conifers with substantial economic value, contains six extant genera and over 28 species 1 . Taxus is the largest genus in Taxaceae, including common species such as T. chinensis, T. brevifolia and T. baccata, and it is mainly distributed in Asia, North America and Europe 2 . For decades, Taxus has served as a natural source of paclitaxel (trade name Taxol), a well-known chemotherapy agent against various cancers 3 . But plant-derived paclitaxel suffers from a short supply due to its low abundance in Taxus, limiting its clinical application. Multiple strategies have been employed to address supply issues 4 , and promising progress has been made in chemical 5 and semichemical synthesis 6 , direct extraction from Taxus cell lines 7 , fermentation of endophytic paclitaxel-producing fungi 8 and metabolic engineering of paclitaxel production using heterologous systems 9 .
As a tetracyclic diterpene, paclitaxel is biosynthesized by a complex metabolic pathway 10 . The paclitaxel pathway starts with geranylgeranyl diphosphate (GGPP) synthesis through the condensation of isoprenyl diphosphate and dimethylallyl diphosphate 11 . GGPP is then cyclized by taxadiene synthetase (TS), generating a unique diterpene skeleton, taxadiene 12 . Taxadiene is subsequently decorated by a series of reactions including hydroxylation, oxidation, epoxidation, acylation and benzoylation to generate the final product via catalysis by various enzymes (for example, hydroxylase, oxidase, epoxidase, oxomutase and transferase) [13][14][15] . To date, over 20 enzymes have been identified in the paclitaxel biosynthetic pathway. However, several essential steps in the pathway, such as C1 hydroxylation, C9 oxygenation and oxetane formation, remain to be clarified. Moreover, studies have shown that jasmonates, gibberellin, auxin and ethylene are involved in the regulation of paclitaxel biosynthesis to maintain a delicate balance between growth and defence in Taxus [16][17][18] . Several transcription factors (TFs), including AP2/ERF, WRKY, MYC and MYB, have been found to regulate the expression of paclitaxel biosynthetic genes [19][20][21] . However, the comprehensive regulatory mechanisms underlying the growth-defence trade-off are still poorly understood.
A complete Taxus genome sequence can provide valuable bioinformatic and genetic resources to understand paclitaxel biosynthesis and regulatory mechanisms in depth, but the size and complexity of the Taxus genome (2C-value, 22.3-24.3 picograms) have hindered its de novo draft genome assembly to date 22 . Here, we have successfully assembled the Taxus genome, and we present a reference-grade genome sequence of T. chinensis var. mairei containing 10.23 gigabases (Gb) of data with contig N50 of 2.44 megabases (Mb), 9.86 Gb of which was assigned to 12 pseudochromosomes. We demonstrate that the CYP725A (cytochrome P450) genes, closely related to paclitaxel biosynthesis, have evolved independently in a unique physical and functional grouping in the Taxus genome. Moreover, we have uncovered a gene cluster for taxadiene biosynthesis that contains a new type of TS. These results contribute to our understanding of the biological and evolutionary questions regarding paclitaxel biosynthesis and provide insights into the genome structure and organization of gymnosperms.

Results
Taxus genome sequencing, assembly and annotation. To build a chromosome-level genome assembly of Taxus, genomic DNA was extracted from endosperm calli. The endosperm of T. chinensis var. mairei seeds with haploid chromosomes was used to culture the callus, as it could prevent the influence of heterozygous elements in the genome assembly. K-mer analysis showed that the genome size of T. chinensis var. mairei was approximately 10 Gb (Extended Data Fig. 1a), which is consistent with the results from the flow cytometry tests 23 . A de novo assembly of the Taxus genome was achieved by PacBio continuous long reads (318.05 Gb) and augmented with Illumina whole-genome sequencing reads (693.73 Gb) (Supplementary Table 1). After the application of high-throughput/resolution chromosome conformation capture (Hi-C) (Supplementary Table 2), 9.86 Gb of sequence data could be assigned to 12 pseudochromosomes (Extended Data Fig. 1b  and Supplementary Table 3), which covered 96.28% of the genome (Supplementary Table 4). We finally obtained the genome sequence with a total length of 10.23 Gb and a contig N50 of 2.44 Mb (Fig. 1a and Supplementary Table 4).
On the basis of the genomic information, 42,746 protein-coding genes were further annotated by integrating transcriptome data, homologous alignments and ab initio gene models. In total, 73.02% of the genes (31,214 out of 42,746) could be supported by RNA-seq data (Extended Data Fig. 1c and Supplementary Table 5). The BUSCO analysis further demonstrated that 1,052 of the 1,614 core genes were complete, showing relatively high completeness of the assembled genome in gymnosperms (Supplementary Table 6). Furthermore, 36,518 coding genes, accounting for 85.43% of the total predicted genes, were assigned to functional categories with an E-value less than 10 −5 (Supplementary Table 7).
Taxus experienced a whole-genome duplication event in the cupressophyte clade. Given that whole-genome duplication (WGD) is a important evolutionary force contributing to the expansion of plant genome size 24 , we investigated whether Taxus had experienced any WGD events. We built a paralogous gene pair set by performing an all-against-all blastp search. The number of synonymous substitutions per synonymous site (K s ) of paralogues was calculated using the gene pair set. As shown in Fig. 1b, the frequency of K s values exhibited an apparent decay without a natural distribution with increasing K s values, which indicated that no recent WGD event occurred in the Taxus genome. Moreover, we noticed that most of the K s values were less than 0.8, indicating that gene duplication in combination with saturation and stochasticity effects may obscure WGD 24 . We further used MCScanX to produce 8,148 syntenic gene pairs from the all-against-all blastp data and entered them into the K s and distance-transversion rate at fourfold degenerate sites (4DTv) calculations. The results showed two signature peaks located at 2.1 for K s (Fig. 1b) and 0.7 for 4DTv (Extended Data Fig. 1d), suggesting the presence of an ancient WGD in Taxus. Together with previous studies that revealed an ancient WGD event (WGD-ζ) in the common ancestor of angiosperms and gymnosperms 24,25 , all the above results suggest that Taxus shared the common ancient WGD with other coniferophyte lineages.
Taxus genome expansion is linked with retrotransposons. Except for the role of WGD in enlarging the Taxus genome size, we noticed that repetitive sequences constituted a important component of the Taxus genome (Supplementary Table 8). There was a total of 7.79 Gb of repetitive sequences, occupying 76.09% of the entire genome (Supplementary Table 8). Among these repetitive sequences, long terminal repeat (LTR) retrotransposons accounted for the highest proportion at 52.38% (Supplementary Table 8). The insertion time analysis revealed that LTR insertion was a continuous process, and approximately 40% of the insertions occurred 8 to 24 million years ago (Ma) (Fig. 1c). This feature of continuous insertion in the Taxus genome was distinctly different from that in the rice genome, where almost 95% of LTR insertions occurred within the last 5 million years 26 . Considering that LTR insertion in Norway spruce and ginkgo mainly occurred 12-24 and 16-24 Ma 23,25 , the continuous insertion of LTRs might be a common phenomenon in gymnosperms.
To further explore the evolution of LTR in Taxus, we analysed the phylogenies of LTR retrotransposons in a few representative gymnosperm and angiosperm plants. Amino acid sequence similarities within the reverse transcriptase domain of the Ty3/ Gypsy retrotransposons (Gypsy) and Ty1/Copia retrotransposons (Copia) were used to construct phylogenetic trees. As shown in Fig. 1d, the Gypsy superfamily members of the gymnosperms ginkgo and Picea were distributed in families II-VII, while those of angiosperms mainly belonged to family VIII. In contrast, Taxus Gypsy elements not only were distributed in families II-VIII but also evolved a highly species-specific family (family I), suggesting the expansion of specific Gypsy elements after Taxus speciation. Similarly, the unique expansion phenomenon in Taxus was also observed in the phylogenies of the Copia superfamily (Fig. 1d). Moreover, family V consisted of only Taxus LTRs in the Copia phylogenetic tree displaying a Taxus-specific amplification burst. In addition, Taxus was distributed in family IV, where the gymnosperms ginkgo and Picea were located, and families I-III also contained angiosperms, suggesting that Taxus LTRs were placed in a unique position compared with other selected species. These results suggest that the Gypsy and Copia superfamilies of Taxus have undergone a relatively unique evolutionary pattern, especially the specific Gypsy family I and Copia family V.

Evolution of gene families and elevated secondary metabolism in Taxus.
To understand the context of metabolic networks during Taxus evolution, we compared orthologous genes between Taxus and selected gymnosperms, angiosperms and cryptogams (Fig. 1e). In the 35,298 identified orthologous gene families (Supplementary Table 9), we found that 6,533 gene families were shared by the selected species, illustrating their evolutionary conservation (Fig. 1e). Compared with the selected species, 2,339 gene families were exclusive to Taxus (Fig. 1e). In addition, 1,378 gene families experienced loss, while 142 and 41 families underwent expansion and contraction in Taxus, respectively (Fig. 1f).
Taxus contains 9,747 unique genes ( Fig. 1e and Supplementary Table 10), many of which are enriched in the biosynthesis of specialized metabolites, including terpenes, phenylpropanoids and flavones (Supplementary Table 11). For instance, 57 gene families were annotated to be cytochrome P450 (CYP450) gene families (Supplementary Table 10). Gene expansion analysis demonstrated that 979 genes were enriched in ADP binding, oxidoreductase activity, flavin adenine dinucleotide binding, transferase activity and signal transduction, among other functions (Extended Data Fig. 1e and Supplementary Table 12), with eight gene families being associated with CYP450 (Supplementary Table 13). Pfam functional analysis further showed that the Taxus genes were enriched in CYP450 gene families (PF00067.22, P < 0.01) and TFs (PF13837.6, P < 0.01; and PF00847.20, P < 0.01) (Supplementary Table 14). KEGG analysis indicated that the gained and expanded gene families were enriched in a total of 36 and 41 KEGG pathways, respectively, including one phenylpropanoid (ko00940) and three terpenoid metabolic pathways (ko00900, ko00130 and ko00902) (Supplementary Tables 11  and 15).
Evolution and genomic organization of Taxus CYP450s. Given that CYP450s participate in almost half of the enzymatic reactions in paclitaxel biosynthesis 27 , we analysed Taxus CYP450 families and identified 649 CYP450 genes from the present genome using the reported HMM model (PF00067). These CYP450s can be divided into two catalogues: A-type and non-A-type. The A-type CYP450s included only the CYP71 clan, which consisted of 17 families and 325 genes (Extended Data Fig. 2a Table 17). The CYP750 family was reported to participate in the biosynthesis of thujone monoterpene, which is involved in defence responses (for example, resistance against herbivore feeding) 28 , while CYP725 genes were known to contribute to paclitaxel biosynthesis 29 . Phylogenetic analysis of these CYP725 genes further showed that they could be categorized into the CYP725A and CYP725B subfamilies (Fig. 2b). The CYP725A subfamily (a total of 79 genes) exhibited specificity   T5αH3  T5αH2  T5αH1   T13αH1  T2αH  T10βH2  T7βH2  T7βH1   T14βH   T10βH3  T10βH1  26456357  22946779  24761927  24767741  26460669  26689812  23050162  24744480  26702057   55326109  55305455  56059081 Known CYP CYP725A Non-725A A, angiosperms; G, gymnosperms; P, pteridophytes; B, bryophytes. The colour of each block is based on the number of genes in each family, and 0, 1, 2 and 3 indicate that this number ranges from 0, 1-10, 10-50 and 50-100 genes, respectively. b, Phylogenetic analysis of the CYP725 subfamily in T. chinensis var. mairei (Taxus), Ginkgo biloba (Ginkgo), Picea abies (Picea) and C. revolute (Cycas). The dotted outline shows the gene spheres of the CYP725A and CYP725B subfamilies. The light blue dots on the ends of the phylogenetic branches represent the known paclitaxel pathway CYP725A genes and their homologues. The neighbour-joining tree was constructed by Interactive Tree Of Life (iTOL) software. The evolutionary distances were analysed by the p-distance method, and the branch lengths were scaled by the bar. c, Distribution of CYP450 genes on the 12 pseudochromosomes in Taxus. Each short line on the pseudochromosomes represents a CYP450 gene. CYP725As, CYP725Bs and the other CYP450s are marked by red, orange and grey lines, respectively. The known CYP450s in the paclitaxel biosynthesis pathway (known CYP) are shown in blue. The CYP450 groups (≥7 CYP450 genes and ≤5.26 Mb of gene spacing between two adjacent CYP450s) are labelled outside of the corresponding positions on the pseudochromosomes. d, Histogram of the number of CYP450 genes on each pseudochromosome. The CYP725 genes (shown in red and orange) were mainly distributed on pseudochromosome 9, while the other CYPs (shown in grey) were distributed randomly on 12 pseudochromosomes. The y axis represents the number of CYP450 genes. e, Group-based gene expression profiles in response to methyl jasmonate (MeJA) treatment. RNA sequencing analysis was performed with the low-paclitaxel-yielding cell line (LC) treated with 100 μM MeJA for 4 h. The expression of the gene group was calculated by the sum of the expression levels of each CYP450, and each upregulated and downregulated CYP450 was calculated as 1 and −1, respectively, on the basis of their reads per kilobase per million reads values. f, Map of CYP725As located in groups 9.1 and 9.2. The ranges of the gene groups on pseudochromosome 9 are marked in pink. CYP725As and the other genes are marked by red and grey vertical lines, respectively. The known CYP450s in the paclitaxel biosynthesis pathway (known CYP) are shown in blue. The arrows show gene orientations.
to Taxus, whereas the CYP725B subfamily was universal in gymnosperm plants (including Picea, Cycas, Ginkgo and Taxus) (Fig. 2b), which suggested that CYP725A underwent independent evolution in Taxus. Considering that all the previously defined CYP450 genes in the paclitaxel pathway belong to the CYP725A subfamily, these results suggest that the expansion of the CYP725A subfamily played vital roles in the evolution of paclitaxel biosynthesis in Taxus.
We noticed that most CYP725A genes (74.68%) were located on pseudochromosome 9 (Fig. 2c,d), exhibiting a distinct non-uniform distribution. Gene location analysis further revealed that the Taxus CYP450 genes were not distributed randomly but tended to organize into different gene groups, 25 of which were detected in the genome (Fig. 2c). We found that nearly all these groups, except groups 1.2 and 5.1, contained gene members from no more than three CYP450 families, and 11 groups had only one CYP450 family (Supplementary Table 18), suggesting that the grouping of CYP450 genes on the genome had an obvious family aggregation pattern. Furthermore, as an essential phytohormone in the biosynthesis of secondary metabolites 30 , jasmonate is closely related to the expression regulation of CYP450 genes in Taxus (Fig. 2e). Under jasmonate treatment, eight groups showed an obviously increased expression level, and ten groups showed clear inhibition of gene expression (  Table 18). These results suggest that the CYP450s in the majority of groups were coexpressed under jasmonate treatment, implying that the grouping of CYP450 genes had some coordination of physiological functions.
The gene expression levels of four gene groups (group 9.1-9.4) on pseudochromosome 9 were upregulated most prominently in the presence of jasmonate (Fig. 2e). More interestingly, groups 9.1 and 9.2 contained all known CYP725A subfamily genes related to paclitaxel biosynthesis and 12 undefined CYP725As (Fig. 2f). The expression profiles of these two groups of CYP725A genes showed that 88% of CYP725As were highly expressed in roots, 79% of CYP725As were highly expressed in the high-paclitaxel-yielding cell line (HC) and 88% of CYP725As were upregulated after jasmonate treatment (Supplementary Table 19), which is consistent with the results on the increased level of baccatin III and paclitaxel in the Taxus cell line under jasmonate treatment ( Supplementary Fig. 1). These results suggest that the two groups are likely to contain most of the paclitaxel pathway genes that arose during Taxus evolution.
Taxadiene biosynthetic genes are arranged in gene clusters. PlantiSMASH 31 analysis further showed that a potential gene cluster related to terpene biosynthesis was presented in group 9.2 ( Fig. 3a Supplementary Figs. 2 and 3). Moreover, the genes in the cluster showed a highly coordinated tissue expression pattern and expression consistency in response to jasmonate treatment (Fig. 3a), suggesting that the genes could be functionally related. TS2 and TS3 were located adjacent to T5αH1 and T5αH2 (Fig. 3a), suggesting that the genes involved in the first two paclitaxel biosynthetic steps were organized by a tandem gene duplication event during Taxus genomic evolution. The K s value of these duplicated genes suggested that this TS-T5αH duplication occurred approximately 1.15 Ma. In addition to the TS and T5αH genes assembled in the cluster, additional TS (TS1) and T5αH (T5αH3) genes are located downstream and upstream of the cluster, respectively (Fig. 3a). Biochemical assays further confirmed that TS1/2 and T5αH1/2/3 have TS activity (Fig. 3b) and taxa-4(5),11(12)-diene-5α-hydroxylase activity ( Fig. 3c and Supplementary Fig. 4), respectively, demonstrating that the copied genes possessed the corresponding enzyme activities in T. chinensis var. mairei.
We further studied the kinetic properties of TS1 and TS2. The K m value of TS2 was approximately 1.5 times higher than that of TS1, but the turnover number (k cat ) of TS2 was nearly 2 times greater than that of TS1, indicating that TS2 might have a higher catalytic efficiency than TS1 (Fig. 3d). Moreover, exogenous jasmonate treatment resulted in an obviously higher level of TS2 than TS1 transcripts (Fig. 3e), implying that TS2 could play a role in paclitaxel biosynthesis in response to different environmental and developmental cues via jasmonate signalling. Sequence identity analysis showed that TS2 shared only 77-78% protein sequence identity with TS1 and T. brevifolia TS (TbTS), which is much lower than the sequence similarity (over 90%) within the previously reported TS genes (Supplementary Table 21), suggesting that TS2 is a unique TS gene that diverged from TS1 and TbTS. Phylogenetic tree analysis further confirmed that a Taxus-specific gene duplication event approximately 33.2 Ma resulted in two distinct types of TS genes (Extended Data Fig. 5), demonstrating that TSs were encoded by two types of TS genes resulting from gene duplication events in Taxus. Together, these results suggest that the genes involved in the two initial steps of the paclitaxel biosynthesis pathway are arranged in a gene cluster named the 'taxadiene gene cluster' . The taxadiene gene cluster might be formed by gene duplications and neofunctionalization in Taxus and may be somewhat similar to previous studies on operon-like gene clusters in plants 32,33 .
Furthermore, we established a gene-to-gene coregulation network using three rounds of subtraction screening with RNA-seq datasets. The network could cover all known paclitaxel biosynthetic genes (Extended Data Fig. 6 and Supplementary Table 22), indicating its comprehensiveness and high credibility. We identified 17 CYP725A genes, 3 transferases and 10 TFs with this network, which was strongly associated with known paclitaxel biosynthetic genes (Supplementary Tables 23 and 24). Real-time quantitative PCR assays confirmed that the expression of certain genes could be induced by jasmonates (Extended Data Fig. 6), implying that their encoded proteins could be investigated as potential enzymes in paclitaxel biosynthesis. Together, these results outline the biosynthesis pathway of paclitaxel in T. chinensis var. mairei (Fig. 3f) and provide valuable genetic resources for improving paclitaxel production through genetic breeding and synthetic biology.

Discussion
The absence of a chromosome-level genome sequence from Taxus has prevented in-depth phylogenomic studies of Taxus. Our study provides an example assembly of a complex genome in trees using various sequencing technologies on DNA from endosperm calli containing haploid chromosomes. Flow cytometry analysis indicated that the nuclear genome (2n) size of the diploid cells of Taxus was approximately 20.80-24.85 Gb, nearly twice the haploid genome size evaluated by k-mer analysis (Extended Data Fig. 1a). The vast majority of HiFi sequences from Taxus leaves (diploid) could be mapped to the haploid genome (up to 95%). Moreover, 75.81% of the 228,762,501 single nucleotide polymorphisms, 44.97% of the 847,935 insertions and deletions, and 85.64% of the 64,927 structural variants were heterozygous. Taken together with the low heterozygosity (0.02%) of T. chinensis var. mairei, these results demonstrate that the haploid genome assembly could basically represent its diploid genome, showing the advantages of using endosperm calli for genome assembly.
We found that the complete BUSCOs increased only from 64.7% to 65.2% when the N50 value was increased from 637 kilobases to 2.44 Mb during the genome assembly. The low BUSCO value might be due to the limitations of the BUSCO reference dataset. The latest dataset version is embryophyta_odb10 (10 September 2020), containing 1,614 genes from single-copy genes of 50 species, including two bryophytes (Physcomitrella patens and Marchantia polymorpha), one fern (Selaginella moellendorffii) and 47 seed plants (all are angiosperms) but not including any genes of gymnosperms. Consistently, across all of the reported gymnosperm genomes, except for Gnetum montanum, the BUSCO values were not higher than 73%, and four of these genomes had values lower than 51% (Supplementary Table 6). The BUSCO value of the Taxus genome was 65.2%, similar to that of Ginkgo biloba (69.4%) and Pseudotsuga menziesii (67.8%). To assess the Taxus genome quality more comprehensively, we mapped the Illumina DNA sequencing data (~693 Gb) for the genome survey onto the assembled genome and found that up to 99.60% of the sequencing data could be mapped, indicating the integrity of the genome assembly. Moreover, we collected transcriptome data from Taxus organs, comprehensively covering eight tissues and cell lines (root, stem, leaf, bark, male strobili, female strobilus, HC and LC), and mapped the sequencing data to the Taxus genome. The results showed that the average overall mapping rate of transcriptome data to the genome reached 90.45% (Supplementary Table 25), suggesting the integrity of functional genes in the genome.
The Taxus genome contains 4.08 Gb of LTR retrotransposons, including 87.28% Gypsy and 12.35% Copia retrotransposons and a small proportion of unknown LTRs (0.37%) (Supplementary Table 8). The LTR distribution analysis showed that LTRs tended to be distributed throughout the entire chromosome (Extended Data Fig. 7a). In particular, Copia tends to be enriched at the two ends of the chromosomes, while Gypsy is more enriched at the chromosome ends and central areas. Compared with previous studies in groundnuts 34 , the Taxus genome exhibited obvious differences in LTR distribution. The LTR retrotransposons of the groundnut genome are mainly distributed in the central regions of the chromosomes, close to the centromeres. This difference may come from the large disparity in genome size and the difference between angiosperms and gymnosperms.
The LTR insertions in the Taxus genome mainly occurred 8 to 24 Ma during the long insertion period (4-60 Ma) ( Fig.1c and Extended Data Fig. 7b-d), while the primary insertion times of LTRs in spruce and ginkgo were 12-24 and 16-24 Ma within their insertion span from 4 to 64 Ma 25,35 . These results suggest that the Taxus genome has a similar LTR insertion time trend to that in the spruce and ginkgo genomes. The very long insertion time phenomenon might be related to the evolutionary characteristics of gymnosperms. It is generally accepted that gymnosperms are slow-evolving plants. Their morphology is highly conserved, which is supported by the high similarity between extant species and fossil records. Previous studies have shown that angiosperms and gymnosperms differ considerably in their mutation rates of molecular evolution per unit time, with gymnosperm rates being, on average, seven times lower than those of angiosperm species 36 . For this reason, an insertion time longer than 60 million years is common in gymnosperm genomes because of the much lower mutation rate. For example, up to 8.27% of LTRs were inserted into the ginkgo genome over 60 million years, and 13.31% of LTRs were inserted into the spruce genome over 60 million years 25,35 .
In addition to CYP450 enzymes, acetyltransferases play an essential role in paclitaxel biosynthesis, especially BAHD acyltransferases. We found 127 BAHD acyltransferases by identifying their conserved motifs (HXXXD and DFGWG). The BAHD acyltransferases in Taxus were mainly distributed in Clades I, II, VI and V. Clade V can be divided into three groups (Groups I-III), among which Group I contains all known BAHD acyltransferases in the paclitaxel biosynthesis pathway ( Supplementary Fig. 5). It would be worthwhile to investigate whether Group I contains other acyltransferases that function in paclitaxel biosynthesis in the future (Supplementary Table 26). PlantiSMASH analysis indicated that the acyltransferase genes are not organized into any gene clusters. Genomic location analysis showed that the BAHD acyltransferase genes in paclitaxel biosynthesis were mainly distributed on chromosomes 1 and 9 (Extended Data Fig. 6b). Furthermore, TAT2 was colocalized with CYP450s in gene group 9.2 ( Fig. 2c and Extended Data Fig. 6b). The relationship between CYP725As and acetyltransferases in paclitaxel evolution is an interesting aspect to study in the future.
To date, all known TS enzymes are homologous to TS1 (amino acid homology > 90%) (Supplementary Table 21). Our study showed that TSs could be encoded by two distinct types of TS genes resulting from gene duplication events in Taxus. As a representative of the new type of TS enzyme, TS2 only has approximately 77-78% amino acid homology with the reported TS enzymes (Supplementary Table 21 and exhibits more robust induced expression characteristics in treatment with jasmonates (Fig. 3e). The different properties of these two types of TS enzymes imply a new Taxus defence regulation mechanism. In Taxus, the excessive synthesis of taxanes is not conducive to its growth or development, although these chemicals play an essential role in defence responses. It is therefore necessary to accurately and efficiently control the taxane level in cells in response to environmental changes. Our results provide a new hypothesis to explain the regulation of taxane levels in plant cells.
When there are no biotic or abiotic stresses, jasmonate signalling is blocked, and TS1 is responsible for taxane biosynthesis to maintain taxanes at a basic level. However, once insect attack or other stresses occur, jasmonate signalling is activated, and TS2 is rapidly expressed to quickly increase the taxane content in cells.
In addition, we tried to explore the application potential of TS2 in bioengineering. Bian et al. reported an engineered Escherichia coli strain with TbTS (belonging to Type I) for the taxadiene product 37 . We replaced the TbTS gene with the TS2 gene (Extended Data Fig. 8a). After 60 hours of fermentation, we found that the taxadiene titre from the strain containing TS2 was over ten times higher than that from the strain containing TbTS, while the OD 600 of the two strains was not much different (Extended Data Fig. 8b,c). This result shows the great potential of TS2 in bioengineering to produce taxadiene in the future.
We also explored the function of two unknown CYP725As (55305455 and 55326109) in the taxadiene cluster using the well-established T5αH reaction assay ( Fig. 3c and Supplementary  Fig. 4h). We further incubated yeast microsomes that included 55305455, T5αH1 and cytochrome P450 reductase (CPR) with taxadiene as a substrate at the same time and analysed the reaction products by gas chromatography mass spectrometry (GC-MS). As shown in Supplementary Fig. 6, we detected only 5(12)-oxa-3(11)-cyclotaxane, 5(11)-oxa-3(11)-cyclotaxane and taxa-4(20),11(12)-dien-5α-ol, which can be obtained by catalysing taxadiene by the T5αH1 enzyme. The same result was obtained with 55326109 protein in the reaction system ( Supplementary Fig. 6). These results suggest that the unknown CYP725As are not involved in the subsequent reaction catalysed by T5αH. However, the tissue expression specificity of 55305455 and 55326109 was similar to that of TS and T5αH in the cluster, and both of them exhibited higher expression levels in roots than in leaf and bark tissues (Fig. 3a). The real-time PCR assay validated that their expression was induced by jasmonate in Taxus cells (Extended Data Fig. 6e), which is consistent with paclitaxel accumulation ( Supplementary  Fig. 1). Moreover, the gene-to-gene coregulation network showed that 55305455 and 55326109 were correlated with DBAT and T5αH1, respectively (Extended Data Fig. 6a). These results indicate that the two CYPs may play a role in paclitaxel biosynthesis and metabolism and are worthy of in-depth study in the future.
For sequencing of the haploid tissue, DNA was extracted from the endosperm callus of T. chinensis var. mairei 23 . The DNA quality was checked by agarose gel electrophoresis and a Qubit fluorimeter (Thermo Fisher). The paired-end libraries with a 500-bp insert length were prepared by following the Illumina protocols. Sequencing of the library was performed on the Illumina HiSeq 2500 system. For the PacBio Sequel analysis, SMRTbell TM libraries were prepared according to the manufacturer's protocol for the sequencing platform. Four independent Hi-C libraries were constructed and sequenced on an Illumina HiSeq 2500 (PE125 bp) at Annoroad Gene Technology Co.
For circular consensus sequencing, genomic DNA was extracted from frozen leaves using the DNeasy Plant Mini Kit (Qiagen). A 15-kilobase DNA SMRTbell library was constructed and sequenced on a PacBio Sequel II platform; these sequencing reads are known as highly accurate long reads, or HiFi reads.
For Hi-C assembly, the clean Hi-C sequencing data were mapped to the genome draft by HiC-Pro (v.2.7.8) 41 , and the library quality was assessed by counting the number of unique valid paired-end reads. Only unique valid paired-end reads were maintained for downstream analysis. We used the Hi-C data to align and correct the contigs for misassembly through the Juicer 42 pipeline and the 3D-DNA pipeline 43 . The assembly package Lachesis 44 was applied to perform clustering, ordering and orienting on the basis of the normalized Hi-C interactions. For each pseudochromosome group, the exact contig order and directions were obtained through a weighted directed acyclic graph. We filled the gaps among contigs in the pseudochromosomes using TGS-Gapcloser (v.1.01) 45 by two rounds with continuous long-read and HiFi data (26 Gb), respectively. After the filling progress, we further removed the redundant contigs that were not anchored to the chromosomes using Purge Haplotigs (v.1.03) 46 .
For assembly assessment, the RNA-seq reads of eight tissues (including female strobilus, female leaf, female bark of stem, female root, male strobili, male leaf, male bark of stem and male root) and HC and LC were mapped to assess the assembly quality. The average mapping rate of all RNA-seq datasets was subsequently calculated by software HISAT2 (ref. 47 ) with the following parameter: score-min, L, 0, −0.1.
For repeat annotation and analyses, repetitive elements in the Taxus genome were identified through a combination of de novo and homology-based approaches. De novo prediction of repeat elements was carried out using RepeatModeler (v.1.0.1, http://www.repeatmasker.org/RepeatModeler/). For homology-based annotation, the repeat element libraries from Repbase 48 , the Institute for Genomic Research 49 and the annotated Ginkgo biloba genome were merged with the de-novo-derived library to create the whole dataset. The dataset was then used to mask identified TEs in the Taxus genome with RepeatMasker (v.4.0.5, http://www.repeatmasker.org). We identified LTRs with the LTR_retriever method 50 . Specifically, LTR_finder 50 and LTRharvest 51 were first used to identify all the existing LTR sequences in the Taxus genome according to the basic sequence rules of LTRs. The candidate LTR RTs were filtered to remove non-LTR RT repeat elements or those with large amounts of tandem repeats or gaps. Especially in fragmented genome assemblies, these requirements hugely reduce the number of LTR RT candidates but ensure that only full-length LTR RTs are analysed. We integrated the results and discarded false positives using the LTR_retriever pipeline; we then estimated insertion times (T) on the basis of T = D/2μ, where D is the divergence rate and μ is the neutral mutation rate (7.34573 × 10 −10 ) 36 .
For the annotation of protein-coding genes, gene structure prediction was performed using ab initio, homology-based and RNA-seq-based pipelines. For the ab initio annotation, SNAP 52 , Augustus 53 and GlimmerHMM were applied. Eight species (Arabidopsis thaliana 54 , Oryza sativa 55 , Gnetum montanum 56 , Picea abies 25 , Ginkgo biloba 35 , Selaginella moellendorffii 57 , Pinus taeda 58 and Amborella trichopoda 59 ) were chosen for homology annotation to predict protein-coding genes using GeneWise 60 . To generate annotation results based on transcripts, RNA-seq alignment files were generated using TopHat2 (ref. 61 ) and assembled via Cufflinks 62 , and the program PASA 63 was used to align spliced transcripts and annotate candidate genes. Finally, gene models predicted from three approaches were merged by EVM 64 . The functions of protein-coding genes were identified by mapping sequences against the Gene Ontology 65 , InterProScan 66 , Swiss-Prot (http://www.uniprot.org/) 67 , TrEMBL 68 and TAIR databases 69 .

Identification of WGD.
Genome-wide duplications were searched in the Taxus genome. Self-alignment of the assembled genome sequence was performed using metablast as described previously 70 . All-versus-all paralogue analysis in the Taxus genome was performed using reciprocal best hits from primary protein sequences by self-Blastp in Taxus. Reciprocal best hits are defined as reciprocal best Blastp matches with an E-value threshold of 10 −5 , a c-score (Blast score/best Blast score) threshold of 0.3 (ref. 71 ) and an alignment length threshold of 100 amid acids. The value of K s of reciprocal best hit gene pairs was calculated on the basis of the YN model in KaKs_Calculator v.2.0 (ref. 72 ). Synteny analysis was performed on Taxus protein-coding genes using MCScanX 73 to identify WGD events with the default parameters from the top ten self-Blastp hits. K s and 4DTv were calculated for Taxus syntenic block gene pairs. Genome mining for CYP450s and gene clusters. For the identification and classification of CYP450 genes, hmmsearch was used to identify CYP450 genes in the Taxus genome with PF00067 from the Pfam database 74 . The classification of the 649 CYP450 genes was executed by alignment with the CYP450 database 75 using standard sequence similarity cut-offs, with definite standards of 97%, 55% and 40% for allelic, subfamily and family variants, respectively. According to the standardized CYP450 nomenclature 76 , CYP450s were divided into A-type and non-A-type CYP450s, and phylogenetic analysis of CYP450 genes was performed for A-type and non-A-type CYP450s. Neighbour-joining phylogenetic trees were constructed using the MEGA7 package with homologous amino acid sequences 77 .
For genome mining for gene clusters involved in plant specialized metabolism, PlantiSMASH 31 was used to search for potential gene clusters using the default parameters and the GFF (General Feature Format) annotation files of the software. Gene groups were identified by in silico analysis on the basis of the following criteria: (1) the distance between two adjacent CYP450 genes in one group should be less than 5.26 Mb, and (2) one group should contain at least seven CYP450 genes.
RNA-seq data analysis for candidate genes in the paclitaxel biosynthesis pathway. All tissues, including female strobilus, leaf, bark of stem, and root and male strobili, leaf, bark of stem, and root, were mapped to the Taxus genome, and the fragments per kilobase of transcript per million mapped reads value was calculated using HISAT2 and StringTie 76 . Expression data from female bark, female roots and female leaves were used to identify the genes associated with paclitaxel biosynthesis. First, we selected genes that were more highly expressed in roots or bark than in leaves. Second, the genes were further confirmed in two Taxus half-sib cell lines (HC and LC) with distinct accumulation patterns of paclitaxel, and the genes should be highly expressed in HC. The differentially expressed genes were filtered using edgeR 77 with logFC > 1 and FDR < 0.05. We obtained 1,638 genes that met the above thresholds. Gene-to-gene networks were constructed using the expression matrix from MeJA-induced cell line (0, 2, 4, 8 and 24 h) RNA-seq data. Pearce correlation analysis was performed with the known functional genes as the target genes. Hypothesis development for the Pearson correlation was performed, and pairs with P < 0.05 remained.

Functional characterization of TS genes.
The open reading frames of TS1 (ctg6088_gene.1) and TS2 (ctg5306_gene.4) were cloned by PCR with reverse transcription from the Taxus cell line. Plant-Ploc (http://www.csbio.sjtu.edu.cn/ bioinf/plant-multi/), ChloroP (http://www.cbs.dtu.dk/services/ChloroP/) and TargetP (http://www.cbs.dtu.dk/services/TargetP/) were used for the prediction of the plastidial target sequence. The 60-residue N-terminally truncated TS1 and TS2 genes were inserted into the E. coli expression vector pET28b to form the constructs pET28b::TS1 and pET28b::TS2, respectively. All expression plasmids were constructed using the Hieff Clone One Step Cloning Kit (YEASEN), and the primers used in this work are given in Supplementary Table 28. For the in vitro enzyme assay, the enzyme assays were performed in a final volume of 500 µl of buffer (25 mM HEPES, pH 8.5, 10% glycerol, 5 mM DTT, 5 mM sodium ascorbate, 5 mM sodium metabisulfite and 1 mM MgCl 2 ) containing 100 µg of purified protein and 100 µM GGPP (Sigma-Aldrich). The reaction mixture was overlaid with 500 µl of pentane (Macklin, GC-MS grade) and incubated overnight at 32 °C. In addition, the mixture was vortexed, and the pentane overlay was subsequently removed by centrifugation at 5,000 r.p.m. for 10 min and concentrated by N 2 gas before GC-MS analysis. Inactivated TSs-His6 was used as the control. Taxa-4(5),11(12)-diene (1) and taxa-4(20),11(12)-diene (2) preparations were performed according to a previous study with the taxadiene-producing E. coli strain T2 (harbouring pMH1, pFZ81 and pXC02) 37 . The organic solutions containing crude compounds 1 and 2 were concentrated on ice under N 2 gas and redissolved in dimethyl sulfoxide for the purification of compounds 1 and 2 by thin layer chromatography. The purity and concentration were determined by GC-MS.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The T. chinensis var. mairei genome project has been deposited in the Genome Sequence Archive at the National Genomics Data Center, and is accessible at http:// bigd.big.ac.cn/ under BioProject no. PRJCA003841. Whole-genome and RNA-seq data were deposited in the Genome Sequence Archive database under accession nos CRA004292, CRA003496 and CRA004255. The T. chinensis var. mairei genome data have also been deposited at NCBI under BioProject no. PRJNA730337 and are publicly accessible at https://www.ncbi.nlm.nih.gov/Bioproject/?term=PR JNA730337. Source data are provided with this paper.

Code availability
In-house Python and R scripts for gene location, P450 analysis and heat-map analyses can be freely downloaded at GitHub (https://github.com/liaoqinggang/ Taxus_genome_pipelines). Fig. 1 | The Taxus genomic features to complement Fig. 1. a, Genome size estimation of T. chinensis var. mairei based on k-mer distribution. The X-axis represents the occurrence of k-mers, and the Y-axis represents the frequency. The k-mer values for different genome sizes are shown in the inner table. b, Genome-wide all-by-all Hi-C interaction. The heat map shows Hi-C interactions under a resolution of 2 Mb. Darker red pixels indicate higher contact probabilities. The number on the scale bar indicates the number of links after logarithmic analysis. c, Genomic landscape of the twelve pseudochromosomes. Track a represents the length of the pseudochromosomes (Mb); b, c, d, and e show the expression of tissue-specific genes in the bark of stem, root, strobili and leaf from the male Taxus plant, respectively; f, g, h, and i show the expression of tissue-specific genes in the bark of stem, root, strobilus and leaf from the female Taxus plant, respectively; j and k display high-and low-producing paclitaxel cell lines, respectively. d, Whole genome duplication (WGD) analysis based on the substitution rate distribution of paralogs. The 4DTv values of paralogs were calculated using KaKs_ calculator with the YN model. The X-axis is the value of fourfold synonymous third-codon transversions (4DTv) for paralogous pairs in the Taxus genome, and the Y-axis represents the frequency. e, Gene Ontology (GO) enrichment for gene families with significant expansion. GO enrichment analysis of a subset of 142 gene families with significant expansion (p < 0.05); FDRs were adjusted for multiple testing. The size and color of dots indicate the number of genes and false discovery rate (FDR), respectively. The X-axis represents the gene ratio, and the GO terms are listed on the Y axis.