Taxus yunnanensis genome offers insights into gymnosperm phylogeny and taxol production

Taxol, a natural product derived from Taxus, is one of the most effective natural anticancer drugs and the biosynthetic pathway of Taxol is the basis of heterologous bio-production. Here, we report a high-quality genome assembly and annotation of Taxus yunnanensis based on 10.7 Gb sequences assembled into 12 chromosomes with contig N50 and scaffold N50 of 2.89 Mb and 966.80 Mb, respectively. Phylogenomic analyses show that T. yunnanensis is most closely related to Sequoiadendron giganteum among the sampled taxa, with an estimated divergence time of 133.4−213.0 MYA. As with most gymnosperms, and unlike most angiosperms, there is no evidence of a recent whole-genome duplication in T. yunnanensis. Repetitive sequences, especially long terminal repeat retrotransposons, are prevalent in the T. yunnanensis genome, contributing to its large genome size. We further integrated genomic and transcriptomic data to unveil clusters of genes involved in Taxol synthesis, located on the chromosome 12, while gene families encoding hydroxylase in the Taxol pathway exhibited significant expansion. Our study contributes to the further elucidation of gymnosperm relationships and the Taxol biosynthetic pathway.

T axus species, belonging to the Taxaceae (yews, gymnosperms), are slow-growing, long-lived coniferous trees or shrubs, that have been regarded as endangered Tertiary relict species. Taxus are well-known for their cancer-inhibitory alkaloid paclitaxel (Taxol), which is a trace natural product 1 . Taxol is a polyoxygenated cyclic diterpenoid, mainly used to treat numerous cancers, including ovarian, breast, lung, cervical, and pancreatic cancer [2][3][4][5] . However, limited content (0.01-0.05 %) and localization of paclitaxel in specific organs (the bark of yew) renders production from natural sources low 3,6 . Taxol biosynthesis begins with the universal diterpenoid precursor geranylgeranyl diphosphate (GGPP), which is then decorated with a series of cytochrome-P450 hydroxylases (CYP450s), acyltransferases and other enzymes, leading to the end product paclitaxel 7,8 . The complexity of Taxol biosynthesis has greatly hindered the mass production of Taxol 3,9 . To further elucidate the Taxol biosynthesis pathway, we here report a high-quality genome assembly of T. yunnanensis. In addition, the genome of Taxus, as a first representative of Taxaceae, might help in unraveling the phylogenetic relationships within gymnosperms. There is much controversy about the evolutionary relationship between different gymnosperms (such as Cycads, Ginkgo, Gnetophytes and Conifers) [10][11][12] . The 1 KP transcriptome dataset provides strong support for Cycads and Ginkgo sister to the rest of gymnosperms and Gnetophytes sister to, or within, the conifers [13][14][15] . Whole-genome sequences, such as the one of Taxus presented here, provide an additional dataset to shed light on the elusive evolutionary relationships within gymnosperms.

Results and discussion
Based on the k-mer distribution analysis, we estimated the genome size of T. yunnanensis to be 10.49 Gb, with a high level of repetition (77.74%) and heterozygosity (0.54%) (Supplementary Fig. 1 and Supplementary Table 1). The genome sequence of T. yunnanensis was obtained using Oxford Nanopore highthroughput sequencing systems (85×), Illumina (50×) and highthroughput chromosome conformation capture (Hi-C, 60×) (Supplementary Table 2). The total length of the final assembly was 10.73 Gb with a contig N50 of 2.89 Mb and a scaffold N50 of 966.80 Mb (Table 1 and Supplementary Table 3). A total of 10.63 Gb of the assembly and 98.95% of the genes were distributed across 12 chromosome-level pseudomolecules (Supplementary Table 3 and Supplementary Fig. 2). The completeness of the genome assembly and gene set of T. yunnanensis were estimated at 72.6% and 73.7% using BUSCO, which is similar to the available gymnosperm genomes (Supplementary Table 4) [16][17][18] . We annotated 34,931 high-quality protein-coding genes, which is slightly lower than for the S. giganteum genome (38,000) 17 ( Fig. 1 and Table 1). On average, the predicted gene sequence length was 21,831.74 bp, containing 4.58 exons with an average sequence length of 305.46 bp (Table 1). Numerous long introns are a notable characteristic of T. yunnanensis genome. The length distribution for the 10% longest introns in T. yunnanensis is from 14,790 bp to 462,177 bp, and average at 35,282 bp. A comparison of gene models for the 14 land plants revealed that the average length of the longest 10% of introns in most of the gymnosperms was longer than that in angiosperms (Supplementary Table 5 and Supplementary Fig. 3).
A total of 7.96 Gb of repetitive elements occupying 74.11% of the T. yunnanensis genome were annotated (Supplementary Table 6). Repetitive sequences, especially the long terminal repeat retrotransposons (LTR-RTs), have been deemed to be the major component of all gymnosperm genomes and the main cause of gymnosperm genome expansion 12,16,17,19 . Consistent with other gymnosperm genomes, the majority of the repeats in the T. yunnanensis genome are LTR (40.95% of all assembled sequences), of which two super-families, 2,138,065 Ty3/Gypsy and 453,398 Ty1/Copia (the number of repeats sequences) were identified, accounting for 35.95% and 4.77% of all assembled sequences, respectively (Supplementary Table 6). Based on a mutation rate of 7.34573 × 10 −10 substitutions per base per year, we found that the insertion for Gypsy and Copia occurred largely between 8−24 and 8−44 million years ago (MYA), respectively ( Supplementary Fig. 4a). Since the Gypsy accounted for 87.78% of the total LTR sequences, the insertion of large amounts of Gypsy in 8−24 MYA resulted in genome expansion of T. yunnanensis. We identified and characterized full-length LTR in four gymnosperms and three angiosperms (T. yunnanensis, Gnetum montanum, Ginkgo biloba, S. giganteum, Amborella trichopoda, Oryza sativa and Arabidopsis thaliana), the number of LTRs contained in gymnosperms was higher than those in angiosperms (Supplementary Table 7). Phylogenetic reconstructions revealed that conifers displayed substantially higher diversity and abundance than G. montanum and G. biloba, possibly indicating gradual and/or rapid diversification in conifers ( Supplementary  Fig. 4b).
The T. yunnanensis genome, as a second member belonging to the so-called conifers II clade for which the genome sequence has been determined, provides an opportunity to revisit the relationships of gymnosperms. Using six gymnosperms, five angiosperms and two pteridophytes, and Anthoceros punctatus as an outgroup, we identified 588 single-copy gene families (5951 genes in all of the 14 species) to construct a phylogenetic tree, using ASTRAL and 'supertree' based on amino acid alignments, DNA alignments, codon alignments and codon alignments with third-positions removed (Fig. 2a Fig. 5). However, this relationship is at odds with a general phylogeny proposed by the 1KP consortium 13 , which finds Cycad and Ginkgo as sisters to the rest of gymnosperms based on transcriptome data. This difference may require further study, such as the use of genome data for additional gymnosperms.
A total of 575 gene families were expanded, 55 of which exhibited significant expansion (P < 0.05), relative to the ancestor  Most angiosperms have undergone whole-genome duplication (WGD) somewhere during their evolutionary past. Although it has been reported that all seed plants shared an ancient WGD, WGDs in gymnosperms seem to be much rarer [21][22][23] . WGDs are usually identified from Ks (a measure of the number of substitutions per synonymous site) age distributions of paralogs, or from gene collinearity data. Since Ks age distributions showed no clear peaks and no widespread intragenomic colinear or syntenic segments could be detected, we assume no recent WGD event has occurred in the evolutionary past of T. yunnanensis, although older WGDs cannot be excluded (Fig. 2c, Supplementary Fig. 9 and Supplementary Fig. 10). Evidence for small-scale gene duplication events is more evident and general analysis of gene duplication in T. yunnanensis shows that dispersed duplicates (60.07%) from the dominant type compared to three other types: WGD/segmental duplication (0.75%), proximal (11.66%) and tandem (13.09%) (see 'Methods').
All of the hydroxylases involved in Taxol biosynthesis belong to CYP450s 7 . The CYP450s responsible for hydroxylation at the C-2, C-5, C-7, C-10, C-13 and C-2ʹ positions have been characterized in Taxus 8,24-28 , while the enzymes responsible for C-1 and C-9 oxidation are currently unknown ( Fig. 3a and Supplementary Data 7). Most of the enzymes, identified to be involved in Taxol biosynthesis, are encoded by multiple gene copies in Taxus, especially the CYP450s genes such as taxoid 10β-hydroxylase (T10βOH) and taxoid 5α-hydroxylase (T5αOH) (Fig. 3a and Supplementary Data 7).
So far, Taxol is found only in Taxus species, indicating that some of the CYPs involved in the biosynthesis of Taxol may be specific to Taxus. In order to identify such species-specific CYPs, we compared a total of 3368 CYP genes in T. yunnanensis, S. giganteum, G. biloba, G. montanum, Picea glauca, Pseudotsuga menziesii and A. thaliana (Supplementary Table 9). In total, 624 CYP450s genes were identified in T. yunnanensis genome and the number of CYP725 sub-family genes was substantially higher than that in other species (Supplementary Table 9). The CYP450s genes involved in the biosynthesis of Taxol belong to CYP725A subfamily 8 . We constructed a phylogenetic tree from CYP725 genes obtained from a genome sequence alignment of five species (T. yunnanensis, S. giganteum, G. biloba, P. glauca and P. menziesii) ( Fig. 3b and Supplementary Fig. 11). Sixty-eight specific CYP725 genes were found in the T. yunnanensis genome, of which 62 genes belong to the CYP725A sub-family. Although few CYP725 genes have also been identified in the genomes of S. giganteum and G. biloba, only 12 CYP725 genes of S. giganteum belonged to two of the clades of the CYP725A sub-family, while all of 12 CYP725 genes of G. biloba belong to the CYP725B sub-family (Fig. 3b,  Supplementary Fig. 10 and Supplementary Data 8).
The genome assembly allowed us to locate all of the functionally characterized genes of Taxol metabolism in Taxus, as well as their closely related homologs, on either the chromosome or the unplaced scaffold positions. Forty CYP725A genes were found distributed on chromosome 12, including hydroxylation-like genes responsible for C-2, C-5, C-7, C-10, C-13 and C-14 hydroxylation. Moreover, taxadiene synthase (TXS) and 10deacetylbaccatin III-10-O-acetyltransferase (DBAT) like genes involved in the Taxol biosynthetic pathway were also located on chromosome 12 ( Fig. 3c and Supplementary Data 9). These genes were grouped on a 76.2 Mb region to form a taxol synthesis gene cluster which was artificially divided into three sub-clusters (subcluster I, II, III) (Fig. 3c). We detected 12 functionally uncharacterized CYP725 genes in the sub-cluster I, which exhibited a similar expression pattern (Pearson correlation coefficient > 0.8, P < 0.05) with T5αOH, T10βOH, T2αOH, T7βOH and T13αOH, and were highly expressed in bark (Fig. 3c, d). We suspect that these genes may participate in the production of Taxol. TXS and T5αOH are encoded by co-localized gene copies in sub-cluster II; T10βOH, T2αOH, T7βOH and T13αOH are encoded by colocalized gene copies in sub-cluster III, which were all highly expressed in the bark of T. yunnanensis (Fig. 3c, d; Supplementary Data 9, 10). Moreover, 15 functionally uncharacterized CYP725 genes localized in sub-cluster II and III, which have low homology with known hydroxylation-like genes in Taxol pathway, while exhibiting high and similar expression patterns (Pearson correlation coefficient > 0.8, P < 0.05) as the known genes functioning in Taxol metabolism. These genes might be interesting as potential candidates genes in Taxol biosynthesis pathway (Supplementary Data 9, 10).

Conclusions
This study reports a high-quality chromosome-level genome assembly for T. yunnanensis. This provides crucial information for the study of the evolution of gymnosperms. We estimated that there is no evidence of a recent WGD in T. yunnanensis and LTR expansion is the main cause of its large genome size. Interestingly, the CYP725A gene families, encoding hydroxylase involved in Taxol synthesis, exhibited significant expansion, and most of them clustered on chromosome 12 and exhibited co-expression, which contributes to the further elucidation of the Taxol biosynthetic pathway.

Methods
Plant materials, DNA library construction and sequencing. Fresh leaves were collected from T. yunnanensis in Yunnan province. High-quality genomic DNA was isolated from the fresh leaves using the CTAB method 29 , and the DNA quality and concentration were tested by 0.75% agarose gel electrophoresis, NanoDrop One spectrophotometer (Thermo Fisher Scientific) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA).
After the DNA quality and integrity were tested, it was randomly sheared by Covaris ultrasonic disruptor. Illumina sequencing pair-end libraries with an insert size of 300 bp were prepared using Nextera DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA). Sequencing was performed using the Illumina NovaSeq platform (Illumina, San Diego, CA, USA). Raw reads were cleaned to discard lowquality reads (reads with adaptors or unknown nucleotides (Ns) or reads with more than 20% low-quality bases) using the SOAPnuke (v2.1.4) tool (https://github.com/ BGI-flexlab/SOAPnuke) and, after data filtering, clean data were used for subsequent analyses.
For Oxford Nanopore sequencing, the libraries were prepared using the SQK-LSK109 ligation kit and using the standard protocol. The purified library was loaded onto primed R9.4 Spot-On Flow Cells and sequenced using a PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK) with 48-h runs at Wuhan Benagen Tech Solutions Company Limited, Wuhan, China. Base-calling analysis of raw data was performed using the Oxford Nanopore GUPPY software (v0.3.0).
RNA library construction, sequencing and data processing. For gene prediction analysis, total RNA was extracted from young leaves of T. yunnanensis using the RNA prep Pure Plant Plus Kit according to the manufacturer's instructions (Tiangen Biotech (Beijing) Co., Ltd., China). RNA samples were pooled and used a strand-switching method and the cDNA-PCR Sequencing Kit (SQK-PCS109) to carry out sequencing of cDNA by PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK).
To analyze the gene expression pattern, total RNA was extracted from 15 samples that comprise three biological replicates of independent samples of five tissues (root, stem, twig bark, bark and leaf) and sequencing was performed using the Illumina NovaSeq platform (Illumina, San Diego, CA, USA). The stem refers to the young shoots ca. 1 mm in diameter; twig bark is the bark of lateral branch about 0.5 cm in diameter; bark refers to the bark of the tree trunk about 5 cm in diameter.
RNA-seq reads were mapped to the reference genome assembly using STAR (v2.5.1b; parameters: -two pass Mode) 30 and the FPKM was calculated to evaluate the expression level of each gene using the HTSeq (v.0.11.2) tool 31 after averaging some replicated samples. DESeq was used for normalizing gene expression (Base Mean) in each sample, and for identifying differentially expressed genes (DEGs) for each compared group by using P-adj (adjusted P value) < 0.05 as the threshold. Coexpression network analysis (Pearson correlation coefficient > 0.8, P < 0.05) was performed and the co-expressed genes that may share taxol metabolism were characterized.
Genome assembly. Based on the sequencing data of T. yunnanensis, the K-mer analysis method was used to estimate the genome size and heterozygosity using the kmer_freq program in the GCEpackage (v.1.0.0).
Genomic assembly was performed using SMARTdenovo software (https:// github.com/ruanjue/smartdenovo). Two rounds of error correction were performed on the assembly result based on the nanopore sequencing data using Racon (v1.4.11) (https://github.com/isovic/racon). Two rounds of error correction were performed on the assembly result based on the Illumina Novaseq sequencing data using Pilon (v1.23) 32 . Finally, the genome was removed from the heterozygous sequences using the Purge_haplotigs pipeline (v.1.0.4) 33 to obtain the final assembly result. To evaluate the genome, BUSCO (v.4.1.2) 18 assessments and the Illumina short reads were aligned to the assembled genome using BWA, resulting in a mapping rate of 99.64%.
Hi-C sequencing and data processing. High-quality DNA extracted from young leaves of T. yunnanensis was used for Hi-C sequencing. Formaldehyde was used for fixing chromatin. In situ Hi-C chromosome conformation capture was performed according to the DNase-based protocol described by Ramani 34 . The libraries were sequenced using 150 bp paired-end mode on an Illumina NovaSeq (Illumina, San Diego, CA, USA). For pseudochromosome level scaffolding, we used the assembly software ALLHIC (v0.9.12) for stitching, and then we imported the final files (.hic and .assembly) generated by the software into Juicebox (v1.11.08) 35 for manual optimization.
A single-copy gene family shared by at least eight selected species (575 gene families) was screened to construct phylogenetic trees. The 575 gene family files were each aligned using MUSCLE (v.3.8.31) 38 , both as amino acid and nucleotide, resulting in two distinct alignments per gene family. We also forced nucleotide sequences on the amino acid alignments using a custom Perl script to obtain codon-preserving alignments of nucleotide sequences. Gene trees were then reconstructed for each gene family using RAxML (v.8.2.10) software 45 with 100 replicates of bootstrapping. For each gene family, we estimated four different gene trees based on: amino acid alignments, DNA alignment, codon alignment (nucleotides forced to the amino acid alignment), and codon 1 and 2 alignment (codon alignments where the third codon position was removed). Nucleotide-based analyses were conducted using the GTR + GAMMA model; for amino acid analyses, we used WAG model. For four different datasets, we inferred four maximum likelihood tree of species from gene trees using RAxML (v.8.2.10) software 45 . A phylogenetic tree of each single-copy gene was further constructed to infer a consensus species tree using ASTRAL (v5.7.1) 46 .
For supermatrix analyses, we concatenated all gene alignments and ML supermatrix analyses were performed using RAxML (v.8.2.10) 45 software. In this study, four supermatrix datasets were created for amino acid, codon alignment (nucleotides forced to the amino acid alignment), nucleotide alignments and codon1, 2 alignment (codon alignments where the third codon position was removed). These data matrices were used for maximum likelihood phylogenetic analyses by RAxML (v.8.2.10) 45 with the GTR + GAMMA models for nucleotide and WAG models for amino acid data. For each analysis, support was inferred for branches on the final tree from 100 bootstrap replicates.
Whole-gene duplication analysis. All T. yunnanensis amino acid sequences were self-aligned using Blastp (e value cut-off 1e−05) and the best Blastp result was retained. To obtain paralogous gene families, we performed gene cluster analyses based on the CDS alignment using OrthoMCL (v 2.0.9) 44 . Ks values were calculated from all paralogous families using yn00 in the PAML package 47 . The Ks of a given family was represented by the median value, and the distribution of corrected Ks values was plotted by median values 16 .
To distinguish whether this peak represents a whole-genome duplication event or background small-scale duplications, we identified paralogous gene pairs using Blastp methods and determined syntenic blocks using MCScanX 50 (https://github.com/ wyp1125/MCScanx). Although the synonymous substitution rate (Ks) was calculated for T. yunnanensis syntenic block gene pairs and Ks distribution clearly showed a major peak at around 0.1, there were no widespread and well-maintained one-versusone syntenic blocks indicates that a recent whole-genome duplication (WGD) event has not occurred in the T. yunnanensis genome. Indeed, analysis of duplication types of the T. yunnanensis paralogs by Duplicate_gene_classifier tool of MCScanX 50 indicates that there are four types: WGD/segmental duplication (match genes in syntenic blocks), dispersed (other modes than segmental, tandem and proximal), proximal (in the nearby chromosomal region but not adjacent) and tandem (continuous repeat).
Identification of genes related to the Taxol pathway in T. yunnanensis. The genes related to the Taxol pathway in T. yunnanensis were obtained by BLAST based on the reported Taxol pathway genes (GGPPS, TXS, T5αOH, T10βOH, T2αOH, T7βOH, T13αOH, TAT, DBAT, TBT, BAPT and DBTNBT) (the reference gene accession number in Supplementary Data 8). Thirty-eight genes were identified based on Sequence identity with reference genes.
The sequences of the A. thaliana and rice CYP genes were downloaded (http:// drnelson.uthsc.edu/P450seqs.dbs.html) and used as queries to search for homologs and conserved domains (PF00067) against the T. yunnanensis genome. The classification of TyunCYP450 proteins was based on reference sequences from a P450 database established by Nelson 51 .
Genome mining for gene clusters of the Taxol pathway in T. yunnanensis. To search for potential gene clusters that are associated with the Taxol pathway in T. yunnanensis, the genes CYP725A genes and genes related to the Taxol pathway with a distance <10 apart, are considered to be a gene cluster. The gene distance represents the number of genes between two focal genes. The results were parsed and summarized with additional Pfam (version 31.0) entries and gene expression patterns across five tissue types (Supplementary Data 10).
Statistics and reproducibility. Fifteen samples comprise three biological replicates of independent samples of five tissues for RNAseq analysis. All statistical tests were performed by publicly available programs and packages as described in the 'Methods' section. Reproducibility can be accomplished by the sample collection and laboratory methods described in the 'Methods' section and all the data of our analysis availed in public databases.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The data supporting the findings of this work are available within the paper and its Supplementary Information files. The datasets generated and analyzed during this study are available from the corresponding authors upon request. The genome sequence data and the transcriptome sequence data for T. yunnanensis have been deposited under NCBI BioProject number PRJNA661543.