Identification of candidate genes involved in isoquinoline alkaloids biosynthesis in Dactylicapnos scandens by transcriptome analysis

Dactylicapnos scandens (D. Don) Hutch (Papaveraceae) is a well-known traditional Chinese herb used for treatment of hypertension, inflammation, bleeding and pain for centuries. Although the major bioactive components in this herb are considered as isoquinoline alkaloids (IQAs), little is known about molecular basis of their biosynthesis. Here, we carried out transcriptomic analysis of roots, leaves and stems of D. scandens, and obtained a total of 96,741 unigenes. Based on gene expression and phylogenetic relationship, we proposed the biosynthetic pathways of isocorydine, corydine, glaucine and sinomenine, and identified 67 unigenes encoding enzymes potentially involved in biosynthesis of IQAs in D. scandens. High performance liquid chromatography analysis demonstrated that while isocorydine is the most abundant IQA in D. scandens, the last O-methylation biosynthesis step remains unclear. Further enzyme activity assay, for the first time, characterized a gene encoding O- methyltransferase (DsOMT), which catalyzes O-methylation at C7 of (S)-corytuberine to form isocorydine. We also identified candidate transcription factor genes belonging to WRKY and bHLH families that may be involved in the regulation of IQAs biosynthesis. Taken together, we first provided valuable genetic information for D. scandens, shedding light on candidate genes involved in IQA biosynthesis, which will be critical for further gene functional characterization.


Results and Discussion
Illumina sequencing and de novo assembly. To obtain a comprehensive understanding of D. scandens transcriptome, the cDNA libraries were constructed from total RNA of D. scandens roots, leaves and stems, respectively (Fig. 1). Three biological replications for each tissue were sequenced using the Illumina HiSeq. 2000 sequencing platform. After filtering out adaptor sequences, ambiguous reads and low-quality reads (Q20 < 20), a total of 3.7, 3.9 and 3.6 Gb of high-quality reads from roots, leaves and stems, respectively, were generated (Supplementary Table S1). The high-quality reads obtained in this study were deposited in the NCBI SRA database (accession number: SRA480383).
Because genome information is unavailable for D. scandens, all clean reads were de novo assembled using Trinity software 39 , with optimized k-mer length of 25. We obtained a total of 96,741 unigenes with sequence length ranged from 201 to 17,943 bp, and a total length of 114,753,746 bp. The average length of all unigenes is 905 bp, and there are 34,151 unigenes (26.94%) longer than 1,000 bp. The coding DNA sequences (CDS) from all D. scandens unigene sequences were also detected and a total of 30,190 CDSs were obtained. Among them, 26.93% unigenes have CDS longer than 1,000 bp. The summary of sequencing and assembly results was shown in Table 1, and the length distribution of the transcripts, unigenes and CDS were shown in Supplementary Fig. S1.

Functional annotation.
To annotate these assembled unigenes as many as possible, sequences were searched against seven public protein databases: NCBI nucleotide (NT), NCBI non-redundant protein (NR), SwissProt protein, Gene Ontology (GO), euKaryotic Ortholog Groups (KOG), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Protein family (PFAM). A total of 37,783 unigenes (39.05%) were annotated in the public databases and, of these, 5317 unigenes were annotated in all databases. There were 28,557 unigenes (29.51%) matched in the NR database, and 22,315 unigenes (23.06%) matched with known proteins in the SwissProt database. A total of 23,412 unigenes (24.20%) matched to the GO database and 11,545 unigenes (11.93%) matched to the KOG. The number of unigenes matched to the NT, KEGG and PFAM databases was 20,673 (21.36%), 10,516 (10.87%) and 23,149 (23.92%), respectively (Table 2).
For GO analyses, 415,147 unigenes were classified into three classes, including biological processes (274,720 unigenes), cellular components (86,318 unigenes), and molecular functions (54,109 unigenes) ( Supplementary  Fig. S2). There were 12,915 unigenes assigned to KOG classifications, which were divided into 25 specific categories. Predominantly unigenes were found in the category of general functional prediction that is associated with only basic physiological and metabolic functions (2,038, 15.78), whereas unigenes belonging to category of cell motility was the smallest group, with only four unigenes (0.03%) ( Supplementary Fig. S3). The KEGG pathway-based analysis is helpful for understanding the biological functions and interactions of genes. A total of 10,782 unigenes had significant matches in the KEGG database and were assigned to 129 biological pathways. The category with the largest number of unigenes was metabolism that includes the biosynthesis of other secondary  Table 2. Summary of the annotation percentage of D. scandens compared to public databases. metabolites (367, 6.32%). Among them, predominantly category was phenylpropanoid biosynthesis (166 unigenes, 45.23%), followed by IQAs biosynthesis (44,11.99%) and other biosynthesis ( Supplementary Fig. S4).
Quantitative analysis of three major IQAs in D. scandens. According to previous reports, isocorydine, sinomenine and protopine are the main IQA components of D. scandens [32][33][34] . Herein, the content of these major IQAs in roots, leaves and stems of D. scandens was quantified by high performance liquid chromatography (HPLC). Compared to authentic standards, the contents of isocorydine, sinomenine and protopine in roots were 6.074, 0.822 and 1.824%, respectively ( Fig. 2A,B), indicating that D. scandens roots are abundant in isocorydine. By contrast, the contents of isocorydine, sinomenine and protopine in leaves and stems were very low and almost undetectable (Fig. 2C, D). These results were in accordance with roots being the main medicinal part of the plant. Isocorydine, an aporphine alkaloid with one free hydroxyl group, can inhibit cell proliferation by inducing G2/M cell cycle arrest and apoptosis and target the drug-resistant cellular side population through PDCD4-related apoptosis in hepatocellular carcinoma (HCC) [40][41][42] , thus can be served as a potential antitumor agent in HCC. This indicates that isocorydine is a kind of pharmaceutically valuable IQAs. However, the molecular mechanisms involved in isocorydine biosynthesis remain unclear.
Candidate genes encoding enzymes involved in IQAs biosynthesis. We focused on the discovery of genes involved in IQAs biosynthesis, which begins with the condensation of two L-tyrosine derivatives: 4-hydroxyphenylacetaldehyde and dopamine 43,44 . Through a series of enzymatic reactions, (S)-reticuline is synthesized, which acts as the central intermediate and is diverted to different branches for biosynthesis of different types of IQAs 45 . In D. scandens, the major active constituents are isocorydine, corydine, glaucine, sinomenine, protopine and magnoflorine [32][33][34] . To date, although the protopine and magnoflorine pathways have been characterized in other plant sepcies 17,46 , pathways of isocorydine, corydine, glaucine and sinomenine have not been determined. Therefore, we proposed their biosynthesis pathways based on previous reports and present transcriptome data (Fig. 3). Based on the KEGG pathway assignment, we discovered the transcripts encoding all the known enzymes for (S)-reticuline biosynthesis from our Illumina dataset, which include L-tyrosine aminotransferase (TyrAT), tyrosine decarboxylase (TYDC), tyrosine/tyramine 3-hydroxylase (3OHase), NCS, 6OMT, CNMT, NMCH and 4′OMT. Protopine and magnoflorine biosynthesis pathways have previously been depicted as follows: (S)-scoulerine is formed from (S)-reticuline by berberine bridge enzyme (BBE) 47,48 . The conversion of (S)-scoulerine to protopine begins with the formation of two methylenedioxy bridges by cheilanthifoline synthase (CFS) and stylopine synthase (SPS), both are members of the CYP719 family, forming (S)-stylopine. CFS and SPS have been isolated and characterized from E. californica 16,49 and Mexican prickly poppy (Argemone mexicana) 50 . Subsequent N-methylation of (S)-stylopine by tetrahydroprotoberberine -N-methyltransferase (TNMT) 51 yields (S)-N-methylstylopine, which is converted by (S)-cis-N-methylstylopine 14-hydroxylase (MSH) to protopine 21 . In magnoflorine biosynthesis, (S)-reticuline is first oxidized to (S)-corytuberine by (S)-corytuberine synthase (CTS) and, subsequently, (S)-corytuberine-N-methyltransferase (SCNMT) converts (S)-corytuberine to magnoflorine 46 (Table 3; Supplementary Table 2; Fig. 3). Furthermore, 67 unigenes encoding enzymes involved in IQAs biosynthesis were used to detect their expression levels in different tissues based on reads per kilobase of transcript per million reads mapped (RPKM) values. The results indicated that most genes showed higher expression in roots than leaves or stems, especially for those genes located at downstream of protopine and isocorydine biosynthesis such as CFS, SPS, TNMT, MSH and CTS (Fig. 4). This is basically consistent with the higher content of IQAs in the roots of D. scandens (Fig. 2).
In the biosynthetic pathway of isocorydine, corydine and glaucine, we proposed that the last O-methylation reactions were catalyzed by OMTs by affecting different hydroxyl groups. In our dataset, we obtained 10 and 1 unigenes encoding 6OMT and 4′OMT, respectively. Unigenes c36937_g1, c77767_g1, DsOMT, c55366_g1, c32115_g1 and c53376_g1 were highly similar to P. somniferum PsOMT1, PsOMT2 and PsOMT3, respectively 27 . To characterize the evolutionary relationships between OMTs from D. scandens and known OMTs from other plant species, Neighbor-Joining tree was constructed. As shown in Fig. 5, unigenes c55366_g1 and c32115_g1 had high homology with 6OMT, while c53376_g1 had high homology with 4′OMT. Notably, we also found that c36937_g1, c77767_g1 and c47357_g1 were not clustered with the known OMTs, suggesting that they may have distinct roles in IAQs biosynthesis, which requires further study. Previous studies have confirmed the upstream biosynthesis pathway of isocorydine, and the corresponding enzymes have also been characterized 36,37 . However, the last O-methylation step of isocorydine biosynthesis had not been elucidated yet. Here, we supplemented and perfected the isocorydine biosynthesis pathways because it is the most abundant in D. scandens (Fig. 2). By comparing molecular structure, we found isocorydine was formed by O-methylation at C7 on (S)-corytuberine. Thus, three functionally unknown OMT unigenes, c36937_ g1, c77767_g1 and c47357_g1 (designated as DsOMT) were selected to study their function by in vitro enzyme activity. The results showed that DsOMT was able to O-methylate at C7 on (S)-corytuberine yielding isocorydine  Table 3. Unigenes involved in biosynthesis of isoquinoline alkaloid. ( Fig. 6). Therefore, we first established a complete biosynthesis pathway for isocorydine, which will be of potential significance for further understanding the molecular mechanisms of IQAs biosynthesis in plants.    Table S3). The phylogenetic relationship between TFs from D. scandens and TFs characterized from other plants was showed in Fig. 7. We identified three unigenes (c51183_g3, c30931_g1, and c13730_g1) belonged to bHLH, and c51183_g3 was closely homologous to EcbHLH1-1. Six unigenes (c25015_g1, c46912_g1, c21995_g2, c50940_g4, c11393_g1, and c41587_g1) were clustered with WRKY1, with c41587_g1 highly homologous to CjWRKY1, suggesting that they might be involved in regulation of IQAs biosynthesis in D. scandens. Characterizing the functions of these unigenes will help us to better understand the regulatory mechanism of IQAs biosynthesis.

Conclusions
We firstly carried out transcriptomic analysis of D. scandens and obtained a total of 96,741 unigenes, which provide valuable genetic resource for this invaluable Chinese herb medicine. We further proposed the integrated biosynthetic pathways of isocorydine, corydine, glaucine and sinomenine in D. scandens. The identification of 67 unigenes and in vitro enzymatic characterization of one of them provide opportunities for the de novo production were performed for each tissue. Raw reads were firstly transformed into clean reads by removing reads with sequencing adaptors, reads with frequency of unknown nucleotides above 5% and low-quality reads (containing more than 50% bases with Q-value ≤ 20) using a custom Perl script. Then, the clean reads were de novo assembled using the Trinity program (k-mer = 25, group pairs distance = 300) with default parameters 43 .

Functional annotation and candidate gene prediction. For functional annotations, all unigenes
were assessed by public databases, including NT, NR (http://www.ncbi.nlm.nih.gov/), SwissProt (http://www. expasy.ch/sprot) and KOG database (http://www.ncbi.nlm.nih.gov/COG/), using BLASTX (E-value < 10 −5 ) and BLASTN (E-value < 10 −5 ), respectively. The unigenes were also aligned to KOG and KEGG databases (http://www.genome.jp/kegg) 60 using BLASTX with an E-value < 10 −10 . A Perl script was used to retrieve KEGG Orthology information from blast result and then established pathway associations between unigenes and databases. Based on the results of the NR database annotation, the Blast2GO program 61 was used to obtain GO unigene annotations. Then, WEGO 62 software was used to perform GO classification and draw a GO tree. Moreover, the conserved domains/families of the assembled unigenes encoding proteins were searched against the Pfam database (version 26.0) 63 using Pfam_Scan script.
The CDSs of all unigenes were predicted using BLSATX and ESTscan. The unigene sequences were searched against the NR, KOG, KEGG and SwissProt protein databases using BLASTX (E-value < 10 −5 ). The best alignment results were used to determine the sequence direction of unigenes. Unigenes with sequences with matches in one database were not searched further. When a unigene was not aligned to any database, ESTScan 64 was used to predict coding regions and determine sequence direction. To identify the TFs, all unigenes were searched against the PlnTFDB database 65 using iTAK analysis tool (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi) 66 .
HPLC analysis. 0.2 g dried powder of D. scandens roots, leaves and stems was respectively extracted with 50 mL of 1% hydrogen chloride −70% methanol mixed liquor for 60 min, and sonicated for 30 min. For determining main bioactive components of D. scandens, an Agilent 1260 HPLC system (Agilent Technologies, Santa Clara, CA, USA) was used. Chromatographic separation was performed on the chromatographic column Agilent Zorbar SB-C 18 (250 mm × 4.6 mm, 5 μm, Agilent Technologies) at a column temperature of 30 °C. The flow rate was fixed at 1 mL/min, and the mobile phase consisted of sodium dihydrogen phosphate-methanol (35:65, v/v) containing 0.1% sodium dodecyl sulphate (A) and acetonitrile (B). Separation was achieved using the following gradient system: 85% B at 0 min, 100% B at 10 min, and 100% B at 30 min. Detection was performed at 289 nm 67 . Authentic (S)-corytuberine, isocorydine, sinomenine and protopine were purchased from JK chemical (Beijing, China).
Digital gene expression profiling. The high-quality reads were aligned to the assembled unigenes with the BWA program 68 . An RPKM value was calculated for each unigene in each tissue of D. scandens. The RPKMs of all annotated isoforms for the same gene were summed as the RPKM of that gene. Differential expression of unigenes was calculated with a threshold of P value < 0.001 and two-fold change.
Phylogenetic analysis. Phylogenetic analysis was performed based on the deduced amino acid sequences of OMTs and TFs from D. scandens and other plants. All deduced amino acid sequences were aligned with Clustal X using the default parameters as described previously 69 : gap opening penalty, 10; gap extension penalty, 0.1; and delay divergent cutoff, 25%. The evolutionary distances were computed using MEGA5.10 with the Poisson correction method. For the phylogenetic analysis, a neighbor-joining tree was constructed using MEGA5.0. Bootstrap values obtained after 1000 replications are indicated on the branches. The scale represents 0.1 amino acid substitutions per site.
Recombinant protein purification and enzyme activity assay. Full-length cDNA of DsOMT was obtained by PCR amplification using primers 5′-CATATGATGAATCACAAAGTGCATCATCAT-3′ (forward, with the added NdeI restriction site underlined) and 5′-TCTAGATTATTTGCAGAACTCCATGACCCA-3′ (reverse, with the added XbaI restriction site underlined), and cloned into the pCzn1 vector (Zoonbio Biotechnology, China). The vector was introduced into the Escherichia coli line Arctic-Express (Zoonbio Biotechnology, China) for protein expression. The expression of the recombinant protein was induced by 0.5 mM of IPTG at 11 °C for 8 h. The cells were harvested by centrifugation and resuspended in binding buffer, and the suspension was subsequently homogenized by 1 h of 200Wsonication (Vibra Cell VC 505 Sonicator; Sonics & Materials, Newtown, CT). Cell debris was subsequently removed with 10-min centrifugation at 12,000 rpm. After renaturation by 2 M urea, the protein was purified by Ni-IDA -Sepharose CL-6B (Spectrum Chemical Manufacturing, USA) under the manufacturer's instructions. The purity of the His-tagged protein was determined by SDS-PAGE followed by Coomassie Brilliant Blue staining.
The standard enzyme assay for DsOMT activity was performed using a reaction mixture in 50 μl of 100 mM Gly-NaOH (pH 9.0), 25 mM sodium ascorbate, 100 μM SAM, 10% (v/v) glycerol, 1 mM β-mercaptoethanol, 100 μM potential alkaloid substrate, and 50 μg of purified recombinant enzyme. Assays were carried out at 37 °C for 2 h and terminated by adding 200 μL of 1 M NaHCO 3 . Products were identified by HPLC as described above. Control was performed with denatured purified His-tagged proteins prepared by boiling in water for 20 min.