Transcriptome profiling for floral development in reblooming cultivar ‘High Noon’ of Paeonia suffruticosa

Tree peony (Paeonia suffruticosa Andrew) is a popular ornamental plant due to its large, fragrant and colorful flowers. The floral development is the most important event in its lifecycle. To explore the mechanism that regulate flower development, we sequenced the flower bud transcriptomes of ‘High Noon’, a reblooming cultivar of P. suffruticosa × P. lutea, using both full-length isoform-sequencing (ISO-seq) and RNA-seq were sequenced. A total of 15.94 Gb raw data were generated in full-length transcriptome sequencing of the 3 floral developmental stages, resulting 0.11 M protein-coding transcripts. Over 457.0 million reads were obtained by RNA-seq in the 3 floral buds. Here, we openly released the full-length transcriptome database of ‘High Noon’ and RNA-seq database of floral development. These databases can provide a fundamental genetic information of tree peony to investigate its transcript structure, variants and evolution. Data will facilitate to deep analyses of the transcriptome for flower development.


Background & Summary
Tree peony (Paeonia suffruticosa Andrew), is one of the most important horticultural plants in the world and a culturally important ornamental plants in China, due to its striking ornamental and medicinal values. It is a perennial deciduous shrub with large, fragrant, and colorful flowers. With a long history of cultivation, there are more than 3,000 cultivars all over the world. 'High Noon' (P. suffruticosa × P. lutea) is one of the most famous and popular cultivars and is always used for hybrid breeding due to its characteristic of cup shape, semi-double and clear lemon color. On the other hand, High Noon showed an unregular reblooming phenomenon 1 which means a twice floral development occurred around a year. These traits made 'High Noon' a suitable material for researching the floral development in tree peony.
Floral development is the most important developmental event in the life cycle of higher plants. The flowering timing is determined by processes of flowering transition, floral bud differentiation and floral organ identification 2 . A complex gene regulatory network was involved in the floral bud differentiation including plant hormone signal pathway and meristem activity regulation. Although there are great progresses in the study of molecular mechanism in floral development of model plants, it remains unclear in perennial plants, especially tree peony, whose genome information was not yet published. A few genes have been identified to be involved in the transition of shoot apical meristem (SAM) to floral bud in tree peony, including SOC1 3 , FT 4 , and AP1 5 . However, it was still hard to understand the underlying mechanism on floral development of tree peony at transcriptome level.
Next-generation sequencing (NGS) provides precise and comprehensive analysis of RNA transcripts for gene expression. It is applied to explore biological research frequently. Single molecular real time (SMRT) sequencing is a third-generation sequencing technology which offers great improvement than NGS on reads length and avoids the requirement of assembly in NGS 6,7 . The combination of SMRT and NGS has proceeded the genome assembly and transcriptomic research in several species. The genome assembly of (sunflower) Helianthus annuus and an indica rice Shuhui498 (R498) was completed with PacBio SMRT technology 8,9 . The combination of these 2 sequencing technologies has also been applied in many ways, such as seeking for the characteristic of (2019) 6 Fig. 1 Overview of the experimental design and analysis pipeline. The RNA of 'High Noon' bud in its three developmental stages of was collected for PacBio Iso-sequencing and Illumina RNA-seq. The raw data of PacBio Iso-seq were classified and corrected to generate the high-quality transcript sequence. Then these sequences were used to predict the gene sequence and annotate the gene function. The raw data of Illumina RNA-seq were filtered using ng_QC package. Then the clean data were mapped to the full-length transcript sequences by RSEM and used to calculate the reads coverage and gene abundance with package DESeq R package and DESeq2, respectively. The sample stage was confirmed through the parallel samples observed through microscope. Stage I represents the vegetative stage. The stage II represents the bud developmental stage during which the floral organ differentiated. Stage III represents the stage during which the floral organ were formed and the floral bud was identified. Bar = 0.5 mm. www.nature.com/scientificdata www.nature.com/scientificdata/ transcriptomes and identifying new genes in Sorghum bicolor 10 , Zea mays 11 , Phyllostachys edulis 12 . For species without genome information published, the combination of NGS and SMRT was applied to establish a reliable training set for gene prediction and settle biological questions in Beta vulgaris 13 , Alternanthera philoxeroides 14 , and Cassia obtusifolia 15 . In consideration of the absence of tree peony genome, the information of completed mRNA of transcripts is still unclear, which further limits the exploration of tree peony. Therefore, it is necessary to conduct a combined transcript sequencing for the gene prediction and the floral development research in tree peony.

Samples
In this study, we performed both SMRT and NGS to generate large-scale full-length transcripts and collect the gene expression profile for bud development of tree peony. Additionally, the data quality was assessed to verify their reliability. The full-length transcripts will provide gene sequence information for the further study of tree peony, and the gene expression profile will provide comprehensive understanding of the bud development of tree peony.   www.nature.com/scientificdata www.nature.com/scientificdata/ Methods Design and sample collection. 'High Noon' is a cultivar of tree peony, which contributed an important genetic resource for extending flowering period. The buds of different developmental stage were obtained from 3-5 years-old plants of a farm in Heze (E, 115°32′30.7818″; N, 35°20′4.794″). After discarding the adjacent scales and leaves, the buds were transferred to liquid nitrogen immediately and stored to −80 °C. The buds were also fixed simultaneously in FAA solution as parallel samples for microscopic observation. And subjected to section in slices and observed under microscope (Zeiss Primo Star, Germany). Through paraffin sections, vegetative meristem (Stage I, S1), floral meristem (Stage II, S2) and floral organ (Stage III, S3) were identified each with at least 3 buds (Fig. 1). RNA extraction, pacbio cDNA library preparation and sequencing. Total RNA was extracted using RNeasy Plant Mini kit (Qiangen, 74904) and treated with RNase-free DNase I (TAKARA, D2215) according to   www.nature.com/scientificdata www.nature.com/scientificdata/ the manufacture's protocol. The RNA was used for cDNA synthesis through SMARTer PCR cDNA Synthesis Kit (Clontech). The first strand and second strand were synthesized with SMARTScribe RT, using oligo(dT) primer and PCR Primer, respectively. Then the cDNA was selected with the BluePippin Size Selection System (Sage Science, Beverly, MA) according to the Isoform Sequencing protocol as described by Pacific Biosciences (PN 101-070-200-02). To increase the sequencing yield of >4 kb transcripts, a mixture of unfiltered fractions and fractions with size of >4 kb with a mole ratio of 1:1 was processed with the DNA Template Prep Kit (Pacific Biosciences of California, Inc.). Then the library was ready for sequencing after a binding of primer and DNA polymerase to the mixed transcripts. The final library was sequenced on Pacific Bioscience RS II platform (Pacific Biosciences of California, Inc.) by Novogene technology (Tianjin, China; http://www.novogene.com/).

Illumina cDNA library construction and sequencing.
After total RNA was extracted as above, mRNA was enriched by Oligo dT beads and broke into short fragment in fragmentation buffer. Then the first-strand cDNA and second-strand cDNA was synthesized using random hexamers and dNTPs, respectively. The cDNA was subjected to purification and size fractioned by AMPure XP beads, with end pairing, "A" base and Illumina adapter ligation. Then the cDNA libraries were generated by a PCR amplification. After quality control with an Agilent2100 Bioanalyzer, the cDNA libraries were sequenced with a PE mode of 150 bp on an Illumina HiSeq 2000 platform by Novogene technology (Tianjin, China; http://www.novogene.com/).

Data filtering and error correction.
Sequence data were processed using the SMRTlink 5.1 software. Circular consensus sequence (CCS) was generated from the raw subreads with a parameter of minimum length > 200 and minimum predicted accuracy > 0.8. The generated CCS sequences were then classified into Full-length non-chimeric reads (FLNC) and non-full length non-chimeric reads (NFL) according to the containment of 5′ primer, 3′ primer and poly A. FLNC were then fed into the cluster step, which underwent an isoform-level clustering (ICE), followed by a final Arrow polishing with NFL, with a minimum accuracy of 0.99. The resulting consensus reads were subjected to a correction using the Illumina RNA-seq data with the software LoRDEC. Then, after a redundancy deletion by CD-HIT software (−c 0.95, −aS 0.99), the final high quality, full-length, polished consensus sequences were generated after a redundancy deletion by CD-HIT software.
Gene quantification. The raw reads of Illumina RNA-seq were filtered by software ng_QC (−t 4, −L 20, −N 0.001). The clean data was mapped to the Polished consensus sequence by bowtie2 using end-to-end and sensitive mode. The readcounts of each transcript were calculated using RSEM and transformed into FPKM value. The expressional differential analysis was conducted by DESeq R package with a criterion of fold change > 2 and qvalue < 0.001. www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The sequencing raw data and files of gene abundance analysis in this study were deposited in NCBI Gene Expression Omnibus (GEO) and NCBI Sequence Read Archive (SRA) with accessions GSE133476 and SRP212254 16,17 . The annotation information of full-length transcripts in this study was deposited in figshare 18 . The Supplementary material including quality assessment data of raw reads was deposited in figshare 18 . The differentially expressed gene list relative to plant hormone biosynthesis and signaling pathways was deposited in figshare 18 . The flow cytometry analysis of 'High Noon' was deposited in figshare 18 . Technical Validation RNA qualities. The purity and integrity of the total RNA was assessed with Nanodrop 2000 and Agilent 2100.
The RNA samples with RIN > 8.0 were used for sequencing library. Qubit 2.0 was used to measure the quantity of RNA sample and cDNA library. The RNA quality values in this study are listed in Table 1. pacbio ISO-seq quality validation. A total of 15.94 Gb raw data was generated by 15,654,254 subreads in the Pacbio ISO-seq. After a single molecular self-correction, circular consensus sequences (CCSs) of 714,643 reads was obtained, which was subsequently classified to full-length non-chimeric (FLNC) with 5′ primer, 3′ primer and poly A and non-full length (NFL) with a proportion of 61.78% and 38.22%, respectively. Consequently, a total of 441,507 high-quality FLNC reads was obtained through the cluster of FLNC and correction by NFL.
As SMRT sequencing generates a high error rate, it is necessary to perform error correction, which includes self-correction by iterative clustering of circular-consensus reads and correction with high-quality NGS short reads. To this end, the NGS sequence data in this study was used to correct the SMRT sequences using LoRDEC software. After that, redundant transcripts were removed by CD-HIT, and a total of 115,439 non-redundant transcripts (Polished consensus sequences) with an average length of 2,060 bp were obtained (see Table 2).

predictions of coding sequence (CDS) and function annotation.
To obtain comprehensive information of gene function in tree peony, the 115,439 transcripts were mapped to 7 databases, including NR, NT, Pfam, KOG, Swiss-Prot, KDGG, GO for the gene annotation. As a result, at least 32,416 transcripts could be mapped to all these seven databases (Fig. 2a). The length distribution of successfully annotated genes was showed in Fig. 2b Table 4. DEGs relative to floral development.