Construction of integrative transcriptome to boost systematic exploration of Bougainvillea

Members of the genus Bougainvillea are rich sources of natural dyes, pigments, and traditional medicines. They are also commonly used as ornamentals in roadside landscape construction. However, the horticultural development of Bougainvillea flowers with extended growth periods and coloration is not always feasible. One reason is limited molecular knowledge and no genomic information for Bougainvillea. Here, we compiled an integrative transcriptome of all expressed transcripts for Bougainvillea × buttiana Miss Manila by integrating 20 Illumina-sequencing RNA transcriptomes. The integrative transcriptome consisted of 97,623 distinct transcripts. Of these, 47,006 were protein-coding, 31,109 were non-coding, and 19,508 were unannotated. In addition, we affirmed that the integrative transcriptome could serve as a surrogate reference to the genome in aiding accurate transcriptome assembly. For convenience, we curated the integrative transcriptome database for Bougainvillea, namely InTransBo, which can be freely accessed at http://www.bio-add.org/InTransBo/index.jsp. To the best of our knowledge, the present study is the most comprehensive genomic resource for Bougainvillea up-to-date. The integrative transcriptome helps fill the genomic gap and elucidate the transcriptional nature of Bougainvillea. It may also advance progress in the precise regulation of flowering in horticulture. The same strategy can be readily applied toward the systematic exploration of other plant species lacking complete genomic information.

RNA library construction and deep sequencing. The sample mixtures were lysed with 1 mL TRIzol reagent (Invitrogen, Carlsbad, CA, USA). Total RNA was prepared according to the manufacturer's instructions. RNA purity was evaluated with a NanoPhotometer spectrophotometer (Implen USA, Westlake Village, CA, USA). The RNA concentrations were measured with a Qubit RNA assay kit and a Qubit 2.0 fluorometer (Life Technologies, Waltham, MA, USA). RNA integrity was evaluated with the RNA Nano 6000 assay kit in an Agilent Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA). RNA samples passing the quality control test were stored at − 20 °C until later use.
The RNA library was constructed by the Novogene Co. Ltd. (Beijing, China). Three micrograms RNA per sample was used as the input material for library preparation. Polyadenylated (poly(A)) RNA was purified from total RNA with poly T oligo-attached magnetic beads. Fragmentation was carried out using divalent cations under elevated temperature in NEBNext First Strand Synthesis Reaction Buffer (5X). After reverse transcription, the fragments were purified with AMPure XP system (Beckman Coulter, Beverly, USA), and the cDNA fragments with length in 150-200 bp were preferentially selected for PCR. At last, PCR products were purified (AMPure XP system) and library quality was assessed on the Agilent Bioanalyzer 2100 system. RNA sequencing was performed in an Illumina HiSeq 4000 system (Illumina, San Diego, CA, USA) using the 125-bp, strandspecific, paired-end mode.
RNA-seq data preprocessing. Before proceeding to the transcriptome assembly, the RNA-seq raw data were filtered with Trimmomatic 10 to remove adaptor-contained reads, N-content > 10% reads, and low-quality reads which the percent of base with sequence quality SQ ≤ 5 is more than 50% of the whole reads. www.nature.com/scientificreports/ Construction of the integrative transcriptome. The integrative transcriptome for Bougainvillea was constructed with TransIntegrator 11 by integrating multiple RNA-seq transcriptomes. In this study, we defined a new term "integrative transcriptome" as the collection of all transcripts expressed in different tissues. It is kind of combination of transcriptomes but much more than that. The algorithm of TransIntegrator has been well described in 11 , and we only specified the parameters of current application briefly here. In principle, it took four continuous steps for integrative transcriptome construction as follows: (1) 20 Illumina RNA sequencing datasets were de novo assembled separately with Trinity software (version 1.0; Parameters: contig length > 200 bp, K-mer > 5, and max reads of per graph > 200 bp) 12 .
(2) Subsequently, all expressed transcripts of the transcriptomes were mixed together and clustered with CD-HIT-EST (version 4.5.4) 13 by setting a sequence identity threshold -c > 90% and an alignment coverage -aS > 80%. The clustered sequences were identified, the longest sequences were elected as the representatives, and the shorter sequences were discarded. (3) The representative transcripts were bridged with CAP3 (version 12/21/07, default parameters) to form longer sequences 14 if two sequences had > 40 bp overlap (> 90% sequence identity). (4) The integrative transcriptome for Bougainvillea was refined by removing sequences < 300 bp. The conventional procedure was adopted to annotate the integrative transcriptome: The coding transcripts were identified with Annocript (version 1.1.3; default parameters) 15 by referring to the SwissProt and Pfam databases. The rRNA was annotated with RNAmmer (version 1.0.0; default parameters) 16 . The tRNA was annotated with tRNAscan-SE (version 2.1.3; default parameters) 17 . The ncRNAs and miRNA precursors were identified by using the BLAST tool (version 1.3.1; e-value: 1e−5, identify: 90%) against the NONCODE database 18 and miRBase 19 . All annotations were integrated by discarding the low confident annotations subject to the default threshold of corresponding annotation tools.

Validation of the integrative transcriptome.
We validated the integrative transcriptome using both computational and experimental methods. Computationally, we mapped the transcriptomic reads against all assembled transcripts with BLASTn to confirm every bases of the transcripts were supported by multiple (≥ 5) reads. In addition, we assessed the integrative transcriptome (as the gene sets) with BUSCO (version 4.1.4; name = "embryophyta_OthoDB10") 20 . Moreover, we selected a list of transcripts for experimental validation by meeting criteria of: (1) some transcripts are comparatively dissimilar (identify < 50%) to their matched A. thaliana homologs. (2) Some transcripts are significantly longer or shorter than their A. thaliana homologs. (3) And the transcript sequence can be determined by single read-through of Sanger sequencing. Accordingly, eight transcripts were selected for experimental validation, including three dissimilar transcripts (NBP35, RGL2, and ATZNMP) to the matched A. thaliana homologs, four shorter transcripts (PAF2, PGY2, NBP35, and ATZNMP), three longer transcripts (EXO70C1, IMPA, and RGL2), and one nearly identical transcript (PRT6). The selected eight transcripts were first reversely transcribed into cDNAs (TransGen, China) from the extracted total RNA samples and further amplified by the Polymerase Chain Reaction (PCR) with specifically designed primers (Supplementary Table 1). The PCR products were checked for band size on agarose gel. Furthermore, the sequences of right size PCR products were determined by the Sanger sequencing.
The integrative transcriptome-based transcriptome assembly and performance evaluation. Usually, the reference-based transcriptome assembly exhibits more accurate and more stable performance than that of de novo approaches. In this study, to make the integrative transcriptome an alternative reference, it was annotated and preformatted as two files: BougainvilleaXbuttiana_Manila.fa and Bougainvil-leaXbuttiana_Manila.gtf. These two files can be downloaded from the project website. Subsequently, we carried out the reference-based transcriptome assembly using the processed integrative transcriptome as the alternative reference: reads aligned with HISAT2 (version 2.1.0; default parameters) 21 , assembled and estimated transcript abundances with StringTie (version 1.3.4d; default parameters) 22 , and quantified with R package Ballgown (default parameters) 23 .
The quality of transcriptomes assembled by different methods were evaluated by examining the completeness and fragmentation ratio of the transcripts with BUSCO (version 4.1.4; database = "embryophyta_OthoDB10", number of BUSCOs = 1614) 20 . Before the BUSCO evaluation, we excluded transcripts < 300 bp from the transcriptomes under the consideration of significant difference in transcript length distribution (Fig. 4a). In addition, the potential bridge and assembly score were determined with TransRate (version 1.3, default parameters) 24 . A good quality transcriptome is expected to have larger transcript completeness, smaller fragmentation ratio, less potential bridges, and higher assembly score. As the controls, we also demonstrated de novo transcriptome assembly, the only applicable solution up-to-date, with Trinity software (version 2.8.5; default parameters) and Velvet (version 1.2.10; default parameters) 25 . The quality assessment was conducted on an external Illumina sequencing datasets, which was determined in young leaves of Bougainvillea × buttiana (SRA: SRR10076832), the only transcriptomic experiment of same species as that of this study. For fair evaluation, this external dataset was also incorporated into the integrative transcriptome to capture the leaves-specifically expressed transcripts.
Database construction. For user convenience, the transcript library was presented as an online interactive database called InTransBo, which was constructed on the Linux-Apache-JSP platform. MySQL software was used to manage data storage, access, and maintenance. Efficient and friendly user interfaces were designed with JavaScript for interactive transcript search and retrieval. Detailed information of these datasets was summarized in Table 1. Integration of these datasets with TransIntegrator yielded a collection of 97,623 non-redundant transcripts expressed in multiple tissues, which composed of the integrative transcriptome of Bougainvillea (Fig. 1). The average GC content of the integrative transcriptome was 40.12. The average sequence length was 724.11 bp. About 77.03% of the transcripts were 300-885 bp in length, which agreed with the previous estimation of plant gene length of 183-2000 bp 26 . Of the 97,623 sequences, ~ 48.15% (47,006) were annotated as protein-coding transcripts, ~ 31.87% (31,109) were putative long non-coding RNAs (lncRNAs) and other non-coding RNAs (ncRNAs), and the remaining 19.98% (19,508) were not annotated (Fig. 1a). Noteworthy, the integrative transcriptome can expand in size when involving more Illumina sequencing transcriptomes; the more diverse conditions, the more complete integrative transcriptome we can have. However, the protein-coding transcripts stopped growing rapidly after integration of six transcriptomes in this study (Fig. 1b).

Statement of consent. This study on
To validate the integrative transcriptome, we conducted the reads mapping analysis. The analysis confirmed every single bases of transcripts were supported by multiple (≥ 5) mapping reads. We also performed a gene conservation analysis by mapping the transcripts of Bougainvillea integrative transcriptome against the gene sets (48,267 genes) of Arabidopsis thaliana genome (Araport11, release 2021-06-28) with BLASTx by the criteria of sequence identity > 30%. As the results, 53,488 Bougainvillea transcripts can be mapped to 39,555 (about 81.95%) gene sets of A. thaliana, including 44,855 protein-coding transcripts, 1827 putative lncRNAs, and 6808 unannotated transcripts. For the remaining 44,135 unmapped transcripts, 2390 were potential protein-coding, 29,282 were putative lncRNAs, and 12,702 were unannotated. The unmapped transcripts were either Bougainvilleaspecific genes or wrongly assembled transcripts. In addition, of 47,006 protein-coding transcripts of Bougainvillea, 30,719 were annotated subject to A. thaliana genes, and the remaining 16,287 transcripts were more similar to homologous genes of other species (Fig. 1a).
Moreover, we selected eight transcripts for RT-PCR amplification and subsequent Sanger sequencing. These eight transcripts included three dissimilar transcripts (NBP35, RGL2, and ATZNMP) to the matched A. thaliana homologs, four shorter transcripts (PAF2, PGY2, NBP35, and ATZNMP), three longer transcripts (EXO70C1, IMPA, and RGL2), and one nearly identical transcript (PRT6). To this end, all transcripts were validated with the same size and same sequence as they were compiled (Fig. 1c). This result manifested the selected transcripts were well assembled. Putting all the results together, the integrative transcriptome was mostly reliable. www.nature.com/scientificreports/ Construction of the integrative transcriptome database for Bougainvillea. In this study, we constructed the integrative transcriptome database for Bougainvillea (Bougainvillea × buttiana Miss Manila), namely InTransBo. The InTransBo database is accessible freely at http:// www. bio-add. org/ InTra nsBo/ or at its mirror site http:// bioinf. xmu. edu. cn/ InTra nsBo/. InTransBo uses keyword search and sequence BLAST to retrieve data interactively. The keyword search function allows both accurate and fuzzy transcript search via the input of complete or partial gene symbols, gene names, protein names, and abbreviated protein names (Fig. 2a). The search engine returns the hit terms in alphabet order along with the gene symbol, protein name, transcript length, and transcript ID. Clicking on the transcript ID redirects to the transcript information page containing various data listed in order including transcript annotation, sequence information, and transcript expression profiles over different tissues (Fig. 3). For newly identified sequences and unnamed sequences, InTransBo enables data access via an alternative BLAST method. The database supports BLASTn and tBLASTn to identify nucleic acid and protein sequences, respectively. The input may either be the typing sequence in text form or an uploaded file in FASTA format (Fig. 2b). The embedded BLAST engine responds to all hit sequences meeting the default expectation threshold of E-value = 1e−10 and is sorted by hit score (Fig. 2b). For each hit, alignment details may be obtained using the "Alignment" hyperlink. The detailed information of hits may be acquired via the transcript ID hyperlinks. The integrative transcriptome is free for downloading; however, user registration is required. The integrative transcriptome was also annotated and preformatted as a gff file with Annoscript to make it an alternative reference for omics research.
The potential application of integrative transcriptome in transcriptome assembly. As a potential application, we applied the integrative transcriptome to reference-based transcriptome assembly with the HISAT2 + StringTie pipeline. The assembled transcriptomes were further compared with those assembled by the de novo assemblers (Trinity and Velvet). The comparison was conducted on an external Illumina sequencing datasets determined in same species of this study (SRA: SRR10076832) ( Table 1). The transcriptome quality was mainly evaluated by BUSCO and TransRate. As shown in Fig. 4, both de novo methods produced much more transcripts than that of the StringTie method; however, majority of the transcripts assembled by Trinity and Velvet were < 300 bp and consisted of a large number of duplicates. The BUSCO evaluation manifested that the StringTie method outperformed Trinity and Velvet by owning more complete transcripts and less duplicates (Fig. 4b). In this case, Velvet failed to produce reliable transcriptome from almost all aspects. Similar to the evaluation of BUSCO, additional TransRate evaluation estimated higher assembly score and less potential bridges in StringTie-assembled transcriptome. Therefore, the integrative transcriptome can serve as a good reference in transcriptome assembly under the circumstance of no genome.
Taking the integrative transcriptome as the reference, we portrayed the gene expression landscapes for ten Bougainvillea flower tissues, which can be acquired in the InTransBo database via specifying gene symbols. These gene expression profiles are valuable for exploring the biological basis of Bougainvillea. For instance, flower is one of the major sources of metabolites, natural dyes, pigments, medicinal uses in Bougainvillea. Via comparing to 27 genes related to the meristem response and development in Arabidopsis thaliana 27 , we identified 13 distinct homologous genes (52 transcripts) in Bougainvillea. They were AGL24, AP1, CAL, FUL, LFY, SOC1, SPL3, SPL4, SPL5, SPL9, SPL15, FD, and SVP. Monitoring the behaviors of these genes in different flower tissues will provide insights into the molecular mechanisms during floral transition of Bougainvillea.

Conclusion
Bougainvillea sp. require systematic transcriptome research but lack genome support. Regretfully, whole-genome sequencing is not a viable option to-date as it is costly and technically impracticable. Here, we used the computational method TransIntegrator to construct an integrative transcriptome, the collection of all expressed transcripts, for Bougainvillea by integrating 20 heterogeneous Illumina sequencing datasets. Both computational and experimental methods validated that the integrative transcriptome was well compiled. Based on the integrative transcriptome, we curated the InTransBo database for interactive data retrieval. To the best of our knowledge, the InTransBo database could be the first comprehensive genomic and transcriptional resource for Bougainvillea. Furthermore, we demonstrated a typical application of the integrative transcriptome as an alternative reference in transcriptome assembly, which significantly outperformed the de novo methods, the only applicable solution up-to-date under the circumstance of lacking genome.
It is also acknowledged that the integrative transcriptome has its limitations and is far away from completeness. There may exist some falsely assembled long transcripts due to wrongly bridging fragmented transcripts from different transcriptomes, even though the bridges may be supported by multiple reads. Moreover, the de novo assembly may account for a portion of false transcripts, which can't be properly solved by the TransIntegrator. It is thus desired that long-read sequencing datasets could be introduced and more sequencing datasets of different spatiotemporal samples could be involved to largely improve the quality and completeness of integrative transcriptome.
Nevertheless, the integrative transcriptome could mitigate the genome constraint on systematic investigation of Bougainvillea by serving as a reliable surrogate to the genome. It will help elucidate the molecular basis of Bougainvillea and facilitate precise flower regulation in horticultural practices. The same strategy could be readily applied toward the systematic exploration of other plant species lacking adequate genomic data. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.