Transcriptome profiling of abiotic responses to heat, cold, salt, and osmotic stress of Capsicum annuum L.

Peppers (Capsicum annuum L.), belonging to the Solanaceae family, are one of the most economically important crops globally. Like other crops, peppers are threatened by diverse environmental conditions due to different pathogens and abiotic stresses. High-quality reference genomes with massive datasets of transcriptomes from various conditions can provide clues to preferred agronomic traits for breeding. However, few global gene expression profiling datasets have been published to examine the environmental stress-resistant mechanisms in peppers. In this study, we report the RNA-seq analyses of peppers treated with heat, cold, salinity, and osmotic stress at six different time points. RNA-seq libraries from 78 RNA samples containing three biological replicates per time point for each of the abiotic stresses and a mock control were constructed. A total of 204.68 Gb of transcriptome data were verified by differentially expressed genes and gene ontology enrichment analysis. Analyses of the transcriptome data in this study will provide useful information for basic studies of various stimuli to facilitate the development of stress-resistant pepper cultivars.

In this study, we present transcriptome analyses of peppers subjected to four major environment stressesheat, cold, drought, and salinity-at the same time points and at the same plant stages. We describe in detail the construction of 78 RNA-seq libraries for heat-, cold-, mannitol-, and NaCl-treated and untreated control samples at 0, 3, 6, 12, 24, and 72 h. A total of 204.68 Gb of transcriptome data were generated using transcriptome analysis pipelines consisting of quality control, quantification, and differential gene expression analyses. A principal components analysis (PCA) test, hierarchical clustering of gene expression data and gene ontology (GO) enrichment analysis were used to infer the quality of the RNA-seq data and the characteristics of samples in each treatment. The extensive transcriptome data obtained will provide valuable information for future studies of crops exposed to abiotic stresses.

Methods
overview of experimental design. The third or fourth leaves were collected from four pepper plants per each biological replicate. Leaves were harvested at 3, 6, 12, 24, and 72 h after treatment. Mock controls were simultaneously collected with each abiotic treatment sample at 0, 3, 6, 12, 24, and 72 h. Marker gene expression for each condition was confirmed for 78 RNA samples by RT-PCR (Fig. 1). Subsequently, RNA-seq libraries were constructed and sequenced. Transcriptome data were used to conduct a quality assessment and aligned to Capsicum annuum cv. CM334 reference genome (v.1.6). The workflows for the abiotic stress treatment and transcriptome data analysis pipeline are presented in Fig. 1. plant materials and treatment. Two weeks after germination, the pepper seedlings were transplanted into a 32-plug tray (6 cm in diameter by 6.5 cm in height) and maintained in a growth room at 24 ± 1 °C with a 16-h light and 8-h dark photoperiod. At the six-true-leaf stage, plants were subjected to a temperature of 10 °C or 40 °C to mimic cold or heat stress, respectively. For salinity stress, plants were treated with 50 mL of a 400 mM NaCl solution; for osmotic stress, the peppers were treated with 50 mL of 400 mM mannitol. For transcriptome profiling, the third or fourth leaves from four plants were harvested per replicate at 0, 3 Fig. 1 Overview of experimental design and analysis pipeline. RNA from pepper leaves subjected to each abiotic stress (heat, cold, salinity, and osmotic stress) and the 0-h sample from the mock control was harvested. Marker gene expression was confirmed for each stress condition, and the values were normalized to C. annuum actin expression and were calculated relative to control group as mean values with standard deviation. The validated RNAs were sequenced by the Illumina HiSeq 2500 system. All RNA-seq reads were preprocessed for a quality assessment. The filtered transcriptome reads were aligned to the CM334 genome, and the expression profile was analyzed.
www.nature.com/scientificdata www.nature.com/scientificdata/ treatment (Fig. 1). Three biological replicates at each time point per condition were collected. The leaf samples were flash-frozen in liquid nitrogen and stored at −80 °C until RNA isolation. RNA extraction, library construction, and sequencing. Total RNA was extracted from pepper leaf samples (100 mg) using the Trizol reagent (Ambion, USA), according to the manufacturer's instructions. To perform RNA quality control, RNA was quantified spectrophotometrically using a NanoDrop 2000 spectrophotometer (Thermo Scientific, USA), and RNA integrity was verified by agarose gel electrophoresis. Marker gene expression for each treatment was confirmed by RT-PCR analysis using primers specific for each marker gene: heat stress (CaWRKY) 12 , cold stress (CaDhn) 13 , salinity stress (CaPR10) 14 , and osmotic stress (CaDhn) 13 (Fig. 1). RT-PCR was performed using a GeneAtlas thermo-cycler G-02 device (Astec, Japan) using rTaq DNA polymerase (Elpis, Korea) as described by the manufacturer. The gene expression level was normalized to the expression of the CaActin gene and was calculated relative to mock control. Values were calculated following three replications with standard deviations (Fig. 1). Five micrograms of each RNA sample were used to generate a strand-specific library containing inserts of approximately 150-200 bp in size, as previously described 15 . In total, 78 cDNA libraries from five treatments (i.e., the four abiotic stresses and a mock control) were constructed for transcriptome analysis (Table 1). For RNA sequencing, 150-nt, paired-end sequencing was conducted using a HiSeq 2500 platform (Illumina, USA) at Macrogen (Korea).  www.nature.com/scientificdata www.nature.com/scientificdata/ Data preprocessing, gene quantification, and Go enrichment analysis. The raw RNA sequences were filtered and trimmed using cutadapt 16 and the NGS QC Toolkit 17 to remove low-quality bases and adapter sequences. After filtering, the trimmed reads were assessed using FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), and quality results were then merged with multiQC 18 using default parameters. The preprocessed reads were aligned to C. annuum v.1.6 reference genome (GenBank Accessions: AYRZ00000000) 19 and the annotation gene model v.2.0 (http://peppergenome.snu.ac.kr) using Hisat v2-2.1.0 software 20 and default parameters. Transcriptome quantification was performed using featureCounts 21    www.nature.com/scientificdata www.nature.com/scientificdata/ counts. Raw read counts were normalized using TMM methods 22 . PCA was performed using previously published code with modification 23 .
We used the edgeR with three package (estimateDisp, glmQLFit, and glmQLFTest) to conduct differential expression analysis 24 . To design a model formula, all the experimental factors such as time points and treatments were combined into one factor. Then, we found genes that respond differently between the treatment and the mock at any time points. Genes with an adjusted p-value < 0.05 and fold change (FC) > |2| were considered differentially expressed genes (DEGs). The DEGs for each point in treatments were collected to union sets of the DE gene across the time points. The top 30 genes of DEGs for each stress were displayed into heatmap using pheatmap package 25 . GO enrichment analysis for DEGs were performed based on the functional annotation of Arabidopsis genome (TAIR10, http://www.arabidopsis.org). The best hit proteins were mapped by the best match of BLASTP with filtering category (query coverage ≥60%, subject coverage ≥60% and identity ≥60%). The GO enrichment analysis for best hit proteins was performed using clusterProfiler 26 in the R package with org.At.tair. db for Arabidopsis annotation package 27 (See the file "Programs and code information.docx" at figshare). The enriched GO terms were obtained by FDR < 0.05 and minGSize = 10. Then, the stress-related GO terms were visualized using ggplot2 28     www.nature.com/scientificdata www.nature.com/scientificdata/

Data Records
The RNA-seq raw data of 78 samples are deposited at the NCBI Sequence Read Archive (SRA) with identifier SRP187794 29 . The gene expression quantification data of all the samples was deposited at Gene Expression Omnibus (GEO) database with identifier 30 . The combined additional files and information generating this study have been uploaded to figshare 27 .

Technical Validation
Quality control. The quality of the RNA-seq data was assessed by investigating the mean quality score per position and per sequence, as well as the GC content and read length distribution using FastQC and multiQC 18 . The assessment plots are shown in Fig. 2. The quality scores of the bases per position were higher than the Phred quality score of 25, and all reads were greater than the quality score of 20. The GC content of all samples was shown as a normal distribution; these data indicate a lack of sequence contamination during the sequencing process. These statistics revealed that the raw reads were of high quality. Alignments of the preprocessed reads had a high mapping rate, which on average 70.14% and 90.38% in the gene model (v.2.0) and reference genome of C. annuum (v.1.6), respectively 27 (see the file "Statistical summary of RNA-seq for each sample.xlsx" at figshare).
Analysis of transcriptome data. To quantify global gene expression patterns for multiple abiotic stresses, the mapped reads were calculated into read counts for the individual pepper genes. The distributions of all samples for normalized read counts were compared and are shown as a boxplot in Fig. 3a. These distributions were similar between the samples. A PCA analysis revealed that the first two PCs explained most of the variance, and samples from each treatment belonged to the same cluster with similar patterns (Fig. 3b). We further investigated the global gene expression profiles by performing analyses of the DEGs related to each abiotic stress and compared the results to those of the mock control 27 (See the file "DEG result for each stress.zip" at figshare). As shown in Fig. 3c, the y-axis depicts the fold change in the log 2 -transformed data, and the x-axis represents the log 2 -transformed average counts per million reads (CPM). Upregulated DEGs are highlighted in red, whereas downregulated DEGs are shown in blue, with an adjusted p-value < 0.05 and FC > |2|. The gene expression patterns for the 30 top-ranked DEGs are shown in a heatmap (Fig. 4a). We identified a total of 12,494 DEGs shared and unique between stress treatments (Fig. 4b). To validate plant responses to each abiotic stress, GO enrichments were analyzed 27 (See the file "GO enrichment analysis result.zip" at figshare) and we represented stress-responsive GO enrichments showing conserved and unique GO terms by comparison of each treatment (Fig. 4c). The distinctive patterns of gene expression and GO enrichments suggest that these data would be useful for comparing changes in gene expression for other abiotic stresses.

code availability
Codes that were used for the RNA-seq data processing are available at figshare 27 . Software and their versions were described in Methods.