Selection and Validation of Reference Genes for Gene Expression Studies in Codonopsis pilosula Based on Transcriptome Sequence Data

Relative gene expression analyses by RT-qPCR (reverse transcription-quantitative PCR) are highly dependent on the reference genes in normalizing the expression data of target genes. Therefore, inappropriate endogenous control genes will lead to inaccurate target gene expression profiles, and the selection and validation of suitable internal reference genes becomes essential. In this study, we retrieved the commonly used reference genes in transcriptome datasets of Codonopsis pilosula by RNA-Seq (unpublished data), and selected 15 candidate reference genes according to the coefficient of variation (CV) and fold change (FC) value of gene expression. The expression levels of candidate reference genes, which is at different growth stages, undergoing cold stress and drought stress, was determined by RT-qPCR. The expression stability of these genes was evaluated using software packages and algorithms including ΔCt, geNorm, NormFinder and Bestkeeper. Then appropriate reference genes were screened and validated by target gene-UDGPase (UDP glucose pyrophosphorylase). The optimal RGs combinations of C. pilosula, including PP2A59γ, CPY20-1, UBCE32, RPL5B and UBC18 for developmental stage, RPL5B, RPL13 and PP2A59γ for cold treatment, RPL13 and PP2A59γ for drought treatment, were found and proposed as reference genes for future work. This paper laid foundations for both the selection of reference genes and exploration in metabolic mechanism of C. pilosula.


Results
Selection of candidate reference genes based on transcriptome data in C. pilosula. Root of C. pilosula at the initial and late stages of its swelling were collected, and three tissue layers including periderm, phloem and xylem of the roots were separated as samples for transcriptome sequencing. Six replicates for each tissue. The samples were sent to Lianchuan biological co. LTD to perform the transcriptome sequencing by using paired-end sequencing technology on an Illumina Hi-Seq ™ 4000 platform. After assembling and annotation, expression profile data was mapped to the transcriptome. TPM (Transcripts per kilobase of exon model per million mapped reads) was calculated by means of EdgeR package (The RSEM package), which was then used to analyze gene expression level.
According to literatures on the conventional RGs, the related genes including ACT, GAPDH, G6PDH, EF, 18 S rRNA, UBQ, TUB, EIF, UBC, H3, PGK, TEF, CYP and NADHD were retrieved from the transcriptomic data of C. pilosula. As is reported, low-expression genes would make poor RT-qPCR references due to the difficulties in detecting and quantifying their expression, thus, genes with expression levels lower than five TPM were excluded before the stability analyses. In addition, genes with log fold variance more than two in different tissues were also rejected. After the removals, a total of 207 genes were selected for further study. Calculations for mean value (MV), SD, CV and log 2 FC were executed in Microsoft Excel according to TPM 27 . Genes with low CV and log 2 FC were considered stable. We set 0.5 as the CV cut-off value for stable genes and ranked the selected 207 genes. The top 50 genes were displayed in Table S1. After reducing the cut-off values of CV to 0.3 and |log 2 FC| to 0.2 according to the literatures 11,[42][43][44] , fifteen genes were selected as candidate RGs. Expression profile of candidate reference genes of C. pilosula. Primers of 15 candidate RGs were used to amplify cDNA template by PCR technology, and single stripe of target amplicon could be obtained, which was consistent with the expected target stripe size. The melting curves of 15 candidate genes' primers were plotted, all of which showed single peak, indicating that the primer had strong specificity and non-specific amplification occurred. When the length of RT-qPCR amplification product increased from 103 bp to 191 bp, the amplification efficiency of candidate RGs improved from 91.3% to 107.2% (Table 1).
The expression level of candidate RGs was normally determined by the threshold cycle (Ct) values of RT-qPCR. Small Ct value indicated higher gene expression abundance, and vice versa. Figure 1 displayed the Ct values of RGs and their variability. The genes with lower variability were UPL-RHF2A, UBCE32, UBC18, ABCC8, and RPL5B according to the distribution of raw Ct values, thus could be selected as the RGs. The commonly used ACTβ, EF1α, and TUBα3 were the worst candidate RGs because the Ct values spanned multiple units according to the scope of the interquartile. However, analyses on the distribution of the raw Ct values and the range of the interquartile are not sufficient to assess the expression stability of candidate RGs, and other methods are required for more comprehensive and accurate evaluations. expression stability analysis of candidate reference genes of C. pilosula. The Fig. 2. The optimal RGs were found to vary with the experimental conditions. CPY20-1 and UBC18 were the most credible RGs for samples collected at different developmental stages. RPL5B and RPL13 were the best RGs for samples experiencing cold stress treatment. In drought stress treatment, CPY20-1 and RPL13 were the most suitable RGs. CPY20-1 and RPL5B were the most stable RGs for all the samples.
To determine the optimal number of RGs for normalization of gene expression level, the pairwise variation (V n/n+1 ) of two sequential normalization factors (NFs) was computed using geNorm program. If the pairwise variation between genes is larger than or equal to 0.15, one gene should be added for reliable normalization of the pairwise variation. Once the pairwise variation is below 0.15, no additional genes are required for normalization 42 . www.nature.com/scientificreports www.nature.com/scientificreports/ Analyzation on the pairwise variation of candidate RGs was displayed in Fig. 3. For both the total samples and samples developed at different stages, V5/6 was below 0.15, accordingly, five RGs was the optimal number. For samples subjected to cold stress treatment, V3/4 was lower than 0.15, thus 3 RGs are required, while for samples subjected to drought stress treatment, 2 RGs are needed. Results of pairwise variation was consistent with that of the stability measurements. RGs of samples subjected to cold stress or drought stress treatment demonstrated lower average pairwise variation and higher stability. In contrast, RGs of both the total samples and samples at various growth stage behaved unstable and showed higher pairwise variation. NormFinder analysis. NormFinder software was also applied to analyze the pairwise variation of candidate RGs. Expression stability values of genes were shown in Table 3. Genes with lower SV was considered to be stable, www.nature.com/scientificreports www.nature.com/scientificreports/ and suitable for RGs. Stable genes in samples at different developmental stages were PP2A59γ, UBCE32 and RPL5B. RPL13 and RPL5B were the most suitable RGs in cold stress treatment. PP2A59γ was the most stable RG in drought stress treatment. PP2A59γ, RPL13, RPL5B, and UBCE32 were the most suitable RGs for total samples. However, ACTβ was the most unstable gene in all of the samples, which agreed with the results from geNorm analysis.

BestKeeper analysis. BestKeeper analysis determined stable RG according to the Ct values of each gene.
Genes with high coefficient of determination (r), low SD and CV are considered stable. According to SD, CV, and r of Ct value, the expression stability of candidate RGs was ranked corresponding to the different developmental stages, cold and drought stress treatment, as well as the total samples. Then the geomean value of each RG was calculated and used to rank the fifteen RGs (Table 4). The results indicated that the optimal RG was ABCC8 for developmental stages, UBC18 for cold treatment, and RPL13 for both drought treatment and the total samples. The most unstable RG was ACTβ except for developmental stage, in which the most unreliable RG was UPL-RHF2A.
Comprehensive stability analysis of reference genes. Based on the ranking results of expression stability by means of the four methods (ΔCt, geNorm, NormFinder, and BestKeeper), the geomean value of each RG was calculated and applied to determine the comprehensive rank of the fifteen RGs in developmental stage, cold treatment, drought treatment, and the total samples. Results in Table 5 showed that ACTβ was the most unstable RG (single low-ranked RG). The top-ranked single RG is PP2A59γ for developmental stage, RPL5B for cold treatment, and RPL13 for both drought treatment and the total samples. According to the number of RGs suggested by geNorm and the ranking listed in Table 5, the best combination of RGs was PP2A59γ, CPY20-1, UBCE32, RPL5B, and UBC18 for developmental stage, RPL5B, RPL13, and PP2A59γ for cold treatment, RPL13 and PP2A59γ for drought treatment, and RPL13, UBCE32, RPL5B, CPY20-1 and PP2A59γ for the total samples.
Validation on the stability of reference genes. The validity of RG or RGs, denoted as relative quality in Figs. 4 to 6, was verified by normalizing the target genes UDGPase (UDP glucose pyrophosphorylase) with the selected single RG, the combination of RGs and ACTβ. For the developmental stage of C. pilosula, when PP2A59γ alone was used for target gene expression, relative expression of UDGPase in leaves increased firstly and then decreased, and the gene UDGPase was found to be overexpressed nearly 29 times on the 15 th day (Fig. 4A). Relative expression of UDGPase in both the stems and roots decreased and then increased (Fig. 4B,C), reaching the highest on the 30 th day for the stems. In comparison, when normalization was performed using the ACTβ gene, serious deviation produced in all tissues (leaf, stem and root), and the gene UDGPase was found to be overexpressed nearly 58 times on the 15 th day (Fig. 4G-I). Using the combination of RGs including PP2A59γ, CPY20-1, UBCE32, RPL5B and UBC18, the expression characteristics of UDGPase in leaves, stems and roots was basically similar to that using PP2A59γ alone. However, the relative expression levels of UDGPase at different time points in different tissues were slightly lower than those using the most stable single RG (Fig. 4D-F).
For the cold treatment, as shown in Fig. 5, the variation characteristics of relative expression of UDGPase are similar in three organs whether using the most stable RPL5B or combination of RPL5B, RPL13 and PP2A59γ. The relative expression levels of UDGPase using single stable RG were slightly higher than those using combination of RGs except for the control. Application of unstable ACTβ gene for normalizing target gene in root leaded to larger deviations. In the drought treatment, Both the most stable RPL13 and combination of RPL13 and PP2A59γ were the effective RGs for the relative expression of UDGPase of the three tissues. In contrast, using the most unstable   www.nature.com/scientificreports www.nature.com/scientificreports/ ACTβ gene for normalization, although the expression of UDGPase in the roots was similar to that of the most stable RG or RGs, the expression bias of UDGPase appeared in the leaves and stems (Fig. 6).

Discussion
Although C. pilosula is a traditional Chinese medicine, insufficient researches have been conducted on the internal reference genes for gene expression normalization in biosynthesis of its main components. As a result, the synthesis mechanism is not clear by now. Fortunately, Gao, J. P. et al. initiated the study of the internal reference gene of C. pilosula 8 . In their work, GAPDH was selected as RG to study the expression characteristics of UDGPase gene in polysaccharide synthesis. In our study, more stable RG and RGs combinations than GAPDH were found. Based on RT-qPCR study of C. pilosula, systematic selection and validation analysis of reliable RGs for gene  www.nature.com/scientificreports www.nature.com/scientificreports/ expression normalization were carried out. More importantly, candidate RGs, used in the selection of stable RGs, were retrieved from the transcriptome data of C. pilosula, and then screened according to the CV, log2FC values of the expression level of each gene in all samples.
RT-qPCR was used to determine the expression of the candidate RGs under the conditions of cold stress, drought stress and developmental stages, and finally the stable RGs were selected. This study confirmed that RNA-seq technology was a feasible and reliable way to select RGs for C. pilosula. To obtain suitable RG, four calculation methods including ΔCt 38 , geNorm 39 , NormFinder 40 and BestKeeper 41 were applied in analyzing the stability of candidate RGs. The rankings derived from the four methods were slightly different due to the changed algorithms 45 . Taking full consideration of the ranks obtained from the four methods, the geometric means of the four rankings were calculated and used to get an integrated ranking. Accordingly, the suitable RG or the combination of RGs were selected. Surprisingly, some RGs commonly used in other plants, such as ACTβ, EF1α and TUBα3, are considered as unstable RGs in our study 9,11,12,19,31 , which were consistent with the analysis results of Ct values and their range of interquartile. This indicated that genes with too high or too low expression abundance were not suitable for RGs, as well as the genes with a wide interquartile range of Ct values.
Some candidate RGs that ranked at the top of the RNA-Seq analysis were not always stable in RT-qPCR test. This study also showed that the single optimal RG is PP2A59γ for developmental stage, RPL5B for cold treatment, and RPL13 for drought treatment. Single RG can be used for gene expression analysis, as is used in most studies. However, disturbing factors such as extremely cold and drought climates during the period of growth probably   www.nature.com/scientificreports www.nature.com/scientificreports/ leaded to deviation in gene expression by using single RG. Therefore, two or more RGs are recommended for reliable gene expression 43 . Suitable combinations of RGs were obtained as follows. PP2A59γ, CPY20-1, UBCE32, RPL5B, and UBC18 were the optimal combination RGs for developmental stage, RPL5B, RPL13, and PP2A59γ for cold treatment, and the combination of RPL13 and PP2A59γ for drought treatment. The combination of RGs could reduce the influencing of experimental factors on the gene expression, accordingly, it was more accurate   www.nature.com/scientificreports www.nature.com/scientificreports/ and reliable than single RG in the normalization of gene expression 9,25,45 . Comprehensive analysis on the selected RGs recommended five RGs combination for gene expression of C. pilosula, which are RPL13, UBCE32, RPL5B, CPY20-1, and PP2A59.

Materials and Methods
Plant material. One-year-old seedlings of C. pilosula were collected as test plant from Tanchang, Gansu province, China on April 5 th 2018. The seedlings were cultivated in sandy soil at a temperature of 25 ± 2 °C, relative humidity of 55-60%, irradiance of 300 μmol m −2 s −1 and 14 h photoperiod (6:00-20:00 h). After 30 days cultivation, test plants were selected from the uniformly grown seedlings with the length of vines between 30 and 50 cm. These plants were subjected to three treatments, which are the normal watering and fertilization for developmental stages study, cold-stress treatment and drought-stress treatment. For developmental stages study, samples of leaf, stem and root were collected after 30, 45 and 60 days. Cold-stress treatment were conducted in incubator under 4 ± 2 °C for 0, 4, 8 and 12 h, respectively. The leaf, stem and root were collected correspondingly from the treated and untreated plants. Drought-stress treatment were carried out by controlling the relative water content of soil at 50 ± 5%, with the normally irrigated soil as controls. The leaf, stem and root were collected after 5 and 10 days from the drought-stress and normally irrigated plants. Three plants were sampled for each replicate and three replicates were required for each treatment. All samples were immediately frozen in liquid nitrogen, and kept frozen at −80 °C for analyses. RNA isolation and cDNA synthesis. Total RNA was extracted from each sample using the RNAprepPure Plant Kit DP441 (Tiangen Biotech CO., LTD, Beijing, China) according to the manufacturer's instructions, followed by DNase I (Tiangen Biotech CO., LTD, Beijing, China) treatment to eliminate DNA contamination. The integrity of RNA was determined by 1.5% agarose gel electrophoresis. The purity and concentration of total RNA was determined using NanoDrop 2000 Spectrophotometer (Thermo Scientific, Waltham, MA, US). RNA samples with a concentration higher than 60 ng/μL and a ratio of A260/A280 between 1.8 and 2.0 were required for cDNA preparation. Synthesis of cDNA was conducted in the PrimerScript ™ RT cDNA Synthesis Kit (TaKaRa Bio Inc., Dalian, China) using 1.0 μg RNA solution. The obtained cDNA was then diluted with 10 times nuclease-free water to prepare RT-qPCR. Primer design and RT-qPCR. Primers were designed using Primer5 software based on the sequences retrieved from the RNA-Seq dataset of C. pilosula. The criteria for primer design were as follows: primer lengths in 20-24 bp, GC content of 45-55%, melting temperature at 58 °C, and amplicon lengths in 100-250 bp. Primer information for the candidate RGs was listed in Table 2. Reactions were conducted using TB Green Premix Ex Taq II kit (Takara) and amplified on QuantStudio TM 6 Flex Real-Time PCR system (ABI Technologies) in a final volume of 10 μL according to the manufacturer's instructions. Melting curve analyses were done following the amplification cycles in order to examine the specificity of the reactions. Amplification efficiencies were calculated using the QuantStudio TM Real-Time PCR software (ver 1.1). The PCR efficiency of each primer pair was determined by the slope of the standard curve according to the equation E = [10(−1/slope) − 1] × 100% 47 . Efficiency values should be between 90% and 110%. Data analysis. The four most commonly used methods including ΔCt 38 , geNorm 39 , NormFinder 40 and BestKeeper 41 , were adopt to calculate and identify the expression stability of candidate reference genes. The ΔCt method displays the pairwise comparisons by calculating the standard deviation (SD) of each pair candidate RGs and the average SD of each gene. The lower the average SD is, the more stable the RG performs. The geNorm algorithm calculates the expression stability value (M value) and pairwise variation (Vn/n + 1) for all candidate genes. Lower M values suggests higher stability of gene expression. The Vn/n + 1 value determines the optimal