Reference gene selection for transcriptional profiling in Cryptocercus punctulatus, an evolutionary link between Isoptera and Blattodea

The subsocial life style and wood-feeding capability of Cryptocercus gives us an evolutionary key to unlock some outstanding questions in biology. With the advent of the Genomics Era, there is an unprecedented opportunity to address the evolution of eusociality and the acquisition of lignocellulases at the genetic level. However, to quantify gene expression, an appropriate normalization strategy is warranted to control for the non-specific variations among samples across different experimental conditions. To search for the internal references, 10 housekeeping genes from a gut transcriptome of a wood-feeding cockroach, Cryptocercus punctulatus, were selected as the candidates for the RT-qPCR analysis. The expression profiles of these candidates, including ACT, EF1α, GAPDH, HSP60, HSP70, αTUB, UBC, RPS18, ATPase and GST, were analyzed using a panel of analytical tools, including geNorm, NormFinder, BestKeeper, and comparative ΔCT method. RefFinder, a comprehensive ranking system integrating all four above-mentioned algorithms, rated ACT as the most stable reference gene for different developmental stages and tissues. Expression analysis of the target genes, Hex-1 and Cell-1, using the most or the least appropriate reference genes and a single or multiple normalizers signified this research. Our finding is the first step toward establishing a standardized RT-qPCR analysis in Cryptocercus.


Goals and objectives.
The overall goal of this study is to screen for internal references for the temporal and spatial gene quantification in a wood-feeding cockroach, C. punctulatus. Our overarching hypothesis is that housekeeping genes represent a rich reservoir for searching the internal references for RT-qPCR analysis. To test this hypothesis, we investigated the expression profiles of ten housekeeping genes and two target genes under the temporal and spatial conditions. The candidates included actin (ACT ), elongation factor-1α (EF1α), glyceraldehyde 3 phosphate dehydrogenase (GAPDH), heat shock protein 60 (HSP60), heat shock protein 70 (HSP70), α-tubulin (αTUB), ubiquitin conjugating enzyme (UBC), ribosomal protein S18 (RPS18), adenosinetriphosphatase (ATPase) and glutathione-S-transferase (GST) from C. punctulatus. Target genes, hexamerin-1 (Hex-1) and β-1,4endoglucanase (Cell-1), play a critical role in caste differentiation and cellulose degradation 24,25 , respectively, and serve as the positive controls. The temporal (developmental stage) and spatial (tissue type) expression profiles of these candidates were evaluated comprehensively by a panel of analytic programs, including geNorm, Normfinder, BestKeeper, and comparative ΔC T method. Ultimately, a specific set of reference genes is recommended by RefFinder, a comprehensive ranking system integrating all four algorisms.
The advent of the next generation sequencing technologies has propelled entomological research into the Genomic Era. As the most primitive extant member of the Blattaria and the sister group of modern termites, Cryptocercus is the only evolutionary intermediate between cockroaches and termites. This evolutionary "missing" link represents the key species to address some major outstanding questions in biology (e.g., the evolution of eusociality). Results from this study will facilitate our efforts to (1) standardize the gene quantifications in C. punctulatus, (2) functionally decipher the newly sequenced and assembled C. punctulatus genome (unpublished data), and (3) decode the genetic basis governing the transition from solitary cockroaches to eusocial termites and the acquisition of symbiotic lignocellulolytic enzymes within woodroach-termite lineage.

Results
Validation of primer sets. The specificity of individual primer sets was evaluated using both gel electrophoresis and melting curve analyses. The banding pattern on 1% agarose gel showed a single band for candidate and target genes individually. Fluorescence data were collected for melting curve analysis, and a single peak was produced by each candidate as well as target gene. Linear regression coefficient for the reproducibility of RT-qPCR (R 2 ) exceeded 0.99 for all the candidate reference genes and target genes, while amplification efficiency (E%) ranged between 94.1 and 109.3% , suggesting a highly specific and efficient primer design (Table S1 and  Table S2).
Optimal cDNA concentration for GAPDH. The correlations between the C t value of GAPDH and a gradient of cDNA concentrations generated from three different tissues were shown in Fig. 1. For reproductive organs, ovary (FR) and testis (MR), there was a positive linear relationship between C t values and cDNA concentrations ranging from 0.1 ng to 1 µg. Similarly, a positive correlation was observed in neuron ganglion (NG) between C t values and cDNA concentrations ranging from 0.01 ng to 1 µg (Fig. 1) Relative gene expressions among different developmental stages and tissues. Throughout different developmental stages, all candidate genes exhibited the highest expression level in adult females, and the lowest expression level in the 1st nymphs ( Fig. 2A; Table S3). The results from different tissues illustrated that all candidate genes showed notably different expression patterns, especially the target genes ( Fig. 2B; Table S4). Hex-1, a negative regulator of worker-soldier caste differentiation, exhibited significantly higher expressions Figure 1. Optimal cDNA concentrations for RT-qPCR analysis. cDNAs synthesized from three different tissues FR (female reproductive organ, ovary), NG (neuron ganglion) and MR (male reproductive organ, testis) were subjected to a tenfold serial dilution before engaging in the subsequent RT-qPCR analysis.  www.nature.com/scientificreports/ in the ovary (FR) and fat body (FB). Cell-1, a highly conserved endogenous endoglucanases, resided predominantly in the salivary gland (SG). These results demonstrated that the expression profile of housekeeping genes, although relatively stable in comparison to target genes, could vary among different developmental stages and tissues, signifying the importance and necessity for the selection of suitable reference genes.

Stability analysis.
Based on the C t values and BoxPlot analysis (SigmaPlot 10.0), the dispersal of expressions in candidate reference genes displayed range, extreme values and outliers (Fig. 3A,B). Among them, the expression profiles of ATPase, RPS18, UBC, and αTUB were relatively stable throughout different developmental stages (Fig. 3A), whereas RPS18, GAPDH, UBC, HSP70, ACT and αTUB were relatively stable across different tissues (Fig. 3B). geNorm calculates M-value (stability value) for each candidate reference gene and genes with a lower M-value (below the threshold value of 1.5) were considered stable. For different developmental stages, αTUB was the most stable candidate with the lowest M value, while ACT was the most stable reference gene among tissues (Table 1). BestKeeper calculates the SD and r value of each reference gene. Genes with a SD value < 1.0 and r value > 0.9 are considered stable. Candidate with the lowest SD and the highest r values was identified as the most stable reference gene. GAPDH was the most stable candidate throughout developmental stages, while RPS18 was the one among different tissues (Table 1). NormFinder calculates gene stability through an ANOVA -based algorithm and genes showing the lowest stability values (below the threshold value of 1) are consider stable. GAPDH and EF1α were the most stable candidates for different developmental stages and tissues, respectively ( Table 1). The comparative ΔC t method also ranks the stability of reference gene through a stability value, in which genes with a lower stability values were considered with a higher level of stability. As a result, ACT and HSP70 were the most stable candidates throughout developmental stages, while ACT was also the most stable reference gene among tissues (Table 1).
Finally, RefFinder provides the most comprehensive ranking by integrating the geomean of stability values derived from all four analytic tools. For developmental stages, the rank of candidates from the most to the least stable was ACT > HSP70 > GAPDH > αTUB > UBC > EF1α > HSP60 > GST > ATPase > RPS18, while, for different tissues, it was ACT > UBC > EF1α > HSP70 > αTUB > RPS18 > GAPDH > GST > ATPase > HSP60 (Fig. 4). www.nature.com/scientificreports/ The optimal number of reference genes. To search for the optimal number of reference genes, geNorm calculates all pairwise variations under each experimental condition (Fig. 5). Based on Vandesompele and colleagues 26 , a Vn/Vn + 1 threshold value of 0.15 suggests that the addition of "N + 1" reference gene is not necessary, i.e., "N" number of references genes is sufficient to normalize qRT-PCR results. For developmental stages, V 2/3 was lower than 0.15, indicating that ACT and HSP70 were sufficient for the accurate normalization (Fig. 5). www.nature.com/scientificreports/ For tissues, however, the first V value less than the threshold was at V 4/5 , suggesting that ACT , UBC, EF1α and HSP70 were the best combination for the precise normalization (Fig. 5).
Validation of selected reference genes with target genes Hex-1 and Cell-1. The expression profiles of Hex-1 and Cell-1, the target genes, were evaluated to validate the recommended reference genes under different biotic conditions. Across different developmental stages, the expression profile of Hex-1 was similar when normalized to the most stable reference gene ACT and the recommended multi-gene normalizer (ACT and HSP70). The expression of Hex-1 was significantly different when it was normalized to the least stable reference gene RPS18 (Fig. 6). Specifically, the expression of Hex-1 was significantly underestimated in the 1st nymphs. Among different tissues, similar expression profiles of Cell-1 were observed when Cell-1 was normalized to the most stable reference gene ACT , the recommended multi-gene normalizer (ACT , UBC, EF1α and HSP70), and the least stable gene HSP60. Although the expression profiles were similar, Cell-1 expressions in both salivary gland and foregut were overestimated, especially when HSP60 was used as the normalizer (Fig. 6).

Discussion
Selection of candidate reference genes. It is unrealistic to find a "universal" normalizer showing constant expression level across all experimental conditions. In this study, expressions of candidate reference genes varied, more or less, among different developmental stages and tissues. Changes in C t values ≥ 1.0 represent ≥ twofold changes in gene expression level, i.e., small variability in C t values could have drastic impact on target gene expression 27 . Consequently, selection and validation of genes exhibiting a relative low variability under specific experimental conditions is a critical step toward accurate gene quantification study.  Additionally, the expression level (C t value) of target and reference genes should be comparable to ensure that all transcripts are subject to the same kinetic interactions during qRT-PCR 26 . Otherwise, the expression of a highly abundant internal reference (e.g., ribosomal proteins with significant lower C t values) can mask the subtle, but potentially biologically relevant, changes in the expression of target genes 29 . Although the number of reference gene selection publications has been steadily increased for the past decade, the average number of reference genes been tested was 9.53 15 . In this study, we selected ten housekeeping genes, which have a track record of being used as the internal controls, as the reference gene candidates. Target genes, hexamerin-1 (Hex-1) and β-1,4endoglucanase (Cell-1) are of primary importance for caste differentiation and cellulose degradation research. The expression levels of target and candidate reference genes were comparable, with C t values ranging between 16 and 25 using cDNAs generated from the whole body of C. punctulatus adults.
Previous studies have demonstrated the significant impacts of tissue/cell types and developmental stages on the stability of reference gene expression, in some case, even greater than treatments [30][31][32][33] . Here, we empirically examined the temporal and spatial stability of these candidate genes, and recommended different sets of reference genes for tissue/cell types and developmental stages, respectively. Stability assessment. Although the underlying algorithms employed by each analytical tool are different, they all focus on the variance in C t values of each reference gene across treatments 34 . In this study, reference genes recommended by the four analytical tools exhibit some discrepancies, albeit share some commonalities. For different developmental stages, GAPDH was rated as the most stable reference gene by both BestKeeper and Normfinder, whereas αTUB and ACT were the top choice by geNorm and comparative ΔC t method. Similarly, GAPDH was the reference gene of choice in a few lepidopterans, including the silkworm Bombyx mori, Chilo suppressalis, the pink stem borer Sesamia inferens, and the oriental leafworm moth Spodoptera litura [35][36][37][38] , and optimal reference gene for profiling of seasonal and labor-specific gene in Western honey bee, Apis mellifera 16 .
For tissues, both geNorm and comparative ΔC t method ranked ACT as the most stable reference gene, while RPS18 and EF1α were, respectively, recommended by BestKeeper and Normfinder. Robledo and colleagues 34 used a set of empirical data evaluated the accuracy of BestKeeper, Normfinder, geNorm, and comparative ΔC t method. Authors suggested that NormFinder, complemented with the descriptive statistics calculated by BestKeeper, offers www.nature.com/scientificreports/ the most reliable recommendation. In this study, NormFinder selected GAPDH and EF1α as the most stable reference genes, respectively, for developmental stages and tissues (Table 1). Indeed, EF1α has been picked as the most stable reference genes across different tissues in many insects, such as bed bug, Cimex lectularius, bumble bee, Bombus lucorum, diamondback moth, Plutella xylostella and oriental armyworm, Mythimna separata 19,[42][43][44] . The commonality and discrepancies displayed here confirm the notion that no universal reference genes exist for all contexts and reference gene selection and validation is crucial for accurate quantification of gene expression under specific experimental conditions. Without these studies, single un-validated endogenous controls can have profound impacts on data analysis and lead to questionable interpretation 16,18,19,45,46 . In this study, the expression of Hex-1 was significantly underestimated in the 1st nymphs when the least stable instead of the most stable and recommended reference genes was used to normalize target gene expression. Similarly, Cell-1 expressions in both salivary gland and foregut were overestimated when we elected the least stable instead of the most stable and recommended reference genes (Fig. 6). This is consistent with other validation studies that compared the use of stable vs unstable reference genes in the estimation of the target gene expression, in which normalization to unstable reference genes led to over-or under-estimated expressions in the target genes 47-49 . Optimal number of reference genes: single vs multiple normalizers. Besides stability, the number of reference genes used for normalization in a specific experiment can impact RT-qPCR analysis as well. Suzuki and colleagues reported that over 90% of the RNA transcription analysis published in peer-reviewed journals used a single housekeeping gene as reference 50 . Housekeeping genes, such as GAPDH, ACT , and RPS18, have been used extensively as the single reference gene without empirical validation, however, many of these reference genes showed substantial variations at expression level under different experimental conditions 17,[51][52][53] . In fact, as the pool expanded, the chance of these "generic" candidates to be the reference gene of choice decreases 34 . Since the introduction of MIQE guidelines in 2009, researchers have grown more receptive to adopt multiple rather than a single reference gene in RT-qPCR analysis. Despite changes in perception, the implementation of these guidelines has been challenging. The average number of reference genes used in peer-reviewed publications between 2010 and 2015 remained 1.23, in which 13% of the studies used more than a single reference gene 34 .
The optimal number of reference genes in a specific study is suggested by geNorm based on the calculation of normalization factors (NFs) in parallel samples. Pairwise variation (V n/n+1 ) is obtained from NF ratios between N and N + 1 reference genes. The minimum V n/n+1 on a U-shape curve composed by all the V n/n+1 represents the most stable NF that can be obtained among all the reference genes in a specific sample set. The number "N" corresponds to the optimal number of reference genes that are needed for the most accurate data normalization 26 . In this study, geNorm showed that all the V values were below the threshold among different developmental stages, with V 3/4 had the lowest pairwise variation value of 0.032. However, we elected to recommend two reference genes instead of three as the optimal number because V 2/3 value of 0.039 was equally low and far more practical and economical. Similarly, although V 6/7 (0.115) predicted the best number of reference genes for different tissues, four was the number of choice for the same set of reasons (V 4/5 = 0.131; Fig. 5).
Interestingly, it seems that more samples involved in the experiment (4 developmental stages vs 11 tissues) demand a higher number of reference genes (2 vs 4) for accurate normalization. A plausible explanation for this phenomenon is that when more samples were added into the analysis, V n/n+1 would be slower to reach the minimum value due to the introduction of more unstable factors. Consequently, there is no fixed number of internal controls for gene expression studies. The optimal number of reference genes for accurate normalization can be influenced by V n/n+1, sample size, and practicality/feasibility. cDNA concentration. The other factor which can impact the accuracy of RT-qPCR analysis is the initial concentration of cDNA template. In RT-qPCR, fluorescence is positively correlated with the amount of amplified product, suggesting the C t value is cDNA concentration-dependent. In this study, the optimal range of cDNA concentration to precisely quantify GAPDH expression was between 0.1 ng and 1 µg for reproductive and neuron tissues. When cDNA was less than 0.1 ng, the expression of tested genes (C t value) did not correlate with the quantity of cDNA template, which meant no changes could be detected. Although 0.1 ng-1 µg is specifically for GAPDH, accurate quantification of gene expression depends on the optimal range of cDNA concentration, i.e., the quality and quantity of cDNA template can directly impact the accuracy of RT-qPCR analysis.

Materials and methods
Ethics statement. Woodroaches were collected from rotting logs on the grounds of Mountain Lake Biological Station, Giles Co., Virginia (latitude 37.364, longitude 80.519). No specific permits were required for the described field studies.
Colony maintenance. The collected woodroaches were maintained at the University of Kentucky in a tengallon glass aquarium under complete darkness and provisioned with brown rotted pine at 20 ± 1 °C with limited humidity. The identity of Cryptocercus species was determined by a combination of morphological traits and a molecular marker, 12S rRNA. Based on the diagnostic nucleic acid sites embedded in the amplified 12S rRNA fragments, collected Cryptocercus were identified as C. punctulatus 54 . Sample preparation. Cryptocercus punctulatus colonies were acclimated in the laboratory for two weeks before they were subjected to the sample preparation. Cryptocercus punctulatus colony typically contains a pair of reproductives (adult male and female) and different-sized nymphs. www.nature.com/scientificreports/ For developmental stages, we collected four 1st nymphs (1st Nym), three 2nd nymphs (2nd Nym) and one adult male (MA) and one adult female (FA) to represent respective developmental stages within a colony. A total of three colonies were used in this experiment, and each colony represented a biological replication.
For different tissues, leg (Leg), antenna (Ant), muscle (Mus), neuron ganglion (NG), salivary gland (SG), foregut (FG), midgut (MG), hindgut (HG), fatbody (FB), ovary (FR), and testis (MR) were individually dissected from C. punctulatus adults. Before dissection, C. punctulatus were surface sterilized in 70% ethanol for 1 min and followed by rinsing in sterile water for 30 s. Cryptocercus punctulatus adults were dissected under a binocular microscope in 10 mM phosphate buffered saline (PBS, pH 7.8), and respective tissues were snap frozen in liquid nitrogen and stored at -80 °C. Dissected individual tissue samples from three same-sex adults were pooled to represent one tissue type in one biological replication. A total of three biological replications were carried out for this experiment.
Total RNA extraction and cDNA synthesis. Cryptocercus punctulatus whole body or dissected tissues was snap frozen in liquid nitrogen, and then ground to powder using a mortar and pestle. To preserve the integrity of RNA, the grinding process was carried out in liquid nitrogen. The resultant ground up powder (≤ 30 mg) was transferred to a 1.5 ml microcentrifuge tube for RNA extraction using a SV Total RNA Isolation Kit (Promega, Madison, WI, USA) according to the manufacturer's instruction. DNA contamination was eliminated by the DNAase treatment for 15 min. Quality and quantity of total RNA was measured using a NanoDrop 2000 spectrophotometer (Thermo Fisher, USA). cDNA was synthesized using the resultant total RNA as the template and M-MLV transcriptase (Grand Island, NY, USA). Samples without reverse transcriptase were used as the negative controls to make sure there was no contamination of DNA.
Primers were designed by Primer3 (SimGene.com) (Supplementary Table S2), synthesized and diluted to a working concentration of 10 µM. RT-qPCR reactions were run in triplicate on a Bio-Rad MyiQ™ Single-Color Real-Time PCR Detection System (BioRad, Hercules, CA). The thermal cycling profile included an initial denaturation step at 95 °C for 5 min, followed by 40 cycles of 95 °C for 15 s, annealing at 53 °C for 45 s, and concluded by an extension step at 72 °C for 30 s. Samples were run on 1% agarose gel, and then run with the dissociation protocol for melting curve analysis to check the specificity of each individual primer sets. In addition, amplification efficiency (E%) and correlation coefficient (R 2 ) were determined based on the standard curves generated from a tenfold serial dilution of cDNAs.
Optimal cDNA concentration for RT-qPCR analysis. cDNAs from ovary (FR), neuron ganglion (NG) and testis (MR), respectively, were quantified using a Smart Spec Plus spectrophotometer (Bio-Rad, Hercules, CA). A tenfold serial dilution was carried out to generate a cDNA concentration gradient ranging from 10 -6 to 10 -17 g. After RT-qPCR, C t (Threshold Cycle, which is the number of cycles required for the fluorescent signal to exceed the threshold line of background level) values of GAPDH transcripts corresponding to a gradient of cDNA concentrations were analyzed, and the optimal range of cDNA concentrations was determined.

Stability analysis.
Relative expression level of the ten candidate reference genes and the two target genes were calculated by 2 −ΔCt method 55 . The relative expression levels of candidate reference genes across different developmental stages and tissues were analyzed using one-way ANOVA with SPSS Statistics 17.0 (SPSS Inc., Chicago, IL, USA). The means were compared by Tukey test, if the data fit homoscendasticity, and Games-Howell test were performed if not. Specifically, throughout different developmental stages, Tukey test was used for EF1α, GAPDH, HSP70, αTUB, UBC, GST and Hex-1, while Games-Howell test was carried out for ACT , HSP60, RPS18, ATPase and Cell-1. Relative expression of all the candidate reference genes across different tissues was analyzed using Games-Howell test. The dispersion of C t values was assessed using a Box Plot.
The expression profiles of the candidate reference genes and target genes under different biotic conditions (developmental stages and tissues) were evaluated individually using a panel of analytic tools, including geNorm 26 , BestKeeper 56 , Normfinder 57 and the comparative ΔC t method 58 . For geNorm, each reference gene is evaluated by calculating the pairwise variation with all other genes to determine the gene-stability value, M 26 . BestKeeper ranks the reference genes based on the standard deviation (SD) of C t value and the repeated pairwise correlation analyses of all the candidate genes 56 . Instead of measuring the overall stability, Normfinder selects reference genes based on the possible intra-and inter-group variation across different samples 57 . The comparative ΔC t method ranks the reference genes by comparing relative expression of "pairs of genes" within each sample, and the stability of the candidates was obtained according to the repeatability of the gene expression differences among different samples 58  www.nature.com/scientificreports/ of the four above mentioned analytical tools to an individual gene and calculates the geometric mean of their weights for the overall ranking. Relative expression of the target genes, Hex-1 and Cell-1, was calculated using ΔΔCt method 60 . Differences in their expression using an array of normalization factors were compared according to one-way ANOVA with Tukey test. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.