The pseudogene problem and RT-qPCR data normalization; SYMPK: a suitable reference gene for papillary thyroid carcinoma

In RT-qPCR, accuracy requires multiple levels of standardization, but results could be obfuscated by human errors and technical limitations. Data normalization against suitable reference genes is critical, yet their observed expression can be confounded by pseudogenes. Eight reference genes were selected based on literature review and analysis of papillary thyroid carcinoma (PTC) microarray data. RNA extraction and cDNA synthesis were followed by RT-qPCR amplification in triplicate with exon-junction or intron-spanning primers. Several statistical analyses were applied using Microsoft Excel, NormFinder, and BestKeeper. In normal tissues, the least correlation of variation (CqCV%) and the lowest maximum fold change (MFC) were respectively recorded for PYCR1 and SYMPK. In PTC tissues, SYMPK had the lowest CqCV% (5.16%) and MFC (1.17). According to NormFinder, the best reference combination was SYMPK and ACTB (stability value = 0.209). BestKeeper suggested SYMPK as the best reference in both normal (r = 0.969) and PTC tissues (r = 0.958). SYMPK is suggested as the best reference gene for overcoming the pseudogene problem in RT-qPCR data normalization, with a stability value of 0.319.

. The information of the primers was listed. A total of eight reference genes were selected and the detailed information of corresponding primers was listed. RefSeq accession numbers of the selected reference genes, forward (F) and reverse (R) primer sequences, exact primer attachment sites at their target genes, total length of the involved exons and spanned introns, PCR product length and PCR product Tm, are listed. Intron spanning primers: primer pairs attaching within two consequential exons. Exon junction primer: primer with spanning attachment to consequential exons. GAPDH glyceraldehyde-3-phosphate dehydrogenase, TBP TATA-box binding protein, SYMPK symplekin, PYCR1 pyrroline-5-carboxylate reductase 1, ACTB actin beta, HPRT1 hypoxanthine phosphoribosyltransferase 1, B2M beta-2-microglobulin, GUSB glucuronidase beta.   www.nature.com/scientificreports/ Identification of the most stable reference genes using NormFinder. NormFinder intragroup analysis was initially used to rank the candidate reference genes from most to the least stable (Table 4). In normal tissues, HPRT1 was the most stable reference gene, followed by SYMPK, GUSB, ACTB, B2M, TBP, PYCR1, and finally GAPDH (black bar, Fig. 2). In PTC tissues, GUSB was the most stable reference gene, followed by SYMPK, ACTB, PYCR1, HPRT1, B2M, TBP, and finally GAPDH (gray bars, Fig. 2). Thus, NormFinder considered HPRT1 and GUSB as the best reference genes for normal and PTC tissues respectively, while SYMPK was the second most stable reference gene in both tissue types. Next, intergroup analysis was performed, checking reference gene performance in PTC tissues against normal tissues. In this analysis, NormFinder declared as the single best reference genes GUSB, SYMPK, and HPRT1 (dotted gray, Fig. 2), and the best combination of reference genes as SYMPK and ACTB, with a stability value of 0.209 (Table 4).
Identification of the most stable reference genes using BestKeeper. According to BestKeeper intragroup analysis (Table 5), the most stable gene in normal tissues was ACTB, with r = 0.974. On the other hand, the most stable genes in PTC tissues were ACTB and GUSB, which tied with r = 0.966. In normal tissues, the second and third most stable genes were SYMPK and HPRT1, with r = 0.969 and r = 0.966, respectively. The same ranking was observed in PTC tissues, with SYMPK (r = 0.958) and HPRT1 (r = 0.931) having the second and third highest similarity to the BestKeeper index. Finally, PTC and normal intergroup analysis using Best-Keeper ranked the candidate genes from most to least stable as follows: ACTB, SYMPK, HPRT1, GUSB, GAPDH, B2M, PYCR1, and TBP (Fig. 3).

Discussion
When performing qPCR normalization, a reference gene could be considered reliable if it is expressed consistently throughout different pathological statuses 5 . While truly constant expression might only be seen in fairy tales, a lack of confounding influence such as from pseudogenes can be achieved 5 . Omitting pseudogene signal could bring more stability to the observed expression of reference genes, hence our current approach of finding and validating genes with minimal expression variation and that lack pseudogenes 19,23 .
Different statistical parameters could affect the interpretation of reference gene stability. The first factor that came to our attention was CqCV%, as it is more comprehensive than mean and SD. It must be mentioned that CqCV% or distribution frequency is a statistical factor that measures the dispersion of the data and the repeatability and precision of the experiment 24 . In this study, SYMPK and PYCR1 showed the lowest CqCV% in both tissues, meaning that these two genes had the lowest dispersion around the mean value. MFC is another basic statistical factor that indicates the dispersion, the variation, and the difference between minimum and maximum Cq. In the same manner, SYMPK and PYCR1 showed the lowest MFC in both tissues.
According to the NormFinder algorithm, the best single reference gene was GUSB (stability value = 0.301); however, ACTB and SYMPK were suggested as the best combination, with a stability value of 0.209. Interestingly, despite fundamental differences in the analytical software, the BestKeeper algorithm suggested exactly the same two genes as the most stable. This aligns well with the basic statistical analysis, in which SYMPK also  www.nature.com/scientificreports/ Table 4. NormFinder analysis was presented. NormFinder analysis was used to rank the candidate reference genes. NormFinder considered HPRT1 and GUSB as the best reference genes for normal and PTC tissues respectively, while SYMPK was the second most stable reference gene in both tissue types. Intergroup analysis indicated SYMPK and ACTB, with stability value of 0.209, as the best combination of reference genes. a GUSB, SYMPK, and HPRT1 were the best according to the Normfinder algorithm. www.nature.com/scientificreports/  www.nature.com/scientificreports/ had the best MFC value and CqCV%, and thus the aforementioned predictions from microarray data were in accordance with our labwork. To date, very few studies have investigated reference genes in thyroid cell lines and human tissues. A previous study of six candidate reference genes (ACTB, B2M, HPRT1, GAPDH, SDHA, and YWHAZ) in seven goiter and seven normal tissues met with failure when intragroup analysis was done 25 . However, they declared SDHA and ACTB as the most stable reference genes based on NormFinder intergroup analysis 25 . In a same manner, this study found ACTB alone was not a well-qualified reference gene for intragroup analysis, but SYMPK/ACTB in combination had advantages in terms of basic statistics. Unfortunately, ACTB suffers from considerable pseudogene bias, and thus its reliability as a reference gene is dubious. Another study of 45 tissue samples from Iranian patients (15 PTC, 15 paired normal and 15 goiter tissues) checked twelve genes (GAPDH, ACTB, HPRT1, TBP, B2M, PPIA, 18SrRNA, HMBS, GUSB, PGK1, RPLP0, and PGM1) 26 . They declared consistency between the different algorithms of NormFinder, BestKeeper, and GeNorm because all three suggested GUSB and HPRT1 as the most stably expressed genes in all thyroid tumours. In line with this study, we found GUSB to be a stable reference gene, but GUSB also has numerous pseudogenes that can obfuscate RT-qPCR data normalization. HPRT1 was not considered a suitable reference gene in our study as it was only the third most stable reference gene according to both NormFinder and BestKeeper.
Considering the limitations described above, the ideal combination of SYMPK/ACTB, despite its stability value of 0.209, could not in fact be appropriate for RT-qPCR normalization. Pseudogene contamination of ACTB and also GUSB put these two genes out of the running 19,[27][28][29] . As pseudogene can cause falsely lower Cq values due to undesirable mRNA amplification, therefore ACTB cannot be helpful beside SYMPK. To avoid the problem of pseudogenes, the next most stable gene-SYMPK, with a stability value of 0.319-is the most suitable reference.
According to our knowledge, SYMPK was suggested as the second suitable reference gene only in one previous study; albeit after CCSER2 gene 30 . CCSER2\SYMPK inauguration was launched after transcriptome and microarray data analysis in breast cancer cell lines and tissues. Unfortunately we found CCSER2 with a pseudogene (LOC100127962) sufficiently enough to be ruled out of this study. As a final point, SYMPK with the lack of pseudogenes and being presented only as a single isoform within cells showed mastery over the other competitor and it was selected in this study.

Conclusion
Synchronous results of BestKeeper and NormFinder suggested the combination of SYMPK and ACTB for RT-qPCR data normalization, but confounding from ACTB pseudogenes takes it out of the running and leaves SYMPK as the champion.

Subjects and methods
Reference gene selection. A total of eight reference genes were selected, with six based on literature review: SYMPK 30 , Beta-2-microglobulin (B2M) 25 , GAPDH 26,31,32 , TATA-box binding protein (TBP) 26,33 , ACTB 26,34 , and Hypoxanthine phosphoribosyltransferase 1 (HPRT1) 26,34 . The other two candidate reference genes were Pyrroline-5-carboxylate reductase 1 (PYCR1), which promotes cell proliferation in different human neoplasms but surprisingly does not fluctuate in PTC tumors (unpublished data), and β-glucuronidase (GUSB), selected based on a recently published study 26 . Reference gene validation using microarray data. Our selected reference genes were validated using the GSE3678 microarray dataset, which contains seven PTC samples and seven adjacent normal tissues. The dataset was checked for log scale and the quality of the data was assessed. Quantile normalization of the dataset was checked and principal component analysis (PCA) was performed ( Supplementary Fig. 1). Finally, a list of genes with minimal variation was extracted and the presence of our selected reference genes in that list was confirmed. The microarray dataset was analyzed with an R program (script attached).
Exon-junction primer design. Exon-junction primer design was undertaken in order to eliminate pseudoamplification of genomic DNA and\or heterogeneous nuclear RNA 35 . Seven pairs of exon-junction primers were designed using Beacon Designer 8.1 (Premier Biosoft International, Palo Alto, CA, USA). To do this, the exact positions of introns were extracted from the Ensemble Genome Browser, and the exact nucleotide sequences of exons were imported into Beacon Designer 8.1 separately for each of our seven mRNAs. At the final step, the software was asked to design primers that span selected exon(s) in order to specifically amplify mature mRNAs of the desired genes. Primers that did not pass the exon-spanning requirement were rejected and not included in the study. For the eighth candidate gene, GUSB, primer sequences were taken directly from a recently published study 26 . Secondary structures of the primers were re-checked using Oligo7, and to insure the selective amplification of mature mRNAs, specificity was confirmed using NCBI-primer BLAST. Melting temperatures for all pairs of primers were validated using gradient PCR (Sinaclon Bioscience, Tehran, Iran). The complete details of the designed primers, including exon-junction information, are listed in Table 1 www.nature.com/scientificreports/ cal and histological examinations, of the 20 thyroid tissue samples, ten were PTC and ten were adjacent normal thyroid tissues, each with the approximate size of 0.5 cm. Histopathological examinations were done by either the operating hospital or a third-party laboratory. The staging of the tumors was determined by specialized pathologists according to the 7 th edition of the American Joint Committee on Cancer Tumor-Node-Metastasis (TNM) staging system. However, because of the restricted number of other thyroid neoplasm tissues in Iran, the data presented here are restricted to PTC tissues. We had only one anaplastic thyroid cancer sample, which was not included in the study. Patients' information and tumour pathological characteristics were presented in Table 2.
RNA extraction and quantification. Total RNA was extracted from RNAlater-treated tissues using a one-step RNA extraction reagent (BIOBASIC, Canada), according to the manufacturer's instructions. The concentration of the isolated RNA was measured using a NanoDrop oneC spectrophotometer (ThermoScientific, Waltham, MA, USA), and quality was determined using the A260/A280 and A260/A230 ratios. The integrity of extracted RNA was confirmed by 1.0% agarose gel electrophoresis (Invitrogen).
Complementary DNA (cDNA) synthesis. DNase I (ThermoScientific, Germany) was applied to eliminate residual genomic DNA according to the manufacturer's instruction. The ThermoScientific RevertAid Reverse Transcriptase kit was used to reverse transcribe 1 μg of total RNA according to the manufacturer's instruction. The reaction was done in a total volume of 20 μL.
Reverse transcription quantitative polymerase chain reaction (RT-qPCR). RT Melt curve analysis. The specificity of the RT-qPCR was evaluated using melt curve analysis through the gradual increase of temperature with a transition rate of 1 °C (from 55 to 95 °C). After each temperature increase, a plate reading was performed on the green channel. The derivative of fluorescence change over temperature (y axis) was plotted against the temperature (°C, x axis).

Statistical analysis.
Microsoft Excel 2013 was used to calculate the mean C q , standard deviation (SD), correlation of variation (C q CV%, C q CV% = SD/mean × 100%), minimum C q , maximum C q , and maximum fold change (MFC) ( Table 1). MFC was calculated by dividing the maximum Cq by minimum Cq; this value is an estimate that represents the distribution of the Cq, with a lower MFC value indicating a narrower distribution of a reference gene Cq 25 . NormFinder 36 and BestKeeper 24 were used to validate the appropriateness of reference genes. NormFinder ranks reference genes based on their stability value, with lower stability value indicating a more stable reference gene 36 . BestKeeper creates its reference gene index using pair-wise correlations based on the average C q values, SD, and CV 24 . This algorithm considers the gene with the highest Pearson's correlation coefficient (r) as the most stable reference gene.

Ethical approval. The study was approved by the Iran National Committee for Ethics in Biomedical
Research Review Board and the Ethics Committee of Research, Institute for Endocrine Sciences, Isfahan University of Medical Sciences, Isfahan, Iran (IR.UI.REC.1398.058). All experiments involving participants were performed in accordance with the seventh edition of the Helsinki declaration and the research protocol was first submitted to the Iran National Committee for Ethics in Biomedical Research Review Board and the Ethics Committee of Research, Institute for Endocrine Sciences, Isfahan University of Medical Sciences. Informed consent, including the aims, methods, and sources of funding, institutional affiliations of the researchers, the anticipated benefits and potential risks of the study, and post-study provisions, was obtained for each participant. We also informed each participant of their right to refuse.

Data availability
The dataset analyzed during the current study is available in the NCBI-Gene www.nature.com/scientificreports/