Introduction

Quantitative real-time polymerase chain reaction (qPCR) is a routinely used technique for the detection of specific nucleic acids, RNA expression profiling, quantification of DNA and DNA methylation, and validation of microarray hybridization data1. Reverse transcription qPCR (RT-qPCR) is an accurate, sensitive, and fast method to quantify gene expression from qPCR experiments2, and is widely accepted as the Golden Standard for the analysis of gene expression1,3. Briefly, RT-qPCR measures the expression of a set of target RNAs through repeated cycles of sequence-specific amplification followed by expression measurements4. The cycle at which the observed expression first exceeds a user-specified threshold is commonly called the threshold cycle (Ct) or quantification cycle. The Ct values of the target RNAs represent a quantitative assessment of gene expression and are often treated as the raw data for subsequent analyses4. Two methods can be used to quantify gene expression from the Ct value: the absolute and the relative quantification3. In the absolute quantification, a standard curve is used as reference calibrator. In the relative quantification, the signal is related to the expression of a user-specified group3. Therefore, the difference between the two approaches depends on the data used as reference calibrator to which relating the signal.

In many RT-qPCR experiments not all Ct values can be numerically defined. For example, when the starting RNA abundance is too low, or an off-target product is amplified, or no reliable Ct can be determined, the corresponding Ct value cannot be quantified numerically and is flagged as missing value5. Handling missing data is a crucial step in the analysis of RT-qPCR experiments because procedures used in the subsequent analyses of these data are based on statistics that are unable to handle both numeric and missing values4. Imputation is an established technique to solve the problem6. Imputation substitutes a missing value with a rationally selected numeric value4. K-nearest neighbors (KNN)6, maximum Ct plus one cycle (Mestdagh)7, and cubic spline interpolation (Cubic)1 are known methods to impute missing values in RT-qPCR data5,6.

Another key step in the analysis of RT-qPCR data is the assessment of true biological changes associated with the phenomenon or disease of interest. In fact, biological changes are often masked by nonspecific technical variability introduced in the data during the experimental procedure6. Data normalization is expected to reduce/eliminate any technical variability without affecting the true biological results6. Global mean8, DeltaCt based on universal normalizers9, Modified global mean10, Quantile9, and Rank Invariant9 are among the most accepted methods used for RT-qPCR data normalization5.

RT-qPCR experiments allow measuring the expression of several transcripts in parallel using high-density plates9. Plates have been used in several explorative studies to find novel biomarkers from the analysis of different diseases, tissues, experimental conditions, and cell types3,5,6. The large number of studies published in the literature stimulated companies to develop commercial technologies to perform RT-qPCR experiments3. For each experiment, these technologies generate textual reports summarizing a number of experimental parameters and data such as feature name, quality control flags, and Ct values. Different technologies generate reports that can be of different format. According to our experience, SDS, EDS, and OpenArray are among the most used file formats for reporting results of RT-qPCR experiments.

Although the computational procedures and technologies for analyzing RT-qPCR data are well established, the heterogeneity of the assays employed in RT-qPCR experiments and the lack of a consensus on the best normalization system and on the missing values imputation approach to adopt makes it hard to set up a standardized analysis procedure6. Furthermore producing high quality publications and reproducible data are among the most critical pitfalls of qPCR experiments11.

Several open-access software packages, tools, and web applications, such as R packages, have been proposed in the last years for the analysis of RT-qPCR data1. HTqPCR is a well-known open source R\Bioconductor package for the high-throughput analysis of RT-qPCR data9. It provides several functions and parameter options for assessing the quality of the experiment, filtering unreliable data, normalizing raw data, finding potential candidate biomarkers, and visualizing RT-qPCR data9. However, R-based analysis suffers from some known limitations. First of all, analysis procedures are implemented in several packages lacking a unified framework. Second, users with biological background who want to use the functionalities of R packages need non-trivial coding skills. Furthermore, the lack of a simple framework for reusing, sharing, and communicating experimental procedures and results limits reproducibility, transparency, and accessibility of R-based analysis12.

Galaxy is an open, collaborative, web-based, genomic workbench for a reproducible, transparent, and accessible science12. Galaxy provides a very active developer community. More than 6746 public tools and workflows are freely available in the Galaxy Tool Shed repositories12. New tools and workflows are easily deployable in the Galaxy repositories. To this purpose, Galaxy offers fresh installations of R and Python environments, a fast dependency resolver, a step-by-step documentation, a simple graphical interface, and GitHub integration13. However, to the best of our knowledge, no Galaxy tool or workflow has been reported to date for analyzing RT-qPCR data.

In the present work, we developed pipette (PIPE-T), a new tool for analyzing RT-qPCR expression data integrating the functionalities implemented in various R packages into one unified, reusable, transparent, accessible, and easy to use Galaxy wrapper.

Methods

Overview of the main procedures implemented by PIPE-T

PIPE-T implements the relative quantification method using the R language and computing environment14.

To start a PIPE-T analysis, users must upload two input files:

  • A List collection of tab-separated text files for all samples generated as report of the RT-qPCR experiment (ListOfFile).

  • A tab-separated text file associating each filename in ListOfFile with a treatment group (FileTreatment).

Five distinct computational procedures are implemented in PIPE-T. Procedures are summarized in Fig. 1 and a detailed description of each procedure is provided in the following sections.

Figure 1
figure 1

Schematic representation of the analysis procedures implemented in PIPE-T. Input files are uploaded and parsed for initiating the analysis. Transcripts are categorized according to a user-defined range of values and\or quality control flag to label unreliable Ct values. Ct values are normalized to reduce\remove technical variability in the data. Transcripts are filtered out according to a user-specified maximum number of missing values to maintain the bias as low as possible. Imputation is applied to handle missing values. Transcripts discriminating between two treatments are identified for subsequent analyses.

The execution of PIPE-T outputs the following output files:

  • A tab-separated text file containing the raw Ct values for every sample and transcript

  • A PNG file showing the distribution of the Ct values of every samples obtained after the Ct filtering and categorization step visualized as sequence of boxplots.

  • A tab-separated text file containing the normalized Ct values

  • A PNG file showing the cumulative distribution plot before and after data normalization of the coefficient of variation of every transcript.

  • A PNG file showing the distribution of the normalized Ct values visualized as sequence of boxplots.

  • A tab-separated text file containing data after imputation

  • A tab-separated text file containing the results of the differential expression analysis.

File uploading and parsing

Heterogeneity of assays quantifying RT-qPCR gene expression is often associated with heterogeneity of the file formats reporting data summarizing the results of the RT-qPCR experiment. Hence, it is crucial that the user uploads files whose content is compliant with the file format parsable by PIPE-T before running any PIPE-T analysis.

“Upload File from your computer” is a Galaxy tool that allows uploading files into Galaxy. This tool is available on any fresh Galaxy instance or on the main Tool Shed repository15.

PIPE-T processes tab-separated text files containing a dot as decimal separator uploaded with “Upload File from your computer” tool. The formats supported by PIPE-T are:

  • Applied Biosystems Sequence Detection Systems (SDS)

  • ThermoFisher Experiment Detection Systems (EDS)

  • Applied Biosystems OpenArray (OpenArray)

  • Roche LightCycler (LightCycler)

  • Bio-Rad CFX (CFX)

  • Fluidigm Biomark Table format (BioMark)

  • User-formatted plain text (Plain)

SDS, OpenArray, LightCycler, CFX, BioMark, and Plain are HTqPCR R package9 parsable file formats. We updated the parsing procedure to adapt it working with R 3.5.0 and tab-separated text files. We extended the list of the parsable file formats including the possibility of processing EDS format, which is one of the most used by Thermo Fisher Scientific real-time qPCR instruments.

FileTreatment should have only two columns named SampleName and Treatment. The column named SampleName lists the name and the extension of the files uploaded into the ListOfFile collection. The column named Treatment associates each sampleName with an experimental condition or group of interest. Group specification is necessary since PIPE-T implements the relative quantification method to analyze data from RT-qPCR experiments. PIPE-T admits the specification of two treatment groups. In the GitHub documentation we provided a checklist of recommendations to help users formatting their input files and checking that these files contain sufficient data to run PIPE-T without errors.

If file format is correct, PIPE-T populates a qPCRset object containing the following data for each transcript and sample:

  • Raw Ct\Cq value,

  • Value of the internal quality control flag,

  • Transcript and sample names,

  • FeatureCategory

Data parsing and qPCRset object generation are carried out using the readCtData function of the HTqPCR R package9.

Ct filtering and categorization

Feature categorization is a procedure for describing the level of reliability of a transcript and can be used to filter out features whose expression is not sufficiently reliable9. HTqPCR package defines three possible categories: “Undertermined”, “Unreliable”, and “OK”9. “Undetermined” is used to flag Ct values above a user-defined threshold, and “Unreliable” indicates Ct values that are so low as to be estimated by the user to be problematic9.

By default, only Ct values labeled as “undetermined” in the input data files are placed into the “Undetermined” category, and the rest are classified as “OK”9.

The FeatureCategory for a transcript can be altered on the basis of two criteria9:

  • Range of Ct values. Some Ct values might be too high or too low to be considered a reliable measure of gene expression in the sample and, therefore, should not be marked as “OK”.

  • Flags. Depending on the qPCR input, the values might have associated flags, such as “Passed” or “Failed”, which are used for assigning categories.

PIPE-T implements the two criteria allowing users to set up a range of Ct values and a List button. Any Ct value exceeding the user-defined range is categorized as “Unreliable”. Users can force PIPE-T to check internal control flag status. In this case, the FeatureCategory for a transcript is replaced by an “Undetermined” if the transcript did not pass internal quality control.

PIPE-T uses FeatureCategory labels to replace any Ct values corresponding to “Undertermined” and “Unreliable” with a not accessible value (NA).

These operations are carried out using setCategory and filterCategory functions of HTqPCR package9.

Normalization

Data normalization allows to minimize unwanted systematic technical and experimental variation in the data for better appreciating true biological changes16.

PIPE-T offers six different normalization options that are listed below:

  • Global mean8

  • DeltaCt9

  • Modified global mean10

  • Quantile9

  • Norm Rank Invariant9

  • Scale rank invariant9.

Global mean, quantile, norm rank invariant, and scale rank invariant were already implemented in HTqPCR R package9. However, as Norm Rank Invariant and Scale rank invariant worked only if missing values were absent, we extended the procedure substituting any missing value with a numeric value using the na.spline function implemented in the zoo R package17. D’haene and collegues showed the benefits of using the geometric mean for the normalization of microRNA expression data by introducing the so-called modified global mean method10. For these reasons, we integrated the modified global mean method in PIPE-T.

PIPE-T supports the deltaCt method. Housekeeping genes can be specified by the user or can be estimated by the geNorm or NormFinder methods implemented in the NormqPCR R\Bioconductor package18. When geNorm is selected, PIPE-T identifies candidate normalizers taking those transcripts whose stability was greater than 1.5 as reported by Vandesompele and collegues19.

Newly implemented normalization methods have been integrated in PIPE-T as an updated version of the function normalizeCtData of the HTqPCR R package9.

Transcript filtering and imputation

High-throughput data may often contain missing values. For this reason, handling missing values is a crucial step of any RT-qPCR analysis5,6. The simplest solution for handling missing values would be to exclude from the analysis any transcript with at least one missing value. In such a case, missing values do not represent a problem anymore because they are removed from the analysis. However, this approach could filter out a considerable number of potential useful transcripts. Another solution would be to take every transcript no matter of the number of missing values. In such a case, all potential useful transcripts are taken into account for subsequent analysis, but the probability of making an error increases with the number of missing values6. In the literature, there is a wide accepted approach that consists in keeping transcripts with a reasonable number of missing values and filtering out those exceeding this threshold6. Transcripts that do not exceed the threshold are imputed using a suitable method. In the literature, several imputation methods have been proposed20.

PIPE-T offers a slider that the user can move to specify the maximum percentage of missing values admissible for a specific transcript. PIPE-T allows filtering transcripts using a user-defined percentage of missing values and/or a user-defined list of transcripts to be removed by using the filterCtData function of the HTqPCR package9.

In addition, PIPE-T gives the possibility of selecting one of three well-known imputation methods. These methods are:

  • KNN

  • Mestdagh

  • Cubic

KNN and Cubic imputation methods were already implemented in the impute and zoo R packages.

Mestdagh is an imputation method that substitutes a missing Ct value with a numeric value obtained adding one cycle to the highest Ct value across samples7. This method has already been described in other reports5. This method assumes that missing values depends on the low or null abundance of the transcript in the sample.

Differential expression analysis

Differential expression is a very popular analysis for identifying candidate transcripts whose expression can discriminate between two predefined conditions. Among the methods eligible for a differential expression analysis21, PIPE-T offers the possibility of choosing between three approaches:

  • T-test21.

  • Two sample Wilcoxon test21.

  • Rank Product22.

T-test and two sample Wilcoxon test are among the most used statistical tests to perform a differential expression analysis21. Tests are implemented by ttestCtData and mannwhitneyCtData functions of the HTqPCR R package9. For the t-test and the two sample Wilcoxon test, PIPE-T offers the possibility of setting up six distinct parameters, which include: the types of alternative hypothesis to assess significance, the choice of a paired or an unpaired analysis, the presence in the data of replicated transcripts, the choice of a more or less stringent analysis, and the choice of the method for adjusting p-values in case of multiple hypothesis testing.

Rank Product is a popular method originating from a biological reasoning22. Rank Product is carried out using RP function of RankProd R package23.

If users do not specify any differential expression analysis method, PIPE-T allows them to select an option named NONE. In this case, no differential expression analysis is performed on the data.

Data visualization and outputting

Quality assessment of RT-qPCR data is crucial for enhancing the accuracy of the results and the reliability of the conclusions2. HTqPCR provides several visualization options for assessing the quality of qPCR data, which include histograms, boxplots, density distributions, and scatter plots9. PIPE-T uses two boxplot visualizations showing the distribution of the expression values across all samples. The boxplots show the distribution of expression values before and after data normalization, respectively. The visual inspection of the two boxplots is used as qualitative assessment of the normalization procedure because boxplots show the noise reduction comparing the data before and after data normalization8. Empirical Cumulative Distribution Function (ECDF) is also used in the literature for measuring noise reduction as an effect of data normalization8,10. PIPE-T computes and plots ECDF before and after data normalization by using ecdf function of the stats R package14. The significance of the difference between the two ECDF curves is estimated by Kolmogorov-Smirnov test and p-value is reported on top of the figure and in the standard output.

Tabular output files include raw data, filtered data, imputed data and statistics to assess differential expression. A detailed description of the row and column names can be found in HTqPCR and RankProd R packages documentation. A detailed description of visualization, sharing, and workflow integration using Galaxy graphical interface can be found in the Galaxy documentation.

Results

We tested the ability of PIPE-T of analyzing RT-qPCR data using two example datasets whose tab-separated text files were available in the Gene Expression Omnibus (GEO) with accession identifiers GSE25552 and GSE43000. Datasets were relative to two published studies on various metastatic tumors24 and non-small cell lung (NSCL) cancers25. The first study reported the results of the analysis of sixteen different tumors including Lung, Renal, Colon, Sarcoma, Ovarian, and Head and neck squamous cell carcinoma24. The second study reported the results of the analysis of forty-four NSCL tumor samples25. We carried out PIPE-T analysis of both datasets on a test Galaxy instance version 19.01, installed in a local Linux machine. Parameter settings for the two analyses have been taken from the original publications when available. When the parameters were not specified we selected them arbitrarily.

Various metastatic cancers

We downloaded input tab-delimited files from GEO and we added a SDS version 2.4 format header to each of these files because it lacked. Input files contained experimental data for 384 microRNAs. We coupled RT-qPCR data with information about tumor status, which was oligometastatic (OLIGO) for ten out of sixteen patients and polymetastatic (POLY) for the remaining six patients. File names and tumor status were organized into a tab-delimited text file. The newly created file and the sixteen tab-separated text files were uploaded in Galaxy as fileTreatment and ListOfFile through “Upload File from your computer” tool. Analysis was carried out with parameters settings reported in Fig. 2.

Figure 2
figure 2

Parameter settings used for the analysis of metastatic cancer data. PIPE-T wrapper interface with the settings used for the analysis of the metastatic cancer dataset. Input files have already been uploaded using the “Upload File from your computer” tool.

Our tool successfully completed the execution, returning seven output files (see Tables S1S4 and Figs S1S3). Boxplots and EDCF before and after data normalization as well as the significant genes and statistics reported by the differential expression analysis procedure are depicted in Figs 3, 4, and Table 1, respectively.

Figure 3
figure 3

Qualitative assessment of the noise reduction for metastatic cancer data. Box plots show the distribution of Ct values in metastatic cancer samples after Ct filtering and categorization (Panel A) and after normalization (Panel B) procedures. Each box plot is relative to a sample.

Figure 4
figure 4

Quantitative assessment of the noise reduction for metastatic cancer data. ECDFs (y axis) and coefficient of variation (CV) is displayed for the metastatic cancer samples after Ct filtering and categorization (blue line) and after normalization (green line) procedures. Kolmogorov-Smirnov test assessing the significance of the separation between the curves and p value is reported on top of the plot.

Table 1 Significant genes estimated by the differential expression analysis procedure in metastatic cancer dataset.

We found 12 significantly upregulated and 11 downregulated microRNAs in polymetastatic tumors (p value < 0.05 and FC > 2 or FC < 0.5; Table 1).

Interestingly, among the significantly modulated microRNAs reported in the Lussier and coworkers manuscript24, 11 out of 12 microRNAs were consistently up regulated in polymetastatic tumors and 8 out of 11 microRNAs were consistently upregulated in oligometastatic tumors. Any difference between our findings and those reported by Lussier and collegues24 are probably due to the different approaches used in the experiments to filter and handle missing values. Lussier and collegues did not report any information about filtering based on the percentage of missing values or the application of any method for handling missing or unreliable Ct values. These results provide the first evidence that PIPE-T is able to correctly analyze RT-qPCR expression data.

Non-small cell lung cancer

NSCL input files were compliant with SDS format version 2.3 and reported experimental data for 381 microRNAs. Since the downloaded files used a comma as decimal separator, each comma was replaced with a dot before running PIPE-T. RT-qPCR data were coupled with histological data provided in the original publication25, which refer to twenty lung adenocarcinoma (LA) and twenty-four squamous cell lung cancer (SCLC). File names and tumor subtypes were organized into a text file. We uploaded the newly created file as fileTreatment, and the forty-four tab-separated text files as ListOfFile. Analysis was carried out with the parameter settings reported in Fig. 5.

Figure 5
figure 5

Parameter settings used for the analysis of NSLC data. PIPE-T wrapper interface with the settings used for the analysis of the NSLC dataset. Input files have already been uploaded using the “Upload File from your computer” tool.

Our tool successfully completed the execution returning seven output files (see Tables S5S8 and Figs S4S6). Boxplots and EDCF before and after normalization, as well as the significant microRNAs identified by the differential expression analysis procedure, are depicted in Figs 6, 7, and Table 2, respectively.

Figure 6
figure 6

Qualitative assessment of the noise reduction for NSLC data. Box plots show the distribution of Ct values in NSLC samples after Ct filtering and categorization (Panel A) and after normalization (Panel B) procedures. Each box plot is relative to a sample.

Figure 7
figure 7

Quantitative assessment of the noise reduction for NSLC data. ECDFs (y axis) and coefficient of variation (CV) is displayed for the NSLC samples after Ct filtering and categorization (blue line) and after normalization (Green line) procedures. Kolmogorov-Smirnov test assessing the significance of the separation between curves and p value is reported on top of the plot.

Table 2 Significant genes estimated by the differential expression analysis procedure in the NSLC dataset.

We found 16 significantly modulated microRNAs (p value < 0.05 and FC > 2 or FC < 0.5; Table 2). Interestingly, miR-205, miR-149, miR-422a, and miR-708 were significantly upregulated in SCLC and miR-375 was significantly upregulated in LA in accordance with the results of the original manuscript25. Any difference of fold change or p-value between our study and that by Molina-Pinelo and collegues25 can be explained by the different handling of missing values. Authors did not report their approach to missing or unreliable Ct values. In spite of three small differences, our results provide evidences that PIPE-T is able to correctly analyze RT-qPCR expression data.

Conclusions

We developed PIPE-T, a new Galaxy tool that offers several state-of-the-art options for parsing, filtering, normalizing, imputing, and analyzing RT-qPCR expression data. Integration of PIPE-T into Galaxy allows researchers with strong bioinformatic background, as well as those without any programming expertise, to perform complex analysis in a simple to use, transparent, accessible, reproducible, and user-friendly environment.

Availability of Supporting Source Code and Requirements

Project name: Pipe-t

Project home page: https://github.com/igg-molecular-biology-lab/pipe-t (2019)26

Operating system(s): Linux (Galaxy), and platform independent

Programming language: R

Other requirements: Galaxy

License: GNU GPL

PIPE-T is available on the Main Tool Shed15 at the link27, on the Docker28 at the link29 and on the web30 at the link31. PIPE-T code is freely available on GitHub at the link https://github.com/igg-molecular-biology-lab/pipe-t (2019)26.

PIPE-T has the following dependencies:

<requirements>

<requirement type = “package” version = “3.5.0”>r-base</requirement>

<requirement type = “package” version = “7.2.0”>libgcc</requirement>

<requirement type = “package” version = “1.36.0”>bioconductor-htqpcr</requirement>

<requirement type = “package” version = “3.8.0”>bioconductor-rankprod</requirement>

<requirement type = “package” version = “1.56.0”>bioconductor-impute</requirement>

<requirement type = “package” version = “1.11.0”>r-bbmisc</requirement>

<requirement type = “package” version = “1.8.4”>r-psych</requirement>

<requirement type = “package” version = “1.8_3”>r-zoo</requirement>

</requirements>

If Conda32 is installed and enabled, Galaxy locates and resolves any tool dependencies automatically during tool installation.

Data availability

The tab-separated text files included in the ListOfFile collections of the two example applications are available in GEO repository with accession numbers: GSE25552 and GSE43000. A detailed documentation, step-by-step tool installation instructions, configuration, example applications are available on GitHub at the link https://github.com/igg-molecular-biology-lab/pipe-t (2019)26.