PIPE-T: a new Galaxy tool for the analysis of RT-qPCR expression data

Reverse transcription quantitative real-time polymerase chain reaction (RT-qPCR) is an accurate and fast method to measure gene expression. Reproducibility of the analyses is the main limitation of RT-qPCR experiments. Galaxy is an open, web-based, genomic workbench for a reproducible, transparent, and accessible science. Our aim was developing a new Galaxy tool for the analysis of RT-qPCR expression data. Our tool was developed using Galaxy workbench version 19.01 and functions implemented in several R packages. We developed PIPE-T, a new Galaxy tool implementing a workflow, which offers several options for parsing, filtering, normalizing, imputing, and analyzing RT-qPCR data. PIPE-T requires two input files and returns seven output files. We tested the ability of PIPE-T to analyze RT-qPCR data on two example datasets available in the gene expression omnibus repository. In both cases, our tool successfully completed execution returning expected results. PIPE-T can be easily installed from the Galaxy main tool shed or from Docker. Source code, step-by-step instructions, and example files are available on GitHub to assist new users to install, execute, and test PIPE-T. PIPE-T is a new tool suitable for the reproducible, transparent, and accessible analysis of RT-qPCR expression data.

variability introduced in the data during the experimental procedure 6 . Data normalization is expected to reduce/ eliminate any technical variability without affecting the true biological results 6 . Global mean 8 , DeltaCt based on universal normalizers 9 , Modified global mean 10 , Quantile 9 , and Rank Invariant 9 are among the most accepted methods used for RT-qPCR data normalization 5 .
RT-qPCR experiments allow measuring the expression of several transcripts in parallel using high-density plates 9 . Plates have been used in several explorative studies to find novel biomarkers from the analysis of different diseases, tissues, experimental conditions, and cell types 3,5,6 . The large number of studies published in the literature stimulated companies to develop commercial technologies to perform RT-qPCR experiments 3 . For each experiment, these technologies generate textual reports summarizing a number of experimental parameters and data such as feature name, quality control flags, and Ct values. Different technologies generate reports that can be of different format. According to our experience, SDS, EDS, and OpenArray are among the most used file formats for reporting results of RT-qPCR experiments.
Although the computational procedures and technologies for analyzing RT-qPCR data are well established, the heterogeneity of the assays employed in RT-qPCR experiments and the lack of a consensus on the best normalization system and on the missing values imputation approach to adopt makes it hard to set up a standardized analysis procedure 6 . Furthermore producing high quality publications and reproducible data are among the most critical pitfalls of qPCR experiments 11 .
Several open-access software packages, tools, and web applications, such as R packages, have been proposed in the last years for the analysis of RT-qPCR data 1 . HTqPCR is a well-known open source R\Bioconductor package for the high-throughput analysis of RT-qPCR data 9 . It provides several functions and parameter options for assessing the quality of the experiment, filtering unreliable data, normalizing raw data, finding potential candidate biomarkers, and visualizing RT-qPCR data 9 . However, R-based analysis suffers from some known limitations. First of all, analysis procedures are implemented in several packages lacking a unified framework. Second, users with biological background who want to use the functionalities of R packages need non-trivial coding skills. Furthermore, the lack of a simple framework for reusing, sharing, and communicating experimental procedures and results limits reproducibility, transparency, and accessibility of R-based analysis 12 .
Galaxy is an open, collaborative, web-based, genomic workbench for a reproducible, transparent, and accessible science 12 . Galaxy provides a very active developer community. More than 6746 public tools and workflows are freely available in the Galaxy Tool Shed repositories 12 . New tools and workflows are easily deployable in the Galaxy repositories. To this purpose, Galaxy offers fresh installations of R and Python environments, a fast dependency resolver, a step-by-step documentation, a simple graphical interface, and GitHub integration 13 . However, to the best of our knowledge, no Galaxy tool or workflow has been reported to date for analyzing RT-qPCR data.
In the present work, we developed pipette (PIPE-T), a new tool for analyzing RT-qPCR expression data integrating the functionalities implemented in various R packages into one unified, reusable, transparent, accessible, and easy to use Galaxy wrapper.

Methods
overview of the main procedures implemented by pipe-t. PIPE-T implements the relative quantification method using the R language and computing environment 14 .
To start a PIPE-T analysis, users must upload two input files: • A List collection of tab-separated text files for all samples generated as report of the RT-qPCR experiment (ListOfFile). • A tab-separated text file associating each filename in ListOfFile with a treatment group (FileTreatment).
Five distinct computational procedures are implemented in PIPE-T. Procedures are summarized in Fig. 1 and a detailed description of each procedure is provided in the following sections.
The execution of PIPE-T outputs the following output files: file uploading and parsing. Heterogeneity of assays quantifying RT-qPCR gene expression is often associated with heterogeneity of the file formats reporting data summarizing the results of the RT-qPCR experiment. Hence, it is crucial that the user uploads files whose content is compliant with the file format parsable by PIPE-T before running any PIPE-T analysis.
"Upload File from your computer" is a Galaxy tool that allows uploading files into Galaxy. This tool is available on any fresh Galaxy instance or on the main Tool Shed repository 15 . www.nature.com/scientificreports www.nature.com/scientificreports/ PIPE-T processes tab-separated text files containing a dot as decimal separator uploaded with "Upload File from your computer" tool. The formats supported by PIPE-T are: SDS, OpenArray, LightCycler, CFX, BioMark, and Plain are HTqPCR R package 9 parsable file formats. We updated the parsing procedure to adapt it working with R 3.5.0 and tab-separated text files. We extended the list of the parsable file formats including the possibility of processing EDS format, which is one of the most used by Thermo Fisher Scientific real-time qPCR instruments.
FileTreatment should have only two columns named SampleName and Treatment. The column named SampleName lists the name and the extension of the files uploaded into the ListOfFile collection. The column named Treatment associates each sampleName with an experimental condition or group of interest. Group specification is necessary since PIPE-T implements the relative quantification method to analyze data from RT-qPCR experiments. PIPE-T admits the specification of two treatment groups. In the GitHub documentation we provided a checklist of recommendations to help users formatting their input files and checking that these files contain sufficient data to run PIPE-T without errors.
If file format is correct, PIPE-T populates a qPCRset object containing the following data for each transcript and sample: • Raw Ct\Cq value, • Value of the internal quality control flag, • Transcript and sample names, • FeatureCategory Data parsing and qPCRset object generation are carried out using the readCtData function of the HTqPCR R package 9 .
Ct filtering and categorization. Feature categorization is a procedure for describing the level of reliability of a transcript and can be used to filter out features whose expression is not sufficiently reliable 9 . HTqPCR package defines three possible categories: "Undertermined", "Unreliable", and "OK" 9 . "Undetermined" is used to flag Ct values above a user-defined threshold, and "Unreliable" indicates Ct values that are so low as to be estimated by the user to be problematic 9 .
By default, only Ct values labeled as "undetermined" in the input data files are placed into the "Undetermined" category, and the rest are classified as "OK" 9 .
The FeatureCategory for a transcript can be altered on the basis of two criteria 9 : www.nature.com/scientificreports www.nature.com/scientificreports/ • Range of Ct values. Some Ct values might be too high or too low to be considered a reliable measure of gene expression in the sample and, therefore, should not be marked as "OK". • Flags. Depending on the qPCR input, the values might have associated flags, such as "Passed" or "Failed", which are used for assigning categories.
PIPE-T implements the two criteria allowing users to set up a range of Ct values and a List button. Any Ct value exceeding the user-defined range is categorized as "Unreliable". Users can force PIPE-T to check internal www.nature.com/scientificreports www.nature.com/scientificreports/ control flag status. In this case, the FeatureCategory for a transcript is replaced by an "Undetermined" if the transcript did not pass internal quality control.
PIPE-T uses FeatureCategory labels to replace any Ct values corresponding to "Undertermined" and "Unreliable" with a not accessible value (NA).
These operations are carried out using setCategory and filterCategory functions of HTqPCR package 9 .
Normalization. Data normalization allows to minimize unwanted systematic technical and experimental variation in the data for better appreciating true biological changes 16 . PIPE-T offers six different normalization options that are listed below: • Global mean 8 • DeltaCt 9  www.nature.com/scientificreports www.nature.com/scientificreports/ • Modified global mean 10 • Quantile 9 • Norm Rank Invariant 9 • Scale rank invariant 9 .
Global mean, quantile, norm rank invariant, and scale rank invariant were already implemented in HTqPCR R package 9 . However, as Norm Rank Invariant and Scale rank invariant worked only if missing values were absent, we extended the procedure substituting any missing value with a numeric value using the na.spline function implemented in the zoo R package 17 . D'haene and collegues showed the benefits of using the geometric mean for the normalization of microRNA expression data by introducing the so-called modified global mean method 10 . For these reasons, we integrated the modified global mean method in PIPE-T.
PIPE-T supports the deltaCt method. Housekeeping genes can be specified by the user or can be estimated by the geNorm or NormFinder methods implemented in the NormqPCR R\Bioconductor package 18 . When geNorm is selected, PIPE-T identifies candidate normalizers taking those transcripts whose stability was greater than 1.5 as reported by Vandesompele and collegues 19 .
Newly implemented normalization methods have been integrated in PIPE-T as an updated version of the function normalizeCtData of the HTqPCR R package 9 .
Transcript filtering and imputation. High-throughput data may often contain missing values. For this reason, handling missing values is a crucial step of any RT-qPCR analysis 5,6 . The simplest solution for handling missing values would be to exclude from the analysis any transcript with at least one missing value. In such a case, missing values do not represent a problem anymore because they are removed from the analysis. However, this approach could filter out a considerable number of potential useful transcripts. Another solution would be to take every transcript no matter of the number of missing values. In such a case, all potential useful transcripts are taken into account for subsequent analysis, but the probability of making an error increases with the number of missing values 6 . In the literature, there is a wide accepted approach that consists in keeping transcripts with a reasonable number of missing values and filtering out those exceeding this threshold 6   www.nature.com/scientificreports www.nature.com/scientificreports/ exceed the threshold are imputed using a suitable method. In the literature, several imputation methods have been proposed 20 .
PIPE-T offers a slider that the user can move to specify the maximum percentage of missing values admissible for a specific transcript. PIPE-T allows filtering transcripts using a user-defined percentage of missing values and/ or a user-defined list of transcripts to be removed by using the filterCtData function of the HTqPCR package 9 .
In addition, PIPE-T gives the possibility of selecting one of three well-known imputation methods. These methods are: www.nature.com/scientificreports www.nature.com/scientificreports/ KNN and Cubic imputation methods were already implemented in the impute and zoo R packages. Mestdagh is an imputation method that substitutes a missing Ct value with a numeric value obtained adding one cycle to the highest Ct value across samples 7 . This method has already been described in other reports 5 . This method assumes that missing values depends on the low or null abundance of the transcript in the sample. Differential expression analysis. Differential expression is a very popular analysis for identifying candidate transcripts whose expression can discriminate between two predefined conditions. Among the methods eligible for a differential expression analysis 21 , PIPE-T offers the possibility of choosing between three approaches: • T-test 21 .
T-test and two sample Wilcoxon test are among the most used statistical tests to perform a differential expression analysis 21 . Tests are implemented by ttestCtData and mannwhitneyCtData functions of the HTqPCR R package 9 . For the t-test and the two sample Wilcoxon test, PIPE-T offers the possibility of setting up six distinct parameters, which include: the types of alternative hypothesis to assess significance, the choice of a paired or an unpaired analysis, the presence in the data of replicated transcripts, the choice of a more or less stringent analysis, and the choice of the method for adjusting p-values in case of multiple hypothesis testing.
Rank Product is a popular method originating from a biological reasoning 22 . Rank Product is carried out using RP function of RankProd R package 23 .
If users do not specify any differential expression analysis method, PIPE-T allows them to select an option named NONE. In this case, no differential expression analysis is performed on the data. Data visualization and outputting. Quality assessment of RT-qPCR data is crucial for enhancing the accuracy of the results and the reliability of the conclusions 2 . HTqPCR provides several visualization options for assessing the quality of qPCR data, which include histograms, boxplots, density distributions, and scatter plots 9 . PIPE-T uses two boxplot visualizations showing the distribution of the expression values across all samples. The boxplots show the distribution of expression values before and after data normalization, respectively. The visual inspection of the two boxplots is used as qualitative assessment of the normalization procedure because boxplots show the noise reduction comparing the data before and after data normalization 8 . Empirical Cumulative Distribution Function (ECDF) is also used in the literature for measuring noise reduction as an effect of data normalization 8,10 . PIPE-T computes and plots ECDF before and after data normalization by using ecdf function of the stats R package 14 . The significance of the difference between the two ECDF curves is estimated by Kolmogorov-Smirnov test and p-value is reported on top of the figure and in the standard output.
Tabular output files include raw data, filtered data, imputed data and statistics to assess differential expression. A detailed description of the row and column names can be found in HTqPCR and RankProd R packages documentation. A detailed description of visualization, sharing, and workflow integration using Galaxy graphical interface can be found in the Galaxy documentation.

Results
We tested the ability of PIPE-T of analyzing RT-qPCR data using two example datasets whose tab-separated text files were available in the Gene Expression Omnibus (GEO) with accession identifiers GSE25552 and GSE43000. Datasets were relative to two published studies on various metastatic tumors 24 and non-small cell lung (NSCL) cancers 25 . The first study reported the results of the analysis of sixteen different tumors including Lung, Renal, Colon, Sarcoma, Ovarian, and Head and neck squamous cell carcinoma 24 . The second study reported the results of the analysis of forty-four NSCL tumor samples 25 . We carried out PIPE-T analysis of both datasets on a test Galaxy instance version 19.01, installed in a local Linux machine. Parameter settings for the two analyses have been taken from the original publications when available. When the parameters were not specified we selected them arbitrarily.

Various metastatic cancers.
We downloaded input tab-delimited files from GEO and we added a SDS version 2.4 format header to each of these files because it lacked. Input files contained experimental data for 384 microRNAs. We coupled RT-qPCR data with information about tumor status, which was oligometastatic (OLIGO) for ten out of sixteen patients and polymetastatic (POLY) for the remaining six patients. File names and tumor status were organized into a tab-delimited text file. The newly created file and the sixteen tab-separated text files were uploaded in Galaxy as fileTreatment and ListOfFile through "Upload File from your computer" tool. Analysis was carried out with parameters settings reported in Fig. 2.
Our tool successfully completed the execution, returning seven output files (see Tables S1-S4 and Figs S1-S3). Boxplots and EDCF before and after data normalization as well as the significant genes and statistics reported by the differential expression analysis procedure are depicted in Figs 3, 4, and Table 1, respectively.
www.nature.com/scientificreports www.nature.com/scientificreports/ Interestingly, among the significantly modulated microRNAs reported in the Lussier and coworkers manuscript 24 , 11 out of 12 microRNAs were consistently up regulated in polymetastatic tumors and 8 out of 11 microRNAs were consistently upregulated in oligometastatic tumors. Any difference between our findings and those reported by Lussier and collegues 24 are probably due to the different approaches used in the experiments to filter and handle missing values. Lussier and collegues did not report any information about filtering based on the percentage of missing values or the application of any method for handling missing or unreliable Ct values. These results provide the first evidence that PIPE-T is able to correctly analyze RT-qPCR expression data. non-small cell lung cancer. NSCL input files were compliant with SDS format version 2.3 and reported experimental data for 381 microRNAs. Since the downloaded files used a comma as decimal separator, each comma was replaced with a dot before running PIPE-T. RT-qPCR data were coupled with histological data  www.nature.com/scientificreports www.nature.com/scientificreports/ provided in the original publication 25 , which refer to twenty lung adenocarcinoma (LA) and twenty-four squamous cell lung cancer (SCLC). File names and tumor subtypes were organized into a text file. We uploaded the newly created file as fileTreatment, and the forty-four tab-separated text files as ListOfFile. Analysis was carried out with the parameter settings reported in Fig. 5.
Our tool successfully completed the execution returning seven output files (see Tables S5-S8 and Figs S4-S6). Boxplots and EDCF before and after normalization, as well as the significant microRNAs identified by the differential expression analysis procedure, are depicted in Figs 6, 7, and Table 2, respectively.
We found 16 significantly modulated microRNAs (p value < 0.05 and FC > 2 or FC < 0.5; Table 2). Interestingly, miR-205, miR-149, miR-422a, and miR-708 were significantly upregulated in SCLC and miR-375 was significantly upregulated in LA in accordance with the results of the original manuscript 25 . Any difference of fold change or p-value between our study and that by Molina-Pinelo and collegues 25 can be explained by the different handling of missing values. Authors did not report their approach to missing or unreliable Ct values. In spite of three small differences, our results provide evidences that PIPE-T is able to correctly analyze RT-qPCR expression data.

conclusions
We developed PIPE-T, a new Galaxy tool that offers several state-of-the-art options for parsing, filtering, normalizing, imputing, and analyzing RT-qPCR expression data. Integration of PIPE-T into Galaxy allows researchers with strong bioinformatic background, as well as those without any programming expertise, to perform complex analysis in a simple to use, transparent, accessible, reproducible, and user-friendly environment.

Availability of Supporting Source code and Requirements
Project name: Pipe-t Project home page: https://github.com/igg-molecular-biology-lab/pipe-t (2019) 26 Operating system(s): Linux (Galaxy), and platform independent Programming language: R Other requirements: Galaxy License: GNU GPL PIPE-T is available on the Main Tool Shed 15 at the link 27 , on the Docker 28 at the link 29 and on the web 30 at the link 31 . PIPE-T code is freely available on GitHub at the link https://github.com/igg-molecular-biology-lab/pipe-t (2019) 26 .
Calibrator is the treatment group of the first sampleName in fileTreatment. Target is the alternative treatment group. In our example, Calibrator is LA and Target is SCLC. b Value of t statistics. c Significance of the difference between the mean of expression of the treatment groups. MicroRNAs are ordered by p value. d P value adjusted for multiple hypothesis testing. e Delta delta Ct value. f Fold change value calculated as 2 −ddCt . FC greater than 2 and lower than 0.5 are reported. g Average expression of the microRNA in the Calibrator group. h Average expression of the microRNA in the target group. i Category of the Ct values ("OK", "Undetermined") across the samples of calibrator group. j Category of the Ct values ("OK", "Undetermined") across the samples of target group.