reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

Functional enrichment analysis is an analytical method to extract biological insights from gene expression data, popularized by the ever-growing application of high-throughput techniques. Typically, expression profiles are generated for hundreds to thousands of genes/proteins from samples belonging to two experimental groups, and after ad-hoc statistical tests, researchers are left with lists of statistically significant entities, possibly lacking any unifying biological theme. Functional enrichment tackles the problem of putting overall gene expression changes into a broader biological context, based on pre-existing knowledge bases of reference: database collections of known expression regulation, relationships and molecular interactions. STRING is among the most popular tools, providing both protein–protein interaction networks and functional enrichment analysis for any given set of identifiers. For complex experimental designs, manually retrieving, interpreting, analyzing and abridging functional enrichment results is a daunting task, usually performed by hand by the average wet-biology researcher. We have developed reString, a cross-platform software that seamlessly retrieves from STRING functional enrichments from multiple user-supplied gene sets, with just a few clicks, without any need for specific bioinformatics skills. Further, it aggregates all findings into human-readable table summaries, with built-in features to easily produce user-customizable publication-grade clustermaps and bubble plots. Herein, we outline a complete reString protocol, showcasing its features on a real use-case.

Input files structure -Input files need to be tabular data, in either .csv (comma separated values), .tsv (tab separated values) or excel (.xlsx) files. Please note that the .tdt (tab delimited text) extension for tabular data is discouraged as Libre Office's Calc tries to load the data as a Writer document rather than a spreadsheet. Sample files downloadable from within reString are .csv files that can be seamlessly opened by either Excel or Libre Office's Calc.

A B
In A is shown the text import dialog window of Calc. In B, the appearance of one file opened in Calc. The input file has two columns: gene (containing the gene IDs) and another labelled as the input filename (containing the fold change of the expression of that gene with respect to the comparison made). Handling RNAseq results to obtain this kind of tables should be straightforward for all researchers. Even if not all analyses make use the of the fold change information (column 2), input files shall contain it nonetheless.

Supplementary Figure S5
Choose output folder-Choose an existing folder to output the results files to. Navigate the file browser (A) to the desired output location and click Choose. The program will notify that the working directory has been set (B, C).

Supplementary Figure S6
STRING website analysis tools -After performing PPI (proteinprotein interaction) analysis with STRING, at the bottom of the page, the "∑ Analysis" button opens a dialog through which further functional enrichment data can be downloaded (A).
As an example, a KEGG enrichment file (saved by STRING as enrichment.KEGG.tsv) structure is shown in B, as it appears as opened with Libre Office's Calc. "ALL_" was prepended to the filename by reString, as for this particular example the gene list used to retrieve KEGG information contained all genes, irrespective of their up-or downregulation with respect to experimental groups.

genes_from_comparison_n
Remote STRING server

A B
Retrieval of functional enrichment information from STRING. For each supplied gene list table, reString contacts STRING remote servers via STRING APIs to fetch functional enrichment information tables (A). reString outputs what it is doing to the textual output frame of the main program's window (B). When running the sample protocol, reString starts with the first file, and creates a subfolder with the same name into the specified results folder. Then, depending on the analysis parameters, it queries STRING for functional information with either all genes (ALL), or upregulated (UP) and/or downregulated (DOWN) ones, naming output tables accordingly (if the gene list is long enough to generate statistically significant output data). A table for each one of KEGG, Function, Component, Process and RCTM results type is retrieved. The process is repeated for all files.
In the example depicted (B), the genes contained in the input file were divided into up-and downregulated, and each list was used to probe the STRING server. reString outputs the time that took the server to respond. Following STRING APIs instructions, each request is followed by a pause time of 1 second.
To report itself to the STRING server, reString generates a session-unique session ID, shown at each query.

Supplementary Figure S8
A Aggregation of functional enrichment results. After fetching all tables from STRING, reString begins aggregating data from them, detailing each step to the main window (shown in A). When possible, the equivalent Python code of what it is doing is outputted as well, and it should serve as a reference to whomever would like to reproduce or integrate the same analysis when using reString as a Python module (prefixed by "*Python*"). reString processed all the folders it encounters in the output folder indicated before the beginning of the analysis, produced during the retrieval of data from STRING servers. For each kind of term retrieved (KEGG, Function, Component, Process and RCTM), the aggregation is started following the chosen analysis type. In the depicted example, the default analysis is run (indicated with "directions"). After completing the aggregation, reString outputs final, aggregated summaries it produced for each term (B). For each one, two types of tables are produced: results and summary.

Supplementary Figure S9
Species selection dialog. STRING requires a species parameter to be indicated when performing PPI network and functional enrichment analysis. reString defaults to mouse (taxonomy identifier 10900). It is possible to choose among the most common species by clicking the corresponding buttons, or directly setting one species by inputting the corresponding numerical taxonomy identified and clicking "Set".

Supplementary Figure S10
Analysis options. reString knows from the input files if the genes of any given comparison are more expressed in one condition with respect to the other. This directionality information can be exploited to set four possible analysis types (A). "up" and "down" follows in reString the convention specified in B.
Researchers should adjust their input data so that "upregulated" or "downregulated" matches the inplied convention.
In the example shown, condition 1 and condition 2 are two experimental conditions where the abundance of the transcript has been estimated. The log 2 of the ratio of the two conditions is calculated as shown, and the following applies: Upregulated means higher in condition 1 vs condition 2. That is the case of gene 2. The Log2FC is > 0. Downregulated means lower in condition 1 vs condition 2. That is the case of gene 1. The Log2FC is < 0.

Supplementary Figure S11
Draw clustermap window. reString has a built-in tool that produces clustermaps by charting results-type tables. By default, it shows and saves to a preferred location a heatmap (no clustering of rows and columns) that fits into a reasonably-sized picture. Options are illustrated in the picture above, please refer to the manual for a complete explanation.

Supplementary Figure S12
Draw bubble plot window. reString has a built-in tool that produces bubble plots by charting summarytype tables. Options are illustrated in the picture above, please refer to the manual for a complete explanation.

Supplementary Figure S13
Retrotranscription and DNA contamination test. Specific primers were used to amplify sequences of Srp14 on cDNA (193bp) and gDNA (677) on RT plus and RT minus samples.

Supplementary Figure S14
A C

B
Results inspection -Inspecting each block of the clustermap is easy. reString organizes all functional enrichment files it retrieves from STRING, for each comparison, into a folder with the same name as the input file. In this example, the term "Immune System" of Reactome Pathways is investigated in the comparison "treatment_t1_red_VS_treatment_t1_blue_FC" (A). The corresponding files for RCTM data are found in the corresponding folder (B), and once opened, it is easy to find if the term was found with the up-or down-regulated gene lists (C) and retrieve the genes of interest (arrows).

Supplementary Table S1
For endpoint PCR, the conditions were as follows: 95°C for 3 min, followed by 35 cycles of 30 s at 95°C, 30 s at 58.5°C, 45 s at 72°C for 45 s, followed by a final amplification step of 5 min at 72°C. For qPCR, the conditions were as follows: 95°C for 1 min, followed by 40 cycles of 10 s at 95°C, 30 s at 60°C, followed by a final melting curve performed from 65.0 °C to 95.0 °C with 0.5 °C increments.

Installation in depth
Here are step-by-step instructions on how to install reString on specific platforms in the form of YouTube videos. In each video description, the commands that should be inputted in the terminal to perfect the installation process are handily summarized. This covers both checking/installing Python, eventual missing dependencies and restring itself.

Installation troubleshooting
If you experience hiccups during the installation, maybe we got you covered: • If you get SyntaxError after trying to run restring: make sure you are using Python 3.x and not Python 2.x. Python 2.x is obsolete and discontinued. Many systems still support both, in this case you use python and pip for Python 2.x and python3 and pip3 or Python 3.x. In this case, use pip3 to install restring.
• If you get SyntaxError and you are sure you're running Python 3.x: then you're running a version prior to 3.6. Update it.
• If you get errors launching reString by typing restring-gui: To the exception of MacOS, we noticed that the installation script is not placed in the Path/PATH environment variable (that is: even if the script is in your computer, your computer doesn't know where to pull it from when you type it).
If this happens, you have two alternatives: alternative a) start restring by typing python -c "import restring; restring.restring_gui()" or python3 -c "import restring; restring.restring_gui()" Use the first command if python is Python 3.x in your system, use python3 if in your system the version 2.x is called instead. These commands are guaranteed to work from within any folder the terminal is in; alternative b) permanently teach your system where the launch script lies. You will know the location from the installation log (refer to the YouTube videos). In GNU/linux systems, it's far easier to google for something like "how to permanently add a folder to PATH in YOUR_DISTRO_HERE". In Windows, follow the instructions of the YouTube installation guide. When done, you will be able to launch restring by just typing:

restring-gui
• If you get weird errors: Get in touch with us: report a bug.

Procedure
restring can be used via its graphical user interface (recommended). A full protocol, with sample data and examples, is detailed below. Alternatively, it can be imported as a Python module. This hands-on procedure is detailed at the end of this document.

| Prepping the files
All restring requires is a gene list of choice per experimental condition. This gene list needs to be in tabular form, arranged like this sample data. This is very easily managed with any spreadsheet editor, such as Microsoft's Excel or Libre Office's Calc.

| Set input files and output path
In the menu, choose File > Open..., or hit the Open files.. button. Tip: put all input files you want to process together in one or more analyses in the same folder. Input files can be individually selected from any one folder, but each time input files are added, the input files list is reset.
Then, choose an existing directory where all putput files will be placed: choose File > Set output folder or hit Set folder button (Choose a different output folder each time the analysis parameters are varied, see section 5).

| Running the analysis with default settings
In the menu, choose Analysis > New analysis, or hit the New analysis button. restring will look for genes in the files you have specified, interrogate STRING to get functional enrichment data back (these tables, looking exactly the same to the ones you would manually retrieve, will be saved into subfolders of the output folder), then write aggregated results and summaries. These are found in the specified output directory, and take the form of results-or summarytype tables, in .tsv (tab separated values) format, that can be opened out-of-the-box by Excel or Calc. Let's take a look at the anatomy of these tables.

Results tables
The table contains all terms cumulatively retrieved from all comparisons (each one of the inpt files containing the genes of interest between any two experimental conditions). For every term, common genes (if any) are listed. These common genes only include comparisons where the term actually shows up. If the term just appears in exactly one comparison, this is explicitly stated: n/a (just one condition). P-values are the ones retrieved from the STRING tables (the lower, the better). Missing p-values are represented with 1 (that is, in that specific comparison the term is 100% likely not enriched).

Summary tables
These can be useful to find the most interesting terms across all comparisons: better p-value, presence in most/selected comparisons), as well as finding the most recurring DE genes for each term.

| Visualizing the results
Clustermap restring makes it easy to inspect the results by visualizing results-type tables as clustermaps. In the menu, choose Analysis > Draw clustermap to open the Draw clustermap window: Clustermap Options readable: if flagged, the output clustermap will be drawn as tall as required to fully display all the terms contained in it. Be warned that this might get very tall, depending on the number of terms. log transform: if flagged, the p-values are minus log-transformed with the specified base:log(number, base chosen). Hit Apply to apply. cluster rows: if flagged, the rows are clustered (by distance) as per Scipy defaults. The column order is overridden.
cluster columns: if flagged, the columns are clustered (by distance) as per Scipy defaults.

P-value cutoff:
For each term (row), if all values are higher than the specified threshold, the term is not included in the clustermap. For log-transformed heatmaps, for each term (row), if all values are lower than the specified threshold, the term is not included in the clustermap.
Insert a new value and hit Apply to see how many terms are retained/discarded by the new threshold. Note that the default value of 1 will include all terms of a non-transformed table, as all terms are necessarily 1 or lower (moreover, there should automatically be at least a term per row that was significant, at P=0.05, in the files retrieved from STRING, otherwise the term would not appear in the table in the first place).
To set a new threshold, for instance at P=0.001, one should input 0.001, or 3 when logtransforming in base 10. Always hit Apply. Log base: choose the base for the logarithm.
DPI: choose the output image resolution in DPI (dot per inch). The higher, the larger the image.
Apply: Applies the current settings to the table, and shows how the settings impact on the table.
Choose terms..: This button opens a dialog to choose the terms. An example: In this example, the results table contains terms that are irrelevant in the analysis being made. When loading a new table, all terms are automatically included, but the user chan choose to untick the terms that are unwanted. If a new P-value cutoff is applied, restring remembers the user choice even if some of the terms are now removed from the term list and are added back to the table at a later time. Hit Apply & OK to apply the choice and close the window.
Choose col order: The user can reorder the column order by dragging the column names. Multiple adjacent columns can be selected and dragged together (this is ineffective if Cluster rows is flagged).
Hit OK to apply and close the window.

Reset: Reloads the input table and clears term selection.
Help: Opens a dialog that briefly outlines the procedure.
Online Manual: Opens the default browser at the clustermap help section.
Close: Closes the window.

Bubble plot
restring makes it easy to inspect the results by visualizing summary-type tables as bubble plots. In the menu, choose Analysis > Draw bubble plot to open the Draw bubble plot window: The bubble plot emphasizes the information gathered in summary-type tables, drawing, for each selected term, a bubble whose color and size reflect the FDR (color) and number of genes shared between all experimental conditions for each term (size). Here is an example: Bubble plot options log transform: if flagged, the p-values are minus log-transformed with the specified base:log(number, base chosen). Hit Apply to apply.

P-value cutoff:
For each term (row), if all values are higher than the specified threshold, the term is not included in the clustermap. For log-transformed heatmaps, for each term (row), if all values are lower than the specified threshold, the term is not included in the clustermap.
Insert a new value and hit Apply to see how many terms are retained/discarded by the new threshold. Note that the default value of 1 will include all terms of a non-transformed table, as all terms are necessarily 1 or lower (moreover, there should automatically be at least a term per row that was significant, at P=0.05, in the files retrieved from STRING, otherwise the term would not appear in the table in the first place). To set a new threshold, for instance at P=0.001, one should input 0.001, or 3 when logtransforming in base 10. Always hit Apply.

Reset: Reloads the input table and clears term selection.
Help: Opens a dialog that briefly outlines the procedure.
Online Manual: Opens the default browser at the bubble plot help section.
Close: Closes the window.

| Configuring the analysis.
Species restring defaults to Mus musculus. To choose another species, choose Analysis > Set species to open the dialog: STRING accepts species in the form of taxonomy IDs. Hit the button of your species of choice or supply a custom TaxID and hit Set. Head over to STRING's doc to know if your species is supported.

DE genes settings
restring accepts as gene lists input something like this sample data. The input contains information of the gene name (we developed restring having the official gene name in mind as the preferred gene identifier, as that's always the case among researchers in our experience) and information about the direction of the change of expression with respect to experimental groups. This is the implied convention:

Set background
In the words of Szklarczyk et al. 2021's paper: "An increasing number of STRING users enter the database not with a single protein as their query, but with a set of proteins. [..] STRING will perform automated pathway-enrichment analysis on the user's input and list any pathways or functional subsystems that are observed more frequently than expected (using hypergeometric testing, against a statistical background of either the entire genome or a user-supplied background gene list)." By default, reString requests functional enrichment data against the statistical background of the entire genome. This is specified in the textual output during the anaysis: Running the analysis against a statistical background of the entire genome (default). Otherwise, it is possible to specify a background that will be applied to all input files, via Analysis > Custom background. You will be prompted to open a .csv, .tsv, .xls, .xlsx file that need to be a headerless, one-column file containing your custom background entries. Alternatively, you can place one entry per line in a .txt file. During the analysis, this will be specified as follows in the textual output: Running the analysis against a statistical background of user-supplied terms.

Clear Background
To clear (empty) the custom background and revert to the default background (the entire genome), choose Analysis > Clear custom background. A message will confirm that the background has been cleared.

Choosing a specific STRING version
In the menu, choose Analysis > Choose STRING version to open the following dialog: reString is compatible with the output produced from STRING API version 11.0 and above. To get info about past and current STRING releases, see here or hit Info about STRING versions in the window. reString always defaults to the lastest release, but for compatibility purposes other versions (11.0b or 11.0) can be selected.

Investigating an issue
Please let us know if you have any issue. From installation, to usage, to unforeseen application hiccups, there is a form to request assistance. Just hit New issue to start a new request. Files can be drag-and-dropped into the form as well, and your request can be previewed before being finalized.
To help up pin down the issue, please always include details about your machine setup (CPU, RAM, GPU, vendor) as well as your Python, OS and restring version. To further help investigating the matter, you should also include a procedure to allow us to replicate the issue in order to fix it (if possible, also include your input files). If you've been encountering an issue, chances are that some other people have already stumbled over the same problem, and the answer might already be around in the Issues section.

Reporting a bug
Most of the time, restring communicates what it's doing by printing messages in the application window(s). In rare circumstances, errors are printed only to the terminal window (the one you've launched restring-gui from, and that's still around): In addition to all information needed to investigate an issue (see above section), please include all terminal output (copy-and-paste the text, or drop a screenshot) in your bug report. Help us improve restring and report any bug here.

Requesting a new feature
If you feel like restring should be including some new awesome feature, please let us know! We are aimed at making restring richer and more user-friendly.