Improving gene function predictions using independent transcriptional components

The interpretation of high throughput sequencing data is limited by our incomplete functional understanding of coding and non-coding transcripts. Reliably predicting the function of such transcripts can overcome this limitation. Here we report the use of a consensus independent component analysis and guilt-by-association approach to predict over 23,000 functional groups comprised of over 55,000 coding and non-coding transcripts using publicly available transcriptomic profiles. We show that, compared to using Principal Component Analysis, Independent Component Analysis-derived transcriptional components enable more confident functionality predictions, improve predictions when new members are added to the gene sets, and are less affected by gene multi-functionality. Predictions generated using human or mouse transcriptomic data are made available for exploration in a publicly available web portal.

We would like to thank the reviewers for their helpful comments. These constructive remarks have enabled us to improve our manuscript. Below, please find a point-by-point rebuttal.

Referee 1
Remark #1: "... from biological perspective, it is not well-written, and it is not so informative. Actually, in this paper, the authors just used of one biological example "DNA Repair" gene-set" Reply: We agree with this reviewer that the manuscript lacked the biological perspective of our findings. To expand the biological perspective of the manuscript, we added several additional analyses. Corfs predicted to be involved in estrogen receptor signaling and response ( Figure 6B). In this cluster, C6orf141 has recently been suggested to be estrogen receptor alpha regulated in breast cancer cells [1] (Supplementary Data 4). These results show that GENETICA can predict the functionality of uncharacterized genes. " We added these analyses in a new Figure 6: We added the following legend for Figure 6: "Predicted functions for uncharacterized genes using the Hallmark collection.
A) Boxplot of ICA-TC based (blue) and PCA-TC based prediction scores (red) of Corf and LOC genes, for the Hallmark gene set collection (x-axis). The vertical lines of boxes correspond to the median prediction scores; hinges correspond to first and third quartiles of the data; whiskers extend up to 1.5 times the interquartile range; outliers are plotted as dots.
B) Heatmap depicting hierarchical clustering of Corfs and LOC genes based on the ICA-TC based prediction scores for the Hallmark gene set collection. Clusters were obtained with a height cutoff of 0.8 resulting in 116 clusters with a median size of 5 members. Insets depict two clusters containing Corfs and LOC genes predicted to be involved in DNA repair, and estrogen receptor response. Genes marked with an asterisk have been recently implicated in the predicted biological processes based on experimental work. " Secondly, we predicted the function of genes prioritized with CRISPR based genetic screens in three additional biological contexts (Figure 8), namely the inflammatory response to bacterial lipopolysaccharide, viral entry, and lysosome accumulation.
We added the following sentences to our Results section on Page 11, Line 11 : "Additionally, we investigated three genome-wide CRISPR screens that aimed to capture genes involved in the inflammatory response to bacterial lipopolysaccharide [2], lysosomal accumulation [3], and viral entry [4].
A co-functionality network was built using all 111 hit genes of the response to bacterial lipopolysaccharide (LPS) CRISPR screen performed on dendritic cells ( Figure 8A) [2]. Two clusters were observed. The first cluster showed high prediction scores for Immunological Signatures gene sets defining the response to TREM-1, a receptor involved in amplifying the cellular response to LPS, and gene sets defining the response to F. tularensis, a pathogen that produces LPS ( Figure 8A). In contrast, the second cluster showed a high prediction score for an Immunological Signatures gene set defining reactive oxygen species-induced genes within dendritic cells ( Figure 8A). These results show the possibility to use a co-functionality network to identify distinct biological processes that are related to the phenotype under investigation in a CRISPR screen.
In the genome-wide CRISPR-knockout screen identifying 16 genes essential for lysosomal integrity, a network was built with GO Biological Process and KEGG gene set collections [3].
The ICA-TC based prediction scores of the 16 genes for KEGG lysosome gene set ranged from 0.8 to 2.6, indicating that the ICA-TC method did not confidently predict these genes to play a role in lysosome-related processes ( Figure 8B). Reassuringly, the ICA-TC based method still generated higher prediction scores than the PCA-TC method for the KEGG lysosome gene set. This suggests that this particular CRISPR screen identified genes that either have unique expression patterns that are incompatible within a guilt-by-associationstrategy.
The third genome-wide screen, studying genes involved in Ebola virus infection prioritized 13 genes [4]. Within the original publication, GNPTAB, a gene encoding a protein normally involved in the production of mannose-6-phosphate, was subsequently validated.
Concordantly, GNPTAB showed the highest ICA-TC-based prediction score (Prediction score = 6.4) for negative regulation of viral transcription by the host in the GO biological Process gene set collection ( Figure 8C). The other 12 genes showed prediction scores ranging from 0.7 to 2.6 for the respective gene set, demonstrating how ICA-TC-based prediction scores can be used to prioritize genes for further validation. " These data are now added in a new Figure 8: We added the following legend for "Many genes do not yet have a defined function. Therefore, gene function prediction methods remain an important tool for researchers. The critical assessment of functional annotation (CAFA) is a famous benchmark of methods that use protein sequence data to predict gene functions via gene sets [5]. Other data such as gene cross-species homology, protein-protein interaction, mRNA transcription, essentiality, and semantics can also be used to predict gene functions [6][7][8][9][10][11]. One limitation of using mRNA expression data is that some information about the gene function can only be found at the protein level using expression, interaction, or sequence data. One advantage of mRNA expression data is the greater amount of publicly available profiles than protein experiment data.
New methods such as embeddings and neural networks can identify complex relationships from protein and mRNA expression data which serve to improve gene function predictions.
For example, autoencoders are a neural network architecture currently used to infer a latent representation of the mRNA transcription patterns obtained from both bulk and single cell samples [12,13]. This representation can be used as a regulatory barcode to generate gene function prediction scores within a guilt-by-association strategy. Convolutional neural networks and deep neural networks have also been used to directly improve gene function prediction from the protein sequences [14,15]. The recently developed embedding technique Uniform Manifold Approximation Projection (UMAP) has also been utilized to predict novel protein interactions when processing mRNA expression of different gene knockout experiments [16]. Predicted protein interactions can also be used to predict function predictions by propagating the interacting partner functions. "

Remark #3:
"provide a list of gene-sets that the gene-set membership has been improved by the developed approach and show that, whether all of the gene-sets have been improved? if not, explain why? " Reply: To address this remark, we compared PCA-TC based median prediction scores with ICA-TC based prediction scores for all gene sets and calculated a difference (delta) Supplementary Data 1. This analysis revealed that only a minority (3.4%) of gene sets are not improved and comprise a smaller set of genes than the improved gene sets (median gene members is 23 and, 57 respectively.). Gene sets derived from one study (NIKOLSKY prefix) were overrepresented in the top 20 PCA-improved subset. These gene sets represented amplified genomic loci in breast cancer samples that were not associated with any biological process in the original publication [17].
We added the following sentences to our Results section on Page 6, Line 7 : "To explore if a subset of gene sets is better predicted with the PCA-TC based method the differences between gene set prediction scores of both PCA and ICA methods were calculated (delta). Gene sets with deltas higher than zero were defined as ICA-improved and below zero as PCA-improved gene sets. We observed that 811/23,413 (3.4%) gene sets belonged to the PCA-improved subset ( Supplementary Fig. 1). Gene sets belonging to the PCA-improved subset had fewer members (median = 23) than the ICA-improved subset (median = 57). " We added Supplementary  To showcase that our analysis captured the transcriptional activity at different timepoints in time-series experiments we selected two studies from our input mRNA dataset.
In the first study (GSE67684) mRNA profiles of 210 blood or bone marrow biopsies were generated before and 8 days after remission induction therapy in children with de-novo acute lymphoblastic leukemia [18]. In the second study (GSE12548) triplicate mRNA profiles of ARPE-19 cells were generated at different timepoints before and after combined treatment with TGF-beta and TNF-alpha [19]. Using UMAP to visualize the weights in the independent components corresponding to the patient samples reveals that their transcriptional profiles tend towards a common center after remission induction therapy (Peer Review Fig. 1A) [20]. Similarly, ARPE-19 cell replicates consistently showed changes in expression that shift their position in the embedded space (Peer Review Fig. 1B).
Peer Review Figure 1: UMAP embedding of two time series experiments. A) Linked scatterplot of 2D UMAP embedding coordinates of cancer patient biopsies before and after 8 days of treatment (GSE67684). UMAP was performed on the independent component weights corresponding to these samples. Color corresponds to timepoint. Transparency of point and line corresponds to Euclidean distance traveled by the biopsy transcriptional profile between timepoints. B) Scatterplot of 2D UMAP embedding coordinates for ARPE-19 replicates at different timepoints of TGF-TNF treatment (GSE12548). Color corresponds to timepoint.

Referee 2
We did not receive any comments from the second invited reviewer.

Referee 3
We would like to thank the reviewer for the acknowledgement that the manuscript is mostly well-written and that the reviewer finds our ICA-TC framework highly valuable for unbiased analyses of genes and their functions. We are also happy with his congratulations on this important contribution and his belief that our work will be highly relevant and useful to the community.

Remark #1:
"The focus on Affymetrix gene expression technology seems a bit outdated. More than 80% of all GWAS associations typically fall outside coding regions and non-coding genes are understudied compared to protein-coding genes. Therefore, it would be very useful if the ICA-TCs were computed based on large-scale RNA-seq data instead of microarray data.
RNA-seq data could provide predictions for the much larger set of GENCODE genes, results that supposedly would be extremely valuable to better understand contributions of non-coding genes to complex traits and disease. Also, from a technical perspective, how much do their predictions increase when using state-of-the-art RNA-seq data?." Reply: We thank the reviewer for this recommendation and completely agree that it would greatly improve the impact of our method. To address this remark we downloaded and generated gene-level counts (n = 58, 433) for quality-controlled samples (n = 29, 138) provided in the GeneNetwork Assisted Diagnostic Optimization manuscript [8]. We kept all parameters and gene set collections consistent with the microarray version of our framework except the explained variance set to 85% for computational restrictions. Using RNA-seq data did not improve the average prediction scores in most gene set collections when applying the PCA-TC based method ( Figure 9A). By contrast, average prediction scores improved when applying the ICA-TC based method ( Figure 9B). This new version of our framework is now also available at http://www.genetica-network.com.
We added the following sentences to our Results section on Page , Line :  Figure 9C). Predictions obtained using the ICA-TC based method had a lower association to multifunctionality for the microRNA Targets gene set collection when using RNA-seq data as input in comparison with microarray data (AUC 0.12, p-value 5.1 × 10 −40 )( Figure 9D). Predictions obtained using the ICA-TC based method had a higher association to multifunctionality for the Immunological signatures gene set collection when using RNA-seq data as input (AUC 0.59, p-value 6.6 × 10 −58 ). The subset of gene set median prediction scores improved by microarray data in comparison to RNA-seq data when using ICA-TC based method was 14.83%. These results show that the ICA-TC based method can leverage RNA-seq profiles to improve the predictions in some gene set collections. The prediction scores based on the RNA-seq dataset have also been made available at http://genetica-network.com. " These data are now added in a new Figure 9: We added the following legend for Figure 9:

Remark #2:
"Besides the narrow focus on microarray data, another concern is that lack of single-cell RNA-seq data. Can their approach be applied on single-cell RNA-seq data which is typically sparse and where less plentiful compared to bulk expression data?. Would be great if the authors, at least, could include examples on how their approach works when e.g. applied on all single-cell data from some of the large multi-organ and comprehensive brain atlases. For instance, will much few samples be needed because the data is much more specific than e.g. bulk microarray data? " Reply: We thank the reviewer for this observation. We agree that single cell data captures a higher resolution of the mRNA expression of single cells and could potentially improve gene predictions. However, single-cell profiles are currently very sparse with sometimes only 10% of genes with non-zero measurements. This complicates the application of our method since more samples are required to have enough non-zero measurements to calculate a complete gene-by-gene covariance matrix. Additionally, we have not yet determined which cross-study normalization procedure is the most appropriate preprocessing for our method before the calculating the covariance matrix.
We believe that the biopsy variety of publicly available experiments is still higher in bulk mRNA data. For example, we explored panglaoDB and extracted all human experiments performed using 10x chromium technology (n = 239, cells = 397,071) [23]. The majority of experiments were performed on lymphoid tissues, blood, prostate, testis, brain, kidney and liver. In contrast, publicly available bulk mRNA experiments are obtained from healthy, diseased and cancerous complex biopsies of many more tissues in addition to cell line perturbation experiments. Since our method relies on the variety of mRNA states to extract reliable transcriptional components with ICA, this hampers the attractiveness of currently available single cell mRNA profiles for gene function prediction. Still, we cannot rule out that single cell could enable better function predictions when more profiles and future technologies become available.

Remark #3:
"There is no mentioning of species. Are these human data or a mix between human, mouse and rat samples? What are the differences in their benchmarks between species? " Reply: In our original analysis we only used samples generated with the Affymetrix HU-113 Plus 2.0 platform (n = 106, 462), which is a human microarray. We did calculate predictions for a mouse gene set collection (Mammalian Phenotypes) using the human or-  (Supplementary Fig. 2). The mouse prediction scores are available at genetICA-network.com.
We added the following sentences to our Results section on Page 8, Line 6 : Remark #7: " Table not properly described. For instance, it is not clearly stated that GENETICA is the method proposed by the authors" Reply: We changed the following Table: Reply: Again, we apologize for an oversight that resulted in the download links not being available in the website. We provide for the human microarray, mouse microarray and RNA-seq profiles the following: metadata, independent components, mixing matrix, gene set collections used to generate predictions, transcriptional barcodes for every gene set and prediction score tables for every gene set collection. This is findable in the Support Data tab in the website corresponding to the active dataset version selected in the website header. Additionally, we provide in the About tab of the website a link to the executable program used to run consensus independent component analysis and to generate prediction scores. We provide a README that explains how to set up and reproduce our results.
We would like to warn that modern browsers may sometimes temporarily save a cache of the nonupdated website which can prevent the display of the updated website with the Remark #3: "Finally, from a technical perspective, would it be possible to extract ICA-TCs simultaneously based on the microarray and RNA-seq data? " Reply: It is technically possible to append samples from different platforms and perform independent component analysis on such hybrid dataset as long as the set of genes is reduced to the matching subset between datasets. However, because the human microarray (n = 106, 462) contains more samples than the RNA-seq dataset (n = 29, 138) this could bias the gene-gene covariance matrix towards the microarray profiles and therefore bias the determination of transcriptional components. Although this bias could potentially be balanced by downsamplig the microarray dataset, the reduced sample size would miss the goal of performing such joint analysis. Since the RNA-seq dataset provides comparable prediction scores for more genes using only a third of the samples we believe that running a