A Protein Interaction Information-based Generative Model for Enhancing Gene Clustering

In the field of computational bioinformatics, identifying a set of genes which are responsible for a particular cellular mechanism, is very much essential for tasks such as medical diagnosis or disease gene identification. Accurately grouping (clustering) the genes is one of the important tasks in understanding the functionalities of the disease genes. In this regard, ensemble clustering becomes a promising approach to combine different clustering solutions to generate almost accurate gene partitioning. Recently, researchers have used generative model as a smart ensemble method to produce the right consensus solution. In the current paper, we develop a protein-protein interaction-based generative model that can efficiently perform a gene clustering. Utilizing protein interaction information as the generative model’s latent variable enables enhance the generative model’s efficiency in inferring final probabilistic labels. The proposed generative model utilizes different weak supervision sources rather utilizing any ground truth information. For weak supervision sources, we use a multi-objective optimization based clustering technique together with the world’s largest gene ontology based knowledge-base named Gene Ontology Consortium(GOC). These weakly supervised labels are supplied to a generative model that eventually assigns all genes to probabilistic labels. The comparative study with respect to silhouette score, Biological Homogeneity Index (BHI) and Biological Stability Index (BSI) proves that the proposed generative model outperforms than other state-of-the-art techniques.

unlabeled data. The main difficulties in acquiring labelled data are that the method is expensive and needs a great extent of human effort and knowledge. The collection of such labelled information is tremendously costly and we need experts (subject matter experts (SME)) in the field to develop this labelled information. While some large enterprises (https://www.wired.com/2016/11/googles-search-engine-can-now-answer-questions-human-help/, https://time.com/4631730/andrew-ng-artificial-intelligence-2017/) can bear this price 23 , it is not simple for most developers to bear the price.
There is a notable trend in using generative models 24 to investigate data from weak supervisory sources to solve this bottleneck. These weak supervision sources which synthesize the labels by exploiting external knowledge bases 25 , heuristic laws 26 , noisy crowd labels 27 , or even other classifiers 28 , often have limited accuracy and coverage. As the labels are conflicting and noisy, these labels are not regarded to be gold standards. We must infer the dependence and correlation between them in order to solve this conflict. In this respect, the generative model plays an important role in inferring the probabilistic labels without having access to the ground truth. The user-specified structure of the generative model directly impacts the precision of the inferred labels [29][30][31] . Recently, the researchers of Stanford university proposed a new paradigm of generative model named Snorkel 28,32 . Due to the inherent property of Snorkel, it is widely used in various real life problems like surveillance with electronic health records 33 , clinical text classification 34 , web content and event classification 35 . Also, Snorkel is used for improving gene clusteing 36 and medical image training 37 .
Motivated by the above stated facts, we utilized the generative model of Snorkel for developing a novel gene clustering technique. In this work, the final probabilistic labels of the genes are inferred by using protein interaction information, weak supervision sources and Snorkel. Recently, researchers have used generative model of Snorkel without modifying the internal architecture for improving the gene clustering 36 . In this study, the novel contribution is to integrate protein interaction information as a new parameter of the generative model of Snorkel. As per our knowledge, this type of integration of biological knowledge (protein interaction information) with generative model is a new and unique approach. Here, for generating weak supervised sources, we have utilized a multi-objective optimization (MOO) based clustering technique 20 and Gene Ontology 38 . Recently, clustering methods based on multi-objective optimization 8,39 have been discovered to be efficient in solving various real-life issues in clustering. The solutions of MOO-based clustering π π π Π = … { , , , }  36 utilize the non-dominated solutions as the weak supervised sources of the generative model. In the proposed approach, we prudently integrate protein interaction information with the generative model so that it can label the gene expression data efficiently. The protein interaction information acts as a parameter for the generative model that helps in improving the accuracy of the generative model. The final clustering solution is then evaluated by three cluster validity indices namely biological homogeneity index (BHI) 40 , biological stability index (BSI) 40 and Silhouette index 41 . Experimental results indicate that the technique we propose achieves better outcomes than the state-of-the-art techniques. In short, the suggested strategy is a novel way of improving gene clustering from weak supervision sources, by utilizing the protein interaction information and a generative model. For the ease of understanding of the readers, the list of mathematical logic symbols that are used throughout the article is shown in Table 1.
The current paper is unique in the following ways: • A protein interaction based generative model is used for improving the gene clustering. The model utilizes different weak supervision sources and infers a probabilistic clustering solution. • In this study, for weak supervision sources we have used MOO-based solutions along with the three Gene Ontology-based solutions.
The remaining part of the article is structured as follows. In the subsequent section, first, we provide the comprehensive description of the experimental evaluation along with a brief analysis of the performance for the proposed generative model. The next section provides a brief overview of the weak supervision sources and the proposed generative model. Finally we conclude the article by stating the uniqueness and future scope of the work.

Results
In this section, we analyze the performance of the proposed generative model when it is applied on the gene expression profile. In this section, firstly, we briefly describe the details of the datasets. Then we provide a comparative performance analysis of different algorithms with our proposed generative model. Finally, a comprehensive discussion is presented. In the discussion section, we have analyzed the performance of the developed model in an incremental way, i.e., new components are added one by one and the enhancements in performance are reported.
Experiment results. In this section, we comprehensively evaluated the performance of proposed protein interaction based generative model on three real-life NCBI datasets. We have compared the performance of the proposed generative model with different state-of-the-art techniques. For the performance measures, we have calculated two bio-oriented cluster validity indices, namely, biological homogeneity index (BHI) and biological similarity index (BSI) along with a traditional cluster validity index named Silhouette index 41 . For comparing the performance of the proposed method with different existing works, we have considered traditional clustering techniques, one multi-objective optimization based clustering technique, a multi-objective based differential evolution (MODE) 42 approach, and a cluster ensemble technique. For traditional clustering techniques, we have utilized two popular clustering techniques, namely K-means 43 and a density-based clustering technique named DBSCAN 44 . For the multi-objective optimization based clustering technique, we utilized an existing MOO-based clustering algorithm 20 where three objective functions are simultaneously optimized. The three objective functions are Fuzzy Partition Coefficient (FPC), PBM index and DB index. In this MOO-based clustering, we reported the best non-dominated solution for comparison purpose. We have also utilized a pairwise similarity based ensemble technique 45 as a state-of-the-art comparing method. In MODE 42 , which is a multi-objective based differential evolution algorithm, two objective functions are simultaneously optimized.
Along with these state-of-the-art methods, we prudently integrate different parts with the generative model so that the cumulative performance of the architecture follows an incremental way. Simultaneously, we have reported the performance of the proposed architecture in each integration step. Firstly, we have integrated the MOO-based solutions using the generative model ( 1 : (MO + GM)) of Snorkel. Here only the partitioning solutions produced by MOO based technique are considered as the weak supervised solutions. In the next step, we integrated the protein protein interaction information with the generative model. In this integrated model ( 2 : (MO + PPI + GM)), we consider protein protein interaction information as a parameter θ ppi that specifies the strength of the accuracy factor, φ Acc , in the generative model, p θ . Lastly, apart from MOO based solutions, three GO-based solutions are also utilized as the weak supervised solutions in the final integrated architecture ( 3 : (MO + PPI + GM + GO)). As in the GO-based solutions, all the genes are not labelled; we did not exploit only GO-based solutions as the weak supervision sources.
The comparative analyses of the performance of the proposed generative model with different state-of-the-art methods are shown in Tables 2, 3 and 4. These tables illustrate the performance comparison in terms of BHI (Table 2), BSI (Table 3) and Silhouette score (Table 4). From these tables, it is evident that we modelled the whole architecture in a way so that addition of different modules follows an incremental way in terms of performance. In general, the final integrated generative model ( 3 ) obtained higher BHI and BSI values compared to other existing models. For example, in BCLL dataset, the BHI value of  3 is 0.361 which is 50.42%, 9.06% and 4.64% improvements over MOO-based ensemble technique,  1 model and  2 model, respectively. For ILD dataset, the final integrated generative model( 3 ) attains a BHI score of 0.475 which outperforms MOO-based ensemble technique,  1 model and  2 model by 11.24%, 4.86% and 3.26%, respectively. For prostrate dataset,  3 model attains a BHI score of 0.451 which is 10%, 1.3% and 0.6% performance improvements over MOO-based ensemble,  1 and  2 , respectively. Also,  3 model achieves the BSI scores of 0.994, 0.941 and 0.945 for BCLL, ILD and prostrate datasets, respectively.
In conclusion, the analysis as mentioned above shows that the proposed integrated generative model obtains better performance in grouping the genes in terms of biological relevance. Also, to validate the effectiveness of the  Table 1. Glossary of variables and symbols used in the paper.

Discussion
In recent years, the generative model has been extensively used in many fields, and their applications in the bioinformatics domain shows a promising direction. However, this powerful method was never utilized for gene clustering. In computational biology, grouping the same biologically expressed genes improves diagnosis, prognosis, and treatment of a particular disease. Also, it has been found that the use of integrated information extracted from different related biological datasets improves the specific biological task. In this regard, we have utilized protein interaction and Gene Ontology-based information for improving gene clustering. In this study, we logically integrated different biological information in different steps of the generative model so that a noticeable increment in performance can be observed in each level of integration.
Generally, a generative model generates a solution by considering the correlations and dependencies of the inputs. The correlation is inferred by stochastic gradient descent (SGD) and Gibbs sampling. In this study, for understanding the dependency between the inputs, we utilized protein interaction information along with SGD and Gibbs sampling. A characteristic property of the genes is that their protein products have strong physical interactions with each other. Hence the protein interaction information is utilized for inferring the dependency between the inputs.
In this study, the generative model is used as an ensembling model that takes different weak supervision solutions as inputs and infers a probabilistic solution by considering their interrelated dependencies. Hence, the   In this study, we have integrated these weak supervision solutions by using three variants of the generative model, namely  1 ,  2 and  3 . Here,  1 refers to a vanilla model where only the MOO-based solutions are used to infer the final probabilistic model. The proposed MOO-based clustering technique generates a significant amount of optimized solutions. These optimized solutions guide us to infer the final probabilistic labels using the vanilla model  1 . However, in model  1 , it is assumed that all the MOO-based solutions have equal weights in regard to their accuracies which lead to misjudging the quality of the final inferred solutions. Hence, to assign the appropriate weights to different MOO-based solutions, we make use of protein-protein interaction information. In this regard, we develop  2 model where protein interaction information is processed for inferring the weight of each solution. In the above two models ( 1 and  2 ), we did not take into account any ground truth information about the genes for inferring the final probabilistic labels. To enhance the biological relevance of the final solution, along with the MOO-based solutions, we have added three GO-based solutions obtained from a human-curated database. This database refers to Gene Ontology Consortium (GOC) which is the world's largest knowledge-base of gene functions. The three solutions are generated by performing an enrichment analysis on the GOC using the PANTHER (Protein ANalysis THrough Evolutionary Relationships) classification system. In this regard, finally, we develop an integrated generative model ( 3 ) which exploits GO-based solutions along with the MOO-based solutions. As the GO-based solutions are generated by utilizing the human-curated databases, the integration of these solutions enhances the performance of the  3 model.
Keeping the above arguments in mind, an important question may arise as why we have not used GO-based solutions exclusively as they are considered as near ground truth. The reason behind this is as follows: • The number of GO-based solutions that we can obtain is quite low. Hence inferring the final solution by considering only these solutions is prone to over-fitting. • The PANTHER classification system does not classify all the genes as Gene Ontology Consortium may not contain the information for all the genes. Thus, the GO-based solutions do not provide labels for those unmapped genes.
For the above two reasons, we did not use only GO-based solutions for inferring the final solution. The integration of two types of solutions helps us in improving the overall performance of the generative model in terms of three quality measures, BHI, BSI metrics and Silhouette score. As the MOO-based solutions are reasonable in number and pro-grammatically validated, these solutions help us to capture the interrelation between the solutions. On the other hand, GO-based solutions help us to incorporate gene enrichment analysis information within the proposed generative model. In a nutshell, these two types of solutions are of equal importance in enhancing the model performance. To validate the performance of our proposed generative model in terms of biological relevance, we have done a biological analysis of the obtained gene clusters. Here we provide a thorough  www.nature.com/scientificreports www.nature.com/scientificreports/ assessment of the acquired gene clusters ' biological enrichment. by GOTERMMAPPER (https://go.princeton. edu/cgi-bin/GOTermMapper). This finding confirms that the genes of a cluster detected by the proposed gene clustering method are more engaged in the same biological mechanism/function compared to the genome's remaining genes.

Methods and Materials
For the proposed weakly supervised ensembling technique, the key steps are summarized as follows • In the first step, we filtered out the redundant genes from the gene expression profiles. The remaining genes are used for the subsequent steps. • The remaining genes are used to generate the base partitions(BP) by exploiting two different approaches.
1. In the first step, we acquired the solutions by using weak supervision technique effectively. In this respect, we used a clustering technique based on multi-objective optimization (MOO). 2. In the second approach, we took Gene Ontology (GO) 38 into consideration for generating partitioning solutions. • Finally, to obtain the consensus partitioning(solution), we utilized a generative model considering the protein protein interaction information.    48 .
In this study, to filter out the differentially expressed (DE) genes, a statistical test is used. Firstly, we filtered out the genes based on the variances across the samples 48 . Finally, bootstrapped-p value 47 is used as a threshold to filter out the statistically significant genes. In this work, the genes with bootstrapped-p values less than 0.05 are considered as statistically significant and used for further data analysis. We have applied this statistical preprocessing step on three real-life NCBI's GEO datasets, namely B-CLL chronic lymphocytic leukemia 49 , Interstitial lung disease (ILD) 50

Generation of weak supervised solutions.
In any ensembling technique, generating diverse base partitionings is one of the crucial steps to generate an improved consensus partitioning. In this study, for creating base partitionings, we exploited weak supervision technique. In weak supervision, rather than consulting with trained subject matter experts (SME), the solutions (labels) are generated programmatically by analyzing heuristic patterns 52,53 , crowd-sourced data 54,55 and external knowledge base 25,56 . Thus data generated by weakly supervised sources are cheaper, noisier and have less accuracy and coverage. Ideally, to increase accuracy and coverage, weak supervised solutions are combined to generate the final probabilistic solution.
In this study, to maintain the diversity of weak supervision labels, we have used two approaches, first one is a multi-objective optimization (MOO) based clustering technique, and another approach is to exploit Gene Ontology. In MOO-based technique, weak supervised solutions are generated programmatically by analyzing data patterns, whereas the external knowledge database is exploited to generate Gene Ontology-based solutions. In this article, weak supervised solutions are analogous to weak supervised labels. The detailed description of creation of these two types of weak supervised solutions is presented in the subsequent subsections.   www.nature.com/scientificreports www.nature.com/scientificreports/ c-means clustering 58 is used to assign labels for different genes … . In order to exploit the search space extensively, variable length chromosomes are used along with three genetic operators. These three genetic operators are crossover, mutation and selection. After applying these three genetic operators, new population is generated. In each generation, we simultaneously optimize three objective functions and the best solutions are selected after application of non-dominated sorting and crowding distance operators. These three objective functions are: f 1 := Fuzzy Partition Coefficient (FPC) 58 , f 2 = Pakhira-Bandyopadhyay-Maulik index (PBM index) 59 and f 3 = DB index 60 and finally a set of non-dominated solutions {S 1 , S 2 , …, S D } are generated. These non-dominated solutions are placed in the Pareto optimal front which is shown in Fig. 2.
For each non-dominated solution, a label is generated (L i ) by a corresponding labeling function (λ i ), i.e., . These labels, {L 1 , L 2 , …, L D } are generated in a programmatic manner and considered as the weak supervised labels. These weak supervised labels are then encoded into the proposed generative model. The set of non-dominated solutions, which created the Pareto optimal front is shown in Fig. 2.
GO-based solutions. To maintain diversity among the weak supervision sources, along with the MOO-based solutions, we exploited Gene Ontology (GO) for generating the weak supervision sources. Gene Ontology (GO) 38 is the world's largest knowledge base that contains the information about gene functionality. This knowledge base is both human-readable and machine-readable and is a foundation for computational analysis of large-scale molecular biology and genetic experiments in biomedical research. In this study, this functional knowledge base of genes is considered as a weak supervision source. To generate the weak supervised solutions, Gene Ontology performs enrichment analysis on the preprocessed gene expression profile (  G). The enrichment results reveal the associations between gene sets and GO terms. The enrichment analysis is carried out by PANTHER(Protein ANalysis THrough Evolutionary Relationships) 61 classification system. PANTHER classification is a result of subject matter expert's (SME) annotation/curation.
In this task, PANTHER generates gene labels with respect to three biological aspects, namely, molecular function (MF), biological process (BP) and cellular component (CC). Here, these three aspects are considered as the three weak supervised labeling functions, i.e., λ MF , λ BP , λ CC . In each weak supervised labeling function, a list of shared GO terms (GO 1 , GO 2 , …, GO L ) are generated where each shared GO term consists of a set of genes, i.e., where P represents the number of genes of a particular shared GO term. Each labeling function generates multi-label solutions (L MF , L BP , L CC ) where the genes associated with particular GO term are assigned a unique label.
These solutions (L MF , L BP , L CC ) are also considered as weak supervised solutions along with MOO-based solutions (described in the previous subsection) and are considered for constructing the consensus partitioning using the proposed generative model. In these solutions, the genes are labelled according to their shared GO term (GO i |i ∈ [1, L]) based classification. Since not all genes are mapped in the Gene Ontology Consortium, we have considered that λ MF , λ BP , λ CC are abstaining from labelling those genes. Hence, in each of these GO-based solutions (L MF , L BP , L CC ), some genes are kept unlabelled. Though some of the genes are unlabelled in these solutions, the labels of remaining genes can be considered near to ground truth. As these GO-based solutions are generated by exploiting biomedical knowledge base, these solutions help in increasing the performance of the generative model. Also in the result section, we have shown that the addition of these GO-based solutions improves the performance of the generative model compared to traditional generative model. Inception of generative model. The core concept of the proposed architecture is the generative model. The developed generative model takes different weak supervision sources and finally infers a list that contains probabilistic labels for all the samples. The key challenge of the approach is in determining how to integrate weak supervision labels which have unknown correlations, accuracies and different levels of granularity. Hence, this integration phase acts as a critical step in shaping performance of the model. In this regard, the generative model plays an essential role in overcoming this roadblock. The performance of such a generative model is highly dependent on its structure, as the proper structure helps in inferring the accurate correlations between weak supervision labels.
In this study, we developed a generative model which acts as a framework for integrating weak supervision sources to infer labels of the genes. To accomplish this, we modified a popular generative model named Snorkel 28 by utilizing protein protein interaction information. The workflow of Snorkel is different from traditional approaches and is built upon a new machine learning paradigm called data programming 62 . Snorkel offers a trade-off between training time and performance of the model. Also, the structure of Snorkel helps in predicting accurate class labels automatically. The application of Snorkel in top industries, research labs and government agencies show its wide-ranging capabilities in building improved models.
Motivated by the success of Snorkel in a wide range of domains, we utilized a modified version of it for improving gene clustering. In our case, we have modified the generative model part of Snorkel. Let, the generative model p θ integrate the weak supervision labeling function obtained from MOO-based clustering and Gene Ontology, i.e., λ 1 , λ 2 , …, λ D , λ MF , λ BP , λ CC . In general, the labelling function of the generative model are autonomous or uncorrelated to each other. But in the proposed generative model, we considered the statistical dependencies between the labelling functions. This dependence enhances the generative model's predictive accuracy. Finally, each of the data points (gene) is generated as a latent variable by the generative mathematical model. The proposed generative model (p θ ) designed as a factored graph () 63 which is a sort of probabilistic graphic model that includes two kinds of nodes. These two kinds of nodes are evidence variable and factors. The factors describe the relationships in the factor graph between the estimate variables.
In this work, the D labels {L 1 , L 2 , …, L D } acquired from MOO-based clustering and three Gene Ontology-based labels {L MF , L BP and L CC } are interpreted as the evidence variables of factor graph . These D + 3 labels helps to • Accuracy : • Pairwise correlations : where Λ i,j represent the element of the label matrix, Λ, and is defined as Λ i,j = λ j (g i ). We calculated these three factors for a particular gene, g i , and concatenated into a vector φi(Λ, Y) for all D + 3 labeling functions. The proposed probabilistic generative model is described as

Y
In a general generative model, the values of the parameters (θ) are estimated by Eq. 3. These parameters estimate the strength of the three factors of the generative model. Among the three factors, the parameters for two factors (φ Lab , φ Corr ) are estimated by Eq. 3 and for the remaining factor (φ Acc ), the parameters are calculated by utilizing protein protein interaction information. In this study, the accuracy parameter values for the MOO-based solutions are generated by utilizing protein protein interaction information, and the accuracy parameter values for GO-based solutions are generated by Eq. 3. The accuracy parameter for a particular non-dominated solution (S i ) is represented as θ i PPI . Hence the Eq. 2 can be written as The integration of protein interaction with generative model along with underlying architecture is shown in Fig. 3. The parameter θ i PPI is generated by exploiting an updated protein-protein interaction resource named HitPredict 64 . HitPredict is a resource of experimentally determined protein-protein interactions with reliability scores (α ij ). This confidence score (α ij ) of proteins g i and g j denotes the reliability of the interaction and is the geometric mean of annotation-based score and method based score. The annotation score is calculated based on the GO annotations of the interacting proteins. In the method score, score is calculated by considering the experimental evidence of the interactions between proteins. As α ij takes into account both experimental support for the interaction and the genomic features of the interacting proteins, it is considered as a reliable source for exploiting the protein protein interactions.
For a particular non-dominated solution (S i ) which consists of a set of clusters {C 1 , C 2 , …, C K }, θ i PPI is calculated by where for each C r th cluster, CS(C r ) is calculated as follows where n r represents the number of genes present in the cluster C r ; Q represents the number of protein protein interactions extracted from HitPredict 64 for all the genes of C r th cluster. As θ i PPI of a non-dominated solution (S i ) is generated by utilizing the protein interaction information, θ i PPI helps to understand the biological significance

Scalability of the Proposed Approach
The proposed approach consists of two subtasks (generating the weak supervised solutions and inferring labels from those generated solutions) that correctly infer the probabilistic labels of the genes. In this section, we discuss about the time complexities of different subtasks and along with overall time complexity of the proposed approach.
• For generating the weak supervised solutions, we use our proposed multi-objective optimization based clustering technique. NSGA-II is used as the underlying multi-objective optimization technique which has a time complexity of O(mn 2 ). Here n is the size of the population, and m is the number of objective functions. Here Hence, the proposed approach runs in polynomial time. From this time complexity analysis, we can infer that the proposed approach is robust irrespective of the size of the dataset. In the current paper, the proposed technique are applied on the datasets with varied number of genes(range from 2000 to 18000) and samples(range from 21 to 104). Results also prove that the proposed system is robust irrespective of the dataset size.

conclusion
In this paper, we properly utilize different weak supervision sources using a newly developed generative model for improving gene clustering. In this work, rather than using any labelled data, we utilize different weak supervised sources to perform the desired task. Hence, our model overcomes the bottlenecks related to subject matter experts and manual annotation time. The proposed generative model utilizes weak supervision sources along with protein interaction information for inferring the correlations and dependencies of different sources. In this study, for weakly supervised sources, we utilized a multi-objective optimization-based clustering technique along with three gene ontology-based three solutions. These GO-based solutions help to improve the performance of the generative model as these are generated by utilizing the biomedical knowledge base. Also, the use of protein interaction information as the latent variable of the proposed generative model helps to leverage the performance of the proposed model. The obtained results prove the superiority of the proposed method than other existing methods in terms of biological homogeneity index (BHI), biological stability index (BSI) and Silhouette index. Finally, biological analyses are conducted to validate the obtained results.
In the future, we will use the proposed ensemble method to perform various biomedical functions where the real class labels are not available. We will also attempt to develop an enhanced version of the ensemble method by modifying the generative model's variables that will be able to perform the job more correctly.

Data availability
The source code and all datasets used in this study are available at https://github.com/sduttap16/PPI_Generative.