Predicting deleterious missense genetic variants via integrative supervised nonnegative matrix tri-factorization

Among an assortment of genetic variations, Missense are major ones which a small subset of them may led to the upset of the protein function and ultimately end in human diseases. Various machine learning methods were declared to differentiate deleterious and benign missense variants by means of a large number of features, including structure, sequence, interaction networks, gene disease associations as well as phenotypes. However, development of a reliable and accurate algorithm for merging heterogeneous information is highly needed as it could be captured all information of complex interactions on network that genes participate in. In this study we proposed a new method based on the non-negative matrix tri-factorization clustering method. We outlined two versions of the proposed method: two-source and three-source algorithms. Two-source algorithm aggregates individual deleteriousness prediction methods and PPI network, and three-source algorithm incorporates gene disease associations into the other sources already mentioned. Four benchmark datasets were employed for internally and externally validation of both algorithms of our predictor. The results at all datasets confirmed that, our method outperforms most state of the art variant prediction tools. Two key features of our variant effect prediction method are worth mentioning. Firstly, despite the fact that the incorporation of gene disease information at three-source algorithm can improve prediction performance by comparison with two-source algorithm, our method did not hinder by type 2 circularity error unlike some recent ensemble-based prediction methods. Type 2 circularity error occurs when the predictor annotates variants on the basis of the genes located on. Secondly, the performance of our predictor is superior over other ensemble-based methods for variants positioned on genes in which we do not have enough information about their pathogenicity.


Supplementary Tables & Figures
: construction of variant-variant network via PPI network. nsSNVs which are illustrated at the same shape, are located on the same genes. For example, in this nsSNV dataset, we have three variants (shown with square) located on a gene. Each variant was mapped to corresponding protein on PPI network (violet lines). If there was a connection between pair of proteins, all their variants would connect to each other (dashed green lines between squares and circles). In addition, all variants placed at the same genes were connected to each other (not dashed lines at network number 3).

Minimizing objective functions
To solve our op miza on func on of equa on 5, we broke it into four subproblems: deriva on of U in regarding to the V, S, and GY are fixed; derivation of V when the S, U, and GY are constant; derivation of S once the V , U, and GY do not change; derivation of GY with regard to the V , U, and S are consistent. These four subproblems updates all values iteratively until predefined termination criterion is achieved. Multiplicative updating rules are using deriving such problems which are non-convex 1 . The process is started with initialization of matrix factors, V , S, U and GY matrices using random acol strategy 2 . As a result, the updated rule for four matrices are given below: While,  X and  F are nonnegative matrix as below: In a similar manner, we solve the objec ve func on number 6 and matrices given as: The deriative matrices for three-source algorithm, solving equa on 8 are such as following: So the derived matrices of V , U , U at the tes ng phase (solving equa on 9) are as:  T  T  ts  T  VD   T  T  ts  T  VS   ts  V   T  T  ts  T  VD   T  T  ts  T  VS  ts

Parameter selection
We needed three factorization rank for both algorithms, kV, kS and kD that indicate the number of cluster of variants, scores, and diseases. To infer rank parameters, we applied the strategy to stabilize the classification results in each run. As a result, we estimated the stability by dispersion coefficient of consensus matrix (the mean of connectivity matrix) while ran the both algorithms many times 3 .
There are two different groups of deleterious function prediction score at both algorithms. One of them is sequence conservation scores which determine the deleteriousness of a variant by identifying conserved nucleotide position across diverse species. Other group measures deleterious functional change of the protein by aligning multiple amino acid sequences. Furthermore, we choose the score rank, kS=2 to classify two conserva on and functional scores. For two other ranks, variant and disease, we performed a predefined interval of (1 ≤ kV ≤ 250, 1 ≤ kD ≤ 250), computed the ranks which on the dispersion coefficient begins to fall. Thus we set the ranks as kV=200 and kD=100.

Penalization parameters
To guarantee that all objective functions were converged for some iterations, we adjusted the penalization between value of zero and the proportion of main factorization and regularization terms 4 . For two-source algorithm, the range of parameters are defined as: For three-source algorithm, these ranges were changed to: Consequently, we performed a grid search for γ = {0.0001γ , 0.001γ , 0.01 γ , 0.1 γ , γ } for all three regularization parameters. As a result, we set them in respec ve values of 10 -4 and 0.1 for Gamma 1 and Gamma 2, for twosource algorithm, and 10 -4 , 0.1, and 5 for Gamma 1, Gamma 2, and Gamma 3 for three-source algorithm, respectively.