Scoring the correlation of genes by their shared properties using OScal, an improved overlap quantification model

Scoring the correlation between two genes by their shared properties is a common and basic work in biological study. A prospective way to score this correlation is to quantify the overlap between the two sets of homogeneous properties of the two genes. However the proper model has not been decided, here we focused on studying the quantification of overlap and proposed a more effective model after theoretically compared 7 existing models. We defined three characteristic parameters (d, R, r) of an overlap, which highlight essential differences among the 7 models and grouped them into two classes. Then the pros and cons of the two groups of model were fully examined by their solution space in the (d, R, r) coordinate system. Finally we proposed a new model called OScal (Overlap Score calculator), which was modified on Poisson distribution (one of 7 models) to avoid its disadvantages. Tested in assessing gene relation using different data, OScal performs better than existing models. In addition, OScal is a basic mathematic model, with very low computation cost and few restrictive conditions, so it can be used in a wide-range of research areas to measure the overlap or similarity of two entities.


Section 1. Comparing and analyzing models by their solution space
Overlap of two sets is denoted by a triple data (d, R, r), and evey triple data is a point in the three-dimension space taking (d, R, r) as coordinate system. Generally each point has a score using one function, and the scores should be shown in the fourth dimension. The points having the same score form a surface called isosurface. To show the property using 3-d is much complicated.
Fortunately not all functions compared in this study have three variables. If the function has two variables, then the points are mapped into a plane, and their scores are shown in the third dimension, which could be shown by isolines in a plane like a topographic map. Every isoline indicates a score, and the points in the line have same scores. In this study only isolines are studied carefully and the isosurface was just shown by a diagram (Fig. 8 in the main text).

Section 1.1 Isolines point out the weight and impact of parameters
Take the simplest model Ochiai as an example (Fig. S1). Since Ochiai has only one variable, all triples are condensed into a line and their scores are shown in the second dimension (Fig. S1A), showing the score reduces as R increase. Its isolines are parallel to the r axis in the R-r plane (Fig.   S1B) and d axis in the d-R plane (Fig. S1C), then we would know the weights on d and r are zero.
In the both planes, the isolines are perpendicular to R axis and the score indicated in the isoline reduce along this axis. We would know the weight on R is large and it has negative impact to the value of f (increase of R leads to reduce of f).

Jaccard index
Overlapping coefficient Ochiai coefficient The high-score isolines of Jaccard index are nearly parallel to the r axis, and the low-score isolines bend a little to the r axis, so we know it assign large weight on R and small weight on r. In addition we would know the impact of r is negative, which means increasing of r lead to reducing 5 of the function value. The isolines of Overlapping coefficient are nearly parallel to the diagonal, and then we would know the weights on both of the two parameters are large. Because the isolines bend away from the r axis, we know it has positive impact to the function value.
The three modes in class I do not take d into account, so their isolines in d-R plane are parallel to the d-axis. We figure out their isolines in Fig. S3.

Section 1.3 Isolines of models in class II
Comparing different models by their isolines supplies us an opportunity to compare different models which can not be compared before. Mutual information (I) and Poisson distribution (P) are very different models, their scores are heterogeneous. After their isolines are drawn (Fig. S4), it is very clear that they are two similar models, which is also confirmed by their similar expressions.
Pa is a simplified form of Poisson distribution using Stirling's approximation. The three isolines of the model I (mutual information) are corresponding to those of model P (Poisson distribution) by three reference points. The two highest-score isolines of them cross at point (150, 1), the two lowest-score isolines cross at point (30,5), and the left pair of isolines cross at the point (50, 1). It is know that Poisson is an approximation of Binominal and Hypergeometric distributions. As shown in Fig. S5 the isolines of Binominal and Hypergeometric distributions are very similar to that of Poisson (Fig. S4). Only two of the three reference points mentioned above appear in the Fig. S5. The score of the reference point (150, 1) could not be calculated using these two models.
The two highest-score isolines of them cross at point (100, 1).

Section 2 Reasons for different performance of models Section 2.1 Detailed information of the TF-TF dataset
Transcriptional factors (TFs) are key regulators in biological process, which regulates the temporal and special expression of their targets. Common targets of two TFs were thought to be useful information in assessing the relation of TFs, since if two TFs regulate similar targets they would participate in common biological progresses via regulating similar functional modules. Our TF-TF dataset contain 314 TFs and 22507 targets of them. Any two TFs will be a candidate founctional related pair as long as they share at least one common target, and 41651 candidate pairs are generated. Such criteria are very loose since the randomness degree (the ratio of the number of candidate pairs to all the possible pairs) of this dataset is about 75%. A stricter one is needed to find out true functional related TF pairs. Every TF could be seen as a set of its targets and then every candidate pair could be modeled by two overlapping sets, and the correlation score of the two TFs could be calculated using a model quantifies the overlap. The cases with scores higher than a cutoff will be seen as positive calls. We took PPI pair as indicator to the true positive and the proportion of PPI pairs as an approximation of the positive predictive value (PPV). There are 1334 PPI pairs among all the candidate pairs, and then the average PPV of this dataset is 3.2%.

Section 2.2 Positive area of different models cover different district in the plane
We choose the isoline with the score equal to the cutoff as cutoff-line. Then this isoline of a model could divide the d-R plane into two parts, and the part including points with higher scores is called as positive area of this model. Putting cutoff-lines of different models together, the difference of them will be clearly shown (see Fig. S7A). To describe clearly, we separated the positive areas of

Section 2.3 Cases located in different districts have different quality
We mapped all the candidate TF pairs into the d-R plane and counted their distribution (Fig. S8B).
It is clear that distribution of cases in the high-score area is few, and there is high density of distribution of case in the large R and small d district. Taken the distribution of PPI pairs (taken as true positives) together, we found PPV of different districts is significantly different. Three special districts are highlighted. As shown in Fig. S7C, the PPV of the district beyond the boundary line is very low (1.3%) comparing to the average PPV for all candidate pairs (3.2%). We argued that cases beyond the boundary should be dropped out and all the cases seem to be restricted within it coincidentally. Very few candidate cases locate in the district beyond the boundary line and even fewer PPI pairs locate in this district. This is the first one and the second is the LR-Sd district (the peaks in Fig. S8B). There is high density distribution of cases, and the scores of the cases in this district are low. We fount the PPV of this area is very low (2.1%). These three characters (high-density, low score, low PPV) imply that this district is a low-quality district (LQD, red shadow in Fig. S8D). Of course there must be a high-quality district (HQD). We have known that near the d axis is a high-score district, which will be a little different using different model (Fig.   S8A), and there is low density of cases (Fig. S8B). The PPVs of districtsⅠ-Ⅳ are high (Fig.   S8C). Although the two models cover different districts, both of them achieve high PPV (12%) using the cutoff shown in Fig. S8A. We combined their high-score districts and took the union as HQD (light blue shadow in Fig. S8D)  Figure S8 Properties of different area in the d-R plane using TF-TF data. This is also Fig. 6 in the main text.

Section 2.4 Reasonable balance between d and R is important
The two classes of models assign different weights on d and R, and then their positive area cover 11 different districts in the d-R plane, which leads to different performance in application. Which one is better? We know that an effective model should have two features. First there should be a good correlation between the score and accuracy rate (Sorting the positive calls). As mentioned in the main text Poisson is better than Ochiai in this case. Second it should achieve a reasonable balance between sensitivity and specificity using a proper cutoff (Selecting more true positives but less false positives). An ROC curve is needed to find out such proper cutoff for a model, and it could also effectively tell the best one among models. However, lack of golden positive and negative standards hindered us to plot the ROC curve. We transformed the second feature into a graphical notion, i.e. the positive area of a good model should cover the HQD but avoid the LQD as much as possible in the d-R plane. In the main text, we have found that Ochiai is better than Poisson on the second feature.
As shown in Fig. S8D, the HQD is nearly the positive area of Ochiai, which only take R into account and does not care about d. This implies that parameter R is a primary factor and small-R is a necessary condition for large overlap (selecting positive cases). But small-R is not a sufficient condition for large overlap, since we have known that there is bad correlation between PPV and score calculated by Ochiai, which lose some useful information contained in d. Poisson takes both R and d into account and shows better performance in sorting positive calls. But Poisson emphasized d too much. A model that achieves reasonable balance between R and d will lead to better performance.

Section 3 Construction of OScal
OScal is a modification of Poisson and imitates the performance of Hypergeometric distribution.
First, we constructed a basal model OScal_B that use only primary factors (R and d) for OScal.
OScal_B is designed to achieve reasonable balance between R and d, which determine two direct features of an overlap and are two primary factors for measuring overlap. Like Poisson is an approximation of Hypergeometric distribution, OScal_B is also an approximation of OScal. Then we added items that concern the impact of the minor factor (r). The expression of OScal contains three items: OS = Ps+MJ-Mr.

Section 3.1 Simplification of Poisson distribution
We used Stirling's approximation (equation (S1)) to simplify the expression of Poisson (equation (S2)) and obtained the approximation expression of Poisson (equation (S2)). The simplified form calculates the scores more quickly and widely than the original express of Poisson distribution.
There is very little error between scores calculated by them for a case (Supplementary Table S2

Section 3.2 Construction of OScal_B by modification of Poisson distribution
We found the score in cases with small d was overestimated by Poisson. Because when d is small, λ will be very small, then the difference will be overestimated (appear very large). This overestimation will lead to the "LR-Sd trap" of Poisson. We argued that Poisson tolerates too large R, which could be shown by isoline in the d-R plane. As shown in Fig. S9A   Since OScal_B does not take r into account, we constructed two other items to contain the impact of r. MJ is used to detect the impact of r, and Mr is used to detect the hyper-large set. In OScal, the impact of r will be dependent. When d and R are small, the score will increase as r increases as model H does (Fig. S10A). When d and R are very large, then the score will decrease as r increases. Because at this time d*R*sqrt(r) (number of elements in the larger set) will be close to N, in other word the large set is close to the universal set. Table S3 listed some cases generated by such hyper-large sets. The score of case SLO1 using Hypergeometric distribution (H) is large, although the number of the larger set (m) is close to the background number (N=22507). Such cases usually appear in enrichment analysis when the high level GO terms are tested. Researchers usually dropped out such high level GO terms when did enrichment analysis using Hypergeomettic distribution. The score of such case using OScal will be negative (Table S3).