Mutual Information between Discrete Variables with Many Categories using Recursive Adaptive Partitioning

Mutual information, a general measure of the relatedness between two random variables, has been actively used in the analysis of biomedical data. The mutual information between two discrete variables is conventionally calculated by their joint probabilities estimated from the frequency of observed samples in each combination of variable categories. However, this conventional approach is no longer efficient for discrete variables with many categories, which can be easily found in large-scale biomedical data such as diagnosis codes, drug compounds, and genotypes. Here, we propose a method to provide stable estimations for the mutual information between discrete variables with many categories. Simulation studies showed that the proposed method reduced the estimation errors by 45 folds and improved the correlation coefficients with true values by 99 folds, compared with the conventional calculation of mutual information. The proposed method was also demonstrated through a case study for diagnostic data in electronic health records. This method is expected to be useful in the analysis of various biomedical data with discrete variables.


Simulation Settings
The proposed calculation of mutual information was evaluated through simulation studies. In each simulation, we considered a certain relation between two discrete variables where the true mutual information can be calculated. For two categorical variables X and Y whose relation was predefined, the mutual information was estimated from random samples. Six different relations were used for the simulation studies with various sample sizes (50, 100, 200, 500, 1,000, and 2,000), and numbers of categories per variable (2,5,10,20,50, and 100). Each simulation was repeated 100 times. The proposed methods with different p-value thresholds were compared with the true mutual information as well as the results of conventional calculation. All simulations are for the variables of which categorical values cannot be ordered.
Six simulation settings are listed here. (1) Step structure with low mutual information: Considering X and Y with two super categories for each, i.e. X ∈ {w 1 ,w 2 } and Y ∈ {v 1 ,v 2 }, their relation is given by 2 x 2 joint probabilities.
We consider that p(w 1 ,v 1 )=0.4, p(w 1 ,v 2 )=0.1, p(w 2 ,v 1 )=0.2, and p(w 2 ,v 2 )=0.3. Here, we assume that n/2 fine categories per each super category are actually observed instead of the super categories. Let x 1 , …, x n/2 be observed for w 1 , and x n/2+1 , …, x n be for w 2 . Similarly, y 1 , …, y n are assumed to be observed for the super categories of Y. The combinations of the fine categories are assumed to be uniformly distributed within the corresponding combination of the super categories. For example, (n/2) 2 combinations of {x 1 , x 2 , …, x n/2 }×{y 1 , y 2 , …, y n/2 }are uniformly distributed in w 1 ×v 1 of which probability is 0.4. For the simulation, the mutual information of X and Y is estimated from randomly generated data with n fine categories per variable, and compared with the true mutual information, 0.09 bits. (2) Step structure with high mutual information: This setting is similar with (1), but the joint probability of the super categories are given as p(w 1 ,v 1 )=0.7, p(w 1 ,v 2 )=0, p(w 2 ,v 1 )=0, and p(w 2 ,v 2 )=0.3. Here, the true mutual information is 0.61 bits.
(3) Gaussian structure with low mutual information: The joint population of X and Y is defined by a bivariate joint Gaussian distribution, of which marginal distributions are standard and the covariance (σ) is 0.49. First, continuous random samples are generated from the joint distribution.
The observed range of the continuous samples is uniformly discretized as n categories for each variable. Each sample falls into one of n 2 combinations of discretized X and Y, and has corresponding categorical values for X and Y. From the data with n categories per variable, the mutual information is estimated. Different from continuous variables, here the Gaussian structure is hardly observed because the discretized categories are observed without orders. When the marginal variance is 1, the theoretical mutual information of a joint Gaussian distribution is given as log(1/(1−σ 2 )), which is 0.14 bits in this case.
(4) Gaussian structure with high mutual information: This setting is similar with (3), but the covariance of X and Y is given as 0.81. The true mutual information is 0.53 bits.
(5) Random structure with low mutual information: A random relation between X and Y can be constructed by randomly generated joint probability masses. The probabilities of n categories of X are generated from an exponential distribution with λ=1. Let p X denote the vector of these n probability masses. Similarly, the marginal probability mass vector of Y, p Y , is randomly generated from the same exponential distribution. We obtain an n x n matrix of the joint probability distribution, P 1 = p X p Y T . P 1 is the joint probability masses of a randomly structured but independent relation. Independently, we obtain P 2 , another n x n probability matrix, by randomly generating n 2 joint probability masses from an exponential distribution (λ=1). P 2 represents a randomly structured and dependent relation. To ensure random structure and dependency, (P 1 +P 2 )/2 is used for the final joint probability distribution. Samples are randomly generated by the final joint probabilities, from which the mutual information is estimated.
Although the theoretical mutual information is hard to be obtained in this case, it can be empirically estimated with many samples (one million in this work). The true mutual information is expected to be different as the number of categories.
(6) Random structure with high mutual information: This setting is similar with (5), but only P 2 is used for the final probabilities. In this case, X and Y are more dependent to each other than (5).
Consequently, this setting simulates a random structure with higher mutual information than (5).

Time Complexity
The number of samples in the input data (n) is often used as the input data size for the calculation