Learning Bayesian Networks from Correlated Data

Bayesian networks are probabilistic models that represent complex distributions in a modular way and have become very popular in many fields. There are many methods to build Bayesian networks from a random sample of independent and identically distributed observations. However, many observational studies are designed using some form of clustered sampling that introduces correlations between observations within the same cluster and ignoring this correlation typically inflates the rate of false positive associations. We describe a novel parameterization of Bayesian networks that uses random effects to model the correlation within sample units and can be used for structure and parameter learning from correlated data without inflating the Type I error rate. We compare different learning metrics using simulations and illustrate the method in two real examples: an analysis of genetic and non-genetic factors associated with human longevity from a family-based study, and an example of risk factors for complications of sickle cell anemia from a longitudinal study with repeated measures.

Scientific RepoRts | 6:25156 | DOI: 10.1038/srep25156 Markov property states that a variable is independent of all the remaining variables in the graph conditionally on its Markov blanket that is defined by the parent nodes, children nodes and additional parents of the children nodes.
There are well established approaches to structure learning of BNs 6,7,13 that use either exact Bayesian criteria based on the marginal likelihood  In Equation (1), pa(y i ) denotes the observable parents of the variable Y i in the model M, while y ik and pa(y i ) k denote the observed value of Y i and its parent nodes in the k-th sample unit. Each sub-model M i specifies the set of parents of the variable Y i (see Fig. 2), so that M = (M 1 , M 2 , … , M v ). We denote by y k = (y 1k , y 2k , … , y vk ) the  2,000 simulated data sets were generated using the network structure shown on the left and assuming normal distributions for the 5 variables. In 1,000 sets, the observations were IID, and in the remaining 1,000 sets data were generated from 581 independent clusters, with observations correlated within clusters. The table summarizes the number of times the true network was selected in 1,000 simulations with IID observations and 1,000 simulations with correlated data, the false positive rates, and family-wise error rates using three common model selection metrics and a forward search. False positive rates were defined as the number of additional or missing edges over the total number of tests, and family-wise error rates were defined as the probability of one or more errors in the overall search. BIC: Bayesian Information Criterion; AIC: Akaike Information Criterion; LRT: Likelihood Ratio Test at α = 0.05. that can be used for a modular Bayesian model search. All of these proposed approximations assume that the observations are independent or exchangeable.

Mixed-Effects Regression Models.
Mixed effects regression modelling has emerged as one of the most popular method to analyze correlated data 11 . Let Y denote the observations of n subjects from m clusters, and suppose that Y follows a multivariate normal distribution. A linear mixed effects model for Y is described by the equation: s n in which X is an n × p matrix of regression coefficients for the fixed effects β, Z is an n × s matrix of known coefficients, Γ and Δ are s × s and n × n matrices of parameters that describe the correlations between observations, u is a vector of s × 1 random effects, and ε is a vector of n × 1 error terms 11,20 . The model specifies that T T T T so that the correlation between the observations is described by the matrices Ψ and Σ . If both Ψ and Σ are block diagonal matrices, the parameterization is the independent cluster model, in which subjects from m different clusters are independent, but they are correlated within the same cluster. Note that the parameterization in Equation (4) assumes that the vectors u and ε are standardized to have variances equal to 1 and they are independent, while the variance components of Y are the elements of the matrices Γ Γ T and Δ Δ T . For analysis of time-to-events data with proportional hazard models, the random effects are usually modeled in the log-hazard function or using a frailty term with gamma distribution [21][22][23][24] . For categorical data modeled within the framework of generalized linear models, the random effects are modeled on the scale of the linear predictors 25 . These models make the additional assumption that the observations are independent, conditionally on the random effects. Non-Bayesian inference on the fixed effects parameters typically uses the marginal approach based on the integrated likelihood: where φ represents the vector of fixed effects β and variance parameters in Ψ and Σ . The integrated likelihood can be computed in closed form when errors and random effects are normally distributed, but numerical approximations are needed for non-linear/non-normal models 26 . The integrated likelihood is used to find maximum-likelihood estimates of the fixed effects and variance components. Common parameterizations of the random effects include exchangeable correlation, in which within-cluster pairwise correlation is assumed constant, and auto-regressive correlation. When clusters are families, the correlation between family members depends on the degree of relatedness and kinship coefficients that represent the probability of alleles transmitted identically by descent between pairs of family relatives 27 (See Fig. 3). Several model selection criteria have been proposed for selection of fixed effects in mixed effects 11 , including AIC and BIC that can be computed using the integrated likelihood but there is no consensus on the appropriate correction parameters. Specifically, it has been argued that the overall sample size may not be the correct quantity to use when the data are correlated and the effective number of parameters may be unclear in models that includes several random effects 28,29 . Modified versions of BIC that have been proposed, where n e is an estimate of the effective sample size, use a reduced sample size to account for the correlation between observations 11 . We will use these three proposed corrections of the sample size in the simulation study: Jones' correction. n e = 1 T C −1 1, where 1 is the unit vector, and C is the correlation matrix that can be estimated from the covariance matrix V = V(Y|X, β) = ZΨ Z T + Σ 30 . Liberal correction. n e = n c where n c is the number of clusters. This is the most liberal correction, in which a cluster represents a single sample unit.

Mixed-Effects Bayesian Networks
We propose a mixed-effects type parameterization of a BN that can be used for structure and parameter learning from correlated observations. The rationale of our approach rests on the observation that in the mixed effect regression model in Equation (4) we introduce and model the correlation between the sample units through the vector of random effects ZΓ u, with variance-covariance matrix ZΨ Z T , where Ψ = Γ Γ T .
To extend this idea to a BN with variables Y 1 , … , Y v , consider first this simple situation. Suppose v = 2 and let Y 1 be parent of Y 2 , with E(Y 2k |Y 1k ) = β 1 Y 1k and (Y 1 , Y 2 ) follows a bivariate normal distribution. To estimate the coefficient β 1 from a sample of observations that are not independent and have correlation structure known up to some parameters, one can use the mixed effect regression model We proceed by introducing a set of random effects α = (α 1 , … , α v ), in which each α i is a n × 1 vector of correlated random effects associated with Y i . The random effects can be interpreted as additional parameters that augment the parent set of each variable Y i as (pa(Y i ), θ i , α i , γ i ) where the vector θ i represents the parameters of the conditional distribution of each node Y i |pa(Y i ) that we would ordinarily use if observations were IID, and the vector γ i represents the variance parameters of α i that model the correlation between observations. We assume that the distribution of the random effects α i depends only on the variance parameters γ i , so that θ i and α i are independent given γ i . For example, if the variables Y i follow normal distributions and the parent-children relation are described by regression models, the parameter vector θ i includes the regression coefficients β and the uncorrelated error variance terms, while γ i represents the correlation parameters. See Fig. 4 for an example.
Let φ denote the overall set of parameters (θ, α, γ) of the joint probability distribution of the variables Y 1 , … , Y v and we assume global independence of the parameters so that we can write the product of the global likelihood function and parameter prior distribution for data D as: This factorization of the augmented likelihood can be used for local computations using information based criteria or other Bayesian criteria. For example, conditionally on θ = (θ 1 , … , θ v ) and γ = (γ 1 , … , γ v ), the integrated likelihood can be computed as: The kinship matrices contain pairwise kinship coefficients between pairs of family members and these coefficients represent the probability that two individuals share the same gene allele by identity by descent. The covariance between two family members with kinship coefficient k ij is 2k ij γ 2 where γ 2 represents the genetic variance.
Scientific RepoRts | 6:25156 | DOI: 10.1038/srep25156 can be computed exactly for normally distributed variables (see 11 ) or using numerical approximations in other cases as shown in 26 . Maintaining the product-form of the likelihood has the benefit that the search of the best dependency structure among the variables Y 1 , … , Y v can be conducted in a modular way, by finding the optimal set of parents of each variable Y i that optimizes either the marginal likelihood, or the marginal BIC or AIC based on integrated likelihood. The variables Y i can follow a normal distribution, or non-normal distributions such as a Poisson distribution or multinomial distribution for categorical data, or survival distribution for time-to-event data. In the non-normal data case, random effects can be included in the log-transformed parameterization of the mean (Poisson/multinomial data), or the log-hazard function (time-to-event data). Once the best dependency structure is selected, the conditional distributions of each local model can be estimated using MCMC methods, or large sample approximations.
Example of Family-Based Data. Suppose that study subjects are clustered into m families with different familial relations, then the within-family correlation of a variable Y can be described by the random effect α|γ 2 ~ N(0, 2γ 2 K), where γ 2 is the genetic variance to be estimated from data, and the matrix K can be derived from the kinship coefficients as in Fig. 3. By using the singular value decomposition of 2K = USU T , with UU T = U T U = I n , we can define Γ = γS 1/2 , Z = U and u ~ N(0, I n ), so that α = ZΓ u = γUS 1/2 u and the variance covariance matrix of α is γ 2 US 1/2 S 1/2 U T = 2γ 2 K. To extend this parameterization to a Bayesian network with With this parameterization, we allow the genetic variance γ i 2 to vary for each variable Y i , but the matrices U and S will be the same for each Y i , as the kinship matrix K is study specific. We can assume a Gamma prior on each parameter γ i 2 , and independence of θ i and α γ i i 2 for each i = 1, … ., v, and independence of α i , γ i 2 and α j , γ j 2 for all i ≠ j to derive the above factorization of the likelihood.
Example of Repeated Measures. Suppose that data are from a longitudinal study with repeated measures per subject, and we stack the repeated measures per subject, so that the overall size of the data set D is ∑ n k k where n k denotes the number of repeated measures of the kth subject. Clearly, the repeated measures of each individual are correlated and the within subject correlation in each variable Y i can be described by a vector of random effects α i |γ i ~ N(0, γ i Ψ i ) where the matrix Ψ i is a block diagonal matrix with blocks that can be parameterized using exchangeable correlation, or autoregressive structure.

Simulation Studies
We conducted simulation studies to examine the effect of ignoring the correlation between observations in structure learning of a BN. We also compared false positive rate and power of the modifications of BIC and AIC for learning a BN from correlated data. For simplicity, we focused on the forward search procedure of the K2 algorithm 16 . We considered two scenarios: continuous data that follow normal distributions, and time-to-event , conditional of the parameter vector θ. Nodes in orange are the parameters that define the conditional parent-children distribution of the observable variables (fixed effects), while the nodes in yellow are nuisance parameters. Right panel: our proposed parameterization when both the dependency structure and conditional probability distributions need to be estimated from correlated data. The random effects α (blue nodes) have probability distributions that depend on parameters γ (lavender nodes). Both parameters γ and random effects α are used to model the correlation between observations as in Equation (4).
Scientific RepoRts | 6:25156 | DOI: 10.1038/srep25156 data modelled using Cox proportional hazard regression. In the first case, we used the closed form solution to the integrated likelihood that allows for efficient computations of likelihood based model selection criteria 11 . In the second case, we used the numerical approximation of the marginal likelihood that can be derived assuming normally distributed random effects in the log-hazard scale 22,32 . In both simulations we generated data assuming that data are from a family based study design.
Continuous Data. We generated correlated observations borrowing the family structure from the Long Life Family Study (LLFS): a study of healthy aging that enrolled individuals from families with longevity and healthy aging in the United States and Denmark between 2006 and 2009 33,34 . A typical family structure in the LLFS has a proband and consenting siblings, their offspring and spouses. For this simulation study, the total sample size was 4656 and the number of families was 582. With a kinship matrix from each family K f , the variance-covariance matrix of the observations is the 4656 × 4656 matrix: where σ e 2 is the error variance, and γ is the "genetic variance". To simulate normally distributed data with this variance-covariance matrix, we fixed the error variance σ = 1 e 2 and varied the genetic variance to be γ 2 = 1/3, 1, 3 to simulate genetic traits with heritability γ γ σ + = .
.   25%, 50%, and 75% of the trait variability due to genetics and the rest to other non-genetic factors. To generate correlated data, in each simulation a vector Z of independent and normally distributed observations was generated and transformed into Y = UD 1/2 Z where U and D are the matrix of eigenvectors and eigenvalues from the spectral decomposition of the variance-covariance matrix V. This transformation guarantees that V(Y) = UD 1/2 V (Z)D 1/2 U T = V. In each run, we also included 10 null covariates. In this simulation study, the null covariates were common single nucleotide polymorphisms (SNP) with minor allele frequency >5%, which were randomly selected from the real genome-wide genotype data from LLFS. Each simulated data was analyzed using a forward search with BIC and AIC and the LRT at α = 0.05 ignoring the correlation in the data. The data were also analyzed using BIC, AIC and the LRT based on the intergrated likelihood, θ γ p D ( , ) to account for the correlation in the data. Four variants of BIC were used based on different effective sample sizes: n e = 4656 (full sample size); n e = 2796 (Jones' correction); n e = 1768 (Yang's correction); and n e = 582 (most conservative sample size). The simulation was repeated 1,000 times. Table 1 shows the number of false positive covariates that were selected with the forward search using the 9 criteria, the overall number of tests conducted during the forward search, false positive rate (probability of Type 1 error in one test: BIC, AIC or LRT) and family wise error rate (probability of one or more errors in the overall search) when the heritability is 0.5. The false positive rate was calculated by dividing the sum of all false positive covariates by the total number of tests. The full set of results for different heritability estimates can be found in the Supplementary Materials. The results show an inflation of both error rates when the correlation in the data is ignored, with a 55% increase of the family wise error rate for the LRT, and a 267% increase for the BIC. The inflated Type I error will repeat for each search of parent-child dependency in the network and result in highly connected networks. The false positive rate of the LRT based on the integrated likelihood that accounts for the correlation in the data is slightly below the nominal level (0.0432). The various corrections of the BIC result in small false positive and family wise error rates. Using the full sample size as the effective sample size in the BIC is an over-correction that results in a very conservative scoring metrics. Decreasing the effective sample size makes the BIC score more liberal with a modest increase of both false positive and family wise error rates. Although these small error rates of BIC seem desirable, the question is their effect on the true positive rates of the different scoring metrics.
We compared the power of different variants of BIC to the power of the LRT M using the significance threshold determined from the false positive rates in Table 1. To do so, we ran 3 additional simulations in which the variable Y was generated from a multivariate normal distribution with variance-covariance structure as described above. In these scenarios, we modelled the expected value of the variable Y as a linear function of 3 true covariates that  were also generated from a multivariate normal distribution with different amount of correlations. Three sets of regression parameters were chosen to represent the situations of weak, moderate and strong covariate effects such that the first scenario included 3 weak effect covariates, the second scenario included 3 moderate effect covariates, and the third scenario included 3 strong effect covariates. Power was defined as the probability of detecting all three true covariates in each run. The results are summarized in Table 2 and show that the LRT M has consistently higher power than the BIC for all different corrections, when the false positive rates are kept equal. For instance, in the presence of covariates with moderate effects, the BIC M detects the 3 covariates 29.5% of the time, whereas the LRT M detects the covariates 31.4% of the time, which is an increase in power by 1.9%. The most liberal correction of the BIC, with effective sample size equal to the number of clusters, appears to provide a reasonable compromise, and is essentially equivalent to using the LRT.

Time-to-event Data.
To simulate time-to-event data, we again borrowed the family structure from the LLFS and modified the simulation scheme from 35 by inducing correlation with log-normal frailty (random effects). The baseline survival time was simulated from a Weibull (2,2). We simulated the correlated trait such that: where (0, 2 ( , , , )); ( 0, 2) (14) 1 2 582 so that the event time is defined as t = min(T, C) and the censoring indicator is δ = I(T ≤ C). The correlation among observations are induced by the inclusion of random effects term R on the log-hazard scale. The rest of the simulation scheme was very similar to the case of continuous data, except for BIC, where the effective sample size was the number of events as suggested by 36 . Table 3 shows the number of false positive covariates that were selected with the forward search using the 6 criteria, the overall number of tests conducted during the forward search, and both the false positive rate and the family wise error rate when the heritability is 50% on the log-hazard scale. The full set of results for different heritability estimates can be found in the Supplementary Materials. The results show an inflation of both error rates when the correlation in the data is ignored, with a 25% increase of the family wise error rate for the LRT, and a 56% increase for the BIC. The false positive rate of the LRT based on the integrated likelihood is slightly below the nominal level (0.0470), while the traditional LRT exhibits inflated Type 1 error rate of 0.0629. Consistent with the results from the continuous data, AIC is the most liberal metric.
We compared the power of BIC M based on integrated likelihood to the power obtained from the LRT M using the significance threshold determined from the false positive rates in Table 3. Again, we ran 3 additional   simulations in which the correlated survival trait was generated as described above with sets of 3 true covariates of different strengths. The results summarized in Table 4 show that the LRT M has consistently higher power in all cases regardless of the heritability estimates, which is consistent with results from the continuous data. For instance, in the presence of covariates with moderate effects when h 2 = 0.50, BIC M detects the 3 covariates 25.5% of the time, whereas the corresponding LRT M detects the covariates 28.5% of the time, which is an increase by 3.0%. These results emphasize the need to account for correlation in the data to avoid an unnecessary inflation of the false positive error rates. Moreover, after controlling for the Type 1 error, the LRT M appears to have comparable power to the BIC based on integrated likelihood in both cases of continuous and time-to-event data. In practical application, using the LRT is an appealing approximate solution that avoids the problem of choosing the appropriate number of parameters, and also control well the Type 1 error.

Application
We applied the proposed approach in the two real data and compared to the BNs constructed when ignoring correlations in the data.
In the first example, we built a BN to examine the associations between genetic data, blood biomarkers, socio-demographic factors, and life span using data from the LLFS. The genetic variants were unlinked SNPs in the 23 genes of the insulin and insulin-like growth factor 1 signaling (IIS) pathway that were found associated with age at death using single SNP analysis (i.e. testing the association one SNP at a time using Cox proportional hazard regression adjusted for family structure with a significance threshold of 0.005). This pathway is considered as one of the most important pathways in aging 37 . Table 5 summarizes the gene, chromosome, and the number of tested SNPs per gene. There was a total of 13 common SNPs that were individually associated with age at death, adjusting for sex. In a joint model that included these 13 SNPs as covariates, 6 of them were still associated with age at death at the p-value threshold of 0.005. Given the large number of tested SNPs, the p-value threshold of 0.005 may appear too liberal but the goal of this preliminary analysis was primarily to obtain a candidate list of genetic variants to be considered in building the BN.
The question we were trying to answer with the BN was whether some of these direct associations between SNPs and lifespan could be explained through associations with blood biomarkers such as serum levels of DHEA (a steroid hormone linked to muscle loss in aging), insulin growth factor 1 (IGF-1), transferrin receptors (Tr), and hemoglobin (Hgb). All these biomarkers are related to aging and would provide targets to develop intervention for healthy aging 38 . Additional variables in the network were age at enrollment (Age.E) and follow-up survival time (FUS) censored at last contact for living subjects. We also included an indicator variable (Birth Year Cohort: BYC) that accounted for possible secular trend, and sex. To build the BN, we used the search procedure of the K2 algorithm, and we considered all possible orderings of the other variables with the exception of the SNPs that, for biological reasons, were considered as root nodes in the BN. The follow-up survival time was considered as possible child of all the other nodes. For each possible ordering of the variables, a BN was built by fitting appropriate mixed effects regression models of follow-up survival time (using mixed effect Cox proportional hazard regression), age at enrollment, the four biomarkers (using linear mixed model), and by identifying statistically significant predictors through a forward search. Based on the simulation study, the likelihood ratio test from mixed effects model was used for model selection criteria by applying a Bonferroni correction at each node.
The three BNs with largest global likelihood are depicted in Fig. 5 and have very similar structures, with directions of few edges switched (edges colored in red in the figure), and the Markov Blankets (MB) of the variables in these 3 BNs in Table 6 areidentical. Only two SNPs remained in the model, one (rs1009375, in the proximity of AKT3, linked to glicemic control) is directly associated with follow-up survival and another (rs6974881, in PIK3CG, linked to inflammation) is directly associated with age at enrollment. The results suggest that genetic variants in the IIS pathway do not affect age at death through these 4 biomarkers.
We also built the BN ignoring the familiar correlations in the LLFS data using the LRT, and the three BNs with largest likelihood are depicted in Fig. 6. The overall structures are very similar to the top three BNs built under the proposed parameterization. However, in each of these BNs, two additional SNPs (rs17224116 and rs10048024) show significant dependency with transferin receptor level (node Tr) and IGF-1 levels. Based on the results of the   simulations that showed an increase Type I error when the correlation between observations is ignored, these two additional edges are likely to be false positive findings introduced by ignoring the correlation in the data.
In the second example, we used data from 2916 unrelated African-American subjects with sickle cell anemia enrolled in the Cooperative Study of Sickle Cell Disease to model the correlation between several circulating biomarkers of the disease (CSSCD (https://biolincc.nhlbi.nih.gov/studies/csscd/). The data included these 9 biomarkers: fetal hemoglobin, serum glutamic oxaloacetic transaminase, diastolic blood pressure, reticulocyte counts, platelet counts, red blood cell counts, white blood cell counts, hemoglobin, and mean corpuscular volume. Subjects enrolled in the study were followed longitudinally and approximately 3 repeated measures per subject are available, with a total of 8018 measurements available for the current analysis. In order to account for correlations due to repeated measurements on the same subjects, a BN was built by stacking repeated measures and using random effect to describe the correlation between repeated measures of the same study subject. Clinics at which lab measures were taken, age at measurement, and hemoglobin genotypes were considered root nodes of all variables, and all other procedures remained the same as in the previous example. The top BNs and associated MB using the proposed approach and ignoring correlations are illustrated in Fig. 7. Overall, there were 19 edges in the top BN constructed using the proposed approach. When correlations between repeated measures on the same subjects were ignored, there were 28 edges in the BN. These excess edges were reflected as additional variables in the MB of each node, which indicate that virtually all variables are connected to each other. Biologically, the simpler network is more consistent with previous findings that showed strong dependency between hematological parameters, but less dependency of hematological parameters with blood pressure and markers of liver functions (SGOT) 39 . We conjecture that some of these additional edges are likely to be false positives as a result of ignoring apparent correlations between measurements, and this result further bolsters the utility of the proposed approach that can control the false positive error rates for different types of correlation structures.

Discussion and Conclusions
We presented an approach to learn BNs from correlated data arising from clustered sampling. Our approach uses random effects to model the correlation between observations within the same clusters, and assumes marginal and conditional independence on the random effects to maintain the decomposibility of the likelihood and modularity of the computations. The random effects introduced in the parameterization do not affect the network structure per se, and conceptually they are simply additional random parameters that are useful to model the excess correlation in the data. We evaluated different approximate metrics for model selection in data simulated from a hypothetical family-based study, in which the observations of members within the same family are related with varying degrees of correlation. The simulation study showed the importance of accounting for correlated data to avoid inflation of the false positive error rate, and suggested that in large samples a simple likelihood ratio test based on the integrated maximum likehood may provide a good trade off between false positive and false negative rates. Applications on two real data with different correlation structures showed the potential use of this approach to simultaneously model the associations of genetic and non-genetic factors with a complex trait from a family-based observational study and repeated measures of biomarkers.
Our proposed parameterization can be used for a full Bayesian approach to structural and parameter learnings of BNs with correlated data. However, the selection of BNs from data with many variables is computationally a very challenging problem and therefore we focused the simulation analysis on the evaluation of approximate criteria for model selection. A proper Bayesian approach to model selection of networks learned from correlated data appears to be a very challenging question that needs more work. We limited our analysis to selection of networks with Gaussian data and time-to-event data. In both cases, there is a closed form solution, or good numerical approximation, to the calculation of the integrated likelihood that is used to compute likelihood based criteria such as the likelihood ratio test, BIC and AIC. However, approximate methods are needed for categorical variables 25 . We also assumed that the vector of random effects followed a normal distribution. Different distributional assumptions on the random effects and mis-specifications of these need to be explored further.
Another popular approach to account for correlations in observations is generalized estimating equation (GEE) 40 . Studies have shown that empirical results on parameter estimation and significance testing are very similar between GEE and random effects models 41 . The advantage of random effect models is that one can carry subject-specific as well as population-average inference, and therefore they provide a more flexible modeling approach for inference.
Our simulations suggest that the likelihood ratio test based on using the integrated likelihood provides a good metric for model selection. The criterion can be interpreted as a crude approximation of the Bayes factor and, compared to BIC or AIC, it allows users to choose different thersholds for model selection that can trade off sensitivity and specificity. This is an important feature of the criterion, particularly in the analysis of large datasets with several variables.