A new method for constructing networks from binary data

van Borkulo, Claudia D.; Borsboom, Denny; Epskamp, Sacha; Blanken, Tessa F.; Boschloo, Lynn; Schoevers, Robert A.; Waldorp, Lourens J.

doi:10.1038/srep05918

Download PDF

Article
Open access
Published: 01 August 2014

A new method for constructing networks from binary data

Claudia D. van Borkulo^1,2,
Denny Borsboom²,
Sacha Epskamp²,
Tessa F. Blanken²,
Lynn Boschloo¹,
Robert A. Schoevers¹ &
…
Lourens J. Waldorp²

Scientific Reports volume 4, Article number: 5918 (2014) Cite this article

23k Accesses
419 Citations
28 Altmetric
Metrics details

Subjects

Abstract

Network analysis is entering fields where network structures are unknown, such as psychology and the educational sciences. A crucial step in the application of network models lies in the assessment of network structure. Current methods either have serious drawbacks or are only suitable for Gaussian data. In the present paper, we present a method for assessing network structures from binary data. Although models for binary data are infamous for their computational intractability, we present a computationally efficient model for estimating network structures. The approach, which is based on Ising models as used in physics, combines logistic regression with model selection based on a Goodness-of-Fit measure to identify relevant relationships between variables that define connections in a network. A validation study shows that this method succeeds in revealing the most relevant features of a network for realistic sample sizes. We apply our proposed method to estimate the network of depression and anxiety symptoms from symptom scores of 1108 subjects. Possible extensions of the model are discussed.

Network analysis of multivariate data in psychological science

Article 19 August 2021

Statistical inference links data and theory in network science

Article Open access 10 November 2022

Discovery and classification of complex multimorbidity patterns: unravelling chronicity networks and their social profiles

Article Open access 21 November 2022

Introduction

Research on complex networks is growing and statistical possibilities to analyse network structures have been developed to great success in the past decade^1,2,3,4,5. Networks are studied in many different scientific disciplines: from physics and mathematics to the social sciences and biology. Examples of topics that have recently been subjected to network approaches include intelligence, psychopathology and attitudes^6,7,8,9,10. Taking psychopathology as an example, nodes (elements) in a depression network may involve symptoms, whereas edges (connections) indicate to what extent symptoms influence each other. The structure of such a network, however, is unknown due to the absence of a sufficiently formalised theory of depression. Consequently, the network structure has to be extracted from information in data. The challenging question is how to extract it.

Methods that are currently used to discover network structures in the field of psychology are based on correlations, partial correlations and patterns of conditional independencies^7,11,12,13. Although such techniques are useful to get a first impression of the data, they suffer from a number of drawbacks. Correlations and partial correlations, for example, require assumptions of linearity and normality, which are rarely satisfied in psychology and necessarily false for binary data. Algorithms like the PC-algorithm^14,15, which can be used to search for causal structure, often assume that networks are directed and acyclic, which is unlikely in many psychological cases. Finally, in any of these methods, researchers rely on arbitrary cutoffs to determine whether a network connection is present or not. A common way to determine such cutoff-values is through null-hypothesis testing, which often depends on the arbitrary level of significance of α = .05. In the case of network analysis, however, one often has to execute a considerable number of significance tests. One can either ignore this, which will lead to a multiple testing problem, or deal with it through Bonferonni corrections, (local) false discovery rate, or other methods^16,17,18, which will lead to a loss of power.

For continuous data with multivariate Gaussian distributed observations, the inverse covariance matrix is a representation of an undirected network (also called a Markov Random Field^19,20). A zero entry in the inverse covariance matrix then corresponds to the presence of conditional independence between the relevant variables, given the other variables²¹. To find the simplest model that explains the data as adequately as possible according to the principle of parsimony, different strategies are investigated to find a sparse approximation of the inverse covariance matrix. Such a sparse approximation can be obtained by imposing an ℓ₁-penalty (lasso) on the estimation of the inverse covariance matrix^13,22,23. The lasso ensures shrinkage of partial correlations and puts others exactly to zero²⁴. A different take involves estimating the neighborhood of each variable individually, as in standard regression with an ℓ₁-penalty²⁵, instead of using the inverse covariance matrix. This is an approximation to the ℓ₁-penalised inverse covariance matrix. This Gaussian approximation method is an interesting alternative: it is computationally efficient and asymptotically consistent²⁵.

In psychology and educational sciences, variables are often not Gaussian but discrete. Although discrete Markov Random Fields are infamous for their computational intractability, we propose a binary equivalent of the Gaussian approximation method that involves regressions and is computationally efficient²⁶. This method for binary data, which we describe in more detail in the Methods section, is based on the Ising model^19,27. In this model, variables can be in either of two states and interactions are at most pairwise. The model contains two node-specific parameters: the interaction parameter β_jk, which represents the strength of the interaction between variable j and k and the node parameter τ_j, which represents the autonomous disposition of the variable to take the value one, regardless of neighbouring variables. Put simply, the proposed procedure in our model estimates these parameters with logistic regressions: iteratively, one variable is regressed on all others. However, to obtain sparsity, an ℓ₁-penalty is imposed on the regression coefficients. The level of shrinkage depends on the penalty parameter of the lasso. The penalty parameter has to be selected carefully, otherwise the lasso will not lead to the true underlying network – the data generating network²⁵. The extended Bayesian Information Criterion²⁸ (EBIC) has been shown to lead to the true network when sample size grows and results in a moderately good positive selection rate, but performs distinctly better than other measures in having a low false positive rate²⁹.

Using this approach, we have developed a coherent methodology that we call eLasso. The methodology is implemented in the freely available R package IsingFit (http://cran.r-project.org/web/packages/IsingFit/IsingFit.pdf). Using simulated weighted networks, the present paper studies the performance of this procedure by investigating to what extent the methodology succeeds in estimating networks from binary data. We simulate data from different network architectures (i.e., true networks; see Figures 1a and 1b) and then use the resulting data as input for eLasso. The network architectures used in this study involve random, scale-free and small word networks^30,31,32. In addition, we varied the size of the networks by including conditions with 10, 20, 30 and 100 nodes and involve three levels of connectivity (low, medium and high). Finally, we varied the sample size between 100, 500, 1000 and 2000 observations. After applying eLasso, we compare the estimated networks (Figure 1c) to the true networks. We show that eLasso reliably estimates network structures and demonstrate the utility of our method by applying it to psychopathology data.

Results

Validation study

The estimated networks show high concordance with the true networks used to generate the data (Figure 2). Average correlations between true and estimated coefficients are high in all conditions with 500 observations or more (M = .883, sd = .158, see Table 1). In the smallest sample size condition involving only 100 observations, the estimated networks seems to deviate somewhat more from the true networks, but even in this case the most important connections are recovered and the average correlation between generating and estimated networks remains substantial (M = .556, sd = .155). Thus, the overall performance of eLasso is adequate.

Table 1 Correlations as a measure of performance of eLasso. Correlations are computed between upper triangle of weighted adjacency matrix of data generating network and estimated network. Data is simulated under various conditions (s_size, n_nodes, connectedness (p (probability of a connection), pa (preferential attachment), pr (probability of rewiring)) when the AND-rule and γ = .25 is applied. For networks with 100 nodes, deviating levels of connectedness are displayed between brackets. Results of applying eLasso with the OR-rule and γ = 0 are displayed between brackets

Full size table

More detailed information about eLasso's performance is given by sensitivity and specificity. Sensitivity expresses the proportion of true connections which are correctly estimated as present and is also known as the true positive rate. Specificity corresponds to the proportion of absent connections which are correctly estimated as zero and is also known as the true negative rate. It has been shown that sensitivity and specificity tend to 1 when sample sizes are large enough^29,33; the question is for which sample sizes we come close. Overall, specificity is very close to one across all conditions (M = .990, sd = .014) with somewhat lower specificity scores for the largest and most dense random networks (see Table 2). Overall, sensitivity is lower (M = .463, sd = .238) but becomes moderate for conditions involving more than 100 observations (M = .568, sd = .171). The reason that sensitivity is lower than specificity lies in the use of the penalty function (lasso); to manage the size of the computational problem, eLasso tends to suppress small but nonzero connections towards zero. Thus, lower sensitivity values mainly reflect the fact that very weak connections are set to zero; however, the important connections are almost aways correctly identified. In addition, the specificity results indicate that there are very few false positives in the estimated networks; thus, eLasso handles the multiple testing problem very well. Figure 1 nicely illustrates these results: almost all estimated connections in Figure 1c are also present in the generating network depicted in Figure 1b (high specificity), but weaker connections in the original network are underestimated (low sensitivity).

Table 2 Sensitivity and specificity, as a measure of performance of eLasso. Data is simulated under various conditions (s_size, n_nodes, connectedness (p (probability of a connection), pa (preferential attachment), pr (probability of rewiring)) when the AND-rule and γ = .25 is applied. For networks with 100 nodes, deviating levels of connectedness are displayed between brackets. Results of applying eLasso with the OR-rule and γ = 0 are displayed between brackets

Full size table

The above pattern of results, involving adequate network recovery with high specificity and moderately high sensitivity, is representative for almost all simulated conditions. The only exception to this rule results when the largest random and scale-free networks (100 nodes) are coupled with the highest level of connectivity. In these cases, the estimated coefficients show poor correlations with the coefficients of the generating networks, even for conditions involving 2000 observations (.222 and .681, respectively). For random networks, the reason for this is that the number of connections increases as the level of connectivity increases. For scale-free networks, the number of connections does not increase with increasing level of connectivity, but it does result in a peculiar arrangement of network connections, in which one node comes to have disproportionately many connections. Because eLasso penalises variables for having more connections, larger sample sizes are needed to overcome this penalty for these types of networks.

Although the lower level of sensitivity is partly inherent in the chosen method to handle the computational size of the problem and the solution to multiple testing through penalisation, it might be desirable in some cases to have a higher sensitivity at the expense of specificity. In eLasso, sensitivity can generally be increased in two ways. First, eLasso identifies the set of neighbours for each node by computing the EBIC²⁸ (extended BIC). EBIC penalises solutions that involve more variables and more neighbours. This means that if the number of variables is high, EBIC tends to favour solutions that assign fewer neighbours to any given node. In this procedure, a hyperparameter called γ determines the strength of the extra penalty on the number of neighbours^29,33. In our main simulation study, we used γ = .25. When γ = 0, no extra penalty is given for the number of neighbours, which results in a greater number of estimated connections. Second, we applied the so-called AND-rule to determine the final edge set. The AND-rule requires both regression coefficients β_jk and β_kj (from the ℓ₁-regularised logistic regression of X_j on X_k and of X_k on X_j) to be nonzero. Alternatively, the OR-rule can be applied. The OR-rule requires only one of β_jk and β_kj to be nonzero, which also results in more estimated connections.

By applying the OR-rule and γ = 0, correlations between true and estimated coefficients are even higher in all conditions with 500 observations and more (M = .895, sd = .156; Table 1). Sensitivity also improved across all conditions (M = .584, sd = .221; Table 2). With more than 100 observations, average sensitivity is higher (M = .682, sd = .153). Applying the OR-rule and setting γ = 0 thus indeed increases the sensitivity of eLasso. As expected, this gain in sensitivity results in a loss of specificity; however, this loss is slight, as specificity remains high across all conditions (M = .956, sd = .039; Table 2).

Finally, it should be noted that with sparse networks, specificity partly takes on high values due to the low base rate of connections, since it is based on the number of true negatives. Therefore, we also investigated another measure, the so-called F1 score, that is not based on true negatives but on true positives, false positives and false negatives³⁴; as such, it is independent of the base rate. For most conditions, the trends in the results are comparable. However, for larger and/or more dense random networks, the proportion of estimated connections that are not present in the true network is larger. More details about these results are provided in the online Supplementary Information.

To conclude, eLasso proves to be an adequate method to estimate networks from binary data. The validation study indicates that, with sample sizes of 500, 1000 and 2000, the estimated network strongly resembles the true network (high correlations). Specificity is uniformly high across conditions, which means there is a near absence of false positives among estimated network connections. Sensitivity is moderately high and increases with sample size. For the most part, sensitivity is lowered because of weak connections that are incorrectly set to zero; in these cases, however, eLasso still adequately picks up the most important connectivity structures. For larger networks with either higher connectivity or a higher level of preferential attachment, sensitivity becomes lower; in these cases, more observations are needed.

Application to real data

To demonstrate the utility of eLasso, we apply it to a large data set (N = 1108) containing measurements of depression of healthy controls and patients with a current or history of depressive disorder. We used 27 items of the Inventory of Depressive Symptomatology³⁵, which was administered in the Netherlands Study of Depression and Anxiety³⁶ (NESDA). Using eLasso, we investigate how individual depression symptoms are related, as this may reveal which symptoms are important in the depression network; in turn, this information may be used to identify targets for intervention in clinical practice.

The eLasso network for these data is given in Figure 3. To analyse the depression network, we focus on the most prominent properties of nodes in a network: node strength, betweenness and clustering coefficient (Figure 4). Node strength is a measure of the number of connections a node has, weighted by the eLasso coefficients³⁷. Betweenness measures how often a node lies on the shortest path between every combination of two other nodes, indicating how important the node is in the flow of information through the network^38,39. The local clustering coefficient is a measure of the degree to which nodes tend to cluster together. It is defined as how often a node forms a triangle with its direct neighbours, proportional to the number of potential triangles the relevant node can form with its direct neighbours³⁸. These measures are indicative of the potential spreading of activity through the network. As activated symptoms can activate other symptoms, a more densely connected network facilitates symptom activation. Moreover, we inspect the community structure of the networks derived from the empirical data, to identify clusters of symptoms that are especially highly connected.

Figure 3 reveals that most cognitive depressive symptoms (e.g., “feeling sad” (sad), “feeling irritable” (irr), “quality of mood” (qmo), “response of your mood to good or desired events” (rmo), “concentration problems” (con) and “self criticism and blame” (sel)) seem to be clustered together. These symptoms also seem to score moderate to high on at least two out of three centrality measures (Figure 4). For example, “rmo” has a moderate strength and a very high clustering coefficient, whereas it has a low betweenness. This indicates that activation in the network does not easily affect response of mood to positive events (low betweenness), but that, if the symptom is activated, the cluster will tend to stay infected because of the high interconnectivity (high clustering coefficient). Another interesting example is “energy level” (ene), which has a high node strength and betweenness, but a moderate clustering coefficient. Apparently, energy level has many and/or strong connections (high strength) and lies on many paths between symptoms (high betweenness), whereas it is not part of a strongly clustered group of symptoms (moderate clustering coefficient). This symptom is probably more important in passing information through the network, or between other clusters and might, therefore, be an interesting target for intervention.

As opposed to cognitive depressive symptoms, most anxiety and somatic symptoms (e.g., “panic/phobic symptoms” (pan), “aches and pains” (ach), “psychomotor agitation” (agi)) feature low scores on at least two centrality measures. Apparently, most anxiety and somatic symptoms either are less easily affected by other activated symptoms, do not tend to stay infected because of low interconnectivity (low clustering coefficient), or are less important for transferring information through the network (low betweenness). This is to be expected, since participants with a current or history of anxiety disorder are excluded from our sample. The item “feeling anxious” (anx), however, seems to be an important exception; feeling anxious does have a high node strength, a relatively high betweenness and a moderate clustering coefficient. Apparently, feeling anxious does play an important role in our sample of depressive and healthy persons: it can be activated very easily, since a lot of information flows through it (high betweenness), and, in turn, it can activate many other symptoms because it has many neighbours (high node strength, moderate clustering). The role of feeling anxious in our network is in line with high comorbidity levels of anxiety and depressive disorders found in the literature^40,41,42. Still, feeling anxious is not a symptom of depression according to current classifications, even though recent adaptations in DSM-5 propose an anxiety specifier for patients with mood disorders⁴³. In line with this, our data suggest that people with a depressive disorder experience depressive symptoms often also feel anxious, although they may not have an anxiety disorder. This supports criticisms of the boundaries between MDD and generalised anxiety, which have been argued to be artificial⁸.

Another interesting feature of networks lies in their organization in community structures: clusters of nodes that are relatively highly connected. In the present data, the Walktrap algorithm^44,45 reveals a structure involving six communities (see Figure 5). The purple cluster contains mostly negative mood symptoms, such as “feeling sad” (sad) and “feeling irritable” (irr); the pink cluster contains predominantly positive mood symptoms, such as “capacity of pleasure” (ple) and “general interest” (int); the green cluster is related to anxiety and somatic symptoms, such as “anxiety” (anx) and “aches and pains” (ach); the blue and yellow clusters represent sleeping problems.

Discussion

eLasso is a computationally efficient method to estimate weighted, undirected networks from binary data. The present research indicates that the methodology performs well in situations that are representative for psychology and psychiatry, with respect to the number of available observations and variables. Network architectures were adequately recovered across simulation conditions and, insofar as errors were made, they concerned the suppression of very weak edges to zero. Thus, eLasso is a viable methodology to estimate network structure in typical research settings in psychology and psychiatry and fills the gap in estimating network structures from non-Gaussian data.

Simulations indicated that the edges in the estimated network are nearly always trustworthy: the probability of including an edge, that is not present in the generating network, is very small even for small sample sizes. Due to the use of the lasso, more regression coefficients are set to zero in small sample sizes, which results in a more conservative estimation of network structure. For larger networks that are densely connected or that feature one node with a disproportionate number of connections, more observations are needed to yield a good estimate of the network. As the sample size grows, more and more true edges are estimated, in line with the asymptotic consistency of the method.

The model we presented may be extended from its current dichotomous nature to accommodate ordinal data, which are also prevalent in psychiatric research. For multinomial data, for example, the Potts model could be used⁴⁶. This model is a generalisation of the Ising model with two states to a model with more than two states. Another straightforward extension of the model involves generalisation to binary time series data (by conditioning on the previous time point to render observations independent).

Methods

In this section we briefly explain the newly implemented method eLasso, provide the algorithm, describe the validation study and the real data we used to show the utility of eLasso.

eLasso

Let x = (x₁, x₂, …, x_n) be a configuration where x_i = 0 or 1. The conditional probability of X_j given all other nodes X_\j according to the Ising model^26,47 is given by

where τ_j and β_jk are the node parameter (or threshold) and the pairwise interaction parameter respectively.

In practice, the graph structure of psychological constructs is unknown. Therefore, the estimation of the unknown graph structure and the corresponding parameters is of central importance. By viewing X_j as the response variable and all other variables X_\j as the predictors, we may fit a logistic regression function to investigate which nodes are part of the neighbourhood of the response variable. The intercept τ_j of the regression equation is the threshold of the variable, while the slope β_jk of the regression equation is the connection strength between the relevant nodes. In order to perform the logistic regression, we need multiple independent observations of the variables.

To establish which of the variables in the data are neighbours of a given variable and which are not, we used ℓ₁–regularised logistic regression^25,26. This technique is commonly called the lasso (least absolute shrinkage and selection operator) and optimises neighbourhood selection in a computationally efficient way, by optimising the convex function

in which i represents the independent observations {1, 2, .., n}, contains all β_jk and τ_j parameters and ρ is the penalty parameter. The final term with ρ ensures shrinkage of the regression coefficients^24,26. Parameter τ_j can be interpreted as the tendency of the variable to take the value 1, regardless of its neighbours. Parameter β_jk represents the interaction strength between j and k.

The optimisation procedure is applied to each variable in turn with all other variables as predictors. To this end, the R package glmnet can be used⁴⁸. The glmnet package uses a range of maximal 100 penalty parameter values. The result is a list of 100 possible neighbourhood sets, some of which may be the same. To choose the best set of neighbours, we used the EBIC²⁸ (extended Bayesian Information Criterion). The EBIC is represented as

in which is the log likelihood (see below), |J| is the number of neighbours selected by logistic regression at a certain penalty parameter ρ, n is the number of observations, p − 1 is the number of covariates (predictors) and γ is a hyperparameter, determining the strength of prior information on the size of the model space³³. The EBIC has been shown to be consistent for model selection and to performs best with hyperparameter γ = 0.25 for the Ising model²⁹. The model with the set of neighbours J that has the lowest EBIC is selected. From equation (1), it follows that the log likelihood of the conditional probability of X_j given its neighbours X_ne_(j) over all observations is

At this stage, we have the regression coefficients of the best set of neighbours for every variable; i.e., we have both β_jk and β_kj and have to decide whether there is an edge between nodes j and k or not. Two rules can be applied to make the decision: the AND rule, where an edge is present if both estimates are nonzero and the OR rule, where an edge is present if at least one of the estimates is nonzero^25,26.

Although we do have the final edge set by applying one of the rules, note that for any two variables j and k, we get two results: the result of the regression of j on k (β_jk) and the result of the regression of k on j (β_kj). To obtain an undirected graph, the weight of the edge between nodes j and k, ω_jk, is defined as the mean of both regression coefficients β_jk and β_kj. All steps of the described method are summarised in the algorithm below and is incorporated in R package IsingFit (http://cran.r-project.org/web/packages/IsingFit/IsingFit.pdf).

Input data set X for p variables and n subjects

Output (weighted) edge set for all pairs X_j and X_k

1
Initialise: Select (randomly) one variable from the set of variables. This is the dependent variable.
1. a
  Perform ℓ₁-regularised logistic regression on all other variables (glmnet uses 100 values of penalty parameter ρ).
2. b
  Compute the EBIC for ρ (i.e., each set of neighbours).
3. c
  Identify the set of neighbours that yield the lowest EBIC.
4. d
  Collect the resulting regression parameters in matrix Θ with τ on the diagonal and β on the off-diagonal.
5. e
  Repeat steps a through d for all p variables.
2
Determine the final edge set by applying the AND rule: if both regression coefficients β_jk and β_kj in Θ are nonzero, then there is an edge between nodes j and k.
3
Average the weights of the regression coefficients β_jk and β_kj. Define Θ* as the averaged weighted adjacency matrix with thresholds τ on the diagonal. This is now a symmetric matrix.
4
Create a graph corresponding to the off-diagonal elements of the averaged weighted adjacency matrix Θ*. This can be done with qgraph in R⁴⁹.

Validation study

We generated data from the three most popular types of network architectures: random networks, scale-free networks and small world (clustered) networks^30,31,32. Figure 1a shows illustrative examples of each type of networks. Network sizes were chosen to be comparable to the most common number of items in symptom checklists (10, 20 and 30 nodes), but also large networks were included (100 nodes). The level of connectivity of the networks was chosen to generate sparse networks. For this reason, in case of random networks, the probability of a connection (P_conn) between two nodes was set to 0.1, 0.2 and 0.3. For small world networks, the neighbourhood was set to 2 and for scale-free networks only one edge is added per node at each iteration in the graph generating process. To obtain a wide variety of well known graph structures, the rewiring probability (P_rewire) in small world networks was set to 0.1, 0.5 and 1 and the power of preferential attachment (P_attach) in scale-free networks was set to 1, 2 and 3. For the condition with 100 nodes, we used different levels of connectivity for random and scale-free networks (random networks: P_conn = .05, .1 and .15; scale-free networks: P_attach = 1, 1.25 and 1.5). Otherwise, nodes will have too many connections.

The generated networks are binary: all connections have weight 1 or 0. To create weighted networks, positive weights were assigned from squaring values from a normal distribution with a mean of 0 and a standard deviation of 1 to obtain weights in a realistic range. Examples of resulting weighted networks are displayed in Figure 1b. Besides weights, thresholds of the nodes are added. To prevent nodes with many connections to be continuously activated and consequently having no variance, thresholds were generated from the normal distribution between zero and minus the degree of a node.

From the weighted networks with thresholds, data was generated according to the Ising model by drawing samples using the Metropolis-Hastings algorithm, implemented in R using the IsingSampler package^50,51,52. Four sample size conditions were chosen that are realistic in psychology and psychiatry: 100, 500, 1000 and 2000. The generated data were used to estimate networks with eLasso. Examples of the resulting estimated networks are displayed in Figure 1c.

This setup led to a 3 × 4 × 3 × 4 quasi-factorial design, with the factors network type (random, small world, scale-free), level of connectedness, network size (10, 20, 30, 100) and sample size (100, 500, 1000, 2000). Thus, the total simulation study involved 144 conditions. Each of these conditions was replicated 100 times. For each condition, the mean correlation between data generating and estimated parameters, the mean sensitivity and the mean specificity is computed. These served as outcome measures, indicating the quality of network recovery. Sensitivity, or the true positive rate, is defined as SEN = TP/(TP + FN), in which TP is the number of true positives and FN is the number of false negatives. Specificity, or the true negative rate, is defined as SPE = TN/(TN + FP), in which TN is the number of true negatives and FP is the number of false positives. Note that, in order to compute sensitivity and specificity, the off-diagonal elements of the weighted adjacency matrix Θ* (β_jk), have to be dichotomised.

Since specificity naturally takes on high values for sparse networks, also the F1 score is computed. For more details about the F1 score and the results, see Supplementary Information online.

Data description

We used data from the Netherlands Study of Depression and Anxiety³⁶ (NESDA). This is an ongoing cohort study, designed to examine the long-term course and consequences of major depression and generalised anxiety disorder in the adult population (aged 18–65 years). At the baseline assessment in 2004, 2981 persons were included. Participants consist of a healthy control group, people with a history of depressive or anxiety disorder and people with current depressive and/or anxiety disorder.

To demonstrate eLasso, we selected individuals from NESDA with a current or history of depressive disorder and healthy controls. To this end, we excluded everyone with a current or history of anxiety disorder. The resulting data set contains 1108 participants. To construct a network we used 27 items of the self-report Inventory of Depressive Symptomatology³⁵ that relates to symptoms in the week prior to assessment (IDS).

Data were dichotomised in order to allow the application of the Ising model. Therefore, the four response categories of the IDS items were recoded into 0 and 1. The first response category of each item indicates the absence of the symptom. In the case of “feeling sad”, the first answering category is “I do not feel sad”. This option is recoded to 0, since it indicates the absence of the symptom. The other three options (“I feel sad less than half the time”, “I feel sad more than half the time” and “I feel sad nearly all of the time”) are recoded to 1, indicating the presence of the symptom to some extent. Other items are recoded similarly.

Analysing the dichotomised data with our method and visualising the results with the qgraph package for R⁴⁹, results in the network in Figure 3. The layout of the graph is based on the Fruchterman-Reingold algorithm, which iteratively computes the optimal layout so that nodes with stronger and/or more connections are placed closer to each other⁵³. This network conceptualisation of depressive symptomatology might give new insights in issues that are still unexplained in psychology.

References

Barabási, A. L. The network takeover. Nat. Phys. 8, 14–16 (2012).
Article CAS Google Scholar
Barzel, B. & Barabási, A. L. Universality in network dynamics. Nat. Phys. 9, 673–681 (2013).
Article CAS PubMed Central Google Scholar
Kitsak, M. et al. Identification of influential spreaders in complex networks. Nat. Phys. 6, 888–893 (2010).
Article CAS Google Scholar
Liu, Y. Y., Slotine, J. J. & Barabási, A.-L. Controllability of complex networks. Nature 473, 167–173 (2011).
Article CAS ADS PubMed Google Scholar
Vespignani, A. Modelling dynamical processes in complex socio-technical systems. Nat. Phys. 8, 32–39 (2012).
Article CAS MathSciNet Google Scholar
Borsboom, D. Psychometric perspectives on diagnostic systems. J. Clin. Psychol. 64, 1089–1108 (2008).
Article PubMed Google Scholar
Borsboom, D. & Cramer, A. O. J. Network analysis: An integrative approach to the structure of psychopathology. Annu. Rev. Clin. Psychol. 9, 91–121 (2013).
Article PubMed Google Scholar
Cramer, A. O. J., Waldorp, L. J., Van Der Maas, H. L. J. & Borsboom, D. Comorbidity: A network perspective. Behav. Brain Sci. 33, 137–150 (2010).
Article PubMed Google Scholar
Schmittmann, V. D. et al. Deconstructing the construct: A network perspective on psychological phenomena. New Ideas Psychol. 31, 43–53 (2011).
Article Google Scholar
Van Der Maas, H. L. J. et al. A dynamical model of general intelligence: The positive manifold of intelligence by mutualism. Psychol. Rev. 113, 842 (2006).
Article PubMed Google Scholar
Schäfer, J. & Strimmer, K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Molec. Biol. 4, 32 (2005).
Article MathSciNet Google Scholar
Bickel, P. J. & Levina, E. Covariance regularization by thresholding. Ann. Stat. 36, 2577–2604 (2008).
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2008).
Article PubMed MATH Google Scholar
Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H. & Bühlmann, P. Causal inference using graphical models with the r package pcalg. J. Stat. Softw. 47, 1–26 (2012).
Article Google Scholar
Spirtes, P., Glymour, C. & Scheines, R. Causation, Prediction and Search (MIT press, Cambrigde, Massachusetts, 2001).
Drton, M. & Perlman, M. Multiple testing and error control in gaussian graphical model selection. Statist. Sci. 22, 430–449 (2007).
Article MathSciNet MATH Google Scholar
Efron, B. Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Am. Statist. Assoc. 99, 96–104 (2004).
Article MathSciNet MATH Google Scholar
Strimmer, K. fdrtool: a versatile r package for estimating local and tail area-based false discovery rates. Bioinformatics 24, 1461–1462 (2008).
Article CAS PubMed Google Scholar
Kindermann, R. & Snell, J. L. Markov Random Fields and their Applications, vol. 1 (American Mathematical Society Providence, RI, 1980).
Lauritzen, S. Graphical Models (Oxford University Press, USA, 1996).
Speed, T. & Kiiveri, H. Gaussian markov distributions over finite graphs. Ann. Stat. 14, 138–150 (1986).
Article MathSciNet MATH Google Scholar
Foygel, R. & Drton, M. Extended bayesian information criteria for gaussian graphical models. Adv. Neural Inf. Process. Syst. 23, 2020–2028 (2010).
Google Scholar
Ravikumar, P., Wainwright, M. J., Raskutti, G. & Yu, B. et al. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence. Electron. J. Statist. 5, 935–980 (2011).
Article MATH Google Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288 (1996).
MathSciNet MATH Google Scholar
Meinshausen, N. & Bühlmann, P. High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34, 1436–1462 (2006).
Article MathSciNet MATH Google Scholar
Ravikumar, P., Wainwright, M. J. & Lafferty, J. D. High-dimensional ising model selection using l1-regularized logistic regression. Ann. Stat. 38, 1287–1319 (2010).
Article MATH Google Scholar
Ising, E. Beitrag zur theorie des ferromagnetismus. Z. Phys. A-Hadrons. Nucl. 31, 253–258 (1925).
CAS Google Scholar
Chen, J. & Chen, Z. Extended bayesian information criteria for model selection with large model spaces. Biometrika 95, 759–771 (2008).
Article MathSciNet MATH Google Scholar
Foygel, R. & Drton, M. High-dimensional ising model selection with bayesian information criteria. arXiv preprint arXiv:1403.3374 (2014).
Barabási, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Article ADS MathSciNet PubMed MATH Google Scholar
Erdös, P. & Rényi, A. On random graphs. Publ. Math. Debrecen 6, 290–297 (1959).
MathSciNet MATH Google Scholar
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998).
Article CAS ADS PubMed Google Scholar
Foygel, R. & Drton, M. Bayesian model choice and information criteria in sparse generalized linear models. arXiv preprint arXiv:1112.5635 (2011).
Jardine, N. & van Rijsbergen, C. J. The use of hierarchic clustering in information retrieval. Inform. Storage Ret. 7, 217–240 (1971).
Article Google Scholar
Rush, A. et al. The inventory of depressive symptomatology (IDS): Psychometric properties. Psychol. Med. 26, 477–486 (1996).
Article CAS PubMed Google Scholar
Penninx, B. W. et al. The netherlands study of depression and anxiety (NESDA): Rationale, objectives and methods. Int. J. Method. Psych. 17, 121–140 (2008).
Article Google Scholar
Barrat, A., Barthelemy, M., Pastor-Satorras, R. & Vespignani, A. The architecture of complex weighted networks. Proc. Natl. Acad. Sci. U.S.A. 101, 3747–3752 (2004).
Article CAS ADS PubMed PubMed Central Google Scholar
Boccaletti, S., Latora, V., Moreno, Y., Chavez, M. & Hwang, D. Complex networks: Structure and dynamics. Phys. Rep. 424, 175–308 (2006).
Article ADS MathSciNet MATH Google Scholar
Opsahl, T., Agneessens, F. & Skvoretz, J. Node centrality in weighted networks: Generalizing degree and shortest paths. Soc. Networks 32, 245–251 (2010).
Article Google Scholar
Goldberg, D. & Fawcett, J. The importance of anxiety in both major depression and bipolar disorder. Depress. Anxiety 29, 471–478 (2012).
Article PubMed Google Scholar
Kessler, R. C., Nelson, C. B., McGonagle, K. A. & Liu, J. et al. Comorbidity of DSM-III—R major depressive disorder in the general population: Results from the US National Comorbidity Survey. Br. J. Psychiatry 30, 17–30 (1996).
Article Google Scholar
Schoevers, R. A., Beekman, A. T. F., Deeg, D. J. H., Jonker, C. & Van Tilburg, W. Comorbidity and risk-patterns of depression, generalised anxiety disorder and mixed anxiety-depression in later life: results from the amstel study. Int. J. Geriatr. 18, 994–1001 (2003).
Article CAS Google Scholar
American Psychiatric Association. . The Diagnostic and Statistical Manual of Mental Disorders (5th ed.) (Arlington, VA: American Psychiatric Publishing, 2013).
Orman, G. K. & Labatut, V. A Comparison of Community Detection Algorithms on Artificial Networks (Springer, Berlin Heidelberg, 2009).
Pons, P. & Latapy, M. Computing communities in large networks using random walks. J. Graph Algorithms Appl. 10, 191–218 (2006).
Article MathSciNet MATH Google Scholar
Wu, F.-Y. The potts model. Rev. Mod. Phys. 54, 235 (1982).
Article ADS MathSciNet Google Scholar
Loh, P. L. & Wainwright, M. J. Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses. Ann. Stat. 41, 3022–3049 (2013).
Article MathSciNet MATH Google Scholar
Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1 (2010).
Article PubMed PubMed Central Google Scholar
Epskamp, S., Cramer, A. O. J., Waldorp, L. J., Schmittmann, V. D. & Borsboom, D. qgraph: Network visualizations of relationships in psychometric data. J. Stat. Softw. 48, 1–18 (2012).
Article Google Scholar
Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. Biometrika 57, 97–109 (1970).
Article MathSciNet MATH Google Scholar
Epskamp, S. IsingSampler: Sampling methods and distribution functions for the Ising model (2013). URL github.com/SachaEpskamp/IsingSampler. R package version 0.1.
Murray, I. Advances in Markov chain Monte Carlo methods. PhD thesis, Gatsby Computational Neuroscience Unit, University College London (2007).
Fruchterman, T. M. & Reingold, E. M. Graph drawing by force-directed placement. Software Pract. Exper. 21, 1129–1164 (1991).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Interdisciplinary Center Psychopathology and Emotion regulation, University Medical Center Groningen, University of Groningen,
Claudia D. van Borkulo, Lynn Boschloo & Robert A. Schoevers
Department of Psychology, Psychological Methods, University of Amsterdam,
Claudia D. van Borkulo, Denny Borsboom, Sacha Epskamp, Tessa F. Blanken & Lourens J. Waldorp

Authors

Claudia D. van Borkulo
View author publications
You can also search for this author in PubMed Google Scholar
Denny Borsboom
View author publications
You can also search for this author in PubMed Google Scholar
Sacha Epskamp
View author publications
You can also search for this author in PubMed Google Scholar
Tessa F. Blanken
View author publications
You can also search for this author in PubMed Google Scholar
Lynn Boschloo
View author publications
You can also search for this author in PubMed Google Scholar
Robert A. Schoevers
View author publications
You can also search for this author in PubMed Google Scholar
Lourens J. Waldorp
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.v.B., D.B. and L.J.W. wrote the main manuscript. C.v.B., S.E. and T.F.B. carried out the validation study, C.v.B. and S.E. prepared Figures 1 and 3, C.v.B. and L.J.W. prepared Figure 2, C.v.B. prepared Tables 1–2. C.v.B., L.B. and R.A.S. wrote the Application to real data section. All authors contributed to manuscript revisions.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder in order to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/

Reprints and permissions

About this article

Cite this article

van Borkulo, C., Borsboom, D., Epskamp, S. et al. A new method for constructing networks from binary data. Sci Rep 4, 5918 (2014). https://doi.org/10.1038/srep05918

Download citation

Received: 08 April 2014
Accepted: 11 July 2014
Published: 01 August 2014
DOI: https://doi.org/10.1038/srep05918

This article is cited by

Dynamic predictors of COVID-19 vaccination uptake and their interconnections over two years in Hong Kong
- Jiehu Yuan
- Yucan Xu
- Qiuyan Liao
Nature Communications (2024)
Exploring symptom-level associations between anxiety and depression across developmental stages of adolescence: a network analysis approach
- Yunhan He
- Chengrui Wu
- Xinli Chi
BMC Psychiatry (2023)
Network structure of mobile phone addiction and anxiety symptoms among rural Chinese adolescents
- Jingjing Wang
- Yunjiao Luo
- Wei Wang
BMC Psychiatry (2023)
A network analysis of subjective well-being in Chinese high school students
- Shiwei Wang
- Siqi Zhao
- Mei Yang
BMC Public Health (2023)
Network analysis of neuropsychiatric symptoms in Alzheimer’s disease
- Grace J. Goodwin
- Stacey Moeller
- Samantha E. John
Alzheimer's Research & Therapy (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.