Network analysis of multivariate data in psychological science

Borsboom, Denny; Deserno, Marie K.; Rhemtulla, Mijke; Epskamp, Sacha; Fried, Eiko I.; McNally, Richard J.; Robinaugh, Donald J.; Perugini, Marco; Dalege, Jonas; Costantini, Giulio; Isvoranu, Adela-Maria; Wysocki, Anna C.; van Borkulo, Claudia D.; van Bork, Riet; Waldorp, Lourens J.

doi:10.1038/s43586-021-00055-w

Download PDF

Primer
Published: 19 August 2021

Network analysis of multivariate data in psychological science

Denny Borsboom ORCID: orcid.org/0000-0001-9720-4162¹,
Marie K. Deserno²,
Mijke Rhemtulla³,
Sacha Epskamp^1,4,
Eiko I. Fried⁵,
Richard J. McNally⁶,
Donald J. Robinaugh⁷,
Marco Perugini ORCID: orcid.org/0000-0002-4864-6623⁸,
Jonas Dalege⁹,
Giulio Costantini⁸,
Adela-Maria Isvoranu ORCID: orcid.org/0000-0001-7981-9198¹,
Anna C. Wysocki³,
Claudia D. van Borkulo^1,4,
Riet van Bork ORCID: orcid.org/0000-0002-4772-8862¹⁰ &
…
Lourens J. Waldorp¹

Nature Reviews Methods Primers volume 1, Article number: 58 (2021) Cite this article

74k Accesses
280 Citations
285 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 21 February 2022

This article has been updated

Abstract

In recent years, network analysis has been applied to identify and analyse patterns of statistical association in multivariate psychological data. In these approaches, network nodes represent variables in a data set, and edges represent pairwise conditional associations between variables in the data, while conditioning on the remaining variables. This Primer provides an anatomy of these techniques, describes the current state of the art and discusses open problems. We identify relevant data structures in which network analysis may be applied: cross-sectional data, repeated measures and intensive longitudinal data. We then discuss the estimation of network structures in each of these cases, as well as assessment techniques to evaluate network robustness and replicability. Successful applications of the technique in different research areas are highlighted. Finally, we discuss limitations and challenges for future research.

Statistical inference links data and theory in network science

Article Open access 10 November 2022

Node centrality measures are a poor substitute for causal inference

Article Open access 02 May 2019

The anatomy of a population-scale social network

Article Open access 06 June 2023

Introduction

In many scientific fields, researchers study phenomena best characterized at the systems level¹. To understand such phenomena, it is often insufficient to focus on the way individual components of a system operate. Instead, one must also study the organization of the system’s components, which can be represented in a network². The value of analysing the structure of a system in this way has been underscored by the advent of network science, which has delivered important insights into diverse sets of phenomena studied across the sciences^3,4. This Primer discusses methodology to apply this line of reasoning to the statistical analysis of multivariate data.

Network approaches involve the identification of system components (network nodes) and the relations among them (links between nodes). Well-known examples include semantic networks (in which concepts are connected through shared meanings⁵), social networks (in which people are connected through acquaintance⁶) and neural networks (in which neurons are connected through axons⁷). After nodes and links are identified, and a network has been constructed, one can study its topology using descriptive tools of network science⁸. For instance, one can describe the global topology of a network (such as a small-world network or random graph⁹) or the position of individual nodes within the network (for example, by assessing node centrality¹⁰). These analyses are often carried out with the goal of relating structural features of the network to system dynamics^4,11.

Network representations have a long history as research tools in statistics, where they encode important information concerning the joint probability distribution of a set of variables¹². For instance, in graphical models, unconnected nodes are conditionally independent given all or a subset of other nodes in the network¹²; in causal models, graphical criteria are used to determine whether parameters in an estimated causal model are identified¹³; and in structural equation models, path-tracing rules on network representations are used to determine the value of empirical correlations implied by the model¹⁴.

In this Primer, we present network analysis of multivariate data as a method that combines both multivariate statistics and network science to investigate the structure of relationships in multivariate data. This approach identifies network nodes with variables and links between nodes and describes them with statistical parameters that connect these variables (for example, partial correlations). Statistical models are used to assess the parameters that define the links in the network, in a process known as network structure estimation. Then, using a process of network description, the resulting network is characterized using the tools of network science^15,16,17. Here, we refer to this combined procedure of network structure estimation and network description as psychometric network analysis (Fig. 1).

**Fig. 1: Structure of psychometric network analysis.**

Network approaches to multivariate data can be used to advance several different goals. First, they can be used to explore the structure of high-dimensional data in the absence of strong prior theory on how variables are related. In these analyses, psychometric network analysis complements existing techniques for the exploratory analysis of psychological data, such as exploratory factor analysis (which aims to represent shared variance due to a small number of latent variables) and multidimensional scaling (which aims to represent similarity relations between objects in a low-dimensional metric space). The unique focus of psychometric network analysis is on the patterns of pairwise conditional dependencies that are present in the data. Second, network representations can be used to communicate multivariate patterns of dependency effectively, because they offer powerful visualizations of patterns of statistical association. Third, network models can be used to generate causal hypotheses, as they represent statistical structures that may offer clues to causal dynamics; for instance, networks that represent conditional independence relations form a gateway that connects correlations to causal relations^13,18,19.

Here, we review these functions of network analysis in the context of three types of application in psychological science, illustrating them with examples taken from personality, attitude research and mental health.

Experimentation

The schematic workflow of psychometric network analysis as discussed in this paper is represented in Fig. 2. Typically, one starts with a research question that dictates a data collection scheme, which includes cross-sectional designs, time-series designs and panel designs. Psychometric network analysis begins with node selection, a choice primarily driven by substantive rather than methodological considerations. The core of the psychometric network analysis methodology then lies in the steps of network structure estimation, network description and network stability analysis. Importantly, inferences drawn from the output of network analytic methods require both substantive domain knowledge and general methodological considerations regarding the stability and robustness of the estimated network in order to optimally inform scientific inference.

**Fig. 2: Schematic representation of the workflow used in network approaches to multivariate data.**

Network approaches to multivariate data are based on generic statistical procedures and thus invite applications to many types of data. The approaches discussed in this paper, however, have been developed and typically used in the context of psychometric variables such as responses to questionnaire items, symptom ratings and cognitive test scores²⁰, possibly extended with background variables such as age and gender²¹, genetic information²², physiological markers²³, medical conditions²⁴, experimental interventions²⁵ and anticipated downstream effects²⁶. Accordingly, the nodes we discuss will ordinarily represent items and tests.

The majority of network modelling approaches use conditional associations to define the network structure prevalent in a set of variables^20,27. A conditional association between two variables holds when these variables are probabilistically dependent, conditional on all other variables in the data. Which measure of conditional association to use depends on the structure of the data; for instance, for multivariate normal data, partial correlations would be indicated, whereas for binary data, logistic regression coefficients may be used. The strength of this conditional association is typically represented in the network as an edge weight that describes the connection between two nodes. If the association between two variables can be explained by other variables in the network, so that their conditional association vanishes when these other variables are controlled for, then the corresponding nodes are disconnected in the network representation.

The description of the joint probability distribution of a set of variables in terms of pairwise statistical interactions is a graphical model¹² known as the pairwise Markov random field (PMRF)²⁷. Versions of the PMRF are known under several other names as well in the statistical literature; see refs^28,29 for an overview of the relations between relevant statistical models. Many network modelling approaches attempt to estimate the PMRF, typically using existing statistical methodologies such as significance testing³⁰, cross-validation³¹, information filtering³² and regularized estimation^{16,33,34,35,36}. Because of its prominence in the literature, this Primer is limited to network approaches that use the PMRF, although it should be noted that other approaches to the analysis of multivariate data exist, including models based on zero-order associations³⁷, self-reported causal relations between variables^38,39 and relative importance of variables⁴⁰.

Because, in typical multivariate data, a substantive subset of associations between variables vanishes upon conditioning, applications of network modelling generally return non-trivial topological structures and the description of such structures is an important goal of psychometric network analysis. For instance, the extent to which network nodes are connected and the network’s general topology are of interest, as well as the position of individual nodes in that structure. Thus, psychometric network analysis typically involves interpreting the output of statistical estimation procedures, for example an estimated PMRF, as the input for network description techniques taken from network science (Fig. 1).

Types of data

Network models always operate on associations among sets of variables, but such associations can be extracted from many different experimental and quasi-experimental designs. We focus on three designs that represent typical data environments in social science where psychometric network analysis can be relevant: cross-sectional networks, longitudinal networks of panel data and time-series networks (Fig. 3).

**Fig. 3: Data structure, methods and resulting networks per typical data environment.**

Cross-sectional data

In applications to cross-sectional data, networks are representations of the conditional associations between variables measured at a single time point in a large sample (T = 1, N = large). In this case, the associations between variables are driven by individual differences, which renders such networks useful for studying the psychometric structure of psychological tests²⁹. In the cross-sectional data example used here, we are interested in the empirical relations among personality and personal goals. We analyse a data set in which three levels of personality structure are assessed via questionnaires, using network models to investigate empirical relations among these elements and personal goals. Our illustrative personality data set features 432 observations and 39 variables of interest⁴¹.

We represent network structures as they arise at different levels of aggregation⁴² at which personality can be described. These can be higher-order traits, such as conscientiousness; facets, such as orderliness, industriousness and impulse control⁴³; or even specific single items, such as prudent, reflective and disciplined (items of impulse control⁴⁴) that allow for a finer distinction of personality characteristics below facets (see ref.⁴⁵ for an example). The objective of psychometric network analysis, in this case, would be to offer insight into the multivariate pattern of conditional dependencies that characterize the joint distribution of these variables at these different levels of aggregation (Box 1).

When cross-sectional data are analysed through network estimation and interpreted via network description, is it important to keep in mind that resulting topologies represent structures that describe differences between individuals, and that these are not necessarily isomorphic to processes or mechanisms that characterize the individuals who make up the data. That is, inter-individual differences do not necessarily translate to intra-individual processes^46,47. If one is interested solely in the structure of individual differences, cross-sectional data are adequate, but research into intra-individual dynamics ideally complements such data sources with panel data or time series.

Box 1 Psychometric structure of personality test scores

A substantial part of the literature on human personality is concerned with the psychometric structure of personality tests. Research has shown that people’s self-ratings on adjectives (such as outgoing, punctual and nervous) or responses to items that characterize them (I make friends easily, I get stressed out easily; see the International Personality Item Pool for an overview of psychometric items) show systematic patterns of correlations. These patterns of correlations are often described by a low-dimensional factor model; most often, solutions with five factors known as the Five Factor Model¹⁴² or with six factors known as HEXACO¹⁴³ are proposed. The factors in the Five Factor Model are often interpreted as latent variables that cause the correlations between the item scores. However, attempts to ground these latent variables in psychological or biological theories of human functioning have met with limited success, and correlations between personality items may have other causes that include content overlap and the presence of direct relations between properties measured by these items⁶⁹. Such hypotheses are consistent with the finding that items in personality scales typically either load on several factors simultaneously or feature correlated residuals, suggesting that the latent variable model does not fully account for the correlations between item scores. Recently, network models have been proposed as an alternative representation of the psychometric structure of personality tests that does not require a priori commitment to a particular generating model (such as a latent variable model) and may serve to identify alternative mechanisms that lead to correlations between items^44,144. An exploratory factor model and a network model are visualized in the figure using IPIP-Big Five Factor Markers open data¹⁴⁵.

Panel data

In network applications to longitudinal data (also referred to as panel data), a limited set of repeated measurements characterize both the association structure of variables at a given time point and the way these conditional dependencies’ change over time (N > T). Such measures can illuminate the structure of individual differences and intra-individual change in parallel.

In our example for network approaches to panel data, we use repeated assessments of emotions and beliefs towards Bill Clinton as represented in longitudinal panel data of the American National Election Studies (ANES) between 1992 and 1996. We aim to model consistency, stability and extremity of attitudes towards Bill Clinton during the time that he transitioned from governor of Arkansas to president of the United States. The network theory of attitudes (Box 2) formalizes changes in attitude importance as network temperature, for example, increasing or decreasing interdependence between attitude elements. In the panel data example, network analyses can assist in modelling temperature changes in the interdependence of attitude elements towards BillClinton.

Box 2 Causal attitude network model and attitudinal entropy

The network theory of attitudes holds that attitudes are higher-level properties emerging from lower-level beliefs, feelings and behaviours¹¹¹. A negative attitude towards a politician might emerge from negative beliefs (that the politician is incompetent and bad for the future of the country), feelings (anger and frustration towards the politician) and behaviours (voting behaviour and making jokes about the politician). These different attitude elements can be modelled as nodes in a network, in which edges between attitude elements represent potentially bidirectional interactions between the elements. The network theory of attitudes relies on the central principle that interdependence between attitude elements increases when the attitude is important to the person and when an individual directs attention to the attitude object¹¹¹. This theory uses analogical modelling of statistical mechanics and the effect of attitude importance, and attention is formalized as a decrease in temperature. The effect of decreasing network temperature is that the entropy of a multivariate system decreases by making (attitude) elements in the system more interdependent. In the case of attitudes, this effect translates to heightened consistency and stability of the attitude when it is important, because the different attitude elements rein each other in under low temperature compared with high temperature (see the figure, parts a and b). Low temperature leads to low variance of the overall attitude within an individual, and hence higher stability. By contrast, a group of individuals with low-temperature attitude networks have higher variance than a high-temperature group, because the pressure of attitude elements to align leads to higher extremity of the overall attitude, creating a bimodal distribution. As this bimodal distribution only occurs in a low-temperature/high-importance scenario, the network model offers a potential explanation for polarization: higher importance leads to more strongly connected networks, which in turn produces polarized attitudes.

Time-series data

Networks as applied to time-series data of one or multiple persons characterize multivariate dependencies between time series of variables that are assessed intra-individually (T = large, N ≥ 1). Such networks are most often applied in situations where one seeks insight into the dynamic structure of systems. For instance, in the social and clinical sciences, recent years have witnessed a surge of daily diary studies and ecological momentary assessment, conducted via smartphones and designed to study such dynamic structures. Studies typically measure experiences — such as mood states, symptoms, cognitions and behaviours — at the moment they occur^48,49. In such cases, network analyses can assist in interpreting intensive longitudinal data by offering insightful characterizations of the multivariate pattern of dynamics.

In the time-series data example used here, we leverage data gathered during the onset of the COVID-19 pandemic to investigate the impact of reduced social contact due to lockdown measures on the mental health of students enrolled at Leiden University in the Netherlands. In this ecological momentary assessment study, students were followed daily for 2 weeks, assessing momentary social contact as well as current stress, anxiety and depression 4 times per day via a smartphone application⁵⁰. In this situation, a network model can be fitted to these data to investigate to what degree social contact variables influence mental health variables over the course of hours and days. Because, in this case, multiple individuals were assessed multiple times, the design is mixed; in such situations, it is often profitable to use a statistical multilevel approach^27,51, in which the repeated observations are treated as nested in the individuals. This explicitly separates individual differences from time dynamics⁵².

Results

In a PMRF, the joint likelihood of multivariate data is modelled through the use of pairwise conditional associations, leading to a network representation that is undirected. There are several benefits to the PMRF that make this particular network representation important. First, the PMRF encodes conditional independence relations (in terms of absent links between nodes), which form an important gateway to identify candidate data-generating mechanisms^29,53,54. However, the PMRF does not require an a priori commitment to any particular data-generating mechanism (unlike directed acyclic graph estimation or latent variable modelling, for example). Because PMRFs do not place strong assumptions on the structure of the generating model but do hold clues to causal structure through conditional independencies, they are well suited to exploratory analyses (see also Limitations and optimizations). In addition, estimated PMRFs often describe the data successfully with only a subset of the possible parameters (for example, using sparse network structures), which leads to more insightful network visualizations. Finally, a priori commitments invariably lead to problems of underdetermination, because many structurally different models will produce indistinguishable data, which is known as statistical equivalence. By contrast, the PMRF is uniquely identified, so there are no two equivalent PMRFs with different parameters that fit the data equally well.

If data are continuous, a popular type of PMRF is the Gaussian graphical model (also known as a partial correlation network) in which edges are parameterized as partial correlation coefficients^55,56. If data are binary, a popular PMRF developed to estimate the Ising model can be used, in which edges are parameterized as log-linear relationships^16,29,36. The Ising model and the Gaussian graphical model can be combined in mixed graphical models, in which edges are parameterized as regression coefficients from generalized linear regression models⁵⁷. Mixed graphical models represent the most general approach to PMRF estimation and also allow for the inclusion of categorical and count variables.