A graph neural network framework for causal inference in brain networks

A central question in neuroscience is how self-organizing dynamic interactions in the brain emerge on their relatively static structural backbone. Due to the complexity of spatial and temporal dependencies between different brain areas, fully comprehending the interplay between structure and function is still challenging and an area of intense research. In this paper we present a graph neural network (GNN) framework, to describe functional interactions based on the structural anatomical layout. A GNN allows us to process graph-structured spatio-temporal signals, providing a possibility to combine structural information derived from diffusion tensor imaging (DTI) with temporal neural activity profiles, like that observed in functional magnetic resonance imaging (fMRI). Moreover, dynamic interactions between different brain regions discovered by this data-driven approach can provide a multi-modal measure of causal connectivity strength. We assess the proposed model’s accuracy by evaluating its capabilities to replicate empirically observed neural activation profiles, and compare the performance to those of a vector auto regression (VAR), like that typically used in Granger causality. We show that GNNs are able to capture long-term dependencies in data and also computationally scale up to the analysis of large-scale networks. Finally we confirm that features learned by a GNN can generalize across MRI scanner types and acquisition protocols, by demonstrating that the performance on small datasets can be improved by pre-training the GNN on data from an earlier study. We conclude that the proposed multi-modal GNN framework can provide a novel perspective on the structure-function relationship in the brain. Accordingly this approach appears to be promising for the characterization of the information flow in brain networks.


Introduction
Brain connectivity comes in different flavors, either resting on the structural anatomical layout, as derived from diffusion tensor imaging (DTI) or based on temporally resolved activity patterns, like observed in functional MRI (fMRI) [58].White matter tracks reconstructed from DTI provide a foundation for structural connectivity (SC) and can be used to quantify the (static) anatomical connection strength between brain regions.On the other hand fMRI enables us to map out dynamic neural activity distributions across the brain, whereas the coherence of fluctuations is usually referred to as functional connectivity (FC).Intuitively one might follow the paradigm "structure determines function", but it has been shown that the relationship between brain structure and function is quite complex and still a focus of intense research [26,46,68,2,14].For instance, brain regions with robust SC usually show also high FC, but the inverse is not necessarily true [48].While FC is a statistical measure with no information concerning the directionality of the relation, effective connectivity and directed functional connectivity measures try to infer causal dependencies in functional imaging data [36].Thus connectivity measures derived from different modalities can provide distinct, but complementary aspects of brain connectivity [5,100,22].Still, studying their relations is challenging mainly due to the complex spatio-temporal dependencies and inherent difficulty in long term forecasting.
In this paper we propose a data driven model, which combines information from fMRI and DTI to infer causal dependencies between brain regions.Temporal activity patterns of neuron pools, interconnected by the spatial anatomical layout, can be interpreted as time-varying graph structured signals.For such applications, graph neural networks (GNN) have shown to be useful, providing a possibility to process data with graph-like properties in the framework of artificial neural networks (ANN) [98].Motivated by their success in computer vision [38,59], convolution operations were recently extended to the graph domain [18,27].Learning such convolution filters in ANN enables us to capture inherent spatial dependencies in the non-Euclidean geometry of graphs, which are used in our context to integrate spatial relations of brain networks, based on their structural anatomical connections.Further, temporal dependencies in a dynamic system can be acquired by recurrent neural networks (RNNs) that have proven to be well suited for processing data with sequential structure.In our study, RNNs learn temporal characteristics of brain dynamics, like those observed in resting-state fMRI.A certain type of GNN architecture denoted as diffusion convolution recurrent neural network (DCRNN), [61] provides the possibility to integrate spatio-temporal information of graph-structured signals.By combining fMRI with DTI data, the idea is to replicate brain dynamics more accurately, to get an improved understanding of functional interactions between brain regions, which are physically constrained by their structural backbone [25].
Causal relationships between brain regions can be revealed by directed functional connectivity and effective connectivity.Two prominent and distinct approaches have been established in recent years therefore [36].The first one is based on a simple idea taken up by the British econometrician Clive Granger [43].If one event A causes another event B, then A would precede B, and information on the occurrence of A should contribute to the prediction of the occurrence of event B. Such temporal dependencies between multivariate processes are typically described in the framework of a multivariate vector auto regressive (VAR) model, building a foundation for Granger causality (GC).By trying to make accurate predictions of temporal neural profiles, GC tests if adding information about neural activity in brain region B helps to improve the prediction of the activity in region A (and vice versa).This provides an exploratory measure for directed causal dependencies between segregated brain areas.
The second popular approach is methodologically different: Dynamic causal modeling (DCM) relies on a mechanistic input-state-output model of neuron pools, describing the effective connectivity strength between brain areas [37].Experimental conditions and stimuli are encoded in input functions, and the model output can be related to empirically observed electromagnetic or hemodynamic responses.In a Bayesian framework, effective couplings of neural populations are estimated, providing a neurophysiological perspective on causal relationships between different regions in the brain.However due to its relatively high computational complexity, the analysis with DCM is usually limited to a few pre-defined regions in the brain only, what could neglect relevant components for the analysis [24].
Here we present a data-driven machine learning approach that combines structural and functional information of neuron pools in a predictive framework for brain dynamics.By studying spatio-temporal dependencies between brain areas which were learned by the DCRNN model from DTI and fMRI data, we deduce the information flow between segregated areas in the brain.This provides us with a multi-modal data-driven perspective on causal relationships within brain networks.We compare the predictive capabilities of a typical VAR model to those of a DCRNN model, and show that this machine learning approach can successfully account for long-term and non-linear relationships in data.We conclude that such data-driven methods inspired by graph signal processing can be promising candidates for modeling brain dynamics, foremost due to their improved accuracy in replicating empirically observed data.Moreover, a greater neurophysiological plausibility results, because neural interactions are constrained by their anatomical substrates in this model.In contrast to classical DCM, the DCRNN also naturally scales to large networks by learning localized filters on the graph structure [27], what opens the possibility for an exploratory analysis of whole brain networks.
Usually for a good performance more complex machine learning models require a larger amount of data, but it is not economical in MRI to perform studies with very large sample sizes.To account for these issues we demonstrate that also in our context transfer learning [70] can enhance the model accuracy of small datasets.We pre-train the DCRNN on a large-scale dataset of 100 restingstate fMRI (rs-fMRI) sessions provided by the Human Connectome Project [93] (HCP).We then show that the pre-trained model considerably improves the predictive performance on a smaller independent dataset of 10 sessions compared to standard training.This points to the ability of the DCRNN to generalize across scanner types and acquisition protocols to a certain extent, enabling the possibility for transfer learning.
Finally the integrative framework of anatomical and functional neuroimaging data can help to better understand the general relation between brain structure and function.By modeling localized interactions on the anatomical substrate, this approach allows us to reconstruct the amount of information on activity distributions that occurs in structurally connected brain regions.While many current approaches focus on predicting only the coherence patterns of the activity in brain regions (FC) from their SC [11,84,66,62,28], we present a framework that directly replicates observed neural activity profiles, thereby relying on information from SC.

Model description.
In this study we use the DCRNN model [61] architecture to explore the spatiotemporal relationships of brain dynamics in resting state fMRI.An overview of the model structure is provided in figure 1.To learn the temporal dependencies of  The signal on the graph x(t) at a certain time point t is the average BOLD signal in brain regions/nodes, obtained by the fMRI measurement at time t (B3).The encoder (A) receives an input sequence [x(1), . . ., x(T p )], and iteratively updates its hidden state H(t).The final encoder state H(T p ) is passed to the decoder part, which learns to recursively predict the output sequence of graph signals [x(T p + 1), . . ., x(T f )] in the future.During testing and validation, the decoder uses its own outputs as inputs, to generate the subsequent output.The first input of the decoder (< GO > symbol) is simply a vector of zeros.
the BOLD signal, recurrent neural networks (RNNs) with sequence-to-sequence learning are employed [85].In such an architecture the encoder network maps information from an input sequence into a hidden representation, which is used by the decoding part to sequentially generate outputs, based on this encoded information.In the context of brain dynamics, the input sequence corresponds to measurements of the BOLD signal x(t) ∈ R N in N brain regions at T p time points, while the objective is to predict the signal at T f subsequent time points.
In addition to temporal, also spatial dependencies between brain regions are incorporated via diffusion convolution operations [61].Consider the network of regions of interest (ROIs) as a graph G = (V, E, A w ), where V, |V| = N denotes a set of vertices (nodes), E represents a set of edges and A w ∈ R N ×N is a weighted adjacency matrix.The latter represents the spatial connectivity of the nodes, i.e. the ROIs on the neuronal network, which are adjacent to each other, i.e. connected by an edge.Also the weights result from DTI, reflecting the axonal connection strength between the connected regions.Goal of the DCRNN model is to learn a function h(...) which maps T p past activity states x(t), to T f future states: The encoder, as well as the decoder of the DCRNN consist of gated recurrent units [23], modified with graph convolutions [27], and for training the model scheduled sampling was applied [13].A detailed description of the model architecture is provided in section 4.

Data description.
For the first part of our evaluation, resting-state fMRI data from the S1200 release provided by the Human Connectome Project [93] (HCP) was employed [40].Further the multi-model parcellation proposed by Glasser et.al [39] was applied to divide each hemisphere into 180 segregated regions.The BOLD signal in each region was averaged, so for each resting state session, N = 360 time courses were obtained.During each session T = 1200 images were acquired, so the data can be arranged in a matrix X ∈ R N ×T .For the following analysis, we filter the data with a 0.04 − 0.07Hz narrow band bandpass filter, as it has shown to be reliable and functionally relevant for gray matter activity [41,19,25,15,3].We additionally present results in supplement I, when employing a more liberal bandpass filter with cutoff frequencies between 0.02 − 0.09Hz.The input and output (label) samples for the DCRNN model were generated from the data in X, by defining windows of length T p to obtain input sequences of neural activity states [x(t − T p + 1), . . ., x(t)], and respective target sequences [x(t + 1), . . ., x(t + T f )] of length T f .The index t was propagated through each resting-state fMRI session, so in total T − T p − T f + 1 input-output pairs were generated per session.The first 80% were used for training the DCRNN model, 10% for validation, and the last 10% for testing.In total 4 resting-state fMRI sessions from 25 different subjects were employed for the evaluations.The input and output length was chosen to be T p = T f = 30, what would correspond to a time span of roughly 22 s, based on the sampling with a repetition time T R = 0.72 s [92].But note that in general the sequence-to-sequence architecture employed would be able to deal with arbitrary input and output signal lengths [85].
In addition to temporal brain dynamics, also structural information was incorporated into the model, described by the anatomical connection strength between brain regions deduced from DTI. Therefore the DTI dataset provided in the S1200 release [40] was further processed by employing multi-shell, multi-tissue constrained spherical deconvolution [53], implemented in the MRtrix3 software package [90].White matter tracktography was performed to estimate whole brain structural connectivity between the N = 360 regions of the multi-modal parcellation atlas [39].The reconstructed anatomical connections define the edge strength in the graph adjacency matrix A w ∈ R N ×N .A more detailed description of the datasets and preprocessing involved can be found in section 4.3.1.

Model performance.
In a first step we assess the capabilities of the DCRNN model to learn temporal activity patterns in neuron pools, and their relationships across the spatial layout.As a first baseline we compare the DCRNN to the performance of a linear vector autoregressive (VAR) model [43], further described in section 4.2.A common way to estimate causal relations between different regions of interest (ROIs) in a brain network, is to fit a multivariate VAR model to neural temporal activity patterns, like those observed in different neuroimaging modalities [36,10,75].Evaluating the fitted VAR allows us to infer, if one spatial brain region, contains additional information about future activity profiles of other regions, indicating a causal dependency between them.The accuracy in replicating empirically observed neural activity profiles can indicate how well a model has learned the underlying process of neural dynamics, including the interactions and dependencies among brain regions.In this comparison we incorporated two different optimization methods for the estimation of the VAR coefficients.The first one employs an ordinary least squares (OLS) fit on the neural activity timecourses x(t) from individual rs-fMRI sessions [45].The second approach, in analogy to the DCRNN, follows a gradient-descent based optimization [54] on the windowed neural activity samples as described in section 2.2.For this evaluation we rely on the latter one, as it could improve the performance of the VAR, as outlined in section 4.2.
A representative example of the predictive accuracy of both approaches is shown in figure 2, as well as their average performance on the complete testing data set.The true BOLD signal in these 4 ROIs is marked green, while predictions of the VAR are highlighted in red, and for the DCRNN in blue.The first 30 T Rs of BOLD signal were used as the model inputs, and the goal was to predict the subsequent 30 T Rs.This illustrative example was chosen to represent the whole test set, the prediction error of the VAR model on this sample is 0.169, and as such slightly below average, while the error of the DCRNN is with 0.037 higher than its average.Below the average MAE over all samples in the test set is illustrated, in dependence of the forecasting horizon (C).
On the right side in (C) the average of all horizons is shown.
To further verify the difference in the prediction accuracy, the equivalent evaluation, using a more liberal frequency filtering within the 0.02 − 0.09Hz range, is provided in supplement I. Furthermore in supplement II we evaluate the different approaches employing the volumetric AAL parcellation [91] and performing an alternative method for reconstructing the anatomical connectivity [12].

Impact of spatial modeling.
For this application of the DCRNN model, the anatomical connectivity was used to characterize spatial relations between nodes in the brain network, shaping the transition of activity between brain regions.To illustrate that the DCRNN indeed has learned relevant spatial interactions between different ROIs, we evaluate this recurrent neural network model, without employing graph (diffusion) convolution layers.This restriction considers only self-couplings (filters of order K = 0) of nodes on the structural graph.Figure 3 (A) shows the test MAE in dependence of the incorporated walk order K.The increase in computational time per epoch in dependence of included transition orders K is depicted in figure (B).A more detailed comparison of the prediction MAE between the sequence-to-sequence model without graph convolutions (K = 0), and including spatial transitions up to order K = 3 is illustrated in figure 3 (C).
These results show, that the vast amount of the information about future activity in one region comes from the region itself.But by including first order transitions on the structural network (K = 1) the error can already be decreased by 25%.Filters of higher orders K = 2, 3 only slightly improve the predictions further, but the computational load increases linearly with order K, as shown in subfigures (A) and (B).The role of such transitions within the anatomical network can tell us something about the general structure-function relationship in the human brain, by pointing out how much information about functional dynamics comes from structurally connected regions.The comparison between K = 0 and K = 3 shows, that roughly up to 27% of the predictive performance can be attributed to information from regions that are anatomically connected with each other.

Causal connectivity.
In this section the objective is to study the principle of information passing between different ROIs the DCRNN has learned from the neuroimaging data.As shown in the previous section 2.4, propagating information on the anatomical network can improve the predictions of the temporal evolution of the BOLD signal, displaying a dependence among structurally connected brain regions.Now to derive a measure of causal connectivity strength, by following the idea of Granger [43], the goal is to reconstruct how information about the activity in ROI A contributes to the prediction of the activity in ROI B. To reveal relationships inside the data by directly looking at the learned parameters is often challenging when ANN models become more complex.One simple strategy used to account for this problem is to induce perturbations in the models input space and then observe how these perturbations are propagated to the models outputs [101,73].
In our context, the DCRNN first learns a function h(...), mapping the original input sequences of neural activity states [x(t − T p + 1), . . ., x(t)] to a predicted output sequences of future states [x(t + 1), . . ., x(t + T f )].Then the information about the activity in a ROI n is removed, by simply replacing the values x n (t) in the input sequence with the mean value of the data distribution xn (t) = 0. Next the input sequence with the artificial perturbation in n is projected by the model h(...) to an output sequence [x (t + 1), . . ., x (t + T f )].Finally the differences of the models predictions x (t) with the perturbation in the input space in ROI n , and the predictions x(t) with the original input can be compared.A measure of influence I(n ) ∈ R N of the information in ROI n on the predictions in other ROIs can then be defined as: with I n (n ) describing the impact of region n on region n.Here x(s) n (t) and x (s) n (t) denote the predictions in region n with and without the perturbation of n in the input space respectively, of a test sample s at time step t.
To visualize this measure of influence of n on each individual region n, values of I(n ) can be projected onto the cortical surface.In the following we studied the impact of the parieto-insular vestibular cortex (PIVC) on all other brain regions.Here PIVC in the right hemisphere is characterized as a conjunction of ROIs R OP2-3 and R Ig, as defined by Glasser et al. [39].Previous results show that this location coincides with the average location of PIVC across human subjects [64,32].The perturbation was induced in R OP2-3 and R Ig simultaneously, and figure 4 illustrates the strength of influence on all other regions (encoded in red) of the target region PIVC (marked in blue).The results of this analysis show that PIVC exhibited strong causal connectivity with the Sylvian fissure, the perisylvian cortex and the insula.Similar connectivity patterns have been observed using diffusion weighted imaging [97,49] and resting state functional connectivity [34] in human subjects as well as in nonhuman primates using tracer techniques [94].Several separate regions of the vestibular cortex are located within this Sylvian network, including the posterior insular cortex area (PIC), a region critical to the integration of visual and vestibular cues (for human subjects: [31,33]; for non-human primates the region is referred to as VPS: [94,21]).The information flow within this Sylvian network is not fully understood yet.Current theories assume that vestibular and visual cues about self motion are combined within PIVC and PIC and are then further processed to the temporoparietal junction (TPJ), a larger cortical region located at the junction of the temporal and parietal cortices, where visual-vestibular signals are integrated into a representation of the self in space [32].The results of the current analysis support this view by showing a causal relationship with the supramarginal gyrus, which is part of the TPJ.Further functional connectivity from PIVC was observed with the visual cortex.This result is interesting, since several studies have shown inhibitory interactions between the visual system and PIVC [95,16,35,34], such that PIVC is inhibited when visual cues are processed attentively and vice versa.These inhibitory interactions are assumed to be modulated in magnitude by attention networks located in the visual and parietal cortices (see [34]).

Model generalization.
Often one problem is the availability of a sufficient amount of data, in order to fully train and take advantage of machine learning models with large parameter spaces.Especially in MRI studies it is usually time-consuming and costly to acquire such large data sets.To account for this limitation, the concept of transfer learning was proposed in machine learning [70].The basic idea behind transfer learning is that if only few data are available to learn a certain task, one can pretrain the model on a large-scale dataset of a similar task.In a next step, the feature representations learned on the large database can be used as an initialization for learning the desired target task.The goal is to transfer knowledge of one source domain to a target domain, by re-using the pretrained models weights.If the feature representation of the source domain is diverse enough, this can improve the model performance in comparison to starting the training without any prior knowledge, e.g.relying on a random initialization of the model weights [70].
To investigate, if transfer learning can also be suitable for our application, we studied the capabilities of the DCRNN to generalize across different datasets.Therefore we pretrained the DCRNN using the data provided by the HCP [93], as described in section 4.3.1.The model was pretrained for 70 epochs on in total 100 resting-state fMRI sessions, including the anatomical connectivity as reconstructed from DTI.Next we used a dataset collected with a Siemens Magnetom Prisma 3T at the University of Regensburg (UR), including 10 resting-state fMRI sessions and corresponding structural imaging data.The acquisition parameters of this dataset are outlined in more detail in section 4.3.2.We fine tuned the DCRNN, pretrained on the HCP data, by training it for 70 more epochs on the UR dataset, and initialized the second training with a lower learning rate of 0.001.This pretrained model was compared to the DCRNN, only trained on the UR dataset, and with weight parameters initialized randomly with Xavier/Glorot initialization [42].
The comparison between relying on standard training, and utilizing transfer learning is illustrated in figure 5. Figure 5 (A) shows the training and validation error during learning when starting with a random initialization of the weights in red.This model was trained in total for 140 epochs on the UR dataset only.In blue the training and validation error is depicted of the model, pretrained on the HCP dataset for 70 epochs at first, and fine tuned on the UR dataset for the subsequent 70 epochs.Figure 5 (A) illustrates that at the beginning, the training error on the UR data is relatively high, but as the pretrained model adapts to the new dataset the MAE becomes considerably smaller than without pretraining.In figure 5 (B) the test MAE in dependence of the prediction horizon is depicted.The average test error could be reduced by 27% from 0.0388 to 0.0284 by encompassing transfer learning.Therewith the model performance on the small UR dataset, containing 10 sessions a 7.3 min, becomes comparable to the performance on the large HCP dataset with 100 sessions a 14.4 min with a M AE = 0.0279.

Discussion
We introduced a multi-modal framework for inferring causal relations in brain networks, based on a graph neural network architecture, uniting structural and functional information observed with DTI and fMRI.First this model provides a data-driven perspective on a fundamental question in neuroscience: How the function of the brain is related to its structure.Further by modeling dynamic interactions on the structural anatomical substrate, this framework accounts for nonlinear spatio-temporal dependencies between segregated brain regions, allowing us to reconstruct a multi-modal measure of causal influence strength.
First, we evaluated the performance of the DCRNN by studying its capabilities to reproduce empirically observed neural activity patterns, and compared it to a VAR model, like typically used for the analysis of brain connectivity with Granger causality [43,75].We showed that the DCRNN can also capture temporal longterm dependencies in fMRI data, enabling it to make accurate predictions up to 30 T Rs (≈ 20s) in the 0.04 − 0.07Hz frequency range, what could reduced the overall test MAE considerably in comparison to a linear VAR.Note that results in 2.3 demonstrate, that despite its simplicity, a VAR can make quite reliable predictions within the first 10 T Rs.Also its linearity allows for various possibilities for statistical inference of causal relations between different time courses, making it a practicable and fast tool for the estimation of Granger-causal connectivity [10].But for the future it could be of interest to also consider non-linear and long-term relationships in neuroimaging data, in order to get a more complete picture of functional interactions between areas in the brain.The improved accuracy of the DCRNN reveals that it has better learned inherent characteristics of brain dynamics, and might be therefore more appropriate to characterize causal relations than simple linear models.We additionally verified the results by employing a more liberal bandpass filter with cutoff frequencies 0.02 − 0.09Hz in supplement I.By including more frequency components, the BOLD signal becomes more complex and is accordingly harder to predict.The same analysis has been carried out relying on a volumetric brain atlas [91], and using an alternative tractography method to reconstruct the structural connectivity in supplement II.In all cases the difference between the VAR and DCRNN in the prediction performance is apparent, especially for larger horizons.Also the DCRNN does not require stationarity of time series data, therefore avoiding potentially distorting pre-pocessing steps in order to achieve the latter.Another aspect that improves the plausibility of the estimated causal relations between brain regions is the integration of structural information into the graph neural network model.As the propagation of neural signals is physically constrained by the layout of white matter connections, propagating information via graph convolutions along anatomically connected regions is in agreement with prior knowledge on the anatomy of the brain.
The impact of this structural modeling was further investigated in section 2.4.In the DCRNN the propagation of information is realized as a stationary diffusion process in the notion of a diffusion convolutions (DCs) [61].Results show that diffusion steps of order K = 1 already contribute most to the improvement of prediction accuracy, while higher order terms of K = 2, 3 only have a nominal further impact on the performance.The influence of structural modeling on the predictive performance provides additional insight into the general structurefunction relation in the brain, by pointing out, how much additional information about the functional activity in a certain region can be gained from the inclusion of structurally connected regions.By including filters up to order K = 3, the predictions could be improved by 27 % in comparison to when information from anatomically connected regions has been neglected.Note that for each time step t the DCRNN already applies multiple DCs to the multi-variate time series data, thereby inherently capturing the influence of higher order structural connections.Therefore low orders of diffusion walks K ≤ 3 seem to be already sufficient to account for indirect transitions.A good trade-off between computational load and model accuracy could be achieved with a maximum walk order of K = 2, as the computational complexity linearly increases with K. Learning localized filters characterized by polynomial coefficients θ k renders it possible to efficiently analyze large scale networks [27], what allowed us to conduct an analysis with N = 360 regions simultaneously on a single GPU.So unlike classical DCM, this model can also be applied to study interactions across the whole brain, making it suitable for an exploratory analysis.
The results demonstrated that propagating information across anatomical connections improves the model accuracy, pointing towards functional dependencies between different brain regions.In the spirit of explainable artificial intelligence (XAI), we proposed a method to reconstruct such dependencies, which the DCRNN has learned from the data in section 2.5.Inducing perturbations in the model's input space allowed us to study how the activity in a certain region influences other regions.This influence would quantify the importance of temporal information on the activity in a certain ROI for predicting the activity in other ROIs.Following the philosophy of Granger causality, this indicates a causal dependency between ROIs, thereby providing a measure of directed influence among each other.This kind of relation is referred to as directed functional connectivity or causal connectivity, as such information theoretic measures are dependent on causal mechanisms, but are not necessarily identical with them [76,17], which distinguishes them from explicit model-based approaches like DCM for effective brain connectivity [36].For our approach we used the more general notion of causal connectivity, as we do not only incorporate functional data, but also structural information to describe such causal dependencies between different regions.To demonstrate an application of our proposed approach, we evaluated the influ-ence of PIVC on other brain regions.This could reveal a causal relation of PIVC with brain regions in the Sylvian fissure, the perisylvian cortex and the insula, but also with the visual cortex.
In a final step, we proposed an approach to improve the model performance on smaller data sets.We demonstrated that the concept of transfer learning [70] finds also an application in our context of detecting intrinsic patterns in fMRI time-series and structural connectivity data.Features learned from the data of the HCP repository [93] could be well transferred to a smaller dataset, acquired with a Siemens Magnetom Prisma 3T.This made it possible to achieve an almost as good accuracy on a small dataset with 10 sessions (each 7.3 minutes in duration) as with a large dataset of 100 sessions (each 14.4 minutes in duration).The acquisition and preprocessing protocols of the two fMRI datasets were relatively comparable in our study, so in other cases with larger differences in the temporal resolutions of the data, downsampling one dataset could be necessary in order to better learn transferable feature representations.
Note that by integrating the structural information into the model, the functional interactions learned by this model also depend on the predefined anatomical layout.Therefore the quality of DTI data is additionally relevant for the results, but it is known that DTI has problems to accurately reconstruct long-range white matter tracks [87].Also fMRI comes with its limitations for studying neural interactions, as the sampling rate is considerably lower than the underlying neural responses, and the neural activity is only indirectly measured based on the observed hemodynamic response [75].So the interpretation of the results should consider the informative content of the neuroimaging data used in this model.
In conclusion we think that GNN architectures can provide an interesting novel approach to combine complex non-linear temporal and spatial patterns as observed in fMRI and DTI data.Currently GNNs already show very promising applications for classification tasks in MRI based on brain connectivity networks [57,9,60,55].In our study we showed that they can be also suitable to characterize the non-Euclidean spatial relationship of segregated brain regions when analyzing dynamic functional interactions on the structural network.Beyond the investigation of causal relations, this data-driven approach for brain dynamics could also be of interest for other applications.While many current approaches dealing with the structure-function relationship in the human brain focus on inferring the overall functional coherence patterns from their SC [11,84,66,68,62,28], this framework allows us to directly relate temporally resolved activity profiles to their anatomical substrate.Further this whole-brain model could be of use for clinical applications, by studying dynamics in the diseased brain or modeling the impact of structural lesions [71,4].For future studies it could also be interesting to apply this multi-modal GNN model to other functional neuroimaging modalities like those obtained with electroencephalography (EEG) or magnetoencephalography (MEG) to investigate brain dynamics with greater temporal resolution.

DCRNN.
In the context of neuroimaging, neural activity patterns can be interpreted as a graph structured spatio-temporal signal distribution.The nodes in this graph represent ROIs in a human brain, while the edges reflect the connection strengths between these ROIs in the anatomical neuronal network, which forms a structural scaffold for the flow of information.This connection strength is given by the axonal connection strength as determined from DTI measurements.The activity dynamics on such networks can be modeled by a random walk on a graph, where a diffusion convolution operation is invoked to capture the spatial dependencies [61,27].A diffusion-convolution recurrent neural network (DCRNN) is designed to integrate diffusion convolution, a sequence-to-sequence architecture and a scheduled sampling technique [61].The model, as it is applied in the current study, is described in detail below.
When considering voxel time series of brain activity maps, we collect all data into a data matrix X = (x(1) . . .x(T )), with x(t) ∈ R N .Given N ROIs, taken from a brain atlas and each represented by a meta-voxel, and considering T time points for each meta-voxel time series, which represents the activation time course of one of the ROIs, then we have Note that the columns x(t) ∈ R N of the data matrix describe the activation of all ROIs at any given time point 1 ≤ t ≤ T , while its rows xn (t), t = 1, . . ., T represent the meta-voxel time course of every ROI 1 ≤ n ≤ N .Now consider a network of ROIs (brain areas, neuron pools) as an undirected graph G = (V, E, A w ), where V, |V| = N denotes a set of vertices (nodes), E represents a set of edges and A w ∈ R N ×N is a weighted adjacency matrix.The latter represents the structural connectivity of the nodes, i.e. the ROIs on the neuronal network, which are adjacent to each other and connected by an edge.Such undirected graphs can be deduced from diffusion tensor imaging (DTI) data, which also provide the edge weights w nn .The latter reflect the anatomical connection strengths between the connected vertices.Note that DTI alone cannot determine the direction of information flow, what makes it necessary to incorporate functional imaging data.
The flow of activity observed on G is expressed as a time-dependent graph signal x(t) ∈ R N .It represents the feature of each ROI, which here is the BOLD signal amplitude.Forecasting the flow of activity on G amounts to learning a function h(...) that maps T p past graph signals to future T f graph signals:

Spatial dependencies.
Information flow on G is considered a stochastic random walk process modeled by • a state transition matrix T = D −1 A w = ( ŵ1 . . .ŵN ) Here we have with w ∈ R N and ŵn = ( ŵ1n . . .ŵNn ) where the ŵ = w nn / n w nn denote normalized edge strengths.Here state transitions are modeled as a diffusion process on a graph.Note that because the DTI cannot obtain directed graphs, its diffusion matrix is symmetric, i. e. T = T T .Thus an eigen-decomposition exists according to Further the state transition matrix T is proportional to a normalized graph Laplacian representing a random walk on the graph.Now consider the set of eigenvectors U of the diffusion Laplacian matrix as a set of basis vectors.Then the graph signal x t ∈ R N can be transformed to the conjugate domain and vice versa, hence we have [78] x ω = U T x t (8) Finally invoking the convolution theorem, the graph convolution operator * G can be defined as where f θ denotes a filter parametrized by θ and denotes the Hadamar product in the conjugate domain.The transformed vector U T f θ ≡ θ ω = (θ 1 (ω), . . ., θ N (ω)) T summarizes the filter parameters θ n , n = 1, . . ., N into a parameter vector in the conjugate frequency domain.If it is replaced by a diagonal feature matrix, i. e.
, it represents a convolution kernel.Thus we have for the output signal Now expanding the filter kernel Θ ω into a power series with respect to the eigenvalue matrix Λ of the transition matrix T, unfolding the bi-quadratic form into a sum of rank one outer product forms θ n uu T , n = 1, . . ., N , which can be considered elementary filter kernel, and finally keeping only terms up to order K, we obtain Note that this diffusion convolution operation includes the inverse diffusion process, represented by the transpose state transition matrix T T as well, since DTI can only yield undirected graphs.Thus, as has been shown by [61], diffusion convolution is intimately related to spectral graph convolution (SGC) [27].More precisely, GDC is equivalent to SGC up to a similarity transformation [61].
Considering a CNN architecture and using the diffusion convolution operation, the output of each of the q ∈ {1, . . ., Q} diffusion convolution layers (DCL) is then given as follows: Hereby x t ∈ R N denotes the input at time t, h q,t ∈ R N the corresponding output of the q-th convolution layer, Q the number of filters employed, σ(...) any suitable activation function, and θ q,k ∈ R K+1 parameterizes the q-th convolutional kernel of order k.The DCL learns to represent graph structured data and can be trained with gradient descent based optimization techniques.
Note that this random walk on a graph represents a Markov process.At the limit K → ∞ it converges to a stationary distribution P ∈ R N ×N , which for finite K < ∞ can be approximated by [86] The i-th row P i, * of this matrix represents the likelihood of diffusion starting from ROI v i ∈ V, hence the proximity of any other ROI v j ∈ V with respect to ROI v i .

Temporal dependencies.
Given the graph convolution operation, temporal dynamics on the graph can be modeled using gated recurrent units (GRU) [23].The trick is to replace the matrix multiplications in GRU by diffusion convolutions * G , as derived in equation 12.This leads to the diffusion convolutional gated recurrent unit (DCGRU) [61] where x(t), H(t) denote the input and output states of the GRU at time t and [x(t), H(t−1)] denotes their concatenation.Also r(t), u(t) represent reset and update gates at time t, and b r , b u , b c , respectively, denote bias terms.Furthermore, Θ r , Θ u , Θ c denote the parameter sets of the corresponding filters.An illustration of a single DCGRU cell is provided in figure 6.Similar to GRUs, also DCGRUs can be employed to build layers of recurrent neural networks, which can be trained by backpropagation through time (BPTT) [96,63].If multiple step ahead forecasting is intended, a sequence-to-sequence architecture can be used.In this architecture, both the encoder and the decoder are composed of DCGRU layers forming a diffusion convolution recurrent neural net (DCRNN) (see Fig. 1).During training, a time series of past events is fed into the encoder and its final states form the input to the decoder.The latter then generates predictions, which can be compared to available ground truth observations.For later testing, such ground truth observations are replaced by predictions generated by the model itself.Given BOLD signal voxel activation time series, segments of an observed voxel time series are used to train a DCRNN to predict future activations.The network is trained by maximizing the likelihood of generating the target future time series using BPTT learning.Hence, DCRNN can capture spatiotemporal dependencies between time series.The DCRNN [61] was implemented using the TensorFlow [1] library for machine learning, and computations were performed on an Nvidia Quadro K6000 GPU, running on a desktop PC with an Intel(R) Xeon(R) CPU E5-1620 v4 CPU under Linux Debian 9. Scheduled sampling [13] is invoked during training to account for the fact that the distribution of input stimuli during testing might differ from the distribution of training stimuli.During scheduled sampling reference observations are fed to the model with probability i , while predictions released by the model are fed in with probability 1 − i at the i-th iteration.During supervised training, instances to be predicted are, of course, known.An inverse sigmoidal function determines the sampling probability decay: It was found to be sufficient to train the model for 70 epochs, and the scheduled sampling parameter can be set to τ = 5000.As an objective function the mean absolute error (MAE) was used to describe the overall difference between true activity x(t) and predicted activity x(t): For this optimization problem, the ADAM algorithm [56] was employed, and the gradient was derived from mini-batches of 32 samples.To further improve convergence, an annealing learning rate was used, initialized as η = 0.1, and decreased by a factor of 0.1 at epochs 20, 40 and 60, or if the validation error did not improve for more than 10 epochs.Before lowering the learning rate, the weights with lowest validation error were restored, in order to avoid getting stuck in local optima.
The encoder and decoder of the sequence-to-sequence architecture consist to two diffusion convolution GRU layers each, and the hidden state size is set to Q = 64.
The training performance is illustrated in figure 7.

Autoregressive models.
As Granger causality [43] is typically based on linear vector autoregressive (VAR) models for stochastic time series data, we evaluated a VAR as one baseline method.
The idea of an autoregressive process (AR) is that a time series x(t) can be described by a linear function of the first T p of its lagged values [65] x(t with coefficients α 1 . . .α p , intercept β and an error term u(t).This expression can be extended to a multivariate VAR model with N time series x(t) = [x 1 (t), . . ., x N (t)] as [65] x where coefficients are stored in matrices A ∈ R N ×N , and intercepts and errors are described by vectors b ∈ R N and u(t) ∈ R N .In the context of this study, time series x(t) reflect the BOLD signal of N brain regions, measured at different times t.
For the estimation of coefficients A and intercepts b various methods exist [10], and in this study we rely on two different strategies.The first is based on a typical ordinary least squares (OLS) estimation [45,10] on individual subject sessions, implemented in the statsmodel python package [74].The first 80% of each fMRI session were used to fit the model to the data, and the subsequent 10% were used for validation and the last 10% were employed as a test set.To check for stationarity of the analyzed time series an augmented Dickey-Fuller test for unit roots was performed [45,67], with a p-value of p < 0.01.
Additionally, in order to render the comparison to the DCRNN more accurate, we implemented a gradient descend based optimization for a VAR model in TensorFlow [1], to verify that the differences in predictive performance can be related to the models, and not solely to the optimization strategies.In analogy to the DCRNN training, input-output samples of neural activities were generated from the data like described in section 2.2, which were used to minimize the MAE between the model's prediction x(t) and groundtruths x(t).The convergence could be optimized by employing stochastic gradient descent (SGD) optimization with a batch size of 1, using an annealing learning rate with a start value of η = 0.005.The VAR model was trained for 100 epochs, and the learning rate was reduced by a factor of 0.1 after epoch 70 and 90.A comparison of the error on the test set between the two different optimization strategies can be found in supplement III.
Best performance could be achieved employing a SGD based optimization in combination with a lag order of P = 30.Note that with such a high lag order, around 9.7% of the N = 360 time courses do not fulfill the stationarity criteria of the augmented Dickey-Fuller test anymore (p > 0.01).Yet the prediction accuracy could still be improved by including lags up to P = 30, like shown in supplement III.As the objective criterion of the evaluation was to assess the capabilities of replicating empirically observed neural activity patterns, we chose the VAR model with best accuracy for comparison with the DCRNN in section 2.3.The first data set used in this study is provided by the HCP data repository [47,93].The S1200 release includes data from subjects which participated in four resting state fMRI sessions, lasting 14.4 min each and collecting 1200 volumes per session.Customized Siemens Connectome Skyra magnetic resonance imaging scanners with a field strength of B 0 = 3T were employed for data acquisition, using multi-band (factor 8) acceleration [69,29,77,99].The data was collected by gradient-echo echo-planar imaging (EPI) sequences with a repetition time T R = 720 ms and an echo time T E = 31.1 ms.The field of view was F OV = 208 mm × 180 mm and N s = 72 slices with a thickness of d s = 2 mm were obtained, containing voxels with a size of 2 mm × 2 mm × 2 mm.The preprocessed version, including motion-correction, structural preprocessing and ICA-FIX denoising was chosen [40,51,52,30,81,72,44].Next a multi-model parcellation scheme was applied to divide the cortical gray matter hemisphere into 180 regions [39], and the BOLD signal inside each region was averaged, to obtain the temporal activity evolution for each area.For our study we found it appropriate to apply global signal regression, firstly because it showed to effectively reduce movement artifacts in HCP datasets [20].Also in this study of causal relations, the goal was to extract the additional information, which certain brain regions contain about the activity of other regions, whereby local interactions rather than global modulations were of interest for us.Those time coures were bandpass filtered, first performing the evaluations on a noise reduced narrowband in section 2.3, employing a filter with cutoff frequencies 0.04 − 0.07Hz [41,19,15,3], and additionally implementing a more liberal bandpass filter with cutoff frequencies 0.02 − 0.09Hz, as displayed in supplement I.

Datasets
Diffusion MRI data was collected in 6 runs, whereby approximately 90 directions were sampled during each run, employing three shells of b = 1000, 2000, and 3000 s/mm 2 , including 6 b = 0 images [82].A Spin-echo EPI sequence was employed with repetition time T R = 5520 ms, echo time T E = 89.5 ms, and multi band factor 3. In total N s = 111 slices were obtained, with field of view F OV = 210 mm × 180 mm and an isotropic voxel size of 1.25 mm × 1.25 mm × 1.25 mm.The preprocessing included intensity normalization across runs, EPI distortion correction, eddy-current corrections, removing motion artifacts, and gradient non-linearity corrections [40,83,6,8,7].To obtain the structural connectivity strengths between regions defined by Glasser et al. [39], the MRtrix3 software package was employed [90].Briefly multi-shell multitissue constrained spherical deconvolution [53] was used to obtain the response functions for fiber orientation distribution estimation [89,88].Furher 10 million streamlines were created with anatomical constrained tractography [79] and spherical-deconvolution informed filtering was applied [80], reducing the number of streamlines to 1 million1 .The group structural connectome was computed as an average across the first 10 subjects, as the variance in the structural connectivity strength is relatively low across subjects [102], while probabilistic tracktography methods are relatively computationally demanding.

UR data.
The second dataset was acquired with a Siemens Magnetom Prisma with field strength B 0 = 3 T at the University of Regensburg (UR).The data of 10 different subjects were used, whereby resting state fMRI data were collected during a scanning time of 7.3 min.All subjects provided written informed consent and the study was approved by the local ethics committee of the University of Regensburg.An EPI sequence was employed using multi-band (factor 8) acceleration, sampling 600 volumetric images per run with a repetition time of T R = 730 ms and an echo time of T E = 31 ms.The field of view was F OV = 208 mm × 208 mm and N s = 72 slices with thickness of d s = 2 mm were collected, containing voxels with a size of 2 mm × 2 mm × 2 mm.For preprocessing the HCP pipeline (version 4.0.0) was employed, as described by Glasser et al. [40].To achieve good correspondence between the two datasets, the further preprocessing was also performed like outlined in section 4.3.1.The fMRI time courses were averaged within each brain region of the multi-modal parcellation scheme [39], and again global signal regression was applied.Finally those time courses were bandpass filtered within the noise reduced range of 0.04 − 0.07 Hz.
To reconstruct the anatomical connectivity, diffusion MRI data was collected in 4 runs, sampling approximately 90 directions, employing two shells with b = 1500 and 3000 s/mm 2 , and also including 7 b = 0 images.The repetition time of the Spin-echo EPI sequence was T R = 3222 ms with an echo time T E = 89, 2 ms, employing a multi-band (factor 4) acceleration.Overall N s = 92 slices were collected, with a field of view F OV = 210 mm×210 mm, containing voxels with a size of 1.5 mm × 1.5 mm × 1.5 mm.Preprocessing of the diffusion MRI data was based on the HCP guidelines [40], and finally the anatomical connectivity matrices were obtained like in section 4.3.1 using constrained spherical deconvolution as provided in the MRtrix package [90].The group structural connectivity was computed as an average over the 10 subjects.This illustrative example was chosen to be reasonable representative for the whole test set, the prediction error of the VAR model on this sample is with 0.119 slightly below average, while the error of the DCRNN is with 0.033 higher than its average.Below the average MAE over all samples in the test set is illustrated, in dependence of the forecasting horizon (C).On the right side in (C) the average of all horizons is shown.

Figure 1 :
Figure 1: An overview of the DCRNN model.The model consists of an encoder and decoder (A), modified to process graph structured signals (B).In our context, vertices (nodes) V, |V| = N of the graph G are defined as N brain regions, derived from an atlas (B2).Structural connections between brain regions are derived from DTI, quantifying the strength of edge connections in the graph (B1).The signal on the graph x(t) at a certain time point t is the average BOLD signal in brain regions/nodes, obtained by the fMRI measurement at time t (B3).The encoder (A) receives an input sequence [x(1), . . ., x(T p )], and iteratively updates its hidden state H(t).The final encoder state H(T p ) is passed to the decoder part, which learns to recursively predict the output sequence of graph signals [x(T p + 1), . . ., x(T f )] in the future.During testing and validation, the decoder uses its own outputs as inputs, to generate the subsequent output.The first input of the decoder (< GO > symbol) is simply a vector of zeros.

Figure 2 (
A) illustrates that a linear VAR model can generate in a few cases also correct long term predictions, but most often after 10 T Rs (≈ 7s) the error starts to accumulate and the predictions become less accurate.The predictions of the DCRNN (figure2 (B)) remain stable over much longer forecasting horizons, and the average mean absolute error M AE = 0.0279 is considerably lower than the M AE = 0.1786 of the VAR.

Figure 2 :
Figure 2: The figure illustrates the prediction accuracy of a VAR model (A) in comparison to the DCRNN (B).The true BOLD signal in these 4 ROIs is marked green, while predictions of the VAR are highlighted in red, and for the DCRNN in blue.The first 30 T Rs of BOLD signal were used as the model inputs, and the goal was to predict the subsequent 30 T Rs.This illustrative example was chosen to represent the whole test set, the prediction error of the VAR model on this sample is 0.169, and as such slightly below average, while the error of the DCRNN is with 0.037 higher than its average.Below the average MAE over all samples in the test set is illustrated, in dependence of the forecasting horizon (C).On the right side in (C) the average of all horizons is shown.

Figure 3 :
Figure 3: This figure depicts the effect of structural modeling on the prediction accuracy.In (A) the test MAE in dependence on included walk order K is shown, while (B) demonstrates the impact of K on the computational load per epoch.A more detailed comparison of the MAE on the forecasting horizon when employing filters with order K = 0 and K = 3 is illustrated in (C).

Figure 4 :
Figure 4: The figure illustrates the influence of activity in PIVC on all other brain regions.The left side depicts the left hemisphere, while on the right side the right hemisphere is shown.The target region PIVC in the right hemisphere is marked in blue.The values of the influence measure I(n ) were normalized between 0 and 100 and are encoded in red in this illustration.PIC = posterior insular cortex.PIVC = parieto-insular vestibular cortex.SF = Sylvian fissure and surrounding perisylvian cortex.TPJ = temporo-parietal junction.Note that causal relationships from right PIVC were primarily found in the ipsilateral hemisphere.

Figure 5 :
Figure 5: This figure illustrates the performance difference between standard training and encompassing transfer learning.Figure (A) shows the validation and training MAE during learning from epoch 70 onwards, and the errors with and without pretraining are depicted in blue and red respectively.At the very beginning of fine tuning, the error of the pretrained model is relatively high, but decreases after the model adapts to the UR dataset.In figure (B) the final test MAE of both models is shown in dependence on the forecasting horizon.

Figure 7 :
Figure 7: Illustration of the model performance during training.The figure shows the MAE during the training (solid blue line) and validation data set (dashed light blue line) in dependence of the number of epochs.The gray line illustrates the scheduled sampling probability i over time.Vertical lines indicate when the learning rate was lowered by a factor of 0.1.In the first few epochs the training error, due to the high schedule sampling probability i , is already quite low.During testing and validation the inputs for the decoder are always the models own predictions, what reflects the large discrepancy between training and validation error within the first epochs.When the sampling probability is subsequently decreased, the model also learns to successfully make long term forecasting. .

Figure 9 :
Figure 9: The figure illustrates the prediction accuracy of a VAR model (A) in comparison to the DCRNN (B).This illustrative example was chosen to be reasonable representative for the whole test set, the prediction error of the VAR model on this sample is with 0.119 slightly below average, while the error of the DCRNN is with 0.033 higher than its average.Below the average MAE over all samples in the test set is illustrated, in dependence of the forecasting horizon (C).On the right side in (C) the average of all horizons is shown.

Figure 10 :
Figure 10: This figure shows the test MAE for two different optimization strategies to find the VAR coefficients, in dependence of lag orders P .The first one is performed with an ordinary least squares (OLS) fit on individual subject sessions and the average test error is depicted in red in this figure.The second one, in analogy to the training of the DCRNN, is based on gradient descent optimization, aggregating input-output pairs of samples across sessions like described in section 2.2.The test MAE of the stochastic gradient descent (SGD) approach is illustrated in blue.
Figure6: Overview of the processing steps of the DCGRU cell.The input x(t), as well as the previous hidden state H(t − 1) are concatenated and passed to the reset gate r(t), as well as to the update gate u(t).The reset gate r(t) controls the proportion of H(t − 1) which enters c(t), together with input x(t).Then the hidden state H(t − 1) is updated by c(t), whereby the amount of new information is controlled by u(t).