Unraveling hidden interactions in complex systems with deep learning

Rich phenomena from complex systems have long intrigued researchers, and yet modeling system micro-dynamics and inferring the forms of interaction remain challenging for conventional data-driven approaches, being generally established by scientists with human ingenuity. In this study, we propose AgentNet, a model-free data-driven framework consisting of deep neural networks to reveal and analyze the hidden interactions in complex systems from observed data alone. AgentNet utilizes a graph attention network with novel variable-wise attention to model the interaction between individual agents, and employs various encoders and decoders that can be selectively applied to any desired system. Our model successfully captured a wide variety of simulated complex systems, namely cellular automata (discrete), the Vicsek model (continuous), and active Ornstein–Uhlenbeck particles (non-Markovian) in which, notably, AgentNet’s visualized attention values coincided with the true variable-wise interaction strengths and exhibited collective behavior that was absent in the training data. A demonstration with empirical data from a flock of birds showed that AgentNet could identify hidden interaction ranges exhibited by real birds, which cannot be detected by conventional velocity correlation analysis. We expect our framework to open a novel path to investigating complex systems and to provide insight into general process-driven modeling.

positions x t and y t , and cell state c t .The output here is a list of expected probabilities that each cell becomes alive. We use the binary cross entropy loss function between the AgentNet output T = 0.2. We have found that our model is robust against change in system constants, showing similar performance with different constant values.
We attached a constant vector R to the attention vector to open the possibility that interaction function h pair depends on the global variable R. (In case of AOUPs, this is the case since R affects interaction potential F int .) Differing from the VM case, AgentNet for AOUP yields a total of 8 parameters for each agent, namely means and variances for positions x t+1 , y t+1 and velocities v x t+1 , v y t+1 . This is necessary because input and output state variables should be the same in order to iteratively sample from the predicted distribution and feed it into the prediction at the next step.
We employed teacher forcing 4 to train the LSTM-based AgentNet, which is a technique that feeds ground-truth labels into subsequent LSTM cells instead of sampled output in the early stages of training. This is useful to stabilize the training of trajectory prediction since the prediction depends on the last output which typically explodes to meaningless values in early, untrained stages. We set an initial epoch of 50 as the teaching period such that the possibility of using the In this study, we used the first portion (file A) of the data since we are interested in general bird flocking rather than a specific landing sequence (file C), and the second part (file B) contained more than 3000 unique birds in a 5 s instance, generally exceeding the memory capacity of a GPU (12 GB) even for the case of a single data per batch. Among 30 min (= 54000 f) of trajectory data, we employed around 10 min (= 18000 f) of frames and constructed a dataset with 30f * 10 = 300f each, 20f apart from each other. Although a single datapoint spans an overlapped time range with other data, we strictly split the training and test dataset to remove any possibility of data contamination. Exact details about the dataset and statistics can be found in 5 .
In the CS case, there is a problem with unbalanced labels since not every sample has full 10-step (10 s) trajectories, as mentioned in the main manuscript. Thus, we checked the number of agents present at certain steps and calculated a weighted loss to strengthen the effect of cases with fewer birds (typically, the case of a higher time step has a smaller number of constituents since many trajectories end early).
AgentNet for CS was trained with 530 sets of bird trajectories that spanned 300 frames each, split into 10 time steps. Since every trajectory starts and ends at different time steps, we take an average of bird displacement error where its starting point is regarded as time step 0. For instance, if the trajectory of bird 1 spans from time step 0 to 3 and the one of bird 2 spans from 4 to 6, the error calculated at bird 1, step 2 and bird 2, step 6 will be treated the same as the displacement error at the time step 2 since both steps are 2nd steps to their starting point. Also, since there were trajectories only lasting 2 time steps (= 1.0s), we cannot apply linear regression for those trajectories. Instead, we linearly extrapolated the trajectories by employing the given velocity at the first time step. Also, we intentionally scaled the input data by multiplying 0.1 for every variable to stabilize the training while preserving the relative difference of variable magnitude.

Supplementary Note 1 : Inspection scheme of AgentNet for CS
As the author of the original paper noted 5 , CS data contains a lot of short trajectories due to frequent occlusion and limitation of the camera viewing angle. An exemplary snapshot of typical trajectory data is drawn in Fig. S1, where all of the trajectories are not aligned in time and have dif-ferent lengths. This form of unstructured data can be expressed as a dynamic graph that precludes the naive application of a graph neural network (GNN) 6 with LSTM 7 states since the neighbor of each node changes dynamically and each node has a different starting time. Hence, we need an inspection method to manually check the hidden state at each time step and update each node's status to the appropriate form. Figure S1 describes all three possible cases that our inspection scheme needs to deal with.
Case A is an ordinary case with a continued state, so we can simply hand over the previously updated hidden state to the next iteration. Case B means that the agent, which existed at the last time step (t = t 0 − 1), disappeared and should not be considered as a valid neighbor anymore. Our inspection method excludes such data and creates a mask for the attention matrix to ensure that the attention between a valid agent and non-existent agent should be strictly zero at any time. Finally, case C indicates a newly entered agent whose hidden states should be initialized before it starts its chain of hidden states.
In practice, input data is created from raw data in the form of [time step, number of total agents, number of state variables]. Note that we use the number of total agents for every time step, which is not always the same as the number of agents at the specific time. For instance, in Fig. S1, there are 7 birds present in the total scene, but only 6 birds are present at t = t 0 . Instead of changing the data size every time step, we set every dimension to the maximum agent number (in this case, 7) and fill the currently non-existing birds' state variables with dummy values (we used zeros). With such padding, we can create a batched data set that enables parallel calculations to speed up the GPU deep learning. But we have to carefully mask them out properly at each forward pass to ensure that none of these dummy data affects the results because we keep calculating and updating these dummy data as well as the real agents' data.
Our inspection algorithm creates two masks with length of total agents, which is called 'Now mask' m now and 'New mask' m new . Each mask checks the existence of agents at the current(t = t 0 ) and next time step (t = t 0 + 1), and assigns 1 if the corresponding condition is satisfied and 0 otherwise. 'Current mask' assigns 1 if the agent is present at t 0 . 'New mask' assigns 1 if the agent is not present at t 0 but appears in t 0 + 1. Inspection method employs these three masks to control hidden state updates, attention calculations, and state variables updates.
At the start of every iteration, the algorithm checks new agents that need to be initialized.
Let us denote the hidden states of agents from time t as h t and state input data as x t . Then, the hidden state update can be expressed as where I(x) indicates the hidden state initialization module consisting of neural layers, and means element-wise multiplication. Then, after the attention calculation, the algorithm casts a two-dimensional attention mask m att on raw attention matrix α q raw to exclude all of the attention values between real and dummy agents as follows: 1 if a i and a j are both real agent 0 if otherwise (4) and α q = α q raw m att . We can construct m att from the 'Now mask' by repeating row vector mask m now for the column dimension to make expanded matrix mask M now , with m att = M now × M T now .

Supplementary Note 2 : Advantages of Neural Attention
Neural attention needs greater number of parameters to optimize, but showed greater performance on predicting complex agent compared to existing attention mechanisms. Here, we compare the performance of AgentNet for Vicsek Model (VM) with different attention mechanisms to calculate attention between ith and jth agents, α ij . We employed used transformer 8 architecture for all three mechanisms which seprates key k, query q and value v. Additive mechanism employs three d × d weight matrices w 1 , w 2 , w 3 to compute attention weight as w 3 (σ(w 1 k i , w 2 q j )), where d is a dimension for output vector, σ is a nonlinear activation function, k i is a key for ith agent and q j is a query for jth agent. Multiplicative mechanism calculates attention weight with innter product as k i · q j with some normalization constant, and boradly used to handle sequential data such as a natural language.
In constrast, neural attention uses multiple MLP layers A θ , parametrized by θ, to compute attention.
Here, || indicates the concatenation of two vectors and u is the vector of global variables.
Note that the additive mechanism is technically the same as a single layer of perceptrons. Table   S1 shows the results of AgentNet for VM that employs different attention mechanisms. (In this case and following demonstration, the system has 100 agents compared to 300 agents of the main manuscript version. This results in slightly lower NLL value, −1.013, compared to the one reported in the manuscript, −0.856 since the system with lower density is easier to predict.) We observed that both additive and multiplicative attention cannot fully capture the nonlinear and complex interaction range of complex systems as neural attention does.
One possible reason for the outperformance of neural attention is that the interaction range of a complex system is far more nonlinear and complicated than conventional attention mechanisms can handle, which are relatively under-parametrized for such interaction ranges. This assumption is further supported if the performance gap decreases as the interaction boundary becomes more linear. We experimented with a smoothed version VM (Fig. S2A), where interaction strength is defined as a sigmoid function s(x) Figure S2B shows the performances of VAIN (exponential-based attention from 9 ), GAT (conventional multiplicative transformer), and AgentNet (neural attention) with different b when a = 1. Note that the original interaction boundary coincides with the b → ∞ case. Results illustrate the difference between the scales of the two mechanisms for b, which underpins the aforementioned nonlinearity hypothesis.

Supplementary Discussion 1 : AOUP attention analysis
The variable-wise attention in AgentNet is wrapped by the sigmoid function, where its output is equal to or greater than 0, as seen in Fig. S3A. Hence, relative positional information is vital to convert the magnitude into actual force with a sign, which can be directly applied to change the velocity. For instance, force should be +x direction if the target agent i is in (1, 0) and the neighbor agent j is in (0, 0), but it changes to −x direction if the neighbor agent moves to (2, 0), although the magnitude (and thus the attention value) would be the same. It is possible to regain this directional information if the value vector v j , corresponding to a leftover function, contains the positional data of the jth agent; the position information of the ith agent can be delivered by h self . Figure S3 verifies that value vector v preserves the positional information of the input data.
We gathered data from 100 test samples of AOUP, and performed principal component analysis (PCA) to 16-dimensional vector v from the encoder. Figure S3B shows that the top four principal components (PCs) stand out in terms of explained variance, compared to the PCs with lower ranks.
We found that these four PCs have an interesting nonlinear relationship with two coordinates x and y from input data. Figure S3C shows that these four PCs have strong sinusoidal relationships with x and y, giving clear evidence of information preservation. We note that this specific nonlinear transformation is not special, and there could be various other possible means of preserving positional information. (For instance, the simplest way to convey positional information would be to directly preserve the coordinate data from the input without any nonlinear transformation.) In this particular case, we can assume that the overall function f could successfully retrieve the