Learning the dynamics of realistic models of C. elegans nervous system with recurrent neural networks

Given the inherent complexity of the human nervous system, insight into the dynamics of brain activity can be gained from studying smaller and simpler organisms. While some of the potential target organisms are simple enough that their behavioural and structural biology might be well-known and understood, others might still lead to computationally intractable models that require extensive resources to simulate. Since such organisms are frequently only acting as proxies to further our understanding of underlying phenomena or functionality, often one is not interested in the detailed evolution of every single neuron in the system. Instead, it is sufficient to observe the subset of neurons that capture the effect that the profound nonlinearities of the neuronal system have in response to different stimuli. In this paper, we consider the well-known nematode Caenorhabditis elegans and seek to investigate the possibility of generating lower complexity models that capture the system’s dynamics with low error using only measured or simulated input-output information. Such models are often termed black-box models. We show how the nervous system of C. elegans can be modelled and simulated with data-driven models using different neural network architectures. Specifically, we target the use of state-of-the-art recurrent neural network architectures such as Long Short-Term Memory and Gated Recurrent Units and compare these architectures in terms of their properties and their accuracy (Root Mean Square Error), as well as the complexity of the resulting models. We show that Gated Recurrent Unit models with a hidden layer size of 4 are able to accurately reproduce the system response to very different stimuli. We furthermore explore the relative importance of their inputs as well as scalability to more scenarios.

Recent developments in experimental neuroscience have considerably increased the availability of novel recordings and reconstructions shedding further light into the structure and function of the brain as well as many other systems. But understanding the complexities behind the relations between structure and function as well as the behaviour of such systems across multiple scales in these neuronal collections is constrained by the methods available to study them. This challenge has raised interest in many related fields, such as electrophysiological analysis, imaging techniques, brain-related medicine, computational modelling, simulation, and model reduction. Many of these efforts, while not directly providing specific information regarding structural or functional dynamics, do supply large volumes of recordings, measurements or simulations of observable input-output behaviour. The availability of these large datasets raises the question of whether low complexity, data-driven, black-box models can be used to model such input-output relations with low error, avoiding the reliance on detailed inner structures that may not be known or available.
To determine whether such an approach can be used for large, complex systems, one research direction is the study of smaller and simpler nervous systems, for which the underlying principles of network organization and information processing are easier to postulate. These organisms can become useful models to gain insight into the fundaments of neuronal dynamics and whole brain organization, validate hypotheses and develop and test modelling methods, simulation instruments and model reduction techniques. The hope is that the knowledge gained from these analyses and the techniques developed for these simpler organisms can later be used to model more complex systems. www.nature.com/scientificreports/ model should be able to reproduce with reasonably low error (the error metric used here being RMSE -Root Mean Squared Error) the behaviour of the realistic model while having fewer degrees of freedom. In this work we focus on the issue of reduced RMSE, which we equate to fidelity in reproducing the system dynamics, showing that we can produce sufficiently accurate models for analyzing the behavioural response of the C. elegans connectome under the described scenarios, using neural networks. The ultimate goal, however, is to show that our methodology is able to produce reduced-order, compressed models, that can be efficiently used in simulation to test and validate hypotheses regarding the behaviour and functionality of the neuronal systems of complex organisms. An illustration of our methodology is presented in Fig. 2.

Related work and context
The connectome-based models mentioned above are often termed white-box models, as they are based on direct knowledge and access to the internal structure and parameters' values of the modelled system. These are distributed models, where each neuron has a 3D description and position in space and the synapses are associated with neuronal sections. Such models enable highly accurate simulation of the dynamics of the systems but easily become extremely complex as they incorporate detailed structural and functional information of the system. While the white-box approach ensures access to and evaluation of inner parameters during simulation, it has been shown that the activity of complex networks of neurons can often be described by relatively few distinct patterns, which evolve on low-dimensional manifolds 15 . This knowledge, together with the ever-present need to avoid potential numerical intractability in large-scale networks with many degrees of freedom, has generated renewed interest in applying model reduction, often also referred to as model compression, to these neuronal networks, including techniques such as Dynamic Mode Decomposition (DMD) 16 , Proper Orthogonal Decomposition (POD) 17 and Discrete Empirical Interpolation (DEIM) 18 . Depending on the level of morphological accuracy of the underlying models, reduction techniques can have any shade of grey from white-box to blackbox, the latter assuming no preliminary knowledge of the system structure and building the model solely out of knowledge of its input-output behaviour.
Black-box approaches are often built upon data-driven models, sometimes learning-based, which have the ability to grasp more naturally and more efficiently the complexity induced by the profound nonlinearities in the neuronal transmission of information. Machine-learning techniques are used to extract data-driven reduced order models for systems arising from differential equations describing the intrinsic dynamics 19 and even to extract the governing equations of the estimated model 20 . It is therefore quite natural to consider using state of the art learning methods for developing reduced models of neuronal behaviour using data obtained from available recordings or even simulations obtained with more complex models.
Especially designed to capture temporal dynamic behaviour, Recurrent Neural Networks (RNNs), in their various architectures such as Long Short-Term Memory (LSTMs) and Gated Recurrent Units (GRUs), have been extensively and successfully used for forecasting or detecting anomalies in multivariate time series data [21][22][23][24] . Bidirectional LSTMs were used to model genome data by 25 , whereas a combination of CNNs and LSTMs generates a model for epileptic seizure recognition using EEG signal analysis in 26 . An attempt to model the human brain activity based on fMRI using RNNs (LSTMs and GRUs) is reported in 27 . In recent years, deep network approaches were used to model realistic neural activity data [28][29][30] . Few studies examined the behavioural output of network models of C. elegans using machine-learning techniques. RNNs are generated in a grey-box manner Figure 2. Modelling methodology. The learning machine is trained with input-output data reproducing selected behaviours and then tested against inputs it has not seen before. www.nature.com/scientificreports/ to study the chemotaxis behaviour 31 or to predict the synaptic polarities 32 of C. elegans, yet these models only include a subset of the connectome.

Methods
Given that the starting point is in fact represented by time series data obtained from simulations of the realistic connectome-based model, the modelling task is akin to a sequence to sequence conversion for which the most suitable neural network models are the recurrent ones.
In this work we analyze the suitability of three recurrent neural networks architectures. We start with the least complex unit, the simple RNN, originally proposed in the 1980's to model sequence data [33][34][35] . The second model used for the recurrent layer is the LSTM unit 36,37 , and finally we analyze its sibling, the GRU 38 .
Recurrent neural networks.  are a family of neural networks used for processing sequential data, particularly adept to processing a sequence of values x (1) , ..., x (t) , and in most cases capable to process sequences of variable length. RNNs appear from the relaxation of the condition on Feedforward Neural Networks (FFNNs) that neurons in a given layer do not form connections among themselves.
Although simple RNNs ( Fig. 3-left), which are trained using Backpropagation Through Time (BPTT) 39 , seem to be a good model for sequential tasks, they are known to suffer from various issues, mainly vanishing and exploding gradients 40 . Exploding gradients refer to a large increase in the norm of the gradient during training, which appears due to the explosion of long term components that can grow exponentially faster than short term ones. This is the less common of the two problems and there are known solutions to handle it, such as the clipping gradient technique 41 . A harder to solve issue is the vanishing gradient 40 , which refers to when long term components go exponentially fast to zero, making it impossible for the model to learn the correlation between temporally distant events.
In our case, for a faithful reproduction of the dynamics, the simulations require the use of fine time steps, leading to long sequences in the datasets. This in turn implies that the response at a given time will depend on values which are far back in the sequence. This situation, however unavoidable, may lead the RNN to experience difficulties in learning our data resulting in a model with unacceptable RMSE.
Long short-term memory. The Long Short-Term Memory unit 36 appeared as a solution to the vanishing gradient problem, later improved with the inclusion of the forget gate to adaptively release internal resources when necessary 37 .
A LSTM unit consists of three main gates, the input gate i t that controls whether the cell state is updated or not, the forget gate f t defining how the previous memory cell affects the current one and the output gate o t , which controls how the hidden state is updated. Note that LSTM units exhibit a major difference from RNN simple units, since besides the hidden state they also output a cell state to the next LSTM unit, as is apparent in Fig. 3-center. The LSTM mechanism is described by the following equations: Gated recurrent units. The use of LSTM units in recurrent neural networks already produced models able to learn very distant dependencies 37 , but these units are complex structures composed of three gates. For that reason, in 2014 a new type of unit, the GRU 38 , was suggested, described by: where the weights W z , U z , W r , U r , W h , U h and the biases b z , b r , b h are the learned parameters. www.nature.com/scientificreports/ The GRU ( Fig. 3-right) is only composed of two gates, the update gate z t and the reset gate r t . The GRU only outputs the hidden state h t computed based on the candidate hidden state ĥ t . The update gate controls how much of the past information needs to be passed along to the future, while the reset gate is used to decide how much information the model should forget.

Experimental setting
Data. The starting model is based on the complete connectome of the adult hermaphrodite of C. elegans, with 302 multi-compartmental neurons and 6702 synapses. The neurons are described by 3D geometrical information extracted from NeuroML and LEMS files 5 for C. elegans. We added membrane biophysical properties and connectivity data (chemical synapses and gap junctions) from 10 for the complete connectome. We term this as a high-fidelity model, since due to the level of detail taken into account we assume it reproduces with fidelity the output of physical models of C. elegans neurons. The simulations reproduce the Forward Crawling Motion scenario, by applying varying input currents to two sensory neurons ("PLML", "PLMR") and two interneurons ("AVBL", "AVBR") and record the responses of four neurons known to have strong activity during forward locomotion ("DB1", "LUAL", "PVR" and "VB1"; in reality we record the responses of the entire set of neurons, but analyse only these four). The resulting system is described in Python and simulated in NEURON 12 . The Python code invokes NEURON to generate the neuronal network, simulate its behaviour with respect to certain input signals (currents) and save the responses in time of the four output neurons (voltages).
We simulate the full high-fidelity model for 500 ms with two time steps-0.5 ms and 0.1 ms-and 40 different shapes for the input currents. The input-output waveforms are extracted into two datasets of 40 snapshot files each, which are further fed to the learning framework. These datasets are available in the online repository.
To train and tune the hyperparameters, learning rate and batch size, the data was divided into three sets: training, validation and test. The separation of data is done as follows: the training set uses 50% of the data, the validation set 25% and the test set the remaining 25% . The separation is partially done by hand, so that validation and test sets are as diverse and demanding as possible. Alternatively, one can use an automatic separation procedure, but given the small number of sample files, visualizing the shapes was sufficient for a reliable decision for this case. Three examples from the diverse set of inputs and outputs are shown in Fig. 4.
Modelling. The machine learning models are developed in Python 42 , using the libraries Keras 43 and TensorFlow 44 . Details on the code and dependencies to run the experiments are listed in a Readme file available together with the code in the online repository.
All architectures consist of one recurrent layer described in "Methods" followed by a dense layer. The dense layer performs a simple linear transformation for each sequence point to convert the output of the recurrent layer, of size "hidden size", into the four outputs.
For a consistent comparison, we fixed the optimizer to Adam 45 and the loss function to the root mean squared error. The other two hyperparameters, the learning rate and the batch size, as well as the activation function were tuned separately for the three architectures, by keeping one fixed and varying the others, for a fixed hidden size of 16 units. Each model was trained for 1000 epochs, with the final model chosen as the best iteration on the validation set. We then fixed the hyperparameters at the optimal values yielded from this search: the batch size at 32 for all three, the activation functions kept to the default and the learning rate at 0.001 for the RNN and 0.05 www.nature.com/scientificreports/ for the LSTM and the GRU 14 . Note that these represent the upper limits of the learning rates, since the Adam optimizer computes individual adaptive learning rates for each parameter.

Experiments and results
In Experiment 1 ("Experiment 1: RNN vs. LSTM vs. GRU (performance)") we compare the performance of the three types of layers, RNN, LSTM and GRU. Experiment 2 ("Experiment 2: LSTM vs. GRU (reduction)") carries a comparison between different sizes of the GRU recurrent layer to determine the optimal size under some RMSE constraints. Experiment 3 ("Experiment 3: GRU (long sequences)") is an investigation upon the models' ability to reproduce data resulting from simulations with a finer time step, therefore involving longer sequences with more data points. In Experiment 4 we investigate the relative importance of certain inputs with respect to the outputs and finally, in Experiment 5, we examine both the scalability and generalization potential of the resulting models, by significantly increasing the number of inputs and outputs, as well as mixing data from multiple scenarios. For all the experiments, the loss is computed as the average RMSE of ten runs.  Figure 5 shows the evolution of the training and validation losses during the training process. The simple RNN unit tends to take more time to learn, being also slightly less stable towards the end of the training process. Although this is not a good indicator, it is not as alarming as the behaviour shown in Fig. 6, where it is clear that the simple RNN unit is not able to reproduce the outputs with the desired reduced RMSE, while the LSTM and the GRU perform well. A summary of this experiment's results is shown in Table 1. Since the simple RNN unit did not perform sufficiently well by not being able to reproduce the output with minimal RMSE, we are left with the LSTMs and GRUs units. Given that the GRU is the less complex unit of the two, we consider it the main option and keep the LSTM as an alternative architecture.

Experiment 2: LSTM vs. GRU (reduction).
The GRU, due to its low RMSE and relative simplicity, therefore emerged as the prime candidate unit for our modelling purposes. However, we now want to determine how small the models can be without compromising the overall error. The focus of this second experiment is therefore to test different sizes of the recurrent layer and determine the smallest size that is still able to generate a model with sufficiently low RMSE.
We test both the LSTM and GRU units using the dataset with the coarser time step of (0.5 ms). Since the LSTM does not produce noticeable improvement over the GRU with a similar number of parameters, we only report here the results obtained with the GRU for six sizes of the recurrent layer: 2 , 4 , 8 , 16 , 32 , 64 (Table 2). Figure 7 illustrates the evolution of the training and validation losses during the learning process, where one can see that for a size as small as 8 the model reaches a low and stable loss. In fact, from Fig. 8 and Table 2 we can state that a GRU with a size of 4 hidden units is optimal to reproduce the outputs with low RMSE.

Experiment 3: GRU (long sequences).
We now explore the models' behaviour on data sampled with different time steps as in the same simulation interval this leads to sequences of different lengths. From a methodology standpoint this is important since even though time-wise the dynamics do not change, the temporal dependencies that the model has to learn are farther back in the sequence, which increases the difficulty of the Even though the model takes more time to converge for the finer time step, it ends up stabilizing with a loss of the same order in both cases and the model fits the test data well, as shown in Fig. 9 and Table 3. The plots in Fig. 10 further strengthen this idea.

Experiment 4: LSTM & GRU (input importance).
Throughout this work we guaranteed the models are tested against unseen input signals, by deliberately placing in the test set data with inputs with unique and varied shapes. We now want to investigate the relative importance of certain inputs with respect to the outputs. Hence, in this experiment we train the models on the same simulation datasets as before, except we discard the data for two input neurons during the learning process, and see how the resulting models predict the outputs. We do this in two separate settings. First we train the models with recurrent layers of 8, 16 and 32, without the input information for the AVB neurons (No-AVB case). Next we repeat the process without the PLM data (No-PLM case). Figure 6. Experiment 1: realistic (red and blue) and predicted (green and black) sequences for DB1 and LUAL (1st and 3rd rows) and PVR and VB1 (2nd and 4th rows) for two sequences of the test set (one selected simulation out of ten).  www.nature.com/scientificreports/ The models perform similarly from an accuracy perspective and the size does not influence the overall RMSE. We notice that in each case, the models are able to accurately replicate the voltage of two output neurons but do a poor job at predicting the other two outputs. In the No-AVB case, the two neurons well reproduced are LUAL and PVR, as shown in Fig. 11, whereas in the No-PLM case the opposite occurs as the outputs of DB1 and VB1 are accurately reproduced while the other two exhibit a large RMSE. Table 4 shows the average RMSE of the two settings. In both cases the error reflects the models' inability to accurately predict one pair of neurons. The larger absolute RMSE in one of the settings is merely a result of the increased voltage magnitude of LUAL and PVR neurons compared to DB1 and VB1 as in both cases the prediction is inaccurate for those respective nodes.
This result indicates that the AVB neurons are important for accurate prediction of the behaviour of LUAL and PVR, while the PLM inputs have more influence on the DB1 and VB1 neurons. This result is interesting from an interpretability standpoint, as it shows that some inputs are more relevant than others for specific outputs and behavioural scenarios.   (scalability and generalizability). We create a dataset composed of data for two completely different behaviours, FCM and Nictation. For the FCM scenario the data is the same as described in "Data", with the difference that we now record the responses of 16 output neurons instead of 4. In the Nictation case, we apply stimuli to 6 sensory neurons and we record 12 output neurons associated with neck and head muscles 11,13,46 . This new dataset therefore has a total of 80 examples, 4 + 6 = 10 inputs and 16 + 12 = 28 outputs. This experiment examines both the scalability potential of the resulting models, by significantly increasing the number of inputs and outputs, as well as their ability to predict more general data, coming from two different behaviours. LSTMs and GRUs with recurrent layer sizes of 8, 16 and 32 are trained on this data and the results are shown in Table 5. Both types of layers, and GRU in particular, are able to predict for a certain behaviour the output of the neurons of interest in that scenario, using a number of neurons for the recurrent layer inferior to the total of output neurons for which the voltage is predicted. This is apparent in Fig. 12, where we show various output sequences extracted from the test set for GRU with 16 hidden units. The model is indeed quite accurate overall in terms of RMSE, with some relative error showing for nodes where there is little to no activity, which is discarded since their response magnitude is within the absolute error metric used. However, this should not be an issue for most applications, as the important neurons for a given scenario are the ones with strong responses.

Discussion
Accuracy of the low-order model. Our analysis shows that a GRU with as low as 4 hidden units is consistently able to reproduce the outputs of the original model with a reasonably low RMSE and it adapts well to longer sequences, resulting from data sampled with both coarser and finer time steps. There is no noticeable difference in the errors between the two datasets in Experiment 3, the model consistently showing a RMSE below 1.3e-02 for the test set. What is interesting to note from Experiment 2 (see the RMSEs for Test in Table 2) is that increasing the number of hidden units to more than 16 worsens the predictions, so much that it is better to use a 4-units GRU than a 64-units GRU. As the RMSEs for the training sets are still decreasing for more units, this is probably due to overfitting of the data, a known problem in learning settings.
Interpretability. We are interested in further understanding to what degree we are able to make some assumptions on the structure of the (reduced) model and extract a description, perhaps mathematical, for it. The fact that we are able to replace a complex biophysical model with a simpler recurrent neural network with few neurons also means the interpretability improves (helps to better understand the modelled neural circuits). Furthermore, in Experiment 4 we show that the models are able to accurately predict only certain outputs when deprived of certain inputs during training. This already suggests specific input-output dependencies, which a Figure 9. Average training and validation RMSE of ten runs, for the two datasets with a GRU-based recurrent layer with two different hidden sizes.    Methodology. Generalizability indicates a (low-order) model's ability to predict the original system response beyond the data used for modelling. From a machine learning perspective, this can be understood as prediction for a test set with input-output data unseen during training. While this is an expected merit of the  www.nature.com/scientificreports/ neural networks in general and also demonstrated here, this is not the only implication of our modelling efforts. We are modelling a system having biophysical descriptions with well-established accuracy. However we are treating it as a black-box, using only peripheral input-output information from the initial high-fidelity simulations. Even though for many other more complex neural circuits the internal details are less known or completely unknown, this peripheral information can still potentially be obtained, hence the effectiveness of our models extends to these systems as well. This could especially be useful for specific cases (e.g. wet-labs extracting neural data from experiments on animals, or settings related to specific pathologies), where the system dynamics can be learned and predicted so that the number of real experiments further needed would be reduced.

Conclusions
In this paper we create reduced order models for the C. elegans nervous system with three different recurrent neural networks architectures: simple RNNs, LSTMs and GRUs. The objective is to further generate a low-order description to replace the original, detailed model in the NEURON simulator. To achieve this goal we seek a model as simple as possible and therefore the ideal unit would appear to be the simple RNN. However, this unit does not perform sufficiently well compared to the other two architectures. The LSTM and GRU give comparable results in terms of overall fidelity, measured through RMSE, for different sizes of the recurrent layer. Due to its simplicity, GRU is preferable, and with a hidden size of 4 units, is able to reproduce with high fidelity, i.e. low RMSE, the original model's responses to different types of stimuli. Furthermore, from a computational standpoint, explicitly inferring the response of the GRU model to such stimuli will vastly outperform the cost associated with simulating the high-fidelity model within NEURON, which has to solve the set of nonlinear equations implicit in the connectome network. Quantifying the potential advantage would require solving an inverse problem and performing an identical simulation of the extracted low-order model in NEURON, which is not the subject of this paper. Further work will concentrate on improving the automation in choosing appropriate stimuli for the training, validation and test sets as well as optimal parameter selection. This will require a systematic analysis of compression possibilities of the learning-based models with error control. The novel concept of physics-informed machine learning will also help improving not only the predictions, but also the interpretability and generalizability of the neural nets. It implies adding structural or context information like physical constraints (domain, boundary conditions, initial conditions) to the training process. The resulting models will be more reliable, as they are guaranteed to satisfy physical laws, and they will potentially need less training data, making them even more suitable for experimental neuroscience, where there is a limited amount of labelled datasets available and in some cases it is not viable, for financial or ethical reasons, to procure additional data.
These results nonetheless show that it is feasible to develop recurrent neural network models able to infer input-output behaviours of realistic models of biological systems, enabling researchers to advance their understanding of these systems even in the absence of detailed level of connectivity.

Data and code availability
The datasets, models, the source code and the instructions to run them are available in the following GitHub repository: https:// github. com/ gmest re98/ Celeg ans-Forwa rdCra wling-RNNs.