Main

Continuous neural network architectures built by ordinary differential equations (ODEs)2 are expressive models useful in modelling data with complex dynamics. These models transform the depth dimension of static neural networks and the time dimension of recurrent neural networks (RNNs) into a continuous vector field, enabling parameter sharing, adaptive computations and function approximation for non-uniformly sampled data.

These continuous-depth (time) models have shown promise in density estimation applications3,4,5,6, as well as modelling sequential and irregularly sampled data1,7,8,9.

While ODE-based neural networks with careful memory and gradient propagation design9 perform competitively with advanced discretized recurrent models on relatively small benchmarks, their training and inference are slow owing to the use of advanced numerical differential equation (DE) solvers10. This becomes even more troublesome as the complexity of the data, task and state space increases (that is, requiring more precision)11, for instance, in open-world problems such as medical data processing, self-driving cars, financial time-series and physics simulations.

The research community has developed solutions for resolving this computational overhead and for facilitating the training of neural ODEs, for instance by relaxing the stiffness of a flow by state augmentation techniques4,12, reformulating the forward pass as a root-finding problem13, using regularization schemes14,15,16 or improving the inference time of the network17.

Here, we derive a closed-form continuous-depth model that has the modelling capabilities of ODE-based models but does not require any solver to model data (Fig. 1).

Fig. 1: Neural and synapse dynamics.
figure 1

A postsynaptic neuron receives the stimuli I(t) through a nonlinear conductance-based synapse model. Here, S(t) stands for the synaptic current. The dynamics of the membrane potential of this postsynaptic neuron are given by the DE presented in the middle. This equation is a fundamental building block of LTC networks1, for which there is no known closed-form expression. Here, we provide an approximate solution for this equation which shows the interaction of nonlinear synapses with postsynaptic neurons in closed form.

Intuitively, in this work, we replace the integration (that is, solution) of a nonlinear DE describing the interaction of a neuron with its input nonlinear synaptic connections, with their corresponding nonlinear operators. This could be achieved in principle using functional Taylor expansions (in the spirit of the Volterra series)18. However, in the particular case of liquid time-constant (LTC) networks, we can leverage a closed-form expression for the system’s response to input. This allows one to evaluate the system’s response to exogenous input (I) and recurrent inputs from hidden states (x) as a function of time. One way of looking at this is to regard the closed-form solution as the application of a nonlinear forward operator to the inputs of each hidden state or neuron in the network, where the outputs of one neuron constitute the inputs for others. Effectively, this rests on approximating a conductance-based model with a neural mass model, of the kind used in the dynamic causal modelling of real neuronal networks19.

The proposed continuous neural networks yield considerably faster training and inference speeds while being as expressive as their ODE-based counterparts. We provide a derivation for the approximate closed-form solution to a class of continuous neural networks that explicitly models time. We demonstrate how this transformation can be formulated into a novel neural model and scaled to create flexible, performant and fast neural architectures on challenging sequential datasets.

Deriving an approximate closed-form solution for neural interactions

Two neurons interact with each other through synapses as shown in Fig. 1. There are three principal mechanisms for information propagation in natural brains that are abstracted away in the current building blocks of deep learning systems: (1) neural dynamics are typically continuous processes described by DEs (see the dynamics of x(t) in Fig. 1), (2) synaptic release is much more than scalar weights, involving a nonlinear transmission of neurotransmitters, the probability of activation of receptors and the concentration of available neurotransmitters, among other nonlinearities (see S(t) in Fig. 1) and (3) the propagation of information between neurons is induced by feedback and memory apparatuses (see how I(t) stimulates x(t) through a nonlinear synapse S(t) which also has a multiplicative difference of potential to the postsynaptic neuron accounting for a negative feedback mechanism). One could read I(t) as a mixture of exogenous input to the (neural) network and presynaptic inputs from other neurons that result in a depolarization x(t). This depolarization is mediated by the current S(t) that depends upon depolarization and a reversal threshold A. LTC networks1, which are expressive continuous-depth models obtained by a bilinear approximation20 of a neural ODE formulation2, are designed on the basis of these mechanisms. Correspondingly, we take their ODE semantics and approximate a closed-form solution for the scalar case of a postsynaptic neuron receiving an input stimulus from a presynaptic source through a nonlinear synapse.

To this end, we apply the theory of linear ODEs21 to analytically solve the dynamics of an LTC DE as shown in Fig. 1. We then simplify the solution to the point where there is one integral left to solve. This integral compartment, \(\int\nolimits_{0}^{t}f(I(s))\,{\mathrm{d}}s\) in which f is a positive, continuous, monotonically increasing and bounded nonlinearity, is challenging to solve in closed form since it has dependencies on an input signal I(s) that is arbitrarily defined (such as real-world sensory readouts). To approach this problem, we discretize I(s) into piecewise constant segments and obtain the discrete approximation of the integral in terms of the sum of piecewise constant compartments over intervals. This piecewise constant approximation inspired us to introduce an approximate closed-form solution for the integral \(\int\nolimits_{0}^{t}f(I(s))\,{\mathrm{d}}s\) that is provably tight when the integral appears as the exponent of an exponential decay, which is the case for LTCs. We theoretically justify how this closed-form solution represents LTCs’ ODE semantics and is as expressive (Fig. 1).

Explicit time dependence

We then dissect the properties of the obtained closed-form solution and design a new class of neural network models we call closed-form continuous-depth networks (CfC). CfCs have an explicit time dependence in their formulation that does not require a numerical ODE solver to obtain their temporal rollouts. Thus, they maximize the trade-off between accuracy and efficiency of solvers. Formally, this property corresponds to obtaining lower time complexity for models without numerical instabilities and errors as illustrated in Table 1 (left). For example, Table 1 (left) shows that the complexity of a pth-order numerical ODE solver is \({{{\mathcal{O}}}}(Kp)\), where K is the number of ODE steps, while a CfC system (which has explicit time dependence) requires \({{{\mathcal{O}}}}(\tilde{K})\), where K is the exogenous input time steps, which are typically one to three orders of magnitude smaller than K. Moreover, the approximation error of a pth-order numerical ODE solver scales with \({{{\mathcal{O}}}}({\epsilon }^{p+1})\), whereas CfCs are closed-form continuous-time systems, thus the notion of approximation error becomes irrelevant to them.

Table 1 Computational complexity of models

This explicit time dependence allows CfCs to perform computations at least one order of magnitude faster in terms of training and inference time compared with their ODE-based counterparts, without loss of accuracy.

Sequence and time-step prediction efficiency

In sequence modelling tasks, one can perform predictions based on an entire sequence of observations, or perform auto-regressive modelling where the model predicts the next time-step output given the current time-step input. Table 1 (right) depicts the time complexity of different neural network instances at inference, for a given sequence of length n and a neural network of k number of hidden units. We observe that the complexity of ODE-based networks and Transformer modules is at least an order of magnitude higher than that of discrete RNNs and CfCs in both sequence prediction and auto-regressive modelling (time-step prediction) frameworks.

This is desirable because not only do CfCs establish a continuous flow similar to ODE models1 to achieve better expressivity in representation learning but they do so with the efficiency of discrete RNN models.

CfCs: flexible deep models for sequential tasks

Additionally, CfCs are equipped with novel time-dependent gating mechanisms that explicitly control their memory. CfCs are as expressive as their ODE-based peers and can be supplied with mixed memory architectures9 to avoid gradient issues in sequential data processing applications with long-range dependences. Beyond accuracy and performance metrics, our results indicate that, when considering accuracy per compute time, CfCs exhibit over 150 fold improvements over ODE-based compartments. We perform a diverse set of advanced time-series modelling experiments and present the performance and speed gain achievable by using CfCs in tasks with long-term dependences, irregular data and modelling physical dynamics, among others.

Deriving a closed-form solution

In this section, we derive an approximate closed-form solution for LTC networks, an expressive subclass of time-continuous models. We discuss how the scalar closed-form expression derived from a small LTC system can inspire the design of CfC models. In this regard, we define the LTC semantics. We then state the main theorem that computes a closed-form approximation of a given LTC system for the scalar case. To prove the theorem, we first find the integral solution of the given LTC ODE system. We then compute a closed-form analytical solution for the integral solution for the case of piecewise constant inputs. Afterward, we generalize the closed-form solution of the piecewise constant inputs to the case of arbitrary inputs with our novel approximation and finally provide sharpness results (that is, measure the rate and accuracy of an approximation error) for the derived solution.

The hidden state of an LTC network is determined by the solution of the following initial value problem (IVP)1:

$$\frac{{\mathrm{d}}{{{\bf{x}}}}}{{\mathrm{d}}t}=-\left[{w}_{\tau }+f({{{\bf{x}}}},{{{\bf{I}}}},\theta )\right]\odot {{{\bf{x}}}}(t)+A\odot f({{{\bf{x}}}},{{{\bf{I}}}},\theta ),$$
(1)

where at a time step t, x(D×1)(t) defines the hidden state of a LTC layer with D cells, and I(m×1)(t) is an exogenous input to the system with m features. Here, \({w}_{\tau }^{(D\times 1)}\) is a time-constant parameter vector, A(D×1) is a bias vector, f is a neural network parametrized by θ and is the Hadamard product. The dependence of f(.) on x(t) denotes the posibility of having recurrent connections.

The full proof of theorem 1 is given in Methods. The theorem formally demonstrates that the approximated closed-form solution for the given LTC system is given by equation (2) and that this approximation is tightly bounded with bounds given in the proof.

In the following, we show an illustrative example of this tightness result in practice. To do this, we first present an instantiation of LTC networks and their approximate closed-form expressions. Extended Data Fig. 1 shows a liquid network with two neurons and five synaptic connections. The network receives an input signal I(t). Extended Data Fig. 1 further derives the DE expression for the network along with its closed-form approximate solution. In general, it is possible to compile an LTC network into its closed-form expression as illustrated in Extended Data Fig. 1. This compilation can be performed using Algorithm 1 provided in Methods.

Theorem 1

Given an LTC system determined by the IVP in equation (1), constructed by one cell, receiving a single-dimensional time-series exogenous input I(t) with no self-connections, the following expression is an approximation of its closed-form solution:

$$x(t)\approx ({x}_{0}-A){\mathrm{e}}^{-[{w}_{\tau }+f(I(t),\theta )]t}f(-I(t),\theta )+A.$$
(2)

Tightness of the closed-form solution in practice

Figure 2 shows an LTC-based network trained for autonomous driving22. The figure further illustrates how close the proposed solution fits the actual dynamics exhibited from a single-neuron ODE given the same parametrization. The details of this experiment are given in Methods.

Fig. 2: Tightness of the closed-form solution in practice.
figure 2

We approximate a closed-form solution for LTC networks1 while largely preserving the trajectories of their equivalent ODE systems. We develop our solution into CfC models that are at least 100 fold faster than neural ODEs at both training and inference on complex time-series prediction tasks.

We next show how to design a novel neural network instance inspired by this closed-form solution that has well-behaved gradient properties and approximation capabilities.

Designing CfC models from the solution

Leveraging the scalar closed-form solution expressed by equation (2), we can now distil this model into a neural network model that can be trained at scale. The solution provides a grounded theoretical basis for solving scalar continuous-time dynamics, and it is important to translate this theory into a practical neural network model which can be integrated into larger representation learning systems equipped with gradient descent optimizers. Doing so requires careful attention to potential gradient and expressivity issues that can arise during optimization, which we will outline in this section.

Formally, the hidden states, x(t)(D×1) with D hidden units at each time step t, can be obtained explicitly as

$${{{\bf{x}}}}(t)=B\odot {\mathrm{e}}^{-[{w}_{\tau }+f({{{\bf{x}}}},{{{\bf{I}}}};\theta )]t}\odot f(-{{{\bf{x}}}},-{{{\bf{I}}}};\theta )+A,$$
(3)

where B(D) collapses (x0 − A) of equation (2) into a parameter vector. A(D) and \({w}_{\tau }^{(D)}\) are system’s parameter vectors, while I(t)(m×1) is an m-dimensional input at each time step t, f is a neural network parametrized by \(\theta =\{{W}_{Ix}^{(m\times D)},{W}_{xx}^{(D\times D)},{b}_{x}^{(D)}\}\) and is the Hadamard (element-wise) product. While the neural network presented in equation (3) can be proven to be a universal approximator as it is an approximation of an ODE system1,2, in its current form, it has trainability issues which we point out and resolve shortly.

Resolving the gradient issues

The exponential term in equation (3) drives the system’s first part (exponentially fast) to 0 and the entire hidden state to A. This issue becomes more apparent when there are recurrent connections and causes vanishing gradient factors when trained by gradient descent23. To reduce this effect, we replace the exponential decay term with a reversed sigmoidal nonlinearity σ(.). This nonlinearity is approximately 1 at t = 0 and approaches 0 in the limit t → . However, unlike exponential decay, its transition happens much more smoothly, yielding a better condition on the loss surface.

Replacing biases by learnable instances

Next, we consider the bias parameter B to be part of the trainable parameters of the neural network f( − x, − I; θ) and choose to use a new network instance instead of f (presented in the exponential decay factor). We also replace A with another neural network instance, h(. ) to enhance the flexibility of the model. To obtain a more general network architecture, we allow the nonlinearity f(−x, −I; θ) present in equation (3) to have both shared (backbone) and independent (g(. )) network compartments.

Gating balance

The time-decaying sigmoidal term can play a gating role if we additionally multiply h(. ) with (1 − σ(. )). This way, the time-decaying sigmoid function stands for a gating mechanism that interpolates between the two limits of t → − and t →  of the ODE trajectory.

Backbone

Instead of learning all three neural network instances f, g and h separately, we have them share the first few layers in the form of a backbone that branches out into these three functions. As a result, the backbone allows our model to learn shared representations, thereby speeding up and stabilizing the learning process. More importantly, this architectural prior enables two simultaneous benefits: (1) Through the shared backbone, a coupling between the time constant of the system and its state nonlinearity is established that exploits causal representation learning evident in a liquid neural network1,24. (2) through separate head network layers, the system has the ability to explore temporal and structural dependences independently of each other.

These modifications result in the CfC neural network model:

$${\textbf{x}}(t) = \underbrace{\sigma(-f({\textbf{x}}, {\textbf{I}};\theta_f) {{\textbf{t}}})}_{{{\rm{time}}\text{-}{\rm{continuous}}\,{\rm{gating}}}} \odot g({\textbf{x}},{\textbf{I}};\theta_g) + \underbrace{\left[ 1 -\sigma(-[f({\textbf{x}}, {\textbf{I}};\theta_f)] {{\textbf{t}}}) \right]}_{{{\rm{time}}\text{-}{\rm{continuous}}\,{\rm{gating}}}} \odot h({\textbf{x}},{\textbf{I}};\theta_h).$$
(4)

The CfC architecture is illustrated in Extended Data Fig. 2. The neural network instances could be selected arbitrarily. The time complexity of the algorithm is equivalent to that of discretized recurrent networks25, being at least one order of magnitude faster than ODE-based networks.

The procedure to account for the explicit time dependence

CfCs are continuous-depth models that can set their temporal behaviour based on the task under test. For time-variant datasets (for example, irregularly sampled time series, event-based data and sparse data), the t for each incoming sample is set based on its time stamp or order. For sequential applications where the time of the occurrence of a sample does not matter, t is sampled as many times as the batch length, with equidistant intervals within two hyperparameters a and b.

Experiments with CfCs

We now assess the performance of CfCs in a series of sequential data processing tasks compared with advanced, recurrent models. We first approach solving conventional sequential data modelling tasks (for example, bit-stream prediction, sentiment analysis on text data, medical time-series prediction, human activity recognition, sequential image processing and robot kinematics modelling), and compare CfC variants with an extensive set of advanced RNN baselines. We then evaluate how CfCs compare with LTC-based neural circuit policies (NCPs)22 in real-world autonomous lane-keeping tasks.

CfC network variants

To evaluate the proposed modifications we applied to the closed-form solution network described by equation (3), we test four variants of the CfC architecture: (1) the closed-form solution network (Cf-S) obtained by equation (3), (2) the CfC without the second gating mechanism (CfC-noGate), a variant that does not have the 1 − σ instance shown in Extended Data Fig. 2, (3) The CfC model (CfC) expressed by equation (4) and (4) the CfC wrapped inside a mixed memory architecture (that is, where the CfC defines the memory state of an RNN, for instance, a long short-term memory (LSTM)), a variant we call CfC-mmRNN. Each of these four proposed variants leverages our proposed solution and thus is at least one order of magnitude faster than continuous-time ODE models.

To investigate their representation learning power, in the following we extensively evaluate CfCs on a series of sequence modelling tasks. The objective is to test the effectiveness of the CfCs in learning spatiotemporal dynamics, compared with a wide range of advanced models.

Baselines

We compare CfCs with a diverse set of advanced algorithms developed for sequence modelling by both discretized and continuous mechanisms. These baselines are given in full in Methods.

Human activity recognition

The human activity dataset7 contains 6,554 sequences of humans demonstrating activities such as walking, lying, sitting, etc. The input space is formed of 561-dimensional inertial sensor measurements per time step, recorded from the user’s smartphone26, being categorized into six group of activities (per time step) as output.

We set up our dataset split (training, validation and test) to carefully reflect the modifications made by Rubanova et al.7 on this task. The results of this experiment are reported in Table 2. We observe that not only do the CfC variants Cf-S, CfC-noGate and CfC-mmRNN outperform other models with a high margin, but they do so with a speed-up of more than 8,752% over the best-performing ODE-based instance (Latent-ODE-ODE). The reason for such a large speed difference is the complexity of the dataset dynamics that causes the ODE solvers of ODE-based models such as Latent-ODE-ODE to compute many steps upon stiff dynamics. This issue does not exist for closed-form models as they do not use any ODE solver to account for dynamics. The hyperparameter details of this experiment are provided in Extended Data Fig. 3.

Table 2 Human activity recognition, per time-step classification

Physical dynamics modelling

The Walker2D dataset consists of kinematic simulations of the MuJoCo physics engine27 (see Methods for more details). As shown in Table 3, CfCs outperform the other baselines by a large margin, supporting their strong capability to model irregularly sampled physical dynamics with missing phases. It is worth mentioning that, on this task, CfCs even outperform transformers by a considerable, 18% margin. The hyperparameter details of this experiment are provided in Extended Data Fig. 3.

Table 3 Per time-step regression

Event-based sequential image processing

We next assess the performance of CfCs on a challenging sequential image processing task. This task is generated from the sequential modified National Institute of Standards and Technology (MNIST) dataset following the steps described in Methods. Moreover, the hyperparameter details of this experiment are provided in Extended Data Fig. 4.

Table 4 summarizes the results on this event-based sequence classification task. We observe that models such as ODE-RNN, CT-RNN, GRU-ODE and LSTMs struggle to learn a good representation of the input data and therefore show poor performance. In contrast, RNNs endowed with explicit memory, such as bi-directional RNNs, GRU-D, Lipschitz RNN, coRNN, CT-LSTM and ODE-LSTM, perform well on this task. All CfC variants perform well on this task and establish the state-of-the-art on this task, with CfC-mmRNN achieving 98.09% and CfC-noGate achieving 96.99% accuracy in classifying irregularly sampled sequences. It is worth mentioning that they do so around 200–400% faster than ODE-based models such as GRU-ODE and ODE-RNN.

Table 4 Event-based sequence classification on irregularly sequential MNIST

Regularly and irregularly sampled bit-stream XOR

The bit-stream XOR dataset9 considers the classification of bit streams by implementing an XOR function in time. That is, each item in the sequence contributes equally to the correct output. The details are given in Methods.

Extended Data Fig. 5 compares the performance of many RNN baselines. Many architectures such as Augmented LSTM, CT-GRU, GRU-D, ODE-LSTM, coRNN and Lipschitz RNN, and all variants of CfC, can successfully solve the task with 100% accuracy when the bit-stream samples are equidistant from each other. However, when the bit-stream samples arrive at non-uniform distances, only architectures that are immune to the vanishing gradient in irregularly sampled data can solve the task. These include GRU-D, ODE-LSTM, CfC and CfC-mmRNNs. ODE-based RNNs cannot solve the event-based encoding tasks regardless of their choice of solvers, as they have vanishing/exploding gradient issues9. The hyperparameter details of this experiment are provided in Extended Data Fig. 4.

PhysioNet Challenge

The PhysioNet Challenge 2012 dataset considers the prediction of the mortality of 8,000 patients admitted to the intensive care unit. The features represent time series of medical measurements taken during the first 48 h after admission. The data are irregularly sampled in time and over features, that is, only a subset of the 37 possible features is given at each time point. We perform the same test–train split and preprocessing as in ref. 7, and report the area under the curve (AUC) on the test set as a metric in Extended Data Fig. 6. We observe that CfCs perform competitively to other baselines while performing 160 times faster in terms of training time compared with ODE-RNN and 220 times compared with continuous latent models. CfCs are also, on average, three times faster than advanced discretized gated recurrent models. The hyperparameter details of this experiment are provided in Extended Data Fig. 7.

Sentiment analysis using IMDB

The Internet Movie Database (IMDB) sentiment analysis dataset28 consists of 25,000 training and 25,000 test sentences (see Methods for more details). Extended Data Fig. 8 shows how CfCs equipped with mixed memory instances outperform advanced RNN benchmarks. The hyperparameter details of this experiment are provided in Extended Data Fig. 7.

Performance of CfCs in autonomous driving

In this experiment, our objective is to evaluate how robustly CfCs learn to perform autonomous navigation in comparison with their ODE-based counterparts, LTC networks. The task is to map incoming high-dimensional pixel observations to steering curvature commands. The details of this experiment are given in Methods.

We observe that CfCs similar to NCPs demonstrate a consistent attention pattern in each subtask while maintaining their attention profile under heavy noise as depicted in Extended Data Fig. 10c. This is while the attention profile of other networks such as CNNs and LSTMs is hindered by added input noise (Extended Data Fig. 10c).

This experiment empirically validates that CfCs possess similar robustness properties to their ODE counterparts, that is, LTC-based networks. Moreover, similar to NCPs, CfCs are parameter efficient. They performed the end-to-end autonomous lane-keeping task with around 4,000 trainable parameters in their RNN component (Extended Data Fig. 9).

Scope, discussion and conclusions

We introduce a closed-form continuous-time neural model built from an approximate closed-form solution of LTC networks that possess the strong modelling capabilities of ODE-based networks while being notably faster, more accurate, and stable. These closed-form continuous-time models achieve this by explicit time-dependent gating mechanisms and having a LTC modulated by neural networks. A discussion of related research on continuous-time models is given in Methods.

For large-scale time-series prediction tasks, and where closed-loop performance matters24, CfCs can bring great value. This is because they capture the flexible, causal and continuous-time nature of ODE-based networks, such as LTC networks, while being more efficient. A discussion on how to use different variants of CfCs is provided in Methods. On the other hand, implicit ODE- and partial differential equation-based models17,29,30,31 can be beneficial in solving continuously defined physics problems and control tasks. Moreover, for generative modelling, continuous normalizing flows built by ODEs are the suitable choice of model as they ensure invertibility, unlike CfCs2. This is because DEs guarantee invertibility (that is, under uniqueness conditions6, one can run them backwards in time). CfCs only approximate ODEs and therefore no longer necessarily form a bijection32.

What are the limitations of CfCs?

CfCs might express vanishing gradient problems. To avoid this, for tasks that require long-term dependences, it is better to use them together with mixed memory networks9 (as in the CfC variant CfC-mmRNN) or with proper parametrization of their transition matrices33,34. Moreover, we speculate that inferring causality from ODE-based networks might be more straightforward than a closed-form solution24. It would also be beneficial to assess whether verifying a continuous neural flow35 is more tractable by using an ODE representation of the system or its closed form.

For problems such as language modelling where a large amount of sequential data and substantial computational resources are available, transformers36 and their variants are great choices of models. CfCs could bring value when: (1) data have limitations and irregularities (for example, medical data, financial time series, robotics37 and closed-loop control, and multi-agent autonomous systems in supervised and reinforcement learning schemes38), (2) the training and inference efficiency of a model is important (for example, embedded applications39,40,41) and (3) when interpretability matters42.

Ethics statement

All authors acknowledge the Global Research Code on the development, implementation and communication of this research. For the purpose of transparency, we have included this statement on inclusion and ethics. This work cites a comprehensive list of research from around the world on related topics.

Methods

Proof of theorem 1

Proof. In the single-dimensional case, the IVP in equation (1) becomes linear in x as follows:

$$\frac{{\mathrm{d}}}{{\mathrm{d}}t}x(t)=-\left[{w}_{\tau }+f(I(t))\right]\cdot x(t)+Af(I(t)).$$
(5)

Therefore, we can use the theory of linear ODEs to obtain an integral closed-form solution (section 1.10 in ref. 21) consisting of two nested integrals. The inner integral can be eliminated by means of integration by substitution43. The remaining integral expression can then be solved in the case of piecewise constant inputs and approximated in the case of general inputs. The three steps of the proof are outlined below.

Integral closed-form solution of LTC

We consider the ODE semantics of a single neuron that receives some arbitrary continuous input signal I and has a positive, bounded, continuous and monotonically increasing nonlinearity f:

$$\frac{{\mathrm{d}}}{{\mathrm{d}}t}x(t)=-\left[{w}_{\tau }+f(I(t))\right]\cdot x(t)+A\cdot \left[{w}_{\tau }+f(I(t))\right].$$

Assumption. We assumed a second constant value wτ in the above representation of a single LTC neuron. This is done to introduce symmetry in the structure of the ODE, yielding a simpler expression for the solution. The inclusion of this second constant may appear to profoundly alter the dynamics. However, as shown below, numerical experiments suggest that this simplifying assumption has a marginal effect on the ability to approximate LTC cell dynamics.

Using the variation of constants formula (section 1.10 in ref. 21), we obtain after some simplifications:

$$x(t)=(x(0)-A){\mathrm{e}}^{-{w}_{\tau }t-\int\nolimits_{0}^{t}f(I(s)){\mathrm{d}}s}+A.$$
(6)

Analytical LTC solution for piecewise constant inputs

The derivation of a useful closed-form expression of x requires us to solve the integral expression \(\int\nolimits_{0}^{t}f(I(s))\,{\mathrm{d}}s\) for any t ≥ 0. Fortunately, the integral \(\int\nolimits_{0}^{t}f(I(s))\,{\mathrm{d}}s\) enjoys a simple closed-form expression for piecewise constant inputs I. Specifically, assume that we are given a sequence of time points

$$0={\tau }_{0} < {\tau }_{1} < {\tau }_{2} < \ldots < {\tau }_{n-1} < {\tau }_{n}=\infty ,$$

such that \({\tau }_{1},\ldots ,{\tau }_{n-1}\in {\mathbb{R}}\) and I(t) = γi for all t [τi; τi+1) with 0 ≤ i ≤n − 1. Then, it holds that

$$\int\nolimits_{0}^{t}f(I(s))\,{\mathrm{d}}s=f({\gamma }_{k})(t-{\tau }_{k})+\mathop{\sum }\limits_{i=0}^{k-1}f({\gamma }_{i})({\tau }_{i+1}-{\tau }_{i}),$$
(7)

when τk ≤ t < τk+1 for some 0 ≤ k ≤ n − 1 (as usual, one defines \(\mathop{\sum }\nolimits_{i = 0}^{-1}:= 0\)). With this, we have

$$x(t)=(x(0)-A){{\mathrm{e}}}^{-{w}_{\tau }t}{{\mathrm{e}}}^{-f({\gamma }_{k})(t-{\tau }_{k})-\mathop{\sum }\limits_{i = 0}^{k-1}f({\gamma }_{i})({\tau }_{i+1}-{\tau }_{i})}+A,$$
(8)

when τk ≤ t < τk+1 for some 0 ≤ k ≤ n − 1. While any continuous input can be approximated arbitrarily well by a piecewise constant input43, a tight approximation may require a large number of discretization points τ1, …, τn. We address this next.

Analytical LTC approximation for general inputs

Inspired by equations (7) and (8), the next result provides an analytical approximation of x(t).

Lemma 1

For any Lipschitz continuous, positive, monotonically increasing and bounded f and continuous input signal I(t), we approximate x(t) in equation (6) as

$$\tilde{x}(t)=(x(0)-A){{\mathrm{e}}}^{-\left[{w}_{\tau }t+f(I(t))t\right]}f(-I(t))+A.$$
(9)

Then, \(| x(t)-\tilde{x}(t)| \le | x(0)-A| {{\mathrm{e}}}^{-{w}_{\tau }t}\) for all t0. Writing c=x(0)A for convenience, we can obtain the following sharpness results, additionally:

  1. 1.

    For any t0, we have \(\sup \left\{ \frac{1}{c}(x(t)-\tilde{x}(t))| I:[0;t]\to {\mathbb{R}} \right\}={{\mathrm{e}}}^{-{w}_{\tau }t}\).

  2. 2.

    For any t0, we have \(\inf \left\{ \frac{1}{c}(x(t)-\tilde{x}(t))| I:[0;t]\to {\mathbb{R}} \right\}={{\mathrm{e}}}^{-{w}_{\tau }t}({{\mathrm{e}}}^{-t}-1)\).

Above, the supremum and infimum are meant to be taken across all continuous input signals. These statements settle the question about the worst-case errors of the approximation. The first statement implies, in particular, that our bound is sharp.

The full proof is given in the next section. Lemma 1 demonstrates that the integral solution we obtained and shown in equation (6) is tightly close to the approximate closed-form solution we proposed in equation (9). Note that, as wτ is positively defined, the derived bound between equations (6) and (9) ensures an exponentially decaying error as time goes by. Therefore, we have the statement of the theorem. □

Proof of lemma 1

We start by noting that

$$x(t)-\tilde{x}(t)=c\,{{\mathrm{e}}}^{-{w}_{\tau }t}\left[{{\mathrm{e}}}^{-\int\nolimits_{0}^{t}f(I(s)){\mathrm{d}}s}-{{\mathrm{e}}}^{-f(I(t))t}f(-I(t))\right].$$

Since 0 ≤ f ≤ 1, we conclude that \({{\mathrm{e}}}^{-\int\nolimits_{0}^{t}f(I(s)){\mathrm{d}}s}\in [0;1]\) and ef(I(t))tf(−I(t))  [0; 1]. This shows that \(| x(t)-\tilde{x}(t)| \le | c| {{\mathrm{e}}}^{-{w}_{\tau }t}\). To see the sharpness results, pick some arbitrary small ε > 0 and a sufficiently large C > 0 such that f(−C) ≤ ε and 1 − ε ≤ f(C). With this, for any 0 < δ < t, we consider the piecewise constant input signal I such that I(s) = −C for s [0; t − δ] and I(s) = C for s (t − δ; t]. Then, it can be noted that

$$\begin{array}{l}{{\mathrm{e}}}^{-\int\nolimits_{0}^{t}f(I(s)){\mathrm{d}}s}-{{\mathrm{e}}}^{-f(I(t))t}f(-I(t))\ge \\ {{\mathrm{e}}}^{-\varepsilon t-\delta \cdot 1}-{{\mathrm{e}}}^{-(1-\varepsilon )\cdot t}\varepsilon \to 1,\,\,{{{\rm{when}}}}\,\,\varepsilon ,\delta \to 0\end{array}.$$

Statement 1 follows by noting that there exists a family of continuous signals \({I}_{n}:[0;t]\to {\mathbb{R}}\) such that In(  ) ≤ C for all n ≥ 1 and In → I pointwise as n → . This is because

$$\begin{array}{l}\mathop{\lim }\limits_{n\to \infty }\left|\right.\int\nolimits_{0}^{t}f(I(s))\,{\mathrm{d}}s-\int\nolimits_{0}^{t}f({I}_{n}(s))\,{\mathrm{d}}s\left|\right.\le \\ \mathop{\lim }\limits_{n\to \infty }\int\nolimits_{0}^{t}|\, f(I(s))-f({I}_{n}(s))| \,{\mathrm{d}}s\le \mathop{\lim }\limits_{n\to \infty }L\int\nolimits_{0}^{t}| I(s)-{I}_{n}(s)| \, {\mathrm{d}}s\\ =0\end{array},$$

where L is the Lipschitz constant of f, and the last identity is due to the dominated convergence theorem43. To see statement 2, we first note that the negation of the signal −I provides us with

$$\begin{array}{l}{{\mathrm{e}}}^{-\int\nolimits_{0}^{t}f(-I(s)){\mathrm{d}}s}-{{\mathrm{e}}}^{-f(-I(t))t}f(I(t))\le \\ {{\mathrm{e}}}^{-(1-\varepsilon )(t-\delta )-\delta \cdot 0}-{{\mathrm{e}}}^{-\varepsilon \cdot t}(1-\varepsilon )\to {{\mathrm{e}}}^{-t}-1,\end{array}$$

if ε, δ → 0. The fact that the left-hand side of the last inequality must be at least et − 1 follows by observing that \({{\mathrm{e}}}^{-t}\le {{\mathrm{e}}}^{-\int\nolimits_{0}^{t}f(I^{\prime} (s)){\mathrm{d}}s}\) and ef(I(t))tf( − I(t)) ≤ 1 for any \(I^{\prime} ,I^{\prime\prime} :[0;t]\to {\mathbb{R}}\). □

Compiling LTC architectures into their closed-form equivalent

In general, it is possible to compile the architecture of an LTC network into its closed-form version. This compilation allows us to speed up the training and inference time of ODE-based networks as the closed-form variant does not require complex ODE solvers to compute outputs. Algorithm 1 provides the instructions on how to transfer the architecture of an LTC network into its closed-form variant. Here, WAdj corresponds to the adjacency matrix that maps exogenous inputs to hidden states and the coupling among hidden states. This adjacency matrix can have an arbitrary sparsity (that is, there is no need to use a directed acyclic graph for the coupling between neurons).

Algorithm 1

Translate the architecture of an LTC network into its closed-form variant

Inputs: LTC inputs I(N×T)(t), the activity x(H×T)(t) and initial states x(H×1)(0) of LTC neurons and the adjacency matrix for synapses \({W}_{Adj}^{[(N+H)* (N+H)]}\)

 LTC ODE solver with step of Δt

 time-instance vectors of inputs, \({{{{\bf{t}}}}}_{I(t)}^{(1\times T)}\)

 time-instance of LTC neurons tx(t)    time might be sampled irregularly

 LTC neuron parameter τ(H×1)

 LTC network synaptic parameters {σ(N×H), μ(N×H), A(N×H)}

Outputs: LTC closed-form approximation of hidden state neurons, \({\hat{{{{\bf{x}}}}}}^{(N\times T)}(t)\)

xpre(t) = WAdj × [I0IN, x0xH]    all presynaptic signals to nodes

for ith neuron in neurons 1 to H do

  for j in Synapses to ith neuron do

  \({\hat{x}}_{i}+=({x}_{0}-{A}_{ij}){\mathrm{e}}^{\left[\left.-{t}_{x(t)}\odot \left(1/{\tau }_{i}+\frac{1}{1+{e}^{(-{\sigma }_{ij}({x}_{pr{e}_{ij}}-{\mu }_{ij}))}}\right)\right)\right]}\odot \frac{1}{1+{\mathrm{e}}^{({\sigma }_{ij}({x}_{\mathrm{pre}_{ij}}-{\mu }_{ij}))}}+{A}_{ij}\)

  end for

end for

return \(\hat{{{{\bf{x}}}}}(t)\)

Experimental details of the tightness experiment

We took a trained NCP22, which consists of a perception module and an LTC-based network1 that possesses 19 neurons and 253 synapses. The network was trained to steer a self-driving vehicle autonomously. We used recorded real-world test runs of the vehicle for a lane-keeping task governed by this network. The records included the inputs, outputs and all the LTC neurons’ activities and parameters. To perform a numerical evaluation of our theory to determine whether our proposed closed-form solution for LTC neurons is good enough in practice as well, we inserted the parameters for individual neurons and synapses of the DEs into the closed-form solution (similar to the representations shown in Extended Data Fig. 1b,c) and emulated the structure of the ODE-based LTC networks. We then visualized the output neuron’s dynamics of the ODE (in blue) and of the closed-form solution (in red). As illustrated in Fig. 2, we observed that the behaviour of the ODE is captured by the closed-form solution with a mean squared error of 0.006. This experiment provides numerical evidence for the tightness results presented in our theory. Hence, the closed-form solution contains the main properties of liquid networks in approximating dynamics.

Baseline models

The example baseline models considered include some variations of classical auto-regressive RNNs, such as an RNN with concatenated Δt (RNN-Δt), a recurrent model with moving average on missing values (RNN-impute), RNN-Decay7, LSTMs44 and gated recurrent units (GRUs)45. We also report results for a variety of encoder–decoder ODE-RNN-based models, such as RNN-VAE, latent variable models with RNNs, and with ODEs, all from ref. 7.

Furthermore, we include models such as interpolation prediction networks (IP-Net)46, set functions for time series (SeFT)47, CT-RNN48, CT-GRU49, CT-LSTM50, GRU-D51, PhasedLSTM52 and bi-directional RNNs53. Finally, we benchmarked CfCs against competitive recent RNN architectures with the premise of tackling long-term dependences, such as Legandre memory units54, high-order polynomial projection operators (Hippo)55, orthogonal recurrent models (expRNNs)56, mixed memory RNNs such as ODE-LSTMs9, coupled oscillatory RNNs (coRNN)57 and Lipschitz RNN58.

Experimental details for the Walker2D dataset

This task is designed based on the Walker2d-v2 OpenAI gym59 environment using data from four different stochastic policies. The objective is to predict the physics state in the next time step. The training and testing sequences are provided at irregularly sampled intervals. We report the squared error on the test set as a metric.

Description of the event-based MNIST experiment

We first sequentialize each image by transforming each 28 × 28 image into a long series of length 784. The objective is to predict the class corresponding to each image from the long input sequence. Advanced sequence modelling frameworks such as coRNN57, Lipschitz RNN58 and mixed memory ODE-LSTM9 can solve this task very well with accuracy of up to 99.0%. However, we aim to make the task even more challenging by sparsifying the input vectors with event-like irregularly sampled mechanisms. To this end, in each vector input (that is, flattened image), we transform each consecutive occurrence of values into one event. For instance, within the long binary vector of an image, the sequence 1, 1, 1, 1 is transformed to (1, t = 4) (ref. 9). This way, sequences of length 784 are condensed into event-based irregularly sampled sequences of length 256 that are far more challenging to handle than equidistance input signals. A recurrent model now has to learn to memorize input information of length 256 while keeping track of the time lags between the events.

Description of the event-based XOR encoding experiment

The bit streams are provided in densely sampled and event-based sampled formats. The densely sampled version simply represents an incoming bit as an input event. The event-based sampled version transmits only bit changes to the network, that is, multiple equal bits are packed into a single input event. Consequently, the densely sampled variant is a regular sequence classification problem, whereas the event-based encoding variant represents an irregularly sampled sequence classification problem.

Experimental details of the IMDB dataset experiment

Each sentence corresponds to either positive or negative sentiment. We tokenize the sentences in a word-by-word fashion with a vocabulary consisting of the 20,000 words occurring most frequently in the dataset. We map each token to a vector using trainable word embedding. The word embedding is initialized randomly. No pretraining of the network or word embedding is performed.

Setting of the driving experiment

It has been shown that models based on LTC networks are more robust when trained on offline demonstrations and tested online in closed loop with their environments, in many end-to-end robot control tasks such as mobile robots60, autonomous ground vehicles22 and autonomous aerial vehicles24,61. This robustness in decision-making (that is, their flexibility in learning and executing the task from demonstrations despite environmental or observational disturbances and distributional shifts) originates from their model semantics that formally reduces to dynamic causal models20,24. Intuitively, LTC-based networks learn to extract a good representation of the task they are given (for example, their attention maps indicate what representation they have learned to focus on the road with more attention to the road’s horizon) and maintain this understanding under heavy distribution shifts. An example is illustrated in Extended Data Fig. 10.

In this experiment, we aim to investigate whether CfC models and their variants, such as CfC-mmRNN, possess this robustness characteristic (maintaining their attention map under distribution shifts and added noise), similar to their ODE counterparts (LTC-based networks called NCPs22).

We start by training neural network architectures that possess a convolutional head stacked with the choice of RNN. The RNN compartment of the networks is replaced by LSTM networks, NCPs22, Cf-S, CfC-NoGate and CfC-mmRNN. We also trained a fully convolutional neural network for the sake of proper comparison. Our training pipeline followed an imitation learning approach with paired pixel-control data from a 30 Hz BlackFly PGE-23S3C red–green–blue camera, collected by a human expert driver across a variety of rural driving environments, including different times of day, weather conditions and seasons of the year. The original 3 h data set was further augmented to include off-orientation recovery data using a privileged controller62 and a data-driven view synthesizer63. The privileged controller enabled the training of all networks using guided policy learning64. After training, all networks were transferred on-board our full-scale autonomous vehicle (Lexus RX450H, retrofitted with drive-by-wire capability). The vehicle was consistently started at the centre of the lane, initialized with each trained model and run to completion at the end of the road. If the model exited the bounds of the lane, a human safety driver intervened and restarted the model from the centre of the road at the intervention location. All models were tested with and without noise added to the sensory inputs to evaluate robustness.

The testing environment consisted of 1 km of private test road with unlabelled lane markers, and we observed that all trained networks were able to successfully complete the lane-keeping task at a constant velocity of 30 km h−1. Extended Data Fig. 10 provides an insight into how these networks reach driving decisions. To this end, we computed the attention of each network while driving by using the VisualBackProp algorithm65.

Related works on continuous-time models

Continuous-time models

Machine learning, control theory and dynamical systems merge at models with continuous-time dynamics60,66,67,68,69. In a seminal work, Chen et al.2,7 revived the class of continuous-time neural networks48,70, with neural ODEs. These continuous-depth models give rise to vector field representations and a set of functions that were not possible to generate before with discrete neural networks. These capabilities enabled flexible density estimation3,4,5,71,72 as well as performant modelling of sequential and irregularly sampled data1,7,8,9,58. In this paper, we showed how to relax the need for an ODE solver to realize an expressive continuous-time neural network model for challenging time-series problems.

Improving neural ODEs

ODE-based neural networks are as good as their ODE solvers. As the complexity or the dimensionality of the modelling task increases, ODE-based networks demand a more advanced solver that largely impacts their efficiency17, stability13,15,73,74,75 and performance1. A large body of research has studied how to improve the computational overhead of these solvers, for example, by designing hypersolvers17, deploying augmentation methods4,12, pruning6 or regularizing the continuous flows14,15,16. To enhance the performance of an ODE-based model, especially in time-series modelling tasks76, solutions for stabilizing their gradient propagation have been provided9,58,77. In this work, we showed that CfCs improve the scalability, efficiency and performance of continuous-depth neural models.

Which CfC variants to choose in different applications

Our extensive experimental results demonstrate that different CfC variants, namely Cf-S, CfC-noGate, vanilla CfC and CfC-mmRNN, achieve comparable results to each other while one comes on top depending on the nature of the data set. We suggest using CfC in most cases where the sequence length is up to a couple of hundred steps. To capture longer-range dependences, we recommend CfC-mmRNN. The Cf-S variant is effective when we aim to obtain the fastest inference time. CfC-noGate could be tested as a hyperparameter when using the vanilla CfC as the primary choice of model.

Description of hyperparameters

The hyperparameters used in our experimental results are as follows:

  • clipnorm: the gradient clipping norm (that is, the global norm clipping threshold)

  • optimizer: the weight update preconditioner (for example, Adam, Stochastic Gradient Descent with momentum, etc.)

  • batch_size: the number of samples used to compute the gradients

  • hidden size: the number of RNN units

  • epochs: the number of passes over the training dataset

  • base_lr: the initial learning rate

  • decay_lr: the factor by which the learning rate is multiplied after each epoch

  • backbone_activation: the activation function of the backbone layers

  • backbone_dr: the dropout rate of the backbone layers

  • forget_bias: the forget gate bias (for mmRNN and LSTM)

  • backbone_units: the number of hidden units per backbone layer

  • backbone_layers: the number of backbone layers

  • weight_decay: the L2 weight regularization factor

  • τdata: the constant factor by which the elapsed time input is multiplied (default value 1)

  • init: the gain of the Xavier uniform distribution for the weight initialization (default value 1)