Deep neural networks using a single neuron: folded-in-time architecture using feedback-modulated delay loops

Deep neural networks are among the most widely applied machine learning tools showing outstanding performance in a broad range of tasks. We present a method for folding a deep neural network of arbitrary size into a single neuron with multiple time-delayed feedback loops. This single-neuron deep neural network comprises only a single nonlinearity and appropriately adjusted modulations of the feedback signals. The network states emerge in time as a temporal unfolding of the neuron’s dynamics. By adjusting the feedback-modulation within the loops, we adapt the network’s connection weights. These connection weights are determined via a back-propagation algorithm, where both the delay-induced and local network connections must be taken into account. Our approach can fully represent standard Deep Neural Networks (DNN), encompasses sparse DNNs, and extends the DNN concept toward dynamical systems implementations. The new method, which we call Folded-in-time DNN (Fit-DNN), exhibits promising performance in a set of benchmark tasks.

1) What keep me to say that the paper is visionary is Ref [13] which actually introduced the idea for, true, a much more simpler variant of neural networks. To clarify better the paper contributions, I would suggest stating them clearly somewhere in the first part of the paper, mentioning clearly what is new in this paper in comparison with Ref [13] and the large body of work done on the topic in meanwhile.
2) To put the theoretical model in hardware is not a trivial task. I am not an electrical or a photonics engineer, but I believe that the paragraph starting with "The Fit-DNN approach…" (line 252) discusses the possibility of putting the proposed model into hardware with too much ease giving the wrong impression to the reader that it is a relatively easy task. I believe that a more realistic discussion has to be made here, and the real engineering and research challenges which have to be solved in order to have a working hardware prototype have to be discussed seriously. Perhaps, I am wrong, but to orchestrate all the delay loops just with photonics may be very challenging, while converting from optical signal to electrical signal may be too slow. Wouldn't be easier to have a first prototype in silicon? Also, from a hardware design perspective, how easy will scale such a device to millions of delay loops to reach the scale of current neural network size (an advantage here is, I assume, the linear relation between the number of delay loops and neurons)?
3) As the authors mentioned, emulating sparse neural networks may be indeed a commonsense approach for the proposed model. Still, in Ref [37] and in all of the follow up sparse training works, it has been shown that just the dynamic sparsity (optimizing and adapting the sparse connectivity while training) can outperform the dense network constantly. Is it easy to incorporate the idea of dynamic sparsity into the proposed theoretical model? 4) Just a curiosity, why are using the sine function (an unusual choice for neural networks) and not sigmoid? 5) In terms of evaluation, I assume that it can be quite time consuming to train a model with an increased number of hidden neurons even if you wrote the code in C++. Still, for a better overview of the model performance a comparison with the equivalent MLP model (with exactly the same static sparse connectivity, number of neurons, and any other setting) would help in understanding better the model performance. Also, a fully connected MLP (same settings and neurons) would be interesting to see.
proposed by the authors valid with other types of time-delayed systems? -The proposed method seems at the intersection of "conventional" machine learning (use of backpropagation, etc.) and delay-based reservoir computing (single neuron, etc.). The reader would intuitively assume that the new method intends to leverage from the best of both worlds. A dedicated section where the performance of the Fit-DNN would be discussed in contrast to the two other approached would be of interest.
-Could the authors elaborate on the efficiency of their method from the viewpoint of computation time?
modulation capabilities up to 96 GSamples per second and acquisition with 40GHz analog bandwidth and 256GSa/s are available, providing the perspective for a photonic implementation. In contrast, an electronic implementation in silico would come with some particular advantages, but would, e.g., suffer from the fact that delay lines are not that easily and efficiently to be implemented, while in photonic it requires just a piece of standard telecommunication fiber (or a passive optical waveguide in an integrated version). In order to get to DNN with many hidden layers, amplification might be required to compensate for the losses from layer to layer. In the photonic scheme, this would be implemented via a semiconductor optical amplifier for fiber-optical implementations, or gain sections in photonic integrated circuits. Also those types of functional elements are standard technology.
Millions of delay loops are not needed, since the number of delay loops scales only linearly with the number of the virtual nodes of a single layer. For sparse connectivity, the number of delay loops can even be much lower than the number of the virtual nodes of a single layer. While scaling to the sizes of current digitally simulated neural networks would be quite challenging, our proposed approach could in principle require orders of magnitude fewer components than a direct hardware implementation, where each artificial neuron and weight would need to be represented by its own element. An aspect that also should not be forgotten, is the fact that through using the same physical element to emulate all the neurons, we essentially guarantee that the functional neurons in the equivalent network are identical. Parameter mismatches and non-identical elements can be a problem in implementations relying on large arrays of hardware neurons.
In comparison to other analog hardware or neuromorphic platforms of conventional DNN, it is worth noting that those approaches require a much larger number of precision-fabricated and tuned elements. They might not depend on the time-unfolding, giving the impression that they would be much faster. However, in reality, due to the use of much fewer but much faster components, our approach might turn out very competitive, in terms of speed, cost, and energy consumption.
Altogether, although we are providing some perspectives for hardware implementations here, we emphasize once more that this is a manuscript introducing and discussing a novel concept to realize DNN. A particular hardware implementation discussed in this manuscript would not only restrict the perspective but would also go far beyond the scope of this work. A hardware implementation would require a demanding and very detailed and specific characterization of its properties, repelling a large part of the targeted audience and diluting the general aspects of our novel approach.
Nevertheless, we have extended the discussion of hardware implementation in subsection 3.4 of the "Discussion" section: In addition to providing a dynamical systems perspective on DNNs, Fit-DNNs can also serve as blueprints for specialized DNN hardware. The Fit-DNN approach is agnostic concerning the type of nonlinearity, enabling flexibility of implementations. A suitable candidate could be a photonic neuromorphic implementation [13,14,15,16,52,53,20], where a fast artificial neuron can be realized with the Gigahertz timescale range. Photonic systems have already been used to construct delay-based reservoir computers. In retrospect, it is quite clear how instrumental the reduced hardware requirement of a delay-based approach was in stimulating the current ecosystem of reservoir computing implementations. For example, the delay-based reservoir computing has been successfully implemented using electronic systems, magnetic spin systems, MEMS, acoustic, and other platforms. We hope that for the much larger community around DNNs, a similarly stimulating effect can be achieved with the Fit-DNN approach we presented here, since it also drastically reduces the cost and complexity for hardware-based DNNs.
Certainly, realizations on different hardware platforms face different challenges. In the following, we exemplify the requirements for a photonic (optoelectronic) scheme. Such an implementation requires only one light source, a few fiber couplers, and optical fibers of different lengths. The modulations of the delay loops can be implemented using Mach-Zehnder intensity modulators. Finally, only two fast photodetectors (one for all delay loops and one for the output) would be required, as well as an optical amplifier or an electrical amplifier which could be used to compensate for roundtrip losses. Those are all standard telecommunication components. The conversion from optical to electrical signals can be done extremely fast, faster than the clock rate of today's fast electronic processors, and only two photodetectors are needed, regardless of the number of virtual nodes and number of delay loops.

3) As the authors mentioned, emulating sparse neural networks may be indeed a commonsense approach for the proposed model. Still, in Ref [37] and in all of the follow up sparse training works, it has been shown that just the dynamic sparsity (optimizing and adapting the sparse connectivity while training) can outperform the dense network constantly. Is it easy to incorporate the idea of dynamic sparsity into the proposed theoretical model?
In our approach, removing or adding a delay loop would change an entire diagonal in the hidden weight matrices. This is due to the Fit-DNN's special sparsity pattern: the delay-induced connections are arranged on diagonals of the connection matrices. The sparsity training algorithms in [37] and related references rely on the ability to flexibly choose the non-zero connections of sparse networks. Therefore, they are not directly applicable to the Fit-DNN. We tested pruning and further dynamic sparsity training methods for our model. We observed that removing all weights of a diagonal at the same time perturbed the previous training too much so that the method fails. Also, adding diagonals incrementally while testing which positions were the most favorable did not improve the performance. Nevertheless, we expect that it is possible to find a suitable method to optimize the choice of delays. One candidate for such a method is pruning by slowly fading out diagonals that contain weaker connections on average. However, further investigations of specific sparsity training methods for the Fit-DNN are beyond the scope of this manuscript.
To address the sparsity issue and to discuss our previous attempts to adapt existing sparsity training methods for the Fit-DNN, we added the following paragraph in the Discussion section: It has been shown that dynamic sparsity [38,39] can outperform dense networks and, fundamentally, Fit-DNNs are intrinsically compatible with certain kinds of sparsity. However, in our approach, removing or adding a delay loop would change an entire diagonal in the hidden weight matrices. Therefore, sparsity training algorithms such as [38,39] and related works are not directly applicable to the Fit-DNN. Our preliminary tests have shown that removing the weights of a diagonal at the same time disturbs the previous training too much, so the method fails. Nevertheless, we expect that it is possible to find a suitable method to optimize the choice of delays. Therefore, further investigation of specific sparsity training methods for the Fit-DNN would be very welcome. One candidate for such a method could be pruning by slowly fading diagonals that contain weaker connections on average.

4) Just a curiosity, why are using the sine function (an unusual choice for neural networks) and not sigmoid?
Since the sine nonlinearity is compatible with an optoelectronic implementation (see, e.g., [14,20,33]), we considered it as the default choice for the numerics. However, we have also performed tests with other, more traditional nonlinearities such as "tanh" and "ReLU"; see Table  2 in the Supplementary Information. The Fit-DNN works equally well for all of these nonlinearities, and we conclude that the particular type of nonlinearity does not seem to be crucial.

5) In terms of evaluation, I assume that it can be quite time consuming to train a model with an increased number of hidden neurons even if you wrote the code in C++. Still, for a better overview of the model performance a comparison with the equivalent MLP model (with exactly the same static sparse connectivity, number of neurons, and any other setting) would help in understanding better the model performance. Also, a fully connected MLP (same settings and neurons) would be interesting to see.
The training is indeed time-consuming due to the fact that we need to solve the delay-dynamical equation for each training step. (This could be overcome if "online" training, i.e., training on a hardware-implemented system, can be realized.) The following table compares the training time for the Fit-DNN with the training time of the corresponding MLP (for MNIST, 10 epochs, with the standard parameters listed in Table 1  Note that we used a straightforward implementation of the MLP and backpropagation algorithm for comparison and that there is certainly room for optimization (e.g., parallelization). Moreover, the required time for solving the DDE strongly depends on the chosen numerical integration method and step size. We were using the Heun method with a step size of h = theta/32 for theta < 2 and h = 1/16 for theta >= 2, i.e., the CPU times in the above table were measured with h = 1/64.
The strength of our approach is so far the potentially faster inference. We see a potential that our methods can achieve fast inference times when it is realized in hardware. The training process is indeed rather time-consuming because we need to solve the DDE for the forward propagation. One option to overcome this limitation would be to perform the forward propagation even during the training process on specialized hardware. However, that would certainly be more difficult to implement than "offline" training on a conventional computer. To emphasize this, we included the following passage in the manuscript (Sec. 3.5): We would also like to point out that the potential use of very fast hardware components is accompanied by a possibility of fast inference. However, a fast hardware implementation of the Fit-DNN will not accelerate the training process, because a traditional computer is still required, at least for the back-propagation of errors. If the forward propagation part of the training process is also performed on a traditional computer, the delay equation must be solved numerically for each training step, leading to a significant increase in training time. Therefore, the presented method is most suitable when fast inference and/or high hardware efficiency are prioritized. We would like to point out that the integration of the training process into the hardware-part could be addressed in future extensions of our concept.
Moreover, the computational costs for the back-propagation, as reported in the first submission, scale quadratically with the number of nodes N per hidden layer, i.e., it is slow for "wide" networks. That is, your assumption is correct if this version of back-propagation is applied. We have taken your comment as an opportunity to improve the back-propagation algorithm for the Fit-DNN. We found that it is actually possible to reformulate the algorithm such that the computation time scales only linearly with N. This can either be achieved by rearranging the formulas (for steps 2 and 3) or by directly alternating the derivation of the algorithm.
Accordingly, we performed a comprehensive revision of the Methods Section and Section 3 of the Supplementary Information.
The updated version of the back-propagation algorithm for the Fit-DNN follows the traditional back-propagation scheme for forward-connected networks, with the minor difference that we have to distinguish between linear and nonlinear connections entering the same node. In terms of computational cost, the updated algorithm is equivalent to back-propagation for the traditional MLP. This is also clearly shown by the measured CPU times in the For CIFAR-10, accuracies of 56.84 % (ReLU and linear activation) and 65.76 % ("zero-bias" and linear activation) have been achieved with MLPs with 20328000 training parameters. By extensive data augmentation, the accuracy could be increased to 78.62 % [Zhouhan Lin, Roland Memisevic, Kishore Konda: "How far can we go without convolution: Improving fully-connected networks" (2015)]. For higher success rates other methods than MLPs are required.

6) Minor, denoising is a different task from classification which uses a different evaluation metric. Why are you presenting them together in the same table? It can confuse the reader.
We have restructured Table 2 of the main text and Tables 1 and 2 of the Supplementary Information to avoid confusion.

7) Is it possible to perform experiments also on a more challenging dataset (e.g. CIFAR 100)?
To address this question, we extended our numerical studies by a classification task considering the 20 superclasses of the CIFAR-100 dataset.
-In Table 1, the new row (c) is added and the caption is modified related to the new CIFAR100 tests.
-A few other minor changes.

8) [optional] How difficult would be to adapt this idea to recurrent neural networks and convolutional neural networks?
It is indeed possible to adapt this idea to recurrent neural networks (RNNs). For this, we only need to use T-periodic modulation functions for the delayed feedback terms. In this way, we obtain hidden layers with shared weights and thus an RNN. In other words, it is always possible to implement an RNN when the implementation of very deep DNNs with identical layers is possible. Therefore, we expect our system to be well-suited for RNN implementations. Moreover, a certain type of RNN has already been presented in Ref. [13], but without adaptable weights and only N internal connections for the hidden layers.
Implementing convolutional neural networks (CNNs) with the folded-in-time approach is much more challenging. The reason is that pooling layers are not compatible with the type of nodes that appear in Fit-DNN setup. One option would be to try more complicated dynamics, such as systems with distributed delays. In our opinion, extending the folded-in-time approach to CNNs would be a significant and challenging step to take this approach further.
We have modified one sentence in the Discussion as follows: For instance, one can implement different layer sizes, multiple nonlinear elements, and combine different structures such as recurrent neural networks with trainable hidden layers.

9) Please perform a careful proof-read of the whole paper (including the mathetical details) as not everything is crystal clear and there are still some typos (e.g. "??" in supplementary materials)
Done.

10) Very nice video.
Thank you very much for the positive comments.

While I do acknowledge and appreciate the innovative content of the presented material, I also think that there are obvious fundamental weaknesses of the proposed method, which the authors neglect to discuss. As it stands, this could lead to a misleading impression on the potential for this method, and I would request that the authors address the two most obvious issues.
Thank you for pointing out the insufficient discussion of the possible weaknesses of our method. We take this comment very seriously. Below are the responses and the list of corresponding changes in the manuscript. We also created a new subsection 3.5 (Trade-Offs) to address the limitations of the method.

The first issue relates to the obtainable performance. One should recall that the recent success of Deep Learning is strongly connected to the efficiency with which modern hardware can run many thousands operations in parallel. In this regard, the proposed concept of using only a single nonlinearity through which all operations need to be funnelled looks like a fundamental bottleneck in the design. This bottleneck is not properly discussed in the manuscript and only implicitly alluded to in section 4.2 when the case of small theta is discussed. While the authors rightly claim that the design is highly flexible, it should also be made clear that the design will not scale well for a large number of nodes due to the performance bottleneck.
The nodes of the Fit-DNN are indeed processed sequentially, i.e., the processing time of all hidden layers of the Fit-DNN is theta multiplied by the total number of nodes. The following modified paragraph (sec. Discussion) discusses this problem and proposes possible ways to address it: Since only one nonlinear node and one fast read-out element are absolutely necessary in our approach, ultrafast components could be used that would be unrealistic or too expensive for full DNN implementations. At the same time, since the single nonlinear element performs all nonlinear operations sequentially with node separation θ, parallelization cannot be applied in this approach.The overall processing time scales linearly with the total number of nodes LN and with the node separation θ. Possible ways to address this property that could represent a limitation in certain applications include the use of a small node separation θ [13] or multiple parallel copies of Fit-DNNs. In this way, a tradeoff between the number of required hardware components and the amount of parallel processing is possible. At the same time, the use of a single nonlinear node comes with the advantage of almost perfect homogeneity of all folded nodes, since they are realised by the same element. We have also found a place in the Discussion section, where a similar misleading may occur. The sentence As a result, just one neuron is sufficient to fold the whole complexity of the network architecture. is changed to As a result, just one neuron with feedback is sufficient to fold the entire complexity of the network.
Additionally, we mention that a possible hardware implementation would not help in the training process. We added the following passage: We would also like to point out that the potential use of very fast hardware components is accompanied by a possibility of fast inference. However, a fast hardware implementation of the Fit-DNN will not accelerate the training process, because a traditional computer is still required, at least for the back-propagation of errors. If the forward propagation part of the training process is also performed on a traditional computer, the delay equation must be solved numerically for each training step, leading to a significant increase in training time. Therefore, the presented method is most suitable when fast inference and/or high hardware efficiency are prioritized. We would like to point out that the integration of the training process into the hardware-part could be addressed in future extensions of our concept.

The article is quite well written, informative and complete. It is in my opinion a valuable contribution to the field, that has a level of impact and novelty that warrants publication in Nature Communications. I would however invite the authors to address the three following points:
Thank you for the positive evaluation of the manuscript and for mentioning its suitability for Nature Communications.

-Equations (1) and (2) indicate that the neuron obeys an Ikeda-like equation. Is the approach proposed by the authors valid with other types of time-delayed systems?
There are several reasons for choosing the Ikeda-like systems. It is a convenient system for delay-based reservoir computing, which has been introduced in [13]. A simple structure of the instantaneous terms allows us to avoid many technical problems. Moreover, the − α ( ) Ikeda-like systems are the natural choice if one wants to keep a resemblance to the classical multilayer perceptron. Therefore, such systems are the first choice when dealing with delay-based ML applications.
However, we do not see any fundamental obstacles to using other types of time-delayed systems. We have made the following amendments in the manuscript in order to emphasize this point: The type or exact nature of the single neuron is not essential. To facilitate the presentation of the main ideas, we assume that the system state evolves in continuous time according to a differential equation of the general form: … Systems of the form (1) are typical for machine learning applications with delay models [13,14,33,20].

-The proposed method seems at the intersection of "conventional" machine learning (use of backpropagation, etc.) and delay-based reservoir computing (single neuron, etc.). The reader would intuitively assume that the new method intends to leverage from the best of both worlds. A dedicated section where the performance of the Fit-DNN would be discussed in contrast to the two other approached would be of interest.
Indeed, the Fit-DNN concept incorporates both: delay-based reservoir computing and DNN. In particular, for large node separation , the Fit-DNN mimics conventional multilayer perceptrons.
θ Therefore, the performance in terms of accuracy is equivalent in this case. Choosing a smaller θ benefits the overall computation time, but decreases the achievable accuracy. This decrease strongly depends on the considered tasks (see Fig. 4). Compared to delay-based reservoir computing, our concept radically expands the range of possible applications. We consider this to be one of the main advantages of Fit-DNN. See also our related responses to question 1 of reviewer 1 and question 1 of reviewer 2 above.
To address this issue more strongly in the manuscript, we have made the following changes: In the caption to Fig. 4, we changed the sentence: The accuracy obtained in the map limit case is shown by the horizontal black line. θ → ∞ to The accuracy obtained in the map limit case is shown by the horizontal black line (this θ → ∞ corresponds to the classical sparse multilayer perceptron).
We also added a paragraph in the Discussion section: Another major aspect of the Fit-DNN construction is the importance of the temporal node separation θ. For large node separation θ, the Fit-DNN mimics conventional multilayer perceptrons.Therefore, the performance in terms of accuracy is equivalent in this case. In contrast, choosing a smaller θ benefits the overall computation time, but decreases the achievable accuracy. This decrease strongly depends on the considered tasks (see Fig. 4).

-Could the authors elaborate on the efficiency of their method from the viewpoint of computation time?
Your question has much in common with question 5 of reviewer 1. Please refer to our detailed response to this comment and the description of the corresponding amendments to the manuscript.

Introduction
Fueled by Deep Neural Networks (DNN), machine learning systems are achieving outstanding results in large-scale problems. The data-driven representations learned by DNNs empower stateof-the-art solutions to a range of tasks in computer vision, reinforcement learning, robotics, healthcare, and natural language processing [1,2,3,4,5,6,7,8,9]. Their success has also motivated the implementation of DNNs using alternative hardware platforms, such as photonic or electronic 1 concepts, see, e.g., [10,11,12] and references therein. However, so far, these alternative hardware implementations require major technological efforts to realize partial functionalities, and, depending on the hardware platform, the corresponding size of the DNN remains rather limited [12].
Here, we introduce a folding-in-time approach to emulate a full DNN using only a single artificial neuron with feedback-modulated delay loops. Temporal modulation of the signals within the individual delay loops allows realizing adjustable connection weights among the hidden layers. This approach can reduce the required hardware drastically and offers a new perspective on how to construct trainable complex systems: The large network of many interacting elements is replaced by a single element, representing different elements in time by interacting with its own delayed states. We are able to show that our folding-in-time approach is fully equivalent to a feed-forward deep neural network under certain constraints-and that it, in addition, encompasses dynamical systems specific architectures. We name our approach Folded-in-time Deep Neural Network or short Fit-DNN.
Our approach follows an interdisciplinary mindset that draws its inspiration from the intersection of AI systems, brain-inspired hardware, dynamical systems, and analogue computing. Choosing such a different perspective on DNNs leads to a better understanding of their properties, requirements, and capabilities. In particular, we discuss the nature of our Fit-DNN from a dynamical systems' perspective. We derive an adapted : a back-propagation approach applicable to gradient descent training of Fit-DNNs based on continuous dynamical systems and demonstrate that it provides good performance results in a number of tasks. Our approach will open up new strategies to implement DNNs in alternative hardware.
For the related machine learning method called 'reservoir computing' based on fixed recurrent neural networks, folding-in-time concepts have already been successfully developed [13]. Delaybased reservoir computing typically uses a single delay loop configuration and time-multiplexing of the input data to emulate a ring topology. The introduction of this concept led to a better understanding of reservoir computing, its minimal requirements, and suitable parameter conditions. Moreover, it facilitated their implementation on various hardware platforms [13,14,15,16,17,18,19]. In fact, the delay-based reservoir computing concept inspired successful implementations in terms of hardware efficiency [13], processing speed [16,20,21], task performance [22,23], and last, but not least, energy consumption [16,22].

A network folded into a single neuron
The traditional Deep Neural Networks consist of multiple layers of neurons coupled in a feedforward architecture. Implementing their functionality with only a single neuron requires preserving the logical order of the layers while finding a way to sequentialize the operation within the layer. This can only be achieved by temporally spacing out processes that previously acted simultaneously. A single neuron receiving the correct inputs at the correct times sequentially em- ulates each neuron in every layer. The connections that previously linked neighboring layers now instead have to connect the single neuron at different times, and thus interlayer links turn into delay-connections. The weight of these connections has to be adjustable, and therefore a temporal modulation of these connections is required. The architecture derived this way is depicted in Fig. 1 and called Folded-in-time DNN. The core of the Fit-DNN consists of a single neuron with multiple delayed and modulated feedbacks. The type or exact nature of the single neuron is not essential; we only demand that its : . ::: To :::::::::: facilitate ::: the :::::::::::::: presentation ::: of :::: the :::::: main ::::::: ideas, ::: we :::::::: assume :::::: that :::: the : system state evolves in continuous time according to a differential equation of the general form: Here x(t) denotes the state of the neuron; f is a nonlinear function with the argument a(t) combining the data signal J(t), time-varying bias b(t), and the time-delayed feedback signals Fig. 1. We explicitly consider multiple loops of different delay lengths τ d . Due to the feedback loops, the system becomes a so-called delay dynamical system, which leads to profound implications for the complexity of its dynamics [24,25,26,27,28,29,30,31,32]. ::::::::: Systems :: of ::::: the ::::: form :::: (1) :::: are :::::::: typical :::: for ::::::::: machine ::::::::: learning ::::::::::::: applications ::::: with :::::: delay :::::::: models ::::::::::::::: [13,14,33,20] : . : Intuitively, the feedback loops in Fig. 1 lead to a reintroduction of information that has already passed through the nonlinearity f . This allows chaining the nonlinearity f many times. While a classical DNN composes its trainable representations by using neurons layer-by-layer, the Fit-DNN achieves the same by reintroducing a feedback signal to the same neuron repeatedly. In each pass, the time-varying bias b(t) and the modulations M d (t) on the delay-lines ensure that the time evolution of the system processes information in the desired way. To obtain the data signal J(t) and outputŷ we need an appropriate pre-or postprocessing, respectively.

Equivalence to multi-layer neural networks
To further illustrate how the Fit-DNN is functionally equivalent to a multi-layer neural network, we present Fig. 2 showing the main conceptual steps for transforming the dynamics of a single neuron with multiple delay loops into a DNN. A sketch of the time-evolution of x(t) is presented in Fig. 2a. This evolution is divided into time-intervals of length T , each emulating a hidden layer.
In each of the intervals, we choose N points. We use a grid of equidistant timings with small temporal separation θ. For hidden layers with N nodes, it follows that θ = T /N . At each of these temporal grid points t n = nθ, we treat the system state x(t n ) as an independent variable. Each temporal grid point t n will represent a node, and x(t n ) its state. We furthermore assume that the data signal J(t), bias b(t), and modulation signals M d (t) are step functions with step-lengths θ; we refer to the Methods Sec. 4 for their precise definitions. By considering the dynamical evolution of the time-continuous system x(t) only at these discrete temporal grid points t n (black dots in Fig. 2a), one can prove that the Fit-DNN emulates a classical DNN. To show it formally, we define network nodes x n of the equivalent DNN as with n = 1, . . . , N determining the node's position within the layer, and = 1, . . . , L determining the layer. Analogously, we define the activations a n of the corresponding nodes. Furthermore, we add an additional node x N +1 := 1 to take into account the bias. Thus, the points from the original time-intervals T are now described by the vector x = (x 1 , . . . , x N ). Figure 2b shows the original time-trace cut into intervals of length T and nodes labeled according to their network position. The representation in Fig. 2c is a rotation of Fig. 2b with the addition of an input and an output layer. The connections are determined by the dynamical dependencies between the nodes x n . These dependencies can be explicitly calculated either for small or large distance θ. In the case of a large node separation θ, the relations between the network nodes x n is of the familiar DNN shape: System (4) is derived in detail in the Supplementary Information. The matrix W describes the connections from layer − 1 to and corresponds to the modulated delay-lines in the original single-neuron system. Each of the time-delayed feedback loops leads to a dependence of the state Fig. 2a. By way of construction, the length of each delay-loop is fixed. Since the order of the nodes (3) is tied to the temporal position, a fixed delayline cannot connect arbitrary nodes. Rather, each delay-line is equivalent to one diagonal of the coupling matrix W . Depending on the number of delay loops D, the network possesses a different connectivity level between the layers. A fully connected Fit-DNN requires 2N − 1 modulated delay loops, i.e., our connectivity requirement scales linearly in the system size N and is entirely independent of L, promising a favorable scaling for hardware implementations. The time-dependent modulation signals M d (t) allow us to set the feedback strengths to zero at certain times. For this work, we limit ourselves to delayed feedback connections, which only link nodes from the neighboring layers, but in principle this limitation could be lifted if more exotic networks were desired. For a visual representation of the connections implied by two sample delay loops, see Fig. 2b and c. The mismatch between the delay τ d and T determines, which nodes are connected by that particular delay-loop: For τ d < T (τ d > T ), the delayed feedback connects a node x n with another node x +1 i in a subsequent layer with n > i (n < i), shown with red (yellow) arrows in Fig. 2.
To complete the DNN picture, the activations for the first layer will be rewritten as a 1 := g(a in ) := g(W in u), where W in is used in the preprocessing of J(t). A final output matrix W out is used to derive the activations of the output layer a out := W out x L . We refer to the Methods Sec. 4.2 for a precise mathematical description.

Dynamical systems perspective: small node separation
For small node separation θ, the Fit-DNN approach goes beyond the standard DNN. Inspired by the method used in [13,34,35], we apply the variation of constants formula to solve the linear part of (1) and the Euler discretization for the nonlinear part and obtain the following relations between the nodes up to the first-order terms in θ: for the layers = 1, . . . , L, and nodes n = 2, . . . , N . Note, how the first term e −αθ x n−1 couples each node to the preceding one within the same layer. Furthermore, the first node of each layer is connected to the last node of the preceding layer: where x 0 N := x 0 = x(0) is the initial state of system (1). Such a dependence reflects the fact that the network was created from a single neuron with time-continuous dynamics. With insufficient : a ::::: small : node separation θ, each node state residually depends on the preceding one and is not fully independent. These additional 'inertial' connections are represented by the black arrows in the network representation in Fig. 2c and are present in the case of small θ. This second case of small θ may seem like a spurious, superfluous regime that unnecessarily complicates the picture. However, in practice, a small θ directly implies a fast operation-as the time the single neuron needs to emulate a layer is directly given by N θ. We, therefore, expect this regime to be of interest for future hardware implementations. Additionally, while we recover a fully connected DNN using D = 2N − 1 delay loops, our simulations show that this is not a strict requirement. Adequate performance can already be obtained with a much smaller number of delay loops. In that case, the Fit-DNN is implementing a particular type of sparse DNNs.
For a small temporal node separation θ, the Fit-DNN approach differs from the classical multilayer perceptron because it contains additional linear intra-layer connections and additional linear connections from the last node of one hidden layer to the first node of the next hidden layer, see Fig. 2c, black arrows. Nonetheless, the network can be trained by adjusting the input weights W in , the output weights W out , and the non-zero elements of the potentially sparse weight matrices W using gradient descent. For this, we employ a modified back-propagation algorithm, described in Sec. 4.3, which takes these additional connections into consideration.

Benchmark tasks
Since under certain conditions, the Fit-DNN fully recovers a standard DNN (without convolutional layers), the resulting performance will be identical. This is obvious, when considering system (4), since the dynamics are perfectly described by a standard multilayer perceptron. However, the Fit-DNN approach also encompasses the aforementioned cases of short temporal node distance θ and the possibility of using less delay-loops, which translates to a sparse DNN. We report here that the system retains its computational power even in these regimes, i.e. : , a Fit-DNN can in principle be constructed with few and short delay-loops.
is to reconstruct the original images from their noisy versions. Figure 3 shows examples of the original Fashion-MNIST images, their noisy versions, and reconstructed images. For the tests, we solved the delay system (1) numerically and trained the weights by gradient descent using the modified back-propagation algorithm described in Sec. 4.3. Unless noted otherwise, we operated in the small θ regime, and in general did not use a fully connected network. By nature of the architecture, the choice of delays τ d is not trivial. We always chose the delays as a multiple of θ, i.e. τ d = n d θ, d = 1, . . . , D. The integer n d can range from 1 to 2N − 1 and indicates which diagonal of the weight matrix W is accessed. After some initial tests, we settled on drawing the numbers n d from a uniform distribution on the set {1, . . . , 2N − 1} without replacement.
If not stated otherwise, we used the activation function f (a) = sin(a), but the Fit-DNN is in principle agnostic to the type of nonlinearity f that is used. The standard parameters for our numerical tests are listed in Table 1. For further details we refer to the Methods Sec. 4.4.
In Table 2, we show the Fit-DNN performance for different numbers of the nodes N = 50, 100, 200, and 400 per hidden layer on the aforementioned tasks. We immediately achieve high success rates on the relatively simple MNIST and Fashion-MNIST tasks. The more challenging CIFAR-10 : , ::::::: coarse ::::::::::::: CIFAR-100 : and cropped SVHN tasks obtain lower yet still significant success rates. The confusion matrices (see Supplementary Information) also show that the system tends to confuse similar categories (e.g. 'automobile' and 'truck'). While these results clearly do not rival record state-of-the art performances, they were achieved on a novel and radically different architecture. In particular, the Fit-DNN here only used about half of the available diagonals of the weight matrix and operated in the small θ regime. For the tasks tested, increasing N clearly :::::: coarse ::::::::::::: CIFAR-100 : :::::: 29.39 :::::: 32.73 :::::: 34.  leads to increased performance. This also serves as a sanity check and proves the scalability of the concept. In particular, note that if implemented in some form of dedicated hardware, increasing the number of nodes per layer N does not increase the number of components needed, solely the time required to run the system. Also note, that the denoising task was solved using only 5 delay-loops. For a network of 400 nodes, this results in an extremely sparse weight matrix W . Nonetheless, the system performs well. Figure 4 shows the performance of the Fit-DNN for the classification tasks and the correctness of the computed gradients for different node separations θ. Since this is one of the key parameters that controls the Fit-DNN, understanding its influences is of vital interest. We also use this opportunity to illustrate the difference between the modified and the classical ::::::::::: importance ::: of :::::::::::: considering ::: the ::::::: linear ::::: local ::::::::::::: connections :::::: when :::::::::::: performing back-propagation algorithm for the gradient descent. :: to :::::::::: compute :::: the ::::::: weight ::::::::::: gradients. : We applied gradient checking, i.e. : , : the comparison to a numerically computed practically exact gradient, to determine the correctness of the gradient estimates obtained from both back-propagation methods :::::::: obtained :::::::::: gradient :::::::::: estimates. We also trained the map limit network (4) for comparison, corresponding to a (sparse) multilayer perceptron. In this way, we can also see how the additional intra-layer connections influence the performance for small θ.
Further numerical results regarding the number of hidden layers L, the number of delays D, back-propagation :::::::::: algorithm ::::::: taking :::: the :::::: local ::::::::: coupling :::: into ::::::::::::::: consideration (blue points), and classical :::::::::: neglecting :::::: them : (red points)back-propagation algorithm; panels (a, c, e, and g). The accuracy obtained in the map limit case θ → ∞ is shown by the horizontal black line ::::: (this ::::::::::::: corresponds ::: to ::: the ::::::::: classical ::::::: sparse ::::::::::: multilayer ::::::::::::: perceptron). Lower panels show the cosine similarities between the numerically computed approximation of the exact gradient and the gradient obtained from the modified :: by ::::::::::::::::::: back-propagation ::::: with : (blue points) and classical :: or ::::::::: without (red) back-propagation method ::::: local :::::::::::: connections. and the role of the activation function f are presented in detail in the Supplementary Information. We find that the optimal choice of L depends on the node separation θ. Our findings suggest that for small θ, one should choose a smaller number of hidden layers than for the map limit case θ → ∞. The effect of the number of delays D depends on the task. We found that a small number of delays is sufficient for the denoising task: the mean squared error remains constant when varying D between 5 and 40. For the CIFAR-10 task, a larger number of delays is necessary to obtain optimal results. If we use the standard parameters from Table 1, we obtain the highest CIFAR-10 accuracy for D = 125 or larger. This could likely be explained by the different requirements of these tasks: While the main challenge for denoising is to filter out unwanted points, the CIFAR-10 task requires attention to detail. Thus, a higher number of delay-loops potentially helps the system to learn a more precise representation of the target classes. By comparing the Fit-DNN performance for different activation functions, we also confirmed that the system performs similarly well for the sine f (a) = sin(a), the hyperbolic tangent f (a) = tanh(a), and the ReLU function f (a) = max{0, a}.
Finally, our approach encourages further cross-fertilization among different communities. While the spatio-temporal equivalence and the peculiar properties of delay-systems may be known in the dynamical systems community, so far, no application to DNNs had been considered. Conversely, the Machine Learning core idea is remarkably powerful, but usually not formulated to be compatible with continuous-time delay-dynamical systems. The Fit-DNN approach unifies these perspectives-and in doing so, provides a concept that is promising for those seeking a different angle to obtain a better understanding or to implement the functionality of DNNs in dedicated hardware.

The delay system and the signal a(t)
The delay system (1) is driven by a signal a(t) which is defined by Eq. (2) as a sum of a data signal J(t), modulated delayed feedbacks M d (t)x(t − τ d ), and a bias b(t). In the following, we describe the components in detail.
(i) The input signal. Given an input vector (u 1 , . . . , u M ) T ∈ R M , a matrix W in ∈ R N ×(M +1) of input weights w in nm and an input scaling function g, we define Thus, the modulation functions M d (t) are step functions with step length θ. The numbers v d,n play the role of the connection weights from layer − 1 to layer . More precisely, v d,n is the weight of the connection from the (n+N −n d )-th node of layer −1 to the n-th node of layer . Section 4.2 below explains how the modulation functions translate to the hidden weight matrices W . In order to ensure that the delay terms connect only consecutive layers, we set v d,n = 0 whenever n d < n or n d > n + N − 1 holds.
(iii) The bias signal. Finally, the bias signal b(t) is defined as the step function where n = 1, . . . , N and = 2, . . . , L. For 0 ≤ t ≤ T , we set b(t) := 0 because the bias weights for the first hidden layer are already included in W in , and thus in J(t).

Network representation for small node separation θ
In this section, we provide details to the network representation of the Fit-DNN which was outlined in Sec. 2. The delay system (1) is considered on the time interval [0, LT ]. As we have shown in Sec. 2, it can be considered as multi-layer neural network with L hidden layers, represented by the solution on sub-intervals of length T . Each of the hidden layers consists of N nodes. Moreover, the network possesses an input layer with M nodes and an output layer with P nodes. The input and hidden layers are derived from the system (1) by a discretization of the delay system with step length θ. The output layer is obtained by a suitable readout function on the last hidden layer. We first construct matrices W = (w nj ) ∈ R N ×(N +1) , = 2, . . . , L, containing the connection weights from layer − 1 to layer . These matrices are set up as follows: Let n d := n d − N , then w n,n−n d := v d,n define the elements of the matrices W . All other matrix entries (except the last column) are defined to be zero. The last column is filled with the bias weights b 1 , . . . , b N . More specifically, where δ n,j = 1 for n = j, and zero otherwise. The structure of the matrix W is illustrated in the Supplementary Information. Applying the variation of constants formula to system (1) yields for 0 ≤ t 0 < t ≤ T L: In particular, for t 0 = ( − 1)T + (n − 1)θ and t = ( − 1)T + nθ we obtain where a(s) is given by (2). Note that the functions M d (t), b(t), and J(t) are step functions which are constant on the integration interval. Approximating x(s − τ d ) by the value on the right θ-grid point directly yields the network equation (6).

Application to machine learning and a modified back-propagation algorithm
We apply the system to two different types of machine learning tasks: image classification and image denoising. For the classification tasks, the size P of the output layer equals the number of classes. We choose f out to be the softmax function, i.e.
If the task is to denoise a greyscale image, the number of output nodes P is the number of pixels of the image. In this case, clipping at the bounds 0 and 1 is a proper choice for f out , i.e.

Data augmentation, input processing and initialization
For all classification tasks, we performed an augmentation of the training input data by adding a small Gaussian noise to the images and by pixel jittering, i.e., randomly shifting the images by at most one pixel horizontally, vertically, or diagonally. For the CIFAR-10task :::: /100 :::::: tasks, we also applied a random rotation of maximal ±15 • and a random horizontal flip with the probability 0.5 to the training input images. Further, we used dropout [55] with a dropout rate of 1% for the CIFAR-10task :::: /100 :::::: tasks. For the denoising task, we performed no data augmentation. Moreover, for the four ::: five : classification tasks, we used the input preprocessing function g(a) = tanh(a). For the denoising task, we applied no nonlinear input preprocessing, i.e. g(a) = a. The weights were always initialized by Xavier initialization [56]. In all cases, we used 100 training epochs.

Code availability
The source code to reproduce the results of this study is freely available on GitHub: https: //github.com/flori-stelzer/deep-learning-delay-system.  Table 1 shows how the number of hidden layers L affects the performance of the Fit-DNN. We investigated two cases: the map limit θ → ∞ and the case θ = 0.5. If the system operates in the map limit, we observe that the optimal number of hidden layers is 2 or 3, depending on the task. If θ = 0.5, the performance of the Fit-DNN drops significantly for the CIFAR-10 [1], the ::::: coarse ::::::::: CIFAR-100 ::: [1], :::: the cropped SVHN [2], and the denoising task. For this reason, deeper networks do not offer an advantage for solving these tasks if θ = 0.5. The MNIST [3] and Fashion-MNIST [4] accuracies do not suffer much from choosing a small node separation θ. Here the systems performance remains almost unchanged in comparison to the map limit. Figure 1 shows the effect of the choice of the number of delays D on the performance of the Fit-DNN. A larger number of delays D yields a slightly better accuracy for the CIFAR-10 task. We obtain an accuracy of less than 51% for D = 25, and an accuracy between 52% and 53% for D = 125 or larger. For the denoising task, we already obtain a good mean squared error (MSE) for a small number of delays D. The MSE remains mostly between 0.0253 and 0.0258 independently of D. The fluctuations of the MSE are small.
We compare ::::::: compared : two methods for choosing the delays τ d = n d θ. The first method is to draw the numbers n d without replacement from a uniform distribution on the set {1, . . . , 2N − 1}. The second method is to choose equidistant delays, with n d+1 − n d = (2N − 1)/D . For the CIFAR-10 task, one may observe a slight advantage of the equidistant delays, whereas for the denoising task, randomly chosen delays yield slightly better results. In both cases, however, the influence of the chosen method on the quality of the results is small and seems to be insignificant. Table 2 compares the performance of the Fit-DNN for different activation functions f (a) = sin(a), f (a) = tanh(a), and f (a) = max(0, a) (ReLU). The results show that the Fit-DNN works well with various activation functions. Figure 2 shows the confusion matrices for the cropped SVHN and the CIFAR-10 tasks. These matrices show how often images of a corresponding dataset class are either recognized correctly or mismatched with another class. Confusion matrices are a suitable tool to identify which classes are confused more or less often. The confusion matrix for the cropped SVHN task shows, e.g., that the number 3 is relatively often falsely recognized as 5 or 9, but almost never as 4 or 6. The confusion matrix for the CIFAR-10 task indicates that images from animal classes (bird, cat, deer, dog, frog, horse) are often mismatched with another animal class, but rarely with a transportation class (airplane, automobile, ship, truck). This is an expected result for the CIFAR-10 task. Figure 3 shows results for a sine function fitting task. The objective of the task is to fit functions yi(u), i = 1, . . . , 5, u ∈ [−1, 1], plotted in Fig. 4, which are defined as concatenations yi(u) = si • . . . • s1(u) of sine functions si(u) = sin(ωi(u) + ϕi) with The simulations were performed with N = 20 nodes per hidden layer, D = 3, and τ1 = 15, τ2 = 20, τ3 = 25. Since the task is to fit a concatenation of i sine functions and the Fit-DNN consists in this case of L concatenated sine functions, one would expect optimal results for L ≥ i. In our tests, this was true :: for : up to i = 3 ::::::::::: concatenated :::::::: functions. The function y1 can be approximated by the Fit-DNN's output with a small MSE with any number of layers, see Fig. 3. The function y2 can be fitted with a small error if and only if L ≥ 2 (with a few exceptions). For the function y3 we obtain relatively exact approximations with 2 or more hidden layers, but the smallest MSE is obtained with L = 3 in most cases. The Fit-DNN fails to fit the functions y4 and y5 for all L. [%] ::::: coarse :::::::::: CIFAR-100 : :::: 33.52 : :::: 33. [%] :::::  Table 1: Accuracies [%] for the classification tasks and mean squared error for the denoising task for different numbers of hidden layers L. For a node separation of θ = 0.5, two hidden layers seem to be optimal for the classification tasks (except CIFAR-10 :::: /100), and one hidden layer is sufficient for the denoising task. When the systems operates in the map limit θ → ∞, additional hidden layers can improve the performance.    [%] ::::: coarse :::::::::: CIFAR-100 : ::::: 32.162 : ::::: 32.183 : ::::: 30.   that ::::: were ::::::: removed ::: by :::::: setting the vanishing modulation amplitude :: to :::: zero; ::: see :::: Eq. (6).

Generating delay system
The Fit-DNN has M input nodes, P output nodes, and L hidden layers, each consisting of N nodes. The hidden layers are described by the delay systemẋ where α > 0 is a constant time-scale, f is a nonlinear activation function, and the argument a(t) is a signal , which is composed of a data signal J(t), a bias signal b(t) : , and delayed feedback terms modulated by functions M d (t). The components of a(t) are described in the Methods Section. The delays are given by τ d = n d θ, where θ := T /N and 1 ≤ n1 < . . . < nD ≤ 2N − 1 are natural numbers. The state of the -th hidden layer is given by the solution x(t) of (3)-(4) on the interval ( − 1)T < t ≤ T . We define the node states of the hidden layers as follows: x n := x(( − 1)T + nθ) for the node n = 1, . . . , N of :: the : layer = 1, . . . , L. The nodes of the hidden layers are connected by the delays τ d , which is :: as illustrated in Fig. 5. In order to :: To : ensure that only nodes of consecutive hidden layers are connected, we set The delay connections, which are set to zero by condition (6), are indicated by dashed arrows marked with a black × symbol in Fig. 5. Additionally, we set M d (t) = 0 for t ∈ [0, T ]. This implies, in combination with condition (6), that the system has no incoming delay connection ::::::::: connections : from a time t − τ d before zero. For this reason, we do not need to know a history function [5,6,7,8] for : is :::: not ::::::: required :: to ::::: solve : the delay system (3) The :::::::::: application :: of : the variation of constants formula yields :::: gives for 0 ≤ t0 < t ≤ LT the equation Using this equation on appropriate time intervals [(n − 1)θ, nθ], we obtain the following relations for the nodes in the first hidden layer x 1 n = e −αθ x 1 n−1 + θ 0 e α(s−θ) f (a((n − 1)θ + s)) ds, n = 2, . . . , N ,.
where f out is an output activation function which suits to the given task. Moreover, where uM+1 := 1 and x N +1 := 1, for = 1, . . . , L. We call the a n and a out p the activation of the corresponding node. For n = 1, . . . , N , the variable w out pn denotes the output weight connecting the n-th node of layer L to the p-th output node, and w out p,(N +1) denotes the bias for p-th output node (in other words, the weight connecting the on-neuron x L N +1 of layer L to the p-th output node).

Time signals of the Fit-DNN
• the state of the system x(t), • the signal a(t), • the signal a(t) decomposed into the positive and the negative parts of its components (i.e., the data signal, the modulated feedback signals, and the bias signal) indicated by their corresponding color, • the data signal J(t), • the delayed feedback signals x(t − τ d ) (grey), • the trained modulation functions M d (t) (colored), • and the bias b(t).    . Panel (b) shows the internal processes of the hidden layers: the state variable of the system x(t), and the signal a(t), which consists of the data signal J(t), the delayed feedback signals x(t − τ d ) multiplied by the modulation functions M d (t), and the bias signal b(t). Panel (c) illustrates the output layer.