On Improving The Computing Capacity of Dynamical Systems

Reservoir Computing has emerged as a practical approach for solving temporal pattern recognition problems. The procedure of preparing the system for pattern recognition is simple, provided that the dynamical system (reservoir) used for computation is complex enough. However, to achieve a sufficient reservoir complexity, one has to use many interacting elements. We propose a novel method to reduce the number of reservoir elements without reducing the computing capacity of the device. It is shown that if an auxiliary input channel can be engineered, the drive, advantageous correlations between the signal one wishes to analyse and the state of the reservoir can emerge, increasing the intelligence of the system. The method has been illustrated on the problem of electrocardiogram (ECG) signal classification. By using a reservoir with only one element, and an optimised drive, more than 93% of the signals have been correctly labelled.


on improving the computing capacity of Dynamical Systems Vasileios Athanasiou ✉ & Zoran Konkoli
Reservoir computing has emerged as a practical approach for solving temporal pattern recognition problems. the procedure of preparing the system for pattern recognition is simple, provided that the dynamical system (reservoir) used for computation is complex enough. However, to achieve a sufficient reservoir complexity, one has to use many interacting elements. We propose a novel method to reduce the number of reservoir elements without reducing the computing capacity of the device. it is shown that if an auxiliary input channel can be engineered, the drive, advantageous correlations between the signal one wishes to analyse and the state of the reservoir can emerge, increasing the intelligence of the system. the method has been illustrated on the problem of electrocardiogram (ecG) signal classification. By using a reservoir with only one element, and an optimised drive, more than 93% of the signals have been correctly labelled.
Reservoir Computing (RC) has been successfully used for solving plethora of temporal information processing problems, such as speech recognition or time series prediction and classification [1][2][3] . In Fig. 1a, we show the classical RC scheme in the context of signal classification. The internal state of the reservoir is defined as r(t) = (R 1 (t), R 2 (t), …, R N (t)) where R i (t) describes the state of an i-th element at a time instance t. For example, if the reservoir is built from neurons then R i (t) denotes the activation state of a neuron. Likewise, if the reservoir is a memristor network, then R i (t) represents a memristance value. Plethora of other elements have been suggested to build reservoirs (cf. 4 and references therein). The set of all possible values of r constitutes the configuration space of the reservoir Ω.
The process of computation is realised by "feeding" an input signal q(t) to a dynamical system that represents the reservoir 5 . The external input q(t) "pushes" the state of the reservoir into a configuration r(t) according to a dynamic law r t H r t q t ( ) ( ( ), ( ))  = , where the dot over a symbol denotes the time derivative, e.g.  ≡ r dr dt / . The configuration r(t) is the result of the computation performed by the reservoir computer. Note that the state of the system at a particular time instance t depends on the full history of the input signal up to that time point. Thus r(t) depends on all q(t′) with t′ < t. To emphasise this we write r t q t ( ) [ ]( ) =  . Essentially, the reservoir works as a filter: it converts an uncountable series of values q into another series r q [ ] =  . One can think of time as an index that can be used to access a particular value in such a series. In the following, when discussing specifically a one-element reservoir, to simplify the notation we write r(t) = R 1 (t) instead of r(t) = R 1 (t).
The state r(t) should also depend on the initial condition of the system i, and one ought to assume r t q r t ( ) [ , ]( )  = . However, if the system exhibits the fading memory property then this dependence on the initial condition should be very weak: .
A few examples of dynamical systems that illustrate the fading memory concept can be found in literature 6 .
The reservoir is equipped with a readout layer ψ that assigns a label y to the reservoir state at every time instance ψ = t y t r t : ( ) ( ( )). The readout layer should be a simple structure, and it should not contribute substantially to the computation 7 ; it is used to identify the region that the state has been driven to. For example, to realise a binary classification, one would use y t r t ( ) ( ( )) where ψ is a sigmoid-like function, and  w w w , , , N 0 1 are free parameters. The main claim is that under suitable reservoir conditions, one can realise any computation by using a fixed reservoir and by simply adjusting the readout layer achieve any desired functionality 8 . In the context of the binary classification example introduced above, this implies that any binary classification can be achieved by only adjusting the free parameters w w w , , , N of the reservoir 9 . In practice, to achieve this, one needs reservoirs with a large number of elements that feature recursive feedback mechanisms (intrinsic or externally engineered), both for hardware 10 or software 3 implementations. This intuitive understanding has been corroborated with rigorous mathematical arguments.
The most important formal criterion that makes the RC approach possible is that the dynamical system being used as the reservoir has to separate inputs 11,12 . The rigorous mathematical formulation is rather involved, and somewhat counterintuitive. In brief, assuming a fixed time reference t, a reservoir separates inputs if for any pair of distinct input signals q and q' there is a readout layer ψ ′ q q , such that respective outputs y t r t ( ) A pedagogical discussion on the mathematical background to the concept can be found in literature 13 . An intuitive formulation of the separability requirement is as follows. Assuming that the goal is to classify signals into a fixed number of k classes c 1 , c 2 , …, c k , the reservoir has to exhibit the following behaviour. If the input belongs to a class c 0 , then, the state of the reservoir r should be driven towards a region Ω 0 of the configuration space; for an input of class c 1 the state should be driven towards a region Ω 1 and so on. The key requirement is that the regions Ω 0 , Ω 1 , …, Ω k should not overlap and they ought to be compact so that the decision boundaries are simple [14][15][16] . Input separation may not be possible if the reservoir were not complex enough. In machine learning, prior to the development of the deep learning paradigm, a specific practice for separating inputs has been considered. The practice features the use of additional input channels to achieve the desired separation, and is normally referred to as feature engineering [17][18][19] . In deep learning, feature engineering occurs spontaneously while training deep neural networks. Note that high quality features are found by training deep neural networks' internal structure, which implies modification of the dynamical system used for computation. However, in RC, one is not allowed to adjust reservoir's internal structure for training purposes. By assumption, one cannot modify the reservoir to increase the performance. This is particularly true in the context of the physical reservoir computing 7,13 where, by assumption, one is interested in building computers from dynamical systems that cannot be easily modified. In this work, we suggest that it is possible to work around this problem. The classical RC scheme. A dynamical system (reservoir) responds to the input q(t). The internal state of the reservoir r(t) is processed by a simple readout layer. The only part of the system being trained is the readout layer. (b) Same as for panel (a), but an auxiliary drive signal u(t) is provided that interacts with the reservoir. (c) Same as for panel (b) with an additional feedback mechanism u R (t) = h(r(t)) being included. The drive signal, the feedback mechanism and the weights are trained.
Instead of modifying the dynamical system one can try to adjust the input that is being fed into the system. In particular, we demonstrate a generic method of building powerful reservoir computers by using an additional input feature, a drive signal. If providing an additional input feature contributes to input separation, then, one does not need to search for more complex dynamical systems with a larger number of interconnected reservoir elements. Instead, one can perform RC with the minimum amount of reservoir elements. The purpose of using an additional input feature is to achieve optimal input separation.
The key idea explored in this study is that a drive signal can be used to achieve advantageous correlations between input and reservoir's state, increasing the performance of RC. If so, then a "recognizable" correlation between r and q is built in the sense that r can be used to infer about q without a substantial engineering overhead. If this can be achieved, then the drive and the reservoir can operate in synergy, so that the "intelligence" that should reside in the reservoir, is transferred into the drive instead, which could allow for reservoirs with smaller size 20 . In this context, we use the term "intelligence" as a measure of the computational resources needed to execute an algorithm, being often referred to as "logical depth" 21 . An implicit assumption is being made that a higher logical depth implies larger engineering overhead necessary to implement the algorithm on a computing machine. Therefore, the transfer of intelligence from the reservoir to the drive means that the need for reservoir's computational resources can be reduced by using a drive signal.
The method works for dynamical systems that can be driven by an external signal. By assumption, the system responds to two signals, q(t) and u(t), where q is the signal one wishes to classify, and u features as an auxiliary additional input channel (feature). Using the notation introduced earlier, one can write =  r t q u t ( ) [ , ]( ). In Fig. 1, we show two implementations of the ideas discussed above in the context of a signal classification problem. Implementation 1 (Fig. 1b): In this case, the reservoir is stimulated by a drive signal u and an input signal q. The dynamics of the reservoir is defined by r t H r t q t u t ( ) ( ( ), ( ), ( ))  = . Implementation 2 (Fig. 1c): The complexity of the system can be increased by adding a feedback, Note that H is considered fixed, and only once H is given, one chooses the appropriate implementation. This implies that those two implementations lead to two different (non-equivalent) pattern recognition algorithms, where implementation 1 is a special case of 2. For example, it is impossible to find a drive for implementation 1 that would mimic the behavior of implementation 2 if h(r(t)) ≠ 0. To do this, one would have to construct a drive that behaves as a filter, since the state of the reservoir r behaves in such a way. In particular, one would have to find a drive that has an intimate knowledge of the input signal q, but this is impossible as the input signal is not known a priori.
To find an optimal drive signal u, we suggest a generic training procedure that can be applied at any supervised learning problem or any reservoir. The system is optimised on a set of training data until the result of the computation agrees with the desired functionality. The ability of the system to generalise is examined by using a separate set of test data.
The training procedure consists of two phases. In the first phase, the drive signal and the feedback (if used) are optimised to achieve maximum input separation, without considering the readout layer. If the reservoir separates inputs, then, a simple readout layer (e.g. linear classifier) could be trained to successfully classify inputs by reading the reservoir's state 11 . In the second phase, only the readout layer is optimised, by keeping the drive signal and the feedback function found from the first phase. The advantage of considering the two phases of training is that the readout layer can be optimised offline. The states of the reservoir are generated under the optimum drive and feedback (if used). Then, the readout layer is trained with supervised learning to infer the correct class by reading the reservoir states.
Phase 1: To measure input separation we use the separability index which is introduced and described in detail in the methods section. For a single-element network and a binary classification problem with classes of input signals c 0 and c 1 the separability index is defined as q c q c 0 1 , where q c q c , 0 0 1 1 〈... 〉 ∈ ∈ denotes the geometric mean over different input signals q 0 and q 1 that characterise each input class c 0 and c 1 respectively. The variable d[u, q 0 , q 1 ] denotes a typical distance between trajectories that are generated when the system is exposed to each pair of input signals q 0 and q 1 under a drive u.
Trajectories are analysed over an extended time interval [0, T] where T denotes the observation time. A typical distance is computed as T 0 note that the average over time is computed first, followed by the computation of the difference between the average memristance values. In methods, it is explained why it would be wrong to perform these operations in the reverse order. The degree of trajectory separability can be controlled by the drive signal u. Thus, in the first training phase, the goal is to find the optimal drive signal u * = maxarg u v [u] and the feedback function h * (if used) that achieves the maximal trajectory separability. We use a genetic algorithm optimisation, which is described with details in methods, with the separability index v[u] as the fitness function.
www.nature.com/scientificreports www.nature.com/scientificreports/ Phase 2: The readout layer implementation used differs somewhat from the central RC dogma. The purpose of the first phase of training is to maximise the separation of inputs over the whole time interval T, which features the computation of average memristance values R . Therefore, instead of the classical readout layer, the inference unit performs the assessment at the time instance T: y T r When compared to the classical RC readout layer, the computational cost of implementing such a unit is marginal. The idea to construct the readout layer by using such averages have been verified independently in separate studies 22 .
For technical reasons we assume ψ = r r ( ) w w¯. This allows the use of the least square method. In particular, for a binary classification problem, the output should be close to +1 (−1) when a signal from class c 0 (c 1 ) is applied, and we simply minimise the prediction error by varying w 0 , w 1 , …, w N . The decision by the inference unit is considered as the simplest possible: if y(T) > 0 then infer class c 0 otherwise infer class c 1 .
The dynamical system used consists of a single memristor and an optional feedback unit. A memristor is a non-linear, passive, two-terminal component with a time-varying resistance often being referred to as the memristance R(t). The memristor element is suitable for temporal information processing since it exhibits the filter property: The memristance value at a specific time instance depends on the whole history of the applied voltage signal up to that time. Few applications of memristor networks for temporal information processing purposes can be found 2,23 .
To illustrate the method, a simple Pershin Di Ventra model 24 is used. The memristance R(t) changes depending on the voltage signal ΔV that is applied across the element according to a simple law. If −V thr < ΔV < V thr , then the memristance changes as  ; α, β are device dependent parameters, const is a constant value and usually α ≪ β. The memristance is bounded between the lowest value R min > 0 and the maximum value R max . This can be written as: is zero unless the condition in the argument is satisfied, and likewise for θ(R > R min ). Figure 2 shows how  R depends on R and ΔV. The dependence of f on the applied voltage ΔV (R min < R < R max ) is shown in panel a). The flowlines of the ordinary differential equation system are illustrated in Fig. 3. When ΔV > 0, then the memristance increases until R = R max is reached. When ΔV < 0, the memristance decreases until R = R min .
In this study, the parameters α, V thr , R min and R max are always kept fixed. More realistic memristor models are available. For example, real devices can exhibit asymmetry in the memristance change ( and non-linear rate of memristance changes 25 . However, the model used here is known to reproduce most of the experimental results. The drive signal u(t) is expressed as a Fourier series with nine amplitudes and a base frequency (details provided in the methods section). The feedback function is defined as u R (t) = h(R(t)) = p 1 R(t) + p 0 . The ten parameters of the drive signal and the parameters p 1 , p 0 are optimised in the first phase of the training procedure. www.nature.com/scientificreports www.nature.com/scientificreports/ Two options of how the input signal q(t) interacts with the reservoir are considered. Option 1: It is natural to assume that the input signal influences the β parameter of the memristor model making it time dependent: β q (t) = m 1 q(t) + m 0 . Such a model has been explored before 23 . Option 2: Another option is to assume that the input signal acts as an external voltage source 2 u q (t) with u q (t) = k 1 q(t) + k 0 and β kept fixed at the value β c . These two options (Options 1 and 2) along with the two implementations (Implementations 1 and 2) result in the four models I1O1, I2O1, I1O2 and I2O2. The dynamical equations for those models are summarised in Table 1.
We compare the performance of these four models with a set of simpler models in which neither a drive signal nor a feedback function are optimised. In such a way one can test the performance of a reservoir when no additional input features are provided. This reveals a "raw" intelligence of a reservoir. Then, the increase in the additional computational power provided by an additional input feature can be analyzed. The dynamical equations of the models I0O1 and I0O2 are given in Table 1. The model I0O1 results from option 1 when neither a drive signal nor a feedback function is optimised. In this case, the reservoir is always driven by the same drive signal u c and there is no feedback u R (t) = 0. The model I0O2 results from option 2 without providing any drive signal nor a feedback: u(t) = 0 and u R (t) = 0.

Results
As a source of data, the labelled data set of ECG signals "ECG5000" from the UEA and UCR Time Series Classification Repository 26 is used. Usually, in the literature, ECG signals are classified into many classes 27 . However, since the reservoir consists of only one memristor element, a simple problem of binary signal classification is investigated. Including more classes would require larger memristor networks. Thus only two classes are considered, class c 0 (healthy heart) and class c 1 (heart problems). The strategy of using the single memristor for ECG signal classification is simple to describe: the memristance ought to be driven towards R min or R max depending on whether the input signal belongs to c 0 or c 1 respectively. If this can be achieved, classification can be performed by simply checking whether the memristance value exceeds a pre-defined threshold. The same type of thinking applies to more complex classification problems with many classes. For example, with a two-memristor network with r = (R 1 , R 2 ) a four-class problem could be solved. The state could be driven towards four different regions (R min , R min ), (R min , R max ), (R max , R min ), and (R max , R max ).
In the training procedure, 40 signals with the label c 0 , and 40 with the label c 1 have been used. The training procedure for each model (I1O1, I2O1, I1O2 and I2O2) resulted in an optimum drive signal: u 11 for I1O1, u 21 for I2O1, u 12 for I1O2 and u 22 for I2O2. The two optimal feedback functions were found too: h 21 for I2O1 and h 22 for I2O2.
The ability of the trained models to generalise has been validated on a separate set of test data: 740 signals with the label c 0 , and 740 signals with the label c 1 . The trained models were exposed to those signals, and the percentage of correct label assignments was counted. The quality of recognition is described in terms of the success rate 0 ≤ S ≤ 1 being defined by the percentage of correct signal classifications. Thus S ≈ 1 indicates a good classifier.
In this work, the set of test data has been considered to be much larger than the set of training data (740 ≫ 40). The reason is to test the performance of the proposed models in the worst case when few training examples are available. It has been shown that RC performs relatively well with a few training examples because RC can extract general features from the training data set 28,29 . Choosing data sets and their size can be crucial in machine learning for improving accuracy of models. One could try different sizes of data sets and the performance of models Dynamics www.nature.com/scientificreports www.nature.com/scientificreports/ would also depend on the choice of data sets 30 . Investigating the right choice of data sets and their size falls out of the scope of this article.
An ECG signal has a natural time reference, the top of the QRS peak (regardless of the class). Thus in the database all signals are aligned so that their QRS peaks coincide. To investigate whether the models can be used for classifying non-synchronised ECG signals, we intentionally modified the phase of each signal in the training and the test data sets by random shifts in time. Thus we distinguish two major groups of data sets: aligned versus asynchronous.
In the simulations, since ECG signals are quasi-periodic, the observation time was taken to be ten periods, T = 10 periods. The reservoir has the chance to "absorb" in its state extended information about the input over a longer time interval. In such a way, small features of the input signal that might be ignored by visual inspection, can accumulate over time and be detected. Observing the signal over several periods gives the reservoir necessary time to distinguish important trends from insignificant fluctuations.
Since the memristance value is bounded, it is useful to normalise the separability index v in a range of possible values between 0 and 1. When v = 1 this signals a maximum input separation. Any two input signals from the two different classes drive the state (on average) to its opposite extreme bounds R min or R max . The value v = 0 indicates that there is no correlation whatsoever between the states' values and the input signal q. For example, it is even possible that the state is driven on average to the same region of values for both classes of inputs. frozen input layer. The device parameters m 1 , m 0 , k 1 , and k 0 used in simulations have been kept fixed. The values of m 1 and m 0 were chosen so that β(t) has a mean value close to β c and is always positive. The values of k 1 and k 0 were chosen so that the voltage signal u q (t) is bounded in the [−10, 10] interval. The drive signal u c was chosen as a square wave pulse with +1 the first half and −1 the second half of each period. The values for all parameters used in the simulations are provided in the methods section.
In Fig. 4, the phase space separation is shown as it was achieved with the optimum models I1O1 and I2O1 under two input signals from the different classes. Those input signals are shown in panel a) covering two periods in time, to be referred to as q 0 (t) and q 1 (t) for class c 0 and c 1 respectively. In b), only one curve is shown, the optimised drive signal u 11 (t). This drive signal governs the applied voltage signal across the memristor and is independent of the input signal. In d), the memristance trajectories r 0 (t) = H[r 0 , q 0 , u 11 ](t) and r 1 (t) = H[r 1 , q 1 , u 11 ](t) are shown for the model I1O1.
The phase space separation is evident: the trajectories r 0 (t) and r 1 (t) are driven towards two different regions of the state space: r 1 (t) ≈ R max and r 0 (t) ≈ R min for t → ∞. It is important to notice that the memristor model is exposed to the same drive u 11 for both inputs q 0 and q 1 ; the machine can never "know" which input signal it will have to analyse. Yet, the drive posses an intrinsic "intelligence" that works in synergy with the memristor element, to achieve the phase space separation. The drive signal u 11 has been found so that when q = q 0 (q = q 1 ) the memristance is driven to a region close to R min (R max ) for t → ∞. Note here that u 11 (panel (b)) has both positive and negative values. If u 11 was only positive (negative) then R(t) ≈ R max (R(t) ≈ R min ) and R(t) would not be separable. Therefore, u 11 has been found with the right amount and synchronisation of positive and negative values to achieve input separation. www.nature.com/scientificreports www.nature.com/scientificreports/ In panel c), the voltage signal across the memristor model is shown when the model I2O1 was simulated with u 21 (t), h 21 and the input signals shown in a). Because of the feedback, the voltage across the memristor u 21 (t) + u R (t) is different for the two different applied inputs. Again, the phase space separation is evident, as shown in panel e). In fact, one can pick any pair of signals from classes c 0 and c 1 and, provided the same optimised u 11 , u 21 , and h 21 are used, the separation property will likely hold. Thus one can think of u 11 , u 21 , and h 21 as the device configuration "parameters".
The fact that input separation is evident in panels d) and e) shows why the two training phases suggested in this work operate in synergy. The task of genetic algorithm optimisation was to find a drive signal so that phase space separation occurs for most of the signals in the training data set. If so, then, one can simply build a linear classifier for inferring the class of the input signal. For example, by observing those two panels, one could build a linear classifier with a bare eye.
Additionally, it is of interest to notice that the response of the device was faster for the model I2O1 than the model I1O1. For example, in e) one can see that phase space separation occurred already at the first period of the input signal. However, in d) more periods were needed for the memristance to be driven towards its bounds. This is owed to the usage of a feedback function. An additional usage of a feedback function resulted in faster responses.
The values of the separability index v and the success rates of the optimised circuits are summarised in Table 2. The left panel summarises the results of the analysis for synchronised data set signals. The right panel is for asynchronous signals.
Aligned signals. For the aligned signals in Table 2, a larger separability index v was obtained for the models with a feedback mechanism (I2O1 and I2O2) than the ones without feedback (I1O1 and I1O2). The lowest separability indexes were obtained for the models I0O1 and I0O2 where neither a drive nor a feedback were optimised. To judge whether a larger separability index means a better training process, one should test the optimised models with a test data set. In the left panel, all the success rates with optimised input features (I1O1, I2O1, I1O2 and I2O2) were found S > 93.2% indicating that these or similar models can be used for classifying ECG signals.
It is important to notice here that the success rates were found smaller for the models with optimised feedback than the models with just optimised drive signal. This happened due to overfitting, a problem which usually occurs in machine learning: The extra parameter to be optimised (feedback) was so powerful that the models fit too much on details of the training data set that were not generally features of other input signals of the test data set.
The success rates for models with just a drive signal (I1O1 and I1O2) were found larger than the models without any drive signal (I0O1 and I002). In particular, by training a drive signal, the success rates improved from 87.2% to 97.9% and from 51.9% to 97.8%. Therefore, training a drive signal resulted to exceptional success rates which could not be achieved by just using a memristor element.
Additionally, for the model I1O2, even though the training process resulted in a very low v = 0.059, the success rate was large (97.8%). This happened because it was still possible to construct a readout layer which classifies most of the input signals correctly. By inspecting closer the state R(t) over time (not shown), we saw that although the instantaneous state R(t) was driven to overlapping regions, the average value of the state was separable. This is one of the advantages of using a readout layer which averages the state over time. For example, if the memristance values were driven on average to the numerical values 3.10 and 3.00 for inputs of class c 0 and c 1 respectively, then the separability index would be small but it would still be possible to construct a readout layer: if the average memristance is larger than 3.05 then infer class c 0 otherwise infer class c 1 .
These results indicate that a single memristor model equipped with an optimised voltage driving signal can be used for ECG signal classification. Note that the drive needs to be trained only once during the first phase of training, on a known set of signals, but the device can be used on unknown signals. However, we reiterate that the downloaded data set consists of signals with the same phase, (see Fig. 4a)). In practice, for online classification, one would need an online QRS detector 31 . QRS detectors are used to recognise the phase of ECG signals by identifying the part of the ECG signal which changes fastest in time. Moreover, the task of signal classification would be harder if there was no information regarding the phase of the signals, for example for embedded computation where every piece of extra equipment is a problem, one might want to remove a QRS detector unit. Can the system still classify with such a high success rate if there is no fixed time reference?
Asynchronous signals. The separability indexes v and the success rates in the right panel in Table 2 for non-aligned signals were found smaller than their aligned counterparts listed in the left panel. This is expected www.nature.com/scientificreports www.nature.com/scientificreports/ because now the models should be smarter and infer correctly independently of the input signal phase. The models without a feedback mechanism (I1O1 and I1O2) performed worse with success rates 55.4% and 45.0% respectively. Including and optimising a feedback mechanism (models I2O1 and I2O2) improved both the training procedure (index v) and the testing one (success rate S). The separability indexes from I1O1 to I2O1 and I1O2 to I2O2 improved from 0.204 to 0.729 and from 0.027 to 0.441 respectively. Additionally the success rates improved from 55.4% to 87.9% and from 45.0% to 79.0%. Therefore, here, contrary to the results of using the initial data set (in the left panel in Table 2) there was no over-fitting: The extra parameter to be optimised (feedback) was so powerful that the models fit better on some features of the training data set which are also features of the test data set. optimised input layer. The classification performance can be improved if there is an option to adjust the input layer parameters m 1 , m 0 , k 1 , or k 0 . While this is clearly possible for in silico implementations, it might be possible for hardware implementations too. The values of the separability index v and the success rate when the input layers were additionally optimised are shown in Table 3 for both aligned (left panel) and non-aligned signals (right panel).
By comparing the results in Table 2 to the ones in Table 3, the following key results were found regarding the additional optimisation of a simple input layer: • The largest success rates across all models is S = 98.6% for the aligned signal data set ( Table 3, the second row), and S = 93.1% for the non-aligned signal data set ( Table 3, the sixth row). • Larger success rates were achieved for the non-aligned signals when the input layer was optimised (S > 90% in Table 3 versus S < 90% in Table 2). • In contrast to the above, for the aligned data set, the input layer optimisation does not improve the success rate uniformly across all models (cf. the S values in the left panels of Tables 2 and 3). This could be an instance of overfitting: the input layer optimisation works well on the training data but does not generalise (towards the test data). • A single memristor model without optimising a drive signal (models I0O1 and I0O2) cannot be used to classify the ECG signals since S < 90% for both frozen and optimised input layers. The only way to achieve S > 90% was to provide an additional input feature such as a drive signal or a drive signal with a feedback function. For example, in Table 3 right panel, models I1O2 and I2O2 performed better than I0O2 and model I2O1 performed better than I0O1.
An interesting finding concerns the question whether the addition of a feedback function improves v and S. It depends on how hard is the problem. For easier classification problem, when the signals are aligned, the feedback does not always improve the accuracy of the prediction. However, for a harder problem, when the signals are not aligned, the addition of the feedback always improves both v and S. For example, I2O1 did not perform better than I1O1 when the signals were aligned (see the left panel in Table 3). Even though I2O1 is an extension of I1O1, neither a larger index v nor a larger S was found for I2O1. This has probably happened because of convergence problems during the optimisation. It was hard to optimise both the input layer and the drive signal. Probably, better optimisation algorithms should be tested in future, such as optimising one layer at a time, e.g. optimising the input layer and keeping fixed the feedback and the drive signal etc.

Discussion
A novel classification method has been suggested that can be used to increase the intelligence of pattern recognition devices. The method requires modest resources to implement. The approach has been illustrated in the reservoir computing context, but the method could be easily used in other pattern recognition setups.
A typical pattern recognition device is a dynamical system that accepts an input signal and informs about which class the signal belongs to by producing a label, being the output of the computation. The key idea is to equip a machine learning system (the reservoir) with an auxiliary input channel, the drive signal, that is external to the system and easy to control. This signal can be optimised so that the information about the signal one wishes to analyse (the input signal) can be efficiently embedded into the reservoir state. Then, the reservoir state can be analysed to infer the information about the input.
An intuitive way of understanding the concept is to envision a theatre performance. A prompter in a theatre helps an actor that forgot a line. The prompter is familiar with the intrinsics of the play, and has an overall understanding of the right timing of the narrative. In a similar way, the optimised drive provides clues that accumulate over time and increase the overall intelligence of the system.  www.nature.com/scientificreports www.nature.com/scientificreports/ The method can be exploited in two ways. First, in the bottom-up approaches, when the reservoir is engineered from scratch, the intelligence that normally resides in the reservoir can be moved to the drive. This would allow engineering reservoirs with smaller size, without reducing the computing capacity of the device. Second, the intelligence of an existing system could be increased, especially for reservoirs that built in a top-down manner, that are not engineered from scratch or once engineered are not meant to be modified. The respective examples might include an amorphous structure (e.g. to be used for in materio computation), or an echo state network (essentially a random neural network with feedback links).
Our approach offers a series of practical advantages. The reported successes rates have been obtained with rather modest resources: the machine learning system consisted only of a simple memristor element, and an optimised drive signal. Thus, it has been demonstrated that extra intelligence can indeed be provided through the external drive since a single memristor (models I0O1 and I0O2) cannot be used to solve such a classification problem. The results of our work point to the exciting possibility that there is a potential synergy between the reservoir state dynamics and the external input signal profile (the signal one wishes to analyze) in terms of the information they could potentially share. This synergy can be exploited, but it is deeply hidden, scattered over time, as small clues that a traditional pattern recognition device ignores. The drive signal can be used to unlock that hidden potential. Clearly, such a synergy might not exist for every information processing problem (a particular reservoir-input combination), but it seems that for the problem that has been investigated it is there: The presence of the drive had much stronger effect on the aligned signals with a fixed time reference. Further, it is true that the presence of the drive imposes some limitations on the system design. However, arranging for an extra input channel ought to be possible in general.
To demonstrate the method, we used it to classify ECG signals. The method classifies the ECG signal with success rate 98.6% (93.1%) for aligned (non-aligned) signals. The reported success rates fall roughly in the range of other state of the art methods. For example, it has been reported that for the data set "ECG5000", the best performing algorithm "COTE", achieved a success rate of 94.61% 32 . Other state-of-the-art approaches, e.g. deep learning networks 33 , have reported similar success rates. Interestingly, the state of the art success rate has been achieved despite the fact that rather modest resources were used to implement the classification algorithm. Naturally, such a comparison is not entirely fair since we use the lowest possible number of classes (two). In contrast, the COTE algorithm was applied on the set of aligned signals with five-classes. Likewise, Hannun et al. 33 have solved a ECG classification problem with twelve classes and different ECG data sets. Thus it is likely that the success rates reported in here would deteriorate if one would increase the number of classes. However, to mitigate this one can consider using more complex memristor networks. Still, the fact that the success rates fall in the same range illustrate the potential of the method.
Since we considered just one memristor element it did not make much sense to work classification problems with more than two classes. The method can be easily applied to any multi-class pattern recognition problem, however. To do this, as already mentioned, one could consider using more complex dynamical systems. But there are other practices used in machine learning that might be used with an advantage. One could also consider a combination of several classification systems similar to ones suggested in this work. For example by, again, using a single memristor equipped with one linear readout layer. In particular, one system could be trained for a different task, e.g. to classify if a signal belongs to class 1 or 2, another system if a signal belongs in class 1 or 3, etc. Another option would be to use current work as models for trying ensemble learning methods 34 . All those ideas are left for future implementations.
The training data set (80 signals) was much smaller in size than the testing data set (1480 signals), and still the system performed very well. The findings of our method agree with other findings 28,29 that RC works well with a few number of training examples. Generic features can be extracted from the training data set since a few parameters are trained because reservoir's internal structure is not trained. Similarly, in our work, the fact that the drive was parameterised with very few parameters (and a limited training data set) suggests that the hidden correlations we seek to explore are in some sense global, and do not require detailed drive tuning to access. It is possible that a sub-optimised drive could do the job as well as the fully optimised one. In fact, this has also been corroborated by inspecting the genetic algorithm optimization steps. There is no reason to believe that such a global synergy could not be found and exploited in other systems.
The biggest challenge with the method is the process of training the drive. For very large systems with many elements the drive optimization procedure might be a problem. One would have to compute distances between points in a space with a very large dimension over many time instances. However, this issue could be mitigated in several ways.
First, to train the drive, there is no need for a rigorous supervised learning approach during this phase. Let us illustrate this on a thought experiment. Assume that the goal is to apply the approach for the pattern recognition problem where the goal is to use an artificial neural network. It is true that during the drive optimization step the user needs to have an access to labelled data, but not in the same way as for supervised learning approaches. To train an artificial neural networks one uses labels actively to train "inside" of the network, i.e. adjust its weights. Our use of the labels is more passive, we operate from the "outside". The main advantage of the suggested approach is that an easily adjustable external signal needs to be manipulated instead of the internal network weights. This might not matter for in silico implementations but it might be a decisive difference for hardware implementations, especially if one cannot engineer the reservoir easily.
Further, the fact that a sub-optimal drive might work opens for several options for building scalable separability optimization techniques that can be used on large systems. An obvious option is to reduce the training data set size: The size of the training data set influences the number of distance computations, and as argued above, the training data set with limited size was sufficient to optimise the drive. Another option to explore is to estimate separability in some other way, without computing distances between all trajectories (and trajectory points), when forced working with large data sets. Then, one might try to use a sampling technique in which not every (2020) 10:9191 | https://doi.org/10.1038/s41598-020-65404-3 www.nature.com/scientificreports www.nature.com/scientificreports/ distance has to be computed. For example, instead of computing the distance among the full set of trajectory points, one picks few sampling points at random, e.g. when the average trajectory position is used to estimate distances. Such an average could be efficiently estimated efficiently by random sampling.
It has been shown that including more reservoir resources (in addition to the external drive signal) such as feedback mechanisms or input layers significantly improved the performance of the device. These options are recommended for software applications since it would only cost the training of just few additional free parameters. However, for hardware implementations, those options should be considered wisely: these additional resources imply additional engineering overhead, e.g. in terms of energy consumption.
While the method is demonstrated theoretically on a simple memristor model, it is likely that such recognizable correlations can be exploited in other dynamical systems. The extensions towards other systems ought to be straight forward, at least in principle. It is possible that a novel pattern recognition paradigm might emerge at the interface between the theory of complex dynamical systems, and state-of-the-art machine learning methods, by further exploring (developing and applying) the method suggested in here.

Methods
To perform the first phase of the training procedure, we introduce a separability index and a genetic algorithm optimisation. The separability index acts as the fitness function of the optimisation. The details for calculating the separability index are presented in the first subsection. The optimisation algorithm is explained in the second subsection. Details of implementing numerical simulations are presented in the third subsection.
Separability index. The separability index v is introduced as a tool for quantifying the ability of a reservoir to separate inputs. Several such measures have been considered in the literature [14][15][16] . The measure suggested in this study is specifically tailored towards measuring the degree of collaboration between the elements.
One assumes a drive signal u and generates all possible trajectories for a given training data set which is a set of input signals with a corresponding label each one. The separability index v is computed by estimating typical distances between the trajectories. If typical distances are large then v should be large too. This implies that v depends on u and on all input signals q j i . To emphasize this, the notation v[u; c 1 , c 2 , …, c k ] will be used. If the classes are implicit, then a shorter form v[u] is more useful. Every class of input signals c 1 , c 2 , …, c k is represented by typical input signals This distance will be denoted by To obtain a typical distance one simply averages the instantaneous distance over time as , , The above formulas seem perfectly reasonable. However, there is a problem with these formulas that is not obvious. For example, consider the trajectories depicted in Fig. 5. Panel (a)  www.nature.com/scientificreports www.nature.com/scientificreports/ trademark of good input separation. By observing the probability distributions, one can with a bare eye decide a decision boundary. However, the behavior illustrated in panel (b) is problematic. For these trajectories the corre- is larger than zero for most time-points t implying that the average in Eq. (9) would be large too. This would signal that the behavior depicted in panel (b) is a trademark of good input separation and yet it would be hard to find a decision boundary by considering the probability distributions. The problem is that even though the individual trajectories are separated in time the regions they visit change over time. This implies that one would need to design a readout layer that is aware of these changes, leading to additional computational overhead.
A better distance estimate is obtained if one averages over time first, before the Euclidean norm is computed. The algorithm used to compute the typical distances is given by  (13) is a more useful measure than the one given in Eq. (9). For example, consider the two trajectories in Fig. 5 panel (a). There the regions claimed by the trajectories are stable. The problem with Eq. (9) is that it cannot properly distinguish between the situations depicted in panels (a) and (b). This equation predicts large degree of separability for both panels, which is clearly incorrect. If Eq. (13) is used, then the distance measure is larger for the pair of trajectories in (a) than in (b).
Further, note that in Eq. (10) the additional factor, N 1/ R , has been added. This factor is extremely important. It penalizes careless increase of the dimension of the state space. In practical terms, without that factor, the distance between trajectories would increase when adding additional copies of the existing memristors but without any direct coupling with the existing network. Note that if such elements are added then they contribute to the sum in the distance formula and the distance measure increases. This wrongly signals an increase in the information processing ability of the device. In principle, such parallel copies of non-interacting elements should not add any additional information processing ability.
Combining typical distance between trajectories into an overall separability measure. Assuming that the typical distances have been estimated, there are several ways to define the index v. For example, the most natural definition would be to calculate the average over the typical distances: We argue that the procedure to estimate the separability index given in Eq. (14) is not the optimal one. Instead, in this study, the following estimate is used:  16) is used, the index v is maximized only if all distances are fairly large, since when one of the distances is very small, the final product will also be small.

Genetic algorithm optimisation.
Genetic algorithms (GA) are used in the first phase of the training procedure with the separability index v as a fitness function. GA is a strong optimisation technique that can be used to solve problems regardless of their complexity [35][36][37] . Additionally, there is one other reason why GA is preferred in this work and not other methods such as gradient-based optimisations: At the moment, calculating gradients of index v with respect to the parameters of the drive signal requires extensive mathematical analysis and validation since similar work has not been implemented before.
Genetic algorithm optimisation is implemented in which a combination of two nested loops is used. The small loop represents the typical GA optimisation steps (e.g. crossover or mutation). Iterations in this block are stopped when the best fitness does not improve in four consecutive steps. However, each GA block starts with random sampling of the candidate solutions. There is a possibility that the small loop does not converge to a solution with a sufficiently large fitness due to extremely unlucky choice of the candidates. To guard against that problem, we repeat the GA iteration block several times, remembering the best solution. If indeed the best fitness does not improve in two subsequent iterations then it is extremely likely that the best solution has been found. The ideas discussed above have been implemented as follows: • 1. Determine the parameters to be optimised. • 2. Big loop -Repeat: • 2.1. Initialization: Sample 300 candidate solutions and store in the list CANDIDATES. If a best candidate solution is available from the previous iteration (stored in the variable BEST) include it into the CANDI-DATES list. • 2.2. Small loop -Repeat: • 2.2.1. Keep the 30 best solutions out of the list CANDIDATES to use them for the genetic operations. • 2.2.2. Choose another 10 randomly generated parents as candidate solutions. • 2.2.3 Perform genetic operations with the 30 + 10 = 40 parents to produce 300 offsprings. Decide with probability 0.5 if you will perform mutation or crossover. Create an offspring in the following way.
• If you perform mutation, choose randomly 1 out of the 40 parents. Choose randomly one parameter of the parent. Convert it into a binary number and choose randomly one bit of the binary number to mutate. Then convert the mutated binary number into a decimal number. • If you perform crossover, choose randomly 2 out of the 40 parents. Choose randomly one parameter in the same position for both parents. Convert both parameters into binary numbers. Crossover the binary numbers to produce a new binary number. Convert the new binary number into a decimal one.  Classical GA suffers from premature convergence. This can happen if the population diversity is not maintained and the genetic operations of crossover and mutation cannot produce better off-springs than their parents 38 . In the GA scheme above, the following measures are taken to prevent premature convergence: • In the step 2.1, a big amount of initial candidate solutions (300) are sampled. • In the step 2.2.1, a big amount of best solutions (30) are kept to be used for genetic operation at the next iteration. • In the steps of the small loop, a bigger priority is given to the solutions with best fitness and a slow convergence is avoided because 30 best solutions are chosen against 10 randomly generated. The randomly generated solutions encourage the genetic diversity which is important for avoiding premature convergence.  Table 4. When the input layers were not optimised, the parameters m 1 , m 0 , k 1 and k 0 were fixed. Their fixed values are shown in the Table 5.   Table 5. Numerical values of the parameters m 1 , m 0 , k 1 and k 0 .