Article | Published:

# Long short-term memory networks in memristor crossbar arrays

## Abstract

Recent breakthroughs in recurrent deep neural networks with long short-term memory (LSTM) units have led to major advances in artificial intelligence. However, state-of-the-art LSTM models with significantly increased complexity and a large number of parameters have a bottleneck in computing power resulting from both limited memory capacity and limited data communication bandwidth. Here we demonstrate experimentally that the synaptic weights shared in different time steps in an LSTM can be implemented with a memristor crossbar array, which has a small circuit footprint, can store a large number of parameters and offers in-memory computing capability that contributes to circumventing the ‘von Neumann bottleneck’. We illustrate the capability of our crossbar system as a core component in solving real-world problems in regression and classification, which shows that memristor LSTM is a promising low-power and low-latency hardware platform for edge inference.

A preprint version of the article is available at ArXiv.

## Main

The recent success of artificial intelligence largely results from advances in deep neural networks, which have a variety of architectures1, with the long short-term memory (LSTM) network being an important one2,3. By enabling the learning process to remember or forget the history of observations, LSTM-based recurrent neural networks (RNNs) are responsible for recent achievements in analysing temporal sequential data for applications such as data prediction4,5, natural language understanding6,7, machine translation8, speech recognition9 and video surveillance10. However, when implemented in conventional digital hardware, the complicated structures of LSTM networks lead to drawbacks connected with inference latency and power consumption. These issues are becoming increasingly prominent as more applications involve the processing of temporal data near the source in the era of the Internet of Things (IoT). Although there has been an increased level of effort in designing novel architectures to accelerate LSTM-based neural networks1116, low parallelism and limited bandwidth between computing and memory units remain outstanding issues. It is therefore imperative to seek an alternative computing paradigm for LSTM networks.

A memristor is a two-terminal ‘memory resistor’17,18 that performs computation via physical laws at the same location where information is stored19. This feature entirely removes the need for data transfer between the memory and computation. Built into a crossbar architecture, memristors have been successfully employed in feed-forward fully connected neural networks20,21,22,23,24,25,26,27, which have shown significant advantages in terms of power consumption and inference latency over their CMOS-based counterparts28,29. The short-term memory effects of some memristors have also been utilized for reservoir computing30. On the other hand, most state-of-the-art deep neural networks, in which LSTM units are responsible for the recent success of temporal data processing, are built with more sophisticated architectures than fully connected networks. The memristor crossbar implementation of an LSTM has yet to be demonstrated, primarily because of the relative scarcity of large memristor arrays.

In this Article we demonstrate our experimental implementation of a core part of LSTM networks in memristor crossbar arrays. The memristors were monolithically integrated onto transistors to form one-transistor one-memristor (1T1R) cells. By connecting a fully connected network to a recurrent LSTM network, we executed in situ training and inference of this multilayer LSTM-based RNN for both regression and classification problems, with all matrix multiplications and updates during training and inference physically implemented on a memristor crossbar interfacing with digital computing. The memristor LSTM network experiments succeeded in predicting airline passenger numbers and identifying an individual human based on gait. This work shows that LSTM networks built in memristor crossbar arrays represent a promising alternative computing paradigm with high speed–energy efficiency.

## Results

### Memristor crossbar array for LSTM

Neural networks containing LSTM units are recurrent; that is, they not only fully connect the nodes in different layers, but also recurrently connect the nodes in the same layer at different time steps, as shown in Fig. 1a. The recurrent connections in LSTM units also involve gated units to control the remembering and forgetting, which enable the learning of long-term dependencies2,3. The data flow in a standard LSTM unit is shown in Fig. 1b and is characterized by equation (1) (linear matrix operations) and equation (2) (gated nonlinear activations), or equivalently by equations (3) to (5) in the Methods.

$$\left[\begin{array}{c}{\widehat{{\bf{a}}}}^{t}\\ {\widehat{{\bf{i}}}}^{t}\\ {\widehat{{\bf{f}}}}^{t}\\ {\widehat{{\bf{o}}}}^{t}\end{array}\right]=\left[\begin{array}{ccc}{{\bf{W}}}_{{\rm{a}}} & {{\bf{U}}}_{{\rm{a}}} & {{\bf{b}}}_{{\rm{a}}}\\ {{\bf{W}}}_{{\rm{i}}} & {{\bf{U}}}_{{\rm{i}}} & {{\bf{b}}}_{{\rm{i}}}\\ {{\bf{W}}}_{{\rm{f}}} & {{\bf{U}}}_{{\rm{f}}} & {{\bf{b}}}_{{\rm{f}}}\\ {{\bf{W}}}_{{\rm{o}}} & {{\bf{U}}}_{{\rm{o}}} & {{\bf{b}}}_{{\rm{o}}}\end{array}\right]\left[\begin{array}{c}{{\bf{x}}}^{t}\\ {{\bf{h}}}^{t-1}\\ 1\end{array}\right]$$
(1)
$$\begin{array}{l}{\widehat{{\bf{c}}}}^{t}=\sigma \left({\widehat{{\bf{i}}}}^{t}\right)\odot {\rm{tanh}}\left({\widehat{{\bf{a}}}}^{t}\right)+\sigma \left({\widehat{{\bf{f}}}}^{t}\right)\odot {\widehat{{\bf{c}}}}^{t-1}\\ {{\bf{h}}}^{{\rm{t}}}=\sigma \left({\widehat{{\bf{o}}}}^{{\rm{t}}}\right)\odot {\rm{tanh}}\left({\widehat{{\bf{c}}}}^{t}\right)\end{array}$$
(2)

where xt is the input vector at the present step, ht and ht−1 are the output vectors at the present and previous time steps, respectively, $${\widehat{{\bf{c}}}}^{t}$$ is the internal cell state, and $$\odot$$ represents the element-wise multiplication. σ is the logistic sigmoid function, which yields $${\widehat{{\bf{i}}}}^{t},\,{\widehat{{\bf{f}}}}^{t}\,{\rm{and}}\,{\widehat{{\bf{o}}}}^{t}$$for the input, forget and output gates. The model parameters are stored in weights W, recurrent weights U and bias parameters b for cell activation (a) and each gate (i, f, o), respectively. Because of this complicated structure, state-of-the-art deep RNNs involving LSTM units include massive quantities of model parameters, typically exceeding the normal capacity of on-chip memory (usually static random access memory, SRAM), and sometimes even off-chip main memory (usually dynamic random access memory, DRAM). Consequently, inference and training with the network will require the parameters to be transferred to the processing unit from a separate chip for computation, and data communication between the chips greatly limits the performance of LSTM-based RNNs on conventional hardware.

To address this issue, we have adopted a memristor crossbar array for an RNN and store the large number of parameters required by an LSTM–RNN as the conductances of the memristors. The topography of this neural network architecture, together with the data flow direction, is shown in Fig. 1c. The linear matrix multiplications are performed in situ in a memristor crossbar array, removing the need to transfer weight values back and forth. The model parameters are stored within the same memristor crossbar array that performs the analogue matrix multiplications. We connected an LSTM layer to a fully connected layer for the experiments described here, and the layers can be cascaded into more complicated structures in the future. For demonstration purposes, the gated unit in the LSTM layer and the nonlinear unit in the fully connected layer were implemented in software in the present work, but they can be implemented by analogue circuits31 without digital conversions to substantially reduce the energy consumption and inference latency.

The analogue matrix unit in our LSTM was implemented in a 128 × 64 1T1R crossbar array with memristors monolithically integrated on top of a commercial foundry-fabricated transistor array20 (Fig. 2a,b). The integrated Ta/HfO2 memristors exhibited stable multilevel conductance (Fig. 2c), enabling matrix multiplication in the analogue domain20,26,27,32. With transistors controlling the compliance current, the integrated memristor array was programmed by loading a predefined conductance matrix with a write-and-verify approach (previously employed for analogue signal and image processing20 and ex situ training of fully connected neural networks26) or by a simple two-pulse scheme (previously used for in situ training of fully connected neural networks27; also used for in situ training of the LSTM network in this work). Inference in the LSTM layer was executed by applying voltages on the row wires of the memristor crossbar array and reading out the electrical current through the virtual grounded column wires. The readout current vector is the dot product of the memristor conductance matrix and the input voltage-amplitude vector, which was obtained directly by physical laws (Ohm’s law for multiplication and Kirchhoff’s current law for summation). Each parameter in the LSTM model was encoded by the conductance difference between two memristors in the same column, and subtraction was calculated in the crossbar array by applying voltages with the same amplitude but different polarities on the corresponding row wires (Fig. 2a). The applied voltage amplitude on the rows that connect to the memristors for bias representation was fixed across all the samples and time steps. The experimental readout currents from the memristor crossbar array comprise four parts, representing vectors $${\widehat{{\bf{a}}}}^{t},\,{\widehat{{\bf{i}}}}^{t},\,{\widehat{{\bf{f}}}}^{t}$$and $${\widehat{{\bf{o}}}}^{t}$$ as described in equation (1), which were nonlinearly activated and gated (equation (2)) and converted to voltages (performed in software in the present work). The voltage vector (ht) was then fed into the next layer (a fully connected layer) and recurrently to the LSTM layer itself at the next time step (ht−1) (Fig. 1c).

The neural network was trained in situ within the memristor crossbar array to compensate for possible hardware imperfections, such as limited device yield, variation and noise in the conductance states33, wire resistance, analogue peripheral asymmetry and so on. Before training, all memristor conductances were initialized by set voltage pulses across the memristor devices and simultaneous fixed-amplitude pulses to the transistor gates. During training, initial inferences were performed on a batch of sequential data (mini-batch) and yielded sequential outputs. The memristor conductances were then adjusted to make the inference outputs closer to the target outputs (evaluated by a loss function, see Methods). The intended conductance update values (∆G) were calculated using the back-propagation through time (BPTT) algorithm34,35,36 (see Methods for details) with the help of off-chip electronics, and then applied to the memristor crossbar array experimentally. For memristors that needed the conductance to be decreased, we first applied a reset voltage pulse to their bottom electrodes (the top electrodes were grounded) to initialize the memristors to their low conductance states. We then applied synchronized set voltage pulses to the top electrodes, analogue voltage pulses to the transistor gates (ΔVgate ΔG) and zero voltages to the bottom electrodes (grounded) to update the conductances of all memristors in the array (see Supplementary Fig. 1 for additional details). The conductance update can be performed in a row-by-row or column-by-column basis, as proposed in previous work27. This two-pulse scheme has previously been demonstrated to be effective in achieving linear and symmetric memristor conductance updates27.

The present work focuses on exploring the feasibility of employing emerging analogue devices (such as memristors) based neural networks with various architectures, in particular the LSTM network. For this purpose we developed a neural network framework in MATLAB with a Keras-like37 interface, which enables the implementation of an arbitrarily configured neural network architecture, and specifically in this work, the LSTM–fully connected network (see Supplementary Fig. 2 for the detailed architecture). The experimental memristor crossbar array executes the matrix multiplications during the forward and backward passes and the weight update, which can be replaced with a simulated memristor crossbar array or a software backend using 32-bit floating point arithmetic. This architecture enables a direct comparison of the crossbar neural network with the digital approach using the same algorithm and dataset. In the experimental crossbar implementation, the framework communicates to our custom-built off-chip measurement system26, which supplies up to 128 different analogue voltages and senses up to 64 current channels simultaneously (see Methods for details), from/to the memristor crossbar array, completing the matrix multiplication and weight update.

### Regression experiment

We first applied the memristor LSTM in predicting the number of airline passengers for the next month, a typical example of a regression problem. We built a two-layer RNN in a 128 × 64 1T1R memristor crossbar array with each layer in a partition of the array. The input of the RNN was the number of air passengers in the present month, and the output was the projected number for the subsequent month. The RNN network structure is illustrated in Fig. 3a. We used 15 LSTM units with a total of 2,040 memristors (34 × 60 array) representing 1,020 synaptic weights (Fig. 3b), which took one data input, one fixed input for bias and 15 recurrent inputs from themselves. The second layer of the network was a fully connected layer with 15 inputs from the LSTM layer and another input as the bias. The recurrent weights in the LSTM units represent the learned knowledge on when and what to remember and forget, and therefore the output of the network was dependent on both present and previous inputs.

The dataset we chose for this prediction task included the airline passenger numbers per month from January 1949 to December 1960, with 144 observations38, from which the first 96 data points were selected as the training set (only one sample, with a sequence length of 96) and the remaining 48 data points as the testing set (Fig. 3c). During the inference, the number of passengers was linearly converted to a voltage amplitude (normalized to between 0 V and 0.2 V so as not to disturb the memristor conductances). The final output electrical current was scaled back by the same coefficient during input normalization and a conductance-to-weight ratio as specified in Table 1 to reflect the number of airline passengers. The training process was carried out to minimize the mean square error (equation (7)) between the data in the training set and the network output, by stochastic gradient descent through the BPTT algorithm (see Methods). The raw voltages applied on the memristor crossbar array and the raw output currents during the inference after 800 epochs are shown in Fig. 3d–g. The corresponding conductance and weight values were read out from the crossbar array and are shown in Supplementary Fig. 3, although they were not used for either the inference or the training processes. The experimental training result in Fig. 3c shows that the network learned to predict both the training data and the unseen testing data after 800 epochs of training.

### Classification experiment

We further applied our memristor LSTM–RNN to identify an individual human by the person’s gait. The gait as a biometric feature has a unique advantage when identifying a human from a distance, as other biometrics (for example, the face) occupy too few pixels to be recognizable. It becomes increasingly important in circumstances in which face recognition is not feasible, for example because of camouflage and/or a lack of illumination. To use gait in a surveillance application scenario, it is preferable to deploy many cameras and perform the inference locally rather than sending the raw video data back to a server in the cloud. Inference near the source should be performed with low power and small communication bandwidth, but still achieve low latency.

The memristor LSTM–RNN utilized a feature vector extracted from a video frame as the input, and outputs the classification result as electrical current at the end of the sequence (Fig. 4a). The two-layer RNN was implemented by partitioning a 128 × 64 memristor crossbar array (Fig. 4b) in which 14 LSTM units in the first layer were fully connected to a 50-dimensional input vector with 64 × 56 connections (implemented in a 128 × 56 memristor crossbar array). The 14 LSTM units were also fully connected to eight output nodes. The classification result is represented by the maximum dimension in the output vectors of the output nodes in the fully connected layer.

To demonstrate the core operation of the memristor LSTM memristor network, feature vectors for the input of the LSTM–RNN were extracted from video frames by software. Human silhouettes with 128 × 88 pixels were first pre-extracted from the raw video frames in the USF-NIST gait dataset39 and then processed into 128-dimensional width-profile vectors40. The vectors were then down-sampled to 50 dimensions to fit the size of our crossbar array (Fig. 4c) and normalized to between −0.2 V and 0.2 V to match the input voltage range. We chose video sequences from eight different people out of 75 in the original dataset (Table 2). The videos cover various scenarios, with people wearing two different pairs of shoes on two different surface types (grass or concrete) taken from two different viewpoints (eight covariances). The video sequences were further segmented into 664 sequences, each with 25 frames, as described in detail in the Methods and Supplementary Fig. 4. The training was performed on 597 sequences randomly drawn from the dataset, while the remaining 67 unseen sequences were used for the classification test. In state-of-the-art deep neural networks, the feature vectors that feed into the LSTM layer are usually extracted by multiple convolutional layers and/or fully connected layers without much human knowledge. The feature extraction step could also be implemented in a memristor crossbar array when multiple arrays are available in the near future20.

The training and inference processes were experimentally performed, in part, in the memristor crossbar array, with a procedure similar to that in the regression experiment (see Methods). The goal of the training was to minimize the cross-entropy in the Bayesian probability (equation (8) in the Methods), which is the loss function calculated from the last-time-step electrical current and the ground truth (Fig. 4a). The desired weight update values were optimized with root-mean-square propagation (RMSprop)41 based on the weight gradient calculated by the BPTT algorithm (mostly in software in the present work) and then experimentally applied to the memristor crossbar array after the inference operation on one mini-batch of 50 training sequences, using the two-pulse conductance update scheme. The mean cross-entropy during the inference of each mini-batch was calculated in software and is shown in Fig. 4d, from which one sees the effectiveness of the training with the memristor crossbar array. The raw input voltages and output currents, as well as the memristor conductance maps after the training experiment, are provided in Supplementary Fig. 5 and Supplementary Fig. 6, respectively. The classification test was conducted on the separate testing set after training on each epoch. The classification accuracy increased steadily during the training, and the maximum accuracy within the 50 epochs of training was 79.1%, confirming that the in situ training of the LSTM network is effective.

In addition to the successful demonstration of in situ training on the memristor LSTM network in both the regression and classification experiments, we quantitatively compared our result to the digital counterpart by training with the same neural network architecture and hyper-parameters, and on the same dataset. The results of the classification experiment are provided in Supplementary Fig. 7b (more data on the regression experiment are provided in Supplementary Fig. 7a). Our experimental result matches the simulation with a 1.4% conductance update error, showing that using memristors in an LSTM network and obtaining an accuracy comparable with the digital approach will require a higher accuracy in conductance tuning as well as other improvements in the crossbar array. This may be achieved with improved devices42,43 and a better weight update scheme44, by constructing a larger network for increased redundancy27 or by employing multiple memristors to represent one synaptic weight. The LSTM network is more sensitive to weight update error than a fully connected network for pattern recognition27, suggesting different architectures may impose different performance requirements on emerging analogue devices.

## Discussion

In summary, we have built multilayer RNNs with a memristor LSTM layer and a memristor fully connected layer. A major reason for using memristor networks for LSTM and other machine intelligence applications is their promise in terms of speed and energy efficiency. The advantages of analogue in-memory linear algebra for inference are well established20,21,22,23,24,25,26,27,28,44,45. The computation could stay in the analogue domain, with the analogue input signals acquired directly from sensors, and the analogue output signals activated with nonlinear device response or circuits. The entire operation could be performed in a single time step, and so the latency would not scale with the size of the array. Our proof-of-concept work employs off-chip analogue-to-digital conversions (ADCs), for demonstration purposes. Even with the ADC overhead, a mixed-signal system at a scaled technology node projects significant advantages over an all-digital system46,47,48,49. The successful demonstrations for both regression and classification tasks show the versatility of connecting the memristor neural network layers in different configurations. The results open up a new direction for integrating multiple memristor crossbar arrays with different configurations on the same chip, which will minimize data transfer and significantly reduce the inference latency and power consumption in deep RNNs.

## Methods

### 1T1R array integration

The transistor array was fabricated in a commercial foundry using the 2 µm technology node. The memristors were then integrated in a university cleanroom. The transistors were used as selector devices to mitigate the sneak path problem in the crossbar array and to enable precise conductance tuning. Two layers of metal wires were also fabricated in the foundry back-end-of-the-line process as row and column wires to reduce the wire resistance (about 0.3 Ω between cells). The low wire resistance in the array is one of the key factors providing accurate matrix multiplication. The memristors were fabricated on top of the transistor array in the UMass Amherst cleanroom, with sputtered palladium as the bottom electrode, atomic-layer-deposited hafnium dioxide as the switching layer and sputtered tantalum as the top electrode.

### Measurement system

A custom-built off-chip multi-board measurement system was used during the experiment; the details of this system have been reported previously26. Eight 16-bit digital-to-analogue convertors (DACs) (Analgo Devices, LTC2668, least significant bit = 0.3 mV) were used to supply different analogue voltages to the 128 rows simultaneously. The current was sensed and converted to voltage through transimpedance amplifiers attached to each column. Eight 14-bit ADCs were used to convert the analogue voltages from the 64 columns to digital values before being transmitted to the microcontroller (Microchip Technology, PIC32MX795F512). A fixed linear relation was used to convert the DAC/ADC digitized command/reading to voltage/current values in the microcontroller. The present measurement system was optimized for flexibility in the application demonstration rather than speed and energy efficiency.

### Inference in the two-layer LSTM–RNN

The network in this work had two layers, the LSTM as the first layer and a fully connected layer as the second. The algorithm can be extended to more layers because of the cascaded structure. The input to the network was xt, and the LSTM cell activation at was calculated as shown equation (3):

$${{\bf{a}}}^{t}={\rm{tanh}}\left({{\bf{W}}}_{{\rm{a}}}{{\bf{x}}}^{t}+{{\bf{U}}}_{{\rm{a}}}{{\bf{h}}}^{t-1}+{{\bf{b}}}_{{\rm{a}}}\right)$$
(3)

where ht−1 is the recurrent input, which is also the output of this layer at step t−1 (which will be introduced later), Wa and Ua are the synaptic weight matrices for the input and recurrent input, respectively, and ba is the bias vector. These notions are similar to the following equations.

The input gate (it), forget gate (ft) and output gate (ot) that control the output are defined in equation (4):

$$\begin{array}{l}{{\bf{i}}}^{t}=\sigma {\widehat{{\bf{i}}}}^{t}=\sigma \left({{\bf{W}}}_{{\rm{i}}}{{\bf{x}}}^{t}+{{\bf{U}}}_{{\rm{i}}}{{\bf{h}}}^{t-1}+{{\bf{b}}}_{{\rm{i}}}\right)\\ {{\bf{f}}}^{t}=\sigma {\widehat{{\bf{f}}}}^{t}=\sigma \left({{\bf{W}}}_{{\rm{f}}}{{\bf{x}}}^{t}+{{\bf{U}}}_{{\rm{f}}}{{\bf{h}}}^{t-1}+{{\bf{b}}}_{{\rm{f}}}\right)\\ {{\bf{o}}}^{t}=\sigma {\widehat{{\bf{o}}}}^{t}=\sigma \left({{\bf{W}}}_{{\rm{o}}}{{\bf{x}}}^{t}+{{\bf{U}}}_{{\rm{o}}}{{\bf{h}}}^{t-1}+{{\bf{b}}}_{{\rm{o}}}\right)\end{array}$$
(4)

The output of the LSTM layer (as the hidden layer output in the two-layer RNN) was determined using equation (5):

$$\begin{array}{l}{{\bf{c}}}^{t}={\rm{tanh}}\left({{\bf{i}}}^{t}\odot {{\bf{a}}}^{t}+{{\bf{f}}}^{t}\odot {\widehat{{\bf{c}}}}^{t-1}\right)\\ {{\bf{h}}}^{t}={{\bf{o}}}^{t}\odot {{\bf{c}}}^{t}\end{array}$$
(5)

Equations (3), (4) and (5) are equivalent to equations (1) and (2) in the main text, in which the linear and nonlinear operations are separated for easier comprehension.

The final output of the RNN is read out by a fully connected layer, the function of which is characterized by equation (6):

$${{\bf{y}}}^{t}=f\,\left({\widehat{{\bf{y}}}}^{t}\right)=f\,\left({{\bf{W}}}_{{\rm{FC}}}{{\bf{h}}}^{t}+{{\bf{b}}}_{{\rm{FC}}}\right)$$
(6)

where f is the nonlinear activation function in the fully connected layer. Specifically, we used the logistic sigmoid function in the airline prediction experiment and the softmax function in the human gait identification experiment.

### Training with BPTT

The goal of the training process was to minimize a loss function $${\mathscr{L}}$$, which is a function of the network output yt and the targets $${{\bf{y}}}_{{\rm{target}}}^{t}$$ (ground truth or labels). Specifically, we chose the mean square loss error over all time steps for the airline prediction experiment (equation (7)) and cross-entropy loss on the last time step for the human gait identification experiment (equation (8)):

$${{\mathscr{L}}}_{{\rm{airline}}}=\sum _{n=1}^{N}\sum _{t=1}^{T}\frac{1}{2}{\left[{{\bf{y}}}^{t}\left(n\right)-{{\bf{y}}}_{{\rm{target}}}^{t}\left(n\right)\right]}^{\top }\left[{{\bf{y}}}^{t}\left(n\right)-{{\bf{y}}}_{{\rm{target}}}^{t}\left(n\right)\right]{\rm{/}}T$$
(7)
$${{\mathscr{L}}}_{{\rm{gait}}}=-\sum _{n=1}^{N}\sum _{c=1}^{C}{{\bf{y}}}_{c,{\rm{target}}}\left(n\right){\rm{log}}\left[{y}_{c}^{T}\left(n\right)\right]$$
(8)

where n indexes over the sample, N is the batch size, t is the temporal sequence number and T is the total time steps in the sequence.

The training—that is, model optimization—was based on the weight gradients of the loss function. Since the weights stayed the same in the same mini-batch over all time steps, the gradients were accumulated before each weight update. The gradient vector of the last layer output on the loss function $${\mathscr{L}}$$ on sample n at sequence t is denoted $$\updelta {\widehat{{\bf{y}}}}^{t}=\frac{\partial {\mathscr{L}}}{\partial {{\bf{y}}}^{t}}$$, and other gradients are notated similarly. The gradients were calculated using the BPTT algorithm35,36. The last layer output delta was calculated with equation (9) for the airline prediction task and with equation (10) for the gait identification task. This particular step is calculated in software in the present work, but it can be implemented with a simple circuit as proposed in previous literature47.

$$\updelta {\widehat{{\bf{y}}}}^{t}=\frac{\partial {\mathscr{L}}}{\partial {\widehat{{\bf{y}}}}^{t}}=\sigma ^{\prime} \left({{\bf{y}}}^{t}-{{\bf{y}}}_{{\rm{target}}}^{t}\right)$$
(9)

where σ′ is the derivative of the logistic sigmoid function, and similarly in the following equations, tanh′ is the derivative of the hyperbolic tangent function.

$$\updelta {\widehat{{\bf{y}}}}^{t}=\frac{\partial {\mathscr{L}}}{\partial {\widehat{{\bf{y}}}}^{t}}=\left\{\begin{array}{cc}{{\bf{y}}}^{t}-{{\bf{y}}}_{{\rm{target}}}^{t} & t=T;\\ 0 & t < T\end{array}\right\}$$
(10)

where T is the length of the temporal sequence.

The previous layer deltas were calculated with the chain rule:

$$\updelta {{\bf{h}}}^{t}={{\bf{W}}}_{{\rm{FC}}}^{\top }\updelta {\widehat{{\bf{y}}}}^{t}+\updelta {{\bf{h}}}^{t+1}$$
(11)
$$\updelta {\widehat{{\bf{o}}}}^{t}=\updelta {{\bf{h}}}^{t}\odot {{\bf{c}}}^{t}\odot \sigma ^{\prime} \left({{\bf{o}}}^{t}\right)$$
(12)
$$\updelta {\widehat{{\bf{c}}}}^{t}=\updelta {{\bf{h}}}^{t}\odot {{\bf{o}}}^{t}\odot {\rm{\tanh }}^{\prime} \left({{\bf{c}}}^{t}\right)+\updelta {\widehat{{\boldsymbol{c}}}}^{t+1}$$
(13)
$$\updelta {\widehat{{\bf{a}}}}^{t}=\updelta {{\bf{c}}}^{t}\odot {{\bf{i}}}^{t}\odot {\rm{\tanh }}^{\prime} \left({{\bf{a}}}^{t}\right)$$
(14)
$$\updelta {\widehat{{\bf{i}}}}^{t}=\updelta {{\bf{c}}}^{t}\odot {{\bf{a}}}^{t}\odot \sigma ^{\prime} \left({{\bf{i}}}^{t}\right)$$
(15)
$$\updelta {\widehat{{\bf{f}}}}^{t}=\updelta {{\bf{c}}}^{t}\odot {{\bf{c}}}^{t-1}\odot \sigma ^{\prime} \left({{\bf{f}}}^{t}\right)$$
(16)
$$\updelta {\widehat{{\bf{c}}}}^{t-1}=\updelta {\widehat{{\bf{c}}}}^{t}\odot {\widehat{{\bf{f}}}}^{t}$$
(17)
$$\left[\begin{array}{c}\updelta {{\bf{x}}}^{t}\\ \updelta {{\bf{h}}}^{t-1}\end{array}\right]=\left[\begin{array}{cccc}{{\bf{W}}}_{{\rm{a}}}^{\top } & {{\bf{W}}}_{{\rm{i}}}^{\top } & {{\bf{W}}}_{{\rm{f}}}^{\top } & {{\bf{W}}}_{{\rm{o}}}^{\top }\\ {{\bf{U}}}_{{\rm{a}}}^{\top } & {{\bf{U}}}_{{\rm{i}}}^{\top } & {{\bf{U}}}_{{\rm{f}}}^{\top } & {{\bf{U}}}_{{\rm{o}}}^{\top }\end{array}\right]\left[\begin{array}{c}\updelta {\widehat{{\bf{a}}}}^{t}\\ \updelta {\widehat{{\bf{i}}}}^{t}\\ \updelta {\widehat{{\bf{f}}}}^{t}\\ \updelta {\widehat{{\bf{o}}}}^{t}\end{array}\right]$$
(18)

where M denotes the transpose of the matrix M. The computationally expensive steps (with complexity O(N2)—the others are O(N)) described in equations (11) and (18) can be calculated in the crossbar array. In the present work, the error vectors and subsequently the weight gradients, which are the outer product of the delta vector and the stored input vector (delta rule), were calculated in software. There have been attempts at the simple hardware implementation of this step47, but the applicability to our weight update scheme is still under evaluation.

$$\left[\begin{array}{cc}\updelta {{\bf{W}}}_{{\rm{FC}}}^{t} & \updelta {{\bf{b}}}_{{\rm{FC}}}^{t}\end{array}\right]=\left[\updelta {\widehat{{\bf{y}}}}^{{\rm{t}}}\right]\left[\begin{array}{cc}{{\bf{h}}}^{{t}^{\top }} & 1\end{array}\right]$$
(19)
$$\left[\begin{array}{lll}\updelta {{\bf{W}}}_{{\rm{a}}}^{t} & \updelta {{\bf{U}}}_{{\rm{a}}}^{t} & \updelta {{\bf{b}}}_{{\rm{a}}}^{t}\\ \updelta {{\bf{W}}}_{{\rm{i}}}^{t} & \updelta {{\bf{U}}}_{{\rm{i}}}^{t} & \updelta {{\bf{b}}}_{{\rm{i}}}^{t}\\ \updelta {{\bf{W}}}_{{\rm{f}}}^{t} & \updelta {{\bf{U}}}_{{\rm{f}}}^{t} & \updelta {{\bf{b}}}_{{\rm{f}}}^{t}\\ \updelta {{\bf{W}}}_{{\rm{o}}}^{t} & \updelta {{\bf{U}}}_{{\rm{o}}}^{t} & \updelta {{\bf{b}}}_{{\rm{o}}}^{t}\end{array}\right]=\left[\begin{array}{l}\updelta {\widehat{{\bf{a}}}}^{t}\\ \updelta {\widehat{{\bf{i}}}}^{t}\\ \updelta {\widehat{{\bf{f}}}}^{t}\\ \updelta {\widehat{{\bf{o}}}}^{t}\end{array}\right]\left[\begin{array}{ccc}{{\bf{x}}}^{{t}^{\top }} & {{\bf{h}}}^{t-{1}^{\top }} & 1\end{array}\right]$$
(20)

The parameter (weights or bias) gradients were accumulated as described in equation (21):

$${\bf{GRAD}}=\sum _{t=1}^{T}\sum _{n=1}^{N}\updelta {{\bf{W}}}^{t}\left(n\right)$$
(21)

The stochastic gradient descent with momentum (SGDM) optimizer that we used in the airline prediction problem yielded the desired weight update value by means of equation (22):

$$\Delta W=-\left(\eta {\rm{\Delta }}{{\bf{W}}}_{{\rm{pre}}}+\alpha {\bf{GRAD}}\right)$$
(22)

where η and α are the hyper-parameters for momentum and learning rate, respectively.

In the gait identification experiment we used the RMSprop optimizer, which gives the desired weight update values through equation (23):

$$\begin{array}{l}MS=\beta {{\bf{MS}}}_{{\rm{pre}}}+\left(1-\beta \right){{\bf{GRAD}}}^{\circ 2}\\ \Delta W=-\left[\eta {\rm{\Delta }}{{\bf{W}}}_{{\rm{pre}}}+\alpha {\bf{GRAD}}\oslash \left(\sqrt{{\bf{MS}}}+\varepsilon \right)\right]\end{array}$$
(23)

where β, E, α and η are hyper-parameters, GRAD◦2 is the element-wise square operation on matrix GRAD and $$\oslash$$ indicates the element-wise division operation.

Finally, the desired weight values are updated in the memristor crossbar array by the two-pulse scheme with ΔVgate ΔW, which can be performed row-by-row (set) and column-by-column (reset)27. The calculation of the desired weight update values, either with SGDM or with RMSprop, was performed with software in this work. There are plausible proposals for implementing the SGDM algorithm in hardware47,48,49, where an auxiliary memory array is placed near the memristor crossbar array to store and compute the synaptic weights. With the proposed weight update scheme, the same auxiliary memory can also be used to store the gate voltage matrix, and the weight gradient matrix is updated and accumulated directly in the gate voltage matrix. The RMSprop algorithm, on the other hand, is much more expensive to implement in hardware. In the present work, we chose RMSprop to speed up the training process during the relatively more complicated classification experiment. It can also be used when the system is targeted for inference only, where the training is facilitated by external electronics to compensate the defects of the emerging memristor devices and asymmetry of the peripheral circuits.

### Hyperparameters

The hyperparameters used during the training experiment are presented in Table 1. They include both the hyperparameters for the neural network and the physical parameters to operate the memristor crossbar array.

### USF-NIST gait dataset

We have full permission to use the videos of walking humans (http://www.cse.usf.edu/~sarkar/SudeepSarkar/Gait_Data.html), granted by the creator of the dataset. We picked a subset of video sequences from eight walking people out of the total of 75 in the original dataset. The video sequences were further segmented into 664 samples, each having an ID number corresponding to the person (see Table 2 for details). We randomly picked 90% of the samples (total of 597) as the training set, and the remaining 10% (total of 67) as the testing set.

## Data availability

The data that support the plots within this paper and other findings of this study are available from the corresponding author upon reasonable request. The code that supports the plots within this Article and other finding of this study is available at http://github.com/lican81/memNN. The code that supports the communication between the custom-built measurement system and the integrated chip is available from the corresponding author upon reasonable request.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

1. 1.

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

2. 2.

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

3. 3.

Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000).

4. 4.

Schmidhuber, J., Wierstra, D. & Gomez, F. Evolino: hybrid neuroevolution/optimal linear. In Proc 19th International Joint Conference on Artificial Intelligence 853–858 (Morgan Kaufmann, San Francisco, 2005).

5. 5.

Bao, W., Yue, J. & Rao, Y. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLoS ONE 12, e0180944 (2017).

6. 6.

Jia, R. & Liang, P. Data recombination for neural semantic parsing. In Proc. 54th Annual Meeting of the Association for Computational Linguistics (eds Erk, K. & Smith, N. A.) 12–22 (Association for Computational Linguistics, 2016).

7. 7.

Karpathy, A. The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy Blog http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (2015).

8. 8.

Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).

9. 9.

Xiong, W. et al. The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5934–5938 (IEEE, 2018).

10. 10.

Sudhakaran, S. & Lanz, O. Learning to detect violent videos using convolutional long short- term memory. In Proc. 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) 1–6 (IEEE, 2017).

11. 11.

Chang, A. X. M. & Culurciello, E. Hardware accelerators for recurrent neural networks on FPGA. In Proc 2017 IEEE International Symposium on Circuits and Systems 1–4 (IEEE, 2017).

12. 12.

Guan, Y., Yuan, Z., Sun, G. & Cong, J. FPGA-based accelerator for long short-term memory re- current neural networks. In Proc. 2017 22nd Asia and South Pacific Design Automation Conference 629–634 (IEEE, 2017).

13. 13.

Zhang, Y. et al. A power-efficient accelerator based on FPGAs for LSTM network. In Proc. 2017 IEEE International Conference on Cluster Computing 629–630 (IEEE, 2017).

14. 14.

Conti, F., Cavigelli, L., Paulin, G., Susmelj, I. & Benini, L. Chipmunk: a systolically scalable 0.9 mm2, 3.08 Gop/s/mW @ 1.2 mW accelerator for near-sensor recurrent neural network inference. In 2018 IEEE Custom Integrated Circuits Conference (CICC) 1–4 (IEEE, 2018).

15. 15.

Gao, C., Neil, D., Ceolini, E., Liu, S.-C. & Delbruck, T. DeltaRNN. in Proc. 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays 21–30 (ACM, 2018); http://dl.acm.org/citation.cfm?doid=3174243.3174261.

16. 16.

Rizakis, M., Venieris, S. I., Kouris, A. & Bouganis, C.-S. Approximate FPGA-based LSTMs under computation time constraints. In 14th International Symposium in Applied Reconfigurable Computing (ARC) (eds Voros, N. et al.) 3–15 (Springer, Cham, 2018).

17. 17.

Chua, L. Memristor—the missing circuit element. IEEE Trans. Circuit Theory 18, 507–519 (1971).

18. 18.

Strukov, D. B., Snider, G. S., Stewart, D. R. & Williams, R. S. The missing memristor found. Nature 453, 80–83 (2008).

19. 19.

Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing. Nat. Nanotech. 8, 13–24 (2013).

20. 20.

Li, C. et al. Analogue signal and image processing with large memristor crossbars. Nat. Electron. 1, 52–59 (2018).

21. 21.

Le Gallo, M. et al. Mixed-precision in-memory computing. Nat. Electron. 1, 246–253 (2018).

22. 22.

Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature 521, 61–64 (2015).

23. 23.

Burr, G. W. et al. Experimental demonstration and tolerancing of a large-scale neural net- work (165 000 synapses) using phase-change memory as the synaptic weight element. IEEE Trans. Electron. Devices 62, 3498–3507 (2015).

24. 24.

Yu, S. et al. Binary neural network with 16 mb rram macro chip for classification and online training. In 2016 IEEE International Electron Devices Meeting (IEDM) 16.2.1–16.2.4 (IEEE, 2016).

25. 25.

Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).

26. 26.

Hu, M. et al. Memristor-based analog computation and neural network classification with a dot product engine. Adv. Mater. 30, 1705914 (2018).

27. 27.

Li, C. et al. Efficient and self-adaptive in-situ learning in multilayer memristor neural networks. Nat. Commun. 9, 2385 (2018).

28. 28.

Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222 (2018).

29. 29.

Jeong, D. S. & Hwang, C. S. Nonvolatile memory materials for neuromorphic intelligent machines. Adv. Mater. 30, 1704729 (2018).

30. 30.

Du, C. et al. Reservoir computing using dynamic memristor for temporal information processing. Nat. Commun. 8, 2204 (2017).

31. 31.

Smagulova, K., Krestinskaya, O. & James, A. P. A memristor-based long short term memory circuit. Analog. Integr. Circ. Sig. Process 95, 467–472 (2018).

32. 32.

Jiang, H. et al. Sub-10 nm Ta channel responsible for superior performance of a HfO2 memristor. Sci. Rep. 6, 28525 (2016).

33. 33.

Yi, W. et al. Quantized conductance coincides with state instability and excess noise in tantalum oxide memristors. Nat. Commun. 7, 11142 (2016).

34. 34.

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986).

35. 35.

Mozer, M. C. A focused backpropagation algorithm for temporal pattern recognition. Complex Syst. 3, 349–381 (1989).

36. 36.

Werbos, P. J. Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1, 339–356 (1988).

37. 37.

Chollet, F. Keras: deep learning library for Theano and tensorflow. Keras https://keras.io (2015).

38. 38.

International Airline Passengers: Monthly Totals in Thousands. Jan 49 – Dec 60. DataMarket https://datamarket.com/data/set/22u3/international-airline-passengers-monthly-totals-in-thousands-jan-49-dec-60 (2014).

39. 39.

Phillips, P. J., Sarkar, S., Robledo, I., Grother, P. & Bowyer, K. The gait identification challenge problem: data sets and baseline algorithm. In Proc. 16th International Conference on Pattern Recognition Vol. 1, 385–388 (IEEE, 2002).

40. 40.

Kale, A. et al. Identification of humans using gait. IEEE Trans. Image Process. 13, 1163–1173 (2004).

41. 41.

Tieleman, T. & Hinton, G. Lecture 6.5—RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 26–31 (2012).

42. 42.

Choi, S. et al. SiGe epitaxial memory for neuromorphic computing with reproducible high performance based on engineered dislocations. Nat. Mater. 17, 335–340 (2018).

43. 43.

Burgt, Y. et al. A non-volatile organic electrochemical device as a low-voltage artificial synapse for neuromorphic computing. Nat. Mater. 16, 414–418 (2017).

44. 44.

Ambrogio, S. et al. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature 558, 60–67 (2018).

45. 45.

Sheridan, P. M., Cai, F., Du, C., Zhang, Z. & Lu, W. D. Sparse coding with memristor networks. Nat. Nanotech. 12, 784–789 (2017).

46. 46.

Shafiee, A. et al. ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proc. 43rd International Symposium on Computer Architecture 14–26 (IEEE, 2016).

47. 47.

Gokmen, T. & Vlasov, Y. Acceleration of deep neural network training with resistive cross-point devices: design considerations. Front. Neurosci. 10, 33 (2016).

48. 48.

Cheng, M. et al. TIME: a training-in-memory architecture for memristor-based deep neural networks. In Proc. 54th Annual Design Automation Conference 26 (ACM, 2017).

49. 49.

Song, L., Qian, X., Li, H. & Chen, Y. PipeLayer: a pipelined ReRAM-based accelerator for deep learning. In 2017 IEEE International Symposium on High Performance Computer Architecture 541–552 (IEEE, 2017).

## Acknowledgements

This work was supported in part by the US Air Force Research Laboratory (grant no. FA8750-15-2-0044) and the Intelligence Advanced Research Projects Activity (IARPA; contract no. 2014-14080800008). D.B., an undergraduate from Swarthmore College, was supported by the NSF Research Experience for Undergraduates (grant no. ECCS-1253073) at the University of Massachusetts. P.Y. was visiting from Huazhong University of Science and Technology with support from the Chinese Scholarship Council (grant no. 201606160074). Part of the device fabrication was conducted in the cleanroom of the Center for Hierarchical Manufacturing, an NSF Nanoscale Science and Engineering Center located at the University of Massachusetts Amherst.

## Author information

Q.X. and C.L. conceived the idea. Q.X., J.J.Y. and C.L. designed the experiments. C.L., Z.W. and D.B. carried out programming, measurements, data analysis and simulation. M.R., P.Y., C.L., H.J., N.G. and P.L. built the integrated chips. Y.L., C.L., W.S., M.H., Z.W. and J.P.S. built the measurement system and firmware. Q.X., C.L., J.J.Y. and R.S.W. wrote the manuscript. M.B., Q.W. and all other authors contributed to the results analysis and commented on the manuscript.

### Competing interests

The authors declare no competing interests.

Correspondence to J. Joshua Yang or Qiangfei Xia.

## Supplementary information

1. ### Supplementary Information

Figures, Notes and References

## Rights and permissions

Reprints and Permissions

• #### DOI

https://doi.org/10.1038/s42256-018-0001-4

• ### Reinforcement learning with analogue memristor arrays

• Zhongrui Wang
• , Can Li
• , Wenhao Song
• , Mingyi Rao
• , Daniel Belkin
• , Yunning Li
• , Peng Yan
• , Hao Jiang
• , Peng Lin
• , Miao Hu
• , John Paul Strachan
• , Ning Ge
• , Mark Barnell
• , Qing Wu
• , Andrew G. Barto
• , Qinru Qiu
• , R. Stanley Williams
• , Qiangfei Xia
•  & J. Joshua Yang

Nature Electronics (2019)

• ### Memristive crossbar arrays for brain-inspired computing

• Qiangfei Xia
•  & J. Joshua Yang

Nature Materials (2019)

• ### A role for analogue memory in AI hardware

• Geoffrey W. Burr

Nature Machine Intelligence (2019)