Abstract
Recent breakthroughs in recurrent deep neural networks with long shortterm memory (LSTM) units have led to major advances in artificial intelligence. However, stateoftheart LSTM models with significantly increased complexity and a large number of parameters have a bottleneck in computing power resulting from both limited memory capacity and limited data communication bandwidth. Here we demonstrate experimentally that the synaptic weights shared in different time steps in an LSTM can be implemented with a memristor crossbar array, which has a small circuit footprint, can store a large number of parameters and offers inmemory computing capability that contributes to circumventing the ‘von Neumann bottleneck’. We illustrate the capability of our crossbar system as a core component in solving realworld problems in regression and classification, which shows that memristor LSTM is a promising lowpower and lowlatency hardware platform for edge inference.
Main
The recent success of artificial intelligence largely results from advances in deep neural networks, which have a variety of architectures^{1}, with the long shortterm memory (LSTM) network being an important one^{2,3}. By enabling the learning process to remember or forget the history of observations, LSTMbased recurrent neural networks (RNNs) are responsible for recent achievements in analysing temporal sequential data for applications such as data prediction^{4,5}, natural language understanding^{6,7}, machine translation^{8}, speech recognition^{9} and video surveillance^{10}. However, when implemented in conventional digital hardware, the complicated structures of LSTM networks lead to drawbacks connected with inference latency and power consumption. These issues are becoming increasingly prominent as more applications involve the processing of temporal data near the source in the era of the Internet of Things (IoT). Although there has been an increased level of effort in designing novel architectures to accelerate LSTMbased neural networks^{11–16}, low parallelism and limited bandwidth between computing and memory units remain outstanding issues. It is therefore imperative to seek an alternative computing paradigm for LSTM networks.
A memristor is a twoterminal ‘memory resistor’^{17,18} that performs computation via physical laws at the same location where information is stored^{19}. This feature entirely removes the need for data transfer between the memory and computation. Built into a crossbar architecture, memristors have been successfully employed in feedforward fully connected neural networks^{20,21,22,23,24,25,26,27}, which have shown significant advantages in terms of power consumption and inference latency over their CMOSbased counterparts^{28,29}. The shortterm memory effects of some memristors have also been utilized for reservoir computing^{30}. On the other hand, most stateoftheart deep neural networks, in which LSTM units are responsible for the recent success of temporal data processing, are built with more sophisticated architectures than fully connected networks. The memristor crossbar implementation of an LSTM has yet to be demonstrated, primarily because of the relative scarcity of large memristor arrays.
In this Article we demonstrate our experimental implementation of a core part of LSTM networks in memristor crossbar arrays. The memristors were monolithically integrated onto transistors to form onetransistor onememristor (1T1R) cells. By connecting a fully connected network to a recurrent LSTM network, we executed in situ training and inference of this multilayer LSTMbased RNN for both regression and classification problems, with all matrix multiplications and updates during training and inference physically implemented on a memristor crossbar interfacing with digital computing. The memristor LSTM network experiments succeeded in predicting airline passenger numbers and identifying an individual human based on gait. This work shows that LSTM networks built in memristor crossbar arrays represent a promising alternative computing paradigm with high speed–energy efficiency.
Results
Memristor crossbar array for LSTM
Neural networks containing LSTM units are recurrent; that is, they not only fully connect the nodes in different layers, but also recurrently connect the nodes in the same layer at different time steps, as shown in Fig. 1a. The recurrent connections in LSTM units also involve gated units to control the remembering and forgetting, which enable the learning of longterm dependencies^{2,3}. The data flow in a standard LSTM unit is shown in Fig. 1b and is characterized by equation (1) (linear matrix operations) and equation (2) (gated nonlinear activations), or equivalently by equations (3) to (5) in the Methods.
where x^{t} is the input vector at the present step, h^{t} and h^{t−1} are the output vectors at the present and previous time steps, respectively, \({\widehat{{\bf{c}}}}^{t}\) is the internal cell state, and \(\odot \) represents the elementwise multiplication. σ is the logistic sigmoid function, which yields \({\widehat{{\bf{i}}}}^{t},\,{\widehat{{\bf{f}}}}^{t}\,{\rm{and}}\,{\widehat{{\bf{o}}}}^{t}\)for the input, forget and output gates. The model parameters are stored in weights W, recurrent weights U and bias parameters b for cell activation (a) and each gate (i, f, o), respectively. Because of this complicated structure, stateoftheart deep RNNs involving LSTM units include massive quantities of model parameters, typically exceeding the normal capacity of onchip memory (usually static random access memory, SRAM), and sometimes even offchip main memory (usually dynamic random access memory, DRAM). Consequently, inference and training with the network will require the parameters to be transferred to the processing unit from a separate chip for computation, and data communication between the chips greatly limits the performance of LSTMbased RNNs on conventional hardware.
To address this issue, we have adopted a memristor crossbar array for an RNN and store the large number of parameters required by an LSTM–RNN as the conductances of the memristors. The topography of this neural network architecture, together with the data flow direction, is shown in Fig. 1c. The linear matrix multiplications are performed in situ in a memristor crossbar array, removing the need to transfer weight values back and forth. The model parameters are stored within the same memristor crossbar array that performs the analogue matrix multiplications. We connected an LSTM layer to a fully connected layer for the experiments described here, and the layers can be cascaded into more complicated structures in the future. For demonstration purposes, the gated unit in the LSTM layer and the nonlinear unit in the fully connected layer were implemented in software in the present work, but they can be implemented by analogue circuits^{31} without digital conversions to substantially reduce the energy consumption and inference latency.
The analogue matrix unit in our LSTM was implemented in a 128 × 64 1T1R crossbar array with memristors monolithically integrated on top of a commercial foundryfabricated transistor array^{20} (Fig. 2a,b). The integrated Ta/HfO_{2} memristors exhibited stable multilevel conductance (Fig. 2c), enabling matrix multiplication in the analogue domain^{20,26,27,32}. With transistors controlling the compliance current, the integrated memristor array was programmed by loading a predefined conductance matrix with a writeandverify approach (previously employed for analogue signal and image processing^{20} and ex situ training of fully connected neural networks^{26}) or by a simple twopulse scheme (previously used for in situ training of fully connected neural networks^{27}; also used for in situ training of the LSTM network in this work). Inference in the LSTM layer was executed by applying voltages on the row wires of the memristor crossbar array and reading out the electrical current through the virtual grounded column wires. The readout current vector is the dot product of the memristor conductance matrix and the input voltageamplitude vector, which was obtained directly by physical laws (Ohm’s law for multiplication and Kirchhoff’s current law for summation). Each parameter in the LSTM model was encoded by the conductance difference between two memristors in the same column, and subtraction was calculated in the crossbar array by applying voltages with the same amplitude but different polarities on the corresponding row wires (Fig. 2a). The applied voltage amplitude on the rows that connect to the memristors for bias representation was fixed across all the samples and time steps. The experimental readout currents from the memristor crossbar array comprise four parts, representing vectors \({\widehat{{\bf{a}}}}^{t},\,{\widehat{{\bf{i}}}}^{t},\,{\widehat{{\bf{f}}}}^{t}\)and \({\widehat{{\bf{o}}}}^{t}\) as described in equation (1), which were nonlinearly activated and gated (equation (2)) and converted to voltages (performed in software in the present work). The voltage vector (h^{t}) was then fed into the next layer (a fully connected layer) and recurrently to the LSTM layer itself at the next time step (h^{t−1}) (Fig. 1c).
The neural network was trained in situ within the memristor crossbar array to compensate for possible hardware imperfections, such as limited device yield, variation and noise in the conductance states^{33}, wire resistance, analogue peripheral asymmetry and so on. Before training, all memristor conductances were initialized by set voltage pulses across the memristor devices and simultaneous fixedamplitude pulses to the transistor gates. During training, initial inferences were performed on a batch of sequential data (minibatch) and yielded sequential outputs. The memristor conductances were then adjusted to make the inference outputs closer to the target outputs (evaluated by a loss function, see Methods). The intended conductance update values (∆G) were calculated using the backpropagation through time (BPTT) algorithm^{34,35,36} (see Methods for details) with the help of offchip electronics, and then applied to the memristor crossbar array experimentally. For memristors that needed the conductance to be decreased, we first applied a reset voltage pulse to their bottom electrodes (the top electrodes were grounded) to initialize the memristors to their low conductance states. We then applied synchronized set voltage pulses to the top electrodes, analogue voltage pulses to the transistor gates (ΔV_{gate} ∝ ΔG) and zero voltages to the bottom electrodes (grounded) to update the conductances of all memristors in the array (see Supplementary Fig. 1 for additional details). The conductance update can be performed in a rowbyrow or columnbycolumn basis, as proposed in previous work^{27}. This twopulse scheme has previously been demonstrated to be effective in achieving linear and symmetric memristor conductance updates^{27}.
The present work focuses on exploring the feasibility of employing emerging analogue devices (such as memristors) based neural networks with various architectures, in particular the LSTM network. For this purpose we developed a neural network framework in MATLAB with a Keraslike^{37} interface, which enables the implementation of an arbitrarily configured neural network architecture, and specifically in this work, the LSTM–fully connected network (see Supplementary Fig. 2 for the detailed architecture). The experimental memristor crossbar array executes the matrix multiplications during the forward and backward passes and the weight update, which can be replaced with a simulated memristor crossbar array or a software backend using 32bit floating point arithmetic. This architecture enables a direct comparison of the crossbar neural network with the digital approach using the same algorithm and dataset. In the experimental crossbar implementation, the framework communicates to our custombuilt offchip measurement system^{26}, which supplies up to 128 different analogue voltages and senses up to 64 current channels simultaneously (see Methods for details), from/to the memristor crossbar array, completing the matrix multiplication and weight update.
Regression experiment
We first applied the memristor LSTM in predicting the number of airline passengers for the next month, a typical example of a regression problem. We built a twolayer RNN in a 128 × 64 1T1R memristor crossbar array with each layer in a partition of the array. The input of the RNN was the number of air passengers in the present month, and the output was the projected number for the subsequent month. The RNN network structure is illustrated in Fig. 3a. We used 15 LSTM units with a total of 2,040 memristors (34 × 60 array) representing 1,020 synaptic weights (Fig. 3b), which took one data input, one fixed input for bias and 15 recurrent inputs from themselves. The second layer of the network was a fully connected layer with 15 inputs from the LSTM layer and another input as the bias. The recurrent weights in the LSTM units represent the learned knowledge on when and what to remember and forget, and therefore the output of the network was dependent on both present and previous inputs.
The dataset we chose for this prediction task included the airline passenger numbers per month from January 1949 to December 1960, with 144 observations^{38}, from which the first 96 data points were selected as the training set (only one sample, with a sequence length of 96) and the remaining 48 data points as the testing set (Fig. 3c). During the inference, the number of passengers was linearly converted to a voltage amplitude (normalized to between 0 V and 0.2 V so as not to disturb the memristor conductances). The final output electrical current was scaled back by the same coefficient during input normalization and a conductancetoweight ratio as specified in Table 1 to reflect the number of airline passengers. The training process was carried out to minimize the mean square error (equation (7)) between the data in the training set and the network output, by stochastic gradient descent through the BPTT algorithm (see Methods). The raw voltages applied on the memristor crossbar array and the raw output currents during the inference after 800 epochs are shown in Fig. 3d–g. The corresponding conductance and weight values were read out from the crossbar array and are shown in Supplementary Fig. 3, although they were not used for either the inference or the training processes. The experimental training result in Fig. 3c shows that the network learned to predict both the training data and the unseen testing data after 800 epochs of training.
Classification experiment
We further applied our memristor LSTM–RNN to identify an individual human by the person’s gait. The gait as a biometric feature has a unique advantage when identifying a human from a distance, as other biometrics (for example, the face) occupy too few pixels to be recognizable. It becomes increasingly important in circumstances in which face recognition is not feasible, for example because of camouflage and/or a lack of illumination. To use gait in a surveillance application scenario, it is preferable to deploy many cameras and perform the inference locally rather than sending the raw video data back to a server in the cloud. Inference near the source should be performed with low power and small communication bandwidth, but still achieve low latency.
The memristor LSTM–RNN utilized a feature vector extracted from a video frame as the input, and outputs the classification result as electrical current at the end of the sequence (Fig. 4a). The twolayer RNN was implemented by partitioning a 128 × 64 memristor crossbar array (Fig. 4b) in which 14 LSTM units in the first layer were fully connected to a 50dimensional input vector with 64 × 56 connections (implemented in a 128 × 56 memristor crossbar array). The 14 LSTM units were also fully connected to eight output nodes. The classification result is represented by the maximum dimension in the output vectors of the output nodes in the fully connected layer.
To demonstrate the core operation of the memristor LSTM memristor network, feature vectors for the input of the LSTM–RNN were extracted from video frames by software. Human silhouettes with 128 × 88 pixels were first preextracted from the raw video frames in the USFNIST gait dataset^{39} and then processed into 128dimensional widthprofile vectors^{40}. The vectors were then downsampled to 50 dimensions to fit the size of our crossbar array (Fig. 4c) and normalized to between −0.2 V and 0.2 V to match the input voltage range. We chose video sequences from eight different people out of 75 in the original dataset (Table 2). The videos cover various scenarios, with people wearing two different pairs of shoes on two different surface types (grass or concrete) taken from two different viewpoints (eight covariances). The video sequences were further segmented into 664 sequences, each with 25 frames, as described in detail in the Methods and Supplementary Fig. 4. The training was performed on 597 sequences randomly drawn from the dataset, while the remaining 67 unseen sequences were used for the classification test. In stateoftheart deep neural networks, the feature vectors that feed into the LSTM layer are usually extracted by multiple convolutional layers and/or fully connected layers without much human knowledge. The feature extraction step could also be implemented in a memristor crossbar array when multiple arrays are available in the near future^{20}.
The training and inference processes were experimentally performed, in part, in the memristor crossbar array, with a procedure similar to that in the regression experiment (see Methods). The goal of the training was to minimize the crossentropy in the Bayesian probability (equation (8) in the Methods), which is the loss function calculated from the lasttimestep electrical current and the ground truth (Fig. 4a). The desired weight update values were optimized with rootmeansquare propagation (RMSprop)^{41} based on the weight gradient calculated by the BPTT algorithm (mostly in software in the present work) and then experimentally applied to the memristor crossbar array after the inference operation on one minibatch of 50 training sequences, using the twopulse conductance update scheme. The mean crossentropy during the inference of each minibatch was calculated in software and is shown in Fig. 4d, from which one sees the effectiveness of the training with the memristor crossbar array. The raw input voltages and output currents, as well as the memristor conductance maps after the training experiment, are provided in Supplementary Fig. 5 and Supplementary Fig. 6, respectively. The classification test was conducted on the separate testing set after training on each epoch. The classification accuracy increased steadily during the training, and the maximum accuracy within the 50 epochs of training was 79.1%, confirming that the in situ training of the LSTM network is effective.
In addition to the successful demonstration of in situ training on the memristor LSTM network in both the regression and classification experiments, we quantitatively compared our result to the digital counterpart by training with the same neural network architecture and hyperparameters, and on the same dataset. The results of the classification experiment are provided in Supplementary Fig. 7b (more data on the regression experiment are provided in Supplementary Fig. 7a). Our experimental result matches the simulation with a 1.4% conductance update error, showing that using memristors in an LSTM network and obtaining an accuracy comparable with the digital approach will require a higher accuracy in conductance tuning as well as other improvements in the crossbar array. This may be achieved with improved devices^{42,43} and a better weight update scheme^{44}, by constructing a larger network for increased redundancy^{27} or by employing multiple memristors to represent one synaptic weight. The LSTM network is more sensitive to weight update error than a fully connected network for pattern recognition^{27}, suggesting different architectures may impose different performance requirements on emerging analogue devices.
Discussion
In summary, we have built multilayer RNNs with a memristor LSTM layer and a memristor fully connected layer. A major reason for using memristor networks for LSTM and other machine intelligence applications is their promise in terms of speed and energy efficiency. The advantages of analogue inmemory linear algebra for inference are well established^{20,21,22,23,24,25,26,27,28,44,45}. The computation could stay in the analogue domain, with the analogue input signals acquired directly from sensors, and the analogue output signals activated with nonlinear device response or circuits. The entire operation could be performed in a single time step, and so the latency would not scale with the size of the array. Our proofofconcept work employs offchip analoguetodigital conversions (ADCs), for demonstration purposes. Even with the ADC overhead, a mixedsignal system at a scaled technology node projects significant advantages over an alldigital system^{46,47,48,49}. The successful demonstrations for both regression and classification tasks show the versatility of connecting the memristor neural network layers in different configurations. The results open up a new direction for integrating multiple memristor crossbar arrays with different configurations on the same chip, which will minimize data transfer and significantly reduce the inference latency and power consumption in deep RNNs.
Methods
1T1R array integration
The transistor array was fabricated in a commercial foundry using the 2 µm technology node. The memristors were then integrated in a university cleanroom. The transistors were used as selector devices to mitigate the sneak path problem in the crossbar array and to enable precise conductance tuning. Two layers of metal wires were also fabricated in the foundry backendoftheline process as row and column wires to reduce the wire resistance (about 0.3 Ω between cells). The low wire resistance in the array is one of the key factors providing accurate matrix multiplication. The memristors were fabricated on top of the transistor array in the UMass Amherst cleanroom, with sputtered palladium as the bottom electrode, atomiclayerdeposited hafnium dioxide as the switching layer and sputtered tantalum as the top electrode.
Measurement system
A custombuilt offchip multiboard measurement system was used during the experiment; the details of this system have been reported previously^{26}. Eight 16bit digitaltoanalogue convertors (DACs) (Analgo Devices, LTC2668, least significant bit = 0.3 mV) were used to supply different analogue voltages to the 128 rows simultaneously. The current was sensed and converted to voltage through transimpedance amplifiers attached to each column. Eight 14bit ADCs were used to convert the analogue voltages from the 64 columns to digital values before being transmitted to the microcontroller (Microchip Technology, PIC32MX795F512). A fixed linear relation was used to convert the DAC/ADC digitized command/reading to voltage/current values in the microcontroller. The present measurement system was optimized for flexibility in the application demonstration rather than speed and energy efficiency.
Inference in the twolayer LSTM–RNN
The network in this work had two layers, the LSTM as the first layer and a fully connected layer as the second. The algorithm can be extended to more layers because of the cascaded structure. The input to the network was x^{t}, and the LSTM cell activation a^{t} was calculated as shown equation (3):
where h^{t−1} is the recurrent input, which is also the output of this layer at step t−1 (which will be introduced later), W_{a} and U_{a} are the synaptic weight matrices for the input and recurrent input, respectively, and b_{a} is the bias vector. These notions are similar to the following equations.
The input gate (i^{t}), forget gate (f^{t}) and output gate (o^{t}) that control the output are defined in equation (4):
The output of the LSTM layer (as the hidden layer output in the twolayer RNN) was determined using equation (5):
Equations (3), (4) and (5) are equivalent to equations (1) and (2) in the main text, in which the linear and nonlinear operations are separated for easier comprehension.
The final output of the RNN is read out by a fully connected layer, the function of which is characterized by equation (6):
where f is the nonlinear activation function in the fully connected layer. Specifically, we used the logistic sigmoid function in the airline prediction experiment and the softmax function in the human gait identification experiment.
Training with BPTT
The goal of the training process was to minimize a loss function \({\mathscr{L}}\), which is a function of the network output y^{t} and the targets \({{\bf{y}}}_{{\rm{target}}}^{t}\) (ground truth or labels). Specifically, we chose the mean square loss error over all time steps for the airline prediction experiment (equation (7)) and crossentropy loss on the last time step for the human gait identification experiment (equation (8)):
where n indexes over the sample, N is the batch size, t is the temporal sequence number and T is the total time steps in the sequence.
The training—that is, model optimization—was based on the weight gradients of the loss function. Since the weights stayed the same in the same minibatch over all time steps, the gradients were accumulated before each weight update. The gradient vector of the last layer output on the loss function \({\mathscr{L}}\) on sample n at sequence t is denoted \(\updelta {\widehat{{\bf{y}}}}^{t}=\frac{\partial {\mathscr{L}}}{\partial {{\bf{y}}}^{t}}\), and other gradients are notated similarly. The gradients were calculated using the BPTT algorithm^{35,36}. The last layer output delta was calculated with equation (9) for the airline prediction task and with equation (10) for the gait identification task. This particular step is calculated in software in the present work, but it can be implemented with a simple circuit as proposed in previous literature^{47}.
where σ′ is the derivative of the logistic sigmoid function, and similarly in the following equations, tanh′ is the derivative of the hyperbolic tangent function.
where T is the length of the temporal sequence.
The previous layer deltas were calculated with the chain rule:
where M^{⊺} denotes the transpose of the matrix M. The computationally expensive steps (with complexity O(N^{2})—the others are O(N)) described in equations (11) and (18) can be calculated in the crossbar array. In the present work, the error vectors and subsequently the weight gradients, which are the outer product of the delta vector and the stored input vector (delta rule), were calculated in software. There have been attempts at the simple hardware implementation of this step^{47}, but the applicability to our weight update scheme is still under evaluation.
The parameter (weights or bias) gradients were accumulated as described in equation (21):
The stochastic gradient descent with momentum (SGDM) optimizer that we used in the airline prediction problem yielded the desired weight update value by means of equation (22):
where η and α are the hyperparameters for momentum and learning rate, respectively.
In the gait identification experiment we used the RMSprop optimizer, which gives the desired weight update values through equation (23):
where β, E, α and η are hyperparameters, GRAD^{◦2} is the elementwise square operation on matrix GRAD and \(\oslash \) indicates the elementwise division operation.
Finally, the desired weight values are updated in the memristor crossbar array by the twopulse scheme with ΔV_{gate} ∝ ΔW, which can be performed rowbyrow (set) and columnbycolumn (reset)^{27}. The calculation of the desired weight update values, either with SGDM or with RMSprop, was performed with software in this work. There are plausible proposals for implementing the SGDM algorithm in hardware^{47,48,49}, where an auxiliary memory array is placed near the memristor crossbar array to store and compute the synaptic weights. With the proposed weight update scheme, the same auxiliary memory can also be used to store the gate voltage matrix, and the weight gradient matrix is updated and accumulated directly in the gate voltage matrix. The RMSprop algorithm, on the other hand, is much more expensive to implement in hardware. In the present work, we chose RMSprop to speed up the training process during the relatively more complicated classification experiment. It can also be used when the system is targeted for inference only, where the training is facilitated by external electronics to compensate the defects of the emerging memristor devices and asymmetry of the peripheral circuits.
Hyperparameters
The hyperparameters used during the training experiment are presented in Table 1. They include both the hyperparameters for the neural network and the physical parameters to operate the memristor crossbar array.
USFNIST gait dataset
We have full permission to use the videos of walking humans (http://www.cse.usf.edu/~sarkar/SudeepSarkar/Gait_Data.html), granted by the creator of the dataset. We picked a subset of video sequences from eight walking people out of the total of 75 in the original dataset. The video sequences were further segmented into 664 samples, each having an ID number corresponding to the person (see Table 2 for details). We randomly picked 90% of the samples (total of 597) as the training set, and the remaining 10% (total of 67) as the testing set.
Data availability
The data that support the plots within this paper and other findings of this study are available from the corresponding author upon reasonable request. The code that supports the plots within this Article and other finding of this study is available at http://github.com/lican81/memNN. The code that supports the communication between the custombuilt measurement system and the integrated chip is available from the corresponding author upon reasonable request.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
 2.
Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780 (1997).
 3.
Gers, F. A., Schmidhuber, J. & Cummins, F. Learning to forget: continual prediction with LSTM. Neural Comput. 12, 2451–2471 (2000).
 4.
Schmidhuber, J., Wierstra, D. & Gomez, F. Evolino: hybrid neuroevolution/optimal linear. In Proc 19th International Joint Conference on Artificial Intelligence 853–858 (Morgan Kaufmann, San Francisco, 2005).
 5.
Bao, W., Yue, J. & Rao, Y. A deep learning framework for financial time series using stacked autoencoders and longshort term memory. PLoS ONE 12, e0180944 (2017).
 6.
Jia, R. & Liang, P. Data recombination for neural semantic parsing. In Proc. 54th Annual Meeting of the Association for Computational Linguistics (eds Erk, K. & Smith, N. A.) 12–22 (Association for Computational Linguistics, 2016).
 7.
Karpathy, A. The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy Blog http://karpathy.github.io/2015/05/21/rnneffectiveness/ (2015).
 8.
Wu, Y. et al. Google’s neural machine translation system: bridging the gap between human and machine translation. Preprint at https://arxiv.org/abs/1609.08144 (2016).
 9.
Xiong, W. et al. The Microsoft 2017 conversational speech recognition system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5934–5938 (IEEE, 2018).
 10.
Sudhakaran, S. & Lanz, O. Learning to detect violent videos using convolutional long short term memory. In Proc. 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) 1–6 (IEEE, 2017).
 11.
Chang, A. X. M. & Culurciello, E. Hardware accelerators for recurrent neural networks on FPGA. In Proc 2017 IEEE International Symposium on Circuits and Systems 1–4 (IEEE, 2017).
 12.
Guan, Y., Yuan, Z., Sun, G. & Cong, J. FPGAbased accelerator for long shortterm memory re current neural networks. In Proc. 2017 22nd Asia and South Pacific Design Automation Conference 629–634 (IEEE, 2017).
 13.
Zhang, Y. et al. A powerefficient accelerator based on FPGAs for LSTM network. In Proc. 2017 IEEE International Conference on Cluster Computing 629–630 (IEEE, 2017).
 14.
Conti, F., Cavigelli, L., Paulin, G., Susmelj, I. & Benini, L. Chipmunk: a systolically scalable 0.9 mm^{2}, 3.08 Gop/s/mW @ 1.2 mW accelerator for nearsensor recurrent neural network inference. In 2018 IEEE Custom Integrated Circuits Conference (CICC) 1–4 (IEEE, 2018).
 15.
Gao, C., Neil, D., Ceolini, E., Liu, S.C. & Delbruck, T. DeltaRNN. in Proc. 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays 21–30 (ACM, 2018); http://dl.acm.org/citation.cfm?doid=3174243.3174261.
 16.
Rizakis, M., Venieris, S. I., Kouris, A. & Bouganis, C.S. Approximate FPGAbased LSTMs under computation time constraints. In 14th International Symposium in Applied Reconfigurable Computing (ARC) (eds Voros, N. et al.) 3–15 (Springer, Cham, 2018).
 17.
Chua, L. Memristor—the missing circuit element. IEEE Trans. Circuit Theory 18, 507–519 (1971).
 18.
Strukov, D. B., Snider, G. S., Stewart, D. R. & Williams, R. S. The missing memristor found. Nature 453, 80–83 (2008).
 19.
Yang, J. J., Strukov, D. B. & Stewart, D. R. Memristive devices for computing. Nat. Nanotech. 8, 13–24 (2013).
 20.
Li, C. et al. Analogue signal and image processing with large memristor crossbars. Nat. Electron. 1, 52–59 (2018).
 21.
Le Gallo, M. et al. Mixedprecision inmemory computing. Nat. Electron. 1, 246–253 (2018).
 22.
Prezioso, M. et al. Training and operation of an integrated neuromorphic network based on metaloxide memristors. Nature 521, 61–64 (2015).
 23.
Burr, G. W. et al. Experimental demonstration and tolerancing of a largescale neural net work (165 000 synapses) using phasechange memory as the synaptic weight element. IEEE Trans. Electron. Devices 62, 3498–3507 (2015).
 24.
Yu, S. et al. Binary neural network with 16 mb rram macro chip for classification and online training. In 2016 IEEE International Electron Devices Meeting (IEDM) 16.2.1–16.2.4 (IEEE, 2016).
 25.
Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 15199 (2017).
 26.
Hu, M. et al. Memristorbased analog computation and neural network classification with a dot product engine. Adv. Mater. 30, 1705914 (2018).
 27.
Li, C. et al. Efficient and selfadaptive insitu learning in multilayer memristor neural networks. Nat. Commun. 9, 2385 (2018).
 28.
Xu, X. et al. Scaling for edge inference of deep neural networks. Nat. Electron. 1, 216–222 (2018).
 29.
Jeong, D. S. & Hwang, C. S. Nonvolatile memory materials for neuromorphic intelligent machines. Adv. Mater. 30, 1704729 (2018).
 30.
Du, C. et al. Reservoir computing using dynamic memristor for temporal information processing. Nat. Commun. 8, 2204 (2017).
 31.
Smagulova, K., Krestinskaya, O. & James, A. P. A memristorbased long short term memory circuit. Analog. Integr. Circ. Sig. Process 95, 467–472 (2018).
 32.
Jiang, H. et al. Sub10 nm Ta channel responsible for superior performance of a HfO_{2} memristor. Sci. Rep. 6, 28525 (2016).
 33.
Yi, W. et al. Quantized conductance coincides with state instability and excess noise in tantalum oxide memristors. Nat. Commun. 7, 11142 (2016).
 34.
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by backpropagating errors. Nature 323, 533–536 (1986).
 35.
Mozer, M. C. A focused backpropagation algorithm for temporal pattern recognition. Complex Syst. 3, 349–381 (1989).
 36.
Werbos, P. J. Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1, 339–356 (1988).
 37.
Chollet, F. Keras: deep learning library for Theano and tensorflow. Keras https://keras.io (2015).
 38.
International Airline Passengers: Monthly Totals in Thousands. Jan 49 – Dec 60. DataMarket https://datamarket.com/data/set/22u3/internationalairlinepassengersmonthlytotalsinthousandsjan49dec60 (2014).
 39.
Phillips, P. J., Sarkar, S., Robledo, I., Grother, P. & Bowyer, K. The gait identification challenge problem: data sets and baseline algorithm. In Proc. 16th International Conference on Pattern Recognition Vol. 1, 385–388 (IEEE, 2002).
 40.
Kale, A. et al. Identification of humans using gait. IEEE Trans. Image Process. 13, 1163–1173 (2004).
 41.
Tieleman, T. & Hinton, G. Lecture 6.5—RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4, 26–31 (2012).
 42.
Choi, S. et al. SiGe epitaxial memory for neuromorphic computing with reproducible high performance based on engineered dislocations. Nat. Mater. 17, 335–340 (2018).
 43.
Burgt, Y. et al. A nonvolatile organic electrochemical device as a lowvoltage artificial synapse for neuromorphic computing. Nat. Mater. 16, 414–418 (2017).
 44.
Ambrogio, S. et al. Equivalentaccuracy accelerated neuralnetwork training using analogue memory. Nature 558, 60–67 (2018).
 45.
Sheridan, P. M., Cai, F., Du, C., Zhang, Z. & Lu, W. D. Sparse coding with memristor networks. Nat. Nanotech. 12, 784–789 (2017).
 46.
Shafiee, A. et al. ISAAC: a convolutional neural network accelerator with insitu analog arithmetic in crossbars. In Proc. 43rd International Symposium on Computer Architecture 14–26 (IEEE, 2016).
 47.
Gokmen, T. & Vlasov, Y. Acceleration of deep neural network training with resistive crosspoint devices: design considerations. Front. Neurosci. 10, 33 (2016).
 48.
Cheng, M. et al. TIME: a traininginmemory architecture for memristorbased deep neural networks. In Proc. 54th Annual Design Automation Conference 26 (ACM, 2017).
 49.
Song, L., Qian, X., Li, H. & Chen, Y. PipeLayer: a pipelined ReRAMbased accelerator for deep learning. In 2017 IEEE International Symposium on High Performance Computer Architecture 541–552 (IEEE, 2017).
Acknowledgements
This work was supported in part by the US Air Force Research Laboratory (grant no. FA87501520044) and the Intelligence Advanced Research Projects Activity (IARPA; contract no. 201414080800008). D.B., an undergraduate from Swarthmore College, was supported by the NSF Research Experience for Undergraduates (grant no. ECCS1253073) at the University of Massachusetts. P.Y. was visiting from Huazhong University of Science and Technology with support from the Chinese Scholarship Council (grant no. 201606160074). Part of the device fabrication was conducted in the cleanroom of the Center for Hierarchical Manufacturing, an NSF Nanoscale Science and Engineering Center located at the University of Massachusetts Amherst.
Author information
Author notes
Affiliations
Contributions
Q.X. and C.L. conceived the idea. Q.X., J.J.Y. and C.L. designed the experiments. C.L., Z.W. and D.B. carried out programming, measurements, data analysis and simulation. M.R., P.Y., C.L., H.J., N.G. and P.L. built the integrated chips. Y.L., C.L., W.S., M.H., Z.W. and J.P.S. built the measurement system and firmware. Q.X., C.L., J.J.Y. and R.S.W. wrote the manuscript. M.B., Q.W. and all other authors contributed to the results analysis and commented on the manuscript.
Competing interests
The authors declare no competing interests.
Corresponding authors
Correspondence to J. Joshua Yang or Qiangfei Xia.
Supplementary information
Supplementary Information
Figures, Notes and References
Rights and permissions
About this article
Received
Accepted
Published
Issue Date
DOI
Further reading

Reinforcement learning with analogue memristor arrays
Nature Electronics (2019)

Memristive crossbar arrays for braininspired computing
Nature Materials (2019)

A role for analogue memory in AI hardware
Nature Machine Intelligence (2019)