Abstract
Neural networks trained by backpropagation have achieved tremendous successes on numerous intelligent tasks. However, naïve gradientbased training and updating methods on memristors impede applications due to intrinsic material properties. Here, we built a 39 nm 1 Gb phase change memory (PCM) memristor array and quantified the unique resistance drift effect. On this basis, spontaneous sparse learning (SSL) scheme that leverages the resistance drift to improve PCMbased memristor network training is developed. During training, SSL regards the drift effect as spontaneous consistencybased distillation process that reinforces the array weights at the highresistance state continuously unless the gradientbased method switches them to low resistance. Experiments show that the SSL not only helps the convergence of network with better performance and sparsity controllability without additional computation in handwritten digit classification. This work promotes the learning algorithms with the intrinsic properties of memristor devices, opening a new direction for development of neuromorphic computing chips.
Introduction
Recently, the field of artificial intelligence (AI) has witnessed tremendous advances in deep neural networks (DNNs)^{1,2,3,4,5,6,7}. In DNNs, as a connectionist approach to AI, knowledge is represented by a hierarchically distributed pattern of activation nodes and a large number of weights storing the connection strength between the nodes^{8,9,10}. Since weights are trainable depending on some relevant data without explicit rules, ‘learning’ can be defined naturally^{11}. However, to gain powerful generalization capabilities, DNNs usually have massive tunable weights, thus the network training and inference are such dataintensive that the computation efficiency is usually limited by the memory access^{12,13}. These networks are highly amenable to computation via large and dense matrix–vector multiplications that can be highly parallelized^{14,15}. This has led to tremendous opportunities for hardware acceleration.
First, DNN accelerators and neuromorphic chips were designed with the applicationspecific integrated circuits to alleviate the memory access bottleneck to speed up the computation^{16}. However, using this design approach, the synaptic weights are still stored in random access memories rather than directly encoded by the states of emerging analog devices and computed at the memory locations. Such analog devices assembled systems enable fundamental Nonvon Neumann scheme and are believed to achieve significant speedup and power reduction for onchip implementation of DNNs^{17,18,19,20,21}. One of the most attractive analog devices is the twoterminal memristor, such as phase change memory (PCM) and resistive random access memory (RRAM), which offers the advantages of high density, fast operation speed, and lowpower consumption^{22,23}. The weight elements of DNNs can be represented by the conductance of memristors, which can be programmed by update pulses or read by lowamplitude reading pulses. Therefore, the weights in memristor networks could be accessed and tuned locally, making it extremely suitable for the DNN hardware representation^{24}. When implemented in the crossbar structure, memristor networks are ideal substrates that directly perform multiplyaccumulate operations (MACs) at the weight locations^{25}. This parallel computing property speeds up the training and inference of DNNs with significant lowenergy consumption, providing a promising computing paradigm for neuromorphic computing systems^{26,27,28}.
However, there is still a huge discrepancy between the ideal unrestricted weight change in DNNs and the actual conductance change in memristors, resulting in the imprecise encoding of network weights and the corresponding computation. The conductance change of memristors is not only nonlinear, asymmetric, and precisionlimited, but also stochastic because of the devicetodevice and pulsetopulse variations^{29,30,31}. These properties generally lead to nonconvergence when training real memristor neural networks. A promising solution to this issue is developing a cooptimization design that combines device properties (hardware insights) with DNN training techniques (software insights), such as quantization, pruning, and sparsification. Several approaches have been reported to restore the properties of the ideal weight, which ensures more stable performance but results in inefficiency and impracticality^{17,32,33}. Our studies indicated that an adequate learning scheme able to exploit the stochasticity in PCM and incorporate braininspired approaches to enhance DNNs is essential to overcome this issue^{34,35}. At present, the development of memristors is still insufficient to meet the ideal requirements for DNN training. Comparing the emerging memories for synapse device, PCM exhibits advantages on on/off ratio, endurance, and retention than RRAM, and on/off ratio than MRAM^{36}. In addition, PCM is more mature technology as it has been manufacturing on a foundry basis. The physical mechanism of PCM is also well understood, which facilitates IC design of chip products. Thus, it is very practical to overcome the disadvantages of PCM and exploit the high reliability and the suitability for DNN applications. However, a critical issue for realization of PCMbased memristor neural network is that even if the weights represented by the resistance are precisely tuned to the ideal weights, its most cumbersome feature, the resistance drift issue, still remains, which has not been considered in previous studies^{31,37}. Resistance can deviate continuously from the initial intended values because PCM undergoes a spontaneous resistance increase due to the structural relaxation after amorphization^{38}. The resistance drift not only causes a major reliability issue of PCM, but also exacerbates the imprecise encoding of network weights in memristors. The conventional approach is to minimize the resistance drift by implementing error correction codes or adding specific periphery circuits, such as customized circuitry for rapid iterative write, modulation coding for driftresilient, coding to new readout schemes that are drift resilient^{39}. However, such approaches result in significant increase of complexity and power consumptions, and decrease of throughput.
Here we built a 39 nm 1 Gb PCMbased memristor neural network and quantified its nonideal characteristics, especially the unique resistance drift effect. PCM weights play a unique role in inmemory computations (e.g., measuring time stayed in a weight state and modifying weight nonlinearly according to the time) but the resistance drift effect is usually thought to be a problem for application. In this work, we make use of the unique resistance drift effect to develop a new learning scheme for PCMbased memristor neural network, and a multilayer perceptron (MLP) with 78425610 structure and 203,264 weights was built. It has been reported that resistance drift increases the effect of image classification in SNNs, but it is explained as an effect of improving the weight representation and has not been linked to a new learning method^{39}. In some neuromorphic computing systems, weights encoded by memristors are binaryvalued (+1 or –1) for feasibility and thus weights stay in either +1 or –1 until an update process. To embody such abstraction in PCMbased memristor DNNs, we utilize the power of spontaneity of PCM memristor, that is, the electrical properties of the resistance drift that can be described by power law using a defect model^{40,41,42,43}. In this setup, we fill the gap between actual behavior and unrealistic expectation in a PCMbased memristor DNN accelerator. At the same time, we incorporate sparsification based on weight consistency in a very natural way, i.e., discriminating a PCM weight staying consistently in high resistance with spontaneous resistance increase. During training, the negative weight (–1) corresponding to the crystalline state decreases the weighted sum (i.e., sum of multiplications between each input and weight) but the positive weight (+1 + δ, δ ≥ 0) corresponding to amorphous state increases the weighted sum for the nonnegative inputs. Because δ is getting larger as long as a weight remains in the positive state, therefore, a prior knowledge that the consistency of positive weights increases the final output values potentially is incorporated in this PCMbased memristor neural network. Interestingly, the final output is selectively increased when the prediction is correct. That is, the consistency correlates with accuracy improvement. Weight consistency map reveals that the new learning scheme system prevents weights from being stuck in the one of weight states during the whole training, which controls the discrimination of the one weight state from the other weight state, i.e., the sparsity. It is noticeable that the weight sparsity is changed differently between inputside weights and outputside weights. The consistency boosts the tendency of sparsity enhancement by increasing the number of negative inputside weights but suppresses the increase of the number of negative outputside weights. Thus, SSL manifests the negative contribution of some input data to the accuracy by increasing the sparsity and utilize the minimize uncontrollable negative contribution to the output. A possible outline of PCMbased memristor neural network system, device implementation, operation schemes, and solution to resistance drift even after training is carefully addressed.
Our approach not only improves the classification accuracy of the PCMbased memristor neural network on MNIST handwritten digit dataset from 89.6% to 93.2%, but also controls the weight sparsity differently between layers (+1.4% for the inputside layer of about 200k weights while –4.5% for the outputside layer of about 2.5k weights) without any additional computation and operation. The demonstration based on the statistical parameters extracted from a 1 Gb PCM array also indicates higher feasibility. Unlike the passive approach of suppressing the drift effect, our SSL promotes the fusion of computerbased neural algorithms with the intrinsic properties of memristor devices^{44,45,46}. To our best knowledge, this is the first learning method leveraging the PCM resistance drift to improve the memristorbased neural network performance. This opens up a new path for the development of PCMbased memristor neuromorphic accelerators.
Results
Description of the working system
Our approach is based on the distributed computing and the data quantization. A neural network is separated into several feedforward units (FFUs) and a central backpropagation unit (BPU) as shown in Fig. 1a^{47}. BPU calculates weight modifications to update the old weights in high precision (e.g., float32) according to the feedforward output received from FFUs and supervised labels. Then the new weights are binarized in BPU and sent to FFUs. FFUs update their own PCM weight storage and send the loss for the next input to BPU again. In this configuration, the multiple FFUs share the computation workloads. Also, the data transfer from BPU to FFUs is reduced by the data quantization. After the training, the positive weights are pinned to a fixed value to be used for the inference without weight variation. BPU has its own full precision weight storage to be updated, like conventional neural network models for weight quantization. The weight is quantized and then stored in FFUs for the feedforward process. Although the storage is just replaced to PCM memristor in this neural network, the consistency is stored as additional resistance increase in PCM memristor automatically as shown in Fig. 1b.
A possible weight change in PCM memristor weight storage can be represented as a combination of three patterns; staying in –1, alternating between +1 and –1, increasing in +1 + δ (left column in Fig. 1c). The probability distributions of the corresponding full precision weights are simplified through the binarization (middle column). As it is binarized, it discards a lot of information about the full precision weight. However, Bayesian inference can be considered as a method to reconstruct information about the probability distribution of full precision weights, which predicts posterior probability based on prior probability and likelihood. Since it is not known how the full precision weight will be formed, +1, –1 can be thought of as having a prior probability of 50:50. This is a method of predicting the distribution of full precision weights based on likelihood where +1 and –1 appear when binarizing. We can infer that the distribution of full precision weights is shifted toward the positive side by reflecting the likelihood of +1. This inferred probability distribution is called the posterior probability distribution, and in the next inference, the posterior probability of the previous step becomes the prior probability, and then inference continues according to likelihood. There is no way to perfectly reflect this postprobability in a 1bit memristor, but PCM can encode the increased probability with an increase in weight value using the drift effect. Then, according to the prior (i.e., the significance of positive weights), the posterior is determined by increasing the weights inspired from the gain coding (right column)^{48,49}. Thus the weight increase based on the weight history can be rationalized by Bayesian inference and the gain coding. The resistance drift of a PCM memristor is compared after two different operations, crystallization, and amorphization. Continuous resistance read shows different behaviors between amorphous (DRIFT) and crystalline states (ON), as shown in Fig. 1d. The resistance of the amorphous state is increasing gradually according to the power law, while the resistance of the crystalline state is almost constant^{41,42,50}. Thus, the prior can be incorporated into the weights naturally. Scanning electron microscopy (SEM) and tunneling electron microscopy (TEM) reveal the PCM array and the cell structure as shown in Fig. 1e, f (more details of PCM array and cell in “Methods” section)
PCM memristor operation
In our neural network, although the conventional weight storage is replaced with PCM memristor, the operation of the weight storage is the same as the conventional one. Some memristors are switched (S) and others are not (N). When a PCM memristor is not switched and remains in the amorphous state, the resistance is increased due to the spontaneous resistance drift. Figure 2a shows an example of how the resistance state of the PCM memristor changes as a sequence of specific S and N is delivered. The PCM memristor is switched occasionally in the middle of continual resistance read. The resistance read at the amorphous state shows a slight increase in resistance until it is switched back to the crystalline state. From 200 to 1200 resistance reads, the resistance drift continues without switching the amorphous state. In this way, the previously defined consistency in the pattern of weight evolutions is recorded as a resistance drift.
Dual weight mapping strategy
To implement spontaneous weight modification from the resistance drift, we must consider how to map different physical quantities (e.g., high and low resistances) with positive values to typical weight values (e.g., +1 and –1, respectively). This can be solved by some practical mapping methods known as 2memristor cell^{31,45} or dummycell methods^{51,52}. Also, the mapping from conductance to weight values helps accelerate feedforward propagation, i.e., matrix–vector multiplication or MAC. However, training with resistance drift in conductance is not effective because some modifications are required to make the resistance drift effect equivalent in the conductance scheme. Therefore, we propose a resistance scheme for the training and conversion to conductance scheme for the inference.
For the training, the PCM memristor is written and read memristor cellbymemristor cell, but is read as an analog memory in our practice. The computations of the feedforward propagation are shared with FFUs for several training data though MAC is not introduced. Thus, it is freer for processing the resistance values, e.g., taking the logarithmic value of the resistance to equalize the magnitude of the variations in high and low resistance (denoted by, R_{h} and R_{l}, respectively). As shown in Fig. 2b, we obtained \(W_{{ji}}^{\mathrm{{train}}}\) from the measurement of \(I_{R_{ji}}\) and some arithmetic operations. The dummycell method was used to assign the weight increase property only to the positive weights. T denotes the translation (highlighted by red color) to make a pair of R_{h} and R_{l} centered at the zero point and S denotes the scaling (highlighted by blue color) to make the translated values +1 and –1, respectively. As a result, we can obtain the intended weight values as shown in Fig. 2c. Then, the weighted sum is calculated separately for the feedforward propagation for the training.
To accelerate the feedforward propagation during the inference after the training, it is required to map the weight values from the conductance. Because we propose the weight pinning, two distinguished static conductance states are required. Considering the reuse of the PCM memristor weight, all high resistance memristors (including resistance drift) are switched to low resistance (equivalently, high conductance denoted by G_{h}) while all low resistance memristors are switched to high resistance (equivalently, low conductance denoted by G_{l}). Then, G_{h} and G_{l} are mapped to +1 + δ_{pin} and –1, respectively. The mapping for the inference is similar to that for the training process except T and S. In this scheme, the resistance drift is negligible due to the reciprocal relation between the resistance and the conductance. To implement MAC, we just measure the total current I_{i} when adequate V_{j}’s are supplied according to the input data. The measured current can be formulated from the circuit and the result is identical to the indented weighted sum as depicted in Fig. 2d.
Classification results
The most prominent result observed in this research is that the classification accuracy increased up to ~93.2% after the resistance drift effect was introduced. Figure 3a shows the training (inset) and testing accuracy (from 10 repeats) for the network with the conventional binary weight (denoted by ‘conventional’) and the network with the spontaneously increasing weight (denoted by ‘drift’). The remarkable fact is that the improvement in accuracy requires little additional computation cost and operation energy. Interestingly, drift demonstrated a larger deviation earlier, but a steady increase in accuracy with deviation reduced, unlike conventional (error bars) networks. This is because the consistency for the weight change could not be sufficiently assessed at the early stage of training and then the consistency is getting established and incorporated well.
To identify the improvement in accuracy, we examined the confusion matrix as shown in Fig. 3b and e. The confusion matrix indicated how many times the neural network classified each input digit image into each class. The diagonal number indicates correct classifications for each digit while the offdiagonal number means the total occurrence of the confusion (i.e., the discrepancy between the prediction and the truth). The numbers at the top and right (single row and column, respectively) are the sum of each row and column except diagonal value. For example, the highest value ‘205’ (sum of values surrounded by black lines in the confusion matrix) at the top means that the neural network mispredicted the input digit images as ‘3’ mostly (denoted by ‘misprediction as ‘3”), and the highest value ‘222’ (sum of values surrounded by red lines in confusion matrix) at the right means that the class ‘8’ is mostly mispredicted by the neural network (denoted by ‘misprediction of ‘8”). Because some specific digits like ‘3’ and ‘8’ showed distinctive results in confusion matrix, we compared a digitwise accuracy during the training. The digitwise accuracy is defined as the number of correct predictions divided by the number of actual images for each digit. Digit ‘0’ and ‘1’ outperform in accuracy while digit ‘8’ shows low accuracy overall in Fig. 3c. Although ‘3’ and ‘8’ show the top2 highest attempt/success ratio (shown in Fig. 3d), the accuracy of ‘3’ is much higher than the accuracy of ‘8’. The improved result (denoted by ‘Drift’) shows that the number of misprediction as ‘3’ is reduced from 205 to 109 and the number of misprediction of ‘8’ is reduced from 222 to 76 significantly. The SSL shows accuracy improvement for most of the digits, especially better for ‘8’ as found in Fig. 3f.
Drift effect during training
To reveal two contradicting results related to the output node ‘8’, that is, the weakest drift effect and the greatest improvement in accuracy, the impact of the drift effect on the final output in the feedforward process is analyzed in Fig. 4. Here, the actual 256 nodes in the hidden layer and 10 output nodes were simplified to five (n_{j} = 5, n_{i} = 5) for convenience as shown in Fig. 4a. Consequently, there are five classes (e.g., handwritten digits from 0, 1, 2, 3, and 4) to be distinguished and the weight has a 5 by 5 dimension. We introduced the drift effect as an additional term, \(\delta _{ji}^{\left( {23} \right)}\left( t \right)\) to the weight connecting ith node in the second (hidden) layer and jth node in the third (output) layer. It should be noted that \(\delta _{ji}^{\left( {23} \right)}\)(t) is nonzero positive only for j and i satisfying \(w_{ji}^{\left( {23} \right){\mathrm{{bin}}}}\) = +1. In this schematic, the input vector \(x_j^{\left( 2 \right)}\) and weight matrix \(w_{ji}^{\left( {23} \right){\mathrm{{bin}}}}\) are identical to those in other cases, respectively, but \(\delta _{ji}^{\left( {23} \right)}\)(t) is different for all the cases. We compared three different cases as shown in Fig. 4b. Among weights with a value of +1 in the weight matrix, the weight with a drift value of 0 indicates the PCM cell that has just switched to +1 state without time to drift. Weights with zero drift were arbitrarily selected. Those positive weights that are just switched will have various drift values. The top panel is a conventional driftless feedforward process so the weighted sum and the softmax output are calculated as \(s_i^{\left( 3 \right)} = \mathop {\sum}\nolimits_j {x_j^{\left( 2 \right)}} w_{ji}^{\left( {23} \right){\mathrm{{bin}}}}\) and \(x_i^{\left( 3 \right)}\) = softmax(\(s_i^{\left( 3 \right)}\)), respectively. Because \(x_i^{\left( 3 \right)}\) has the highest value (0.44) at i = 3, the conventional neural network infers that the input is \({\mathrm{{class}}}_{i = 3}\). The middle and bottom panels indicate different effects of \(\delta _{ji}^{\left( {23} \right)}\)(t), and thus weights experience the drift differently. For instance, some elements of the weighted sum are increased or decreased according to which weights experience the drift effect greater or lesser. The effect of drift to be emphasized is that the change in each \(s_i^{\left( 3 \right)}\) had a direct influence on every \(x_i^{\left( 3 \right)}\) because of the mutually exclusive property of the softmax function. In the middle panel, large \(d_{i = 3}^{\left( 3 \right)}\)(=1.70−1.10 = 0.6) made \(x_{i = 3}^{\left( 3 \right){\mathrm{{drift}}}}\)(=softmax(\(s_{i = 3}^{\left( 3 \right)}\) + \(d_{i = 3}^{\left( 3 \right)}\)) = 0.57) larger than \(x_{i = 2}^{\left( 3 \right)}\)(=0.44) and \(x_{i = 2}^{\left( 3 \right){\mathrm{{drift}}}}\) (=softmax(\(s_{i = 2}^{\left( 3 \right)}\) + \(d_{i = 2}^{\left( 3 \right)}\))=0.28) less than \(x_{i = 2}^{\left( 3 \right)}=0.36\) even for nonzero \(d_{i = 2}^{\left( 3 \right)}\) (=1.01−0.90 = −0.11) simultaneously. Therefore, loss (\(E =  \mathop {\sum}\nolimits_{i = 1}^{n_i} {t_i\log \left( {x_i} \right)}\)) will be reduced from 0.36 to 0.24 in the middle panel because the index i of the largest \(d_i^{\left( 3 \right)}\) is 3, which is identical to the index i of the true label of the input (on the other hand, \(\delta _{ji}^{(23)}\left( t \right)\) has the opposite (0.4 and 0.1 were swapped from the middle panel) pattern in the bottom panel, so the prediction is incorrect (i.e., the indices are not identical each other) and the loss increases to 0.40.
Because \(\delta _{ji}\left( t \right)\) is determined internally during the training process, we can only estimate it heuristically through the above two extreme cases. If the drift effect improves the prediction, the loss will be reduced and the weight update becomes smaller than the conventional case. On the contrary, if the drift effect makes the prediction worse, the loss will increase and the weight update becomes larger than the conventional case correspondingly. In this way, some weights that have a significant contribution to minimizing the loss function will most likely experience the drift effect.
Weight consistency
Although it is hard to interpret the optimization of the drift effect during the training explicitly, the weight consistency reveals the pattern of weight changes. Figure 5a shows the results made from the subtraction of two heatmaps which implies the weight consistency in the conventional and drift, respectively. Thus, larger values indicate more significant drift effects on the weight consistency. The numbers in the horizontal axis indicates the digits (0–9) to be classified. The numbers in vertical axis (−10 to +10) denote how many times a weight stays in the same state for the first 10 epochs. The 256 weights can be considered to be associated with the output node ‘0’. The consistency of these weights can be reflected by adding each weight value once for each epoch. For example, if a weight has been in the −1 state for 10 epochs, the added weight value is −10. Thus, all of 256 weights assigned to each digit falls into one of the values between −10 and +10 for both the conventional and drift networks. The negative values at the bottom (deep bluish squares) indicates the slight reduction of the number of weights stuck in negative states. During the last 10 epochs, the negative consistent weights are significantly reduced, and some weights remain in the positive states long instead in Fig. 5b. For the inputside weights, the consistency of 28 × 28 weights connected to each node in hidden layer of the neural network is accumulated. There is no noticeable pattern during the first 10 epochs, but some weights connected to the central region of the input images show stronger negative contribution to the accuracy during the last 10 epochs as shown in Fig. 5c and d. At the early of training, there is no significant difference in weight consistency. As the training progresses, some weights stuck in either positive or negative states are remarkably reduced and some weights changing between two states are increased. Interestingly, the inputside and outputside weights show opposite patterns. Some weights stuck in the positive state are reduced for the inputside weights but some weights in the negative state are reduced for the outputside weights. Therefore, the accuracyoriented consistency leads to the perturbation of some weights.
Sparsity
Because the perturbation of some weights is not reflected in the accuracy calculated at a certain time, the weight sparsity (i.e., the number of negative weights over the number of total weights) is examined. In the conventional, the number of negative weights is increased for both the inputside (Fig. 6a) and outputside (Fig. 6b) weights during the training. On the contrary, in the drift, the number of negative weights is further increased for the inputside weights (Fig. 6c) by about 1.4%, but the number of negative weights is decreased for the outputside weights (Fig. 6d) by about 4.5%. The blue and red lines indicate the proportion of the total negative and positive weights, respectively. The outputside weights are grouped by the output digits and the inputside weights are grouped by each node in hidden layers (i.e., 256 groups and 28 × 28 weights per group).
Considering the introduction of the accuracyoriented consistency, the consistency of some weights compromising the accuracy will be excluded mostly. We can imagine an overlap between the consistency and accuracy. The weights can be subdivided by the position in the neural network (i.e., inputside or outputside), the consistency, and the sparsity. Therefore, the overlap portion (denoted by selectivity) determines the effectiveness of SSL scheme. Because the drift effect increases the output due to the nonnegative contribution to the weighted sum basically, the similarity of the output vector to a onehot vector indicates the effectiveness of SSL.
Selectivity
The maximum component in an output vector is <1 and the rests are larger than 0 as shown in Fig. 7a. We can define the digitwise difficulty of classification based on the output vector, because the neural networks try to make the output like a onehot vector as much as possible. The classification of a digit can be considered much difficult as more as the component of the digit in output vectors deviates from 1 or 0. In this perspective, the digit ‘8’ is much difficult to be distinguished from others than ‘1’ in the conventional. Surprisingly, the drift effect discriminates the digit ‘1’ output as two subgroups, though the average output of ‘1’ is close to 1 in Fig. 7a. In each group, the output of ‘1’ in the conventional is totally different according to whether a certain digit ‘1’ input image is classified correctly in the drift or not (two topright bluish lines) as shown in Fig. 7b. That is, the output of ‘1’ in the conventional is extremely deviated from 1 for some images classified incorrectly in the drift. However, digit ‘8’ does not show such dependence on the correctness in the drift (two topleft reddish lines). The drift effect makes the output of ‘8’ close to a onehot vector eventually (orange lines). In Fig. 7c, the average output is improved and then shows the promising output like a onehot vector. The accuracyoriented consistency subdivides both the digit ‘8’ and ‘0’ according to the difficulty. Then, SSL exploits the decision boundary of classification to maximize the accuracy.
The number of increases and decreases of Max(\(x_i^{{\mathrm{{drift}}}}\)) compared with Max(x_{i}) are plotted with respect to the correctness, as shown in Fig. 7d. The total number of increases or decreases can be divided into four cases, I/I, I/C, C/I, and C/C, where I and C denote ‘incorrect’ and ‘correct’ respectively, and the former and the latter refer to ‘conventional/drift’. The total number of increases is larger than that of decreases essentially and the ratio (i.e., the number of increases over the number of decreases) is 6.29. It is noticeable that the ratio is remarkably higher when the classification of the drift case is correct. Simultaneously, the ratio is reduced considerably when the result is incorrect. The ratios are 10.7 and 9.68 for I/C and C/C (above the overall ratio), whereas they are 1.06 and 2.85 for C/I and I/I (below the overall ratio), respectively. The results indicate that the selectivity of the output increase is originated from the consistency and can be supported analytically by tracing the backpropagation process.
Weight pinning
One important issue of the spontaneous weight modification utilizing resistance drift is that the optimized weight will undergo inevitable further change after training. However, a suitable constant weight value can be found to represent all individually changed weights due to the drift without degrading performance in the inference process. After training, all positive weights can be regarded as a fixed positive constant like a conventional digital memory device. Then we can find the constant that maximizes the accuracy of the test images. When the fixed positive constant is changed from 1.05 to 1.70 by 0.05 step, we compared the accuracy and then found that it is compatible to the accuracy just after the training when w_{pin} = 1.4 (name after weight pinning) as shown in Fig. 8a.
Discussion
From the digitwise analysis, we found some results. (1) Digit ‘8’ classification is mostly inaccurate. (2) The accuracyoriented consistency (i.e., drift effect) is introduced. (3) The consistency controls the weight sparsity, so the weights are regrouped based on the digitwise (via outputside weights) and imagewise (via inputside weights) manners. (4) By drift effect, digit ‘1’ input images are easily subdivided into two parts while digit ‘8’ input images are not. (5) Then, the network increases the classification accuracy of ‘8’ mostly by exploiting the room for weight modification without degrading the classification accuracy of ‘1’. The accuracy improvement due to consistencyinduced weight increase is summarized as shown in Fig. 8b.
Our approach is rationalized in that the difficulty of classification is related to the difficulty of weight determination (i.e., consistency)^{53,54,55}. The difficultybased accuracy improvement is similar to the experimental results demonstrating the decision confidence in the brain^{56,57,58,59}. In animal experiments, different neural activity according to the difficulty of the decision is the most important feature of the confidence^{49,60}. For example, for a mixture of two stimuli (e.g., scent A and scent B) with a different ratio, neural activity is most active for incorrect decision in easy decision tasks (A: 100% +B: 0% or A: 0% +B: 100%) and least active for correct decision in easy decision tasks. In the contrary, for the decision in difficult decision tasks (A: 51% +B: 49% or A: 49% +B: 51%), neural activity is moderately active regardless of the correctness. For this reason, the process of confidence formation is to classify easy tasks definitely, which is similar to the classification of ‘1’ under the drift effect.
It is generally known that the reset current decreases as the cell size decreases^{61,62}. In addition, a lower reset current is associated with a decrease in amorphous volume and consequent higher amorphous resistance^{63}. Interestingly, the resistance in the amorphous state correlates with the drift coefficient^{64}. The higher the amorphous resistance, the greater the drift coefficient. To our best knowledge, there is no publication on the impact of technology scaling on the resistance drift yet.
Various experimental results have explained resistance drift in different ways and share a commonality that it is related to structural relaxation and stress release. Therefore, the electronic state of the material has a direct effect on the resistance drift rather than the technology node scaling or device structure. When PCM cell is not fully reset, the drift coefficient as well as resistance becomes smaller. The drift coefficient of the fully reset GST is ~0.09–0.11^{39}. In our case, PCM is almost fully reset (~0.10) and thus the impact due to technology node reduction is negligible. Even if the drift coefficient becomes greater with the size scaling or structure changes, we can slightly increase the reset current to decrease the reset resistance, thus maintaining the drift coefficient to a similar level.
We tested the classification accuracy over a wide range of drift coefficient means and deviations as shown in Supplementary Fig. 7. The consistently improved accuracy indicates that our learning scheme is very robust to the changes in drift effect caused by technology scaling or environmental temperatures. Interestingly, it is known that the drift coefficient does not depend on the constant temperature^{42,65}. If the temperature rises suddenly, the drift coefficient temporarily increases. As shown in Supplementary Fig. 7c, the accuracy remains increasingly improved until the drift coefficient increases to ~1. Since the drift factor in the fully reset state is bounded, it does not have a value much higher than 0.12 for GSTbased PCM^{43,66}. Therefore, the accuracy of proposed scheme will not deteriorate at a high temperature as long as it is below the glass transition temperature of phase change materials.
In this work, we built an ensemble of a mature 1 Gb PCM array with 39 nm technology, and by leveraging the statistical parameters obtained from the measurement of resistance drift, demonstrated a spontaneous sparse learning scheme in a PCMbased memristor neural network. By encoding the consistency of weight changes into the weight increases spontaneously, the neural network demonstrates the discriminated weight modification depending on the classification difficulty. Without any additional computation and operation costs, this new training scheme with sparsity controlled (+1.4% for outputside weights and −4.5% for inputside weights) improves the accuracy (93.2%) of a handwritten digit classification application (MNIST dataset) compared with the baseline neural network (89.6%).
Additionally, due to the compatibility of the consistencyinduced weight increase with any PCMbased memristor neural network, our results have potential to be related to evolutionary neural networks^{4,5} inspired by the spontaneous behavior and modular neural networks^{6,7,67}. Our work provides a potential prototype neural network that implements a metacognitive function, such as confidence, with low computation workloads for lightweight devices. Furthermore, our approach can be applied to RRAM as well as PCM, and improved learning effects are expected in more complex networks using multilevel cells. Although the resistance drift is weak in RRAM, the resistance drift appears distinctly in the conductivebridge RAM, one of RRAM branches, due to the lateral diffusion of the metal constituting the bridge^{68}. Since the drift coefficient is roughly observed within the range of 0.001–0.1 similar to PCM depending on the resistance of the initial resistance, the method presented in this study is expected to be applicable to those RRAM devices^{69}. Our work may provide inspirations to RRAM research, i.e., leveraging the unique material characteristics (e.g., conductivebridge modulation) to improve the performance/efficiency of memristor networks or to incorporate another learning scheme into memristor networks. The basic operation of our PCM array is well evaluated up to the 2bit 4 level. The proposed SSL algorithm is expected to further improve the performance of network in the case of multilevel operation.
Our proposed learning scheme utilizes the intrinsic resistance drift to improve the learning accuracy without additional cost, which opens up a new path to develop learning schemes using the materials’ properties.
Methods
Device fabrication
Crosspoint PCM arrays of cells with n+Si/nSi/pSi/TiN/Ge_{2}Sb_{2}Te_{5} (GST)/TiW/W structures were fabricated on a 300mm Si wafer. A highly doped n+Si layer was prepared with B at a doping concentration exceeding 10^{22} atoms/cm^{2}. The nSi/pSi diode was epitaxially grown on the wafer by chemical vapor deposition (CVD) with B at a doping concentration of 10^{21} atoms/cm^{2} for ntype and P at a doping concentration of 10^{20} atoms/cm^{2} for ptype. TiN films were deposited on the diodes by physical vapor deposition (PVD) as the bottom electrode (BE). The stack was dryetched by etching along the ydirection and then along the xdirection to form pillar structures. The stack of nSi/pSi/TiN was etched into yaxisoriented parallel lines using a hard mask of SiO_{2}. The gaps between the etched lines were filled with SiO_{2}. The gapfilled structure was flattened by chemical–mechanical polishing (CMP) and etched again with xaxisoriented parallel lines. This formed word lines (WL) of n+Si at the bottom of the diode. The gaps between diodes were then filled with SiO_{2} and flattened by CMP. Then, interconnects and trenches were formed in the SiO_{2} with lithography along the ydirection to build GST and a top electrode (TE) on the TiN through the dual damascene process. The interconnects and trenches were filled with GST and then TiW was deposited as the TE. After CMP, a line of W metal was formed using lithography on the TiW as the bit line. Finally, the crosspoint PCM array was fabricated. The contact area between the GST and bottom electrode was ~40 nm × 80 nm.
PCM Array information
The PCM Array used in the research is an array with 39 nm technology in the process of developing PCM arrays with 90, 50, 45, 39 and 20 nm technologies^{70,71,72} (officially reported for some technology only). Various operations and evaluation data including temperature and resistance drift for each technology were partially disclosed^{73,74}. For example, resistance distributions for four resistance levels of (00), (01), (10), and (11) after an elapse of 400 h (1.4 × 10^{6} s) and an additional thermal annealing at 130 °C for 12 h bake were provided for 90 nm PCM array. Also, for 45 nm PCM array, analysis of the drift effect was presented in depth to improve multilevel operational characteristics and verify practicality. The total array sizes fabricated by 90 and 45 nm technology are 512 Mb (16 banks of 32 Mb, 8 blocks/bank) and 1 Gbit (16 banks of 64 Mb, 8 blocks), respectively. For the 39 nm PCM Array, the temperature and drift characteristics were not officially disclosed at the waferlevel, but they are similar to those of the 45 nm PCM array. For the 39 nm PCM array, 2PCM synapse behavior was evaluated^{75}. Although the operation circuit was slightly deformed according to the characteristics of each technology to implement ancillary functions for performance improvement, such as high read or write throughput, the characteristics of each PCM array were well evaluated as can be seen in the literature provided.
Measurement
Each PCM cell is in the low resistance state at about 10^{4} Ω which was measured by using 1 V pulse with 5 ns rising time, 60 ns duration, and 5 ns falling time. The cell resistance is increased to about 10^{7} Ω using a pulse of 5 V height with 5 ns rising time, 60 ns duration, and 5 ns falling time. We then make the cell have a few 10^{4} Ω by applying a voltage of 3 V with 5 ns rising time, 400 ns duration, and 300 ns falling time. The PCM cell is switched 20 times between set and reset states with 10 resistance read after each switching. Then the resistance is measured 1000 times just after the reset switching to measure the resistance drift as depicted in Supplementary Fig. 1a. A pulse generator (Keysight 33600A) was used to provide electrical pulse to the cell. A source meter (Keithley 2600) was used to measure the cell resistance. Pulse is applied to the TE while bottom electrode is ground through the contact pads. Supplementary Fig. 1b shows the wafer being measured.
Parameter extraction
We extracted resistance distribution and deviation from over 100 cells. For each cell, the set and the reset resistance were measured over 100 data points and then the resistance drift was measured for 1000 data point with 1 readoperation per 1 s. Then, the sample mean and the sample standard deviation of the set resistance, the reset resistance, and the drift parameter (d in Eq. (1)) were extracted and denoted by \(\mu _{{\mathrm{{SET}}}_i}^{\mathrm{{s}}}\), \(\sigma _{{\mathrm{{SET}}}_i}^{\mathrm{{s}}}\), \(\mu _{{\mathrm{{RESET}}}_i}^{\mathrm{{s}}}\), \(\sigma _{{\mathrm{{RESET}}}_i}^{\mathrm{{s}}}\), \(\mu _{{\mathrm{{DRIFT}}}_i}^{\mathrm{{s}}}\), and \(\sigma _{{\mathrm{{DRIFT}}}_i}^{\mathrm{{s}}}\), respectively, for the \(i{{\mathrm{th}}}\) cell. Then, the statistical parameter for the resistance distribution, i.e., population mean (\(\mu _{{\mathrm{{SET}}}_i}^{\mathrm{{p}}}\), \(\mu _{{\mathrm{{RESET}}}_i}^{\mathrm{{p}}}\), and \(\mu _{{\mathrm{{DRIFT}}}_i}^{\mathrm{{p}}}\)) and population standard deviation (\(\sigma _{{\mathrm{{SET}}}_i}^{\mathrm{{p}}}\), \(\sigma _{{\mathrm{{RESET}}}_i}^{\mathrm{{p}}}\), and \(\sigma _{{\mathrm{{DRIFT}}}_i}^{\mathrm{{p}}}\)) with an error term depending on the number of sampling data and the distribution model were calculated for the ith cell. Supposed that the resistance distribution is a normal distribution, \(\mu _i^{\mathrm{{p}}}\) can be estimated as \(\mu _i^{\mathrm{{s}}}\) with standard error of \(\frac{{\mu _i^{\mathrm{{s}}}}}{{\sqrt n }}\) while \(\sigma _i^{\mathrm{{p}}}\) can be estimated as \(\sigma _i^{\mathrm{{s}}}\) with standard error of \(\sqrt {\frac{2}{{n  1}}} \left( {\sigma _i^{\mathrm{{s}}}} \right)^2\). This process is depicted in Supplementary Fig. 2 (i and ii) for R_{ij}, jth measurement of resistance of ith cell. In addition, the celltocell population mean (M^{p}) and population standard deviation (Σ^{p}) were determined for both distributions of \(\mu _i^{\mathrm{{p}}}\)’s and \(\sigma _i^{\mathrm{{p}}}\)’s as shown in Supplementary Fig. 3 (iii and iv), which are denoted by \(M_\mu ^{\mathrm{{p}}}\), \(\Sigma _\mu ^{\mathrm{{p}}}\) and \(M_\sigma ^{\mathrm{{p}}}\), \(\Sigma _\sigma ^{\mathrm{{p}}}\). By sampling a value from the normal distribution defined by \(M_\mu ^{\mathrm{{p}}}\) and \(\Sigma _\mu ^{\mathrm{{p}}}\), we could determine a mean resistance (μ^{*}) of a possible PCM cell. Then, the resistance deviation (α^{*}) that corresponds with μ^{*} is determined as a value that has the same cumulated probability in the normal distribution defined by \(M_\sigma ^{\mathrm{{p}}}\), \(\Sigma _\sigma ^{\mathrm{{p}}}\). Thus, six unique parameters (\(\mu _{{\mathrm{{set}}}_i}^ \ast\), \(\sigma _{{\mathrm{{set}}}_i}^ \ast\), \(\mu _{{\mathrm{{reset}}}_i}^ \ast\), \(\sigma _{{\mathrm{{reset}}}_i}^ \ast\), \(\mu _{{\mathrm{{drift}}}_i}^ \ast\), and \(\sigma _{{\mathrm{{drift}}}_i}^ \ast\)) were assigned to each PCM cell for set, reset, and drift independently and randomly. Finally, \(R_{{\mathrm{{set}}}}\), \(R_{{\mathrm{{reset}}}}\), and d_{drift} were generated and renewed every switching based on the uniquely assigned parameters to PCM cells. This process is depicted in Supplementary Fig. 4 (v, vi, and vii) and key statistical data that is determined from measurement are represented in bottom of the figure.
Neural network
We used a neural network with 784 nodes in input layer (denoted by \(x_k^{\left( 1 \right)}\)), 256 nodes in one hidden layer (denoted by \(x_j^{\left( 2 \right)}\)), and 10 nodes in output layer (denoted by \(x_i^{\left( 3 \right)}\)), for the MNIST handwritten digit classification, as shown in Supplementary Fig. 5. The layers and the weights between the layers are distinguished by the superscript. All nodes are fully connected and any regularization process such as batch normalization and dropout are excluded. Each 28 × 28 greyscale input image is normalized to [0, 1] and denoted by \(x_k^{\left( 1 \right)}\). The weights are initialized by the variance scaling initialization. We applied ReLU activation in the hidden layer and softmax in the output layer, and then crossentropy function was used as a loss function. To exclude any adaptive factor, we used stochastic gradient descent optimizer, constant learning rate of 0.001, and minibatch size of 100. Because we propose PCM as a weight device, the full precision weights are binarized and then the results are written in PCM cells. Some PCM cells which are in amorphous state and not updated experiences the drift effect. The modified weights increase the weighting of some inputs and affects the whole training process. The algorithm is described in Supplementary Fig. 6.
Data availability
All the related raw data, trained parameters, and codes are available from the authors upon reasonable requests.
References
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Schmidhuber, J. Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015).
Krizhevsky, A., Sutskever, I. & Hinton, G. Imagenet classification with deep convolutional neural networks. NIPS 25, 1106–1114 (2012).
Miikkulainen, R. et al. Evolving deep neural networks. In: Artificial Intelligence in the Age of Neural Networks and Brain Computing. (eds Kozma, R., Alippi, C., Choe, Y. & Morabito, F. C.) 293–312 (Elsevier, 2018).
Yao, X. A review of evolutionary artificial neural networks. Int. J. Intell. Syst. 8, 539–567 (1993).
Auda, G. & Kamel, M. Modular neural networks: a survey. Int. J. Neural Syst. 9, 129–151 (1999).
Azam, F. Biologically Inspired Modular Neural Networks (Virginia Tech, 2000).
Mira, J. M. Symbols versus connections: 50 years of artificial intelligence. Neurocomputing 71, 671–680 (2008).
Smolensky, P. Connectionist AI, symbolic AI, and the brain. Artif. Intell. Rev. 1, 95–109 (1987).
Maia, T. V. & Cleeremans, A. Consciousness: converging insights from connectionist modeling and neuroscience. Trends Cogn. Sci. 9, 397–404 (2005).
Haykin, S. Neural Networks and Learning Machines, 3rd edn (Pearson Education India, 2010).
Han, S., Pool, J., Tran, J. & Dally, W. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems 28 (eds Cortes C., Lawrence, N., Lee, D., Sugiyama, M. & Garnett, R.) (MIT Press, 2015).
Marcus, G. Deep learning: a critical appraisal. Preprint at 180100631 (2018).
Indiveri, G., LinaresBarranco, B., Legenstein, R., Deligeorgis, G. & Prodromakis, T. Integration of nanoscale memristor synapses in neuromorphic computing architectures. Nanotechnology 24, 384010 (2013).
Hu, M. et al. Memristor crossbarbased neuromorphic computing system: a case study. IEEE Trans. Neural Netw. Learn Syst. 25, 1864–1878 (2014).
Nurvitadhi, E. et al. Accelerating binarized neural networks: comparison of FPGA, CPU, GPU, and ASIC. In 2016 International Conference on FieldProgrammable Technology (FPT), 77–84 (2016).
Boybat, I. et al. Neuromorphic computing with multimemristive synapses. Nat. Commun. 9, 2514 (2018).
Narayanan, P. et al. Toward onchip acceleration of the backpropagation algorithm using nonvolatile memory. IBM J. Res. Dev. 61, 11:11–11:11 (2017).
Li, C. et al. Efficient and selfadaptive insitu learning in multilayer memristor neural networks. Nat. Commun. 9, 1–8 (2018).
Li, C. et al. Analogue signal and image processing with large memristor crossbars. Nat. Electron. 1, 52–59 (2017).
Yao, P. et al. Face classification using electronic synapses. Nat. Commun. 8, 1–8 (2017).
Jeong, D. S., Kim, K. M., Kim, S., Choi, B. J. & Hwang, C. S. Memristors for energyefficient new computing paradigms. Adv. Electron. Mater. 2, 1600090 (2016).
Kim, K. H. et al. A functional hybrid memristor crossbararray/CMOS system for data storage and neuromorphic applications. Nano Lett. 12, 389–395 (2012).
Ielmini, D. & Wong, H.S. P. Inmemory computing with resistive switching devices. Nat. Electron. 1, 333–343 (2018).
Xia, Q. & Yang, J. J. Memristive crossbar arrays for braininspired computing. Nat. Mater. 18, 309–323 (2019).
Chenchen, L. et al. A memristor crossbar based computing engine optimized for high speed and accuracy. In 2016 IEEE Computer Society Annual Symposium on VLSI (ed. Metra, C.) 110–115 (IEEE Computer Society, 2016).
Chabi, D., Wang, Z., Bennett, C., Klein, J.O. & Zhao, W. Ultrahigh density memristor neural crossbar for onchip supervised learning. IEEE Trans. Nanotechnol. 14, 954–962 (2015).
Zhang, Y., Wang, X., Li, Y. & Friedman, E. G. Memristive model for synaptic circuits. IEEE Trans. Circuits Syst. II 64, 767–771 (2017).
Raqibul, H., Tarek, M. T. & Chris, Y. Onchip training of memristor based deep neural networks. In 2017 International Joint Conference on Neural Networks (ed. Choe, Y.) 3527–3534 (IEEE, 2017).
Yu, S. et al Scalingup resistive synaptic arrays for neuroinspired architecture: challenges and prospect. In 2015 IEEE International Electron Devices Meeting(ed. Suehle, J.) 17.13.11–17.13.14 (IEEE, 2015).
Burr, G. W. et al. Experimental demonstration and tolerancing of a largescale neural network (165,000 synapses) using phasechange memory as the synaptic weight element. IEEE Trans. Electron Devices 62, 3498–3507 (2015).
Chang, C.C. et al. Mitigating asymmetric nonlinear weight update effects in hardware neural network based on analog resistive synapse. IEEE J. Emerg. Sel. Top. Circuits Syst. 8, 116–124 (2018).
PaiYu, C. et al. Mitigating effects of nonideal synaptic device characteristics for onchip learning. In 2015 IEEE/ACM International Conference on ComputerAided Design (ed. Marculescu, D.) 194–199 (IEEE, 2015).
Lee, J. H., Lim, D. H., Jeong, H., Ma, H. M. & Shi, L. P. Exploring cycletocycle and devicetodevice variation tolerance in MLC storagebased neural network training. IEEE Trans. Electron Devices 66, 2172–2178 (2019).
Pei, J. et al. Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572, 106–111 (2019).
Ielmini, D. & Ambrogio, S. Emerging neuromorphic devices. Nanotechnology 31, 092001 (2019).
Kim, S. et al. NVM neuromorphic core with 64kcell (256by256) phase change memory synaptic array with onchip neuron circuits for continuous insitu learning. In 2015 IEEE International Electron Devices Meeting(ed. Suehle, J.) 17.1.1–17.1.4 (IEEE, 2015).
Raty, J. Y. et al. Aging mechanisms in amorphous phasechange materials. Nat. Commun. 6, 7467 (2015).
Oh, S., Shi, Y., Liu, X., Song, J. & Kuzum, D. Driftenhanced unsupervised learning of handwritten digits in spiking neural network with PCM synapses. IEEE Electron Device Lett. 39, 1768–1771 (2018).
Boniardi, M. et al. Statistics of resistance drift due to structural relaxation in phasechange memory arrays. IEEE Trans. Electron Devices 57, 2690–2696 (2010).
Ielmini, D., Lavizzari, S., Sharma, D. & Lacaita, A. L. Physical interpretation, modeling and impact on phase change memory (PCM) reliability of resistance drift due to chalcogenide structural relaxation. In 2007 IEEE International Electron Devices Meeting (ed. Philip Wong, H.S) 939–942 (IEEE, 2007).
Boniardi, M. & Ielmini, D. Physical origin of the resistance drift exponent in amorphous phase change materials. Appl. Phys. Lett. 98, 243506 (2011).
Pirovano, A. et al. Lowfield amorphous state resistance and threshold voltage drift in chalcogenide materials. IEEE Trans. Electron Devices 51, 714–719 (2004).
Ambrogio, S. et al. Reducing the impact of phasechange memory conductance drift on the inference of largescale hardware neural networks. In 2019 IEEE International Electron Devices Meeting (ed. Takayanagi, M.) 6.1.1–6.1.4 (IEEE, 2019).
Bichler, O. et al. Visual pattern extraction using energyefficient “2PCM synapse” neuromorphic architecture. IEEE Trans. Electron Devices 59, 2206–2214 (2012).
Suri, M. et al. Impact of PCM resistancedrift in neuromorphic systems and driftmitigation strategy. In 2013 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH)) (IEEE, 2013).
Dean, J. et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25 (eds Pereira, F., Chris Burges, C. J., Bottou, L. & Weinberger, K. Q.) (MIT Press, 2012).
Knill, D. C. & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 27, 712–719 (2004).
Pouget, A., Drugowitsch, J. & Kepecs, A. Confidence and certainty: distinct probabilistic quantities for different goals. Nat. Neurosci. 19, 366–374 (2016).
Papandreou, N. et al. Drifttolerant multilevel phasechange memory. In 2011 3rd IEEE International Memory Workshop (ed. Choi, J.) 1–4 (IEEE, 2011).
Chang, C.C. et al. Mitigating asymmetric nonlinear weight update effects in hardware neural network based on analog resistive synapse. IEEE J. Emerg. Sel. Top. Circuits Syst. 8, 116–124 (2017).
Chen, P.Y. et al. Mitigating effects of nonideal synaptic device characteristics for onchip learning. In 2015 IEEE/ACM International Conference on ComputerAided Design (ed. Marculescu, D.) 194–199 (IEEE, 2015).
Newton, M. A. & Raftery, A. E. Approximate Bayesianinference with the weighted likelihood bootstrap. J. R. Stat. Soc. Ser. BStat. Methodol. 56, 3–48 (1994).
Burr, G. W. et al. Neuromorphic computing using nonvolatile memory. Adv. Phys.X 2, 89–124 (2017).
Blundell, C., Cornebise, J., Kavukcuoglu, K. & Wierstra, D. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (eds Bach, F. & Blei, D.) 1613–1622 (JMLR, 2015).
Rolls, E. T., Grabenhorst, F. & Deco, G. Choice, difficulty, and confidence in the brain. NeuroImage 53, 694–706 (2010).
Perry, C. J. & Barron, A. B. Honey bees selectively avoid difficult choices. Proc. Natl Acad. Sci. USA 110, 19155–19159 (2013).
Gherman, S. & Philiastides, M. G. Neural representations of confidence emerge from the process of decision formation during perceptual choices. NeuroImage 106, 134–143 (2015).
Kiani, R. & Shadlen, M. N. Representation of confidence associated with a decision by neurons in the parietal cortex. Science 324, 759–764 (2009).
Kepecs, A., Uchida, N., Zariwala, H. A. & Mainen, Z. F. Neural correlates, computation and behavioural impact of decision confidence. Nature 455, 227–231 (2008).
Im, D. H. et al. A unified 7.5 nm dashtype confined cell for high performance PRAM device. In 2008 IEEE International Electron Devices Meeting (ed. Brederlow, R.) 1–4 (IEEE, 2008).
Spadini, G., Karpov, I. V. & Kencke, D. L. Future high density memories for computing applications: device behavior and modeling challenges. In 2010 International Conference on Simulation of Semiconductor Processes and Devices (eds Baccarani, G. & Rudan, M.) 223–226 (IEEE, 2010).
Kim, K. & Jeong, G. The prospects of nonvolatile phasechange RAM. Microsyst. Technol. 13, 145–147 (2006).
Redaelli, A., Pirovano, A., Locatelli, A. & Pellizzer, F. Numerical implementation of low field resistance drift for phase change memory simulations. In 2008 Joint NonVolatile Semiconductor Memory Workshop and International Conference on Memory Technology and Design (ed. Eguchi, R.) 39–42 (IEEE, 2008).
Sebastian, A., Krebs, D., Gallo, M. L., Pozidis, H. & Eleftheriou, E. A collective relaxation model for resistance drift in phase change memory cells. In 2015 IEEE International Reliability Physics Symposium (ed. La Rosa, G.) 1–6 (IEEE, 2015).
Ielmini, D., Lacaita, A. L. & Mantegazza, D. Recovery and drift dynamics of resistance and threshold voltages in phasechange memories. IEEE Trans. Electron Devices 54, 308–315 (2007).
Bottou, L. From machine learning to machine reasoning An essay. MLear 94, 133–149 (2014).
Woo, J. Y. et al. Role of local chemical potential of Cu on data retention properties of Cubased conductivebridge RAM. IEEE Electron Device Lett. 37, 173–175 (2016).
Choi, S., Ambrogio, S., Balatti, S., Nardi, F. & Ielmini, D. Resistance drift model for conductivebridge (CB) RAM by filament surface relaxation. In 2012 4th IEEE International Memory Workshop (ed. Choi, J.) 1–4 (IEEE, 2012).
Choi, Y. et al. A 20 nm 1.8 V 8 Gb PRAM with 40MB/s program bandwidth. In 2012 IEEE International SolidState Circuits Conference (ed. Hidaka, H.) 46–48 (IEEE, 2012).
Chung, H. et al. A 58 nm 1.8 V 1 Gb PRAM with 6.4 MB/s program BW. In 2011 IEEE International SolidState Circuits Conference (ed. Gass, W.) 500–502 (2011).
Lee, K. et al. A 90 nm 1.8 V 512 Mb diodeswitch PRAM with 266 MB/s read throughput. IJSSC 43, 150–162 (2008).
Hwang, Y. N. et al. MLC PRAM with SLC writespeed and robust read scheme. In 2010 Symposium on VLSI Technology (ed. Dennison, C.) 201–202 (2010).
Kang, D. et al. Twobit cell operation in diodeswitch phase change memory cells with 90 nm technology. In 2008 Symposium on VLSI Technology (ed. Woo, J.) 98–99 (2008).
Kang, D.H., Jun, H.G., Ryoo, K.C., Jeong, H. & Sohn, H. Emulation of spiketiming dependent plasticity in nanoscale phase change memory. Neurocomputing 155, 153–158 (2015).
Acknowledgements
This work was partly supported by NSFC (No. 61836004), BrainScience Special Program of Beijing under Grant Z181100001518006, Beijing Innovation Center for Future Chips, Tsinghua UniversityChina Electronics Technology HIK Group Co. Joint Research Center for Braininspired Computing, JCBIC, National Science Foundation of ChinaNational Natural Science Foundation of ChinaYunnan Joint Fund (NSFCYunnan Joint Fund)—61327902, Nano Material Technology Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2016M3A7B4910398) as well as by the research fund (1.200095.01) of UNIST (Ulsan National Institute of Science and Technology), Artificial Intelligence Graduate School Program (No. 2020001336, UNIST) through the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT).
Author information
Authors and Affiliations
Contributions
H. Jeong and L. Shi designed and supervised the entire project. D. Lim measured device data, simulated neural network, performed analysis. S. Wu contributed in modelling and analysed simulation results. J. Lee assisted simulation. R. Zhao, H. Jeong, and L. Shi provided ideas for interpretation of the results. L. Shi, R. Zhao, D. Lim and S. Wu wrote the manuscript. All authors contributed to manuscript preparation.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lim, DH., Wu, S., Zhao, R. et al. Spontaneous sparse learning for PCMbased memristor neural networks. Nat Commun 12, 319 (2021). https://doi.org/10.1038/s4146702020519z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4146702020519z
This article is cited by

Activitydifference training of deep neural networks using memristor crossbars
Nature Electronics (2022)

Emerging of twodimensional materials in novel memristor
Frontiers of Physics (2022)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.