Memristive GAN in Analog

Generative Adversarial Network (GAN) requires extensive computing resources making its implementation in edge devices with conventional microprocessor hardware a slow and difficult, if not impossible task. In this paper, we propose to accelerate these intensive neural computations using memristive neural networks in analog domain. The implementation of Analog Memristive Deep Convolutional GAN (AM-DCGAN) using Generator as deconvolutional and Discriminator as convolutional memristive neural network is presented. The system is simulated at circuit level with 1.7 million memristor devices taking into account memristor non-idealities, device and circuit parameters. The design is modular with crossbar arrays having a minimum average power consumption per neural computation of 47nW. The design exclusively uses the principles of neural network dropouts resulting in regularization and lowering the power consumption. The SPICE level simulation of GAN is performed with 0.18 μm CMOS technology and WOx memristive devices with RON = 40 kΩ and ROFF = 250 kΩ, threshold voltage 0.8 V and write voltage at 1.0 V.

the proposed GAn architecture overall architecture. GAN utilizes two neural networks with competitive learning approach: (1) a Generator to produce a new data from the input noise and (2) a Discriminator to discriminate this new data from the training data 2,18 . Figure 1(a) illustrates a hardware implementation of the proposed Analog Memristive Deep Convolutional Generative Adversarial Network (AM-DCGAN). The proposed implementation is inspired by DCGAN algorithm 18 . While the original GAN algorithm is complex and may require batch normalization between layers and max-pooing operation to introduce non-linearity to the system, in the proposed hardware implementation the normalization is not considered and mean-pooling (average-pooling) is used to ensure the simplified implementation in analog domain. For the simple applications like MNIST, the pooling layer is not necessary and the architecture can perform well without it; however, for more complex problems and RGB images, the pooling layer is required. The architecture is applicable for small-scale problems, like handwritten digits generation illustrated in the results section, and shows the proof of concept that GAN can be implemented using analog CMOS-memristive circuits.
In the AM-DCGAN architecture shown in Fig. 1, the Generator Network has been implemented as a Memristive Deconvolutional Neural Network (DCNN). The Discriminator Network, on the other hand, is a Memristive Convolutional Neural Network (CNN). During training of these neural networks, the Generator produces the fake images from input noise, while the Discriminator is trained to identify whether the produced image is fake or real using supervised training method. The Discriminator acts as a binary classifier producing a single output. An example of generated images after training this network on MNIST handwritten digit database 21 is illustrated in Fig. 1(b). The AM-DCGAN architecture can be integrated to the image sensor to read the real images directly from the sensor without the application of analog-to-digital converter (ADC). To remove thermal noise and fixed pattern noise in the analog domain in such system, subtracting and memory circuits can be utilized 22 . In this work, training with Adam optimizer 23 and binary cross-entropy loss function is used for error calculation and update of memristive devices. The training unit performing error propagation and weight updated can also be implemented on-chip. However, such chip would need to be implemented in digital as the endurance of memristors are limited, with repeated writes causing faster aging, and would end up reducing the reliability of inference operation. Deconvolution and Convolution operations in these circuits are performed using memristive crossbar based dot product multiplication. In the Generator block containing k layers, as shown in Fig. 1(c), the input noise is provided to the first deconvolutional layer. This and subsequent layers consist of a number of deconvolution filters, F. The filter outputs are fed to a normalization layer with rectifier linear unit (ReLU) circuits, which outputs only positive inputs. To enhance the training of GAN, especially for complex datasets, the leaky ReLU (LeakyReLU) activation function should be used (circuit level implementation of both is shown in Methods section). This deconvolution process is performed several times to achieve the desired size of the image. In the last layer of the Generator, a hyperbolic tangent (tanh) activation function is used instead of ReLU. The input to this last layer can be either positive or negative. Therefore, the memristive filter with both positive and negative weight (represented by 2 memristors) are used; while for all the previous layers the filters with only positive weights (containing 1 memristor) is enough for simple applications to reduce the complexity of the architecture. However, the performance of the system with both positive and negative weights in both convolutional and deconvolutional filters is still better. Figure 1(d) shows a CNN based Discriminator. It consists of several memristive convolutional layers with convolutional filters (F), mean pooling layers with memristive mean filter (M) and a memristive dense network. The convolutional layer output is fed to the ReLU activation function followed by the mean pooling operation. While for MNIST application shown in this work, the pooling operation is not required and can be removed without compromising on the performance of the system, pooling is necessary for more complex applications. Therefore, in this system we show the memristive mean-pooling filters that can reduce the complexity of analog CNN, comparing to the implementation of having a max-pooling operation. The mean output is then fed to the next convolution layer. After the last convolutional layer, the output image is processed by the dense network, which is another memristive neural network; however with only one output depicting if the image is "fake" or "real". The circuit level implementation of the components of the architecture is illustrate in the Methods section. circuit components. Individual elements of two networks are designed using a typical 180nm CMOS process. The circuits are shown in Fig. 2, with the design parameters presented in Fig. 2(l). Other system components, including the activation functions, are designed using analog CMOS circuits as shown in Fig. 2.
Convolution, deconvolution and mean filters. Convolution and deconvolution filters are realized as a single column of small memristive crossbars, as shown in shown in Fig. 2(a). The weights represented by the conductance of memristive devices in these filters are trained during the training stage. In the deconvolution operation, the crossbars can be of different sizes depending on the input image and required upsampling operation.
The filter of Fig. 2(a) is used only for positive synapses as is required in majority of the layers of CNN and DCNN for simple applications (like MNIST). This reduces the complexity of the circuits by removing the requirement of placing an OpAmp based current to voltage converter after each convolutional filter. However, for more complex databases, both positive and negative synapses are required, otherwise, the performance will be deteriorated. The input to the filter V in is normalized to the range [−V th , V th ], where V th is a threshold voltage of the memristor. The output of each crossbar V outF is read across the load memristor or a set of parallel load memristors, when the minimum resistance of the memristor is higher than the required load, as illustrated in Fig. 2(a). The output voltage V outF is shown in Eq. (1), where R L is a total load resistance. When R L = R L1 = R L2 = . . . = R Lk , R L = R Lk /k, where k is a number of memristors in parallel.
As this method utilises the voltage divider principle, the normalization of voltages between V outF−H and V outF−L (which are the highest and lowest outputs of V outF ) is performed automatically. This also prevents the output voltage from exceeding the maximum filter input V in (or V th ). Our design contains four R L k memristors in parallel. This memristors replace the necessity to put load resistor to read the voltage across it and reduce on-chip area. In the proposed design, the total load resistance should be equal to 1 kΩ. As for WO x memristors, R ON = 40 kΩ, to create the total load of 10 kΩ, four devices in parallel are required. Depending on the memristor material, the number of these devices may vary. In addition, of the on-chip area of the architecture is not a critical aspect, load resistors can be used instead of memrsitors to reduce the fabrication complexity.
The last block of the Generator circuit, however, is a hyperbolic tangent activation function requiring both positive and negative inputs. Hence, the last deconvolutional layer utilises a different filter with positive and negative weights in memristive structures. Such filters can be used in all the convolutional and deconvolutional layers of the architecture, when it is necessary to enhance the performance of the system for more complex applications. In this filter (Fs), shown in Fig. 2(c), negative synapses are created using two memristors and an OpAmp-based subtractor 1 . However, rather than using conventional method with a current-to-voltage converter, we read voltages across the load memristors. To implement positive synapses, G − is set to R OFF , while G + varies according to the required weight value. To create negative synapses, G − vary and G + is set to R OFF . This method, while being a very efficient circuit, also allows us to obtain weight w = 0. The tangent circuit (tanh, shown in Fig. 2(d)) provides a "hard" tangent function. Hence, the low output voltage of the filter ensures the limits of the the input to the tanh circuit. The OpAmp used has two differential and one common gain stages with indirect compensation technique, is shown in Fig. 2(f) 24 . Depending on the R ON /R OFF ratio of the devices and mapping method, an additional amplification may be required for V outF and ′ V outF . The memristive mean-pooling filters used in the Discriminator are shown in Fig. 2(e). They consist of M memristors set to maximum resistance R OFF to limit the current flowing through the memristor and in turn overall power consumption. The mean filters are illustrated separately rather than is a strided convolution operation with convolution filters to emphasize that these filters are not trainable and always set to R OFF . The memristive mean-pooling filters are used to reduce the on-chip area of the architecture, comparing to max-pooling operation. These filters are optional for the MNIST application, as the problem is simple and pooling can be avoided. However, such filters will be useful when training the architecture to generate more complex RGB images.
Activation functions. The activation functions used in AM-DCGAN are either ReLU/LeakyReLU, as shown in Fig. 2(b) or hyperbolic tangent (tanh), as shown in Fig. 2(d). The ReLU circuit is based on a transmission gate. The input voltage is applied to the inverter M 1 /M 2 and transmission gate M 7 /M 8 . The inverters are designed with a threshold V thReLU = 0V. Depending on the sign of the input, if the transmission gate is on, the voltage V inR is passed to the output V outR = V inR . If the transmission gate is off, V outR = 0, due to action of the second transmission gate M 9 /M 10 . The way to change the ReLU to LeakyReLU is proposed in Fig. 2(b). Instead of connecting the transmission gate M 9 /M 10 to the ground for the case of negative input, it is connected to voltage divider R LR1 -R LR2 which reduces the input voltage fro negative case by the ratio R LR2 /R LR1 . In this work, this ratio is set up to 100/400 to satisfy LeakyReLU equation y = max(0.2x, x). The "hard" hyperbolic tangent (tanh) (Fig. 2(d)) is also implemented with two adjusted inverters with parameters shown in Fig. 2(l). The "hard" tanh function is selected due to small on-chip area and power consumption comparing to traditional tanh implementation 25 .
Dense network. The dense network in the discriminator is a conventional memristive neural network with ReLU activation function in the hidden layers and linear activation function in the output layer. The sign of memristive weights in the dense layer is controlled by the sign crossbar, which stores in the sign of each memristor. The output from the main crossbar is read sequentially one column at a time. At the same time, a row of the sign www.nature.com/scientificreports www.nature.com/scientificreports/ crossbar corresponding to the particular column of the main crossbar is activated. The output of the sign crossbar is controlled by the weight sign switch (WSS), which inverts the input voltage applied to the main crossbar, instead of introducing the negative weights to memristors. The sequential processing in the dense layer crossbar is controlled by the control switches (CS) using two transistors. The output current from the column is converted to voltage by an OpAmp based current to voltage converter (IVC) and processed with ReLU function. As the crossbar columns in the dense layer are processed sequentially, the voltage signal is retained using conventional Analog Sample and Hold (SH) circuit shown in Fig. 2(k). The SH circuit contains sample and hold part and two voltage buffers to reduce the effect of the other system components on the sampled signal. The parameters of SH circuit are adjusted for ns operation and small capacitor value (0.05pF) is selected. When V clk is high, the input signal is sampled. The clock signal V clk of SH circuit in each column is programmed to read the ReLU output at a particular time step and retain all the outputs (V clk is kept high) till the "read" of the last column is finished to supply them to the second layer in parallel. In the second dense layer, the processing is performed similarly; however, the crossbar has only one output corresponding to the particular class ("fake" or "real").
Weight sign switch circuit. Weight sign switch (WSS) circuit shown in Fig. 2(g) is used to control the sign of the weights by inverting input voltages applied to the crossbar. Each weight of the network is represented by two memristors: (1) from the main crossbar and (2) from the sign crossbar. The main crossbar stores only the absolute value of the weight, while the sign crossbar stores the sign of the weight in a form of R ON or R OFF state of the memristors. As the columns of the main crossbar in the dense network are processed sequentially, only one column is activated in the main crossbar at a time. While in the sign crossbar all the columns are activated with a single active 1T1R cell at a time, so the signs of all the weights V outS are supplied to WSS in parallel, and the outputs of all WSS circuits V outWSS are fed to the main crossbar. WSS acts as a comparator and either keeps the sign of the input voltage V inWSS (the output of the convolution filter) or changes it. If the voltage from the sign crossbar V outS exceeds the threshold of the inverter M 35 − M 36 in WSS, the thresholding circuit activates the sign switch M 41 − M 44 controlling the difference amplifier. If the sign switch is on, the difference amplifier subtracts 2V inWSS from V inWSS . Otherwise, V outWSS = −V inWSS , when the output of the sign switch is set to 0 V. Therefore, instead of introducing negative memristive weight, we change the sign of the input, if the weight is supposed to be negative.
Crossbar switch and current to voltage converter. The sequential processing of the crossbar columns in the dense layers is controlled by the control switches (CS) based on two transistors shown in Fig. 2(i). The bulk of the transistors in a crossbar switch is connected to the control voltages V DD (for PMOS) and −V DD (for NMOS), rather than to their sources. This improves the linearity of the switch and avoids the leakage current when switch is set to "off ". Nevertheless, with the increase of the number of crossbar rows connected to the transistor, the linearity of the switch decreases. In this work, the switch and following current to voltage converter (IVC) was adjusted for 784 input rows (considering the input of 28 × 28). As the switch is nonlinear, the current does not go beyond 10 mA. In an ideal case without this switch for largest possible weights and inputs, it should be about 160 mA. This hence, normalizes the current without any additional normalization circuit. For smaller number of rows, the ideal current of 25 mA is normalized to 8 mA. Nevertheless, the effect of this normalisation on the overall performance still should be investigated further. The current to voltage converter (IVC, shown in Fig. 2(h)) after the switch is adjusted to convert the current from the switch to voltage. The maximum current is converted to the threshold voltage of the memristor for further processing. This converter provides an inverted output V inv as well as a non-inverted output V outIVC = −V inv .
Memristor non-idealities. This work focuses on the investigation of the effect of various memristor non-idealities [26][27][28] , such as limited number of stable resistive states, resistance variations, device failures and the device aging, on the implementation of the proposed AM-DCGAN. When limited number of stable resistive states is added, the weights are restricted between R ON and R OFF resistances. The intermediate states are distributed uniformly between R ON and R OFF to show the impact of the limited number of states.
The variation of the resistive states of the memristor are represented as Gaussian distribution w = w + G d (μ = 0, std = σ), where w is a normalized weight, G d is a Gaussian distribution with μ = 0 mean and standard deviation std = σ. Figure 3(a) shows the distribution of normalized highest and lowest memristive weights for difference standard deviation (std) of Gaussian distribution, based on the variation of R ON and R OFF states, and corresponding difference between neighboring states for the devices with 128 stable resistive states. Based on Fig. 4(b), the architecture based on memristive devices with weight distribution with std = 0.04 can still www.nature.com/scientificreports www.nature.com/scientificreports/ produce good quality images, while Fig. 3(a) shows that the neighboring states are still close to each other. For std = 0.1 and std = 0.2, the neighboring states (0 and 1/128 in Fig. 3(a)) completely overlap, which leads to the bad quality of the generated images ( Fig. 4(b)).
The failure of the devices in the architecture is introduced by setting the devices to R ON (when the memristor is stack in R ON state), R OFF (when the device is stack in R OFF state) or R = 0 (when the connection is failed completely and the device is not conducting the current). The percentage failure in Fig. 4(c) combines all these case. For example, 1% failure of the device means that 0.25% of random memristive weights are set to R ON , 0.25% of random memristive weights are set to R OFF , and 0.5% of the devices are completely disconnected and set to 0. This method allows to illustrate a generalized scenario of memristor failures and their effect on the performance of the architecture. Figure 3(b) shows effect of the memristor aging of the normalized R ON and R OFF states (highest and lowest memristive weights in the architecture). In general, memristor aging assumes the change in the resistive levels, when high resistance level R OFF decreases and low resistance level R ON increases, and several highest and lowest resistive states can not be reached anymore 29,30 . In the evaluation of the proposed architecture, the general case was considered, where the decrease in R OFF and increase of R ON are assumed to be the same. Figure 3(b) shows how the percentage of aging used in the simulation effects the highest and lowest memristive weights in the architecture. For example, in the architecture with 128 levels, the aging of 4% the aging removes 6 highest and 6 lower resistive states; while the aging of 10% cuts 13 highest and 13 lowest states.

Results
System level simulations. The system has been tested on handwritten digits MNIST handwritten digits database. The Generator with 3 deconvolutional layers and 128 and 64 filter of the size of 5 × 5 (25 memrsitors), and the Discriminator with 2 convolutional layer and 64 and 128 filters of the size of 5 × 5 and single dense layer are implemented. To understand the effect of variability and limited number of stable resistive states of memristive devices, further simulations were carried out with the same database. Figure 4 illustrates the effect of different memristor non-idealities on the quality of the generated images and required number of the training iterations to achieve the high quality images produced by the generator. Figure 4(a) shows the effect of the limited number of stable resistive states in memristive devices on the quality of generated images. From Fig. 4(a), it can be observed that memristive devices with 64 or more resistive levels can be used without causing significant degradation of the image quality. Figure 4(b)illustrates the effect of the memristor variability on the performance of the Generator. This includes random cell-to-cell variability following Gaussian distribution (the stochasticity of R ON /R OFF values of the device in each memristor cell) and possibility of write/read error, when the final programmed value may have a deviation from the ideal case. The variation is resistive states in memristors follows as Gaussian distribution (explained in the Methods section). The results in Fig. 4(b) are shown for the memristive devices with 128 stable resistive states. The architecture can tolerate the distribution of memristive weights (represented as a conductance of the device) with the standard deviation of 0.04 from the ideal weight, and after std = 0.08 the image quality degrades significantly. Figure 4(c) illustrates the effect of random failures of the memristive devices on the generated images. The device failure is introduced randomly considering three cases, when: (1) device is stack in R ON state, (2) device is stack in R OFF state and (3) complete failure of the device, when the current flow is disturbed. The device failure has a significant impact on the performance on the network, and the architecture can tolerate only up to 0.3-0.5% of device failures. Figure 4(d) illustrates the effect of device aging 29,30 on the performance of the architecture. The uniform aging of R ON and R OFF states was assumed (explained in Methods), where certain percentage of high and low resistive states is removed. The architecture can tolerate 2-4% of aging without the significant reduction of the quality of generated images. Figure 4(e) shows how the variation in hardware and deviation from the ideal output affects the quality of the generated images. Random hardware noise was introduced to the outputs of all the activation functions as y n = y × R d (1 − x/100, 1 + x/100), where y is an ideal output, y n is a noisy output, R d (a, b) is a random uniform distribution with minimum value a and maximum value b, and x represents added noise in percentage. Figure 4(e) illustrates the effect on only hardware noise (left side) and the effect of hardware noise with memristor variation with std = 0.02 (right side). In the generated images without memristor variation, the generated numbers can still be recognized even with 40% of the noise in hardware, even though the generated images are noisy. However, in the experiment with both hardware variation and memristor variation, the quality of generated images significantly reduces after 10% of hardware noise. Figure 4(f) illustrates the effect of the noise in the input images to the discriminator during training following random normal distribution on the quality of the generated images. Such noise in the input images may be a result of the non-idealities of the interface between the analog pixel sensor and the proposed architecture in the near-sensor processing. The noise is added according to the equation z n = z + R d (−x/100, x/100), where z n is the noisy image, z is an ideal input image, x is the added input noise in percentage considering that the image pixel values are normalized in the range of [0,1]. According to Fig. 4(f), the system can tolerate up to 10% of added random noise.
Finally, Fig. 4(g) illustrates the effect of number of training epochs on the generated image showing that 50 epochs, corresponding to maximum 3M update cycles of memristive devices, can produce good image quality (used in the calculated of required number of memristor update cycles). Figure 5 shows the quantitative analysis of the generated images represented by Frechet Inception Distance (FID) score 31 calculated for 10,000 generated images compared to 10,000 randomly selected images from the training set of the MNIST database. Provided FID scores correspond to the generated images shown in Fig. 4. The meaning and calculation of the FID score is explained in Methods. The simulation results in Fig. 5(a) show that relatively low FID score for the architecture with limited number of resistive states can be achieved starting from 32 states. Figure 5(b-d) illustrate the calculated FID scores for variation in resistive states, device failure and aging, respectively. The dependence of FID score on memristor variation and percentage aging is close to linear. While, even insignificant percentage of the device failure in the system affects FID score and quality of generated images significantly. Figure 5(e) illustrates the effect of the hardware variation (noise) on the performance of the www.nature.com/scientificreports www.nature.com/scientificreports/ architecture without memristor variability and with memristor variability of std = 0.02. While for hardware variation of up to 35% the memristor variability (blue line) contributes significantly to the increase of FID score, the FID scores for the simulation with and without memristor variability become almost identical for large variation in hardware (after 35%). In addition, the effect of hardware variation cause smaller deterioration of the quality of generated images than the memristor variation for the given parameters. Figure 5(f) illustrates FID scores for noisy input images, where FID score increases significantly after 10% of the input noise, which correlates with the results illustrated in Fig. 4(f).  www.nature.com/scientificreports www.nature.com/scientificreports/   Figure 6(g) illustrates the simulation results for control switch for the crossbar columns in ON and OFF states. The temperature variation does not affect the performance of the switch. The variation of transistor length L does not cause the significant deviation from the ideal output. Figure 6(h) illustrates the performance analysis of the SH circuit, where first two graphs show the clock signal V clk and corresponding input voltage and sampled output voltage. The temperature variation does not affect the performance of SH circuit, as well as L variation. Table 1 shows maximum power consumption, on-chip area and delay in various circuit components of the system. We designed the circuit using a stable 180 nm CMOS technology to prove that GAN circuit can work with memristive devices. Knowm MSS (Multi-Stable Switch) model is used to simulate the WOx memristive devices 32 . This however, has led to high power consumption and can be reduced by using smaller geometry processes. The on-chip area of CMOS part of the complete Generator configuration with 3 layers containing [128, 64, 1] filters, input noise size of 7 × 7 and output image size of 28 × 28 is 0.581 mm 2 . The overall delay for forward propagation in the Generator circuit with 3 layers is about 0.68 ns. On the other hand, the delay in the Discriminator circuit with 3 convolutional and 2 dense layers is 0.122 μs. This is due to the sequential processing in the crossbar columns in the dense layers. Hence, after training, the Generator circuit can have high performance speed, as the processing is performed in analog domain and analog to digital conversion stage does not limit the processing speed by the sampling rate of ADCs and DACs. Nevertheless, if it is desired to reduce power consumption, this fast speed can be compromised introducing sequential processing in the layers.
CMOS scaling. Scaling of the proposed architecture for smaller CMOS technology allows to improve power consumption and on-chip area of the system. We show the comparison of system components, filters and crossbars designed in 45 nm CMOS technology to 180 nm CMOS circuits. The simulations were performed using 45 nm CMOS PTM models 33 . Table 2 shows the changes in the performance metrics, such as power consumption, on-chip area, circuit delay, temperature and transistor length variation effects, due to scaling down of the circuits to 45 nm with V DD = 1V for separate circuit components and for filters with activation functions and dense layer crossbar. In general, the power consumption and on-chip area are significantly improved, however, the temperature variation and transistor length L variation of the circuits designed 45 nm cause more significant output variation than for 180 nm CMOS technology. The delay of the circuits is affected by the used transistor length L in the design. In the circuits with minimum L = 45 nm, the delay is decreased, comparing to 180nm circuits; however the delay increases if the circuit length is set to L = 90 nm. www.nature.com/scientificreports www.nature.com/scientificreports/ When scaling down, the operation range of the ReLU and LeakyReLU circuits decreased to [−1.2 V, 1.2 V] due to the decrease of V DD . In addition, for the small input voltages [−8 mV, 8 mV], the offset of 2 mV occurs, which can be improved by optimizing the inverters M 1 − M 4 . In the Tanh circuit, there is a small offset (maximum 0.1 V) in the DC response and the slight change in the slope for the input voltage of 0.1 V − 0.6 V, which can be further improved by optimizing the circuit parameters. If the OpAmp circuit is scaled to the minimum length L = 45 nm, the circuit performance is affected by the temperature and L variation, also it does not reach V DD = 1 V (only 0.8 V) in a closed loop configuration, while the delay is reduced. Therefore, the performance of the OpAmp with L = 90 nm is shown. Such OpAmp is more stable and accurate, comparing to the one with L = 45 nm, however the increase in L affects the dynamic behavior of the circuit and causes the output delay. In the scaling of the WSS circuit, R 11 and R 12 are increase to 2 kΩ to improve the performance. If the OpAmp with L = 45 nm is used in WSS, the operation range is reduced and the variation in temperature and L can cause the distortion of the output. Therefore, the OpAmp with L = 90 nm is more effective for such circuit, however 4 ns is required for the circuit to fully stabilize.
The second part of Table 2 compares the performance of the convolutional filters and dense layer crossbar and processing circuits in terms of power consumption, on-chip area, total delay and errors at different parts of the circuit. The convolutional filters were simulated with 10 positive and negative weights followed by the difference amplifier and LeakyReLU or Tanh circuits. The dense layer crossbar was simulated with 9 × 9 memristors in both main and sign crossbar, 9 WSS circuits, CS, IVC and LeakyReLU circuits. The R 8 − R 10 resistance values for convolutional filters and R 13 − R 15 for IVC in the dense layer have been increase to 1 kΩ to avoid the impact of the processing circuits on the memristive crossbar. Overall, the power consumption and on-chip area decrease significantly for 45 nm CMOS technology, however, the delay increases. The output errors vary for different components of the system. For successful scaling for the design, the trade-off between processing speed affected by the circuit delay and on-chip area and power consumption should be found.
Memristor variability effect. Variability and non-idealities of memristive devices can deteriorate the performance of the architectures. Hence, Monte Carlo simulations were also performed to understand the operational limits of the circuit. As actual memristor variations are not very well known, we used large conservative estimates of 10% and 30% variations in memristor performances. A typical Generator circuit with 2 deconvolutional layers with ReLU activation functions and one final layer with hyperbolic tangent was used. Figure 7(a-f) shows these results. The first row ( Fig. 7(a,b)) compares the performance of the circuit with ideal response showing very little absolute error in both Relu as well as tanh output nodes. The second and third row of Fig. 7 (Fig. 7(c-f)) show results of introducing variability at 10% and 30% respectively. It can be observed that the tanh layer has higher sensitivity to memristor variability. This is expected as being the output layer, it also accumulates errors from all previous layers. The output spikes in the timing diagram are caused by the delay of inverters in ReLU circuit and can be compensated with additional circuit components.   www.nature.com/scientificreports www.nature.com/scientificreports/ variability (from 0% to 40%). The simulation was performed for 10 particular cases with 10 random iterations of the variation in memristor values, and the average value is reported for 1st convolutional filter output, mean filter output, dense layer input and dense layer output (Fig. 7(h-k)). The results for the 1st convolutional filter output variation show that the error for 30% is 2-3% less than the error for 40% memristor variation. This is due to the testing of only few particular random cases of memristor variation and averaging of the final error results. Overall, the output errors for 30% and 40% variation are similar, as illustrated by the example in Fig. 7(g), where the output variation of two different convolutional filters is shown. As the signal propagates through the memristive levels, the errors at the output of each stage increase and accumulate for the next stage.
Training time and power. Based on Fig. 4(e), we also calculated the time and power required for the update (training) cycle. We consider WO x memristive devices, with R ON = 40 kΩ and R OFF = 250 kΩ, threshold voltage 0.8 V, write voltage 1 V and different "write" times. These were used in a three layer Generator and Discriminator architectures with 1.7 million memristive weights for 50 epochs. 60,000 MNIST images of size 28 × 28 were used. Furthermore, a maximum of 128 convolutional filters of the size of 5 × 5 are used. These results for different ways of parallel and sequential update of memristors in the architecture are shown in Table 3. The training time per epoch and average power per computation is reported. Considering the number of required update cycles for memristive devices from Table 3 and Fig. 4(c), the memristor endurance for on-chip training of such networks may be a problem. In such cases, one may consider the use of AM-DCGAN for partial image-generation and network training. The software or microprocessor based pre-training up to certain accuracy followed by complete hardware-based training can be used to improve the learning speed of AM-DCGAN.

Discussion and conclusion
This paper proposes the first implementation of fully analog memristive hardware implementation of AM-DCGAN. The practical aspects such as endurance, limitations on conductive levels, and circuit parasitics from wires and devices were included in the design and simulations. The system remains stable for conductance variation up to std = 0.04 with 128 conductance levels, and inference performance substantially degrades at std = 0.08. The system can tolerate up to 4% of memristor aging and up to 0.3-0.5% of device failures in the proposed architecture. The hardware variation (noise) reduces the quality and adds noise to the generated images, while the hardware variation up to 10% with memristor variability can still be tolerated. Up to 10% of noise in the input images, that can be the result of the interface between image sensor and the proposed circuit in near-sensor processing, can also be tolerated. By treating the memristive neural networks as hardware subroutines, it is possible to develop an analog system that can speed up complex neural computing for near-sensor data processing of GAN. The main advantages of AM-DCGAN are small on-chip area and possibility to implement it on edge devices and integrate directly to analog sensors to process information in real time and speed up the training. The WO x memristive devices are compatible with 180 nm CMOS process 11 , and the proposed CMOS-memristive www.nature.com/scientificreports www.nature.com/scientificreports/ system can be fabricated using Back-End-of-Line (BEOL) process 34 placing the memristive crossbars on top of the CMOS circuits.
In this work, only basic non-idealities of memristive devices, which traditionally affect the performance of the neural networks significantly, have been considered. While the memristor variation effect shown in the simulation results illustrates the worst case scenario, where for std > 0.04 even the highest and lowest memristive states can correlate, and the aging simulation cause some of the states to disappear, the effect of non-linearity of memristor programming and non-linear distribution of resistive states in the particular devices can be studied further.
The design of the control circuit and overall chip implementation of the proposed GAN are still open research problems and may be considered in future works. In addition, the implementation of more complex analog GAN architecture with max-pooling, normalization and an on-chip training unit for generation of more complex images is an open problem that can be addressed in future. While, MNIST handwritten digits generation of the size of 28 × 28 illustrates that memristive devices with variabilities and limited number of stable resistive states can still be used for analog GAN implementation, more complex cases with large image, and in turn increased hardware complexity, should be considered as a research direction for the further implementations of analog memristor based GAN architectures.

Methods technology, models and tools. The system level simulation for image generation was performed in
Python, considering the outputs and non-idealities of the circuits verified in SPICE. The circuit level simulations were performed in SPICE using 180 nm CMOS technology and Knowm MSS (Multi-Stable Switch) memristor model 32,35 based on the parameters of WO x memristive devices 19,36 . In this model, the current of the device is described by two components, memory-dependent current I m and Schottky diode current I s 37 , shown in Eq. (2). The parameters α f , β f , α r , β r are positive device specific constants describing an exponential behavior of I s , which are different for each memristor material. The conductance of the device is represented as a total conductance of N metastable switches. The memory-dependent current I m can be represented as I m = V(G m + ΔG m ), where G m is a total memristor conductance over N metastable switches and ΔG m is a change in conductance relying on the distributions P A and P B

37
. The distributions P A and P B are the probability of transition between A and B states are shown in Eqs. (3) and (4), where β = − V T 1 is a thermal parameter, and α = Δt∕t c . The parameter Δt is time step ratio, and t c is a time scale of the device 32 . The FID scores for memristor variation and device failure in Fig. 5 are calculated as an average of 10 random cases of different variabilities and device failures.