An adaptive synaptic array using Fowler–Nordheim dynamic analog memory

In this paper we present an adaptive synaptic array that can be used to improve the energy-efficiency of training machine learning (ML) systems. The synaptic array comprises of an ensemble of analog memory elements, each of which is a micro-scale dynamical system in its own right, storing information in its temporal state trajectory. The state trajectories are then modulated by a system level learning algorithm such that the ensemble trajectory is guided towards the optimal solution. We show that the extrinsic energy required for state trajectory modulation can be matched to the dynamics of neural network learning which leads to a significant reduction in energy-dissipated for memory updates during ML training. Thus, the proposed synapse array could have significant implications in addressing the energy-efficiency imbalance between the training and the inference phases observed in artificial intelligence (AI) systems.


Introduction
Implementation of reliable and scalable synaptic weights or memory remains an unresolved challenge in the design of energy-efficient machine learning (ML) and neuromorphic processors [1].Ideally, the synaptic weights should be "analog" and should be implemented on a non-volatile, easily modifiable storage device [2].Furthermore, if these memory elements are integrated in proximity with the computing circuits or processing elements, then the resulting compute-in-memory (CIM) architecture [3,4] has the potential to mitigate the "memory wall" [5,6,7] which refers to the energy-efficiency bottleneck in ML processors that arises due to repeated memory access.In most practical and scalable implementations, the processing elements are implemented using CMOS circuits; as a result, it is desirable that the analog synaptic weights be implemented using a CMOS-compatible technology.In literature, several multi-level non-volatile memory devices have been proposed for implementing analog synapses.These include the cross-bar memristor based resistive random-access memories (RRAM) [8], magnetic random-access memories (MRAM) [9], Phase Change Memory (PCM) [10], Spin Torque Transfer RAM (STTRAM) [11], Conductive Bridge RAM [12] or the three terminal devices like the floating-gate transistors [13], ferroelectric field-effect transistor-based RAM (FeRAM) [14], Charge Trap Memory [15] and Electrochemical RAMs (ECRAM) [16].In all these devices the analog memory states are static in nature, where each state needs to be separated from others by an energy barrier ΔE.In non-volatile storage, it is critical that this energy-barrier is chosen to be large enough to prevent memory leakage due to thermalfluctuations or other environmental disturbances.For example, in memristive devices the state of the conductive filament between two electrodes determines the stored analog value, whereas in charge-based devices like floating-gates or FeFET, the state of polarization determines the analog value.At a fundamental level, the energy dissipated to transition between different analog states is determined by the energy-barrier ΔE.For example, switching the RRAM memory state requires 100 fJ per bit [17], whereas STT_MRAM requires about 4.5pJ per bit [18].A learning/training algorithm that adapts its weights in quantized steps (…,Wn-1, Wn, Wn+1, …) towards a target solution (or local extrema), must dissipate energy (…,ΔEn-1, ΔEn, ΔEn+1, …) for memory updates, as shown in Fig. 1(a).In this paper we present a synaptic memory device that uses dynamical states (instead of static states) to implement analog memory in an effort to improve the energy-efficiency of ML training.The core of the proposed device is itself a micro-dynamical system and the system-level learning/training process modulates the dynamical state (or state trajectory) of the memory ensembles.The concept is illustrated in Fig. 1(b), which shows a reference ensemble trajectory that continuously decays towards a zero vector without the presence of any external modulation.However, during the process of learning, the trajectory of the memory ensemble is pushed towards an optimal solution W*.The main premise of this paper is that the extrinsic energy (… ,ΔEn-1, ΔEn, ΔEn+1, …) required for modulation, if matched to the dynamics of learning, could reduce the energy-budget for ML training.This is illustrated in Fig. 1(c) which shows a convergence plot corresponding to a typical ML system as it transitions from a training phase to an inference phase.During the training phase, the synaptic weights are adapted based on some learning criterion whereas in the inference phase the synaptic weights remain fixed or are adapted intermittently to account for changes in the operating conditions.Generally, during the training phase the amount of weight updates is significantly higher than in the inference phase, as a result, memory update operations require a significant amount of energy.Take for example support-vector machine (SVM) training, the number of weight updates scale quadratically with the number of support vectors and the size of the training data, whereas adapting the SVM during inference only scales linearly with the number of support-vectors [19].Thus, for a constant energy dissipation per update, the total energy-dissipated due to weight updates is significantly higher in training than during inference.However, if the energy-budget per weight updates could follow a temporal profile as shown in Fig. 1c, wherein the energy dissipation is no longer constant, but inversely proportional to the expected weight update rate, then the total energy dissipated during training could be significantly reduced.One way to reduce the weight update or memory write energy budget is to trade-off the weight's retention rate according to the profile shown in Fig. 1c.During the training phase, the synaptic element can tolerate lower retention rates or parameter leakage because this physical process could be matched to the process of weight decay or regularization, techniques commonly used in ML algorithms to achieve better generalization performance [20].As shown in Fig. 1c, the memory's retention rate should increase as the training progresses such that at convergence or in the inference phase the weights are stored on a nonvolatile memory.
In this paper we describe a dynamic analog memory (DAM) that can exhibit a temporal profile similar to that of Fig. 1c.Furthermore, the memory is implemented on a standard CMOS process without the need for any additional processing layers.Fig. 1e shows a micrograph of a DAM array and in the Supplementary Section I we describe the circuit implementation details.The proposed DAM requires a Fowler-Nordheim (FN) quantum-tunneling barrier which can be created by injecting sufficient electrons onto a polysilicon island (floating-gate) that is electrically isolated by thin silicon-di-oxide barriers [21].As the electron tunnels through the triangular barrier, as shown in Fig. 1f, the barrier profile changes which further inhibits the tunneling of electrons.We have previously shown that the dynamics of this simple system is robust enough to implement time-keeping devices [22] and self-powered sensors [23].In this paper, we use a pair of synchronized FN-dynamical systems to implement a DAM suitable for implementing ML training/inference engines.Figure 1(f) shows the dynamics of two FN-dynamical systems, labeled as SET and RESET, whose analog states continuously and synchronously decay with respect to time.In our previous work [23], we have shown the dynamics across different FN-dynamical systems can be synchronized with respect to each other with an accuracy greater than 99.9%.However, when an external voltage pulse modulates the SET system, as shown in Fig. 1f, the dynamics of the SET system becomes desynchronized with respect to the RESET system.The degree of desynchronization is a function of the state of the memory at different time instances (Fig. 1g, insets g1-g3) which determines the memory's retention rate.For instance, at time-instant  1 , a small magnitude pulse would produce the same degree of desynchronization as a large magnitude pulse at the time-instant  3 .However, at  1 the pair of desynchronized systems (SET and RESET) would resynchronize more rapidly as compared to desynchronized systems at time-instants  2 or  3 .This resynchronization effect results in shorter data retention; however, this feature could be leveraged to implement weight-decay in ML training.At timeinstant t3, the resynchronization effect is weak enough that the FN-dynamical system acts as a persistent non-volatile memory with high data-retention time.In Methods section, we derive the FN-dynamical system mathematical model and compare it to ML training formulation.We show that the energy required for updating the memory and its data retention capacity can be annealed according to the profile shown in Fig. 1c.The dynamics of the FN-tunneling based DAM (or FN-DAM) were verified using prototypes fabricated in a standard CMOS process (micrograph shown in Fig. 1e.).The FN-DAM devices were programed and initialized through a combination of FN tunneling and hot electron injection.Detailed description of the general programming process can be found in [23] with implementation specific notes in the Methods section.The tunneling nodes (WS and WR in Fig. 1e) were initialized to around 8 V and decoupled from the readout node by a decoupling capacitor to the sense buffers (shown in supplementary Fig. 1).The readout nodes were biased at a lower voltage (~3 V) to prevent hot electron injection [24] onto the floating gate during readout operation.

Dynamic analog memory with an asymptotic non-volatile storage
initializing the tunneling nodes (WS and WR) to different voltages (see Methods section), whilst ensuring that the tunneling rates on the WS and WR nodes were equal.Initially (during the training phase), tunnelingnode voltages were biased high (readout node voltage of 3.1 V), leading to faster FN tunneling (Fig. 2, inset a).A square input pulse of 100 mV magnitude and 500 ms duration (5 fJ of energy) was found to be sufficient to desynchronize the SET node by 1 mV.However, as shown in Fig. 2(b), the rate of resynchronization in this regime is high leading to a decay in the stored weight down to 30% in 40 s.At t = 90 s, the voltage at node WS has reduced (readout node voltage of 2.9 V), and a larger voltage amplitude (500 mV) is required to achieve the same desynchronization magnitude of 1 mV, corresponding to an energy expenditure of 125 fJ.However, as shown in Fig. 2(c), the rate of resynchronization is low in this regime, leading to a decay in the stored weight down to 70% its value in 40 s.Similarly, at a later time instant t = 540 s, a 1 V signal desynchronizes the recorder by 1 mV, however as shown in Fig. 1(d), in this regime 95% of the stored weight value is retained after 40 s.This mode of operation is suitable during the inference phase of machine learning when the weights have already been trained, but the models need to be sporadically adapted to account for statistical drifts.Modeling studies described in Supplementary Section II shows that the write energy per update starts from as low as 5 fJ and increases to 2.5 pJ over a period of a period of 12 days.Supplementary Fig. 3 indicates that at lower WS/WR operating voltage (~ 6V) or at greater instants of time the retention time of FN-DAM converges to that of other FLASH based memory.

Figure 3: (a-b) FN-DAM response to SET pulses of varying frequency. c) Change in WS and WR potentials due to SET and RESET pulses. g) DAM response calculated as difference between WS and WR voltages. Error bars indicate standard deviation estimated across 12 devices.
Each DAM in the FN-DAM device was programmed by independently modulating the SET and RESET junctions shown in Fig. 1(e).The corresponding WS and WR nodes were initially synchronized with respect to each other.After a programming pulse was applied to the SET or RESET control gate, the difference between the voltages at the WS and WR nodes were measured using an array of sense buffers.In results shown in Fig. 3a-d, a sequence of 100 ms SET and RESET pulses were applied.The measured difference between the voltages at the WS and WR nodes indicates the state of the memory.Each SET pulse increases the state while a RESET pulse decreases the state.In this way, the FN-device can implement a DAM that

SET Input
is bidirectionally programmable with unipolar pulses.Fig. 3d also shows the cumulative nature of the FN-DAM updates which implies that the device can work as an incremental/decremental counter.Fig. 3e-f show measurement results which demonstrate the resolution at which a FN-DAM can be programmed as an analog memory.The analog state can be updated by applying digital pulses of varying frequency and variable number of pulses.In Fig. 3e, four cases of applying a 3 V SET signal for a total of 100 ms are shown: a single 100 ms pulse; two 50 ms pulses; four 25 ms pulses; and eight 12.5 ms pulses.The results show the net change in the stored weight was consistent across the 4 cases.A higher frequency leads to a finer control of the analog memory updates.Note that any variations across the devices can be calibrated or mitigated by using an appropriate learning algorithm [25].The variations could also be reduced by using careful layout techniques and precise timing of the control signals.The FN-DAM device can be programmed by changing the magnitude of the SET/RESET pulse or its duration (equivalently number of pulses of fixed duration).Fig. 4a shows response when the magnitude of the SET and RESET input signals varies from 4.1 V to 4.5 V.The measured response shown in Fig. 4a shows an exponential relationship with the amplitude of the signal.When short-duration pulses are used

(d)
for programming, the stored value varies linearly with the number of pulses, as shown in Fig. 4b.However, repeated application of pulses with constant magnitude produces successively smaller change in programmed value due to the dynamics of the DAM device (Fig. 4a).One way to achieve a constant response is to pre-compensate the SET/RESET control voltages such that a target voltage difference y = (WS -WR) can be realized.The differential architecture increases the device state robustness against disruptions from thermal fluctuations (Fig. 4d).The stored value on DAM devices will leak due to thermalinduced processes or due to trap-assisted tunneling.However, in DAM, the weight is stored as difference in the voltages corresponding to WS and WR tunneling junctions which are similarly affected by temperature fluctuations.To verify this, we exposed the FN-DAM device to temperature ranging from 5 -40 o C. Fig. 4d shows that the DAM response is robust to temperature variation and the amount of desynchronization for a single recorder never exceeds 20 mV.When responses from multiple FN-DAM devices are pooled together, the variation due to temperature further reduces.a randomized order, with time interval between successive training points being two seconds.Fig. 5b shows that after training for 5 epochs, the learned boundary can correctly classify the given data.Fig. 5c shows the evolution of weights as a function of time.As can be noted in the figure, initially the magnitude of weight updates (negative of the cost function gradient) was high for the first 50 seconds, after which the weights stabilized and required smaller updates.The energy consumption of the training algorithm can be estimated based on the magnitude and number of the SET/RESET pulses required to carry out the required update for each misclassified point.As the SET/RESET nodes evolve in time, they require larger voltages for carrying out updates, shown in Fig. 5d.The gradient magnitude was mapped onto an equivalent number of 1 kHz pulses, rounding to the nearest integer.Fig. 5e shows the energy (per unit capacitance) required to carry out the weight update whenever a point was misclassified.Though the total magnitude of weight update decreased with each epoch, the energy required to carry out the updates had lower variation (Fig. 5f).The relatively larger energy required for smaller weight updates at later epochs led to longer retention times of the weights (Supplementary Fig. 3).

Discussions
In this paper we reported a Fowler-Nordheim quantum tunneling based dynamic analog memory (FN-DAM) whose physical dynamics can be matched to the dynamics of weight updates used in machine learning (ML) or neural network training.During the training phase, the weights stored on FN-DAM are plastic in nature and decay according to a learning-rate evolution that is necessary for the convergence of gradient-descent training [27].As the training phase transitions to an inference phase, the FN-DAM acts as a non-volatile memory.As a result, the trained weights are persistently stored without requiring any additional refresh steps (used in volatile embedded DRAM architectures [28]).The plasticity of FN-DAM during the training phase can be traded off with the energy-required to update the weights.This is important because the number of weight updates during training scale quadratically with the number of parameters, hence the energy-budget during training is significantly higher than the energy-budget for inference.The dynamics of FN-DAM bears similarity to the process of annealing used in neural network training and other stochastic optimization engines to overcome local minima artifacts [29].Thus, it is possible that FN-DAM implementations or ML processors can naturally implement annealing without dissipating any additional energy.If such dynamics were to be emulated on other analog memories, it would require additional hardware and control circuitry.In the Supplementary Section IV, we show that an FN-DAM based deep neural network (DNN) can achieve similar classification accuracy as a conventional DNN while dissipating significantly less energy during training.Note that for this demonstration, only the fully connected layers were trained while the feature layers were kept static.This mode of training is common for many practical DNN implementations on edge computing platforms where the goal is not only to improve the energyefficiency of inference but also for training [30].
Several challenges exist in scaling the FN-DAM to large neural-networks.Training a large-scale neural network could take days to months [31] depending on the complexity of the problem, complexity of the network, and the size of the training data.This implies that the FN-DAM dynamics need to match the long training durations as well.Fortunately, the 1/log characteristics of FN devices ensures that the dynamics could last for durations greater than a year [32] The other challenge that might limit the scaling of FN-DAM to large neural network is the measurement precision.The resolution of the measurement and the read-out circuits limit the energy-dissipated during memory access and how fast the gradients can be computed (Supplementary Fig. 5).For instance, a 1 pF floating-gate capacitance can be initialized to store 10 7 electrons.Even if one were able to measure the change in synaptic weights for every electron tunneling event, the read-out circuits would need to discriminate 100 nV changes.A more realistic scenario would be measuring the change in voltage after 1000 electron tunneling events which would imply measuring 100 µV changes.However, this will reduce the resolution of the stored weights/updates to 14 bits.This resolution might be sufficient for training a medium sized neural network; however, it is still an open question if this resolution would be sufficient for training large-scale networks [33,34].A mechanism to improve the dynamic range and the measurement resolution is to use a current-mode readout integrated with current-mode neural network architecture.If the read-out transistor is biased in weak-inversion, 120 dB of dynamic range could be potentially achieved.However, note that even in this operating mode, the resolution of the weight would still be limited by the number of electrons and the quantization due to electron transport.Addressing this limitation would be a part of future research.
Another limitation that arises due to finite number of electrons stored on the floating-gate and transported across the tunneling barrier during SET and RESET, is the speed of programming.Shorter duration programming pulses would reduce the change in stored voltage (weight) which could be beneficial if precision in updates is desired.In contrast, by increasing the magnitude of the programming pulses, as shown in Fig. 4(a), the change in stored voltage can be coarsely adjusted.However, this would limit the number of updates before the weights saturate.Note that due to device mismatch the programmed values would be different on different FN-DAM devices.
In terms of endurance, after a single initialization the FN-DAM can support 10 3 -10 4 update cycles before the weight saturates.However, at the core FN-DAM is a FLASH technology and could potentially be reinitialized again.Given that the endurance of FLASH memory is 10 3 [35], it is anticipated that FN-DAM to have an endurance of 10 6 -10 7 cycles.In terms of other memory performance metrics, the ION/IOFF ratio for the FN-DAM is determined by the operating regime and the read-out mechanism.Supplementary Fig. 6 shows the expected ratio estimated using the FN-DAM model.Also, FN-DAM when biased as a non-volatile memory requires on-chip charge-pumps only to generate high-voltage programming pulses for infrequent global erase; thus, compared to FLASH memory, FN-DAM should have fewer failure modes [36].
The main advantage of FN-DAM compared to other emerging memory technologies is its scalability and compatibility with CMOS.At its core, FN-DAM is based on floating-gate memories which have been extensively studied in context of machine learning architectures [13].Furthermore, from an equivalent circuit point of view, FN-DAM could be viewed as a capacitor whose charge can be precisely programmed using CMOS processing elements.FN-DAM also provides a balance between weight-updates that are not too small so that learning never occurs versus weight-updates being too large such that the learning becomes unstable.The physics of FN-DAM ensures that weight decay (in the absence of any updates) towards a zero vector (due to resynchronization) which is important for neural network generalization [37].
Like other analog non-volatile memories, FN-DAM could be used in any previously proposed computein-memory (CIM) architectures.However, in conventional CIM implementations the weights are trained offline and then downloaded on chip without retraining the processor [38].This makes the architecture prone to analog artifacts like offsets, mismatch and non-linearities.On-chip learning and training mitigates this problem whereby the weights self-calibrate for the artifacts to produce the desired output [39].However, to support on-chip training/learning, weights need to be updated at a precision greater than 12 bits [34].In this regard FN-DAM exhibit a significant advantage compared to other analog memories.Even though in this proof-of-concept work, we have a used a hybrid chip-in-the-loop training paradigm, it is anticipated that in the future the training circuits and FN-DAM modules could be integrated together onchip.

Initialization of the FN-DAM array
For each node of each recorder, the readout voltage was programmed to around 3 V while the tunneling node was operating in the tunneling regime (Supplementary Fig. 1).This was achieved through a combination of tunneling and injection.Specifically, VDD was set to 7 V, input to 5 V, and the program tunneling pin was gradually increased to 23 V. Around 12-13V the tunneling node's potential would start increasing.The coupled readout node's potential would also increase.When the readout potential went over 4.5 V, electrons would start injecting into the readout floating gate, thus ensuring its potential was clamped below 5 V.After this initial programming, VDD was set to 6 V for the rest of the experiments.See Supplementary Section I for further details.After one-time programming, input was set to 0 V, input tunneling voltage was set to 21.5 V for 1 minute and then the floating gate was allowed to discharge naturally.Readout voltages for the SET and RESET nodes were measured every 500 milliseconds.The rate of discharge for each node was calculated; and a state where the tunneling rates would be equal was chosen as the initial synchronization point for the remainder of the experiments.

FN Tunneling dynamics
V(t) is the floating gate voltage given by [22,21] Where  1 and  2 are device specific parameters and  0 depends on initial condition as: Using the dynamic given in Eqn.1, the Fowler-Nordheim tunneling current can be calculated as: Weight decay model and FN-DAM dynamics Many neural network training algorithms are based on solving an optimization problem of the form [26]: where  ̅ denotes the network synaptic weights, ℒ(•) is a loss-function based on the training set and  is a hyper-parameter that controls the effect of the ℒ 2 regularization.Applying gradient descent updates on each element   of the weight vector  ̅ as: Where the learning rate   is chosen to vary according to   ~ (1/) to ensure convergence to a local minimum [27]: The naturally implemented weight decay dynamics in FN-DAM devices can be modeled by applying Kirchhoff's Current Law at the SET and RESET floating gate nodes (see Fig. 1e).

𝐶
Where   +   =   is the total capacitance at the floating gate.Taking the difference between the above two equations, we get: For Assuming that the stored weight (measured in mV) is much smaller than node potential (> 6V) i.e.,  ≪   (and   ≈   ) and taking the limit ( → 1) using L'Hôpital's rule: follows the temporal dynamics given in Eqn. 1, Comparing above equation to Eqn. 4, the weight decay factor for FN-DAM system is given as: Chip-in-the-loop linear classifier training A hybrid hardware-software system was implemented to carry out an online machine learning task.The physical weights ( ̅ = [ 1 ,  2 ]) stored in two FN-DAM devices were measured and used to classify points from a labelled test data set in software.We sought to train a linear decision boundary of the form: x ̅ = [ 1 ,  2 ] are the features of the training set.For each point that was misclassified, the error in the classification was calculated and a gradient of the loss function with respect to the weights was calculated.
Based on the gradient information, the weights were updated in hardware by application of SET and RESET pulses via a function generator.
The states of the SET and RESET nodes were measured every 2 seconds and the weight of each memory cell, , was calculated as: The factor of 1000 indicates that the weight is stored as the potential difference between the SET and RESET nodes as measured in mV.We followed a stochastic gradient descent method.We defined loss function as: The gradient of the loss function was calculated as: The weights needed to be updated as Here   is the learning rate as set by the learning algorithm.The gradient information is used to update FN-DAM by applying control pulses to SET/RESET nodes via a suitable mapping function : Positive weight updates were carried out by application of SET pulses and negative updates via RESET pulses.The magnitude of the update was implemented by modulating the number of input pulses.The magnitude of input pulse required,   () (Fig. 2a) so that the floating gate node at current potential   () shifts to a target voltage   is given by:

II. Write Energy Dissipation
Where   is the input capacitive coupling ratio   =     + . The floating gate voltage   () is approximated by the following dynamic [1]: The energy required to charge the input capacitor is given as Figure 2b shows instantaneous energy required to charge unit capacitance when   = 7.6 and   (0) = 7.5.The input capacitance of our device was 1 pF, and the instantaneous write energy per update increased from 5 fJ to 2.5pJ over 12 days.The method for calculating retention time of dynamical systems was described in [2].In brief, the point at which the analog memorydue to resynchronizationfalls below the noise floor is the retention time.The noise floor consists of a constant noise introduced by the readout noise and an operational noise that increases with time, due to thermally induced random desynchronization.
At   , synaptic memory's state goes below the noise floor and hence the following condition is satisfied: When the FN-DAM is biased at around 6 V, its retention time is similar to FLASH/EEPROM memory.However, energy consumption is around 150  (for a 100  input capacitance).The performance of FG-DAM model was compared to that of a standard network model.A 15-layer convolutional neural network was trained on the MNIST dataset using the MATLAB Deep Learning Toolbox.For each learnable parameter in the CNN, a software FN-DAM instance corresponding to that parameter was created.In each iteration, the loss of the network function and gradients were calculated.The gradients were used to update the weights via Stochastic Gradient Descent with Momentum (SGDM) algorithm.The updated weights were mapped onto the FN-DAM array.The weights in the FN-DAM array were decayed according to Eqn. 14.These weights were then mapped back into the CNN.This learning process was carried on for 9 epochs.In the 10 th epoch, no gradient updates were performed.However, the weights were allowed to decay for the last epoch (note that in the standard CNN case, the memory was static).A special case with a 0.1% randomly assigned mismatch in the floating gate parameters ( 1 and  2 ) was also implemented.The readout power is dependent on the readout accuracy required and the speed at which it operates.

V. Read Energy Dissipation
For a PMOS in a source follower configuration, the readout noise is given by:

𝜅𝑃 𝑟𝑒𝑎𝑑 Δ𝑓
Above equation is plotted in SI Figure 5 for different noise floors and readout frequency for   = 5,   = 26  and  = 0.7 The FN-DAM is programmed by applying a pulse of magnitude   () so that the node reaches a potential of   through the input coupling capacitor, as derived in the previous section.The programming ratio is given by: Above equation is plotted for 3 values of k1 which affect the dynamics of   ().The parameter k1 can be altered during the design phase by changing the area and capacitance of the floating gate node.

Figure 1 :
Figure 1: Motivation and principle of operation for the proposed synaptic memory device: (a) conventional non-volatile analog memory where transition between analog static states dissipates energy; (b) Dynamic analog memory where an external energy is used to modulate the trajectory of the memory states towards the optimal solution; (c) desired analog synapse characteristic where the memory retention rate is traded-off with the write energy; reducing the energy dissipation per weight update in training phase by matching the dynamics of the dynamic analog memory to the weight decay; (d) micrograph of a fabricated DAM array along with (e) its equivalent circuit where the leakage current IFN is implemented by (f) the electron transport across a Fowler-Nordheim (FN) tunneling barrier; (g) Implementation of the FN tunneling based DAM where dynamic states g1-g3 determines the energy dissipated per memory update and memory retention rate.
Fig.2shows the measured dynamics of the FN-DAM device in different initialization regimes used in ML training, as described in Fig.1c.The different regimes were obtained by

Figure 4 :
Figure 4: Device characterization: (a) Change in DAM response with each pulse of same magnitude and duration.b) DAM response to varying number of pulses.c) DAM response to pulses of different magnitude.d) device state drift due to temperature variations after 1,2 and 3 hours.

Figure 5 :
Figure 5: Synaptic memory for neuromorphic applications a) Test data set with randomly initialized decision boundary b) Decision boundary after training.c) Evolution of weights after 5 epochs.d) Input voltage required for initiating a unit change in weight.e) Energy expended in updating the weights.f) Average magnitude of weight update and average energy required for each epoch.In this section we experimentally demonstrate the benefits of exploiting the dynamics of FN-DAM weights when training a simple linear classifier.For this results, two FN-DAM devices were independently programmed according to the perceptron training rule[26].We trained the weights of a perceptron model to classify a linearly separable dataset comprises 50 instances of two-dimensional vectors, shown in Fig.5a.During each epoch, the network loss function and gradients were evaluated for every training point in

Figure 2 :
Figure 2: a) Target voltage, floating gate voltage and training voltage as a function of time.b) Energy required to charge unit capacitance as a function of time.

Figure 3 :
Figure 3: a) Retention time as a function of floating gate voltage for a range of input pulse magnitudes.b-c) Retention time of weight updates as a function of time elapsed after initialization to 7.5V (b) and 6V (c).

Figure 4 :
Figure 4: a) Network loss for 3 types of network models.Inset shows same data with X axis in log scale.b) Energy spent in updating the network weights.Inset shows same data with X axis in log scale.

Figure 5 :
Figure 5: Minimum power required to read floating gate voltage as a function of required readout speed.Noise floors shown in legend.

Figure 6 :
Figure 6: Programming ratio for different k1 parameter which can be controlled by changing the size of tunneling junction.
the differential architecture,  =   −   .Let   =   −   , the training voltage calculated by the training algorithm.In addition,   is substituted from Eqn. 2. Let   /  =   , the input coupling ratio: