Optogenetics inspired transition metal dichalcogenide neuristors for in-memory deep recurrent neural networks

Shallow feed-forward networks are incapable of addressing complex tasks such as natural language processing that require learning of temporal signals. To address these requirements, we need deep neuromorphic architectures with recurrent connections such as deep recurrent neural networks. However, the training of such networks demand very high precision of weights, excellent conductance linearity and low write-noise- not satisfied by current memristive implementations. Inspired from optogenetics, here we report a neuromorphic computing platform comprised of photo-excitable neuristors capable of in-memory computations across 980 addressable states with a high signal-to-noise ratio of 77. The large linear dynamic range, low write noise and selective excitability allows high fidelity opto-electronic transfer of weights with a two-shot write scheme, while electrical in-memory inference provides energy efficiency. This method enables implementing a memristive deep recurrent neural network with twelve trainable layers with more than a million parameters to recognize spoken commands with >90% accuracy.

hyperpolarizing potassium channels-it eventually turns ON the switch S1, resetting Vmem to its resting potential. For proper operation of the reset mode, it is required that V1 be larger than the threshold voltage (V *+$. ) of the switch S1, i.e. V1 > Vthsw. Once the voltage of V $,-reaches V *+$. , it discharges the capacitor C1 and resets the membrane voltage V !"! to V %"& . This operation is similar to the hyperpolarization of the membrane due to opening of potassium ion channels in a biological neuron. Since the reset voltage at V !"! is higher than its steady state

Supplementary Note-3. PENs as Photo-sensitive Synapses
Learning in the brain occurs via strengthening and weakening of synaptic connections, interconnecting a myriad of neurons 4,5 . This modification of synaptic strength, referred to as weight plasticity, is responsible for spatial memory storage and forms the basis of learning mechanisms 6,7 . A spike-based formulation of Hebbian learning, spike-timing dependent plasticity (STDP) is considered to be the first law of synaptic plasticity and underlies the basis of associative learning across several species from locusts to fishes to rats to humans 4,8,9 .
Memory traces of past experiences are encoded as temporal correlations between electrical preand post-synaptic neurons, resulting in a temporally asymmetric form of Hebbian learning.
Temporal correlations between pre-and postsynaptic spikes defines the sign and magnitude of long-term synaptic modifications, referred to as the STDP function or learning window, and varies tremendously across synapse types, dendritic compartments and brain regions 8,10 .    Removal of the optical stimulation resulted in retrieval of the original STDP function after 2 hours as shown above, indicating retention of the photo-modulated states to be ~2 hours for our best-performing devices. The average retention was observed to be ~900 seconds for the condition of no crossover between adjacent states, acceptable for training 14 .
Supplementary Figure 5. Spatiotemporally-selective activation of synapses. Application of sinusoidal global optical clock signals to the PENs resulted in a long-term weight modulation as depicted by the graph on the right.
A constant electrical bias of -40 V was applied during these measurements to set the initial conductance state to a low value of ~1 nS. We demarcate a firing threshold just to highlight the fact that the difference in photoresponse or wavelength selectivity of our artificial synapses would allow a wide flexibility for setting the threshold for neuronal firing. In other words, we foresee development of artificial synapses that respond to narrow bands of optical pulses in future and would like to point out that our approach would enable setting of desired pre-set thresholds for selective neuronal firing when such artificial synapses are connected to artificial neurons.

Nature of defects:
The exact nature of traps or defects in a material system is dependent on the composition and the fabrication processes involved. In transition metal dichalcogenides, studies have indicated different possible origins for the electron trapping and detrapping mechanisms, including surface adsorbates [18][19][20] , electron trapping at the semiconductor-dielectric interface 21,22 and intrinsic semiconductor defects/lattice defects 23,24 . Since our measurements were performed in high-vacuum conditions, the probability of surface adsorbates are highly reduced. Thus, we attribute the origin of trapping-detrapping mechanism in our measurements to defects in the semiconductor itself or/and the at the semiconductor-dielectric interface. The high value of subthreshold slope (~2.5 V/dec) in our devices support the presence of traps influencing the ReS2 channel in accordance with similar measurements in literature 25,26 . In the present work, we utilize a combination of optical and electrical pulses to fill and empty these traps and to induce non-volatile conductance changes in our field-effect transistors (FETs). The number of distinct conductance states are determined primarily by the programming pulse resolution and recombination kinetics of the photo-generated carriers, and hence, the programming pulses could be optimized to achieve a very good linearity.

Programming scheme to maximize linearity:
To program states in a linear manner, we need to carefully select the gate voltage, the initial conductance state, the drain voltage and intensity of light illumination as explained below.
Firstly, to maximize linearity in write and erase, we apply and maintain a gate bias (-40 V) to our devices to take it to their depleted-state before starting our measurements (Supplementary Note-3 Figure-5). The initial low conductance is a critical step in achieving linearity, as also reported by other investigations 14,27 . By constantly biasing the devices at a gate voltage that depletes the majority carriers in the system (-40 V in the case of ReS2 devices or negative Vgs for any n-type semiconductor), we ensure that the traps are empty and the background carrier concentration is minimized 28 , and this also increases the probability of photocarrier trapping 26 .
Upon illumination, the photogenerated carriers fill up these traps, resulting in a permanent increase in the channel conductance due to PPC (also termed photogating in literature) 25 .
Finally, the Vds should be high enough to overcome possible contact resistance effects and should result in an Ids which is significantly larger than Igs, necessary to ensure accurate weight update readouts.

Effect of constant gate bias Vgs:
We first show that the linearity is modulated as a function of the applied constant gate bias during optical potentiation for ReS2 FETs. Supplementary Figure-13 below shows the weight updates: optical potentiation with different constant gate biases, and electrical depression. As can be seen, maintaining a large negative Vgs that depletes the majority carriers in the system allows us to write weights more linearly when compared to partially depleted (~0 V) and accumulative (high +Vgs) voltages for ReS2 FETs. As explained above, we believe this negative Vgs helps empty the traps, allowing us to control the photo-generated carriers more precisely, resulting in linear weight updates 28

Effect of drain bias Vds:
A Vds that ensures the channel currents are much larger that the leakage currents is necessary for accurate weight read-outs. In our TMDC FETs, we typically apply a Vds value of 0.1 V for our measurements. Lower applied drain voltages resulted in inconsistent results especially when the gate leakage currents (Igs) were high. The effect of Vds can be more clearly

Effect of optical illumination (Intensity):
We next show that the weight update steps could be further modulated by changing the intensity of illumination (Supplementary Figure-15). Higher optical intensities (consequently increased photocarrier trapping) results in increased weight changes, albeit with an increased spread. Increasing the erase gate voltage raises the Fermi level further to accumulate more electrons in the channel which can recombine with the trapped holes, allowing us to tune our symmetry.
This is similar to strategies used by other groups to achieve desired conductance levels 30,31 .
The input pulses can also be modified as per the device's response to maximize linearity if required as adopted by several other investigations [32][33][34][35] . As indicated from the figures, higher magnitude of absolute conductance requires larger amplitude of erase voltages to erase the channel conductance back to the initial state and achieve symmetry. Intuitively, the large magnitude of absolute and percentage weight changes during potentiation reflects the larger number of trapped holes and hence, the larger amplitude of erase voltages adopted in turn indicates the magnitude by which the Fermi level should be raised in order to accumulate enough additional electrons in the channel to recombine with larger number of trapped holes.
In summary, the exquisite write linearity afforded by the optical gating is the major phenomenon we exploit in this work to get very accurate weights in a DRNN.

Linear Dynamic Range
To assess the benefit of the high write linearity and low write noise provided by the opto-

Two-shot Write Scheme
In this work, we propose to train neural networks offline and then transfer the weights by electro-optic means to the PEN crossbar for electrical inference. The advantages of this method are as follows: • Previously reported work 36,37 using online learning for memristors could only use stochastic gradient descent (SGD) to train fully connected networks (FCN) to classify handwritten digits from the MNIST dataset. However, to train DRNN for classifying complex datasets for speech recognition, it is necessary to use sophisticated momentum based learning rules such as ADAM 38 . Hence, we propose to train the network offline and then transfer the learnt weights on the PEN array with high write accuracy. • Our method of weight writes involve optical stimulation, but after the write step, the inference operation is performed fully in the electrical mode, thus still enjoying high energy efficiency afforded by in-memory computing on PEN memristive arrays.
We next describe the high accuracy weight transfer scheme. From the measured data, we can estimate ΔG and σ as the mean and standard deviation of ΔG across different trials and various initial conductance states. Based on the linearity of the conductance change, we can estimate the write pulse width Tw to reach a desired conductance Gdes from an initial conductance Ginit as: where Tw: Write pulse width to achieve desired conductance However, it is impractical to estimate the average of ΔG with many measurements to get a best fit line for every PEN. Note that for other non-volatile memory with non-linear characteristics [42][43][44] , it typically requires 20-30 iterations of successive write and measurement to converge to a precise conductance (<1 % error). The exquisite linearity and low write noise of the PEN allows us to simplify this procedure and reduce the number of iterations by an order of magnitude as described next.
The quickest method is to estimate the slope ΔG/Tp as difference in conductance after one write pulse of known pulse-width Tp-we refer to it as the two-shot write scheme where the first write is used to estimate the rate of conductance change of the PEN while the second write is used to program the PEN to the desired conductance value based on the earlier estimated slope.
This is illustrated in Supplementary Figure-18 where the initial optical write (Op W) is used to increase the conductance of all devices on a chip to a very large value (greater than largest desired conductance) denoted by Gn op for the n th device. Note that a global optical write is easier to implement in hardware since it does not require optical selectivity. This is followed by a measurement (m1) to read this conductance thus eliminating the mismatch across devices in optical write efficiencies. Next, the first electrical write pulse (w1) of width Tp is applied to help in slope estimation or calibration.
Supplementary Figure 18. Two shot write scheme for transferring learned weights to PEN crossbar. After an initial optical potentiation of the entire array, one measurement is done to estimate the conductance Gn op for the n th device. Next, one electrical write (w1) operation is done for duration Tp followed by a measurement m2 to estimate change in conductance or slope. Finally, the second write pulse (w2) is applied with the duration Twn calculated based on the earlier estimated slope. with the estimate ΔGn/Tp and using Ginit,n as the conductance of the n th device after w1.
However, this will be prone to estimation error due to write noise, measurement noise induced variability as well as any non-linearity in the device write characteristics over a large conductance range. To counter this partially, we use a long write pulse with width Tp = nTs to make this estimation (n>1). This ensures that the mean value of ΔGn obtained after the calibration write w1 is large compared to the write noise. Also note that measurement noise is assumed negligible compared to write noise in this analysis which is reasonable since measurement circuits shared along columns can be made precise due to relaxed area/power requirements compared to elements within the crossbar.
Hence, the actual device conductance, Gact obtained after the second write w2 differs from Gdes and is modelled as follows:

Neural Network Simulations: MNIST and Speech classification
We simulate the case where the network is trained offline and the trained weights are written to the neuromorphic device by the earlier described two-shot write scheme. It should be noted that we can also perform online learning with SGD using blind updates as is typically shown for resistive memory crossbars trained to do handwritten digit recognition tasks based on the  can be seen that for the MNIST task, even a low value of LDR (~50) is good enough to achieve high accuracy. Since the MNIST task is quite simple, it was not suitable to explore the effect of each device non-ideality separately. We next proceed to do that with the speech classification task. Finally, we explored the effect of reduced estimation error during weight mapping by taking larger estimation times as well as using the best fit straight line to estimate the conductances.

Supplementary
As can be seen in Supplementary Figure-  A) is much larger than the drain current Ids (10 -9 A), preventing reliable weight readouts. At -5 V, Ids is larger than Igs, but with very low margin (<3x). At -60 V and higher, Ids becomes much larger than Igs with good readout margins (>200x), allowing accurate weight readouts with very high signal to noise ratio. Please note that the magnitude of normalized weight changes at high drain voltages are lower than those at low drain voltages in the figure below. However, since a b