Hybrid architecture based on two-dimensional memristor crossbar array and CMOS integrated circuit for edge computing

The fabrication of integrated circuits (ICs) employing two-dimensional (2D) materials is a major goal of semiconductor industry for the next decade, as it may allow the extension of the Moore’s law, aids in in-memory computing and enables the fabrication of advanced devices beyond conventional complementary metal-oxide-semiconductor (CMOS) technology. However, most circuital demonstrations so far utilizing 2D materials employ methods such as mechanical exfoliation that are not up-scalable for wafer-level fabrication, and their application could achieve only simple functionalities such as logic gates. Here, we present the fabrication of a crossbar array of memristors using multilayer hexagonal boron nitride (h-BN) as dielectric, that exhibit analog bipolar resistive switching in >96% of devices, which is ideal for the implementation of multi-state memory element in most of the neural networks, edge computing and machine learning applications. Instead of only using this memristive crossbar array to solve a simple logical problem, here we go a step beyond and present the combination of this h-BN crossbar array with CMOS circuitry to implement extreme learning machine (ELM) algorithm. The CMOS circuit is used to design the encoder unit, and a h-BN crossbar array of 2D hexagonal boron nitride (h-BN) based memristors is used to implement the decoder functionality. The proposed hybrid architecture is demonstrated for complex audio, image, and other non-linear classification tasks on real-time datasets.


INTRODUCTION
The need for improved performance, greater throughput and higher integration density has pushed the scaling boundaries of the complementary metal-oxide-semiconductor (CMOS) technology. With the rise in artificial intelligence (AI), even this has proved to be insufficient in implementing machine learning (ML) and other generic neural algorithms on existing hardware architectures 1,2 . Additionally, the processing of AI algorithms on the edge devices needs to simultaneously address issues such as privacy, security, cost, latency, and bandwidth 3 . Therefore, the implementation of such algorithms on the hardware requires highly energy-efficient and massively parallel architectures with low latency and memory requirements. Achieving these parameters with CMOS technology utilizing traditional von-Neumann architectures is challenging because of communication and memory bottleneck along with non-linear effects that traditional designs suffer from 4,5 .
However, the implementation of ML algorithms using analog CMOS design techniques can offer a significant improvement in system performance parameters 6,7 . The analog implementation allows designers to utilize full MOS device physics; hence the architecture can be optimized down to transistor level. But, the analog implementation comes with its own challenges such as the designed circuits are sensitive to device mismatch and non-linear effects 8 . This makes it difficult to design analog circuits in state-of-the-art technology nodes. On the contrary, one could design systems that utilize these shortcomings (i.e., mismatch, non-linearity) as an advantage rather than engineering out the design [9][10][11][12][13][14][15] .
Additional degradation in conventional system design comes from the fact that today's systems are increasingly dependent on memory technologies such as random-access-memory and Flash. These memories are essentially based on the charge storage mechanism, resulting in degradation of performance, reliability, and noise margin in lower technology nodes. To overcome the von-Neumaan bottleneck, several emerging memory devices are being explored where the data can be processed in situ within the memory by exploiting the physical principles of resistive switching (RS). While the transition metal-oxide (TMO) based resistive RAM (RRAM) technology is nearing commercialization, the recent time has witnessed a surge of experimental demonstrations of 2D material-based RRAM. These 2D RAMs can overcome the vertical scaling limit of TMO RRAMs since the remarkable RS is demonstrated even at monolayer 16 . The fundamental mechanism of RS in 2D RRAM appears to be different 17,18 , and thus it may outperform the TMO RRAM in terms of operating speed and power. In addition, we show that the state-currents expand multiple orders of magnitude (from sub-nano Ampere to milli Ampere) in 2D RAM, which is, to our best knowledge, not yet reported for any TMO based RRAM.
Here, we propose the use of a hybrid architecture based on extreme learning machine (ELM algorithm) for edge computing by integrating existing CMOS technology with emerging memristive technologies made of two-dimensional (2D) materials. The result is a hybrid architectural framework (Fig. 1a) that can surpass the current technological limitations while offering outstanding advancements in overall system performance. Our system utilizes the non-linearity offered by the CMOS circuit and the inherent device mismatch along with minimum transistor count to offer efficient design in terms of area and power. We show the system performance for generic classification applications such as audio, image, and other real-time datasets.

RESULTS
Hybrid system architecture The proposed architecture consists of a CMOS Encoder chip followed by the Memristor Decoder chip, as shown in Fig. 1b. These two units work in tandem yet being isolated from one another. The CMOS Encoder chip consists of an ELM encoder, a row-select encode unit, a bias generator unit, and a control unit. The core of the proposed architecture employs local receptive fields-based extreme learning machine (LRF-ELM) algorithm (Fig. 1a). LRF-ELM 19 is a variant of the ELM algorithm where the weights between input and hidden layer are local and random 20 . The only trainable weights are the output weights, which can be learned using the least square method for a given classification or regression task. The output function of the ELM for a generalized single layer feed-forward topology is given by 19,20 : where w i 2 R M is the weight connecting the i th hidden node to M target nodes, a j 2 R n and b i 2 R are the are random and fixed parameters of i th hidden node, x j 2 R n belongs to any N arbitrary distinct sample input and G Á ð Þ is a non-linear continuous function.
For the non-linear function being a Gaussian kernel, G Á ð Þ can be written as: where a j and b i are the random and fixed parameters of i th Gaussian node arising from device mismatch. For an arbitrary distinct sample, x j 2 R n is the j th sample input vector and t 2 R M is the target vector, the output function corresponding to j th sample input can be written as: Where h j È É N j¼1 2 R 1 L is the hidden layer output vector for any N th arbitrary sample input and W 2 R L M is the output weight matrix. The output layer weight matrix W is then be analytically calculated using modified quantization aware stochastic gradient descent (SGD) learning algorithm explained in detail in "Quantization aware learning for LRF-ELM". Figure 1a shows the network level architecture of the hybrid LRF-ELM network used in our framework for a generic case of one target nodeðT 2 R N 1 Þ. The same architecture can also be generalized for multiple target nodes. The output weight vector for j th training sample, m th output node and L hidden nodes and can then be written as w m ¼ ½w 1m ; w 2m;::::::::; w Lm . For the case of 4 target node ðT 2 R N 4 Þ, the input and the hidden layer of the LRF-ELM network shown in Fig. 1a form the ELM encoder unit in the CMOS-Memristor chip in Fig. 1b. The output of this ELM encoder (a row of matrix H) is passed as an input to the rowselect encode unit. Depending on the select signal Sel < 0 : n > ð Þ , the output of hidden node is passed to the Memristor Decoder chip. The Memristor Decoder chip consists of a row-select decode, a memristor crossbar array and a mixed-signal interface unit. The weight matrix W is implemented using a memristor crossbar array made of 2D materials, as shown in the Memristor Decoder chip of Fig. 1b. Each output node of the LRF-ELM network in Fig. 1a is a differential pair integrator 21 (DPI) whose array is implemented in the mixed-signal interface unit of the Memristor Decoder chip. We designed the CMOS Encoder chip in a 180 nm technology node, and post-layout simulations with hidden node parasitics were carried out to verify the results. CMOS circuit non-linearity was captured during training and inferencing, and weights were stored in memristors as the quantized conductance states. The main advantages of the proposed framework are: (1) the framework is tolerant to device variability and utilizes inherent MOS device mismatch as random weights to its advantage, hence get away with any memory storage requirement for the first layer weights; (2) the architecture is robust to process variations and therefore can be designed to much lower CMOS technology node, leading to further improvement in system performance parameters; (3) the overall energy consumption of the proposed system is very low; thus, a good candidate for edge computing applications; (4) in the proposed hybrid system, the CMOS Encoder Chip and the Memristor Decoder Chip are two significant parts of the ELM framework, and each part can be optimized separately. This arrangement enables to explore other types of emerging memories just by replacing the decoder chip, without affecting the overall system; (5) the memristor array performs multiply and accumulate (MAC) operation between the higherdimensional input feature map and the stored quantized weights on a time-multiplexed basis; (6) the LRF-ELM framework for real-time dataset classification uses 9D Gaussian CMOS circuit operating in the subthreshold regime. The circuit, therefore, offers the desired non-linearity with minimum transistor count, hence, offering a low area and low power; and (7) the framework offers a substitute for digital memory with the multi-state memristive device, thus could save energy and area footprint.
Quantization aware learning for LRF-ELM The classification network was trained off-chip using modified quantization aware SGD learning 22 . To incorporate the variability in each state (total 26 states) arising due to device to device and cycle to cycle variations, we included the statistical device variabilities σ k f g 26 k ¼ 1 as a parameter while training the algorithm.
Here, we have used quantization aware SGD algorithm rather than batch gradient descent and modified it to incorporate statistical variability while training. The algorithm is described below with the dimensions of the parameters. Algorithm 1 shows the modified SGD based learning. The algorithm takes into account the quantization error arising due to the mapping of learned weight to the closest available memristor states along with the statistical variability present in the memristor device. In each learning iteration, the function f Á ð Þ assigns each element of the intermittent weight matrixw lm ð Þ to one of the closest available memristor state s meank ð Þ. Here, S mean f g 1 26 and σ f g 1 26 are the mean and the variance vectors obtained from the memristor data matrix M Data ð Þ(explained in detail in Supplementary Fig. 1 and Supplementary Note 1). Thereafter, each element of the quantized weight vector w Q lm is defined as a random number picked from the Gaussian distributions with mean as s meank and variance corresponding to that mean as σ k . Here, M Target class matrix for N training samples: Intermittent output weight matrix for L hidden nodes: Quantized output weight matrix for L hidden nodes: A row of Target class matrix corresponding to an input sample: vector and mean of each state: Algorithm 1. Quantization aware stochastic gradient descent learning for hybrid LRF-ELM

CMOS encoder chip (ELM encoder unit)
The LRF-ELM 19 architecture overcomes the high-dimensionality problem by assigning the sparse connection between input and hidden nodes, as shown in Fig. 1a. For a particular feature input, the value of weights for LRF of all the hidden nodes is the same. Each input node corresponds to a local receptive field generated by taking windows of dimension 3 3 and shifting by the stride of two. For the input matrix of dimension 31 51, a total of 375 input nodes will be obtained. The output layer has time-multiplexed allto-all connectivity with the hidden layer through a row-select encode, a row-select decode, and a memristor array. In an LRF-ELM architecture, the output of each hidden node depends on the different combinations of input nodes. This enables the learning of local correlations more appropriately and is invariant to small translations and rotations. We use a Gaussian function as a hidden node in our architecture, which is implemented using subthreshold MOS circuits [23][24][25] . Each hidden node in Fig. 1a implements a 9D Gaussian circuit for LRF-ELM network. These Gaussian circuits were implemented by cascading one-dimensional (1D) Gaussian cells sequentially in an N-P-N-P fashion, where N and P are NMOS and PMOS cells, respectively. Figure 2a shows the block level architecture of a 9D Gaussian cell (one hidden node) formed by cascading N and P-type Gaussian cells, which allows the circuit to work at a lower swing and reduces settling time. Figure 2b shows the structure of a single NP Gaussian cell made by cascading an N-type and a P-type cell, and Fig. 2c, d illustrate the CMOS architecture of the N-type and P-type Gaussian cells, which closely approximate to Gaussian characteristics. The output current equation of the Gaussian cell for j th input sample and i th feature input as shown in Fig. 2c is given by 24,25 : whereĨ bias is the bias input current, x ji is a feature input, a ji is the random offset emerges due to device mismatch and V x is a constant voltage. The DC characteristic plots of I out2 and I out1 shown in Fig. 2c always satisfy the constrain given by I out1 þ I out2 ¼Ĩ bias . A similar equation can be obtained forI out4 . On approximating, (4) closely resembles the Gaussian function in (5) 24,25 and can be rewritten as: Where C 1 is a constant and b i is a Gaussian kernel parameter obtained by scaling the input voltages. By cascading 9 of such 1D Gaussian structures, a nine-dimensional (9D) Gaussian cell can be obtained, whose equation is approximated by: Where a j 2 R 9 and x j 2 R 9 Fig. 2a shows the block diagram for the implementation of (6) for 9D Gaussian cell.
CMOS encoder chip (row-select encode and control unit) In LRF-ELM, the number of hidden nodes depends on the size of the input feature map to capture local correlations, the input feature dimension, the strides for convolution, and the accuracy desired by the application. For instance, the hidden nodes count required for digit recognition can be significantly lower than digit classification. In all the cases, the output pin count in the CMOS chip cannot be scaled in proportion to the number of hidden nodes. To overcome this challenge, we designed a row-select encode unit whose detailed explanations are mentioned in Supplementary Fig. 2 and Supplementary Note 2, respectively. The circuit utilizes cascode analog multiplexers 26 designed using transmission gate. It can be noted that the number of cascoded multiplexer circuits inside the row-select encode unit is equal to the number of outputs in CMOS Encoder chip. The control unit consists of a voltage bias generator and a scan chain flip-flop; the scan chain flip-flop controls the select line of the row-select encoder in the CMOS chip and the row-select decode of the memristor chip. This helps in time-multiplexing the output of the hidden node to the memristor crossbar array. At each clock edge, n 0 rows are connected to the row-select decode unit of the memristor chip. The control unit simultaneously provides the timing control pulse voltage Out Pulse in Fig. 1b to the DPI 21 circuit in the memristor chip. It can further be noted that the pulse width of the Out Pulse is kept such that the capacitor in the DPI does not reach its saturation level until all the hidden nodes are scanned through select inputs.

Memristor decoder chip (memristor crossbar)
The learned weights between the hidden node and the output node were programmed in the memristor crossbar array, where each memristor device can have multiple quantized conductance states. Memristor crossbar performs MAC operation between the hidden nodes output voltage and memristor conductance. It also decodes the input feature from the hidden layer to a higherdimensional space. The hidden node voltages h 1 ; ; h L ð Þin the LRF-ELM network of Fig. 1a are passed through the row-select encode unit of the CMOS chip to row-select decode unit of the memristor chip on a time-multiplexed basis. The row selection timing is precisely controlled using the timing control block of the CMOS chip. These voltages are then passed as an input to the memristor crossbar where each memristor device is programmed to one of the 26 possible states corresponding to 26 pre-trained conductive states. Each column's output in a memristor crossbar is stored as a charge in a capacitor of a DPI integrator circuit 21 (Fig. 1b). At each time step, the capacitor (C in Fig. 1a) in the DPI array keeps on integrating the output from the memristor crossbar array till the complete hidden nodes are scanned for particular data inputs. The final results are then obtained by comparing the voltages at the DPI output nodes.
The crossbar array of memristors has been fabricated using chemical vapor deposited (CVD) multilayer hexagonal boron nitride (h-BN) as RS medium 27 , which was sandwiched by Au electrodes (see "Methods" and Fig. 3a). These devices exhibit nonvolatile bipolar RS at low currents down to~2 pA in high resistive state and~10 nA in low resistive state, as shown in Fig. 3b. The device-to-device variability of the electrical properties of these devices is low. As an example, Fig. 3c shows the distribution of the set and reset voltages (V SET and V RESET ) for 16 devices; the coefficient of variance (C V ) is calculated as the mean value (µ) divided by the standard distribution (σ), and the values reported are 6.2% for V SET and 12.4% for V RESET . The value of C V of V SET and V RESET is calculated for every single device (see Fig. 3d, f) and for an accumulated number of devices (see Fig. 3e, g), when using current limitations of 1 µA (pink spheres) and 1 mA (blue/red spheres), and these values are compared with the values reported in the literature (which is mainly focused on cycle-to-cycle variability, although some studies reporting device-to-device variability have also been reported) [28][29][30][31][32][33][34][35][36][37][38][39][40] . The amount of data presented is much superior to that reported in any other study and shows a realistic picture of the real cycle-to-cycle and deviceto-device variability, which is amongst the lowest in both cases. Compared to memristors made of traditional metal oxides (which are the reference in this field), the use of h-BN is beneficial because it allows better control of the potentiation process; while the potentiation in most metal oxides is only controllable at high currents and it shows an erratic trend (the currents go up in some pulses and down in others, with an overall upward trend) [41][42][43][44] , the potentiation in h-BN memristors is controllable even in the submicroampere regime, and it shows a very smooth trend 45 (refer Fig. 3j). The reason is that the conductance modulation in CVDgrown multilayer h-BN takes place in few-atoms-wide native defects that are surrounded by highly stable crystalline 2D layered h-BN; thus, ionic migration takes place in a very confinement volume and cannot propagate laterally 46 , allowing to control it more accurately. The Au/h-BN/Au devices here presented exhibited more than 26 stable conductance states 47 starting at 10 nS (i.e., the lowest conductance, see Fig. 3d), something very challenging to achieve when using standard memristors made of TMO. The lateral size of the memristors used in the crossbar array is 5 µm × 5 µm, although good miniaturization has been demonstrated using 150 nm × 200 nm cross-point devices 48 and nanodot devices with a radius of~25 nm 47 , exhibiting excellent potential for high integration density. The fact that only scalable methods have been used during this process facilitates the integration of the proposed setup in the semiconductor fabrication line 46 .

Memristor decoder chip (row-select decode unit and mixed interface circuit)
The output of the CMOS chip is passed to the row-select decode unit of the memristor chip. Depending on the select signal Sel < 0 : n > ð Þin the row-select decode block, the outputs are routed to the memristor crossbar array. The row-select decode is a network of CMOS pass transistors and charge-based switching gates. The block architecture and detailed explanation are presented in Supplementary Fig. 3 and Supplementary Note 3, respectively. It can be noted that the select lines of the row-select decode unit in memristor chip and the row-select encode in CMOS chip are synchronized with the clock edge of the control-andtiming unit, and the number of analog pass transistors is equal to the number of outputs coming from the CMOS Encoder chip. The mixed interface circuit (Fig. 1b) of the output nodes consists of an array of current mode log-domain DPI synapses working in the subthreshold regime. The output response (I Y1 ) of one DPI circuit receiving a cumulative input current I o1 ð Þ from all the memristors in a column of a crossbar and arriving at t 0 while ending at t 1 is given by 21 : for the charging pulse period, and: for the discharge pulse; I Y1 ðt 0 Þ and I Y1 ðt 1 Þ represent the initial condition at t 0 and t 1 respectively, and I gain ¼ I 0 e À ð Vg ÀV DD j j Þ ηU T P. Kumar et al. represents the virtual p-type subthreshold current, η is the subthreshold slope factor and U T is the thermal voltage. At each clock edge, when the row-select encode and row-select decode switches their select inputs Sel < 0 : n > ð Þ , the resultant DPI current (I Yj ; j 2 ð1 : 4Þ) starts falling. This fall is accumulated till the output of each hidden node is scanned by the row-select encode and row-select decode. The outputs of the mixed-signal interface unit are then compared with a pre-defined threshold value. Above and below the threshold value, the outputs are classified as one and zero, respectively, which are then used to identify which class the dataset belongs. It is noted that the capacitor C ð Þ and voltages V g , V τ and V W are set such that the output does not saturates while discharging even when all the rows are scanned. System performance analysis Device mismatch poses a major challenge in analog circuit design 8 . Several methods are proposed to overcome this issue as it affects system performance in one way or another. The proposed framework utilizes inherent device mismatch for implementing random weights between input and hidden nodes. Figure 2a shows one of the hidden nodes of an LRF-ELM architecture of Fig. 1a implementing 9D Gaussian cell. Device mismatch between each transistor in a 9D Gaussian cell (hidden node) adds randomness to these offset voltages a j shown in (5) and (6). Figure 4a shows the output current characteristic plot at each N and P Gaussian cell output port in a 9D Gaussian architecture shown in Fig. 2a. The characteristic plots are obtained when two inputs x j1 and x j2 in Fig. 2a are varied simultaneously from 0 to 1.8 volts, while other inputs are assigned a constant fixed voltage. One can analyze the variation in the non-linear Gaussian curve when inputs are varied. It can also be seen that the available output range for maintaining this non-linearity should be c Statistical analysis of the V SET and V RESET for a population of 16 devices. d, f Comparison of coefficient of variation (C V ) of V SET and V RESET of all our h-BN memristors (respectively) with the values reported in previous publications. The Cv is calculated from different devices and is ordered from smaller to higher. e, g Change of C V of V SET and V RESET (respectively) with increasing population of tested devices. For example, the value of C V for n memristors indicates the cumulative C V for a population of n memristors. h Current signal measured when applying sequences of pulsed voltage stresses, demonstrating analog transitions between different conductance states (up: amplitude 5.8 V, duration 1 ms and interval 1 ms; middle: amplitude 5.8 V, duration 500 µs and interval 500 µs; bottom: amplitude 4 V, duration 20 ms and interval 25 ms). i Cumulative probability plot of the currents registered during a constant voltage stress at 0.1 V for 100 s, at 26 different current levels during the potentiation of the Au/h-BN/Au device. In all cases the current is stable. More stable states may be registered if the reading is carried out at other resistance levels.
preferably kept between 0.8 to 1.6 V. Hence, all the inputs were normalized in this range before being applied to the classification network. It can further be noted that this range can be tuned by changing the aspect ratio of transistors. We choose the aspect ratio of NMOS and PMOS as 2 to satisfy minimum power and area constraints. Figure 4b shows the output current plot when V bias (corresponds to current I bias in Fig. 2b) is varied from 0 to 1.8 V and all the input is kept constant and fixed. This V bias acts as a tunable hyperparameter for adjusting the current amplitude and nonlinearity characteristics in the design. Figure 4c shows the change in the output current amplitude of 9D Gaussian cells for 200 hidden neurons due to process, device mismatch, and gain variation. It can be seen that the output current varies for input voltages according to a log-normal distribution due to the exponential relationship between the voltage and the current of a transistor in the subthreshold region. This is a significant source of non-linearity to random offset voltage a ji in the Gaussian cell.

Framework testing and dataset
A generic ELM network (fully connected) is deemed suitable for classification and recognition of low dimensionality datasets but possess hardware challenges for datasets having high feature dimension because of significant increase in hardware complexity. We tested the proposed hybrid LRF-ELM framework for both classification and recognition results on high-dimensional audio and image datasets. For classification, we utilized environmental sound classification (ECS-50) dataset 49 and free-spoken-digit (FSDD) dataset 50 . Pre-processing and feature extraction was done offline using cascade of asymmetric resonators-inner hair cell (CAR-IHC) cochlear model 51 . The extracted feature (cochleogram) was then passed to the LRF-ELM network for classification. The size of the extracted feature dimension (cochleogram matrix) was 31 51 for ECS-50 dataset and 47 51 for FSDD datasets. This feature matrix size was optimized to properly capture all the relevant feature in the respective datasets. Table 1 shows the classification results for ESC-50 49 and FSDD 50 datasets. For ESC-50 dataset, classification was performed on pre-processed data samples (where each data point is a cochleogram matrix of 1-s audio signal). Dataset for ESC-50 was created by combining 200 data samples each from 5 different classes belonging to different groups where each group belongs to different class categories such as animal, natural landscape, human non-speech, interior domestic, and exterior urban sounds. Speaker recognition was performed on FSDD dataset where average accuracy for various speakers was shown in Table 1 for 2000 pre-processed datapoints of 1-s each.
We also tested the framework for the image classification task. For this, we utilized semeion digit recognition dataset 52 from the UCI repository. The input image dimension was scaled to 12 13 and passed to LRF-ELM network for classification. Table 2 shows the digit recognition accuracy on semeion dataset.
Additionally, we also tested the system for lower-dimensional datasets less than 9 features so that we can directly use standard ELM framework rather than using LRF-ELM technique. The system was trained offline using quantization aware SGD algorithm 22 .
For low dimensional sensory feature signals, we utilized an activity recognition system based on multi-sensor data fusion (AReM) dataset 52 from University of California Irvine repository with multivariate, sequential, and time-series characteristics. Four activities, namely walking, standing, lying, and sitting, were used  Fig. 4 Gaussian non-linearity and mismatch characteristic curve. a Output current characteristic plot at the output of each N and P-type Gaussian cell shown in Fig. 2a when two inputs x j1 and x j2 are varied simultaneously from 0 to 1.8 volts and other inputs ðx j3 À x j9 Þ are fixed constants. b Plot of output current at the output of 1D Gaussian cell (I out1 in Fig. 2c, d) when V bias is varied from 0 to 1.8 V and all other inputs in a 9D Gaussian cell is kept constant. c Variation in output current amplitude of 200 hidden neurons exhibiting a log-normal distribution for fixed input voltages due to process, mismatch and gain.
from the dataset for classification. We normalized the input features between the non-linearity voltage range of the Gaussian kernel obtained for the fixed bias voltage V bias . This was then passed to a fully connected pre-trained ELM network. The classification results were obtained for DPI output array. We also showed the classification for other real-time datasets such as Breast Cancer, Ionosphere and Haberman 52 . Table 3 summarizes the classification accuracy of the framework on these datasets. In all the above cases, the number of hidden nodes L, were kept 10 times 53 of the input features K 10 ð Þ. Floating accuracy in Tables 1-3 denote the classification accuracy when full precision weights between hidden nodes and output were used. Quantized accuracy shows the accuracy when the weights were mapped to the nearest available 26 stable memristor conductance states. CMOS power consumption per computation for a single hidden node for a supply voltage of 0.95 V was found to be 7.8 µW (detailed explanation of energy efficiency is presented in Supplementary Note 4 and Supplementary Note 5).

DISCUSSION
This paper proposes the design of a hybrid edge computing system utilizing emerging 2D materials along with existing CMOS technology. The proposed system integrates best of both process; the area and energy efficiency of 2D materials and the scalability and power efficiency of the existing CMOS process. We utilized a beyond conventional near-memory approach where 2D memristor crossbar array is used as the multi-state analog memory, thus overcoming the problem of area and memory wall in CMOS. The designed system is generic enough to be used in several applications such as audio detection, speech, and non-speech detection, image classification, and other edge computing tasks where latency, power, and area are severely constrained. We showed the fabrication of a crossbar memristors array using multilayer h-BN as dielectric that exhibits analog bipolar RS in >96% of devices. Analysis related to the operating frequency, throughput, and power consumption are presented in Supplementary Fig. 4, and their description in Supplementary Note 4 and Supplementary Note 5, respectively. Furthermore, the isolation among CMOS Encoder chip and Memristor Decoder chip allows us to explore other memory technologies, such as Phase Change Memory and RRAM. Thus, the proposed low-power framework with good classification results makes our system suitable for resource-constrained edge devices.

METHODS Simulation
Post-Layout CMOS circuit analyses were executed on Cisco Hyperflex HX420c with 72 Intel Xeon processing cores in parallel with 32 GB swap memory and 240 GB random-access memory. Cadence IC 6.18 with Spectre 18 was used along with Virtuoso Layout GXL for inferencing each dataset with a turn-around time of~2 weeks for each model.

Fabrication of memristors' crossbar arrays
The fabrication of crossbar arrays of memristor follows three steps from the bottom electrode, middle dielectric layer, and top electrode. First, the matrix of Au bottom electrodes is patterned via photolithography (mask aligner from SUSS MicroTec, model MJB4) and electron beam evaporation (Kurt J. Lesker, model PVD75) on a 300 nm SiO 2 /Si substrate. Each pattern consists of 5 µm wide metal wires connecting large metal pads (~10 4 μm in size) for better probe station tip engagement. Second, for the dielectric layer,~6 nm thick CVD h-BN sheets transferred (from its growth substrate, i.e., Cu foil) on the matrix of bottom electrodes via standard wet transfer method, in which FeCl 3 water-based solution is used as the copper substrate etcher and PMMA is used as the polymer scaffold 54 . Third, a matrix of Au top electrodes was patterned and deposited with the same recipe as the first step, except that the pattern was aligned to form a crosspoint junction with the bottom electrodes. The cross-point regions between the wires define the active area of each memristor, which are 5 µm × 5 µm. The growth process of the h-BN sheet is described in depth in reference 27 .

Device characterization
The structure and surface morphology of the crossbar arrays is analyzed by scanning electron microscopy (SEM, from Carl Zeiss, model Supra 55). The electrical information is collected by semiconductor device analyzer (from Keysight, model B1500) connected with probe station (from Cascade Microtech company, model M150). We use the waveform generator/fast measurement unit (WGFMU) connected for pulsed voltage stress application and simultaneous current recording. For all electrical measurements, the stress tip is always applied to the top electrode, while the ground tip is always applied to the bottom electrode.