## Introduction

Stochastic neuron devices are essential for the neural network implementation of key emerging non-von-Neumann computing concepts such as the Boltzmann machines, which are recurrent artificial neural networks with stochastic features analogous to the thermodynamics of real-world physical systems. BM can be used to solve a broad range of combinatorial optimization problems1,2 with applications in classification3, pattern recognition4, feature learning, and other emerging computing systems. Deriving its name from the Boltzmann distribution of statistical mechanics, BM possesses an artificial notion of “temperature”, and the controlled evolution of this “temperature” parameter during the optimization process5,6, i.e., the “cooling” strategy, can impact the convergence efficiency of the BM and its chance of reaching a better cost-energy minimization (or maximization depending on problem definition). To realize the hardware implementation of the BM that can also allow the “temperature” control and hence the precise execution of desired “cooling” strategy, it is essential to have electronic devices that can generate exponential-class stochastic sampling with dynamically tunable distribution parameters.

The property of memristor in its deterministic form has been commonly used in applications such as multiply-and-accumulate matrix calculation7 and resistor-logic demultiplexers8,9,10. Its stochastic property is often intentionally suppressed11,12,13 in such applications for the purpose of achieving accurate and reproducible computational results14,15. On the other hand, rich stochastic property of memristors, which relies on ensembles of random movements of atoms and ions, offers opportunities in energy-efficient computing applications16,17,18,19,20. With the stochastic property, one can generate random number21 to encrypt information, implement physical unclonable functions22, and realize artificial neurons23 with integrate-and-fire activations. Furthermore, emerging computing schemes can use stochastic memristive device as a building block to emulate biological neural network24,25, whose functions—such as decision-making—can leverage the stochastic dynamics of neurons and synapses. However, a common challenge with previous stochastic memristors is the lack of means to precisely control and modulate the probability distribution that is associated with its randomness. Realizing such devices has been difficult because many device-generated random features in stochastic memristors or oscillators lack stable probability distribution, which limits the chance of controlling it experimentally19,26,27. Additionally, with only two terminals in a common memristor, where the probability distribution can only be influenced through the two-terminal bias, the probability distribution of the device output cannot be tuned flexibly and precisely.

In this work, we overcome such challenge with a three-terminal stochastic hetero-memristor based on tin oxide/MoS2 heterostructure, which demonstrates tunable statistical distributions enabled by the gate modulation. The inherent exponential-class stochastic characteristics of the device arising from the intrinsic randomness and energy distribution in its ionic motions are explored to realize sampling of exponential-class sigmoidal distributions that resembles the Fermi–Dirac distribution in physical systems. The device incorporates gate modulation that allows the efficient control of the stochastic features in the device output characteristics. The device enables the realization of reconfigurable stochastic neuron and the implementation of Boltzmann machine in which the reconfigurable statistic of the device allows different “cooling” strategies to be implemented during the optimization process. The effect of different “cooling” strategies on improving the optimization process efficiency of the BM is demonstrated experimentally.

## Results

Figure 1a shows the schematic of this reconfigurable heteromemristor, where tin oxide serves as filament-switching layer and is sandwiched between a MoS2 layer and Cr/Au top electrodes (TE). The Si substrate serves as a modulating gate bias (Vg) that can influence the filament-formation dynamics in the tin oxide layer. The high-resolution scanning transmission electron microscopy (HR-STEM) image in Fig. 1b shows the cross section of the fabricated device and reveals that the tin oxide layer is amorphous. An energy-dispersive X-ray spectroscopy (EDX) scan in Fig. 1c indicates the elemental composition. Figure 1d plots the Raman spectra for the SnSe sample before and after oxidation, which leads to the formation of the SnOx layer. All signature modes of SnSe, including the shear mode Ag1, the in-the-plane modes Ag2 and B3g, and the out-of-plane mode Ag3 that are observed before oxidation, and are not detected after oxidation, indicating the full oxidation and amorphization of the SnSe sample28. The tin oxide film can also be synthesized using atomic-layer deposition (ALD)29,30,31, which produces films of similar quality as the direct oxidation method.

Unipolar electrical switching characteristics of the device at Vg = 0 V are shown in Fig. 1e. It sets and resets at around 3.2 V and 2.8 V respectively in the positive bias, and at −3.4 V and −3 V, respectively, in the negative bias32. Both the Joule heating and the electric-field driven effect can be playing roles in the device operation. The filament-formation operation can be due to a breakdown-like process with random creation of voltage-stress-induced vacancy or defect sites, which is electric-field driven. The Joule heating can be the main effect in filament rupturing. The insertion of the MoS2 layer in the device made it possible to adjust the electron energy level in MoS2 by externally modulating the gate bias Vg, which can modulate both the contact-energy barrier between the MoS2 and SnOx, and the conductivity of the MoS2 sheet itself (see supplementary information section 4). Hence, as shown in Fig. 1f, as the gate bias decreases from 30 V to −20 V, the electrostatic doping in MoS2 and the associated energy level decreases, leading to the reduction in the series conductivity and hence the gradual increase in the set voltage.

The filament-formation process is stochastic due to the inherent random motion of oxygen ions. To extract this stochastic property quantitatively, a statistical study is carried out on the set process. As shown in Fig. 2, the device is initially reset to the high-resistance state and a bias VTE is applied to the device for up to 2 s. During each set process, it takes a certain amount of time t (t ≤ 2 s) after the bias voltage is applied for the device to be set. This required bias time until set is stochastic in each trial. Furthermore, there is certain chance that the device may still remain in the high-resistance state after 2 s. Figure 2a plots the device current characteristics as a function of time when this reset and set process was repeated for 30 times at VTE = 6 V, 5 V, 4 V, and 3 V, respectively, with Vg fixed at 0 V. At VTE = 6 V, the device is successfully set within the first 2 s for all the 30 trials. At VTE = 5 V, 4 V, and 3 V, the device failed to set within the first 2 s in certain cases. Figure 2b shows the histogram probability distribution extracted from 30 trials of the time required, until the device becomes set. If we consider t as a random variable, the probability that the set will occur within an infinitesimal interval $$\triangle t$$ at time t can be described by an exponential-class distribution33 function $$P=\frac{\triangle t}{\tau }\cdot {e}^{-\frac{t}{\tau }}$$ with the wait time t following a Poisson distribution (see supplementary information section 6) and it fits the experimental data well (red lines, Fig. 2b). This experimental observation resembling Poisson random wait time underlying the filament-formation process in the tin oxide memristive device is indicative of its exponential-class stochastic nature.

Moreover, Fig. 2c plots Pss,t<2s as a function of VTEVTE0 under different gate voltages, which shows exponential-class sigmoidal distribution function. Here, Pss,t<2s is the probability that the device will successfully set within 2 s and VTE0 is the 50% probability bias-voltage point, i.e., Pss,t<2s (VTE = VTE0) = 0.5. With the gate voltage fixed, the chance of the device being set within t < 2 s becomes higher with increasing VTE, following a sigmoidal distribution. It shows that VTE can tune the stochastic property of the set event in the device when Vg is fixed. Microscopically, the VTE tunes the filament-formation process by modulating the vacancy-hopping barrier height and thus the ion-hopping rate. Thus, the device is understandably easier to set at high VTE than low VTE. Under different gate voltages, Pss,t<2s shows a sharper 0-to-1 transition when Vg is 30 V and a wider spread in its 0-to-1 transition when the Vg decreases. Here Vg tunes the Fermi level and charge density in the MoS2 layer, which modulates the potential distribution between MoS2 and tin oxide layer under VTE bias. VTE is more effective in modulating the device when Vg is higher, i.e. the MoS2 layer has a higher electron carrier density and higher conductivity, and thus leads to a sharper 0-to-1 transition in the sigmoidal distribution curve.

The set process is achieved by the filament formation through stochastic vacancy generation and hopping-transport processes. Applying a voltage can reduce the generation and hopping-barrier height and exponentially enhance the generation and hopping rates. Analytically, the set probability, Pss,t<2s, can be derived as Pss,t<2s$$\; = 1-{e}^{{-\beta e}^{\alpha ({V}_{{{{{{\rm{TE}}}}}}}-{V}_{{{{{{\rm{TE}}}}}}0})}}$$, where $$\alpha$$ and $$\beta$$ are parameters related to the material and device structure (see supplementary information section 7). After further approximation, Pss,t<2s can be simplified to a distribution function that resembles the Fermi–Dirac distribution (see supplementary information section 8):

$${P}_{{{{{{\rm{ss}}}}}},\,t < 2{{{{{\rm{s}}}}}}}\approx \frac{1}{1+{{\exp }}\left(-\frac{{V}_{{{{{{\rm{TE}}}}}}}-{V}_{{{{{{\rm{TE}}}}}}0}}{{T}_{{{{{{\rm{eff}}}}}}}}\right)}$$
(1)

where Teff is an effective “temperature” term that can be tuned by the gate bias. This expression fits very well with the experimental data in Fig. 2c. The above analytical description is also in agreement with kinetic Monte Carlo simulations, which describes microscopic stochastic process of vacancy generation, hopping, and recombination in filament formation34,35. Teff corresponding to various gate voltages is extracted from the fitting and Fig. 2d plots Teff versus gate voltage Vg. A behavioral model is developed to understand the dependence of the Teff on the gate-bias voltage. The device is modeled as a memristor in serial combination with a MoS2 layer whose resistance (both the sheet resistance and its contact property with the memristive filament) can be modulated by the gate electric field. As a result, Teff can be expressed as $${T}_{{{{{{\rm{eff}}}}}}}\left({V}_{{{{{{\rm{g}}}}}}}\right)={T}_{{{{{{\rm{V}}}}}}0}\left[1+\frac{Z}{\left({V}_{{{{{{\rm{g}}}}}}}-{V}_{{{{{{\rm{T}}}}}}}\right)}\right]$$, where $${T}_{{{{{{\rm{V}}}}}}0}$$ and Z are constants, VT is the threshold voltage (see supplementary information section 9). As shown in Fig. 2d, this model fits well with the experimental data and describes the modulation effect of Teff by Vg. We would like to note that the value of Teff has the unit of volt. However, to avoid confusion with the actual electrical bias voltages applied on the device, the unit of Teff will be omitted in the subsequent discussions. The above discussed stochastic process of the filament formation together with the gate voltage-dependent “temperature” effect can be used to construct exponential-class distribution sampling that has broad applications in statistical modeling and computing, with the Boltzmann machine as a typical example.

To demonstrate the unique advantages of these tunable exponential-class stochastic heteromemristors in computing application, a version of Boltzmann machine that contains a network of stochastic neurons is implemented. The stochastic neurons may fire in response to the input signals and thus drive the searching dynamics of the BM. The BM iterates all possible solutions to search for the best solution by minimizing the system-energy function. Hardware implementations36,37 of such BM are challenging with conventional transistors and would require a large number of devices and complex circuitry. Here we build a BM where each of the stochastic neuron is based on a single tin oxide/MoS2 hetero-memristor as stochastic switching and simple peripheral circuitry (more details in Methods: BM construction). This implemented BM is used to solve a maximum satisfiability problem (MAX-SAT), which is an NP-hard combinatorial optimization problem underlying a wide range of key applications, including Max-Clique38, correlation clustering39, treewidth computation40, Bayesian network structure learning41, and argumentation dynamics42.

Given a set of Boolean clauses, where each clause is a disjunction of Boolean variables and their negations, the MAX-SAT problem43 aims to maximize the number of clauses that can be true when truth values are assigned to the Boolean variables. Without the loss of generality, the set of Boolean clauses to be solved in this work are selected to be $$\left\{{{{{{\rm{Ci}}}}}}|{{{{{\rm{i}}}}}}={{{{\mathrm{1,2}}}}},\ldots ,5\right\}$$, where the clause C1 is $$\left(x\vee y\vee z\right)$$; C2 is $$\left({x}^{{\prime} }\vee y\vee z\right)$$; C3 is $$\left({x}^{{\prime} }\vee {y}^{{\prime} }\vee z\right)$$; C4 is $$\left(x\vee {y}^{{\prime} }\vee {z}^{{\prime} }\right)$$ and C5 is $$\left({x}^{{\prime} }\vee y\vee {z}^{{\prime} }\right)$$ (shown in Fig. 3a, the Boolean variable $${x}^{{\prime} }$$ is the negation of the Boolean variable $$x$$). The optimization task here is to find a state vector $${{{{{\bf{X}}}}}}=\left({x}_{1},\cdots ,{x}_{6}\right)=(x,y,z,{x}^{{\prime} },{y}^{{\prime} },z^{\prime} )$$ that can maximize the number of clauses to be true. A MAX-SAT can be converted equivalently to a problem that is solvable for the BM44,45. Six stochastic units are used in the BM to realize the activation for each Boolean variable in the state vector $${{{{{\bf{X}}}}}}=\left({x}_{1},\cdots ,{x}_{6}\right)$$. Then we build a weight matrix W. The weight $${w}_{{{{{{\rm{ij}}}}}}}$$ that is between every two Boolean variables is assigned based on the MAX-SAT problem. Solving the MAX-SAT is equivalent to minimizing the total energy $$E={{{{{{\bf{X}}}}}}}^{{{{{{\rm{T}}}}}}}{{{{{\bf{WX}}}}}}$$ of the BM, where $${{{{{{\bf{X}}}}}}}^{{{{{{\rm{T}}}}}}}$$ is the transverse of $${{{{{\bf{X}}}}}}$$.

The constructed BM utilizing the tin oxide/MoS2 heteromemristors is shown in Fig. 3b and the schematic of the circuit blocks with six stochastic neurons is shown in Fig. 3c. In each iteration step, if the hetero-memristor sets, the Boolean value of $${x}_{{{{{{\rm{i}}}}}}}$$ would be flipped. If the heteromemristor does not set, the stochastic neuron would not fire and $${x}_{{{{{{\rm{i}}}}}}}$$ remains the same. The stochastic neurons are sequentially updated until the BM reaches the optimal solution. In Fig. 3d, we experimentally demonstrated the evolution of the state vector and total energy when the BM started from three different initial states and found the same optimal solution, which is $${{{{{\bf{X}}}}}}=(x,{y},z,{x}^{{\prime} },{y}^{{\prime} },{z}^{{\prime} })=({{{{\mathrm{0,1,1,1,0,0}}}}})$$.

As previously shown in Fig. 2d, Vg can tune the tin oxide/MoS2 heteromemristor to have different Teff during the BM optimization process. Teff of the BM describes the average behaviors of all the stochastic units, in close analogy to the temperature parameter in the Boltzmann distribution that describes the average behavior of particles under different thermal equilibrium states in physical systems. Thus, by controlling Teff in the optimization process that can be achieved via tuning the Vg, it is possible to avoid premature convergence issues and facilitate the convergence efficiency associated with the BM. Figure 3e shows the effect of different Vg bias on the BM optimization process. During these three different runs of the BM, all the tin oxide/MoS2 stochastic hetero-memristors are biased at Vg = −20 V, 0 V, and 20 V, respectively. The energy evolved differently during these runs each time. The BM is at Teff = 7 when Vg = 20 V and converges easily for this particular problem. On the other hand, the BM is at Teff = 50 when Vg = −20 V and is less efficient in reaching convergence. For Vg = 0 V, the BM is at Teff = 10 and converges at an intermediate rate among the three cases. By counting how many times the BM can reach the global optimal solution out of 50 trial runs, the success rate as a function of Vg and Teff is statistically obtained as shown in Fig. 3f. It indicates that the Vg and hence the Teff can substantially affect the performance of the BM.

Simulated annealing46,47 can be implemented with our BM where the Teff can gradually change during the optimization process to emulate different “cooling” strategy. It is an important approach for efficiently reaching better optimization solutions and for avoiding the premature convergence. Using the gate-tunable tin oxide/MoS2 device, such “cooling” procedures can be quantitatively implemented during the simulated annealing by translating the designated sequential evolution of Teff into the corresponding series of gate voltage bias conditions following the relation in Fig. 2d. To study the effect of different “cooling” strategies on the efficiency of the BM, four different Teff variation strategies were experimentally applied on the BM. Strategy 1: high Teff in the first three iteration steps followed by low Teff for the remaining iterations in one optimization process (HT to LT), Strategy 2: low Teff in the first three iterations followed by high Teff for the remaining iterations (LT to HT), Strategy 3: maintaining a low Teff in the entire optimization process (LT), and Strategy 4: maintaining a high Teff in the entire optimization process (HT). Figure 4a shows the qualitative schematic about how system energy (color dots) would evolve in the process of searching optimal solutions among multiple possible energy minimums (gray line). To analyze the effect of these “cooling” strategies, typical evolutions of the energy (cost function) during the BM optimization process for the four different strategies were experimentally obtained. As shown in Fig. 4b, using the HT strategy (Teff = 50), the BM is highly active but loses the selectivity for reaching proper convergence. Using the LT strategy (Teff = 5), the BM is significantly less active but possesses higher selectivity that facilitates its convergence to a premature state. Finally, simulated annealing using a “cooling” strategy (HT to LT) enables active initial searches at HT (Teff = 50) and then steady convergence to the minimum energy state at LT (Teff = 5) as shown in the experimental results. Furthermore, Figs. 4c and 4d show the experimentally obtained statistics of success rate in finding the global optimal solution when the different “cooling” strategies are used. Different initial values for the state vectors are used in Figs. 4c and 4d to show the effect from the different initial conditions. Both figures indicate that the HT to LT strategy has the highest success rate for reaching the global optimal solution for this particular problem, while the HT strategy has the lowest success rate. The results are consistent with the simulated performance of the BM (see supplementary information section 10).

To quantitatively understand why Teff can make such a significant difference in the BM optimization process, we analyze the Russel–Rao (RR) similarity48 between all the clauses for this particular MAX-SAT problem. It is because, as illustrated in Fig. 5a, all the five clauses C1–C5 bear inherent similarity to each other due to the following two constraints: the variable constraint and the clause constraint. On the variable side, a Boolean variable and its negation (two variables connected by red lines) are always logically opposite. For example, $$x$$ and $${x}^{{\prime} }$$ will always have opposite values. On the clause side, the chance of two clauses both being true is lower if they contain more complementary Boolean variables in each clause. By assigning true values to the variables $$x$$, $${y}^{{\prime} }$$and $${z}^{{\prime} }$$(yellow circle), the number of complementary variables (blue circle) between clauses could be easily observed. Counting the number of complementary variables can directly reflect the inner connection and constraint of the clauses. In Fig. 5a, for example, if the clause C4: $$\left(x\vee y^{\prime} \vee z^{\prime} \right)$$ is true, then the probability that the clause C2: $$\left({x}^{{\prime} }\vee y\vee z\right)$$ also being true is much smaller than the other three clauses since C4 and C2 contain three pairs of complementary variables.

With the BM set to different Teff, the RR similarity matrix among the five clauses based on the experimental data is constructed in Figs. 5b, 5c and 5d. The color and number in each cell quantify the similarity between each pair of clauses indexed by the row and column. It represents the probability when both clauses are true among all cases. For example, a RR similarity of 0.84 between C1 and C2 in Fig. 5b means that by repeatedly running the BM 50 times at Teff = 50, we had C1 and C2, both being true by the end of 42 (out of 50) runs.

The effect of Teff can be explained as follows. We view the RR similarity as the distance measurement of the statistical relationship between each of the two clauses (distance = 1 − RR coefficient) in solution space49. In other words, clauses with RR similarity close to 1 are seen as closely clustered, while the clauses with RR similarity close to 0 are furthermost separated. When Teff is tuned to 50 (Fig. 5b), all the clauses have similar distances in the solution space, since they show close RR similarity between all pairs. As a consequence, BM tends to search widely in the solution space with a high robustness, high stochasticity, and low selectivity, since choosing any solution would look the same to the BM. When Teff is 20 (Fig. 5c), clauses with small distances are closely clustered, giving high RR similarity close to unity for pairs of clauses that can be easily satisfied simultaneously, such as C1 and C2, and a low RR similarity for pairs of clauses that can hardly be satisfied at the same time, such as C1 and C4. At this Teff = 20, the BM gains more selectivity in solution space. When the Teff is 5 (Fig. 5d), all the clauses are either strongly clustered or separated in distance, with distinct either 1 or 0 RR similarity. BM behaves more like a deterministic “machine”. This tends to cause premature convergence as the BM is significantly less active.

Next, a simulated annealing process in the BM with linear cooling is simulated in Fig. 5e. The evolution of the RR similarity matrix indicates that the BM would evolve through all the cases that are discussed above from being fully stochastic toward nearly deterministic as Teff decreases linearly. Thus, the simulated annealing process of a BM could be understood as such: at high Teff, the BM searches solution space globally with high robustness and low selectivity, for the sake of large gradient descent; as the BM cools down, it gains selectivity toward some solutions and can possibly jump out of local minima since Teff still provides enough perturbation; as the BM cools down to the limit, the BM exhibits a stronger selectivity than robustness, preventing itself from jumping out of the optimal zone. Hence, more efficient performance in the BM can be achieved with an appropriate “cooling” strategy.

In summary, tunable stochastic behavior is demonstrated in the tin oxide/MoS2 heteromemristor, showing inherent exponential-class statistical characteristics. The device can sample exponential-class sigmoidal distributions resembling the Fermi–Dirac distribution in physical systems with tunable distribution parameters to emulate the “temperature” effects. Simulated annealing with control of the “cooling” strategies is demonstrated in the implemented Boltzmann machine for solving combinatorial optimization with respect to a MAX-SAT problem. These stochastic neurons based on tin oxide/MoS2 heteromemristors with reconfigurable statistical behavior pave the way for implementing selected “cooling” strategies in BM to reach optimal convergence efficiency and can find broad applications in energy-efficient computing for learning, clustering, and classification.

## Methods

### Device fabrication

A thin MoS2 layer is first deposited on a Si wafer with a 285-nm thermally grown SiO2 layer on top. The sample is then treated in an Ar/H2-mixed gas environment at 350 °C to clean the MoS2 surface. Subsequently, a thin tin oxide layer oxidized from SnSe is deposited on MoS2 and serves as filament-switching layer. Electron beam lithography is then used to transfer the patterns followed by the evaporation of a 10-nm/40-nm Cr/Au metal stack, which forms the top electrode.

### STEM and EDX

A FEI Titan Themis G2 system was used to prepare the HRSTEM images with four detectors and spherical aberration. To observe the cross-section image, the sample was pretreated by depositing chromium and carbon-capping layers, then thinned by a focused-ion beam (FIB, FEI Helios 450 S) with an acceleration voltage of 30 kV. The HRSTEM image was acquired with an acceleration voltage of 200 kV. EDX signals were collected to identify the elemental component in the cross section, which was integrated within the STEM system.

### Raman spectroscopy

A Renishaw inVia Qontor system was used to measure the Raman spectra, which was installed with a ×100 objective lens, a grating (1800 grooves mm−1), and a charge-coupled device camera. The wavelength of the excitation laser was 532 nm (from a solid laser). The Raman spectra resolution is 1.2 cm−1 per pixel.

### BM construction

The implemented BM prototype contains 24 5-bit digital-to-analog converters (DAC). The digital pattern generation interface (DPGI) and training data acquisition interface (TDAI) are controlled by a Xilinx ML605 FPGA board that carries out information storage and computations. It formed a feedback loop to adjust both input and output patterns at each BM iteration. Depending on different input signals, the BM system adjusts the corresponding output training data accordingly. The BM prototype has six stochastic units, with each unit containing a tin oxide/MoS2 heteromemristor that has approximately sigmoidal switching probability upon applied voltages and peripheral circuitry. The peripheral circuitry is consisting of 4 DACs (digital-to-analog converter) to read digital voltage values and apply to heteromemristor, a dynamic comparator for generating discrete-state readout and output-level shifters.