Machine learning-powered compact modeling of stochastic electronic devices using mixture density networks

The relentless pursuit of miniaturization and performance enhancement in electronic devices has led to a fundamental challenge in the field of circuit design and simulation-how to accurately account for the inherent stochastic nature of certain devices. While conventional deterministic models have served as indispensable tools for circuit designers, they fall short when it comes to capturing the subtle yet critical variability exhibited by many electronic components. In this paper, we present an innovative approach that transcends the limitations of traditional modeling techniques by harnessing the power of machine learning, specifically Mixture Density Networks (MDNs), to faithfully represent and simulate the stochastic behavior of electronic devices. We demonstrate our approach to model heater cryotrons, where the model is able to capture the stochastic switching dynamics observed in the experiment. Our model shows 0.82% mean absolute error for switching probability. This paper marks a significant step forward in the quest for accurate and versatile compact models, poised to drive innovation in the realm of electronic circuits.


Introduction
In the ever-evolving landscape of electronic devices, the pursuit of smaller, faster, and more efficient technology has led to groundbreaking innovations, unlocking new possibilities for various applications across industries 1 .However, as devices continue to shrink and push the boundaries of Moore's law 2 , the inherent stochastic nature of certain electronic components has become increasingly significant 3 .These stochastic electronic devices exhibit inherent variability in their behavior, posing a formidable challenge for conventional deterministic modeling approaches 4 .
In the pursuit of greater accuracy and ease of implementation in electronic circuit simulations, a significant leap has been taken by incorporating machine learning methodologies into compact modeling [5][6][7][8][9][10] .While various machine learning-based compact models have emerged, these models, thus far, have failed to address the implications of stochastic behavior in electronic devices.And while traditional physics-based compact models have been created with stochastic properties 11,12 , these models lack the benefits of machine learning-based methods such as decreased turnaround time and little required device knowledge.The dynamics of electronic devices, traditionally considered deterministic, are increasingly proving to be intrinsically stochastic at the nanoscale.Consequently, conventional compact models that ignore these inherent stochastic processes fail to capture the true behavior of the devices, leading to inaccuracies in circuit simulation.
This paper presents a new approach that addresses this gap in the realm of compact modeling.We introduce the use of Mixture Density Networks (MDNs) 13 to capture the stochastic nature of electronic devices accurately.In addition to this neural network approach, we introduce a novel sampling methodology that uses inverse transform sampling 14 to make our model more accurate to the stochastic nature of devices.By doing these, we create compact models that account for the inherent stochasticity of certain devices, allowing the development of new circuits that are both reliable and innovative.Our approach represents a paradigm shift in electronic device modeling, as it for the first time encompasses the full spectrum of electronic device behavior, from deterministic to stochastic.
To demonstrate the effectiveness and versatility of our approach, we focus on the modeling of heater cryotron 15 , a threeterminal device that shows gate-current controlled switching between superconducting and resistive states.Heater cryotrons, with their stochastic switching behavior, are a particularly compelling case study.Our methodology accurately captures the nuances of this stochasticity, providing a foundation for the development of novel circuits and the optimization of existing designs.
The contributions of this paper are threefold: LG] 10 Nov 2023 • Mixture Density Networks: By employing MDNs, we equip our modeling framework with the capacity to capture complex probability distributions, thereby enabling a more faithful representation of device variability.This empowers designers to explore device behavior beyond traditional deterministic bounds.
• Inverse Transform Sampling: By leveraging inverse transform sampling, we can use our MDN to its full potential.This approach allows the model to output smooth outputs in transient simulations while maintaining stochasticity.This in turn makes our model more realistic to device-to-device or cycle-to-cycle variation.
• Demonstration with Heater Cryotrons: To showcase the effectiveness of our approach, we focus on modeling heater cryotrons-a class of devices known for their stochastic switching behavior.Through this demonstration, we illustrate the practical applicability of our methodology in real-world electronic systems.
As we delve into the details of our approach, we will elucidate the intricacies of MDN-based modeling for stochastic electronic devices, providing a comprehensive understanding of the benefits it offers to circuit simulation and electronic design.

Results
The following section will describe our neural network architecture, sampling methodology, and the results of our approach using experimentally derived heater cryotron data.

Mixture Density Network
Mixture density networks provide the perfect approach for modeling stochastic devices.Mixture density networks work similarly to standard multilayer perceptions (MLP) 16 , but with three output layers connected to the last hidden layer instead of just one.Each output layer has the same number of neurons N, and we label the output layers as µ, σ , and α.The output of these layers is used as the parameters of a Gaussian mixture distribution probability density function (PDF) as defined in equation (1).
Using this approach allows the model to learn the unique output distribution for any given input.The PDF is a combination of N different Gaussian distributions, which are multiplied by a scaling factor α. In order to maintain a valid probability distribution, we need to ensure that the sum of α is always equal to 1.To do that, we apply a softmax function to the output of the α layer, as defined in equation (2).
Additionally, we need to ensure that the output of the σ layer needs to be positive and non-zero since they represent the standard deviation of the Gaussian distributions in the mixture density PDF.In order to do this, we use the exponential linear unit (ELU) activation function plus 1 plus ε, where ε is the smallest value before loss of precision in floating point calculations.The activation function can be seen in equation (3).
Finally, we allow µ to be any value, so there is no need to use an activation function to restrict its output.For the hidden layers, we chose to use the rectified linear unit activation function, which is defined in equation (4).
In order to train the model, we need an appropriate loss function to measure the performance of the model.We choose to use Gaussian Negative Log-Likelihood (GNLL), since it effectively captures the performance with respect to probability.The GNLL for a single data point x is the negative logarithm of this PDF: The goal during training is to minimize the GNLL across all training data points.Therefore, the overall GNLL loss is the average GNLL over all data points: During training, we adjust the network's parameters to minimize this GNLL loss.We can use backpropagation and the AdamW 17 optimizer to update the network's weights and biases by finding the gradients of GNLL with respect to µ, σ , and α as shown in figure 1.

Model Sampling Methodology
In order to properly utilize our model, it is critical that we use an appropriate sampling methodology for the output distributions.The obvious approach for this would be to randomly sample from the distribution, however, this approach presents several issues.In order to sample in this way, we can use the approach outlined in Algorithm 1, where we randomly select a distribution k for the mixture density distribution and sample from a standard normal distribution N (µ k , σ k ).This approach allows us to utilize uniform and standard Gaussian sampling, which is built into most modeling languages including Verilog-A.

Algorithm 1 Standard Sampling from a Gaussian Mixture Density Distribution
Sample x from the mixture distribution Sample Function: x -Sample from the Gaussian mixture density The issue with this approach is that it results in sporadic currents in transient simulations since the current will jump around the distribution every time step.In addition, this approach will cause multi-state devices to switch sporadically between states near the critical value if the switching of the device is stochastic.While this approach may be valuable for some devices, for most devices we will need another sampling methodology that more accurately reflects the variation we see in devices.We propose using inverse transform sampling 14 to accomplish this.Inverse transform sampling works by first obtaining the cumulative distribution function (CDF) of the desired distribution and then finding its inverse.A cumulative distribution function (CDF) is a function that gives the probability that a random variable is less than or equal to a specific value.In the context of inverse transform sampling, the input (a uniform random number between 0 and 1) represents a quantile of the distribution because it corresponds to a specific probability level, and the inverse CDF maps it to the corresponding value in the distribution.To find the CDF, we can start with the probability density function in equation (7).
Our CDF will be equal to the integral from −∞ to x as shown in equation (8).
While this is the CDF, it would be beneficial for us to manipulate this into a more calculable form.By bringing the integral inside the summation and leaving the scaling factor α outside the integral, we are left with equation 9.
This leaves us with the weighted sum of the CDF of a standard normal distribution, which can be defined in terms of the error function as shown in equation (10).Defining this in terms of the error function makes calculation much easier since the error function is well-defined and easy to calculate numerically.The error function is defined in equation (11).
The issue is that there is no closed-form solution to the inverse of this CDF.As such, we use a numerical root finder to find what value of x leads to F(x) − q = 0. We choose to use Brent's method 18 , however, most numerical root finders should work here.For derivative-based root finders, the derivative of the CDF is equal to the PDF.
Now that we have established a way to perform inverse transform sampling, we can take full advantage of its properties to improve our model.To solve our issue with transient simulations, we can keep the same value of q for the entirety of a sweep.This will result in a continuous output that is more in line with the device's properties of cycle-to-cycle variation than traditional sampling.In this case, we can generate a new q every time the derivative of an input with respect to time crosses 0, which can easily be done in Verilog-A.Another benefit of this sampling approach is the ability to clip the probability distribution.For example, we could generate q only between 0.05 and 0.95 so that the model doesn't predict values more than 2 standard deviations away (or in any range).Another use could be modeling device-to-device variation by weighting the distribution when sampling for q, though this work doesn't explore this option.

Device Characteristics of Heater Cryotron
The heater cryotron is a superconducting device designed to exploit the unique properties of superconducting nanowires 15 .This device consists of two superconducting nanowires separated by a dielectric, forming the gate and the channel of the device.Figure 2 (a) shows a false-colored scanning electron microscope (SEM) image of the fabricated heater cryotron device, with a channel width of 1µm and a gate width 450nm.The gate and channel nanowires are separated by a SiO 2 dielectric spacer of 25nm thickness.Initially, for a given channel current (I B ), both nanowires (gate and channel), when maintained at cryogenic temperatures, remain superconducting exhibiting zero resistance, and remain so until the gate nanowire becomes resistive (Figure 2 (b)).However, when a sufficient amount of current (I G ) is passed through the gate nanowire to make it resistive, it starts generating thermal phonons which transport to the channel nanowire with the help of the spacer.These thermal phonons suppress the superconductivity and reduce the critical current of the channel nanowire (I C Ch ).Eventually, when the increase of gate current, reduces the channel critical current below the applied channel current, the channel transitions from its superconducting state to a resistive state as shown in Figure 2 (c).Conversely, removing the current from the heater enables the channel nanowire to revert to its superconducting state.The specific current levels at which superconducting to resistive and resistive to superconducting transitions occur are referred to as the critical current and retrapping current, respectively.The gate current-controlled switching of heater cryotrons has been used in a multitude of applications, including designing logic circuits 15,[19][20][21] , memory systems [22][23][24][25][26] , and neuromorphic systems [27][28][29][30][31] for cryogenic environment, and in interfacing superconducting circuits with semiconducting technology 32 .
Notably, the point at which the device switches between its superconducting and resistive states is characterized by stochastic behavior, which introduces an inherent element of randomness into its operation.This stochasticity makes this device ideal for testing our MDN-based compact modeling approach.The critical current value is not a fixed constant but instead depends on the applied channel current.For example, a higher bias current leads to a lower critical current for the gate, thus influencing the switching behavior of the device.Conversely, a higher bias current leads to a lower retrapping current.Additionally, a much lower bias current is required to switch from a resistive to a superconducting state since the nanowire produces its own heat when resistive among other factors.Understanding this dependence on the channel current is pivotal in characterizing and modeling the behavior of the heater cryotron.
To sample the characteristics of the device, we sweep the gate at different bias currents.By performing each of the sweeps multiple times and recording the resulting current-voltage characteristics of the device, we can have enough data for the MDN to learn the stochastic switching behavior of the device.This experimental approach allows us to explore the device's response under varying conditions and analyze the interplay between the gate and channel currents.

Heater Cryotron Model
For our first iteration of the model, we chose to use our MDN to model the critical gate current for a given bias current.This approach allows the use of a simpler neural network structure for our MDN when compared to learning the I-V characteristics directly.To accomplish this, we use bias current and current state of the device as input with the MDN trained to predict gate critical current (if the heater cryotron is in its superconducting state) and retrapping current if the heater cryotron is in its resistive state.The results of this model can be seen in Figure 2 (d) and (e), where we show the experimental data as well as our model's output distribution at the mean (µ), and with a variation of twice the standard deviation (σ ) in either side (µ − 2σ and µ + 2σ ) of the mean.We can see that the model is able to capture the critical and recapture current at different bias currents as well as the variance in these switching points.Another benefit to this approach is that in addition to reduced model complexity, we don't have to run the model every time step as long as the bias current is constant, making the model even more efficient in certain applications.
While the switching model could work well for some devices, it comes with some major drawbacks.Most notably, it requires separate models for the I-V characteristics of the device in its different states in addition to the switching model.This means that this approach is unable to capture the variance in the device's I-V characteristics outside of switching.Going forward, we are going to use the MDN to directly learn the I-V characteristics of the heater cryotron.This means that we will  One challenge with MDNs is that there is no good way to provide easily interpretable numerical metrics for performance.Log-likelihood can be used to compare different models, but it is not a useful metric on its own.Since there are not any existing approaches to compare against, this metric will not be useful for us.Because of this, we will need to rely on graphical comparisons between the model and the experimental data.First, we can look at a comparison between the distribution from the model vs a histogram of the experimental data for different values of I G and I B .This is shown in Figure 3, where we are using I B of 23.5µA and values of I G from 1.35µA to 1.6µA.Using this range of I G allows us to see how the switching probability distributions change around the critical current.Figure 3 shows that our probability distributions closely match the distributions obtained from the experiment.
Next, we can compare the switching probability of the model.To do this, we use a sweep of I G from 0µA to 3µA for I B values of 14µA, 16.5, 23.5µA, 28µA, and 33µA.To evaluate the switching probability of the model, we can use the CDF evaluated at the voltage midway between the load voltage when the H-Tron is in superconducting state and when it is in resistive state.Figure 4 (left) shows the results of switching probability observed in the experimental data compared to the MDN model.Framing the model in this way allows us to use traditional regression performance metrics, which can be seen in Table 1.On average our model is off by 0.82% for switching probability with an R 2 of 0.9891.Here, the R 2 value represents the percentage of variance in the switching probability explained by the model.
Finally, we can evaluate the performance of the model in transient simulation.So that we can match the output of the model with experimental results, we mirror I G and I B from the experimental data as shown in Figure 6 (a).We use the inverse transform sampling technique as described in section II to achieve the smooth, continuous curves as seen in Figure 4 (right).To show the distribution of the model in the transient simulation, we report µ, µ − 2σ , and µ + 2σ conditions.We can do this by setting the quantile value (q) to 0.5 for the mean and 0.05 and 0.95 for the inverse transform sampling.Figure 4 (right) shows that our MDN model accurately captures the variance in the I-V characteristics of the devices, including the switching dynamics.

Discussion
Our approach utilizing mixture density networks has demonstrated its remarkable ability to accurately model the variance of I-V characteristics and stochastic switching behavior of heater cryotrons.With a 0.82% mean absolute error in the switching probability and an R 2 value of 0.9891, our method has proven its efficacy in capturing the true variance of device behavior.While we've only shown the model using heater cryotrons, the generalizing power of neural networks means this approach will be easily adaptable to other devices.This achievement marks a significant stride towards more precise and reliable electronic circuit simulations, offering unprecedented opportunities for the development of cutting-edge technologies in an era where stochasticity plays an increasingly pivotal role.

Heater Cryotron Fabrication
The fabrication process began with the deposition of a 4 nm layer of superconducting tungsten silicide (WSi) onto a silicon wafer through magnetron sputtering.Following this, contact pads consisting of 5 nm Ti and 30 nm Au were applied using optical lithography and a liftoff procedure.Subsequently, electron beam lithography was employed to define and etch the WSi layer in an RIE process with SF6 chemistry to create the device's channel.A 25 nm layer of SiO 2 was then sputter-deposited across the entire wafer to serve as the dielectric spacer separating the heater from the channel.Another 4 nm layer of WSi was subsequently sputter-deposited, patterned, and etched to form the heater input gate, with contact pads added to this upper layer through the same liftoff process as the initial layer.Finally, openings were etched through the SiO 2 dielectric layer using an RIE process with CHF 3 :O 2 chemistry to establish electrical contact with the underlying channel layer.

Heater Cryotron Characterization
The measurements in this study employed an arbitrary waveform generator (AWG) equipped with two channels and incorporated 10 kΩ resistors in series with each channel.In the data acquisition process, one channel of the AWG was configured to maintain a constant voltage, thereby delivering a fixed bias current, while the other channel was programmed to incrementally ramp up

Figure 1 .
Figure 1.Overview of the machine learning-powered modeling approach for the stochastic behavior of heater cryotron.After training the mixture density network with the device characteristics obtained from experiments, the weights are extracted which are eventually used to create the Verilog-A-based compact model for circuit/system-level simulations.

Figure 2 .
Figure 2. Prediction of Switching characteristics of heater cryotron.(a) A false-colored scanning electron microscope (SEM) image of heater cryotron device consisting two superconducting nanowires of tungsten silicide (W Si) and a dielectric spacer of SiO 2 .(b)-(c) Illustration of the gate-current-controlled switching mechanism.An enough gate current (I G ) switches the gate nanowire to its resistive state and generates thermal phonos which eventually causes the switching of the channel from its superconducting to resistive state.(d) Switching model for critical current evaluated at the mean (µ) and plus or minus 2 standard deviations (σ ) of the resultant distribution and (e) retrapping current of heater cryotron switching model evaluated at the same points.

Figure 4 .
Figure 4. Switching probabilities at varying bias currents and transient simulation results.The predicted vs experimental switching probability for gate current (I G ) between 0 µA and 3 µA at a bias current (I B ) of (a) 14 µA, (b) 16.5 µA, (c) 23.5 µA, (d) 28 µA, and (e) 33 µA.The increase in I B leads to a reduction in the gate switching current.Transient dynamics of (f) bias current and (g) gate current which cause switching of the channel after exceeding the corresponding thresholds.(h) Switching is shown and validated with the experiment by the time dynamics of the load voltage (V L ) at mean (µ) and plus or minus 2 standard deviations (σ ).(i) A zoomed-in view of V L shown in (h).

Table 1 .
Model Performance of Switching Probability , I B , and state as input to the model, and directly predicting V L as the output.