Introduction

In the ever-evolving landscape of electronic devices, the pursuit of smaller, faster, and more efficient technology has led to groundbreaking innovations, unlocking new possibilities for various applications across industries1. However, as devices continue to shrink and push the boundaries of Moore’s law2, the inherent stochastic nature of certain electronic components has become increasingly significant3. These stochastic electronic devices exhibit inherent variability in their behavior, posing a formidable challenge for conventional deterministic modeling approaches4.

In the pursuit of greater accuracy and ease of implementation in electronic circuit simulations, a significant leap has been taken by incorporating machine learning methodologies into compact modeling5,6,7,8,9,10. While various machine learning-based compact models have emerged, these models, thus far, have failed to address the implications of stochastic behavior in electronic devices. And while traditional physics-based compact models have been created with stochastic properties11,12, these models lack the benefits of machine learning-based methods such as decreased turnaround time and little required device knowledge. The dynamics of electronic devices, traditionally considered deterministic, are increasingly proving to be intrinsically stochastic at the nanoscale. Consequently, conventional compact models that ignore these inherent stochastic processes fail to capture the true behavior of the devices, leading to inaccuracies in circuit simulation.

This paper presents a new approach that addresses this gap in the realm of compact modeling. We introduce the use of Mixture Density Networks (MDNs)13 to capture the stochastic nature of electronic devices accurately. In addition to this neural network approach, we introduce a novel sampling methodology that uses inverse transform sampling14 to make our model more accurate to the stochastic nature of devices. By doing these, we create compact models that account for the inherent stochasticity of certain devices, allowing the development of new circuits that are both reliable and innovative. Our approach represents a paradigm shift in electronic device modeling, as it for the first time encompasses the full spectrum of electronic device behavior, from deterministic to stochastic.

To demonstrate the effectiveness and versatility of our approach, we focus on the modeling of heater cryotron15, a three-terminal device that shows gate-current controlled switching between superconducting and resistive states. Heater cryotrons, with their stochastic switching behavior, are a particularly compelling case study. Our methodology accurately captures the nuances of this stochasticity, providing a foundation for the development of novel circuits and the optimization of existing designs. It is worth noting that beyond digital devices, the flexibility of MDNs allows this approach to be used in a variety of applications where stochastic device models could be beneficial, such as analog or mixed-signal electronics.

The contributions of this paper are threefold:

  • Mixture Density Networks: By employing MDNs, we equip our modeling framework with the capacity to capture complex probability distributions, thereby enabling a more faithful representation of device variability. This empowers designers to explore device behavior beyond traditional deterministic bounds.

  • Inverse Transform Sampling: By leveraging inverse transform sampling, we can use our MDN to its full potential. This approach allows the model to provide smooth outputs in transient simulations while maintaining stochasticity. This in turn makes our model more realistic to device-to-device or cycle-to-cycle variations.

  • Demonstration with Heater Cryotrons: To showcase the effectiveness of our approach, we focus on modeling heater cryotrons-a class of devices known for their stochastic switching behavior. Through this demonstration, we illustrate the practical applicability of our methodology in real-world electronic systems.

As we delve into the details of our approach, we will elucidate the intricacies of MDN-based modeling for stochastic electronic devices, providing a comprehensive understanding of the benefits it offers to circuit simulation and electronic design.

Results

The following section will describe our neural network architecture, sampling methodology, and the results of our approach using experimentally derived heater cryotron data.

Mixture density network

Mixture density networks provide the perfect approach for modeling stochastic devices. Mixture density networks work similarly to standard multilayer perceptions (MLP)16, but with three output layers connected to the last hidden layer instead of just one. Each output layer has the same number of neurons N, and we label the output layers as \(\mu\), \(\sigma\), and \(\alpha\). In our model, we use 2 hidden layers with 512 neurons each using the ReLU activation function, and we set \(N=20\). The output of these layers is used as the parameters of a Gaussian mixture distribution probability density function (PDF) as defined in Eq. (1).

$$\begin{aligned} p(x) = \sum _{k=1}^{N} \alpha _k \cdot \frac{1}{\sqrt{2\pi \sigma _k^2}} \cdot \exp \left( -\frac{(x - \mu _k)^2}{2\sigma _k^2}\right). \end{aligned}$$
(1)

Using this approach allows the model to learn the unique output distribution for any given input. The PDF is a combination of N different Gaussian distributions, which are multiplied by a scaling factor \(\alpha\). In order to maintain a valid probability distribution, we need to ensure that the sum of \(\alpha\) is always equal to 1. To do that, we apply a softmax function to the output of the \(\alpha\) layer, as defined in Eq. (2).

$$\begin{aligned} s(\alpha _i) = \frac{e^{\alpha _i}}{\sum _{j=1}^N e^{\alpha _j}}. \end{aligned}$$
(2)

Additionally, we need to ensure that the output of the \(\sigma\) layer needs to be positive and non-zero since they represent the standard deviation of the Gaussian distributions in the mixture density PDF. In order to do this, we use the exponential linear unit (ELU) activation function plus 1 plus \(\epsilon\), where \(\epsilon\) is the smallest value before loss of precision in floating point calculations. The activation function can be seen in Eq. (3).

$$\begin{aligned} ELU(\sigma _i) = \left\{ \begin{array}{lr} \sigma _i + 1 + \epsilon , &{} \text {if } \sigma \ge 0\\ 0.5(e^{\sigma _i} - 1) + 1 + \epsilon , &{} \text {if } \sigma < 0 \end{array} \right\}. \end{aligned}$$
(3)

Finally, we allow \(\mu\) to be any value, so there is no need to use an activation function to restrict its output. For the hidden layers, we choose to use the rectified linear unit activation function, which is defined in Eq. (4).

$$\begin{aligned} ReLU(x) = \left\{ \begin{array}{lr} x, &{} \text {if } x \ge 0\\ 0, &{} \text {if } x < 0 \end{array} \right\} \end{aligned}$$
(4)

In order to train the model, we need an appropriate loss function to measure the performance of the model. We choose to use Gaussian Negative Log-Likelihood (GNLL), since it effectively captures the performance with respect to probability. The GNLL for a single data point x is the negative logarithm of this PDF:

$$\begin{aligned} \text {GNLL}(x) = -\log \left( \sum _{k=1}^{N} \alpha _k \cdot \frac{1}{\sqrt{2\pi \sigma _k^2}} \cdot \exp \left( -\frac{(x - \mu _k)^2}{2\sigma _k^2}\right) \right) \end{aligned}$$
(5)

The goal during training is to minimize the GNLL across all training data points. Therefore, the overall GNLL loss is the average GNLL over all data points:

$$\begin{aligned} L = -\frac{1}{N} \sum _{i=1}^{N} \log \left( \sum _{k=1}^{K} \alpha _k \cdot \frac{1}{\sqrt{2\pi \sigma _k^2}} \cdot \exp \left( -\frac{(x_i - \mu _k)^2}{2\sigma _k^2}\right) \right) \end{aligned}$$
(6)

During the training process, the network undergoes parameter adjustments aimed at minimizing the GNLL loss. To achieve this, we employ the backpropagation algorithm coupled with the AdamW optimizer, as introduced by Loshchilov and Hutter in 201917. This optimization technique facilitates the iterative refinement of the network’s weights and biases. The gradients of GNLL with respect to key parameters, namely \(\mu\), \(\sigma\), and \(\alpha\), are computed using the depicted framework in Fig. 1.

Figure 1
figure 1

Overview of the machine learning-powered modeling approach for the stochastic behavior of heater cryotron. After training the mixture density network with the device characteristics obtained from experiments, the weights are extracted which are eventually used to create the Verilog-A-based compact model for circuit/system-level simulations.

Model sampling methodology

Algorithm 1
figure a

Standard sampling from a Gaussian mixture density distribution.

In order to properly utilize our model, it is critical that we use an appropriate sampling methodology for the output distributions. The obvious approach for this would be to randomly sample from the distribution, however, this approach presents several issues. In order to sample in this way, we can use the approach outlined in Algorithm 1, where we randomly select a distribution k for the mixture density distribution and sample from a standard normal distribution \(\mathcal {N}(\mu _k, \sigma _k)\). This approach allows us to utilize uniform and standard Gaussian sampling, which is built into most modeling languages including Verilog-A. The issue with this approach is that it results in sporadic currents in transient simulations since the current will jump around the distribution every time step. In addition, this approach will cause multi-state devices to switch sporadically between states near the critical value if the switching of the device is stochastic. While this approach may be valuable for some devices, for most devices we will need another sampling methodology that more accurately reflects the variation we see in devices. We propose using inverse transform sampling14 to accomplish this.

Inverse transform sampling involves obtaining the cumulative distribution function (CDF) of the target distribution and finding its inverse. The CDF represents the probability that a random variable is less than or equal to a specific value. In this method, a uniform random number between 0 and 1 serves as the input, representing a quantile of the distribution. The inverse CDF maps this input to the corresponding value in the distribution.

As the distribution evolves over time, the CDF is updated to reflect these changes. Inverse transform sampling allows for the generation of samples that consistently align with a predetermined quantile (q) for each sweep of the device. This ensures that the model accurately captures the device’s behavior at specific points in the distribution, providing a constant sampling point across different timesteps of the same sweep. This approach allows the model to provide a smooth curve while still demonstrating the stochasticity that we are trying to model and offering a coherent representation of the device’s performance under dynamic conditions. To initiate this process, we can begin by examining the probability density function in Eq. (7).

$$\begin{aligned} p(x) = \sum _{k=1}^{K} \alpha _k \cdot \frac{1}{\sqrt{2\pi \sigma _k^2}} \cdot \exp \left( -\frac{(x - \mu _k)^2}{2\sigma _k^2}\right) \end{aligned}$$
(7)

Our CDF will be equal to the integral from \(-\infty\) to x as shown in Eq. (8).

$$\begin{aligned} F(x) = \int _{- \infty }^{x} \sum _{k=1}^{K} \alpha _k \cdot \frac{1}{\sqrt{2\pi \sigma _k^2}} \cdot \exp \left( -\frac{(t - \mu _k)^2}{2\sigma _k^2}\right) dt \end{aligned}$$
(8)

While this is the CDF, it would be beneficial for us to manipulate this into a more calculable form. By bringing the integral inside the summation and leaving the scaling factor \(\alpha\) outside the integral, we are left with Eq. (9).

$$\begin{aligned} F(x) = \sum _{k=1}^{K} \alpha _k \int _{- \infty }^{x} \frac{1}{\sqrt{2\pi \sigma _k^2}} \cdot \exp \left( -\frac{(t - \mu _k)^2}{2\sigma _k^2}\right) dt \end{aligned}$$
(9)

This leaves us with the weighted sum of the CDF of a standard normal distribution, which can be defined in terms of the error function as shown in equation (10).

$$\begin{aligned} F(x) = \sum _{k=1}^{K} \frac{\alpha _k}{2}\left[ 1+erf\left( \frac{x-\mu _k}{\sigma _k \sqrt{2}}\right) \right] \end{aligned}$$
(10)

Defining this in terms of the error function makes calculation much easier since the error function is well-defined and easy to calculate numerically. The error function is defined in Eq. (11).

$$\begin{aligned} erf(x)=\frac{2}{\sqrt{\pi }}\int _{0}^{x}e^{-t^{2}}\, dt \end{aligned}$$
(11)

The issue is that there is no closed-form solution to the inverse of this CDF. As such, we use a numerical root finder to find what value of x leads to \(F(x)- q=0\) where q is the percentile we sample at. We choose to use Brent’s method18, however, most numerical root finders should work here. For derivative-based root finders, the derivative of the CDF is equal to the PDF.

Now that we have established a way to perform inverse transform sampling, we can take full advantage of its properties to improve our model. To solve our issue with transient simulations, we can keep the same value of q for the entirety of a sweep. This will result in a continuous output that is more in line with the device’s properties of cycle-to-cycle variation than traditional sampling. In this case, we can generate a new q every time the derivative of an input with respect to time crosses 0, which can easily be done in Verilog-A. Another benefit of this sampling approach is the ability to clip the probability distribution. For example, we could generate q only between 0.05 and 0.95 so that the model doesn’t predict values more than 2 standard deviations away (or in any range). Another use could be modeling device-to-device variation by weighting the distribution when sampling for q, though this work doesn’t explore this option.

Device characteristics of heater cryotron

Figure 2
figure 2

Prediction of Switching characteristics of heater cryotron. (a) A false-colored scanning electron microscope (SEM) image of heater cryotron device consisting two superconducting nanowires of tungsten silicide (WSi) and a dielectric spacer of \(SiO_2\). (b,c) Illustration of the gate-current-controlled switching mechanism. An enough gate current (\(I_G\)) switches the gate nanowire to its resistive state and generates thermal phonos which eventually causes the switching of the channel from its superconducting to resistive state. (d) Switching model for critical current evaluated at the mean (\(\mu\)) and plus or minus 2 standard deviations (\(\sigma\)) of the resultant distribution and (e) retrapping current of heater cryotron switching model evaluated at the same points.

The heater cryotron is a superconducting device designed to exploit the unique properties of superconducting nanowires15. This device consists of two superconducting nanowires separated by a dielectric, forming the gate and the channel of the device. Figure 2a shows a false-colored scanning electron microscope (SEM) image of the fabricated heater cryotron device, with a channel width of 1 \(\mu m\) and a gate width of 450 nm. The gate and channel nanowires are separated by a SiO\(_2\) dielectric spacer of 25 nm thickness. Initially, for a given channel current (\(I_B\)), both nanowires (gate and channel), when maintained at cryogenic temperatures, remain superconducting exhibiting zero resistance, and remain so until the gate nanowire becomes resistive (Fig. 2b). However, when a sufficient amount of current (\(I_G\)) is passed through the gate nanowire to make it resistive, it starts generating thermal phonons which transport to the channel nanowire with the help of the spacer. These thermal phonons suppress the superconductivity and reduce the critical current of the channel nanowire (\(I_{Ch}^C\)). Eventually, when the increase of gate current reduces the channel critical current below the applied channel current, the channel transitions from its superconducting state to a resistive state as shown in Fig. 2c. Conversely, removing the current from the heater enables the channel nanowire to revert to its superconducting state. The specific current levels at which superconducting to resistive and resistive to superconducting transitions occur are referred to as the critical current and retrapping current, respectively. The gate current-controlled switching of heater cryotrons has been used in a multitude of applications, including designing logic circuits15,19,20,21, memory systems22,23,24,25,26, and neuromorphic systems27,28,29,30,31 for cryogenic environment, and in interfacing superconducting circuits with semiconducting technology32.

Notably, the point at which the device switches between its superconducting and resistive states is characterized by stochastic behavior, which introduces an inherent element of randomness into its operation. This stochasticity makes this device ideal for testing our MDN-based compact modeling approach. The critical current value is not a fixed constant but instead depends on the applied channel current. For example, a higher bias current leads to a lower critical current for the gate, thus influencing the switching behavior of the device. Conversely, a higher bias current leads to a lower retrapping current. Additionally, a much lower bias current is required to switch from a resistive to a superconducting state since the nanowire produces its own heat when resistive among other factors. Understanding this dependence on the channel current is pivotal in characterizing and modeling the behavior of the heater cryotron.

To sample the characteristics of the device, we sweep the gate at different bias currents. By performing each of the sweeps multiple times and recording the resulting current-voltage characteristics of the device, we can have enough data for the MDN to learn the stochastic switching behavior of the device. This experimental approach allows us to explore the device’s response under varying conditions and analyze the interplay between the gate and channel currents.

Heater cryotron model

For our first iteration of the model, we choose to use our MDN to model the critical gate current for a given bias current. This approach allows the use of a simpler neural network structure for our MDN when compared to learning the I-V characteristics directly. To accomplish this, we use bias current and current state of the device as input with the MDN trained to predict gate critical current (if the heater cryotron is in its superconducting state) and retrapping current if the heater cryotron is in its resistive state. The results of this model can be seen in Fig. 2d and e, where we show the experimental data as well as our model’s output distribution at the mean (\(\mu\)), and with a variation of twice the standard deviation (\(\sigma\)) in either side (\(\mu -2\sigma\) and \(\mu +2\sigma\)) of the mean. We can see that the model is able to capture the critical and retrapping current at different bias currents as well as the variance in these switching points. Another benefit to this approach is that in addition to reduced model complexity, we do not have to run the model every time step as long as the bias current is constant, making the model even more efficient in certain applications.

Figure 3
figure 3

Model validation for load voltage (\(V_L\)) against the experiment. The predicted vs experimental distributions of \(V_L\) (dropped across a 1 \(k\Omega\) load resistor) at a bias current of 23.5 \(\mu A\) and gate current of (a) 1.35 \(\mu A\), (b) 1.4 \(\mu A\), (c) 1.45 \(\mu A\), (d) 1.5 \(\mu A\), (e) 1.55 \(\mu A\), and (f) 1.6 \(\mu A\).

While the switching model could work well for some devices, it comes with some major drawbacks. Most notably, it requires separate models for the I-V characteristics of the device in its different states in addition to the switching model. This means that this approach is unable to capture the variance in the device’s I-V characteristics outside of switching. Going forward, we are going to use the MDN to directly learn the I-V characteristics of the heater cryotron. This means that we will be using \(I_G\), \(I_B\), and state as input to the model, and directly predicting the voltages drop across the load resistor (\(V_L\)) as the output.

One challenge with MDNs is that there is no good way to provide easily interpretable numerical metrics for performance. Log-likelihood can be used to compare different models, but it is not a useful metric on its own. Since there are not any existing approaches to compare against, this metric will not be useful for us. Because of this, we will need to rely on graphical comparisons between the model and the experimental data. First, we can look at a comparison between the distribution from the model vs a histogram of the experimental data for different values of \(I_G\) and \(I_B\). This is shown in Fig. 3, where we are using \(I_B\) of 23.5\(\mu A\) and values of \(I_G\) from 1.35\(\mu A\) to 1.6\(\mu A\). Using this range of \(I_G\) allows us to see how the switching probability distributions change around the critical current. Figure 3 shows that our probability distributions closely match the distributions obtained from the experiment.

Next, we can compare the switching probability of the model. To do this, we use a sweep of \(I_G\) from 0 \(\mu \,A\) to 3 \(\mu\, A\) for \(I_B\) values of 14 \(\mu \,A\), 16.5 \(\mu \,A\), 23.5 \(\mu \,A\), 28 \(\mu\, A\), and 33 \(\mu\, A\). To evaluate the switching probability of the model, we can use the CDF evaluated at the voltage midway between the load voltage when the heater cryotron is in superconducting and resistive states. Figure 4 (left) shows the results of switching probability observed in the experimental data compared to the MDN model. Framing the model in this way allows us to use traditional regression performance metrics, which can be seen in Table 1. On the entire dataset, our model achieves a mean absolute error (MAE) of 0.82% for switching probability with an \(R^2\) of 0.9891. Here, the \(R^2\) value represents the percentage of variance in the switching probability explained by the model. We can compare this to a deterministic neural network such as a multiplayer perception (MLP), which has been proposed for compact modeling by several pervious works6,7,10. To do this, we can train the model to predict the threshold gate current for a given bias voltage (\(I_B\)). Since this model isn’t stochastic, we consider the switching probability to be 0% for any value below the threshold current and 100% above. After testing in this way, the model’s MAE is 2.61%, a significant drop in performance compared to our probabilistic approach.

Figure 4
figure 4

Switching probabilities at varying bias currents and transient simulation results. The predicted vs experimental switching probability for gate current (\(I_G\)) between 0 \(\mu A\) and 3 \(\mu A\) at a bias current (\(I_B\)) of (a) 14 \(\mu A\), (b) 16.5 \(\mu A\), (c) 23.5 \(\mu A\), (d) 28 \(\mu A\), and (e) 33 \(\mu A\). The increase in \(I_B\) leads to a reduction in the gate switching current. Transient dynamics of (f) bias current and (g) gate current which cause switching of the channel after exceeding the corresponding thresholds. (h) Switching is shown and validated with the experiment by the time dynamics of the load voltage (\(V_L\)) at mean (\(\mu\)) and plus or minus 2 standard deviations (\(\sigma\)). (i) A zoomed-in view of \(V_L\) shown in (h).

Table 1 MDN and MLP performance of switching probability mean absolute error (MAE) and \(R^2\).

Finally, we can evaluate the performance of the model in transient simulation. So that we can match the output of the model with experimental results, we mirror \(I_G\) and \(I_B\) from the experimental data as shown in Fig. 4 (right). We use the inverse transform sampling technique as described in section II to achieve the smooth, continuous curves as seen in Fig. 4 (right). To show the distribution of the model in the transient simulation, we report \(\mu\), \(\mu -2\sigma\), and \(\mu +2\sigma\) conditions. We can do this by setting the quantile value (q) to 0.5 for the mean and 0.05 and 0.95 for the inverse transform sampling. Figure 4 (right) shows that our MDN model accurately captures the variance in the I-V characteristics of the devices, including the switching dynamics.

Discussion

Our approach utilizing mixture density networks has demonstrated its remarkable ability to accurately model the variance of I-V characteristics and stochastic switching behavior of heater cryotrons. With a 0.82% mean absolute error in the switching probability and an \(R^2\) value of 0.9891, our method has proven its efficacy in capturing the true variance of device behavior. While we’ve only shown the model using heater cryotrons, the generalizing power of neural networks means this approach will be easily adaptable to other devices. This achievement marks a significant stride towards more precise and reliable electronic circuit simulations, offering unprecedented opportunities for the development of cutting-edge technologies in an era where stochasticity plays an increasingly pivotal role.

Methods

Heater cryotron fabrication

The fabrication process began with the deposition of a 4 nm layer of superconducting tungsten silicide (WSi) onto a silicon wafer through magnetron sputtering. Following this, contact pads consisting of 5 nm Ti and 30 nm Au were applied using optical lithography and a liftoff procedure. Subsequently, electron beam lithography was employed to define and etch the WSi layer in an RIE process with SF6 chemistry to create the device’s channel. A 25 nm layer of SiO\(_2\) was then sputter-deposited across the entire wafer to serve as the dielectric spacer separating the heater from the channel. Another 4 nm layer of WSi was subsequently sputter-deposited, patterned, and etched to form the heater input gate, with contact pads added to this upper layer through the same liftoff process as the initial layer. Finally, openings were etched through the SiO\(_2\) dielectric layer using an RIE process with CHF\(_3\):O\(_2\) chemistry to establish electrical contact with the underlying channel layer.

Heater cryotron characterization

The measurements in this study employed an arbitrary waveform generator (AWG) equipped with two channels and incorporated 10 \(k \Omega\) resistors in series with each channel. In the data acquisition process, one channel of the AWG was configured to maintain a constant voltage, thereby delivering a fixed bias current, while the other channel was programmed to incrementally ramp up from zero current to 3 \(\mu A\). Simultaneously, the voltage of the device channel was meticulously recorded on an oscilloscope. The experimental conditions covered six distinct bias current levels, spanning from 14 \(\mu A\) to 33 \(\mu A\). Following the ramp-up of gate current, a subsequent phase involved the concurrent reduction of gate and bias currents to get a larger breadth of data. This entire procedure was repeated 1000 times for each bias current setting, ensuring the capture of the stochastic nature of the device.

Model creation

We use Python with TensorFlow to create the mixture density network. We train the model using the data derived in the previous section. We implement a custom loss function to optimize the model for minimizing Gaussian negative log-likelihood. In addition, we implement a custom activation function for exponential linear unit plus 1 plus epsilon as described in equation (3). To integrate this model into circuit simulation, we translate the neural network architecture and the inverse transform sampling methodology into Verilog-A. Subsequently, the trained model weights are extracted using a Python script and seamlessly integrated into the Verilog-A model. This finalized Verilog-A model is compatible with various circuit simulators that support Verilog-A models, such as HSPICE or Spectre. This process is derived from Hutchins et al.10.