Heuristic recurrent algorithms for photonic Ising machines

The inability of conventional electronic architectures to efficiently solve large combinatorial problems motivates the development of novel computational hardware. There has been much effort toward developing application-specific hardware across many different fields of engineering, such as integrated circuits, memristors, and photonics. However, unleashing the potential of such architectures requires the development of algorithms which optimally exploit their fundamental properties. Here, we present the Photonic Recurrent Ising Sampler (PRIS), a heuristic method tailored for parallel architectures allowing fast and efficient sampling from distributions of arbitrary Ising problems. Since the PRIS relies on vector-to-fixed matrix multiplications, we suggest the implementation of the PRIS in photonic parallel networks, which realize these operations at an unprecedented speed. The PRIS provides sample solutions to the ground state of Ising models, by converging in probability to their associated Gibbs distribution. The PRIS also relies on intrinsic dynamic noise and eigenvalue dropout to find ground states more efficiently. Our work suggests speedups in heuristic methods via photonic implementations of the PRIS.


Definition of an Ising problem
The general definition of an Ising problem is the following: given a matrix (K ij ) (i,j)∈I 2 , and a vector {b i } i∈I where I is a finite set with cardinality |I| = N , we want to find the spin distribution {σ i } i∈I ∈ {−1, 1} N minimizing the following Hamiltonian function: In statistical mechanics, this model represents the energy of an interacting set of spins. The coupling term K ij represents the coupling between spins i and j. In the general definition of the Ising model, the coupling matrix K is assumed to be symmetric. We could also assume that it has a null diagonal, as it would only add a constant offset to the Hamiltonian. We are not making this extra assumption, as it will prove later to be useful. An external magnetic field b i can be applied, which breaks the symmetry of the problem. This general class of Ising problems is NP-hard 1 and many subclasses of this problem exhibit a similar complexity. In this work, we will characterize our optical algorithms on two subclasses of problems: General antiferromagnets, for which the coupling K can only take two discrete values K ij ∈ {0, −1}, and b i = 0 for all i. As every problem in this subclass is equivalent to an unweighted MAX-CUT problem, and that MAX-CUT is known to be NP-hard 1 , this subclass is already NP-hard.
Spin glasses, for which the coupling K can take on continuous values in the range [−1, 1], and b i = 0.
However, our approach can be easily extended to any coupling matrix K. The possible extension to non-zero external magnetic fields will also be discussed. A complete study of the hardness of the various subclasses of Ising problems can be found in Ref. 2 .

Mapping to the MAX-CUT problem
The MAX-CUT problem of a weighted undirected graph can be phrased in terms of its adjacency matrix A: where σ ∈ {−1, 1} N is a spin vector. The value of this MAX-CUT problem is called C max . More intuitively, this problem can be interpreted as finding a subset of vertices of the graph, such that the number of edges connecting this subset to its complementary is maximized. The vertices of this subset (resp. of its complementary set) will have spin up (resp. spin down). We can map the general Ising problem (spin glass) (Equation (2)) to any weighted MAX-CUT problem. This mapping is useful because the publicly-available solver we use to find the Ising ground state works in terms of the MAX-CUT problem 3 . The energy of the Ising ground state and the MAX-CUT are related as follows: where K = −A. From this linear relation, we also deduce that the solution of the weighted MAX-CUT problem and of the general Ising model (spin glass) are the same: argmax σ C A (σ) = argmin σ H (K) (σ). The Little and Hopfield networks 4,5 are two early forms of recurrent neural networks, originally designed to understand and model associative memory processes. In its general acceptance, the Little network can be understood as a synchronous version of the Hopfield network. Using the latter to solve the NP-hard Ising problem was previously investigated 6,7 . The behavior of both networks was later studied from a statistical mechanical perspective by P. Peretto 8 and D. J. Amit and colleagues 9 . We here generalize their study to the case of noisy, synchronous Hopfield network, or noisy Little network 8 and investigate the possibility of sampling hard Ising problems with a parallel, recurrent optical system. In the following, we equivalently use the term "neuron" and "spin" in the framework of spin glass models of neural networks.
Our machine is applying the following algorithm: Initialize assembly of neurons {σ i } with random values. The signal is coded into optical domain with the reduced spins S i = (σ i + 1)/2 ∈ {0, 1}.
At each step of the algorithm, each neuron is applied a potential v i (random variable) whose expected value is given by a matrix multiplication with C = 2J:v and whose probability density is given by the function f φ with meanv i and standard deviation φ: A non-linear threshold Th θ is then applied tov i to yield the neural state at the next time step: To summarize, the sequential transformation of the neuron state is: where N (x|φ) is a gaussian distribution with mean x and standard deviation φ. The algorithm is fully determined by the following set of transformation variables: (C, f ϕ , θ).

Determination of the Temperature factor
In this section, we compute the stationary distribution of the spin variable S t for the general case with variables (C, f ϕ , θ). We first determine the probability of a single spin-flip, knowing that the current spin state is J: Noise distribution G(x) where we define G as a rescaled version of the noise cumulative density function: In the following, we only assume that f φ is symmetric and has a standard deviation φ. The symmetry assumption is not constraining as most noise distributions found in nature have a symmetric (and, in many cases, gaussian) distribution 10 . Our analysis can also be extended to noise distributions whose variance is infinite by parameterizing the distribution with another parameter, usually referred to as a "scaling" parameter (see Supplementary Table 1). The case of f φ being a gaussian distribution is discussed in Ref. 8 , where the function G can be approximated by a sigmoid function when rescaling it by the adequate factor. In the general case, in order to minimize this approximation error, we choose the following rescaling factor Naturally, the optimal scaling factor will depend on the standard deviation φ of the original probability distribution f . We numerically observe that this dependence is linear Optimal scaling factors and corresponding errors are reported in Supplementary Table 1. Identifying the transition probability to a sigmoid function is a necessary step to compute the effective Hamiltonian of the spin distribution 8 .

Transition probability
By choosing adequately the temperature factor k 2 , we minimize the error when approximating the transition probability by a sigmoid function (in the following, h i = h i (J) and σ i = σ i (I), for readability): In the following, we will drop the subscript in γ opt and will assume that its value is known (it can be determined numerically, given the noise distribution). The error function can be used as an expansion parameter, in order to estimate the accuracy of the approximation performed when neglecting this term (as is done in Ref. 8 ).

Expansion of the transition probability in terms of
The transition probability can be derived by multiplying the probabilities of single spin-flips, these events being independent: where W k (I|J) is the k-th order term in the expansion, assuming the error function ε is small (ε s). The zero-th order term of this expansion gives us back the result from 8 : Another way to write this zero-th order term 8 will prove to be useful later in our derivations: .
The general expression for the k−th order term can be derived: The N −th order term scales, in the worst case scenario, like N 0 : For k < N , analyzing the scaling of the k−th order term requires more assumptions. Let's look at the case k = 1, to verify that we can safely neglect higher-order terms in this expansion: This ratio scales exponentially with βh j σ j /2. However, in this case, the error function also goes to zero. In addition, increasing the value of βh j σ j /2 also increases the value of the Hamiltonian H 0 (I|J), and thus reduces the likelihood of the transition W (I|J). Thus, the larger the ratio | W 1 (I|J) W 0 (I|J) |, the smaller the transition probability. In the following, we will neglect all high order terms k ≥ 1.
We do not extend our derivations to a general non-symmetric noise distribution in this work. We still suggest the following ideas to treat this extension: We could first treat the non-symmetric case as a perturbation of the symmetric case and thus derive the error on the effective Hamiltonian of the spin distribution.
In this view, we suggest the parametrization of the skewness using the skew normal distribution 11 , which is a natural extension of the toy-model symmetric noise distribution analyzed in Ref. 8 .

Detailed balance condition
To determine the stationary distribution of the spin state, we first need to verify that the transition probability satisfies the triangular equality 8 , equivalent to the detailed balance condition in the context of Markov Chains 12 : This is equivalent to: which is satisfied if J is symmetric (there is no condition on the linear term h 0 i in the Hamiltonian). We can then deduce the effective Hamiltonian by decoupling the ratio of transition probabilities into the ratio of two terms depending on distributions I and J (by using Supplementary Equation (28) and H L (I) = − 1 β ln F (I) gives the effective Hamiltonian describing the dynamics of the spin distribution:

Effective Gibbs distribution
Assuming J is symmetric, the stationary distribution of the spin state is a Gibbs distribution given by the effective Hamiltonian H L (I) 8,9 : In the following, we define ||.|| k as the usual k-norm (where k is an integer).

Small noise approximation
In the small noise approximation (β 1), the effective Hamiltonian can be simplified using the following Taylor expansion : log cosh x ≈ |x| − log 2: Large noise approximation In the large noise approximation (β 1), the effective Hamiltonian can be simplified using the following Taylor expansion : with unknown variables (h 0 i , J). δ ij is the Kronecker symbol and ∆ i represents a degree of freedom on the diagonal of the Hamiltonian (which results in an additional constant term). Studying the set of general Ising Hamiltonians (K, b) (Supplementary Equation (2)) which can be mapped to an Ising network (J, h 0 ) in the large noise approximation (Supplementary Equation (42)) would require an extensive study that we do not carry in this work. However, one can notice that: In the case where the external field satisfies J · h 0 = 0, the system has a trivial solution h 0 i = b i and J = √ K (the conditions for which a square root of K can be found are discussed in the next section).
In the more general case, one is given N "free" degrees of freedoms ∆ i in finding solutions to Supplementary Equation (43), while there are N constraints to solve in Supplementary Equation (44). Thus, it seems likely that, in non-degenerate cases, this system will have a non-trivial solution. However, one should make sure that while tuning the degrees of freedom ∆ i , the matrix J remains a real square root of the matrix K (see discussion in the next section on finding a square root).
If not, this algorithm should be generalized to using complex-valued matrices J. We do not discuss this generalization of the algorithm in this work. However, considering that arbitrary (complex) unitary transformations can be implemented with our photonic architecture 13 , this would be an interesting extension to this work.
In the following, for the sake of simplicity, we will assume h 0 i ≡ 0. In this case, the large noise approximation Hamiltonian gives: withK = J 2 . Thus, the implementation of the Little network to model a general Ising Hamiltonian (Supplementary Equation (2)) strongly relies on the possibility of finding a symmetric, real square root to the matrix K. Let us remind the reader of the assumptions we have made so far: The model is synchronous, or as stated in Little's original paper 4 "we shall suppose that the neurons are not permitted to fire at any random time but rather that they are synchronized such that they can only fire at some integral multiple of a period τ ".
The neurons have no memory of states older than their previous state (Markov process).
The connections do not evolve in time (there is no learning involved).
J is symmetric and h 0 i = 0, which results in from Supplementary Equation (14).
Supplementary Figure 1. Transitioning between Hamiltonians. a: To approximate a given Ising model in the large noise limit, one should take into account all eigenvalues. b: Effective Hamiltonian and its large/low-noise limit approximation when α = 1 (with (16/16) eigenvalues). c: However, to reduce the noise threshold of the large noise expansion, one should drop out some eigenvalues. d: Effective Hamiltonian and its large/low-noise limit approximation when α = 0 (with (7/16) eigenvalues, all negative eigenvalues being dropped out). In this Supplementary Figure, we study a random spin glass with N = 16 and choose ∆ii = | j Kij|.

Inequalities between Hamiltonians
Using basic algebra and functional analysis, we can show that log cosh x ≤ x 2 /2 and that log cosh x ≥ |x| − log 2, which results in the following inequality on the Hamiltonians By summing over spin configurations, we get the reversed inequality for the partition functions Another inequality can be derived by using the equivalence of norms ||.|| 1 and ||.
These Hamiltonians are plotted for a sample random Ising problem in Supplementary Figure 1.

Suggested algorithms
As suggested above, from the large noise expansion of the effective Hamiltonian, a natural way to probe the Gibbs distribution of some Ising model defined by coupling matrix K (Supplementary Equation (2) is to set the matrix of the recurrent loop C to be a modified square root of K: where α ∈ [−1, 1] is the eigenvalue dropout level, a parameter we will discuss below, and ∆ is a diagonal offset matrix defined as the sum of the off-diagonal term of the Ising coupling matrix ∆ ii = j =i |K ij |. An alternative solution would be to set C = K. This solution has been studied in the particular case of associative memory couplings 9 , where it was shown that the PRIS in this configuration would be described by a free energy function equal to the Ising free energy (up to a factor of 2). We will discuss the pros and cons of both algorithms in Supplementary Note 4.

SUPPLEMENTARY NOTE 3: CONSTRUCTION OF THE WEIGHT MATRIX
We here propose a technique to build a square root of K in order to find an approximate solution to any Ising problem. We start with the general K Ising weight matrix K from Supplementary Equation (2). We notice that as K is symmetric and K ii = 0, K will never obey the condition of Lemma 1.
Proof 1 Let X be a vector of size N and norm 1 such that AX = λX with λ < 0. We assume i 0 is the coordinate of X with maximum absolute value (and x i0 > 0): x i0 = ||x|| ∞ . We have j A i0j x j = λx i0 . We thus get : By simplifying this inequality, we get |A i0i0 | < j =i0 |A i0j | which contradicts our assumption. Thus, for all X, X T AX ≥ 0, which means that A is positive semidefinite.
We can thus write To be able to apply Lemma 1, we need to add diagonal terms to the Ising matrix K from Supplementary Equation 2 by definingK = K + ∆ where ∆ is a diagonal matrix such thatK verifies the assumptions of theorem 1. From there, we can define J = K . In the large noise approximation, we thus get the following Hamiltonian describing the PRIS: Since the ∆ matrix is fixed, the second term in this equation is a constant (independent of the spin distribution, given an Ising problem to minimize). The Gibbs distribution is thus where H K,0 is the Ising Hamiltonian defined in Supplementary Equation (2).

Discussion on the diagonal offset
The diagonal offset that is added to the original Ising matrix can be considered as a supplementary degree of freedom. To verify Lemma 1, we can choose ∆ ii = j =i |K ij |, to make sureK is positive definite and has a real symmetric square root However, there is no reciprocal to Lemma 1. We here show that even some partial or weak reciprocals are usually wrong: The direct reciprocal of Lemma 1 is generally wrong. We consider the following counter-example: . A is symmetric, has positive diagonal elements but is not diagonally dominant (the inequality does not hold for only one diagonal element). However, A is symmetric positive definite: its determinant is 4 and its trace 9, so the sum and product of eigenvalues is positive, which means both are positive.
Even a weak reciprocal of Lemma 1 is wrong: A has positive diagonal elements, such that for all i, |A ii | < j =i |A ij |, then A has at least one negative eigenvalue.
For example, one can consider : A is symmetric, has positive diagonal elements, verifies: ∀i, |A ii | < j =i |A ij |, but still, its eigenvalues are 0, 0, and 3.
One (very) weak reciprocal of Lemma 1 is the following Lemma 2. It means that any Ising matrix (that has a zero diagonal) will never have a real symmetric square root, unless we add an offset to its diagonal.
such that for all i, A ii = 0, then A has at least one negative (and one positive) eigenvalue. Proof 3 Let's first notice that this case still holds for a NP-hard subclass of Ising problems. In this case,K's elements are −1 if there is a spin-spin connection and 0 otherwise. Thus,K = K + ∆ has a zero eigenvalue: KX = 0 with X = (1, 1, ..., 1) T (because the sum of each row is equal to 0). We notice that for α < 1: As seen before, K + ∆ has a zero eigenvalue, and (α − 1)∆ only has negative eigenvalues. Using Weyl's inequalities, we can deduce that K + α∆ has at least one strictly negative eigenvalue.
Studying the reciprocal of Lemma 1 gives us a better insight on what is the optimal diagonal offset ∆ to add to the original Ising matrix. We can consider ∆ as a supplementary degree of freedom in our algorithm. The choice of ∆ suggested by this Lemma on diagonal dominance is particularly adapted for α > 0 and Ising problems with couplings that all have the same sign. However, we notice that for large spin glasses, typically j =i |K ij | will be much greater than K ii . Thus, α ∼ −1/N is the lower limit for non-zero J matrices. The domain of definition of α over which the number of eigenvalues actually varies is then asymmetric. For this reason, we typically choose ∆ ii = | j =i K ij | for large spin glasses. In this study, we will specify which definition of ∆ is chosen for each Supplementary Figure and study.
Tuning the search dimensionality with eigenvalue dropout Tuning the eigenvalue dropout gives us a way of tuning the search dimensionality. If K = U DU † where U is real unitary and D = Diag(λ 1 , ...λ N ), we can parametrize the phase space in terms of the eigenvectors of K, which we denote as e j (associated to eigenvalue λ j ): We can rephrase the optimization problem as follows: As only the positive eigenvalues can decrease the energy, we would like to conclude that only the eigenvectors associated with positive eigenvalues will contribute to minimizing the energy. However, one should make sure that the hard constraint in Supplementary Equation (56) remains satisfied. We can only conclude on the heuristic result that the ground state of the optimization probably will prefer having components in eigenspaces associated with positive eigenvalues. As reducing the eigenvalue dropout level results in dropping out the negative eigenvalues, this explains why we usually observe a better performance for a certain level around α = 0.

Modified algorithm with tunable offset
We define ∆ 0 as the minimum offset to verify the assumption of Lemma 1 and make sure the modified Ising matrix is symmetric positive semi-definite. ∆ 0 is defined as: We define the modified algorithm as follows: This modified algorithm only corresponds to a particular parametrization of the weight matrix and can thus be implemented with the PRIS. The writing of this problem as a function of the coupling matrix eigenvalues (see above) proves that, as long as α ≥ 0 in the modified algorithm, the ground state of the Ising problem will likely remain unchanged (see Supplementary Figure 4).
Supplementary Figure 2. Optimized noise level used in Figure 2 (main text). a: for MAX-CUT graphs ( Figure 2c in the main text). b: for spin glasses (Figure 2b in the main text).
Supplementary Figure 3. Conjugated influence of eigenvalue dropout and noise levels for finding the ground state. For each tuple (α, φ), the PRIS is run 100 times and N iter,90% is plotted (the ground state is pre-computed with BiqMac 3 ). The colorbar is saturated at 10 5 iterations in order to show the valley of optimal (α, φ) values. In this Supplementary Figure, we study a MAX-CUT graph with density d = 0.5 and ∆ii = j =i |Kij|.

SUPPLEMENTARY NOTE 4: NUMERICAL SIMULATIONS
Evaluating sampling performance and critical parameters on two-dimensional ferromagnetic and infinite-range Ising models In the main text, we evaluate the performance of the PRIS on sampling the Gibbs distribution of analytically solvable Ising models and estimating their critical exponents. We chose the following two problems: Two-dimensional ferromagnetic Ising model on a square lattice with periodic boundary conditions. This model was first analytically solved by Onsager 14 . Its energy is defined as where ij refers to nearest neighbors (i, j) under periodic boundary conditions (see Supplementary Figure 5).
Infinite-range Ising model, where each node is connected with a positive coupling of 1/N to all its neighbors, including   itself. This model can be analytically solved by mean field theory 15 . Its energy is defined as In order to measure the critical exponents, we need to make sure the free energy at all temperatures is equivalent to that of the corresponding Ising model at all temperatures. It has been shown that the free energy of a Little network with coupling matrix K equals that of an Ising model defined by the same coupling matrix K, when the coupling weights are configured to learn a set of stable configurations (associative memory) 9 . However, this network has been discarded because of the possible existence of loops between some of its degenerate states, which can result in odd dynamics of the system, if no additional precautions are taken (see, for instance, Supplementary Figure 7, where this simple algorithm actually converges to the maximal energy). We observe that adding a diagonal offset to the coupling, as we do in the square-root version of this algorithm, suppresses these odd dynamics ( Supplementary Figure 7(b)).
The algorithm described earlier to find the ground state of Ising problems could also be used to measure critical exponents, as can be seen in Supplementary Figure 6. However, there are some complications that arise: Taking the square root of the coupling matrix prevents the use of symmetry and sparsity to reduce the algorithm complexity. Thus, for large graphs, the time needed to make a single matrix multiplication becomes quite large on a CPU.
Taking the square root also modifies the coupling amplitude. It is now unclear how the temperature of the system should be defined (which is a problem if we want to estimate the critical temperature with the PRIS). From the large-noise expansion of the Hamiltonian, we should define the effective temperature as T = k 2 φ 2 . However, we observe this definition does not match the theoretical value of the critical temperature of the 2D Ferromagnetic Ising model.
For both Ising problems studied, we perform the following analysis: We first estimate the Binder cumulant U 4 (see definition in the main text) as a function of the system temperature, for various graph sizes N = L 2 . The cumulants U 4 plotted for different L intersect at the critical temperature 12 (see Figure  4(a) in the main text).
We then estimate the dependence on the linear dimension L of various observables at the critical temperature and deduce the corresponding critical exponent. This is enabled by the scaling law of observables at the critical temperature 12 : where m, χ, τ E auto , and τ m auto are respectively the magnetization, the magnetic susceptibility, the energy autocorrelation time and the magnetization autocorrelation time 12 . To estimate the critical exponents, we run the algorithm 120 times for 10 5 iterations, with random initial conditions, at the critical temperature for each L and fit these observables with a power law in L.  The estimates of all observables are obtained by taking every τ E auto generated samples, where τ E auto is the energy autocorrelation time, and dropping the first 10% of the data (arbitrary burn-in or equilibrium time). Our findings are summarized in Supplementary Table 2, Supplementary Figures 8 and 9, and are benchmarked versus the Metropolis-Hastings algorithm, which is summarized in Refs. 12,16,17 .  In Supplementary Figure 10, we examine the influence of the bit accuracy of the setting of the phases on the performance of the machine. This can describe any system where the matrix is encoded by a cascaded array of programmable MZIs such as 13,18 . We assume the phase can be set with b-bit accuracy, which means that when setting phase θ m (resp. φ m ) on the PNP, we actually draw from a uniform distribution . Another way to describe the bit accuracy of the PNP is through the bit-accuracy of the voltage setting 13 . The phase-voltage relation can be approximated by a quadratic dependence: where V 2π is the voltage setting to achieve 2π modulation 19 . If we assume the voltage is set with b V accuracy, then the actual voltage is drawn from a (uniform) distribution over . This translates to the phase accuracy as: We can safely neglect the last term and we notice that the worst case scenario corresponds to θ ∼ 2π, for which b V − 1 = b θ . A rule of thumb results: a b V -bit accuracy of the PNP on its voltage setting corresponds to a (b V − 1)-bit accuracy of the PNP on its phase setting. Inversely, a b θ -bit accuracy of the PNP on its phase setting corresponds to a (b θ + 1)-bit accuracy of the PNP on its voltage setting. Static sources of noise could be a significant bottleneck in scaling the PRIS to large N ∼ 100 graph orders. For instance, a static noise on the phase setting of an array of MZI will result in a static error on the effective coupling between spins, thus reshaping the Hamiltonian landscapes, which could impact the algorithm efficiency. We simulate the algorithm performance as a function of the graph order N for phase resolutions of b θ = 8 and 16 bits. The resulting time on a GHz photonic architecture to find the ground state with 90% chance, T iter,90% , is also shown in Supplementary Figure 10. While a 16-bit phase resolution does not impact the algorithm performance (with scaling results comparable to an ideal photonic network, see main text), an 8-bit phase resolution may increase the required number of algorithm steps by one to two orders of magnitude, depending on the graph order and topology (while still outperforming other photonic systems on a GHz architecture, such as 20,21 ). Thus, the reduction of static noise is of paramount importance in the realization of the PRIS on large-scale static photonic networks.

Optical Neural Networks based on Photoelectric Multiplication
For larger graphs N ∼ 10 3 − 10 6 , one could resort to recently-proposed large-scale optical neural networks based on photoelectric multiplication 22 . By encoding both matrix weights C ij and input signals S For the various problems we study in this paper, the working standard deviation is usually ∼ 1. This corresponds to a working n mac of We can evaluate the corresponding total energy consumption per matrix multiplication for 10 random spin glasses. We get n mac ∼ 4 (resp. ∼ 15) for N = 100 (resp. N = 1, 000). Smaller working n mac will be required for sparser graphs, since ||C|| is smaller. The corresponding SNR scales as ∼ n mac and the total energy is N 2 n mac ∼ 6.2 ± 0.35 fJ/matrix multiplication (resp. 1.9 ± 0.5 pJ/matrix multiplication). There are many attractive features of these networks for the implementation of large-scale PRIS: This architecture naturally leverages quantum noise which perturbs the output as is required for the good execution of the PRIS. The noise level can be tuned by changing the number of photons per MAC which is proportional to the SNR.
The non-linear function is executed in electronic domain, which allows a lot of flexibility on its implementation and reconfiguration (the threshold function required for the good operation of the PRIS would be straightforward to implement), while optical nonlinearities working at low-power have not been demonstrated so far.
This architecture is in principle scalable to very large number of spins N ∼ 10 6 .

Free space optical Neural Networks
Since the PRIS relies on a static transformation, the use of free-space optical neural networks 23,24 with 3D printed masks or reconfigurable SLMs is another option to achieve PRIS with N ∼ 10 6 neurons. The analysis we made on the influence of heterogeneities (see Supplementary Figure 10) remains relevant here. A set of lens -mask -lens would realize the matrix multiplication on the signal S (t) encoded in optical domains, while the coupling matrix C is encoded in the mask transmission. The speed of such a free-space architecture would only be limited by the photodetector (typically ∼ 10 THz) and modulation (typically ∼ 1 kHz for SMLs, 1 GHz for on-chip modulators) bandwidths.

Comparison table of various heuristic algorithms and architectures
Since the PRIS essentially relies on fast vector-to-fixed matrix multiplication, it can be implemented efficiently on various photonic and parallel electronic hardware architectures, such as FPGAs. We summarize in Supplementary Table 3 the projected performance of various heuristic algorithms (MH and PRIS) running on several hardwares (both electronics and photonics). We observe that, while photonics potentially allows the fastest clock for such algorithms and a competitive scaling factor, FPGAs can achieve similar clocks and require much less engineering (they can be bought off the shelf and are easily reconfigurable). To demonstrate our point, we perform a proof-of-concept experiment on a FPGA board, whose results are shown in Supplementary Note 7.
If photonic architectures can significantly reduce the time complexity of the algorithm step by performing massive multiplexing 25 , it must be noted that we are neglecting time overhead such as fabrication, etc. Also, some free-space architectures, such as SLM, are typically slow to reconfigure (on the order of 1ms), thus only being relevant for multiplying very large matrices. In order to evaluate the performance of the PRIS on larger graphs, for which exact solvers typically fail, we benchmark the PRIS against MH for a set of 10 random spin glasses (whose couplings are randomly chosen from a uniform distribution in the interval [−1, 1]). First, we notice that the behavior probed by the PRIS and MH is very similar over the 10 spin glasses. The temperature/noise-dependent mean ground state energy found by MH and the PRIS is shown in Supplementary Figure 11. In particular, we see that the PRIS achieves mean ground state energies similar to MH. As a reminder, a heuristic mapping between the temperature from statistical physics and the effective temperature (noise level) of the PRIS is T ∼ (kφ) 2 . These graphs orders are large enough, so that standard exact solvers are taking a long time to find the ground state. For instance, we ran BiqMac on spin glass 1 on their online submission server 3 for three hours. The algorithm could only find an approximate ground state which was outperformed by MH, PRIS-A, and SA.
We also observe that the PRIS shows an α-dependent optimal noise level, which increases with α. For spin glass 1, the absolute lowest ground state energy is obtained for α = −0.2, i.e. (450/1000) eigenvalues. This hints at the necessity of optimizing the hyperparameter α around α ∼ 0 when running the PRIS on large graphs.

PRIS-A: a proposed metaheuristic variation of the PRIS
A route to achieving systematically low energy states is to design metaheuristics, i.e. master strategies guiding the search for an optimal ground state. Simulated Annealing (SA) 26 is one of such algorithms, derived from Metropolis-Hastings. Let us reminder the reader of one possible implementation of this algorithm with the widely used geometric schedule for the temperature 26-29 : Start from random initial state Choose initial temperature Ti and geometric factor λ < 1 T ← Ti for all i ∈ {1, ..., Nalg, iter} T ← λT for all j ∈ {1, ..., Niter per temp.} Update state according to MH acceptance rule at temperature T end for end for Algorithm 1: Simulated annealing algorithm with a geometric schedule.  Table 4. Summary of benchmarking PRIS and PRIS-A against MH and SA. For both PRIS and MH, the algorithm is ran 100 times for 10,000 iterations at each temperature / noise level. The table shows the absolute lowest ground state energy recorded. For PRIS-A and SA, the algorithm is ran 100 times with Nalg, iter temperature increments (as given by Supplementary Equation (67)) and Niter per temp. = 100. For PRIS-A (resp. SA), the initial noise level (resp. temperature) is φi = 50 (resp. Ti = 5, 000), the final noise level is φ f = 0.1 (resp. T f = 0.01) and the temperature geometric factor is λ = √ 0.991 (resp. λ = 0.991). For PRIS-A, the eigenvalue dropout level is taken to be α = 0, corresponding to (501/1000) eigenvalues. For BiqMac, the algorithm ran for three hours (time limit on the BiqMac online job submission platform). N.C. = Non Computed. This naturally inspires a metaheuristic based on the PRIS, which we call Photonic Recurrent Ising Simulated Annealing (PRIS-A): Start from random initial state Choose initial noise level φi and geometric factor λ < 1 φ ← φi for all i ∈ {1, ..., Nalg, iter} φ ← λφ for all j ∈ {1, ..., Niter per temp.} Update state according to PRIS acceptance rule at noise level φ end for end for Algorithm 2: Photonic Recurrent Ising Simulated Annealing (PRIS-A) algorithm.
Let us note a couple of peculiarities: The noise level from the PRIS is related to an effective temperature via T ∼ φ 2 . Thus, one should compare SA ran with a geometric factor λ to PRIS-A with a geometric factor √ λ.
In the PRIS-A algorithm, the eigenvalue dropout level α is also a degree of freedom. One could thus, in principle, also simulate the annealing of the eigenvalue dropout level, thus affecting the ground state search dimensionality. A comprehensive study of this new class of algorithms goes beyond the scope of this work.
For given initial and final noise levels/temperatures and geometric factors, one can determine the number of temperature increments N alg, iter with the formula: λ is typically chosen to be smaller but close to 1, in order to mimic adiabatic temperature variations. We verify that both for SA and PRIS-A, λ > 0.98 yields consistently low energy ground states, with no particular amelioration when increasing λ (and scaling the number of temperature increments N alg, iter ).
The initial and final noise levels chosen for PRIS-A are φ i = 50 and φ f = 0.1. The initial and final temperatures chosen for SA are T i = (2 * 0.5877 * φ i ) 2 ∼ 3454 and T f = (2 * 0.5877 * φ f ) 2 ∼ 0.01. We observed no significant variation on the minimum ground state energy found for λ > 0.99 and ran each algorithms with a rate of λ = 0.991. The performance of the various algorithms we implement is shown in Supplementary Table 4. MH yields lower ground state energies than PRIS on average of 0.53%. However, PRIS-A can outperform PRIS by a similar amount, lowering energies on average of 0.46%. This is a larger performance enhancement than SA to MH (0.11% decrease of ground state energy). Then, on average, SA outperforms PRIS-A by 0.18%. We expect optimization of the eigenvalue dropout level α (and simulated annealing on this parameter) to further enhance the performance of PRIS and PRIS-A. The PRIS algorithm is primarily designed for future photonic implementations. However, both photonic chips and FPGAs (Field Programmable Gate Arrays) share parallel processing capability. Thus, it is worthwhile to first demonstrate the performance of the algorithm on an FPGA board.
We have implemented the algorithm on Xilinx Zynq UltraScale+ multiprocessor system-on-chip (MPSoC) ZCU104 evaluation board based on the Pynq framework. The high-level synthesis, place, and route have been performed by Xilinx Vivado 2018.3 design suite. Zynq and Zynq Ultrascale+ devices integrate a multi-core processor (ARM Cortex-A9) and programmable logic (FPGA) in the same circuitry. PYNQ is an open-source project released by Xilinx that enables Python Productivity for Zynq devices. It incorporates the open-source Jupyter notebook infrastructure to run an Interactive Python (IPython) kernel and a web server directly on the ARM processor of the Zynq device. It also provides extensive hardware libraries (overlays) and APIs which enable easier and faster programming of FPGA.
For our application, the preprocessing and preparation of data is performed in a Jupyter notebook using Python. Once ready, we transmit the data and write to memories on FPGA through AXI interface. After running the recurrent algorithm for a number of cycles on FPGA, the final state vector is transmitted back for data analysis like energy calculation and correctness verification. In this manner, the same hardware configuration could be easily reconfigured and run different problems efficiently. Noise generation and post-processing (energy calculation, and more generally extracting observable) could also be run directly on the FPGA in future versions of this implementation.
The overall high level architecture is shown in Supplementary Figure 12. We make use of the address-mapped AXI interface to encode different operations (e.g. Block RAM (BRAM) addressing, loop setting, result selection) into different addresses. AXI controller loads matrix, noise and threshold data into BRAMs, then sets up the initial state vector and starts the computation. The loop module is a finite-state-machine which reads the data from BRAM, adds noise, compares with threshold and then updates the state vector at every loop step. The final state vector and clock information is transmitted back through AXI.
We emphasize the following considerations regarding our implementation: The bottleneck of the performance implementation is to read input matrix data from memory. The ZCU104 board has 312 BRAM blocks, with each block transmitting a maximum of 72 bits per clock cycle when configured under true-dual-port mode. It also has 94 UltraRAM (URAM) blocks with each block transmitting a maximum of 144 bits per clock cycle. Thus the total maximum data rate it could read from BRAM and URAM is 36288 bits per clock cycle. For our current implementation (N < 100) we could assume we have enough memory, but as the problem size increases this becomes a major limitation.
We multiply an N -by-N b bit matrix (b being the bits we choose to represent the Ising problem data) with an N -by-1 1-bit state vector, so each data in the result is a conditional sum of one row of matrix C based on the value of the state vector. We choose to implement a binary tree for speeding up the sum as shown in Supplementary Figure 13. At each time step, a certain number of rows of the matrix is loaded onto the leaves of the tree, and then gets propagated to next level based on the value of the state vector, and adds all up thereafter. The result of the sum should propagate to the root after a delay of clock cycles equal to the height of the tree.
The loop is carefully pipelined, in which reading the matrix takes N 2b clock cycles, multiplication (binary tree addition) takes log 2 N cycles, while noise injection and thresholding are done in a single clock cycle. Details of the complexity is discussed Supplementary Figure 13. Binary tree architecture for matrix-to-vector multiplication. A row of matrix C is read from memory to the top leaves of the binary tree. The branch connecting the top leaves to the next tree layer are either active or not, depending on the value of the current state vector (0 or 1). in the next session. We are only running integer arithmetic on the FPGA, so to adapt the original algorithm which runs on float points, we need to scale the parameters (matrix, noise, and threshold) and then round them to integers. The scaling factor and bit length to represent each data needs to be carefully designed.

Supplementary
Note that the only difference between our implemented algorithm on the FPGA and the ideal algorithm is that we round our values to a specific bit depth. Specifically, we store the matrix values, noise, and threshold values only to a specific bit precision. This originates from memory constraints on the FPGA, but it is reasonable to suspect that this rounding affects the performance of the algorithm. To test the effects of this change, we implemented both our original algorithm and the version with rounding on MATLAB, testing various bit depth with varying input, as shown in Supplementary Figure 14. We found that, for a matrix of size N = 100 by 100, rounding after multiplying by 64 or even 32 performed just as well as using arbitrary bit precision (Matlab's default double-precision floating point). Specifically, for a variety of matrix inputs, on average using 64 and 32 roundings reached states of essentially equivalent energy to the arbitrary precision. Over 1000 trials each with a different input matrix, 1000 iterations each (i.e. 1000 matrix multiplications and noise additions and thresholding for each trial), on average our normal algorithm reached a minimum energy of -408.63, whereas the 32 rounding reached -409.08, and the 32 bit reached -409.17.

Complexity
In discussing the complexity of our implementation, we must note the dependence on the particular resources of a given FPGA. Here we will discuss the complexity of a single step of our algorithm, consisting of a single matrix multiplication followed by noise addition and threshold comparison. Since the clock speed of any FPGA is variable, we give the complexity in terms of clock cycles.
We focus on computing the complexity of the matrix multiplication, since it dominates the three steps. To perform this matrix multiplication, we must read the entire matrix from memory and feed it to a system of binary trees. Suppose we store each matrix entry using b bits, and that we are working with an N × N size matrix. In our FPGA, we can read 64 bits from each BRAM block per clock cycle, so one limiting factor is the total number k of BRAM blocks we can use concurrently. Note that we are also limited by the maximum number R of bits that can be held in registers, or the working memory, of our FPGA. This allows us to compute a complexity bound.
Assuming the FPGA can read M bits from each BRAM block per clock cycle, it will take the FPGA at least N 2 b M cycles to read the entire matrix. According to the above considerations, M = 64k + 144u, where k (resp. u) is the number of used BRAM (resp. URAM). In our current implementation, we only used BRAM memory so u = 0. Alternatively, we are also limited by the amount of data we can work with at a time, R. To complete this matrix multiplication, we must store a binary tree corresponding to each row. The result of these binary trees is a vector of N values to be sent to the noise addition step. Thus the total amount of data we will need to work with in this computation is 2N 2 b bits, which will take at least 2N 2 b R cycles to pass through the registers of the FPGA.
To finish the analysis, note that we may feed new data into the binary trees before the original data has fully propagated through. Thus the only computation time outside of that required to load the matrix is the time for the final data to propagate through the binary trees. The size of our binary trees can be determined by considering the minimum of the amount of data we can load at a time, M = 64k, and the amount of data (in bits) we can work with at a time, R. Note that we never need a binary tree larger than the size of a row of the matrix, and that the binary tree takes up twice the space of the data inputted into it. Then the size of our binary tree is B := min( 64k b , R 2b , N ), and it takes on the order of log 2 B cycles to propagate through. Then, incorporating the time to load the matrix, the total number of cycles for a single algorithm step is on the order of number of clock cycles per PRIS algorithm iteration = max Lastly, considering noise addition and thresholding, note that this only involves reading 2N b more bits from bram. The actions of noise addition and thresholding themselves only take two clock cycles, and the time and space necessary to deal with these extra 2N b bit is negligible in complexity compared to that necessary to deal with the N 2 b bits from the matrix. So these two steps do not change the overall complexity. We also need to load values into the BRAM in the first place, but we do not include this process in the analysis. Thus the overall complexity of a single iteration of our algorithm is O max( 2N 2 b R , N 2 b 64k ) + log 2 B . The discussion above is under the assumption that for large size problems (N > 100), board registers are not enough for holding all the data in the matrix. For small size problem, we could alternatively use registers and do the multiplication in one clock cycle, then the major time consumption would be the binary tree addition, which took log 2 B clock cycles.
For problems with larger size, we store matrix in BRAM as mentioned above. We could plot the complexity (Supplementary Figure 15) discussed above under different assumptions: (1) Using all of the memory blocks on the FPGA (Supplementary Figure 15(a) and (b)), which is optimal but takes time to implement for every problem size. (2) For most reasonably-sized problems, we can assume a linear scaling of memory blocks, i.e. M = rN b, which is sub-optimal for a given N but easier to reconfigure for various N allowing this approximation (we can switch between different problem sizes N by changing the width of BRAM IPs in our design). We also assumed that R is large, thus not being the limiting factor of the computation.

Experiment
In this experiment, we tested the correctness and runtime for a problem size of N = 100 under a 300 MHz clock. We used 16 bits to encode each data in matrix, noise and threshold. A total of k = 5 BRAM IPs are instantiated in the design, each with data length of 1600 bits, corresponding to one row of the matrix. (In this design we are using 71% of BRAM blocks on our board.) The experiment runs as follows: Original Test: First we generate a test case running on float numbers in Jupyter notebook using Python. We run the test for 200 clock cycles and record the state vector and energy of each iteration.
Scale the problem: We scale up the matrix, noise and threshold by a factor of 128 and round to integers. We run the scaled test for 200 clock cycles and plot the energy array together with the original test to check the accuracy of the algorithm with rounded values. Several scaling factor and energy arrays are shown in Supplementary Figure 14, in which we could see that scaling under 64 shows some deviation to the unrounded algorithm, and scaling above 1024 results in exactly the same sequence of state vectors.
Correctness verification: we load the scaled test onto FPGA, and run 200 clock cycles and output state vector at each time step, with which we calculate the energy in Python. The FPGA-simulated energy and Python calculated energy is exactly the same, confirming the correctness of the implementation.
Timing and complexity analysis: for this part we only output the state vector and clock count at the very end of 200 clock cycles. For our problem of size 100 by 100, a loop of 200 iterations cost 3803 cycles. On average each time step takes 19 clock cycles (the remaining 3 clock cycles are due to some overhead at the beginning of the run), which is in accordance with our analysis before.
The measured clock cycle per algorithm step is 19 cycles, which is exactly as expected: 19 = 10 + 8 + 1 in which 10 is the