Maximum-Entropy Inference with a Programmable Annealer

Optimisation problems typically involve finding the ground state (i.e. the minimum energy configuration) of a cost function with respect to many variables. If the variables are corrupted by noise then this maximises the likelihood that the solution is correct. The maximum entropy solution on the other hand takes the form of a Boltzmann distribution over the ground and excited states of the cost function to correct for noise. Here we use a programmable annealer for the information decoding problem which we simulate as a random Ising model in a field. We show experimentally that finite temperature maximum entropy decoding can give slightly better bit-error-rates than the maximum likelihood approach, confirming that useful information can be extracted from the excited states of the annealer. Furthermore we introduce a bit-by-bit analytical method which is agnostic to the specific application and use it to show that the annealer samples from a highly Boltzmann-like distribution. Machines of this kind are therefore candidates for use in a variety of machine learning applications which exploit maximum entropy inference, including language processing and image recognition.


Calculation of Continuous decoded bit-errorrate Curves
Naively, it would appear that to plot message bit-error-rate versus crossover probability curves we would need to take a statistically signicant sample for each value of T N ish . However in this section we demonstrate that this is in fact not the case, by a careful choice of how we collect and analyze the experimental data, we can eciently calculate the nal bit-error-rate for any value of T N ish .
This method allows us to plot continuous curves, as displayed in Fig. 4 of the main text.
Let us rst consider the general form of the Hamiltonian for our Ising . Because all bits are subject to the same crossover probability, p, the total probability of having a given number of bits ipped can be written as where N + M is the total number of elements (couplers and elds) in the Hamiltonian. We can assign to each Hamiltonian a bit error rate r n , for the purposes of this discussion {r} could have either been calculated (for example by exhaustive summing or BTE) or obtained experimentally. We now write the total bit-error-rate as a function of crossover probability We now observe thatN corr ∈ {0, 1, 2 . . . N +M } so for a given value of p there are only N + M + 1 possible values of p (N corr , p). Based on this observation we can group these terms together to write down the total decoded bit-error-rate where r s,n is the list of bit-error-rates for all Hamiltonians with N corr = s and N s is the number of Hamiltonians sampled in sector s. Based on the formula given in Eq. 3we can make several observations. Firstly we not thatr s does not depend on p, and q does not depend on any of the decoding rates, therefore knowing the N + M + 1 values of{r} allows us to easily calculate r tot (p) for any value of p. Secondly, we note thatr s need not be an exhaustive sum over all possible Hamiltonians with s corrupted elements; a good approximation of r tot (p) could be obtained usingr s extracted from a representative sample.
2 Conrmation that T N ish = T is the optimal decoding temperature for our theoretical data It is not immediately obvious that Fig. 3 obeys the famous result by Nishimori that optimal decoding happens at T N ish = T. For this reason we have plotted several slices of that gure in the T direction along with the value along the line where T N ish = T in Fig. S1. We immediately note that in that plot the curve representing T N ish = T crosses the other curves at there gloabal minima, thereby conrming that our theoretical data do in fact agree with this result. It is worth noting however that the converse of this famous result is not necessarily true, the minimum in the T N ish direction does not necessarily occur at T N ish = T .This arises from the fact that the decoding curves are discontinuous in the T direction.  Figure S1: Theoretically predicted ratio of nite temperature bit-error-rate to zero temperature bit-error-rate versus T for a variety of Nishimori temperatures.
Crossings at minimum values are circled for clarity. Here BER is shorthand for bit-error-rate.

Eect of system size on decoding
As we discussed in the main text, the plateau which can be seen in Figs. 3 and 5 of the main text shinks and the discontinuities in the ratio plotted in To demonstrate the decrease in the size of the discontinuities with system size, we produce the equivalent of Fig. 3 in the main text but for a truncated version of the Chimera unit cell in which a bit has been removed from each half of the bipartite graph. As Fig. S3 demonstrates, the discontinuities become much more severe.

Estimation of spin-sign transitions for 4xChimera
To estimate the temperatures at which spin-sign transitions occur, we use Bucket Tree Elimination (BTE) code which has been provided by D-Wave Systems Inc. This code acts as a fair Boltzmann sampler and can be made to provide an unbiased sample of N samp spin states at a given temperature T . We use N samp = 10 5 . We then sample 200 temperatures between T = 0 and T = 7 to determine spin orientations. Fig. S4 gives an example of such data.
For each curve in Fig. S4 we can calculate the spin-sign transitions by rst taking a running average over 5 neighboring points to suppress multiple spurious transitions and then using linear interpolation to nd the zero crossings. An example of this procedure is demonstrated in Fig. S5. This method however  becomes problematic when the orientation remains very close to zero for a wide range of temperatures when the ground state orientation is zero, as Fig. S6 demonstrates, statistical sampling error leads to this method nding spurious transitions. Because of these spurious transitions these spins are excluded from our analysis.

Hamiltonians which Decode Signifcantly Differently than Boltzmann
We have not exhaustively searched for Hamiltonians which give signicantly different orientations from the Boltzmann distribution at many dierent values of temperature, however a few have resulted as byproducts of this project. In the interest of future work, we have included Fig. S6 which displays 4 such Hamiltonians which we have found. Investigation into the cause of these anomalies would be an interesting avenue of future study. updates the SA data begin to signicantly deviate from thermal equilibrium at T /(αJ) 3.5, while for 1,000,000 updates the data remain in equilibrium down to the temperature of interest, T /(αJ) = 1.405.

Comparison between Simulated Annealing and Bucket Tree Elimination results
While BTE is guaranteed to provide an equilibrium result, it is still interesting to ask whether we can achieve equilibrium though simulated annealing. Fig. S8 demonstrates that for a randomly selected 128 bit Hamiltonian from our data, a linear sweep from T /(αJ) = 10 to T /(αJ) = 1.405 can maintain equilibrium (which can be found using BTE) throughout the entire sweep if 1,000,000 updates are used, but fails to maintain equilibrium if only 10,000 are used, as was the case in Fig. 8 of the main paper.
We further examine whether we can qualitatively reproduce the behavior seen in Fig. 8a) of the main paper using SA with many updates as well as control error. As Fig. S9 demonstrates, we can reproduce this qualitative behavior. Slight dierences between Fig. S9 and 8a) of the main paper can be attributed either to not all of the Hamiltonians fully equilibrating under SA, or to the fact that restrictions on our available computing power have required us to limit the number of samples used to produce Fig. S9.  Figure S9: Simulated Annealing data subject to control error. These data were taken with a linear sweep starting from T /(αJ) = 5 with 5 10 5 samples, 5% eld control errors and 3% coupler control error. Each randomly generated instance of the error was run for 100 annealing runs and 100 instances were considered for each Hamiltonian.