Abstract
The search for an application of nearterm quantum devices is widespread. Quantum machine learning is touted as a potential utilisation of such devices, particularly those out of reach of the simulation capabilities of classical computers. In this work, we study such an application in generative modelling, focussing on a class of quantum circuits known as Born machines. Specifically, we define a subset of this class based on Ising Hamiltonians and show that the circuits encountered during gradientbased training cannot be efficiently sampled from classically up to multiplicative error in the worst case. Our gradientbased training methods use cost functions known as the Sinkhorn divergence and the Stein discrepancy, which have not previously been used in the gradientbased training of quantum circuits, and we also introduce quantum kernels to generative modelling. We show that these methods outperform the previous standard method, which used maximum mean discrepancy (MMD) as a cost function, and achieve this with minimal overhead. Finally, we discuss the ability of the model to learn hard distributions and provide formal definitions for ‘quantum learning supremacy’. We also exemplify the work of this paper by using generative modelling to perform quantum circuit compilation.
Introduction
As quantum devices with ~80−200 qubits, but without fault tolerance, begin to be built, we near the dawn of the noisy intermediate scale quantum (NISQ)^{1} technology era. Because of the low number of qubits, the limited connectivity between them, and the low circuit depth permitted by low decoherence times, these devices cannot perform many of the most famous algorithms thought to demonstrate exponential speedups over classical algorithms^{2,3}.
In spite of this, NISQ devices could provide efficient solutions to other problems that cannot be solved in polynomial time by classical means. Showing this to be true is referred to as a demonstration of quantum computational supremacy^{4,5,6,7,8,9}, with the first such experimental realisation occurring recently^{10}.
Proposals for demonstrations of quantum computational supremacy on NISQ technology typically involve sampling from the output distribution of random quantum circuits. While a realisation of such an advantage is of great theoretical importance, generating random samples is not obviously independently interesting. We incorporate this sampling into a useful application, keeping the provable quantum advantage, but in a context with more practical applicability.
Specifically, we explore generative modelling in quantum machine learning (QML), which is the task of generalising from a finite set of samples, {y}^{M}, drawn from a data set. By learning the underlying probability distribution from which these samples are drawn, π(y), a model should be able to generate new samples from the said distribution.
Generative models range from simple naive Bayes^{11} models to complicated neural networks, like generative adversarial networks (GANs)^{12}. The intrinsic randomness inherent in quantum mechanics allows for the definition of a new class of generative models that are without a classical analogue. Known as Born machines^{13,14,15}, they have the ability to produce statistics according to Born’s measurement rule. Specifically, for a state ψ〉, a measurement produces a sample x ~ p(x) = 〈xψ〉^{2}. There are several variants, including Bayesian approaches^{16}, adversarial training methods^{17}, and adaptations to continuous distributions^{18}.
Quantum circuit Born machines (QCBM) are a subclass of parameterised quantum circuits (PQCs) and are widely applicable (see ref. ^{19} for a review). PQCs consist of a quantum circuit which carries parameters that are updated during a training process (typically a classical optimisation routine). The circuit is kept as shallow as possible so as to be suitable for NISQ devices.
We ask in this work whether it is possible to have a machine learning application for a PQC, which comes with a provable superior performance over all classical alternatives on near term devices? Such provable guarantees are even more relevant given recent work in QML algorithm ‘dequantisations’^{20,21,22,23,24}.
We take the first steps in answering this question in several ways. We define a subclass of QCBM that we call Ising Born machines (QCIBM). We improve the training of the model over previous methods, which use the maximum mean discrepancy^{14} (MMD) with a classical kernel, by introducing quantum kernels into the MMD, as well as by using entirely new cost functions: the Stein discrepancy (SD) and the Sinkhorn divergence (SHD). To do so, we derive their corresponding gradients in the quantum setting.
We show that these novel methods outperform the MMD with classical kernel by achieving a closer fit to the data as measured by the total variation (TV) distance. We derive forms of the SHD, which can either be efficient to compute or result in an upper bound on TV. We observe numerically that the SD provides an upper bound to TV. Next, we show that sampling from this model cannot be simulated efficiently by any classical randomised algorithm, up to multiplicative error in the worst case, subject to common assumptions in complexity theory (namely the noncollapse of the polynomial hierarchy). Furthermore, this holds for many circuit families encountered during training.
We define a framework in which a provable advantage could be demonstrated, which we refer to as quantum learning supremacy (QLS), and based on distribution learning theory^{25}. Based on our classical sampling hardness results, we conjecture that the QCIBM may be a good candidate for a quantum model which could demonstrate this notion of learning supremacy; however, we leave the further investigation of QLS and its potential to be achieved by such models to future work. Finally, we provide a novel utilisation of such generative models in quantum circuit compilation.
Results
The main results of this work are new efficient gradientbased training methods and results on the hardness of simulating the model we introduce using classical computers. First, we define the model used and discuss its connection to previously studied quantum circuit families. We then discuss the efficient training of the model, first recalling a previously known gradientbased training method, which uses the MMD cost function, and then moving onto our new training methods, which use the SD and the SHD. We then discuss the SHD complexity in detail, and further argue, using its connection to the TV distance, why it should be used. We then prove the hardness results mentioned above, namely that many circuits encountered during gradientbased training are hard to classically simulate, before finally discussing the potential use of quantum generative models in learning distributions that are intractable to classical models. In addition, we provide a framework to study these advantages.
Ising Born machine
Here we define the model we use for distribution learning. A generic quantum circuit Born machine consists of a parameterised quantum circuit, which produces samples by measuring the resulting quantum state, and a classical optimisation loop used to learn a data distribution. The circuits we study have the following structure:
where x_{i} ∈ {0, 1}; the unitaries are defined by Eqs. (2) and (3); S_{j} indicates the subset of qubits on which each operator, j, is applied; and a boldface parameter indicates a set of parameters, α = {α_{j}}.
The operators, X_{k}, Y_{k}, and Z_{k}, are the standard Pauli operators acting on qubit k. Restricting to the case S_{j} ≤ 2 (since only single and twoqubit gates are required for universal quantum computation), the term in the exponential of Eq. (2) becomes exactly an Ising Hamiltonian:
where we are dividing the diagonal unitary parameters, α = {J_{ij}, b_{k}}, into local terms that act only on qubit k, {b_{k}}, and coupling terms between two qubits i and j, {J_{ij}}. We call the model a QCIBM.
A measurement on all qubits in the computational basis results in sample vectors, \({\mathbf{x}} \in {\cal{X}}^n\), where \({\cal{X}} = \{ 0,1\}\). These samples are drawn from the distribution, p_{θ}(x), parameterised by the set of angles, θ = {α, Γ, Δ, Σ}:
We denote the above model and parameters by QCIBM(θ) := QCIBM(α, Γ, Δ, Σ). We choose this structure in order to easily recover two wellknown circuit classes, namely instantaneous quantum polynomial time^{26} (IQP) circuits, and the shallowest depth (p = 1) version of the quantum approximate optimisation algorithm^{27} (QAOA).
IQP circuits are named to reflect the commuting nature of elements in the produce defining the unitary U_{z}, while QAOA^{27} was originally developed as an approximate version of the quantum adiabatic algorithm^{28}. Both of these classes of circuits are known to be routes to demonstrate quantum supremacy^{4,6,8,29}, and we extend this property here by using the results of ref. ^{30}. These classes can be recovered by setting the parameters of a QCIBM as follows:
We denote, for example \(\left\{ {\frac{\pi }{{2\sqrt 2 }}} \right\}\), to be all parameters of the n single qubit gates set to the same value, \(\pi {\mathrm{/}}2\sqrt 2\). We choose the final gate before the computational basis measurement to be in the form of Eq. (3), rather than the more common Euler decomposition of a single qubit gate decomposition found in the literature^{14,16}. This is chosen to make the classical simulation hardness results more apparent in our proofs.
To recover IQP circuits, we simply need to generate the final layer of Hadamard gates (up to a global phase) and do so by setting U_{f} in Eq. (3) as follows:
To recreate depth 1 QAOA circuits, we need to set the Pauli Z and Y parameters, Δ, Σ = 0, since the final gates should be a product of PauliX rotations with parameters, −Γ.
Training the Ising Born machine
Here we introduce the alternative training methods that we use for our purposes and that would be applicable to any generative model. The training procedure is a hybrid of classical and quantum computation, with the only quantum component being the model itself. The remainder of the computation is classical, bringing our scheme into the realm of what is possible for NISQ devices. The procedure can be seen in Fig. 1.
The optimisation procedures we implement are stochastic gradient descents. Parameters, θ_{k}, are updated at each epoch of training, d, according to the rule \(\theta _k^{d + 1} \leftarrow \theta _k^d  \eta \,\partial _{\theta _k}{\cal{L}}_B\). The parameter η is the learning rate and controls the speed of the descent. The initial proposals to train QCBMs were gradientfree^{15,31}, but gradientbased methods have also been proposed^{14,16,32}. In this work, we advocate for increasing the classical computational power required in training to achieve better performance, rather than increasing the quantum resources, for example by adding extra ancillae^{16} or adding costly and potentially unstable (quantum) adversaries^{17,33,34}.
For gradientbased methods, a cost function or metric is required, \({\cal{L}}_B\left( {p_{\mathbf{\theta }}({\mathbf{x}}),\pi ({\mathbf{y}})} \right)\) to compare the Born Machine distribution, p_{θ}(x), and the data distribution, π(y). Good cost functions will have several characteristics. They should be efficient to compute, measured both by sample and computational complexity. They should also be powerful in the sense that they are sensitive to differences between the two distributions. In this work, we will assess sensitivity by using the TV metric as a benchmark:
As discussed later, this is a particularly strong metric. The cost functions we use are typically easier to deal with than TV, and we will remark on their relationship to TV.
One cost function commonly used to train generative models is the Kullback–Leibler (KL) divergence. The KL divergence is also relatively strong, in the sense that it upper bounds TV through Pinsker’s inequality:
where D_{KL}(p_{θ}π) is the KL divergence of π from p_{θ}. Unfortunately, it is difficult to compute, having a high sample complexity, so neither its gradient nor the KL divergence itself can be evaluated efficiently when training parameterised circuits^{14}.
The first efficient gradient method to train Born machines was proposed by ref. ^{14}. There the MMD is used to define the cost function. We extend this methodology in two ways. The first is an alteration to the MMD itself, and the second is by introducing new cost functions. From the MMD, the following cost function^{35,36} can be defined:
The MMD has some very favourable properties; it is a metric on the space of probability distributions, and it is relatively easy to compute (due to low sample complexity). The function, κ in Eq. (11) is a kernel function, a measure of similarity between points in the sample space \({\mathbf{x}} \in {\cal{X}}^n\). A popular choice for this function is the Gaussian mixture kernel^{14}:
The parameters, σ_{i}, are bandwidths that determine the scale at which the samples are compared, and ⋅_{2} is the \(\ell _2\) norm.
Recent works^{37,38} on the near term advantage of using quantum computers in QML have explored quantum kernels, which can be evaluated on a quantum computer. To gain such an advantage, these kernels should be difficult to compute on a classical device. In particular, we will adopt the following kernel^{37} in which the samples are encoded in a quantum state, ϕ(x)〉, via a feature map, ϕ:x → ϕ(x)〉. The kernel is the inner product between vectors:
The inner product in Eq. (13) is evaluated on a quantum computer and is conjectured to be hard to compute on a classical one^{37}, given only a classical description of the quantum states. The state ϕ(x)〉 is produced by acting an encoding unitary on an initial state, \(\left {\phi ({\mathbf{x}})} \right\rangle = {\cal{U}}_{\phi ({\mathbf{x}})}\left 0 \right\rangle ^{ \otimes n}\). Explicitly, the kernel is then given by:
which can be calculated by measuring, in the computational basis, the state which results from running the circuit given by \({\cal{U}}_{\phi ({\mathbf{y}})}\), followed by that of \({\cal{U}}_{\phi ({\mathbf{x}})}^\dagger\). This is seen in Fig. 1. The kernel, Eq. (14), is the observed probability of measuring the allzero outcome, 0^{n}. If this outcome is not observed after polynomially many measurements, the value of the kernel for this particular pair of samples (x, y) is set to zero. Intuitively, this means the feature map has mapped the original points to points with at most exponentially small overlap in the Hilbert space and therefore will not contribute to the MMD.
It is also necessary to derive an expression for the gradient of the cost function. For the MMD, the gradient with respect to the kth parameter^{14}, carried by the kth unitary gate, U_{k}(θ_{k}), is given by:
where \(p_{\theta _k}^ \pm\) are output distributions generated by running the following auxiliary circuits^{39,40} for each unitary gate, U_{k}(θ_{k}):
where \(\theta _k^ \pm : = \theta _k \pm \pi {\mathrm{/}}2\) and U_{l:m} := U_{l}U_{l+1}…U_{m−1}U_{m} are the unitary gates in the Born machine. This gradient occurs because the form of the unitary gates in our case are exponentiated Pauli operators U_{k}(θ_{k}) = exp(iθ_{k}Σ_{k}), with \({\mathrm{\Sigma }}_k^2 = {\Bbb I}\). With the unitaries in this form, the gradient of the probabilities outputted from the parameterised state, with respect to a parameter θ, is given by^{14,40}:
There is a slight difference between Eq. (17) and that of ref. ^{14}, due to a different parameterisation of the unitaries above.
The gradients of the cost functions which we introduce next will also require the parametershifted circuits in Eq. (16). For more details on kernel methods and the MMD, see Supplementary Material Section II.
SD training
So far, we have only proposed a change of kernel in the MMD method of training QCIBMs. We now consider changing the cost function altogether. We endeavour to find costs which are efficient to compute for quantum models, yet stronger than MMD.
The first cost we propose is called the SD. SD has become popular for goodnessoffit tests^{41}, i.e. testing whether samples come from a particular distribution or not, as opposed to the MMD, which is typically used for kernel twosample tests^{36}. This discrepancy is based on Stein’s method^{42}, which is a way to bound distance metrics between probabilities including, for example, the other integral probability metrics (IPM) we utilise in this work. For details on IPMs, see Supplementary Material Section I.
We use the discrete version of the SD^{43} since, in its original form^{41}, it only caters for the case where the distributions are supported over a continuous space. The discretisation is necessary since the QCIBM outputs binary strings and so the standard gradient w.r.t. a sample, x, ∇_{x}, is undefined. As such, we need to use a discrete ‘shift’ operator, Δ_{x}, instead, which is an operator defined by [Δ_{x}f(x)]_{i} := f(x) − f(¬_{i}x) for a function f, where ¬_{i} flips the ith element of the binary vector x.
Fortunately, the discretisation procedure is relatively straightforward (the necessary definitions and proofs can be found in Supplementary Material Section III). The discrepancy is derived^{41,44} from the (discrete) Stein identity^{43}, given by:
where \(\mathop {{\Bbb E}}\limits_{{\mathbf{x}} \sim \pi }\) denotes the expectation value over the distribution, π. This holds for any function \(\phi :{\cal{X}}^n \to {\Bbb C}\) and probability mass function π on \({\cal{X}}^n\). The function s_{π}(x) = Δ_{x} log(π(x)) is the Stein score function of the distribution π, and \({\cal{A}}_\pi\) is a socalled Stein operator of π. Now, the SD cost function can be written in a kernelised form^{41,43}, similarly to the MMD:
where κ_{π} is the Stein kernel and κ is a usual positive semidefinite kernel. \({\mathrm{\Delta }}_{\mathbf{x}}^ \ast\) is a conjugate version of the operator Δ_{x}, but for our purposes, the behaviour of both \({\mathrm{\Delta }}_{\mathbf{x}}^ \ast\) and Δ_{x} are identical. For completeness, we define it in generality in Supplementary Material Section III.
Just as above, the gradient (derived in an identical fashion to the MMD gradient Eq. (15) as is detailed in Supplementary Material Section III) of \({\cal{L}}_{{\mathrm{SD}}}\) with respect to the parameter, θ_{k}, is given by:
We show that almost every term in Eqs. (20) and (22) can be computed efficiently, even when the quantum kernel κ_{Q} from Eq. (13) is used in Eq. (21), that is, with the exception of the score function s_{π} with respect to the data distribution. The score contains an explicit dependence on the data distribution, π. If we are given oracle access to the probabilities, π(y), then there is no issue and SD will be computable. Unfortunately, in any practical application this will not be the case.
To deal with such a scenario, we give two approaches to approximate the score via samples from π. The first of these we call the ‘Identity’ method since it inverts Stein’s identity^{45} from Eq. (18). We refer to the second as the ‘Spectral’ method since it uses a spectral decomposition^{46} of a kernel to approximate the score. The latter approach uses the Nyström method^{47}, which is a technique used to approximately solve integral equations. We will only use the Spectral method in training the QCIBM in the numerical results in Fig. 3, since the Identity method does not give an immediate outofsample method to compute the score. Details of these methods can be found in Supplementary Material Section III.
Notice that, even with the difficulty in computing the score, the SD is still more suitable for training these models than the KL divergence as the latter requires computing the circuit probabilities, p_{θ}(x), which is in general intractable, and so could not be computed for any data set.
SHD training
The second cost function we consider is the socalled SHD. This is a relatively new method to compare probability distributions^{48,49,50}, defined by the following:
where \({\it{\epsilon }} \ge 0\) is a regularisation parameter, c(x, y) is a Lipschitz ‘cost’ function, and \({\cal{U}}(p_{\boldsymbol{\uptheta }},\pi )\) is the set of all couplings between p_{θ} and π, i.e. the set of all joint distributions, whose marginals with respect to x and y are p_{θ}(x) and π(y), respectively. The above cost function, \({\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }}\), is particularly favourable as a candidate because of its relationship to the theory of optimal transport^{51} (OT), a method to compare probability distributions. It has become a major tool used to train models in the classical domain, for example with GANs^{52} through a restriction of OT called the Wasserstein metric, which is derived from OT, when the cost (c(x, y)) is chosen to be a metric on the space of \({\cal{X}}^n\).
We would like to use OT itself to train generative models, due to its metric properties. Unfortunately, OT has high computational cost and exponential sample complexity^{53}. For this reason, the SHD was proposed in refs ^{48,49,50} to interpolate between OT and the MMD as a function of the regularisation parameter \({\it{\epsilon }}\) in Eq. (24). In particular, for the two extreme values of \({\it{\epsilon }}\), we recover^{48} both unregularised OT and the MMD:
As before, we need a gradient of the \({\cal{L}}^{\epsilon}_{{\mathrm{SHD}}}\) with respect to the parameters, which is given by:
where φ(x) is a function that depends on the optimal solutions found to the regularised OT problem in Eq. (24). See Supplementary Material Section IV for more details on the SHD and its gradient.
Sinkhorn complexity
The sample complexity of the SHD is of great interest to us as we claim that the TV and the KL are not suitable to be directly used as cost functions. This is due to the difficulty of computing the outcome probabilities of quantum circuits efficiently. We now analyse why the MMD is a weak cost function and why the SHD should be used as an alternative. This will depend critically on the regularisation parameter \({\it{\epsilon }}\), which allows a smooth interpolation between the OT metric and the MMD.
First, we address the computability of \({\cal{L}}^{\epsilon}_{{\mathrm{SHD}}}\) and we find, due to the results of ref. ^{54}, a somewhat ‘optimal’ value for \({\it{\epsilon }}\), for which the sample complexity of \({\cal{L}}_{{\mathrm{SHD}}}\) becomes efficient. Specifically, the mean error between \({\cal{L}}_{{\mathrm{SHD}}}\) and its approximation \({\hat {\cal{L}}}_{{\mathrm{SHD}}}^{\it{\epsilon}}\) for n qubits, computed using M samples, scales as:
We show in Supplementary Material Section IV.1 that by choosing \({\it{\epsilon }} = {\cal{O}}(n^2)\), we get:
which is the same sample complexity as the MMD^{55} but exponentially better than that of unregularised OT, which scales as \({\cal{O}}\left( {1{\mathrm{/}}M^{1/n}} \right)\)^{53}.
A similar result can be derived using a concentration bound^{54}, such that, with probability 1 − δ,
where we have chosen the same scaling for \({\it{\epsilon }}\) as in Eq. (29). Therefore, we can choose an optimal theoretical value for the regularisation, such that \({\cal{L}}_{{\mathrm{SHD}}}\) is sufficiently far from OT to be efficiently computable but perhaps still retains some of its favourable properties. It is likely in practice, however, that a much lower value of \({\it{\epsilon }}\) could be chosen without a blow up in sample complexity^{49,54}. See Supplementary Material Section IV for derivations of the above results.
Second, we can relate the \({\cal{L}}_{{\mathrm{SHD}}}\) to unregularised OT and TV via a sequence of inequalities. We have mentioned that the MMD is weak, meaning it provides a lower bound on TV in the following way^{55}:
if \(C: = {\mathrm{sup}}_{{\mathbf{x}} \in {\cal{X}}^n}\kappa ({\mathbf{x}},{\mathbf{x}})\, < \, \infty\).
Note that for the two kernels introduced earlier:
hence C = 1 and the lower bound is immediate.
In contrast, as is seen from the inequality on a discrete sample space in Eq. (34)^{56}, the Wasserstein metric (unregularised OT) provides an upper bound on TV, and hence we would expect it to be stronger than the MMD.
where \({\mathrm{diam}}({\cal{X}}^n) = {\mathrm{max}}\{ d({\mathbf{x}},{\mathbf{y}}),{\mathbf{x}},{\mathbf{y}} \in {\cal{X}}^n\}\), d_{min} = min_{x≠y} d(x, y), and d(x, y) is the metric on the space, \({\cal{X}}^n\). This arises by choosing c = d and \({\it{\epsilon }} = 0\) in Eq. (24). If, for instance, we were to choose d(x, y) to be the \(\ell _1\) metric between the binary vectors of length n (a.k.a. the Hamming distance), then we get that \(d_{{\rm{min}}} = 1,{\mathrm{diam}}({\cal{X}}) = n\), and so:
Finally, we can examine the relationship induced by the regularisation parameter through the following inequality; Theorem 1 in ref. ^{54}:
where the size of the sample space is bounded by D, as measured by the metric, and L is the Lipschitz constant of the cost c. As detailed in Supplementary Material Section IV.1, we can choose D = n and L = n:
The log term will be positive as long as \({\it{\epsilon }} \le n{\mathrm{e}}^2\), in which case regularised OT will give an upper bound for the Wasserstein metric, and hence the TV through Eq. (34), so we arrive at:
Unfortunately, comparing this with Eqs. (29) and (30), we can see that, with this scaling of \({\it{\epsilon }}\), the sample complexity would pick up an exponential dependence on the dimension, n, so it would not be efficiently computable. We comment further on this point later.
Numerical performance
In Figs 2–4, we illustrate the superior performance of our alternative training methods, as measured by the TV distance. A lower TV indicates that the model is able to learn parameters which fit the true data more closely. TV was chosen as an objective benchmark for several reasons. First, it is typically the notion of distance that is required by quantum supremacy experiments where one wants to prove hardness of classical simulation. Second, we use it in the definitions of QLS. Finally, it is one of the strongest notions of convergence in probability one can ask for, so it follows that a training procedure that can more effectively minimise TV, in an efficient way, should be better for generative modelling.
We train the model on Rigetti’s Forest platform^{57} using both a simulator and the real quantum hardware, the Aspen QPU. Figure 2 illustrates the training of the model using the Gaussian (Eq. (12)) vs. the quantum kernel (Eq. (13)) for 4 qubits, and we see that the quantum kernel offers no significant advantage vs. training with a purely classical one. Figure 2a shows the TV as trained for 200 epochs, using both the classical and quantum kernels with various learning rates. Figure 2b shows the learned probabilities outputted after training with each kernel, and Fig. 2c shows the difference in the actual \({\cal{L}}_{{\mathrm{MMD}}}\) itself while training with both methods. Interestingly, the latter behaviour is quite different for both kernels, with the quantum kernel initialising with much higher values of \({\cal{L}}_{{\mathrm{MMD}}}\), whereas they both minimise TV in qualitatively the same way. This indicates that hardness of classical simulation (of computing the kernel) does not imply an advantage in learning.
On the other hand, a noticeable outperformance is observed for the SHD and the SD relative to training with the MMD (using a Gaussian kernel), as measured by TV in Fig. 3. Furthermore, we observed that the gap (highlighted in the inset in Fig. 3a) which separates the SHD and SD (red and blue lines) from the MMD (green, yellow, and cyan lines) grows as the number of qubits grows. Unfortunately, the Spectral method to approximate the Stein score does not outperform the MMD, despite training successfully. The discrepancy between the true and approximate versions of the Stein score is likely due to the low number of samples used to approximate the score, with the number of samples limited by the computational inefficiency. We leave tuning the hyperparameters of the model in order to get better performance to future work.
This behaviour is shown to persist on the QPU, Fig. 4, where we show training of the model with both the MMD and SHD relative to TV, (Fig. 4a), the learned probabilities of both methods on, and off, the QPU (Fig. 4b), and the behaviour of the cost functions associated with both methods (Fig. 4c, d). This reinforces our theoretical argument that the SHD is able to better minimise TV to achieve superior results.
Given the performance noted above, we would recommend the SHD as the primary candidate for future training of these models, due to its simplicity and competitive performance. One should also note that we do not attempt to learn these data distributions exactly since we use a shallow fixed circuit structure for training (i.e. a QAOA circuit), which we do not alter. Better fits to the data could likely be achieved with deeper circuits with more parameters.
For extra numerical result demonstrating the performance of the learning algorithms, see Supplementary Material Section V, including a comparison between the quantum and Gaussian kernels for two qubits, similar to Fig. 2; the behaviour of the corresponding cost functions themselves associated with Fig. 3; the performance of the model for 4 qubits, similar to Fig. 3; and the results using a 3 qubit device, the Aspen–4–3Q–A. In all cases, the performance was qualitatively similar to that reported in the main text.
Hardness and quantum advantage
It is crucially important, not just for our purposes but for the design of QML algorithms in general, that the algorithm itself is providing some advantage over any classical one for the same task. This is the case for socalled coherent algorithms, like the HHL linear equation solver^{3}, which is BQPcomplete, and therefore unlikely to be fully dequantised. However, such a proven advantage for near term QML algorithms is yet out of reach. We attempt to address such a question in two steps.

1.
We show that, for a large number of parameter values, θ, our QCIBM circuits are ‘hard’. That is to say, it cannot be efficiently simulated classically up to a multiplicative error, in the worst case. We also show that this holds for the auxiliary quantum circuits used for the gradient estimation, and hence the model may remain hard during training (although we do not know for sure).

2.
We provide formal definitions for QLS, the ability of a quantum model to provably outperform all classical models in a certain task, and a potential pathway to prove such a thing.
The intuition behind point 2 is the following. If our QCIBM model could learn a target distribution π, which demonstrates quantum supremacy, by providing a quantum circuit C close enough to π (i.e. below a threshold error in TV), then the model would have demonstrated something that is classically infeasible. Else there would exist an efficient classical algorithm that can get close to π, which contradicts hardness.
Point 1 does not completely fit that intuition. For one thing, hardness is not known to hold for the required notion of additive error (i.e. TV distance) but only for multiplicative error. Also, even though the model is more expressive than any classical model^{16}, this does not imply that it could actually learn a hard distribution. On the other hand, it is easy to see why the converse would be true, if the QCIBM could learn a distribution that is hard to sample from classically, the underlying circuit must have, at some point, reached a circuit configuration for which the output distribution is hard to classically sample.
We can address point 1 informally (see Supplementary Material Section VI for the formal statements and proof) in three steps:

If the parameters of the model are initialised randomly in \(\{ {\mathbf{\alpha }}\} = \{ J_{ij},b_k\} \in \{ 0,\frac{\pi }{8}, \ldots ,\frac{{7\pi }}{8}\}\) and final measurement angles are chosen such that U_{f}(Γ, Δ, Σ) = H^{⊗n}, then the resulting QCIBM circuit class will be hard to simulate up to an additive error of 1/384 in TV distance, subject to a conjecture relating to the hardness of computing the Ising partition function^{6}.

If certain configurations of the parameters are chosen to be either of the form, (2l + 1)π/kd, where l and d are integers and k is a number that depends on the circuit family, or in the form 2πν, where ν is irrational, then the resulting class of circuits will be hard to sample from classically, up to a multiplicative error, in the worst case.

The circuits produced at each epoch as a result of the gradient updates will each result in a hard circuit class as long as the gradient updates are not chosen carelessly. In each epoch, if the update step is constrained in a way that the new value of the parameter \(\theta _k^{d + 1} = \theta _k^d  \eta \partial _{\theta _k}{\cal{L}}_B\) does not become rational, then the updated circuits will also belong to a class that is hard to simulate (a similar result can be shown for the case where the parameters are updated to keep within the form of (2l + 1)π/kd). This is because the updates can simply be absorbed into the original gates, to give a circuit which has the same form. This holds also for the gradientshifted circuits in Eq. (16) since these correspond to circuits whose parameters are updated as follows: \(\theta _k^{d, \pm } \leftarrow \theta _k^d \pm \pi {\mathrm{/}}2\).
We now provide definitions to meet the requirements of point 2, adapting definitions from distribution learning theory^{25} for this purpose. Specifically, we say that a generative QML algorithm, \({\cal{A}} \in {\mathrm{BQP}}\) (with a small abuse of notation) has demonstrated QLS if there exists a class of probability distributions \({\cal{D}}_n\) over \({\cal{X}}^n\) (bit vectors of length n), for which there exists a metric d and a fixed \({\it{\epsilon }}\) such that \({\cal{D}}_n\) is \((d,{\it{\epsilon }},{\mathrm{BQP}})\)learnable via \({\cal{A}}\) but not \(\left( {d,{\it{\epsilon }},{\mathrm{BPP}}} \right)\)learnable (i.e. learnable by a purely classical algorithm). The task of the learning algorithm \({\cal{A}}\) is, given a target distribution \(D \in {\cal{D}}_n\), to output, with high probability, a Generator, GEN_{D′}, for a distribution D′, such that D′ is close to \(D \in {\cal{D}}_n\) with respect to the metric d. For the precise definitions of learnability we employ, see Supplementary Material Section VII.
This framework is very similar to that of, and inspired by, probably approximately correct (PAC) learning, which has been well studied in the quantum case^{58} but it applies more closely to the task of generative modelling. It is known that, in certain cases, the use of quantum computers can be beneficial to PAC learning but not generically^{59}. Based on this, it is possible that there exist some classes of distributions that cannot be efficiently learned by classical computers (BPP algorithms) but that could be learned by quantum devices (BPQ algorithms). The motivation for this is exactly rooted in the question of quantum supremacy and illustrated crudely in Fig. 5b.
An initial attempt at QLS is as follows. As mentioned above, if random IQP circuits could be classically simulated to within a TV error of \({\it{\epsilon }} = 1{\mathrm{/}}384\)^{6} in the worst case (with high probability over the choice of circuit), this would imply unlikely consequences for complexity theory. Now, if a generative quantum model was able to achieve a closeness in TV less than this constant value, perhaps by minimising one of the upper bounds in Eq. (38), then we could claim that this model had achieved something classically intractable. For example, if we make the following assumptions,

1.
QCIBM could achieve a TV < δ to a target IQP distribution.

2.
A classical probabilistic algorithm, C, could output a distribution q in polynomial time which was γ close in TV to the QCIBM, i.e. it could simulate it efficiently.
Then
where the third line follows from the triangle inequality. Therefore, C could simulate an IQP distribution also, and we arrive at a contradiction.
The major open question left by this work is whether QLS is possible at all; can a quantum model outperform all classical ones in generative learning? This idea motivated our search for metrics that upper bound TV but yet were efficiently computable and therefore could be minimised to efficiently learn distributions to a sufficiently small value of TV. Unfortunately, we can see from the exponential scaling observed in Eq. (38), which gives the upper bound on TV by regularised OT, that SHD will not provably achieve this particular task, despite achieving our primary goal of being stronger than the MMD for generative modelling. We briefly discuss avenues of future research in the ‘Discussion’ section, which could provide alternative routes to QLS.
Quantum compiling
As a concrete application of such a model outside the scope of classical generative modelling, we can use the QCIBM training to perform a type of ‘weak’ quantum circuit compilation. There are potentially other areas that could be studied using these tools or by applying techniques in generative modelling to other quantum information processing tasks, but this is beyond the scope of this work.
The major objective in this area is to compile a given target unitary, U, into one that consists exclusively of operations available to the native hardware of the quantum computer in question. For example, in the case of Rigetti’s Aspen QPU, the native gates are {R_{x}(±π/2), R_{z}(θ), CZ}^{57,60}, and any unitary which a user wishes to implement must be compiled into a unitary V that contains only these ingredients.
Potential solutions to this problem^{60,61} involve approximating the target unitary by assuming that V is a parametric circuit built from the native gates, which can be trained by some optimisation strategy. We adopt a similar view here, but we do not require any extra quantum resources to perform the compilation. With this limitation, we make a tradeoff in that we are not guaranteed to apply the same target unitary, only that the output distribution will be close to that produced by the target. Clearly this is a much weaker constraint than the task of direct compilation, since many unitaries may give rise to the same distribution, but it is much closer to the capabilities of near term devices. To illustrate this application, we train an QCIBM to learn the output distribution of a random IQP circuit when restricted to a QAOA architecture itself using \({\cal{L}}_{{\mathrm{SHD}}}\) as a cost function. The process is illustrated in Eq. (43), where we try to determine suitable QAOA parameters, \(\{ J_{ij}^{{\mathrm{QAOA}}},b_k^{{\mathrm{QAOA}}}\}\), which reproduce the distribution observed from a set of random IQP parameters, \(\{ J_{ij}^{{\mathrm{IQP}}},b_k^{{\mathrm{IQP}}}\}\).
The measurement unitary at the end of the circuit makes this process nontrivial, since this will give rise to significantly different distributions, even given the same parameters in U_{z}. We illustrate this in Fig. 6 using the Rigetti 2q–qvm and for three qubits in Supplementary Material Section V. We find that, even though the learned parameter values are different from the target, the resulting distributions are quite similar, as expected.
Discussion
Providing provable guarantees of the superior performance of near term quantum computers relative to any classical device for some particular nontrivial application is an important milestone of the field. We have shown one potential route towards this goal by combining complexity–theoretic arguments^{4,7,9}, with an application in generative machine learning^{14,15,16,62}, and improved training methods of generative models. Specifically, we introduced the Ising Born machine, a restricted form of a quantum circuit Born machine. These models utilise the Born rule of quantum mechanics to train a parameterised quantum circuit as a generative machine learning model, in a hybrid manner.
We proved that the model cannot be simulated efficiently by any classical algorithm up to a multiplicative error in the output probabilities, which holds for many circuit families that may be encountered during gradientbased training. As such, this type of model is a good candidate for a provable quantum advantage in QML using NISQ devices. To formalise this intuition, we defined a notion of QLS to rigorously define what such an advantage would look like, in the context of machine learning.
We adapted novel training methods for generative modelling in two ways. First, by introducing quantum kernels to be evaluated on the quantum hardware and, second, by proposing and adapting new cost functions. In the case of SHD, we discussed its sample complexity and used this to define a somewhat optimal cost function through a judicious choice of the regularisation parameter. It is possible to choose this parameter such that the cost is efficiently computable even as the number of qubits grows. We showed numerically that these methods have the ability to outperform previous methods in the random data set we used as a test case.
Finally, we demonstrated an application of the model as a heuristic compiler to compile one quantum circuit into another via classical optimisation techniques, which has the advantage of requiring minimal quantum overhead. These techniques could potentially be adapted into methods to benchmark and verify near term quantum devices.
The major question that this work raises is whether or not a provable notion of quantum learning could be achievable for a particular data set, thereby solidifying a use case for quantum computers in the near term with provable advantage. The best prospect for this is the quantum supremacy distributions we know of (for example, IQP), but they are not efficiently testable^{63}. Owing to this, they are also likely to not be efficiently learnable either, given the close relationship between distribution testing and learning^{64}. Furthermore, we can see from the exponential scaling required in Eq. (38) for regularised OT to upper bound TV that other techniques are necessary to achieve QLS, since the methods we present here are not suited to this particular task, despite achieving our goal of being stronger than the MMD for generative modelling. However, this assumes that we have access only to classical samples from the distribution, and the possibility of gaining an advantage using quantum samples^{38,58} is unexplored in the context of distribution learning.
Methods
In this section, we detail the methods used to train the QCIBM to reproduce a given probability distribution. The target distribution is the one given by Eq. (44), which is used in both refs ^{65,66} to train versions of the quantum Boltzmann machine:
To generate this data, T binary strings of length n, written s_{k} and called ‘modes’, are chosen randomly. A sample y is then produced with a probability that depends on its Hamming distance d_{H}(s_{k}, y) to each mode. In all of the above, the Adam^{67} optimiser was applied, using the suggested hyperparameters, i.e. \(\beta _1 = 0.9,\;\beta _2 = 0.999,\;{\it{\epsilon }} = 1 \times 10^{  8},\) and initial learning rate, η_{init}. This was chosen since it was found to be more robust to sampling noise^{14}.
In all of the numerical results, we used a QAOA structure as the underlying circuit in the QCIBM. Specifically, the parameters in U_{f} were chosen such that ∀k, Γ_{k} = π/4, Δ_{k} = 0, and Σ_{k} = 0. The Ising parameters {J_{ij}, b_{k}} were initialised randomly.
For the SD, we used 3 Nyström eigenvectors to approximate the Spectral score in Fig. 3 for 3 qubits, and 6 eigenvectors for 4 qubits. In all cases when using the MMD with a Gaussian kernel, we chose the bandwidth parameters, σ = [0.25, 10, 1000]^{14}. Note that this article was previously published as a preprint^{68}.
Data availability
Data and simulations presented in this work are available from the corresponding author upon request. Code used in this work is available from the corresponding author upon request or on Github^{69}.
References
Preskill, J. Quantum computing in the NISQ era and beyond. Quantum 2, 79 (2018).
Shor, P. Polynomialtime algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM J. Comput. 26, 1484–1509 (1997).
Harrow, A. W., Hassidim, A. & Lloyd, S. Quantum algorithm for linear systems of equations. Phys. Rev. Lett. 103, 150502 (2009).
Bremner, M. J., Jozsa, R. & Shepherd, D. J. Classical simulation of commuting quantum computations implies collapse of the polynomial hierarchy. Proc. R. Soc. Lond. A 467, 459–472 (2011).
Gao, X., Wang, S.T. & Duan, L.M. Quantum supremacy for simulating a translationinvariant Ising spin model. Phys. Rev. Lett. 118, 040502 (2017).
Bremner, M. J., Montanaro, A. & Shepherd, D. J. Averagecase complexity versus approximate simulation of commuting quantum computations. Phys. Rev. Lett. 117, 080501 (2016).
Aaronson, S. & Arkhipov, A. The computational complexity of linear optics. Theory Comput. 9, 143–252 (2013).
Farhi, E. & Harrow, A. W. Quantum supremacy through the quantum approximate optimization algorithm. Preprint at http://arxiv.org/abs/1602.07674 (2016).
Boixo, S. et al. Characterizing quantum supremacy in nearterm devices. Nat. Phys. 14, 595–600 (2018).
Arute, F. et al. Quantum supremacy using a programmable superconducting processor. Nature 574, 505–510 (2019).
Maron, M. E. Automatic indexing: an experimental inquiry. J. ACM 8, 404–417 (1961).
Goodfellow, I. J. et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27 (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.), pp. 2672–2680 (Curran Associates, Inc., 2014).
Cheng, S., Chen, J. & Wang, L. Information perspective to probabilistic modeling: Boltzmann machines versus born machines. Entropy 20, 583 (2018).
Liu, J.G. & Wang, L. Differentiable learning of quantum circuit Born machines. Phys. Rev. A 98, 062324 (2018).
Benedetti, M. et al. A generative modeling approach for benchmarking and training shallow quantum circuits. npj Quantum Inf. 5, 1–9 (2019).
Du, Y., Hsieh, M.H., Liu, T. & Tao, D. The expressive power of parameterized quantum circuits. Preprint at http://arxiv.org/abs/1810.11922 (2018).
Zeng, J., Wu, Y., Liu, J.G., Wang, L. & Hu, J. Learning and inference on generative adversarial quantum circuits. Phys. Rev. A 99, 052306 (2019).
Romero, J. & AspuruGuzik, A. Variational quantum generators: generative adversarial quantum machine learning for continuous distributions. Preprint at http://arxiv.org/abs/1901.00848 (2019).
Benedetti, M., Lloyd, E., Sack, S. & Fiorentini, M. Parameterized quantum circuits as machine learning models. Quantum Sci. Technol. 4, 043001 (2019).
Tang, E. Quantuminspired classical algorithms for principal component analysis and supervised clustering. Preprint at http://arxiv.org/abs/1811.00414 (2018).
Tang, E. A quantuminspired classical algorithm for recommendation systems. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, 217–228 (2019).
Andoni, A., Krauthgamer, R. & Pogrow, Y. On solving linear systems in sublinear time. Preprint at http://arxiv.org/abs/1809.02995 (2018).
Chia, N.H., Lin, H.H. & Wang, C. Quantuminspired sublinear classical algorithms for solving lowrank linear systems. Preprint at http://arxiv.org/abs/1811.04852 (2018).
Gilyén, A., Lloyd, S. & Tang, E. Quantuminspired lowrank stochastic regression with logarithmic dependence on the dimension. Preprint at http://arxiv.org/abs/1811.04909 (2018).
Kearns, M. et al. On the learnability of discrete distributions. In Proc. Twentysixth Annual ACM Symposium on Theory of Computing 273–282 (ACM, New York, NY, 1994).
Shepherd, D. & Bremner, M. J. Temporally unstructured quantum computation. Proc. R. Soc. A. https://doi.org/10.1098/rspa.2008.0443 (2009).
Farhi, E., Goldstone, J. & Gutmann, S. A Quantum approximate optimization algorithm. Preprint at http://arxiv.org/abs/1411.4028 (2014).
Farhi, E., Goldstone, J., Gutmann, S. & Sipser, M. Quantum computation by adiabatic evolution. Preprint at http://arxiv.org/abs/quantph/0001106 (2000).
Bremner, M. J., Montanaro, A. & Shepherd, D. J. Achieving quantum supremacy with sparse and noisy commuting quantum computations. Quantum 1, 8 (2017).
Fujii, K. & Morimae, T. Commuting quantum circuits and complexity of Ising partition functions. New J. Phys. 19, 033003 (2017).
LeytonOrtega, V., PerdomoOrtiz, A. & Perdomo, O. Robust implementation of generative modeling with parametrized quantum circuits. Preprint at http://arxiv.org/abs/1901.08047 (2019).
Hamilton, K. E., Dumitrescu, E. F. & Pooser, R. C. Generative model benchmarks for superconducting qubits. Phys. Rev. A 99, 062323 (2019).
Lloyd, S. & Weedbrook, C. Quantum generative adversarial learning. Phys. Rev. Lett. 121, 040502 (2018).
DallaireDemers, P.L. & Killoran, N. Quantum generative adversarial networks. Phys. Rev. A 98, 012324 (2018).
Borgwardt, K. M. et al. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22, e49–e57 (2006).
Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B. & Smola, A. J. A kernel method for the twosampleproblem. In Advances in Neural Information Processing Systems 19 (eds. Schölkopf, B., Platt, J. C. & Hoffman, T.) 513–520 (MIT Press, 2007).
Havlíček, V. et al. Supervised learning with quantumenhanced feature spaces. Nature 567, 209–212 (2019).
Schuld, M. & Petruccione, F. Supervised Learning with Quantum Computers. Quantum Science and Technology (Springer International Publishing, 2018).
Mitarai, K., Negoro, M., Kitagawa, M. & Fujii, K. Quantum circuit learning. Phys. Rev. A 98, 032309 (2018).
Schuld, M., Bergholm, V., Gogolin, C., Izaac, J. & Killoran, N. Evaluating analytic gradients on quantum hardware. Phys. Rev. A 99, 032331 (2019).
Liu, Q., Lee, J. D. & Jordan, M. A Kernelized Stein Discrepancy for Goodnessoffit Tests. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, 276–284 (JMLR.org, New York, NY, 2016).
Stein, C. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proc. Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. 583–602 (University of California Press, Berkeley, CA, 1972).
Yang, J., Liu, Q., Rao, V. & Neville, J. Goodnessoffit testing for discrete distributions via Stein discrepancy. In Proc. 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research (eds Dy, J. & Krause, A.) 5561–5570 (PMLR, Stockholm, 2018).
Gorham, J. & Mackey, L. Measuring sample quality with Stein’s method. In Advances in Neural Information Processing Systems 28 (eds Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.) 226–234 (Curran Associates, Inc., 2015).
Li, Y. & Turner, R. E. Gradient estimators for implicit models. In 6th International Conference on Learning Representations (ICLR) 2018, Vancouver, BC, Canada (OpenReview.net, 2018).
Shi, J., Sun, S. & Zhu, J. A spectral approach to gradient estimation for implicit distributions. in Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, 4644–4653, (eds Jennifer, D. y. & Andreas, K.), (PMLR, 2018).
Nyström, E. J. Über die praktische auflösung von integralgleichungen mit anwendungen auf randwertaufgaben. Acta Math. 54, 185–204 (1930).
Ramdas, A., Trillos, N. G. & Cuturi, M. On wasserstein twosample testing and related families of nonparametric tests. Entropy 19, 47 (2017).
Genevay, A., Peyre, G. & Cuturi, M. Learning generative models with Sinkhorn divergences. In Proc. TwentyFirst International Conference on Artificial Intelligence and Statistics, Vol. 84 (eds Storkey, A. & PerezCruz, F.) 1608–1617 (PMLR, Playa Blanca, 2018).
Feydy, J. et al. Interpolating between optimal transport and MMD using Sinkhorn divergences. In Proc. Machine Learning Research, Vol. 89 (eds Chaudhuri, K. & Sugiyama, M.) 2681–2690 (PMLR, 2019).
Villani, C. Optimal Transport: Old and New [Grundlehren der mathematischen Wissenschaften] (Springer, Berlin, 2009).
Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research 70, 214–223 (eds Doina, P. & Yee, W. T.), International Convention Centre, Sydney, Australia, (PMLR, 2017).
Dudley, R. M. The speed of mean GlivenkoCantelli convergence. Ann. Math. Stat. 40, 40–50 (1969).
Genevay, A., Chizat, L., Bach, F., Cuturi, M. & Peyré, G. Sample complexity of Sinkhorn divergences. In Proceedings of Machine Learning Research 89, 1574–1583 (eds Chaudhuri, K. & Sugiyama, M.), (PMLR, 2019)
Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B. & Lanckriet, G. R. G. On integral probability metrics, phidivergences and binary classification. Preprint at http://arxiv.org/abs/0901.2698 (2009).
Gibbs, A. L. & Su, F. E. On choosing and bounding probability metrics. Int. Stat. Rev. 70, 419–435 (2002).
Smith, R. S., Curtis, M. J. & Zeng, W. J. A practical quantum instruction set architecture. Preprint at http://arxiv.org/abs/1608.03355 (2016).
Arunachalam, S. & de Wolf, R. Guest column: A survey of quantum learning theory. ACM SIGACT News 48, 41–67 (2017).
Arunachalam, S., Grilo, A. B. & Sundaram, A. Quantum hardness of learning shallow classical circuits. Preprint at http://arxiv.org/abs/1903.02840 (2019).
Khatri, S. et al. Quantumassisted quantum compiling. Quantum 3, 140 (2019).
Jones, T. & Benjamin, S. C. Quantum compilation and circuit optimisation via energy dissipation. Preprint at http://arxiv.org/abs/1811.03147 (2018).
Gao, X., Zhang, Z. & Duan, L. An efficient quantum algorithm for generative machine learning. Sci. Adv. 12, https://doi.org/10.1126/sciadv.aat9004 (2018).
Hangleiter, D., Kliesch, M., Eisert, J. & Gogolin, C. Sample complexity of deviceindependently certified “quantum supremacy”. Phys. Rev. Lett. 122, 210502 (2019).
Goldreich, O., Goldwasser, S. & Ron, D. Property testing and its connection to learning and approximation. J. ACM 45, 653–750 (1998).
Amin, M. H., Andriyash, E., Rolfe, J., Kulchytskyy, B. & Melko, R. Quantum Boltzmann machine. Phys. Rev. X 8, 021050 (2018).
Verdon, G., Broughton, M. & Biamonte, J. A quantum algorithm to train neural networks using lowdepth circuits. Preprint at http://arxiv.org/abs/1712.05304 (2017).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations, (ICLR) 2015, (eds Yoshua, B. & Yann, L.) (San Diego, CA, USA, 2015)
Coyle, B., Mills, D., Danos, V. & Kashefi, E. The Born supremacy: quantum advantage and training of an Ising Born machine. Preprint at http://arxiv.org/abs/1904.02214 (2019).
Coyle, B. IsingBornMachine. https://zenodo.org/record/3779865#.XqvfknVKhrk (2020).
Acknowledgements
B.C. thanks Andru Gheorghiu for useful discussions and title suggestion. We also thank Jean Feydy, Patric Fulop, Vojtech Havlicek, and Jiasen Yang for clarifying pointers. We thank Atul Mantri for comments on the manuscript. This work was supported by the Engineering and Physical Sciences Research Council (grants EP/L01503X/1, EP/N003829/1), EPSRC Centre for Doctoral Training in Pervasive Parallelism at the University of Edinburgh, and School of Informatics and Entrapping Machines (grant FA95501710055). We also thank Rigetti Computing for the use of their quantum compute resources, and views expressed in this paper are those of the authors and do not necessarily reflect the views or policies of Rigetti Computing.
Author information
Authors and Affiliations
Contributions
B.C. devised the theoretical aspects of the work and wrote the code for the numerical results with help from D.M.; D.M. contributed to the learning supremacy definitions. E.K. and V.D. supervised the work. All authors contributed to the manuscript writing.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Coyle, B., Mills, D., Danos, V. et al. The Born supremacy: quantum advantage and training of an Ising Born machine. npj Quantum Inf 6, 60 (2020). https://doi.org/10.1038/s41534020002889
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41534020002889
This article is cited by

An invitation to distributed quantum neural networks
Quantum Machine Intelligence (2023)

NISQ computing: where are we and where do we go?
AAPPS Bulletin (2022)

Ising granularity image analysis on VAE–GAN
Machine Vision and Applications (2022)

A continuous variable Born machine
Quantum Machine Intelligence (2022)

Variational quantum algorithms
Nature Reviews Physics (2021)