As quantum devices with ~80−200 qubits, but without fault tolerance, begin to be built, we near the dawn of the noisy intermediate scale quantum (NISQ)1 technology era. Because of the low number of qubits, the limited connectivity between them, and the low circuit depth permitted by low decoherence times, these devices cannot perform many of the most famous algorithms thought to demonstrate exponential speedups over classical algorithms2,3.

In spite of this, NISQ devices could provide efficient solutions to other problems that cannot be solved in polynomial time by classical means. Showing this to be true is referred to as a demonstration of quantum computational supremacy4,5,6,7,8,9, with the first such experimental realisation occurring recently10.

Proposals for demonstrations of quantum computational supremacy on NISQ technology typically involve sampling from the output distribution of random quantum circuits. While a realisation of such an advantage is of great theoretical importance, generating random samples is not obviously independently interesting. We incorporate this sampling into a useful application, keeping the provable quantum advantage, but in a context with more practical applicability.

Specifically, we explore generative modelling in quantum machine learning (QML), which is the task of generalising from a finite set of samples, {y}M, drawn from a data set. By learning the underlying probability distribution from which these samples are drawn, π(y), a model should be able to generate new samples from the said distribution.

Generative models range from simple naive Bayes11 models to complicated neural networks, like generative adversarial networks (GANs)12. The intrinsic randomness inherent in quantum mechanics allows for the definition of a new class of generative models that are without a classical analogue. Known as Born machines13,14,15, they have the ability to produce statistics according to Born’s measurement rule. Specifically, for a state |ψ〉, a measurement produces a sample x ~ p(x) = |〈x|ψ〉|2. There are several variants, including Bayesian approaches16, adversarial training methods17, and adaptations to continuous distributions18.

Quantum circuit Born machines (QCBM) are a subclass of parameterised quantum circuits (PQCs) and are widely applicable (see ref. 19 for a review). PQCs consist of a quantum circuit which carries parameters that are updated during a training process (typically a classical optimisation routine). The circuit is kept as shallow as possible so as to be suitable for NISQ devices.

We ask in this work whether it is possible to have a machine learning application for a PQC, which comes with a provable superior performance over all classical alternatives on near term devices? Such provable guarantees are even more relevant given recent work in QML algorithm ‘dequantisations’20,21,22,23,24.

We take the first steps in answering this question in several ways. We define a subclass of QCBM that we call Ising Born machines (QCIBM). We improve the training of the model over previous methods, which use the maximum mean discrepancy14 (MMD) with a classical kernel, by introducing quantum kernels into the MMD, as well as by using entirely new cost functions: the Stein discrepancy (SD) and the Sinkhorn divergence (SHD). To do so, we derive their corresponding gradients in the quantum setting.

We show that these novel methods outperform the MMD with classical kernel by achieving a closer fit to the data as measured by the total variation (TV) distance. We derive forms of the SHD, which can either be efficient to compute or result in an upper bound on TV. We observe numerically that the SD provides an upper bound to TV. Next, we show that sampling from this model cannot be simulated efficiently by any classical randomised algorithm, up to multiplicative error in the worst case, subject to common assumptions in complexity theory (namely the non-collapse of the polynomial hierarchy). Furthermore, this holds for many circuit families encountered during training.

We define a framework in which a provable advantage could be demonstrated, which we refer to as quantum learning supremacy (QLS), and based on distribution learning theory25. Based on our classical sampling hardness results, we conjecture that the QCIBM may be a good candidate for a quantum model which could demonstrate this notion of learning supremacy; however, we leave the further investigation of QLS and its potential to be achieved by such models to future work. Finally, we provide a novel utilisation of such generative models in quantum circuit compilation.


The main results of this work are new efficient gradient-based training methods and results on the hardness of simulating the model we introduce using classical computers. First, we define the model used and discuss its connection to previously studied quantum circuit families. We then discuss the efficient training of the model, first recalling a previously known gradient-based training method, which uses the MMD cost function, and then moving onto our new training methods, which use the SD and the SHD. We then discuss the SHD complexity in detail, and further argue, using its connection to the TV distance, why it should be used. We then prove the hardness results mentioned above, namely that many circuits encountered during gradient-based training are hard to classically simulate, before finally discussing the potential use of quantum generative models in learning distributions that are intractable to classical models. In addition, we provide a framework to study these advantages.

Ising Born machine

Here we define the model we use for distribution learning. A generic quantum circuit Born machine consists of a parameterised quantum circuit, which produces samples by measuring the resulting quantum state, and a classical optimisation loop used to learn a data distribution. The circuits we study have the following structure:


where xi {0, 1}; the unitaries are defined by Eqs. (2) and (3); Sj indicates the subset of qubits on which each operator, j, is applied; and a boldface parameter indicates a set of parameters, α = {αj}.

$$U_z({\mathbf{\alpha }}): = \mathop {\prod}\limits_j {U_z} \left( {\alpha _j,S_j} \right) = \mathop {\prod}\limits_j {{\mathrm{exp}}} \left( {{\mathrm{i}}\alpha _j\mathop { \otimes }\limits_{k \in S_j} Z_k} \right)$$
$$U_f\left( {{\mathbf{\Gamma }},{\mathbf{\Delta }},{\mathbf{\Sigma }}} \right){\mathrm{: = exp}}\left( {{\mathrm{i}}\mathop {\sum}\limits_{k = 1}^n {{\mathrm{\Gamma }}_k} X_k + {\mathrm{\Delta }}_kY_k + {\mathrm{\Sigma }}_kZ_k} \right)$$

The operators, Xk, Yk, and Zk, are the standard Pauli operators acting on qubit k. Restricting to the case |Sj| ≤ 2 (since only single and two-qubit gates are required for universal quantum computation), the term in the exponential of Eq. (2) becomes exactly an Ising Hamiltonian:

$${\cal{H}}: = {\mathrm{i}}\mathop {\sum}\limits_{i < j} {J_{ij}} Z_iZ_j + {\mathrm{i}}\mathop {\sum}\limits_{k = 1}^n {b_k} Z_k,$$

where we are dividing the diagonal unitary parameters, α = {Jij, bk}, into local terms that act only on qubit k, {bk}, and coupling terms between two qubits i and j, {Jij}. We call the model a QCIBM.

A measurement on all qubits in the computational basis results in sample vectors, \({\mathbf{x}} \in {\cal{X}}^n\), where \({\cal{X}} = \{ 0,1\}\). These samples are drawn from the distribution, pθ(x), parameterised by the set of angles, θ = {α, Γ, Δ, Σ}:

$$p_{\mathbf{\theta }}({\mathbf{x}}): = \left| {\left\langle {\mathbf{x}} \right|U_f\left( {{\mathbf{\Gamma }},{\mathbf{\Delta }},{\mathbf{\Sigma }}} \right)U_z({\mathbf{\alpha }})\left| + \right\rangle ^{ \otimes n}} \right|^2.$$

We denote the above model and parameters by QCIBM(θ) := QCIBM(α, Γ, Δ, Σ). We choose this structure in order to easily recover two well-known circuit classes, namely instantaneous quantum polynomial time26 (IQP) circuits, and the shallowest depth (p = 1) version of the quantum approximate optimisation algorithm27 (QAOA).

IQP circuits are named to reflect the commuting nature of elements in the produce defining the unitary Uz, while QAOA27 was originally developed as an approximate version of the quantum adiabatic algorithm28. Both of these classes of circuits are known to be routes to demonstrate quantum supremacy4,6,8,29, and we extend this property here by using the results of ref. 30. These classes can be recovered by setting the parameters of a QCIBM as follows:

$${\mathrm{IQP}}(\{ J_{ij},b_k\} ) = {\mathrm{QCIBM}}\left( {\{ J_{ij},b_k\} ,{\mathbf{\Gamma }} = \left\{ {\frac{\pi }{{2\sqrt 2 }}} \right\},{\mathbf{0}},{\mathbf{\Sigma }} = \left\{ {\frac{\pi }{{2\sqrt 2 }}} \right\}} \right),$$
$${\mathrm{QAOA}}_{p = 1}(\{ J_{ij},b_k\} ,{\mathbf{\Gamma }}) = {\mathrm{QCIBM}}\left( {\{ J_{ij},b_k\} ,{\mathbf{\Gamma }} = - {\mathbf{\Gamma }},{\mathbf{0}},{\mathbf{0}}} \right).$$

We denote, for example \(\left\{ {\frac{\pi }{{2\sqrt 2 }}} \right\}\), to be all parameters of the n single qubit gates set to the same value, \(\pi {\mathrm{/}}2\sqrt 2\). We choose the final gate before the computational basis measurement to be in the form of Eq. (3), rather than the more common Euler decomposition of a single qubit gate decomposition found in the literature14,16. This is chosen to make the classical simulation hardness results more apparent in our proofs.

To recover IQP circuits, we simply need to generate the final layer of Hadamard gates (up to a global phase) and do so by setting Uf in Eq. (3) as follows:

$$U_f^{{\mathrm{IQP}}}\left( {\left\{ {\frac{\pi }{{2\sqrt 2 }}} \right\},{\mathbf{0}},\left\{ {\frac{\pi }{{2\sqrt 2 }}} \right\}} \right) = \mathop {\otimes}\limits_{k = 1}^n{\mathrm{e}}^{\frac{{{\mathrm{i}}\pi }}{{2\sqrt 2 }}\left( {X_k + Z_k} \right)} = {\mathrm{i}}H^{ \otimes n}.$$

To recreate depth 1 QAOA circuits, we need to set the Pauli Z and Y parameters, Δ, Σ = 0, since the final gates should be a product of Pauli-X rotations with parameters, −Γ.

Training the Ising Born machine

Here we introduce the alternative training methods that we use for our purposes and that would be applicable to any generative model. The training procedure is a hybrid of classical and quantum computation, with the only quantum component being the model itself. The remainder of the computation is classical, bringing our scheme into the realm of what is possible for NISQ devices. The procedure can be seen in Fig. 1.

Fig. 1: The hybrid training procedures we propose in this work.
figure 1

We have a quantum generator along with auxiliary circuits used to compute the gradient of the various cost functions with respect to the parameters. The training procedure proceeds as follows. First, the QCIBM is sampled from N times via measurements. These samples, along with M data samples y ~ π(y), are used to evaluate a cost function, \({\cal{L}}_B\), where B {MMD, SD, SHD} is one of the efficiently computable cost functions. For each updated parameter, θk, two parameter-shifted circuits are also ran to generate samples, a, b ~ pθ±, which are used to compute the corresponding gradients, \(\partial _\theta {\cal{L}}_B\). For all costs functions and gradients, either a kernel (if a quantum kernel is used, the circuit in this figure must be run) is computed for each pair of samples (as is the case for MMD and SD) or an optimal transport cost function is evaluated (as is the case for SHD).

The optimisation procedures we implement are stochastic gradient descents. Parameters, θk, are updated at each epoch of training, d, according to the rule \(\theta _k^{d + 1} \leftarrow \theta _k^d - \eta \,\partial _{\theta _k}{\cal{L}}_B\). The parameter η is the learning rate and controls the speed of the descent. The initial proposals to train QCBMs were gradient-free15,31, but gradient-based methods have also been proposed14,16,32. In this work, we advocate for increasing the classical computational power required in training to achieve better performance, rather than increasing the quantum resources, for example by adding extra ancillae16 or adding costly and potentially unstable (quantum) adversaries17,33,34.

For gradient-based methods, a cost function or metric is required, \({\cal{L}}_B\left( {p_{\mathbf{\theta }}({\mathbf{x}}),\pi ({\mathbf{y}})} \right)\) to compare the Born Machine distribution, pθ(x), and the data distribution, π(y). Good cost functions will have several characteristics. They should be efficient to compute, measured both by sample and computational complexity. They should also be powerful in the sense that they are sensitive to differences between the two distributions. In this work, we will assess sensitivity by using the TV metric as a benchmark:

$${\mathrm{TV}}(p_{\mathbf{\theta }},\pi ): = \frac{1}{2}\mathop {\sum}\limits_{\mathbf{x}} {|p_{\mathbf{\theta }}({\mathbf{x}}) - \pi ({\mathbf{x}})|}.$$

As discussed later, this is a particularly strong metric. The cost functions we use are typically easier to deal with than TV, and we will remark on their relationship to TV.

One cost function commonly used to train generative models is the Kullback–Leibler (KL) divergence. The KL divergence is also relatively strong, in the sense that it upper bounds TV through Pinsker’s inequality:

$${\mathrm{TV}}(p_{\mathbf{\theta }},\pi ) \le \sqrt {\frac{1}{2}D_{{\mathrm{KL}}}(p_{\mathbf{\theta }}||\pi )},$$

where DKL(pθ||π) is the KL divergence of π from pθ. Unfortunately, it is difficult to compute, having a high sample complexity, so neither its gradient nor the KL divergence itself can be evaluated efficiently when training parameterised circuits14.

The first efficient gradient method to train Born machines was proposed by ref. 14. There the MMD is used to define the cost function. We extend this methodology in two ways. The first is an alteration to the MMD itself, and the second is by introducing new cost functions. From the MMD, the following cost function35,36 can be defined:

$${{\cal{L}}_{MMD}}: = \mathop {\Bbb E}\limits_{{{\rm{x}} \sim {p_\theta }}\atop{{\rm{y}} \sim {p_{_\theta }}} } (\kappa ({\rm{x}},{\rm{y}})) + \mathop {\Bbb E}\limits_{{{\rm{x}} \sim \pi }\atop{{\rm{y}} \sim \pi } } (\kappa ({\rm{x}},{\rm{y}})) - 2\mathop {\Bbb E}\limits_{{{\rm{x}} \sim {p_\theta }}\atop{{\rm{y}} \sim \pi } } (\kappa ({\rm{x}},{\rm{y}})).$$

The MMD has some very favourable properties; it is a metric on the space of probability distributions, and it is relatively easy to compute (due to low sample complexity). The function, κ in Eq. (11) is a kernel function, a measure of similarity between points in the sample space \({\mathbf{x}} \in {\cal{X}}^n\). A popular choice for this function is the Gaussian mixture kernel14:

$$\kappa _{\rm{G}}({\mathbf{x}},{\mathbf{y}}): = \frac{1}{c}\mathop {\sum}\limits_{i = 1}^c {{\mathrm{exp}}} \left( { - \frac{{||{\mathbf{x}} - {\mathbf{y}}||_2^2}}{{2\sigma _i}}} \right).$$

The parameters, σi, are bandwidths that determine the scale at which the samples are compared, and ||||2 is the \(\ell _2\) norm.

Recent works37,38 on the near term advantage of using quantum computers in QML have explored quantum kernels, which can be evaluated on a quantum computer. To gain such an advantage, these kernels should be difficult to compute on a classical device. In particular, we will adopt the following kernel37 in which the samples are encoded in a quantum state, |ϕ(x)〉, via a feature map, ϕ:x → |ϕ(x)〉. The kernel is the inner product between vectors:

$$\kappa _{\rm{Q}}({\mathbf{x}},{\mathbf{y}}): = |\left\langle {\phi ({\mathbf{x}})|\phi ({\mathbf{y}})} \right\rangle |^2.$$

The inner product in Eq. (13) is evaluated on a quantum computer and is conjectured to be hard to compute on a classical one37, given only a classical description of the quantum states. The state |ϕ(x)〉 is produced by acting an encoding unitary on an initial state, \(\left| {\phi ({\mathbf{x}})} \right\rangle = {\cal{U}}_{\phi ({\mathbf{x}})}\left| 0 \right\rangle ^{ \otimes n}\). Explicitly, the kernel is then given by:

$$\kappa _{\rm{Q}}({\mathbf{x}},{\mathbf{y}}) = \left| {\left\langle {0^{ \otimes n}} \right|{\cal{U}}_{\phi ({\mathbf{x}})}^\dagger {\cal{U}}_{\phi ({\mathbf{y}})}\left| {0^{ \otimes n}} \right\rangle } \right|^2,$$

which can be calculated by measuring, in the computational basis, the state which results from running the circuit given by \({\cal{U}}_{\phi ({\mathbf{y}})}\), followed by that of \({\cal{U}}_{\phi ({\mathbf{x}})}^\dagger\). This is seen in Fig. 1. The kernel, Eq. (14), is the observed probability of measuring the all-zero outcome, 0n. If this outcome is not observed after polynomially many measurements, the value of the kernel for this particular pair of samples (x, y) is set to zero. Intuitively, this means the feature map has mapped the original points to points with at most exponentially small overlap in the Hilbert space and therefore will not contribute to the MMD.

It is also necessary to derive an expression for the gradient of the cost function. For the MMD, the gradient with respect to the kth parameter14, carried by the kth unitary gate, Uk(θk), is given by:

$$\frac{{\partial {{\cal{L}}_{{\rm{MMD}}}}}}{{\partial {\theta _k}}} = \mathop {2{\Bbb E}}\limits_{{{\rm{a}} \sim p_{{\theta _k}}^ - }\atop{{\rm{x}} \sim {p_{_\theta }}} } (\kappa ({\rm{a}},{\rm{x}}))- \mathop{{2{\Bbb E}}}\limits_{{\rm{b}} \sim p_{{\theta _k}}^ + \atop {{\rm{x}} \sim {p_\theta }}} (\kappa ({\rm{b}},{\rm{x}})) - \mathop{2{\Bbb E}}\limits_{{{\rm{a}} \sim p_{{\theta _k}}^ - }\atop{{\rm{y}} \sim \pi } } (\kappa ({\rm{a}},{\rm{y}})) + \mathop {2{\Bbb E}}\limits_{{{\rm{b}} \sim p_{{\theta _k}}^ + }\atop{{\rm{y}} \sim \pi } } (\kappa ({\rm{b}},{\rm{y}})),$$

where \(p_{\theta _k}^ \pm\) are output distributions generated by running the following auxiliary circuits39,40 for each unitary gate, Uk(θk):


where \(\theta _k^ \pm : = \theta _k \pm \pi {\mathrm{/}}2\) and Ul:m := UlUl+1Um−1Um are the unitary gates in the Born machine. This gradient occurs because the form of the unitary gates in our case are exponentiated Pauli operators Uk(θk) = exp(kΣk), with \({\mathrm{\Sigma }}_k^2 = {\Bbb I}\). With the unitaries in this form, the gradient of the probabilities outputted from the parameterised state, with respect to a parameter θ, is given by14,40:

$$\frac{{\partial p_{\mathbf{\theta }}({\mathbf{z}})}}{{\partial \theta _k}} = p_{\theta _k}^ - ({\mathbf{z}}) - p_{\theta _k}^ + ({\mathbf{z}}).$$

There is a slight difference between Eq. (17) and that of ref. 14, due to a different parameterisation of the unitaries above.

The gradients of the cost functions which we introduce next will also require the parameter-shifted circuits in Eq. (16). For more details on kernel methods and the MMD, see Supplementary Material Section II.

SD training

So far, we have only proposed a change of kernel in the MMD method of training QCIBMs. We now consider changing the cost function altogether. We endeavour to find costs which are efficient to compute for quantum models, yet stronger than MMD.

The first cost we propose is called the SD. SD has become popular for goodness-of-fit tests41, i.e. testing whether samples come from a particular distribution or not, as opposed to the MMD, which is typically used for kernel two-sample tests36. This discrepancy is based on Stein’s method42, which is a way to bound distance metrics between probabilities including, for example, the other integral probability metrics (IPM) we utilise in this work. For details on IPMs, see Supplementary Material Section I.

We use the discrete version of the SD43 since, in its original form41, it only caters for the case where the distributions are supported over a continuous space. The discretisation is necessary since the QCIBM outputs binary strings and so the standard gradient w.r.t. a sample, x, x, is undefined. As such, we need to use a discrete ‘shift’ operator, Δx, instead, which is an operator defined by [Δxf(x)]i := f(x) − fix) for a function f, where ¬i flips the ith element of the binary vector x.

Fortunately, the discretisation procedure is relatively straightforward (the necessary definitions and proofs can be found in Supplementary Material Section III). The discrepancy is derived41,44 from the (discrete) Stein identity43, given by:

$$\mathop {{\Bbb E}}\limits_{{\mathbf{x}} \sim \pi } [{\cal{A}}_\pi \phi ({\mathbf{x}})] = \mathop {{\Bbb E}}\limits_{{\mathbf{x}} \sim \pi } \left[ {{\mathbf{s}}_\pi ({\mathbf{x}})\phi ({\mathbf{x}}) - {\mathrm{\Delta }}_{\mathbf{x}}\phi ({\mathbf{x}})} \right] = 0,$$
$${\cal{A}}_\pi \phi ({\mathbf{x}}): = {\mathbf{s}}_\pi ({\mathbf{x}})\phi ({\mathbf{x}}) - {\mathrm{\Delta }}_{\mathbf{x}}\phi ({\mathbf{x}}),$$

where \(\mathop {{\Bbb E}}\limits_{{\mathbf{x}} \sim \pi }\) denotes the expectation value over the distribution, π. This holds for any function \(\phi :{\cal{X}}^n \to {\Bbb C}\) and probability mass function π on \({\cal{X}}^n\). The function sπ(x) = Δx log(π(x)) is the Stein score function of the distribution π, and \({\cal{A}}_\pi\) is a so-called Stein operator of π. Now, the SD cost function can be written in a kernelised form41,43, similarly to the MMD:

$${\cal{L}}_{\mathrm{SD}}(p_{{\boldsymbol{\theta}}},\pi ):={{\Bbb E}}_{{\mathbf{x}},{\mathbf{y}} {\sim} p_{{\boldsymbol{\theta}}}}\left[ {\kappa _{\pi} ({\mathbf{x}},{\mathbf{y}})} \right],$$
$$\kappa _\pi ({\mathbf{x}},{\mathbf{y}}): = s_\pi ({\mathbf{x}})^T\kappa ({\mathbf{x}},{\mathbf{y}})s_\pi ({\mathbf{y}}) - s_\pi ({\mathbf{x}})^T{\mathrm{\Delta }}_{\mathbf{y}}^ \ast \kappa ({\mathbf{x}},{\mathbf{y}}) - {\mathrm{\Delta }}_{\mathbf{x}}^ \ast \kappa ({\mathbf{x}},{\mathbf{y}})^Ts_\pi ({\mathbf{y}}) + {\mathrm{tr}}({\mathrm{\Delta }}_{{\mathbf{x}},{\mathbf{y}}}^ \ast \kappa ({\mathbf{x}},{\mathbf{y}})),$$

where κπ is the Stein kernel and κ is a usual positive semi-definite kernel. \({\mathrm{\Delta }}_{\mathbf{x}}^ \ast\) is a conjugate version of the operator Δx, but for our purposes, the behaviour of both \({\mathrm{\Delta }}_{\mathbf{x}}^ \ast\) and Δx are identical. For completeness, we define it in generality in Supplementary Material Section III.

Just as above, the gradient (derived in an identical fashion to the MMD gradient Eq. (15) as is detailed in Supplementary Material Section III) of \({\cal{L}}_{{\mathrm{SD}}}\) with respect to the parameter, θk, is given by:

$$\frac{{\partial {{\cal{L}}_{{\rm{SD}}}}}}{{\partial {\theta _k}}} = \mathop {\Bbb E}\limits_{{{\rm{x}} \sim p_\theta ^ - }\atop{{\rm{y}} \sim {p_{_\theta }}} } [{\kappa _\pi }({\rm{x}},{\rm{y}})] - \mathop{{\Bbb E}}\limits_{{\rm{x}} \sim p_\theta ^ + \atop {{\rm{y}} \sim {p_\theta }}} [{\kappa _\pi }({\rm{x}},{\rm{y}})] - \mathop{\Bbb E}\limits_{{{\rm{x}} \sim {p_\theta }}\atop{{\rm{y}} \sim p_{_\theta }^ - } } [{\kappa _\pi }({\rm{x}},{\rm{y}})] - \mathop {\Bbb E}\limits_{{{\rm{x}} \sim {p_\theta }}\atop{{\rm{y}} \sim p_\theta ^ + } } [{\kappa _\pi }({\rm{x}},{\rm{y}})].$$

We show that almost every term in Eqs. (20) and (22) can be computed efficiently, even when the quantum kernel κQ from Eq. (13) is used in Eq. (21), that is, with the exception of the score function sπ with respect to the data distribution. The score contains an explicit dependence on the data distribution, π. If we are given oracle access to the probabilities, π(y), then there is no issue and SD will be computable. Unfortunately, in any practical application this will not be the case.

To deal with such a scenario, we give two approaches to approximate the score via samples from π. The first of these we call the ‘Identity’ method since it inverts Stein’s identity45 from Eq. (18). We refer to the second as the ‘Spectral’ method since it uses a spectral decomposition46 of a kernel to approximate the score. The latter approach uses the Nyström method47, which is a technique used to approximately solve integral equations. We will only use the Spectral method in training the QCIBM in the numerical results in Fig. 3, since the Identity method does not give an immediate out-of-sample method to compute the score. Details of these methods can be found in Supplementary Material Section III.

Notice that, even with the difficulty in computing the score, the SD is still more suitable for training these models than the KL divergence as the latter requires computing the circuit probabilities, pθ(x), which is in general intractable, and so could not be computed for any data set.

SHD training

The second cost function we consider is the so-called SHD. This is a relatively new method to compare probability distributions48,49,50, defined by the following:

$${\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }}(p_{\mathbf{\theta }},\pi ): = {\mathrm{OT}}_{\it{\epsilon }}^c(p_{\mathbf{\theta }},\pi ) - \frac{1}{2}{\mathrm{OT}}_{\it{\epsilon }}^c(p_{\boldsymbol{\theta }},p_{\boldsymbol{\theta }}) - \frac{1}{2}{\mathrm{OT}}_{\it{\epsilon }}^c(\pi ,\pi ),$$
$${\rm{OT}}_\epsilon ^c({p_{\boldsymbol{\theta }}},\pi ): = \mathop {\min }\limits_{U \in {\cal{U}}({p_{\boldsymbol{\theta }} },\pi )} \left( {\mathop{\sum}\limits_{{({\mathbf{x}},{\mathbf{y}})}\atop{ \in {{\cal{X}}^d} \times {{\cal{Y}}^d}} } {c({\mathbf{x}},{\mathbf{y}})U({\mathbf{x}},{\mathbf{y}}) + \epsilon {\rm{KL}}(U|{p_\theta } \otimes \pi )} } \right),$$

where \({\it{\epsilon }} \ge 0\) is a regularisation parameter, c(x, y) is a Lipschitz ‘cost’ function, and \({\cal{U}}(p_{\boldsymbol{\uptheta }},\pi )\) is the set of all couplings between pθ and π, i.e. the set of all joint distributions, whose marginals with respect to x and y are pθ(x) and π(y), respectively. The above cost function, \({\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }}\), is particularly favourable as a candidate because of its relationship to the theory of optimal transport51 (OT), a method to compare probability distributions. It has become a major tool used to train models in the classical domain, for example with GANs52 through a restriction of OT called the Wasserstein metric, which is derived from OT, when the cost (c(x, y)) is chosen to be a metric on the space of \({\cal{X}}^n\).

We would like to use OT itself to train generative models, due to its metric properties. Unfortunately, OT has high computational cost and exponential sample complexity53. For this reason, the SHD was proposed in refs 48,49,50 to interpolate between OT and the MMD as a function of the regularisation parameter \({\it{\epsilon }}\) in Eq. (24). In particular, for the two extreme values of \({\it{\epsilon }}\), we recover48 both unregularised OT and the MMD:

$$\begin{array}{l}{\underline{{\mathit{\epsilon}}\to 0}} :\\ \quad \quad \;{\cal{L}}_{{\mathrm{SHD}}}^{0}(p_{\boldsymbol \theta},\pi ) \to {\mathrm{OT}}_{0}^{c}(p_{\boldsymbol \theta},\pi )\end{array}.$$
$$\begin{array}{l}{\underline{{\it{\epsilon}} \to \infty }} :\\ \quad \quad \quad {\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }}(p_{\boldsymbol{\theta }},\pi ) \to {\mathrm{MMD}}(p_{\boldsymbol{\theta }},\pi ):\kappa ({\mathbf{x}},{\mathbf{y}}) = - c({\mathbf{x}},{\mathbf{y}})\end{array}.$$

As before, we need a gradient of the \({\cal{L}}^{\epsilon}_{{\mathrm{SHD}}}\) with respect to the parameters, which is given by:

$$\frac{{\partial {\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }}(p_{\boldsymbol{\theta }},\pi )}}{{\partial \theta _k}} = \mathop {\Bbb{E}}\limits_{x \sim {p_{\theta _k^ - }}} [\varphi ({\mathbf{x}})] - \mathop {{\Bbb E}}\limits_{{\mathbf{x}}\! \sim \! p_{\theta _k^ + }}[\varphi ({\mathbf{x}})],$$

where φ(x) is a function that depends on the optimal solutions found to the regularised OT problem in Eq. (24). See Supplementary Material Section IV for more details on the SHD and its gradient.

Sinkhorn complexity

The sample complexity of the SHD is of great interest to us as we claim that the TV and the KL are not suitable to be directly used as cost functions. This is due to the difficulty of computing the outcome probabilities of quantum circuits efficiently. We now analyse why the MMD is a weak cost function and why the SHD should be used as an alternative. This will depend critically on the regularisation parameter \({\it{\epsilon }}\), which allows a smooth interpolation between the OT metric and the MMD.

First, we address the computability of \({\cal{L}}^{\epsilon}_{{\mathrm{SHD}}}\) and we find, due to the results of ref. 54, a somewhat ‘optimal’ value for \({\it{\epsilon }}\), for which the sample complexity of \({\cal{L}}_{{\mathrm{SHD}}}\) becomes efficient. Specifically, the mean error between \({\cal{L}}_{{\mathrm{SHD}}}\) and its approximation \({\hat {\cal{L}}}_{{\mathrm{SHD}}}^{\it{\epsilon}}\) for n qubits, computed using M samples, scales as:

$${\Bbb E}|{\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }} - \hat {\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }}| = {\cal{O}}\left( {\frac{1}{{\sqrt M }}\left( {1 + {\mathrm{e}}^{\left( {2\frac{{n^2 + n}}{{\it{\epsilon }}}} \right)}} \right)\left( {1 + \frac{1}{{{\it{\epsilon }}^{\left\lfloor {n/2} \right\rfloor }}}} \right)} \right).$$

We show in Supplementary Material Section IV.1 that by choosing \({\it{\epsilon }} = {\cal{O}}(n^2)\), we get:

$${\Bbb E}|{\cal{L}}_{{\mathrm{SHD}}}^{{\cal{O}}(n^2)} - \hat {\cal{L}}_{{\mathrm{SHD}}}^{{\cal{O}}(n^2)}| = {\cal{O}}\left( {\frac{1}{{\sqrt M }}} \right),$$

which is the same sample complexity as the MMD55 but exponentially better than that of unregularised OT, which scales as \({\cal{O}}\left( {1{\mathrm{/}}M^{1/n}} \right)\)53.

A similar result can be derived using a concentration bound54, such that, with probability 1 − δ,

$$|{\cal{L}}_{{\mathrm{SHD}}}^{{\cal{O}}(n^2)} - \hat {\cal{L}}_{{\mathrm{SHD}}}^{{\cal{O}}(n^2)}| = {\cal{O}}\left( {\frac{n}{{\sqrt M }}{\mathrm{log}}(1{\mathrm{/}}\delta )^{1/2}} \right),$$

where we have chosen the same scaling for \({\it{\epsilon }}\) as in Eq. (29). Therefore, we can choose an optimal theoretical value for the regularisation, such that \({\cal{L}}_{{\mathrm{SHD}}}\) is sufficiently far from OT to be efficiently computable but perhaps still retains some of its favourable properties. It is likely in practice, however, that a much lower value of \({\it{\epsilon }}\) could be chosen without a blow up in sample complexity49,54. See Supplementary Material Section IV for derivations of the above results.

Second, we can relate the \({\cal{L}}_{{\mathrm{SHD}}}\) to unregularised OT and TV via a sequence of inequalities. We have mentioned that the MMD is weak, meaning it provides a lower bound on TV in the following way55:

$${\mathrm{TV}}(p_{\boldsymbol{\theta }},\pi ) \ge \frac{{\sqrt {{\mathrm{MMD}}(p_{\boldsymbol{\theta }},\pi )} }}{{\sqrt C }},$$

if \(C: = {\mathrm{sup}}_{{\mathbf{x}} \in {\cal{X}}^n}\kappa ({\mathbf{x}},{\mathbf{x}})\, < \, \infty\).

Note that for the two kernels introduced earlier:

$$\kappa _{\rm{G}}({\mathbf{x}},{\mathbf{x}}) = \frac{1}{c}\mathop {\sum}\limits_c^{} e ^{ - \frac{1}{{2\sigma _i}}|{\mathbf{x}} - {\mathbf{x}}|^2} = \frac{1}{c}(c) = 1,$$
$$\kappa _{\rm{Q}}({\mathbf{x}},{\mathbf{x}}) = |\left\langle {\phi ({\mathbf{x}})|\phi ({\mathbf{x}})} \right\rangle |^2 = |\left\langle {0|0} \right\rangle ^{ \otimes n}|^2 = 1,$$

hence C = 1 and the lower bound is immediate.

In contrast, as is seen from the inequality on a discrete sample space in Eq. (34)56, the Wasserstein metric (unregularised OT) provides an upper bound on TV, and hence we would expect it to be stronger than the MMD.

$$d_{{\rm{min}}}{\mathrm{TV}}(p_{\boldsymbol{\theta }},\pi ) \le {\mathrm{OT}}_0^d(p_{\boldsymbol{\theta }},\pi ) \le {\mathrm{diam}}({\cal{X}}){\mathrm{TV}}(p_{\boldsymbol{\theta }},\pi ),$$

where \({\mathrm{diam}}({\cal{X}}^n) = {\mathrm{max}}\{ d({\mathbf{x}},{\mathbf{y}}),{\mathbf{x}},{\mathbf{y}} \in {\cal{X}}^n\}\), dmin = minxy d(x, y), and d(x, y) is the metric on the space, \({\cal{X}}^n\). This arises by choosing c = d and \({\it{\epsilon }} = 0\) in Eq. (24). If, for instance, we were to choose d(x, y) to be the \(\ell _1\) metric between the binary vectors of length n (a.k.a. the Hamming distance), then we get that \(d_{{\rm{min}}} = 1,{\mathrm{diam}}({\cal{X}}) = n\), and so:

$${\mathrm{TV}}(p_{\mathbf{\theta }},\pi ) \le {\mathrm{OT}}_0^{\ell _1}(p_{\mathbf{\theta }},\pi ) \le n\,{\mathrm{TV}}(p_{\mathbf{\theta }},\pi ).$$

Finally, we can examine the relationship induced by the regularisation parameter through the following inequality; Theorem 1 in ref. 54:

$$0 \le {\mathrm{OT}}_{\it{\epsilon }}^c(p_{\boldsymbol{\theta }},\pi ) - {\mathrm{OT}}_0^c(p_{\boldsymbol{\theta }},\pi ) \le 2{\it{\epsilon }}{\mathrm{log}}\left( {\frac{{{\mathrm{e}}^2LD}}{{n{\it{\epsilon }}}}} \right) \sim _{{\it{\epsilon }} \to 0}2{\it{\epsilon }}\,{\mathrm{log}}\left( {1{\mathrm{/}}{\it{\epsilon }}} \right),$$

where the size of the sample space is bounded by D, as measured by the metric, and L is the Lipschitz constant of the cost c. As detailed in Supplementary Material Section IV.1, we can choose D = n and L = n:

$$0 \le {\mathrm{OT}}_{\it{\epsilon }}^{\ell _1}(p_{\boldsymbol{\theta }},\pi ) - {\mathrm{OT}}_0^{\ell _1}(p_{\boldsymbol{\theta }},\pi ) \le 2{\it{\epsilon }}\,{\mathrm{log}}\left( {\frac{{{\mathrm{e}}^2n}}{{\it{\epsilon }}}} \right).$$

The log term will be positive as long as \({\it{\epsilon }} \le n{\mathrm{e}}^2\), in which case regularised OT will give an upper bound for the Wasserstein metric, and hence the TV through Eq. (34), so we arrive at:

$${\mathrm{TV}}(p_{\mathbf{\theta }},\pi ) \le {\mathrm{OT}}_0^{\ell _1}(p_{\mathbf{\theta }},\pi ) \le {\mathrm{OT}}_{{\it{\epsilon }} \le n{\mathrm{e}}^2}^{\ell _1}.$$

Unfortunately, comparing this with Eqs. (29) and (30), we can see that, with this scaling of \({\it{\epsilon }}\), the sample complexity would pick up an exponential dependence on the dimension, n, so it would not be efficiently computable. We comment further on this point later.

Numerical performance

In Figs 24, we illustrate the superior performance of our alternative training methods, as measured by the TV distance. A lower TV indicates that the model is able to learn parameters which fit the true data more closely. TV was chosen as an objective benchmark for several reasons. First, it is typically the notion of distance that is required by quantum supremacy experiments where one wants to prove hardness of classical simulation. Second, we use it in the definitions of QLS. Finally, it is one of the strongest notions of convergence in probability one can ask for, so it follows that a training procedure that can more effectively minimise TV, in an efficient way, should be better for generative modelling.

Fig. 2: The performance of the quantum kernel κQ (red) vs. the Gaussian kernel, κG (blue) (with ηinit = 0.1) for 4 qubits.
figure 2

During training, we sample from the QCIBM and the data 500 times and use a minibatch size of 250. One epoch is one complete update of all parameters according to gradient descent. Error bars represent maximum, minimum, and mean values achieved over five independent training runs, with the same initial conditions on the same data samples. a TV difference achieved with both kernel methods during training. No observable or obvious advantage is seen in using the quantum kernel over the Gaussian one; in contrast, the Gaussian kernel seems to perform better on average. b Final learned probabilities with ηinit = 0.01 using the Adam optimiser. c MMD computed using 400 samples as training points and 100 as test points (seen as the thin lines without markers), independent of the training data.

Fig. 3: MMD (cyan, yellow, green) vs. Sinkhorn (blue) and Stein training with Exact score function (red) and Spectral score method (magenta) for 3 qubits with fully connected topology on the Rigetti 3q−qvm, trained on the data, Eq. (44).
figure 3

Five hundred data points are used for training, with 400 used as a training set and 100 used as a test set. Plots show mean, maximum, and minimum values achieved over five independent training runs on the same data set. a TV difference between training methods, with regularisation parameter for SHD and 3 eigenvectors for Spectral Stein method. Both Sinkhorn divergence and Stein discrepancy are able to achieve a lower TV than the MMD. Inset shows region of outperformance on the order of ~0.01 in TV. We observe that the Spectral score method was not able to minimise TV as well as the exact Stein discrepancy, potentially indicating the need for better approximation methods. b Final learned probabilities of each training method. See Supplementary Material Section V for behaviour of corresponding cost functions.

Fig. 4: MMD (green, yellow) vs. Sinkhorn (blue, cyan) for 4 qubits comparing performance on the real QPU (Aspen–4–4Q–A) vs. simulated behaviour on QVM (Aspen–4–4Q–A–qvm) using 500 samples and a batch size of 250, learning target data (black), and a initial learning rate for Adam as ηinit = 0.01.
figure 4

a TV difference between training methods with regularisation parameter \({\it{\epsilon}} = 0.08\). b Final learned probabilities (black) indicates the probabilities of a random instance of the data distribution (see ‘Methods’) chosen. The probabilities given by the other bars are those achieved after training the model with either the MMD or SHD on the simulator or the physical Rigetti chip, on an average run. The probabilities of the model are generated by simulating the entire wavefunction after training. c \({\cal{L}}_{{\mathrm{SHD}}}^{{0.08}}\) on QVM (cyan) vs. QPU (blue). d \({\cal{L}}_{{\mathrm{MMD}}}\) on QVM (yellow) vs. QPU (green). In both latter cases, trained model performance on 100 test samples is seen as the thin lines without markers. Again it can be seen that the Sinkhorn divergence outperforms the MMD both simulated and on chip, with the deviation apparent towards the end of training. Similar behaviour observed after 100 epochs but not shown due to limited QPU time.

We train the model on Rigetti’s Forest platform57 using both a simulator and the real quantum hardware, the Aspen QPU. Figure 2 illustrates the training of the model using the Gaussian (Eq. (12)) vs. the quantum kernel (Eq. (13)) for 4 qubits, and we see that the quantum kernel offers no significant advantage vs. training with a purely classical one. Figure 2a shows the TV as trained for 200 epochs, using both the classical and quantum kernels with various learning rates. Figure 2b shows the learned probabilities outputted after training with each kernel, and Fig. 2c shows the difference in the actual \({\cal{L}}_{{\mathrm{MMD}}}\) itself while training with both methods. Interestingly, the latter behaviour is quite different for both kernels, with the quantum kernel initialising with much higher values of \({\cal{L}}_{{\mathrm{MMD}}}\), whereas they both minimise TV in qualitatively the same way. This indicates that hardness of classical simulation (of computing the kernel) does not imply an advantage in learning.

On the other hand, a noticeable outperformance is observed for the SHD and the SD relative to training with the MMD (using a Gaussian kernel), as measured by TV in Fig. 3. Furthermore, we observed that the gap (highlighted in the inset in Fig. 3a) which separates the SHD and SD (red and blue lines) from the MMD (green, yellow, and cyan lines) grows as the number of qubits grows. Unfortunately, the Spectral method to approximate the Stein score does not outperform the MMD, despite training successfully. The discrepancy between the true and approximate versions of the Stein score is likely due to the low number of samples used to approximate the score, with the number of samples limited by the computational inefficiency. We leave tuning the hyperparameters of the model in order to get better performance to future work.

This behaviour is shown to persist on the QPU, Fig. 4, where we show training of the model with both the MMD and SHD relative to TV, (Fig. 4a), the learned probabilities of both methods on, and off, the QPU (Fig. 4b), and the behaviour of the cost functions associated with both methods (Fig. 4c, d). This reinforces our theoretical argument that the SHD is able to better minimise TV to achieve superior results.

Given the performance noted above, we would recommend the SHD as the primary candidate for future training of these models, due to its simplicity and competitive performance. One should also note that we do not attempt to learn these data distributions exactly since we use a shallow fixed circuit structure for training (i.e. a QAOA circuit), which we do not alter. Better fits to the data could likely be achieved with deeper circuits with more parameters.

For extra numerical result demonstrating the performance of the learning algorithms, see Supplementary Material Section V, including a comparison between the quantum and Gaussian kernels for two qubits, similar to Fig. 2; the behaviour of the corresponding cost functions themselves associated with Fig. 3; the performance of the model for 4 qubits, similar to Fig. 3; and the results using a 3 qubit device, the Aspen–4–3Q–A. In all cases, the performance was qualitatively similar to that reported in the main text.

Hardness and quantum advantage

It is crucially important, not just for our purposes but for the design of QML algorithms in general, that the algorithm itself is providing some advantage over any classical one for the same task. This is the case for so-called coherent algorithms, like the HHL linear equation solver3, which is BQP-complete, and therefore unlikely to be fully dequantised. However, such a proven advantage for near term QML algorithms is yet out of reach. We attempt to address such a question in two steps.

  1. 1.

    We show that, for a large number of parameter values, θ, our QCIBM circuits are ‘hard’. That is to say, it cannot be efficiently simulated classically up to a multiplicative error, in the worst case. We also show that this holds for the auxiliary quantum circuits used for the gradient estimation, and hence the model may remain hard during training (although we do not know for sure).

  2. 2.

    We provide formal definitions for QLS, the ability of a quantum model to provably outperform all classical models in a certain task, and a potential pathway to prove such a thing.

The intuition behind point 2 is the following. If our QCIBM model could learn a target distribution π, which demonstrates quantum supremacy, by providing a quantum circuit C close enough to π (i.e. below a threshold error in TV), then the model would have demonstrated something that is classically infeasible. Else there would exist an efficient classical algorithm that can get close to π, which contradicts hardness.

Point 1 does not completely fit that intuition. For one thing, hardness is not known to hold for the required notion of additive error (i.e. TV distance) but only for multiplicative error. Also, even though the model is more expressive than any classical model16, this does not imply that it could actually learn a hard distribution. On the other hand, it is easy to see why the converse would be true, if the QCIBM could learn a distribution that is hard to sample from classically, the underlying circuit must have, at some point, reached a circuit configuration for which the output distribution is hard to classically sample.

We can address point 1 informally (see Supplementary Material Section VI for the formal statements and proof) in three steps:

  • If the parameters of the model are initialised randomly in \(\{ {\mathbf{\alpha }}\} = \{ J_{ij},b_k\} \in \{ 0,\frac{\pi }{8}, \ldots ,\frac{{7\pi }}{8}\}\) and final measurement angles are chosen such that Uf(Γ, Δ, Σ) = Hn, then the resulting QCIBM circuit class will be hard to simulate up to an additive error of 1/384 in TV distance, subject to a conjecture relating to the hardness of computing the Ising partition function6.

  • If certain configurations of the parameters are chosen to be either of the form, (2l + 1)π/kd, where l and d are integers and k is a number that depends on the circuit family, or in the form 2πν, where ν is irrational, then the resulting class of circuits will be hard to sample from classically, up to a multiplicative error, in the worst case.

  • The circuits produced at each epoch as a result of the gradient updates will each result in a hard circuit class as long as the gradient updates are not chosen carelessly. In each epoch, if the update step is constrained in a way that the new value of the parameter \(\theta _k^{d + 1} = \theta _k^d - \eta \partial _{\theta _k}{\cal{L}}_B\) does not become rational, then the updated circuits will also belong to a class that is hard to simulate (a similar result can be shown for the case where the parameters are updated to keep within the form of (2l + 1)π/kd). This is because the updates can simply be absorbed into the original gates, to give a circuit which has the same form. This holds also for the gradient-shifted circuits in Eq. (16) since these correspond to circuits whose parameters are updated as follows: \(\theta _k^{d, \pm } \leftarrow \theta _k^d \pm \pi {\mathrm{/}}2\).

We now provide definitions to meet the requirements of point 2, adapting definitions from distribution learning theory25 for this purpose. Specifically, we say that a generative QML algorithm, \({\cal{A}} \in {\mathrm{BQP}}\) (with a small abuse of notation) has demonstrated QLS if there exists a class of probability distributions \({\cal{D}}_n\) over \({\cal{X}}^n\) (bit vectors of length n), for which there exists a metric d and a fixed \({\it{\epsilon }}\) such that \({\cal{D}}_n\) is \((d,{\it{\epsilon }},{\mathrm{BQP}})\)-learnable via \({\cal{A}}\) but not \(\left( {d,{\it{\epsilon }},{\mathrm{BPP}}} \right)\)-learnable (i.e. learnable by a purely classical algorithm). The task of the learning algorithm \({\cal{A}}\) is, given a target distribution \(D \in {\cal{D}}_n\), to output, with high probability, a Generator, GEND, for a distribution D′, such that D′ is close to \(D \in {\cal{D}}_n\) with respect to the metric d. For the precise definitions of learnability we employ, see Supplementary Material Section VII.

This framework is very similar to that of, and inspired by, probably approximately correct (PAC) learning, which has been well studied in the quantum case58 but it applies more closely to the task of generative modelling. It is known that, in certain cases, the use of quantum computers can be beneficial to PAC learning but not generically59. Based on this, it is possible that there exist some classes of distributions that cannot be efficiently learned by classical computers (BPP algorithms) but that could be learned by quantum devices (BPQ algorithms). The motivation for this is exactly rooted in the question of quantum supremacy and illustrated crudely in Fig. 5b.

Fig. 5: Quantum learning supremacy in distribution learning.
figure 5

a Illustration of a learning procedure using a Generator. The algorithm \({\cal{A}}\) is given access to GEND, which provides samples, x~D, and must output a Generator for a distribution that is close to the original. We allow the target generator to be classical, hence it may take as input a string of random bits of size polynomial in n, r(n), if not able to generate its own randomness. b Crude illustration of quantum learning supremacy. No classical algorithm, Ci, should be able to achieve the required closeness in total variation to the target distribution, but the QCIBM (or similar) should be able to, for some class of target distributions. There should be some path in the parameter space of the QCIBM, θ, which achieves this.

An initial attempt at QLS is as follows. As mentioned above, if random IQP circuits could be classically simulated to within a TV error of \({\it{\epsilon }} = 1{\mathrm{/}}384\)6 in the worst case (with high probability over the choice of circuit), this would imply unlikely consequences for complexity theory. Now, if a generative quantum model was able to achieve a closeness in TV less than this constant value, perhaps by minimising one of the upper bounds in Eq. (38), then we could claim that this model had achieved something classically intractable. For example, if we make the following assumptions,

  1. 1.

    QCIBM could achieve a TV < δ to a target IQP distribution.

  2. 2.

    A classical probabilistic algorithm, C, could output a distribution q in polynomial time which was γ close in TV to the QCIBM, i.e. it could simulate it efficiently.


$${\mathrm{TV}}(p_{{\mathrm{IQP}}},q) = \frac{1}{2}\mathop {\sum}\limits_{\mathbf{x}} {|p_{{\mathrm{IQP}}}({\mathbf{x}}) - q({\mathbf{x}})|}$$
$$= \frac{1}{2}\mathop {\sum}\limits_{\mathbf{x}} {|p_{{\mathrm{IQP}}}({\mathbf{x}}) - p_{\boldsymbol{\theta }}({\mathbf{x}}) + p_{\boldsymbol{\theta }}({\mathbf{x}}) - q({\mathbf{x}})|}$$
$$\le \frac{1}{2}\mathop {\sum}\limits_{\mathbf{x}} {|p_{{\mathrm{IQP}}}({\mathbf{x}}) - p_{\boldsymbol{\theta }}({\mathbf{x}})|} + \frac{1}{2}\mathop {\sum}\limits_{\mathbf{x}} {|p_{\boldsymbol{\theta }}({\mathbf{x}}) - q({\mathbf{x}})|}$$
$$\le \delta + \gamma \equiv {\it{\epsilon }},$$

where the third line follows from the triangle inequality. Therefore, C could simulate an IQP distribution also, and we arrive at a contradiction.

The major open question left by this work is whether QLS is possible at all; can a quantum model outperform all classical ones in generative learning? This idea motivated our search for metrics that upper bound TV but yet were efficiently computable and therefore could be minimised to efficiently learn distributions to a sufficiently small value of TV. Unfortunately, we can see from the exponential scaling observed in Eq. (38), which gives the upper bound on TV by regularised OT, that SHD will not provably achieve this particular task, despite achieving our primary goal of being stronger than the MMD for generative modelling. We briefly discuss avenues of future research in the ‘Discussion’ section, which could provide alternative routes to QLS.

Quantum compiling

As a concrete application of such a model outside the scope of classical generative modelling, we can use the QCIBM training to perform a type of ‘weak’ quantum circuit compilation. There are potentially other areas that could be studied using these tools or by applying techniques in generative modelling to other quantum information processing tasks, but this is beyond the scope of this work.

The major objective in this area is to compile a given target unitary, U, into one that consists exclusively of operations available to the native hardware of the quantum computer in question. For example, in the case of Rigetti’s Aspen QPU, the native gates are {Rxπ/2), Rz(θ), CZ}57,60, and any unitary which a user wishes to implement must be compiled into a unitary V that contains only these ingredients.

Potential solutions to this problem60,61 involve approximating the target unitary by assuming that V is a parametric circuit built from the native gates, which can be trained by some optimisation strategy. We adopt a similar view here, but we do not require any extra quantum resources to perform the compilation. With this limitation, we make a trade-off in that we are not guaranteed to apply the same target unitary, only that the output distribution will be close to that produced by the target. Clearly this is a much weaker constraint than the task of direct compilation, since many unitaries may give rise to the same distribution, but it is much closer to the capabilities of near term devices. To illustrate this application, we train an QCIBM to learn the output distribution of a random IQP circuit when restricted to a QAOA architecture itself using \({\cal{L}}_{{\mathrm{SHD}}}\) as a cost function. The process is illustrated in Eq. (43), where we try to determine suitable QAOA parameters, \(\{ J_{ij}^{{\mathrm{QAOA}}},b_k^{{\mathrm{QAOA}}}\}\), which reproduce the distribution observed from a set of random IQP parameters, \(\{ J_{ij}^{{\mathrm{IQP}}},b_k^{{\mathrm{IQP}}}\}\).

$$\begin{array}{l}{\mathrm{QCIBM}}\left( {\left\{ {J_{ij}^{{\mathrm{QAOA}}},b_k^{{\mathrm{QAOA}}}} \right\},\left\{ {{\mathrm{\Gamma }}_k = \frac{\pi }{4}} \right\},0,0} \right)\\ \mathop { \to }\limits^{Compile} {\mathrm{QCIBM}}\left( {\left\{ {J_{ij}^{{\mathrm{IQP}}},b_k^{{\mathrm{IQP}}}} \right\},\left\{ {{\mathrm{\Gamma }}_k = \frac{\pi }{{2\sqrt 2 }}} \right\},0,\left\{ {{\mathrm{\Sigma }}_k = \frac{\pi }{{2\sqrt 2 }}} \right\}} \right).\end{array}$$

The measurement unitary at the end of the circuit makes this process non-trivial, since this will give rise to significantly different distributions, even given the same parameters in Uz. We illustrate this in Fig. 6 using the Rigetti 2q–qvm and for three qubits in Supplementary Material Section V. We find that, even though the learned parameter values are different from the target, the resulting distributions are quite similar, as expected.

Fig. 6: Automatic compilation of IQP circuit to a p = 1 QAOA circuit with two qubits using \({\cal{L}}_{{\mathrm{SHD}}}^{\it{\epsilon }}\) with \({\it{\epsilon }} = 0.1\).
figure 6

Five hundred data samples were used with 400 used for a training set and 100 used as a test set. QCIBM circuit is able to mimic the target distribution well, even though actual parameter values and circuit families are different. Error bars represent mean, maximum, and minimum values achieved over five independent training runs on the same data set. a Initial (cyan) and trained (grey) QAOA circuit parameters for two qubits. Target IQP circuit parameters (green). Parameter values scaled by a factor of 10 for readability. b Final learned probabilities of QCIBM (QAOA) (blue) circuit vs. ‘data’ probabilities (IQP) (black). c Total variation distance and d Sinkhorn divergence for 400 training samples and 100 test samples, using a Hamming optimal transport cost.


Providing provable guarantees of the superior performance of near term quantum computers relative to any classical device for some particular non-trivial application is an important milestone of the field. We have shown one potential route towards this goal by combining complexity–theoretic arguments4,7,9, with an application in generative machine learning14,15,16,62, and improved training methods of generative models. Specifically, we introduced the Ising Born machine, a restricted form of a quantum circuit Born machine. These models utilise the Born rule of quantum mechanics to train a parameterised quantum circuit as a generative machine learning model, in a hybrid manner.

We proved that the model cannot be simulated efficiently by any classical algorithm up to a multiplicative error in the output probabilities, which holds for many circuit families that may be encountered during gradient-based training. As such, this type of model is a good candidate for a provable quantum advantage in QML using NISQ devices. To formalise this intuition, we defined a notion of QLS to rigorously define what such an advantage would look like, in the context of machine learning.

We adapted novel training methods for generative modelling in two ways. First, by introducing quantum kernels to be evaluated on the quantum hardware and, second, by proposing and adapting new cost functions. In the case of SHD, we discussed its sample complexity and used this to define a somewhat optimal cost function through a judicious choice of the regularisation parameter. It is possible to choose this parameter such that the cost is efficiently computable even as the number of qubits grows. We showed numerically that these methods have the ability to outperform previous methods in the random data set we used as a test case.

Finally, we demonstrated an application of the model as a heuristic compiler to compile one quantum circuit into another via classical optimisation techniques, which has the advantage of requiring minimal quantum overhead. These techniques could potentially be adapted into methods to benchmark and verify near term quantum devices.

The major question that this work raises is whether or not a provable notion of quantum learning could be achievable for a particular data set, thereby solidifying a use case for quantum computers in the near term with provable advantage. The best prospect for this is the quantum supremacy distributions we know of (for example, IQP), but they are not efficiently testable63. Owing to this, they are also likely to not be efficiently learnable either, given the close relationship between distribution testing and learning64. Furthermore, we can see from the exponential scaling required in Eq. (38) for regularised OT to upper bound TV that other techniques are necessary to achieve QLS, since the methods we present here are not suited to this particular task, despite achieving our goal of being stronger than the MMD for generative modelling. However, this assumes that we have access only to classical samples from the distribution, and the possibility of gaining an advantage using quantum samples38,58 is unexplored in the context of distribution learning.


In this section, we detail the methods used to train the QCIBM to reproduce a given probability distribution. The target distribution is the one given by Eq. (44), which is used in both refs 65,66 to train versions of the quantum Boltzmann machine:

$$\pi ({\mathbf{y}}): = \frac{1}{T}\mathop {\sum}\limits_{k = 1}^T {p^{n - d_{\rm{H}}(s_k,{\mathbf{y}})}} (1 - p)^{d_{\rm{H}}(s_k,{\mathbf{y}})}.$$

To generate this data, T binary strings of length n, written sk and called ‘modes’, are chosen randomly. A sample y is then produced with a probability that depends on its Hamming distance dH(sk, y) to each mode. In all of the above, the Adam67 optimiser was applied, using the suggested hyperparameters, i.e. \(\beta _1 = 0.9,\;\beta _2 = 0.999,\;{\it{\epsilon }} = 1 \times 10^{ - 8},\) and initial learning rate, ηinit. This was chosen since it was found to be more robust to sampling noise14.

In all of the numerical results, we used a QAOA structure as the underlying circuit in the QCIBM. Specifically, the parameters in Uf were chosen such that k, Γk = π/4, Δk = 0, and Σk = 0. The Ising parameters {Jij, bk} were initialised randomly.

For the SD, we used 3 Nyström eigenvectors to approximate the Spectral score in Fig. 3 for 3 qubits, and 6 eigenvectors for 4 qubits. In all cases when using the MMD with a Gaussian kernel, we chose the bandwidth parameters, σ = [0.25, 10, 1000]14. Note that this article was previously published as a preprint68.