Demonstration of quantum advantage in machine learning

The main promise of quantum computing is to efficiently solve certain problems that are prohibitively expensive for a classical computer. Most problems with a proven quantum advantage involve the repeated use of a black box, or oracle, whose structure encodes the solution. One measure of the algorithmic performance is the query complexity, i.e., the scaling of the number of oracle calls needed to find the solution with a given probability. Few-qubit demonstrations of quantum algorithms, such as Deutsch-Jozsa and Grover, have been implemented across diverse physical systems such as nuclear magnetic resonance, trapped ions, optical systems, and superconducting circuits. However, at the small scale, these problems can already be solved classically with a few oracle queries, and the attainable quantum advantage is modest. Here we solve an oracle-based problem, known as learning parity with noise, using a five-qubit superconducting processor. Running classical and quantum algorithms on the same oracle, we observe a large gap in query count in favor of quantum processing. We find that this gap grows by orders of magnitude as a function of the error rates and the problem size. This result demonstrates that, while complex fault-tolerant architectures will be required for universal quantum computing, a quantum advantage already emerges in existing noisy systems

of an additional bit A. The learner has access to the output register of an example oracle circuit that implements f on random input states, on which he/she has no control. Repeated queries of the oracle allow the learner to reconstruct k. However, any physical implementation suffers from errors, both in the oracle execution itself and in readout of the register. In the presence of errors, the problem becomes hard. Assuming that every bit introduces an equal error probability, the best known algorithms have a number of queries growing as O(n) and runtime growing almost exponentially with n [13,14,20]. In view of the classical hardness of learning parity with noise (LPN), parity functions have been suggested as keys for secure and computationally easy authentication [21,22].
The picture is different when the algorithm can process quantum superpositions of input states, i.e., when the oracle is implemented by a quantum circuit. In this case, applying a coherent operation on all qubits after an oracle query ideally creates the entangled state In particular, when A is measured to be in |1 , |D will be projected onto |k . With constant error per qubit, learning from a quantum oracle requires a number of queries that scales as O(log n), and has a total runtime that scales as O(n) [15]. This gives the quantum algorithm an exponential advantage in query complexity and a superpolynomial advantage in runtime.
In this work, we implement a LPN problem in a superconducting quantum circuit using up to five qubits, realizing the experiment proposed in Ref. 15. We construct a parity function with bit-string k using a series of CNOT gates between the ancilla and the data qubits (Fig. 1b). We then present two classes of learners for k and compare their performance. The first class simply measures the output qubits in the computational basis and analyzes the results. The measurement collapses the state into a random {D, f (D, k)} basis state, reproducing an example oracle of the classical LPN problem. The second class performs some quantum computation (coherent operations), followed by classical analysis, to infer the solution. We show that the quantum approach outperforms the classical one in the number of queries required to reach a target error threshold, and that it is largely robust to noise added to the output qubit register.
The quantum device used in our experiment consists of five superconducting transmon qubits, A, D 1 , ..., D 4 , Vertical lines indicate CNOT gates between each Di (control) and the ancilla qubit A (target). Quantum learning differs from classical learning only by the addition of single-qubit gates (dashed boxes) applied before measurement (see also Extended Data Fig. 1). (c) Optical image of the superconducting quantum processor (qubits in red). A is coupled to each Di by means of two bus resonators (blue). Each qubit is also coupled to a dedicated resonator for control and readout (green) [23]. and seven microwave resonators (Fig. 1c). Five of the resonators are used for individual control and readout of the qubits, to which they are dispersively coupled [24]. The center qubit A plays the role of the result and is coupled to the data register {D i } via the remaining two resonators. This coupling allows the implementation of cross-resonance (CR) gates [25] between A (used as control qubit) and each D i (target), constituting the primitive two-qubit operation for the circuit in Fig. 1b  To implement a uniform random example oracle for a particular k, we first prepare the data qubits in a uniform superposition (Fig. 1b). Preparing such a state ensures that all parity examples are produced with equal probability and is also key in generating a quantum advantage. We then implement the oracle as a series of CNOT gates, each having the same target qubit A and a different control qubit D i for each k i = 1. Finally, the state of all qubits is read out (with the optional insertion of Hadamard gates, see discussion below). The oracle mapping to the device is limited by imperfections in the two-qubit gates, with average fidelities 88 − 94%, characterized by randomized benchmarking [26] (see Extended Data Table 1). Readout errors in the register η Di , defined as the average probability of assigning a qubit to the wrong state, are limited to 20 − 40% by the onset of inter-qubit crosstalk at higher measurement power (Extended Data Fig. 3). A Josephson parametric amplifier [27] in front of the amplification chain of A suppresses its low-power readout error to η A = 5%.
Having implemented parity functions with quantum hardware, we now proceed to interrogate an oracle N times and assess our capability to learn the corresponding k. We start with oracles with register size n = 2, involving D 1 , D 2 , and A. We consider two classes of learning strategies, classical (C) and quantum (Q). In C, we perform a projective measurement of all qubits right after execution of the oracle. This operation destroys any coherence in the oracle output state, thus making any analysis of the result classical. The measured homodyne voltages {V D1 , ...V Dn , V A } are converted into binary outcomes, using a calibrated set of thresholds (see Methods). Thus, for every query, we obtain a binary string {a, d 1 , d 2 }, where each bit is 0 (1) for the corresponding qubit detected in |0 (|1 ). Ideally, a is the linear combination of d 1 , d 2 expressed by the string k (Fig. 1a). However, both the gates comprising the oracle and qubit readout are prone to errors (see Extended Data Table 1). To find the k that is most likely to have produced our observations, at each query m we compute the expectedã k,m for the measured d 1,m , d 2,m and the 4 possible values of k. We then select the k which minimizes the distance to the measured results a 1 , ..., a N of N queries, i.e., N m |ã q − a i,k | [13]. In the case of a tie, k is randomly chosen among those producing the minimum distance. As expected, the error probability p of obtaining the correct answer decreases with N (Fig. 2a). Interestingly, the difficulty of the problem depends on k and increases with the number of k i = 1. This can be intuitively understood as needing to establish a higher correlation between data qubits when the weight of k increases.
In our second approach (Q), while the oracle is left untouched, we apply local operations (Hadamard gates) to all qubits before measuring. Remarkably, this simple operation completely changes the statistics of the mea-a b FIG. 2. Error probability p to identify a 2-bit oracle k as a function of the number of queries N . For both classical (a) and quantum (b) learners, one of the four oracles k is applied, followed by the simultaneous measurement of all qubits. Hadamard gates are applied prior to measurement in the quantum case (Fig. 1b). See text for a description of the solvers in the two scenarios. Inset: number of queries N 1% (k) required to reach 1% error for the classical (empty bars) and quantum (solid) solver.
surement results and the learning procedure. We now use the fact that the state of the data qubits is entangled with the result A (see Eq. 1). Whenever A is measured to be in |1 , the data register will ideally be projected onto the solution, |D 1 , D 2 = |k 1 , k 2 . We therefore digitize and postselect our results on the ≈ 50% outcomes where a = 1 and perform a bit-wise majority vote on {d 1 , d 2 } 1...Ñ . Despite every individual query being subject to errors, the majority vote is effective in determining k (Fig. 2b). We assess the performance of the two solvers by comparing the number of queries N 1% required to reach p = 0.01 (Fig. 2c). Whereas Q performs comparably or worse than C for k = 00, 01 or 10, Q requires less than half as many queries as C for the hardest oracle, k = 11. We note that, while these results are specific to our the lowest oracle and readout errors we can achieve (see Extended Data Table 1), a systematic advantage of quantum over classical learning will become clear in the following. So far we have adhered to a literal implementation of the classical LPN problem, where each output can only be either 0 or 1. However, the actual measurement results are the continuous homodyne voltages {V D1 , ...V Dn , V A }, each having mean and variance determined by the probed qubit state and by the measurement efficiency, respectively [24]. These additional resources can be exploited to improve the learner's capabilities as follows. A more effective strategy for C uses Bayesian estimation to calculate the probability of any possible k for the measured output voltages, and select the most probable (see Methods). This approach is expensive in classical processing time (scaling exponentially with n), but drastically reduces the error probabilityp, averaged over all k, at any N . For each D i , the majority vote between ≈ N/2 inaccurate observations is then replaced by a single vote with high accuracy. Using the analog results, not only does Q retain an advantage over C (smaller p for given N ), but it does so without introducing an overhead in classical processing. The superiority of Q over C becomes even more evident when the oracle size n grows from 2 to 3 data qubits (Fig. 3b). Whereas Q solutions are marginally affected, the best C solver demands almost an order of magnitude higher N to achieve a target error. Maximizing the resources available in our quantum hardware, we observe an even larger gap for oracles with n = 4 (Extended Data Fig. 5), suggesting a continued increase of quantum advantage with the problem size.
As predicted, quantum parity learning surpasses classical learning in the presence of noise. To investigate the impact of noise on learning, we introduce additional readout error on either A or on all D i . This can be easily done by tuning the amplitude of the readout pulses, effectively decreasing the signal-to-noise ratio [28]. When the ancilla assignment error probability η A grows (Fig. 4a), the number of queriesN 1% (the average of N 1% over all k) required by the C solver increases by up to 2 orders of magnitude in the measured range (see also Extended Data Fig. 6). Conversely, using Q,N 1% only changes by a factor of ∼ 3. Key to this performance gap is the optimization of the digitization threshold for { V Di } at each value of η A (see Methods). When η A is increased, an interesting question is whether postselection on V remains always beneficial. In fact, for η A > 0.25, it becomes more convenient to ignore V A and use the totality of the queries (Q in Fig. 4a). Similarly, we step the readout error of the data qubits, with average η D , while setting η A to the minimum. Not only does Q outperform C at every step, but the gap widens with increasing η D .
A numerical model including the measured η A , η D , qubit decoherence, and gate errors modeled as depolarization noise (Extended Data Table 1) is in very good agreement with the measured N 1% at all η A , η D . This model allows us to extrapolate N 1% to the extreme cases of zero and maximum noise. Obviously, when η D = 0.5, readout of the data register contains no information, and N 1% consequently diverges. On the other hand a random ancilla result (η A = 0.5) does not prevent a quantum learner from obtaining k. In this limit, the predicted factor of ∼ 2 inN 1% between Q and Q can be intuitively understood as Q indiscriminately discards half of the queries, while Q uses all of them. (See Supplementary Material for theoretical bounds on the scaling ofN 1% for different solvers.) In conclusion, we have implemented a learning parity with noise algorithm in a quantum setting. We have demonstrated a superior performance of quantum learning compared to its classical counterpart, where the performance gap increases with added noise in the query outcomes. A quantum learner, with the ability of physically manipulating the output of a quantum oracle, is expected to find the hidden key with a logarithmic number of queries and linear runtime as function of the problem size, whereas a passive classical observer would require a linear number of queries and nearly exponential runtime. We have shown that the difference in classical and quantum queries required for a target error rate grows with the oracle size in the experimentally accessible range, and that quantum learning is much more robust to noise. We expect that future experiments with increased oracle size will further demarcate a quantum advantage, in support of the predicted asymptotic behavior.

METHODS
Pulse calibration. Single-and two-qubit pulses are calibrated by an automated routine, executed periodically during the experiments. For each qubit, first the transition frequency is calibrated with Ramsey experiments. Second, π and π/2 pulse amplitudes are calibrated using a phase estimation protocol [29]. The pulse amplitudes, modulating a carrier through an I/Q mixer (Extended Data Fig. 2) are adjusted at every iteration of the protocol until the desired accuracy or signal-to-noise limit is reached. Pulses have a Gaussian envelope in the main quadrature and derivative-of-Gaussian in the other, with DRAG parameter [30] calibrated beforehand using a sequence amplifying phase errors [31]. CR gates are calibrated in a two-step procedure, determining first the optimum duration and then the optimum phase for a ZX 90 unitary.
Experimental setup. A detailed schematic of the experimental setup is illustrated in Extended Data Fig. 2. For each qubit, signals for readout and control are delivered to the corresponding resonator through an individual line through the dilution refrigerator. For an efficient use of resources, we apply frequency division multiplexing [32] to generate the five measurement tones by sideband modulation of three microwave sources. Moreover, the same pair of BBN APS (custom arbitrary waveform generators) channels produce the readout pulses for {D 1 , D 2 }, and another one for {D 3 , D 4 }. Similarly, the output signals are pairwise combined at base temperature, limiting the number of HEMTs and digitizer channels to three. The attenuation on the input lines, distributed at different temperature stages, is a compromise between suppression of thermal noise impinging on the resonators (affecting qubit coherence) and the input power required for CR gates.
Gate sequence. CNOT gates can be decomposed in terms of CR gates using the relation CNOT 12 = (Z − 90 ⊗ X − 90 ) CR 12 [33]. Moreover, the role of control and target qubits are swapped, using CNOT 12 = (H 1 ⊗ H 2 ) CNOT 21 (H 1 ⊗ H 2 ). The first of these H gates is absorbed into state preparation for the LPN sequence (Figs. 1a and Extended Data Fig. 1). Similarly, when two CNOTs are executed back to back, two consecutive H gates on A are canceled out. In order to maintain the oracle identical in C and Q, we do not compile the H gates in the CNOTs with those applied before measurement in Q.
Data analysis. For each set of {k, η A , η D }, solver type, and register size n, we measure the result of 10, 000 oracle queries. Each set is accompanied by n + 2 calibration points (averaged 10, 000 times), providing the distributions of V A , V D1 , ..., V Dn for the collective ground state and for single-qubit excitations (n data and 1 ancilla qubit). These distributions are then used to determine the optimum digitization threshold (for digital solvers) or as input to the Bayesian estimate in C. To obtain p(N ), we resample the full data set with 1000 − 4000 random subsets of each size N .
Error bars are obtained by first computing the credible intervals for p at each set {N, k, η A , η D }. These intervals are computed with Jeffreys beta distribution prior Beta( 1 2 , 1 2 ) for Bernoulli trials, with a credible level of 100% − (100% − 95%)/8 ≈ 99.36%. This ensures that, under a union bound, the average of estimates for 8 different keys is inside the credible interval with a probability of at least 95%. We then perform antitonic regression on the upper and lower bounds of the credible intervals to ensure monotonicity as function of N , and find the intercept to p = 0.01 for each k. The bounds on the value N 1% averaged over the keys is computed by interval arithmetic on the credible intervals of N 1% for each k.
Classical solver with Bayesian estimate. An improved classical solver for the LPN problem can be constructed when the oracle provides an analog output. Under the assumption of Gaussian distributions for each possible bit value, this improved solver corresponds to a Bayesian estimate of the key after a series of observations of the data and ancilla bits. More formally, taking a uniform prior distribution for all binary strings produced by the oracle, one computes the (unnormalized) posterior p(D i ) distribution for each data bit D i the output of the oracle, The (unnormalized) posterior distribution p m (k|V D , V A ) for the key k after the mth query, on the other hand, is given by where p 0 (k) is the prior distribution for each key. Here and above, {V D1 , ...V Dn , V A } are rescaled to have mean 0 and 1 for the corresponding qubit in |0 and |1 , respectively. Iterating this procedure (while updating p(k) at each iteration), and then choosing the most probable key k Bayes = arg max k p(k), one obtains an estimate for the key.
Analog quantum solver with postselection on A. While postselection on A is performed equally on both digital (Fig. 2) and analog (Figs. 3-4) Q solvers, in the analog case all postselected {V Di } are averaged together. Finally, the results { V Di } are digitized to determine the most likely k. The choice of digitization threshold for each D i depends on: a) the readout voltage distributions ρ 0 and ρ 1 for the two basis states, each characterized by a mean µ and a variance σ 2 ; b) η A . Ideally (η A = 0 and perfect oracle), the distribution of each query output V Di matches ρ 0 (ρ 1 ) for k i = 0 (1). When η A > 0, the distribution for We approximate the expected distribution of the mean V Di with a Gaussian having average and variance obtained from ρ ki=0 (ρ ki=1 ) for k i = 0 (1). Finally, we choose the digitization threshold for V Di which maximally discriminates these two Gaussian distributions. We note that the number of queries scales the variance of both distributions equally and therefore does not affect the optimum threshold. Furthermore, this calibration protocol is independent of the oracle (see Extended Data Fig. 7).
Analog quantum solver without postselection. The analysis without ancilla (Q ) closely follows the steps outlined in the last paragraph. For the purpose of extracting the optimum digitization thresholds, we consider η A = 0.5 in the expressions above. This corresponds to an equal mixture of ρ 0 and ρ 1 when k i = 1.
Bounds on performance of the analog quantum solvers. Here we demonstrate how the bounds from Ref. 15 can be easily adapted to the case where the solver uses analog voltage measurements. We consider both the case where experiments are postselected based on the digitized value of the ancilla (referred below as postselected soft averaging), and the case where the ancilla is ignored altogether. We consider different error rate for the ancilla and the data qubits.
Postselected soft averaging. In order to generalize the analysis in Ref. 15 to the postselected soft averaging case, we now need to take two types of data errors into account: depolarizing errors (our crude model for oracle errors), and measurement error (additive Gaussian noise).
First, postselection works identically to Ref. 15, since we treat the ancilla digitally. We note that, in this analysis, the ancilla error rate combines oracle errors and readout errors. Given n queries, n are postselected according to the ancilla value V A , and s of this postselections are correct. Although s is unknown in an experiment, we condition our results on s being typical (i.e., we only consider the values of s that occur with probability higher than 1 − for some small .).
For the correct postselections, we have two possible voltage distributions for each D i , depending on whether the outcome is 0 or 1. The distribution of the outcomes will depend on whether we have one of the correct postselections, and on the value of ith key bit k i . If k i = 0, the conditional voltage distributions, depending on whether we postselected correctly ( ) or not (), are respectively, with N (µ, σ 2 ) the normal distribution with mean µ and variance σ 2 . Therefore, the overall distribution is If the true bit value is 1, we have and therefore Now we must compute the optimal voltage threshold which determine the digital decision at each of the data qubits. If we define the threshold we must choose is The complication is that this is conditioned on s, but we will deal with that later, as the dependence on s also comes from the distribution of outcomes (not just the threshold). In the following we assume the value of s to be typical (i.e., s is contained in the region around the median excluding the distribution tails that add up to at most some small ). Under this assumption, we require that µ i|0 ≤ T ≤ µ i|1 . The probability of having the right answer at a particular bit is the probability that the averaged voltage is on the correct side of the threshold (above or below). If the true value of the bit is 0, i.e., if k i = 0, given the threshold, we can compute where Φ is the cumulative distribution function for a normal distribution, and Q is the tail probability for the normal distribution. We can place a lower bound on Pr(M j ≤ T |s, k i = 0) with an upper bound on Q. Note that, for the range of interest, the argument of Q is always positive, so we can use the bound and therefore which is nearly what we want-we must now address the dependence on s. One way to restrict the analysis to typical s is to require that, forη A = max{η A , 1 − η A },the probability is exponentially close to 1. This choice ofη A requires knowledge of the error rates in the ancilla so that, for example, one knows to postselect on 0 instead of 1 if η A > 0.5. In order to pick a lower bound valid for all typical thresholds and means, we choose the smallest |T − µ i|0 | by choosing T and µ j|0 independently from the typical sets. This leads to and thus, so that, by the union bound, and therefore the lower bound on the number of queries is If k i = 1, we take a similar approach, but the lower bound on the distance between the threshold and the mean is smaller, leading to so clearly this is the worst case for k i . If we want to bound N instead of N , we just remember that there is a 50% chance of collapsing into the informative branch of the state, and using the same typicality argument as before, we have where δ measures how far from the mean k is, with a corresponding Chernoff bound. Analysis without postselection. The analysis is equivalent to the postselected case, but with η a = 1 2 and N = N , since we keep all experiments and have a 50% chance of collapsing the state in the informative branch. All of this leads to .
We now see that depending of choices of δ and δ , postselection may or may not lead to better bounds, but the asymptotic scaling is the same. Complexity of digital classical solvers. Angluin and Laird [13] showed that learning with classification noise requires O(n) queries as long as the classification error rate is below 1 2 , and propose an algorithm (disagreement minimization) that corresponds to solving an NP-complete problem. According to the exponential time hypothesis, it is widely believe that NP-complete problems can only be solved in exponential time. Note that, while the classification rate is nominally η A in our experiment, all errors (including η D and gate infidelities) can be combined onto an effective, k-dependent, single error rate.
Blum, Kalai, and Wasserman [14] devised a subexponential time algorithm for learning with classification errors, as long as the classification error rate is below 1 2 − 1 2 n δ for δ < 1, at the cost of increasing the query complexity to slightly sub-exponential scaling with n.
Later, Lyubashevsky [20] devised another sligthly subexponential time algorithm for learning with classification errors, as long as the classification error rate is below 1 2 − 1 2 (log n) δ for δ < 1, but bringing down the query complexity to n 1+ for > 0. Note that the gains over exponential time scaling for these two algorithms are rather small -a reduction from O(2 n ) to O(2 n log n ) and O(2 n log log n ), respectively. For n = 3, the Blum-Kalai-Wasserman algorithm can only tolerate less than 3 8 ≈ 0.375 classification error rate, while the Lyubashevsky algorithm can only tolerate less than 1 2 − 1 2 log 3 ≈ 0.033 classification error rate. Lyubashevsky's algorithm does not apply to any of the experiments discussed here because our classification error rates are too high. The Blum-Kalai-Wasserman algorithm only applies to some of the experiments discussed here, so for the sake of fair comparison across all error rates, we use Angluin and Laird's disagreement minimization.
Extended Data Fig. 1. Circuit gate decomposition for 3-bit oracles. a k = 111, b, k = 001. CNOT gates [see Fig. 1 Extended Data Fig. 2. Experimental setup. Complete wiring of control and readout electronics inside and outside the Bluefors BF-LD400 dilution refrigerator (see Methods). Home-made Arbitrary Pulse Sequencers (BBN APS, each indicated by its 4 analog channels Ch1-Ch4) produce the waveforms for single-qubit measurement, control, and CR pulses. The readout signal for A is boosted by a Josepshon parametric amplifier (JPA) from UC Berkeley [27].   [26].