Machine learning of high dimensional data on a noisy quantum processor

We present a quantum kernel method for high-dimensional data analysis using Google's universal quantum processor, Sycamore. This method is successfully applied to the cosmological benchmark of supernova classification using real spectral features with no dimensionality reduction and without vanishing kernel elements. Instead of using a synthetic dataset of low dimension or pre-processing the data with a classical machine learning algorithm to reduce the data dimension, this experiment demonstrates that machine learning with real, high dimensional data is possible using a quantum processor; but it requires careful attention to shot statistics and mean kernel element size when constructing a circuit ansatz. Our experiment utilizes 17 qubits to classify 67 dimensional data - significantly higher dimensionality than the largest prior quantum kernel experiments - resulting in classification accuracy that is competitive with noiseless simulation and comparable classical techniques.


I. INTRODUCTION
Quantum kernel methods (QKM) [1,2] provide techniques for utilizing a quantum co-processor in a machine learning setting. These methods were recently proven to provide a speedup over classical methods for certain specific input data classes [3]. They have also been used to quantify the computational power of data in quantum machine learning algorithms and drive the conditions under which quantum models will be capable of outperforming classical ones [4]. Prior experimental work [1,5,6] has focused on artificial or heavily pre-processed data, hardware implementations involving very few qubits, or circuit connectivity unsuitable for NISQ [7] processors; recent experimental results show potential for many-qubit applications of QKM to high energy physics [8].
In this work, we extend the method of machine learning based on quantum kernel methods up to 17 hardware qubits requiring only nearest-neighbor connectivity. We use this circuit structure to prepare a kernel matrix for a classical support vector machine to learn patterns in 67dimensional supernova data for which competitive classical classifiers fail to achieve 100% accuracy. To extract useful information from a processor without quantum error correction (QEC), we implement error mitigation techniques specific to the QKM algorithm and experimentally demonstrate the algorithm's robustness to some of the device noise. Additionally, we justify our circuit design based on its ability to produce large kernel magnitudes that can be sampled to high statistical certainty with relatively short experimental runs.
We implement this algorithm on the Google Sycamore processor which we accessed through Google's Quantum Computing Service. This machine is similar to the quantum supremacy demonstration Sycamore chip [9], but with only 23 qubits active. We achieve competitive results on a nontrivial classical dataset, and find intriguing classifier robustness in the face of moderate circuit fidelity. Our results motivate further theoretical work on noisy kernel methods and on techniques for operating on real, high-dimensional data without additional classical pre-processing or dimensionality reduction.

II. QUANTUM KERNEL SUPPORT VECTOR MACHINES
A common task in machine learning is supervised learning, wherein an algorithm consumes datum-label pairs (x, y) ∈ X × {0, 1} and outputs a function f : X → {0, 1} that ideally predicts labels for seen (training) input data and generalizes well to unseen (test) data. A popular supervised learning algorithm is the Support Vector Machine (SVM) [10,11] which is trained on inner products x i , x j in the input space to find a robust linear classification boundary that best separates the data. An important technique for generalizing SVM classifiers to non-linearly separable data is the so-called "kernel trick" which replaces x i , x j in the SVM formulation by a symmetric positive definite kernel function k(x i , x j ) [12]. Since every kernel function corresponds to an inner product on input data mapped into a feature Hilbert space [13], linear classification boundaries found by an SVM trained on a high-dimensional mapping correspond to complex, non-linear functions in the input space. In this experiment we performed limited data preprocessing that is standard for state-of-the-art classical techniques, before using the quantum processor to estimate the kernel matrixK ij for all pairs of encoded datapoints (x i , x j ) in each dataset. We then passed the kernel matrix back to a classical computer to optimize an SVM using cross validation and hyperparameter tuning before evaluating the SVM to produce a final train/test score.
Quantum kernel methods can potentially improve the performance of classifiers by using a quantum computer to map input data in X ⊂ R d into a high-dimensional complex Hilbert space, potentially resulting in a kernel function that is expressive and challenging to compute classically. It is difficult to know without sophisticated knowledge of the data generation process whether a given kernel is particularly suited to a dataset, but perhaps families of classically hard kernels may be shown empirically to offer performance improvements. In this work we focus on a non-variational quantum kernel method, which uses a quantum circuit U (x) to map real data into quantum state space according to a map φ(x) = U (x)|0 . The kernel function we employ is then the squared inner product between pairs of mapped input data given by k(x i , x j ) = | φ(x i )|φ(x j ) | 2 , which allows for more expressive models compared to the alternative choice φ(x i )|φ(x j ) [4].
In the absence of noise, the kernel matrix K ij = k(x i , x j ) for a fixed dataset can therefore be estimated up to statistical error by using a quantum computer to sample outputs of the circuit U † (x i )U (x j ) and then * e6peters@uwaterloo.ca computing the empirical probability of the all-zeros bitstring. However in practice, the kernel matrixK ij sampled from the quantum computer may be significantly different from K ij due to device noise and readout error. OnceK ij is computed for all pairs of input data in the training set, a classical SVM can be trained on the outputs of the quantum computer. An SVM trained on a size-m training set T ⊂ X learns to predict the class f (x) =ŷ of an input data point x according to the decision function: where α i and b are parameters determined during the training stage of the SVM. Training and evaluating the SVM on T requires an m × m kernel matrix, after which each data point z in the testing set V ⊂ X may be classified using an additional m evaluations of k(x i , z) for i = 1 . . . m. Figure 1 provides a schematic representation of the process used to train an SVM using quantum kernels.

A. Data and preprocessing
We used the dataset provided in the Photometric LSST Astronomical Time-series Classification Challenge (PLAsTiCC) [14] that simulates observations of the Vera C. Rubin Observatory [15]. The PLAsTiCC data consists of simulated astronomical time series for several different classes of astronomical objects. The time series consist of measurements of flux at six wavelength bands. Here we work on data from the training set of the challenge. To transform the problem into a binary classification problem, we focus on the two most represented classes, 42 and 90, which correspond to types II and Ia supernovae, respectively.
Each time series can have a different number of flux measurements in each of the six wavelength bands. In order to classify different time series using an algorithm with a fixed number of inputs, we transform each time series into the same set of derived quantities. These include: the number of measurements; the minimum, maximum, mean, median, standard deviation, and skew of both flux and flux error; the sum and skew of the ratio between flux and flux error, and of the flux times squared flux ratio; the mean and maximum time between measurements; spectroscopic and photometric redshifts for the host galaxy; the position of each object in the sky; and the first two Fourier coefficients for each band, as well as kurtosis and skewness. In total, this transformation yields a 67-dimensional vector for each object.
To prepare data for the quantum circuit, we convert lognormal-distributed spectral inputs to log scale, and normalize all inputs to − π 2 , π 2 . We perform no dimensionality reduction. Our data processing pipeline is consistent with the treatment applied to state-of-the-art classical methods. Our classical benchmark is a competitive solution to this problem, although significant additional feature engineering leveraging astrophysics domain knowledge could possibly raise the benchmark score by a few percent.

B. Circuit design
To compute the kernel matrix K ij ≡ k(x i , x j ) over the fixed dataset we must run R repetitions of each circuit U † (x j )U (x i ) to determine the total counts ν 0 of the all zeros bitstring, resulting in an estimatorK ij = ν0 R . This introduces a challenge since quantum kernels must also be sampled from hardware with low enough statistical uncertainty to recover a classifier with similar performance to noiseless conditions. Since the likelihood of large relative statistical error between K andK grows with decreasing magnitude ofK and decreasing R, the performance of the hardware-based classifier will degrade when the kernel matrix to be sampled is populated by small entries. Conversely, large kernel magnitudes are a desirable feature for a successful quantum kernel classifier, and a key goal in circuit design is to balance the a. b.

c.
FIG. 2: a. 14-qubit example of the type 2 circuit used for experiments in this work. The dashed box indicates U (x i ), while the remainder of the circuit computes U † (x j ) to ouput | φ(x j )|φ(x i ) | 2 . Non-virtual gates occurring at the boundary (dashed line) are contracted for hardware runs. b. The basic encoding block consists of a Hadamard followed by three single-qubit rotations, each parameterized by a different element of the input data x (normalization and encoding constants omitted here). c. We used the √ iSWAP entangling gate, a hardware-native two-qubit gate on the Sycamore processor.
requirement of large kernel matrix elements with a choice of mapping that is difficult to compute classically. Another significant design challenge is to construct a circuit that separates data according to class without mapping data so far apart as to lose information about class relationships -an effect sometimes referred to as the "curse of dimensionality" in classical machine learning.
For this experiment, we accounted for these design challenges and the need to accommodate highdimensional data by mapping data into quantum state space using the quantum circuit shown in Figure 2. Each local rotation in the circuit is parameterized by a single element of preprocessed input data so that inner products in the quantum state space correspond to a similarity measure for features in the input space. Importantly, the circuit structure is constrained by matching the input data dimensionality to the number of local rotations so that the circuit depth and qubit count individually do not significantly impact the performance of the SVM classifier in a noiseless setting. This circuit structure consistently results in large magnitude inner products (median K ≥ 10 -1 ) resulting in estimates forK with very little statistical error. We provide further empirical evidence justifying our choice of circuit in Appendix IV.

A. Dataset selection
We are motivated to minimize the size T ⊂ X since the complexity cost of training an SVM on m datapoints scales as O(m 2 ). However too small a training sample will result in poor generalization of the trained model, resulting in low quality class predictions for data in the reserved size-v test set V. We explored this tradeoff by simulating the classifiers for varying train set sizes in Cirq [16] to construct learning curves ( Figure 3) standard in machine learning. We found that our simulated 17qubit classifier applied to 67-dimensional supernova data was competitive compared to a classical SVM trained using the Radial Basis Function (RBF) kernel on identical data subsets. For hardware runs, we constructed train/test datasets for which the mean train and k-fold validation scores achieved approximately the mean performance over randomly downsampled data subsets, accounting for the SVM hyperparameter optimization. The final dataset for each choice of qubits was constructed by producing a 1000 × 1000 simulated kernel matrix , repeatedly performing 4-fold cross validation on a size-280 subset, and then selecting as the train/test set the exact elements from the fold that resulted in an accuracy closest to the mean validation score over all trials and folds.

B. Hardware classification and Postprocessing
We computed the quantum kernels experimentally using the Google Sycamore processor [9] accessed through Google's Quantum Computing Service. At the time of experiments, the device consisted of 23 superconducting qubits with nearest neighbor (grid) connectivity. The processor supports single-qubit Pauli gates with > 99% randomized benchmarking fidelity and √ iSWAP native entangling gates with XEB fidelities [9,17] typically greater than 97%.
To test our classifier performance on hardware, we trained a quantum kernel SVM using n qubit circuits for n ∈ {10, 14, 17} on d = 67 supernova data with balanced class priors using a m = 210, v = 70 train/test split. We ran 5000 repetitions per circuit for a total of m(m − 1)/2 + mv ≈ 1.83 × 10 8 experiments per number of qubits. As described in Section III A, the train and test sets were constructed to provide a faithful representation of classifier accuracy applied to datasets of restricted size. Typically the time cost of computing the decision function (Equation 1) is reduced to some fraction of mv since only a small subset of training inputs are selected as support vectors. However in hardware experiments we observed that a large fraction (> 90%) of data in T were selected as support vectors, likely due to a combination of a complex decision boundary and noise in the calculation ofK.
Training the SVM classifier in postprocessing required choosing a single hyperparameter C that applies a penalty for misclassification, which can significantly affect the noise robustness of the final classifier. To determine C without overfitting the model, we performed leave-one-out cross validation (LOOCV) on T to determine C opt corresponding to the maximum mean LOOCV score. We then fixed C = C opt to evaluate the test accuracy 1 v v j=1 Pr(f (x j ) = y j ) on reserved datapoints taken from V. Figure 4 shows the classifier accuracies for each number of qubits, and demonstrates that the performance of the QKM is not restricted by the number of qubits used. Significantly, the QKM classifier performs reasonably well even when observed bitstring probabil-  ities (and thereforeK ij ) are suppressed by a factor of 50%-70% due to limited circuit fidelity. This is due in part to the fact that the SVM decision function is invariant under scaling transformations K → rK and highlights the noise robustness of quantum kernel methods.

IV. CONCLUSION AND OUTLOOK
Whether and how quantum computing will contribute to machine learning for real world classical datasets remains to be seen. In this work, we have demonstrated that quantum machine learning at an intermediate scale (10 to 17 qubits) can work on "natural" datasets using Google's superconducting quantum computer. In particular, we presented a novel circuit ansatz capable of processing high-dimensional data from a real-world scientific experiment without dimensionality reduction or significant pre-processing on input data, and without the requirement that the number of qubits matches the data dimensionality. We demonstrated classification results that were competitive with noiseless simulation despite hardware noise and lack of quantum error correction. While the circuits we implemented are not candidates for demonstrating quantum advantage, these findings suggest quantum kernel methods may be capable of achieving high classification accuracy on near-term devices.
Careful attention must be paid to the impact of shot statistics and kernel element magnitudes when evaluating the performance of quantum kernel methods. This work highlights the need for further theoretical investigation under these constraints, as well as motivates further studies in the properties of noisy kernels.
The main open problem is to identify a "natural" data set that could lead to beyond-classical performance for quantum machine learning. We believe that this can be achieved on datasets that demonstrate correlations that are inherently difficult to represent or store on a classical computer, hence inherently difficult or inefficient to learn/infer on a classical computer. This could include quantum data from simulations of quantum many-body systems near a critical point or solving linear and nonlinear systems of equations on a quantum computer [18,19]. The quantum data could be also generated from quantum sensing and quantum communication applications. The software library TensorFlow Quantum (TFQ) [20] was recently developed to facilitate the exploration of various combinations of data, models, and algorithms for quantum machine learning. Very recently, a quantum advantage has been proposed for some engineered dataset and numerically validated on up to 30 qubits in TFQ using similar quantum kernel methods as described in this experimental demonstration [4]. These developments in quantum machine learning alongside the experimental results of this work suggest the exciting possibility for realizing quantum advantage with quantum machine learning on near term processors.

ACKNOWLEDGMENTS
We would like to thank Google Quantum AI team for time on their Sycamore-chip quantum computer. In particular, the presentation and discussion with Kostyantyn Kechedzhi on error mitigation techniques that was incorporated into this experiment and Ping Yeh's participation in some of the group discussions. Pedram Roushan provided a great deal of useful feedback on early versions of the draft and joined in several useful discussions. We would also like to thank Stavros Efthymiou for some early work on the quantum circuit simulations, and Brian Nord for consultation on interesting datasets in the domain of astrophysics and cosmology.
EP is partially supported through A Kempf's Google Faculty Award. JC, GP, and EP are partially supported by the DOE/HEP QuantISED program grant HEP Machine Learning and Optimization Go Quantum, identification number 0000240323. This manuscript has been authored by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy, Office of Science, Office of High Energy Physics.
Supervised learning algorithms are tasked with the following problem: Given input data X ⊂ R d composed of d-dimensional datapoints and the corresponding class labels taken from Y = {−1, 1} attached to each datapoint, construct a function f that can successfully predict f (x i ) = y i given a datapoint-label pair taken from the dataset (x i , y i ) ∈ X × Y.
We now introduce the theoretical foundations for the Support Vector Machine, or SVM (see [21,22] for a thorough review). A linear SVM performs binary classification by constructing a (d − 1) dimensional hyperplane x, w + b that divides elements of X according to their class, by finding the hyperplane with the largest perpendicular distance ("margin") to elements of either class. The hyperplane parameters capable of classifying linearly separable data must satisfy the inequality which corresponds to a symmetric margin of 2/||w|| dividing classes of linearly separable data. Maximizing the margin therefore corresponds to minimizing the hyperplane normal vector, so the task of the SVM is to find the solution to the convex problem Equations A1-A2 frame a constrained optimization problem that can be solved by method of Lagrange multipliers. The Lagrangian to minimize is then Recognizing that the inequality of Equation A1 can only be satisfied by fully separable data, SVM classifiers trained on real data typically employ so-called "slack" variables that loosen the classification constraints and introduce a misclassification penalty to the Lagrangian formulation [10]. The constraints for training a linear SVM on nonseparable data using slack variables ξ i then takes on the form: where we choose to assign an L2 penalty for misclassification by adding an additional cost term to the objective function A2, resulting in the modified objective function Applying the method of Lagrange multipliers, the primal Lagrangian for Equation A6 is Alternatively this can be reformulated to the Wolfe dual problem [23], with the goal of maximizing the dual Lagrangian subject to the constraints: As Equation A6 is a convex programming problem, the solutions to the primal Lagrangian of Equation A7 and the dual Lagrangian of Equation A8 are subject to the Karush-Kuhn-Tucker (KKT) conditions [24]: These conditions determine the hyperplane intercept b and also describe the geometry of the maximal margin hyperplane for the trained SVM. Once the optimal set of parameters α is determined with respect to the training inputs X , the linear SVM predicts the class of a data point using the decision function where the sum runs over the indices of the support vectors, or equivalently all nonzero α i . Equation A20 and the dual Lagrangian A8 no longer explicitly reference elements of the input space w, x i ∈ R, so the optimization problem is still valid under the substitution x → φ(x) for some mapping φ : X → H where H is a Hilbert space (this is the so-called "kernel trick" [12]). This permits us to embed input data into higher dimensional Hilbert space for some choice of φ and then train the SVM on inner products in the mapped space, φ( where the status of k as an inner product guarantees that it is symmetric, positive-definite. The resulting SVM will then be capable of constructing decision boundaries that are nonlinear in the input space R d resulting in a decision function Evaluating k(x p , x s ) = K ps for a fixed set x p , x s ∈ T ∪ V recovers Equation 1 from the main body (since α i = 0 for i outside the support vector set by Equations A13 and A16).
Appendix B: Circuit structure and Hilbert space embedding

Statistical uncertainty and vanishing kernels
We now discuss limitations to hardware-based quantum kernel methods due to statistical uncertainty. Recall that each kernel matrix element K ij = k(x i , x j ) is computed by sampling the output of a circuit U † (x j )U (x i ) for a total of R repetitions and counting the number ν 0 of all-zeros bitstrings that appear. This experiment constitutes R trials of a Bernoulli process parameterized by K ij ; the unbiased estimators for K ij and associated variance are therefore given by: Note that positive definiteness ofK is not necessarily preserved in the presence of statistical error and hardware noise, but in practice we found this had little effect on the ability of the SVM to classify data. The O(R −1/2 ) sampling error of Equation B2 combined with a requirement that would suggest that an -close estimation of K could be achieved using R = O( −2 N 2 ) shots per kernel element. In the main body we argue that this fails to bound relative error between kernel matrices. This is evident in the symmetric Chernoff bound for the relative error of a sampledK ij [25]: for which the probablity of large relative error quickly becomes unbounded for R << O(K −1 ij ). Relative error is relevant by the following reasoning: Let L D be the L1-penalized dual Lagrangian corresponding to a kernel constructed from the transformation K = rK, given by If α opt contains the parameters maximizing the Lagrangian then α opt also maximizes the Lagrangian 1 r L D which may be rewritten as follows: By comparison with Equation B4 L D has the immediate choice of unique solution α opt = α opt /r, which may be achieved by appropriate choice of penalty parameter C appearing in the KKT conditions A11-A19 for L D (or equivalently in the primal problem L P before kernelizing the problem). A similar result can be attained if an L2 misclassification penalty is used by restricting the Lagrange multipliers α. Consequently the decision function for an SVM classifier trained on the kernel matrix K is identical to the decision function for the SVM classifier trained on the original K. This result makes intuitive sense for the choice of a linear kernel K ij = x i , x j , for which stretching/shrinking each datapoint x i → x i / √ r has no effect on the geometry of the maximal margin hyperplane. By choosing some r for the transformation K → rK,K → rK, the absolute error ||K −K|| F may be made arbitrarily large or small without affecting the performance of the associated SVM classifier, which suggests that ||K −K|| F does not completely characterize the resulting SVM accuracy.
The high dimensionality of quantum state space poses a threat to the experimental feasibility of quantum kernel methods since the relative statistical error incurred by finite shot statistics grows as the magnitude of the sampled kernel element shrinks. Using a naive example, a randomly selected encoding unitary will almost certainly result in vanishing kernel elements by strong measure concentration: If we treat S = {U † (x j )U (x i )} as a distribution of unitaries that are random with respect to the Haar measure it is well known that the expected probability for any specific bitstring (and therefore ν0 R ) sampled from a unitary in S scales as O(2 −n ). In practice, the preprocessing of input data and circuit structure must be chosen with careful attention given to the corresponding distribution on K.
To explore this effect we constructed a classifier similar to quantum circuits described in [1,26], depicted in Figure  5b and given by where the entangling gates are selected among nearest neighbors on the (simulated) circuit grid. We chose to parameterize entanglers by c 2 (x i − x j ) instead of quadratic terms proportional to x i x j to eliminate concentration of E[x i x j ] → 0 that occurs for our choice of normalization. For clarity we refer to the circuit described by Equation B7 as "Type 1". An increase in the connectivity results in a higher gate/parameter count, resulting in generally smaller sampled kernel elements. This circuit structure requires that input data have dimension equal to the number of qubits. We applied Principal Component Analysis (PCA) [27] to reduce the 67-dimensional data to n dimensions and then standardized the data to the interval [−π/2, π/2]. We defined additional hyperparameters (c 1 , c 2 ) that can be tuned to optimize the cross-validated performance of the corresponding SVM and control the resulting distribution of kernel matrix elements. Figure 5a shows that the magnitude of K vanishes with respect to increasing c 1 , c 2 or number of qubits n. This does not necessarily result low accuracies for the associated SVM classifiers, but describes a family of kernels that are infeasible to sample on hardware. It is possible to preserve the magnitude of K if c 1 and c 2 are scaled down with increasing n but for small enough angles over/under-rotation errors and noise will become dominating factors in hardware outcomes, while the limit c 2 → 0 results in a circuit that can be simulated trivially.
These results motivate a new approach for encoding data on large numbers of qubits, especially if the input data dimensionality is large. To compute large-magnitude kernels on high dimensional data without dimensionality reduction, we designed a circuit encoding to map input data x i , z i ∈ X ⊂ R d into a subspace of C 2 n using an approximately orthogonal parameterization of U (2 n ), the group describing n-qubit unitaries. While examples of exactly orthogonal parameterizations of U (2 n ) exist, such as Euler angle parameterization of U (2 n ) [28], such schemes are generally inefficient to implement on hardware. We approximate such an encoding using circuits structured similarly to the Hardware Efficient Ansatz [29] consisting of an initial layer parameterizing n U (2) interspersed with local entanglers. This circuit structure (referred to here as "Type 2") is shown in Figure 2 of the main body and can be expressed in terms of individual gates as a. b. 2 4 6 8 10 12 14 16 Number of qubits where E(G) denotes the set of edges composing a length-n simple path permitted by the Sycamore connectivity, superscript (i) indicates action on qubit i, and S : R d → R 3n denotes selection of a subset of 3n elements from the input data to be encoded into a given rotation layer. The specific choice of rotation and entangling gates was influenced by the gate set available on the processor at the time the experiments were conducted, namely √ iSWAP and the Sycamore gate [9]. Note that the use of Z rotations in U A reduces the hardware depth of the corresponding circuit by 50% [30]. This architecture therefore encodes d-dimensional input data in O(d/n) depth on hardware. The fact that this circuit choice may be made arbitrarily shallow in number of qubits n rules out the possibility of demonstrating quantum advantage, but the experimental and design challenges of implementing this architecture are relevant to QKM in general.
As depicted in Figure 1, single qubit rotations in each circuit were "filled" sequentially from left to right, top to bottom beginning with the first element x 1 of the data and ending with x 67 . Exceptions were made to this pattern to more evenly distribute single qubit rotations between entangling layers. We explored randomized filling schemes and found no significant difference to circuit performance nor inner product magnitudes. When the number of qubits is not an integer factor of 67, gaps appeared in at most two layers of the circuit.

Hyperparameter tuning in the quantum circuit
Our circuit design introduces a single parameter c 1 for multiplicative scaling of on elements of preprocessed x. We observed that train/test performance varied with respect to c 1 and therefore treated it as a hyperparameter of the kernel function to be optimized in simulation. In practice, the same optimization procedure can be carried out using sweeps over c 1 for hardware submissions. Figure 6a shows a typical outcome of the hyperparameter tuning process and demonstrates both a local optimum for validation scores with respect to c 1 as well as a capacity for overfitting in the large-c 1 limit. The choice of optimal c 1 has a direct impact on the proximity of states |ψ(x) mapped into the quantum state space; in the trivial limit c 1 → 0 the unitary U (x) becomes the identity map and K ij → 1; similarly in the large c 1 limit the angles between mapped input data grow linearly and K quickly vanishes. Therefore there is a tension between producing mapped states {|ψ(x i ) } with good separability but without the isolation of mapped points in high dimensional space that would degrade performance, a phenomenon in classical machine learning known as the "curse of dimensionality" (e.g. [31]).

Appendix C: Dataset selection and preprocessing
We used the dataset provided in the Photometric LSST Astronomical Time-series Classification Challenge (PLAs-TiCC) [14]. After engineering the 67-float data with binary labels (Section II A), preprocessing consisted of the following steps: 1. Logscale transformation: Many of the features were distributed in a lognormal distribution which motivates use of the transformation x i → log 10 (x i ). Some information loss resulted from taking the absolute value of median flux, for which approximately 4% of entries in the original dataset were negative. All other absolute value operations resulted in negligible loss of sign information.
2. Normalization/scaling and outliers: Since the local rotations of U (x) are 2π-periodic, an effective normalization scheme is to map the boundaries of the input range to − π 2 , π 2 . Realistically, outliers will result in the effective range for most data being compressed into a much smaller space (thereby amplifying the effect of over/underrotation errors) and so we scaled data in such a way to ignore the effects of large outliers by applying the transformation: to every element of input data, where P k denotes the value of the k-the percentile of the input domain. This is equivalent to typical implementations of a robust scaler with the quantile range set to (0.01, 0.99) (e.g. [ which removes the effects of outliers on the rescaling. Then hyperparameter tuning was used to adjust the rotation parameters by a further multiplicative factor c 1 (see Appendix B 2).
Appendix D: Error mitigation

Device parameters
Periodic calibrations of the Sycamore superconducting qubit device produce diagnostic data describing qubit and gate performances. Calibration metrics relevant to this experiment included readout errors p 00 (probability of a computational basis measurement reporting a "1" when the result should have been "0") and p 11 (probability of a measurement reporting a "0" when the result should have been "1"), single qubit T 1 , single qubit gate RB error, and √ iSWAP gate cross-entropy benchmarking (XEB) error. The readout error probabilities were used primarily for constructing and analyzing post-processing techniques for error correction while the remainder of the calibration data were fed in to an automated qubit selection algorithm.

Automated qubit selection
To improve the performance of the algorithm for a given number of qubits, we designed a graph traversal algorithm to select qubits based on diagnostic data taken during device calibration. Less weight was applied to p 0 and p 1 due to the availability of readout error mitigation techniques. We constructed a qubit graph G q = (E, V ) where edges represent entangling gate connectivity and nodes represent qubits according to the Sycamore 23-qubit grid layout. The optimization was done by traversing all simple paths of fixed length k (implemented according to [33,34]), and then scoring each path according to some function of the metrics for the subset of nodes and edges visited. Stated as an optimization problem, given an objective function f : V × E → R scoring subsets of vertices and edges composing an Eulerian graph, this algorithm finds the maximum evaluation of f over all possible graphs G q : subject to the constraints |V k (G)| = k, |E k (G)| = k − 1. Before applying f , the heterogeneous calibration data were normalized to the range [0, 1] and inverted if they represented an error (as opposed to a fidelity). Letting the value of the p-th category of calibration data for the i-th qubit v i be c p (v i ), and similarly the p-th category of calibration for the i-th edge e i be d p (e i ), and defining g p as a scoring function applied to the p-th processed calibration metric, our implementation of f takes on the form where C 1 and C 2 represent the calibration metrics corresponding to single and pairs of qubits respectively. We implemented g p as a logarithmic function for T 1 , T 2 , and f XEB,2q metrics and a linear function for p 00 and p 11 metrics. Figure 7 shows the results of an example optimization overlaid on T 1 calibration results.

Readout error correction
Readout error resulting from relaxation and thermal excitation can be modelled by a stochastic bitflip process applied to the observed bitstrings. Here we describe an efficient and accurate technique for correcting readout error for quantum kernel methods.
Let p(y n |x n ) describe the conditional probability for observing bitstring y n after exposing the bitstring x n to n distinct bitflip channels, and let q k (y|x) for x, y ∈ {0, 1} describe the corresponding probability for observing bit "y" after exposing the k-th bit "x" to a single bitflip channel. Then for the k-th qubit, the metrics introduced in Appendix D 1 as p 00 = q k (1|0) and p 11 = q k (0|1) may be used to partially undo readout error by means of postprocessing. We define a response matrix R ∈ R 2 n ×2 n elementwise as (R) xy = p(y n |x n ) that contains as its elements the total probability for transition from bitstring x n = x 1 . . . x n to bitstring y n = y 1 . . . y n computed as the product of q k (1|0) and q k (0|1) corresponding to each individual bit: For simplicity, we assume that each individual bitflip may be modelled as an independent process, although the techniques discussed here are readily applicable to a system of dependent bitflips if the bitflip likelihoods are experimentally measured in parallel. Then evidently, Note that R is generally asymmetric since typically q k (1|0) < q k (0|1). While multiplying R −1 by the set of observed bitstring frequencies would recover the prior distribution of bitstring frequencies with good fidelity, standard matrix inversion is subject to instabilities and is not tractable for even modest numbers of qubits.
Since only the frequency of the all-zero's bitstring is necessary to computeK ij , we implemented correction using a small subset of bitstring transition probabilities to perform quick and relatively high-fidelity readout error correction in post-processing. We generated R and then truncated the full 2 n -dimensional basis to the bitstring space containing strings with Hamming weight ≤ k max for some n-dependent k max resulting in a truncated response matrix R t . We then computed the pseudo-inverse R −1 t and performed error correction by simple matrix multiplication on the array of experimental readout frequencies (similarly truncated). This simplification comes at the expense of knowledge about any other post-correction frequencies since the other bitstrings in the truncated space are off-center within the Hamming sphere of kept bitstrings, resulting in a bias in the inverted linear map.
We now analyze the effect of truncation on the readout error correction. The number of simultaneous readout error events (either relaxation or excitation) may be modelled as an induced random variable Z = k X k for Pr(X k = 1) = q k (¬x|x). This distribution has expected value µ = k q k (¬x|x) and an exponentially suppressed likelihood for simultaneous readout errors via the Chernoff bound Pr(Z ≥ k) ≤ exp(k − µ − k log(k/µ)). Thus a natural measure for the effect of Hamming weight truncation is the empirical probability allocated to the complement of the truncated subspace: While Equation D5 describes the probability of events outside the truncated subspace, it does not directly translate to failure probability for truncated readout correction. To explore this effect numerically, we computed the output probability distributions for a 10-qubit quantum kernel circuit sampled for 5000 repetitions, and then introduced artificial readout error (using bitflip probabilities taken from the Sycamore processor) followed by roughly 5% Gaussian noise on the sampled distribution. Figure 8 shows the error distribution as well as the effect of correcting using an inverted truncated response matrix for a variety of kernel magnitudes and truncation weights. We observed similar behavior for 14 and 17-qubit simulated experiments; readout error for quantum kernel experiments may be corrected reasonably well using a small fraction of the full bitflip response matrix. Empirical probability for transition out of the weight ≤ k max vanishes exponentially in the highest weight considered. Figure 8 suggests a linear relationship between the kernel K computed in the presence of readout error and the noiseless kernel K. While readout error does not constitute a linear process in general (the underlying bitflip probabilities and bitstring distributions may be modified to give rise to arbitrary effects on K within the bounds of Equation E6), by the arguments in Section IV demonstrating such an effect in general would imply minimal effect of readout error on the corresponding classifier. The degree to which readout error plays a role in quantum kernel methods is therefore an important research area for implementation on near-term processors.

Crosstalk optimization
Cross-talk between two-qubit gates on implemented on superconducting processors can contribute to decoherence and decrease circuit fidelity. Since our choice of circuit ansatz requires entire layers of entangling gates, we conducted diagnostic runs to determine whether executing these gates sequentially (in different staggering patterns) could improve performance compared to executing the gates simultaneously. We found that the completely parallel execution of entangling gates achieved the lowest cross entropy with respect to noiseless simulation compared to partially sequential arrangements. Therefore all entangling gates in this experiment were run in parallel.
Appendix E: Hardware error and performance Figure 9 shows a typical outcome for sampled kernel elementsK ij compared to their true values as determined from noiseless simulation. Notably, all of the hardware outcomes are strongly biased towards zero as a result of decoherence. The observation that the SVM decision function is scale-invariant (Appendix IV) suggests that the performance of an SVM trained using data subject to hardware error will only be affected to the degree that the sampled kernel elements K ij differ from some linear transformation of the corresponding exact elements K ij , so that the circuit fidelity and other typical metrics for hardware performance are not predictive of classifier performance in any obvious way. For instance, we achieved comparable test accuracy to noiseless simulation for n = 14 qubits despite circuit fidelity in the neighborhood of 30%.
FIG. 9: (left) Distribution of sampledK compared to simulated K for 17 qubit train set (m = 210) colored according to quartiles of K. The dashed line indicates the value of rK for r = 0.29, demonstrating that the hardware trends towards some scaled value of K. The mean value of K ii corresponding to Tr(|0 0|) ≡ 1 (elements in the top right) is a useful proxy for circuit fidelity and tends towards 30% for the qubit counts investigated. (right) The distribution ofK around rK is irregular but exhibits consistent patterns with respect to kernel magnitude. The hardware error is therefore not normal with respect to (a scaled version of) K, which complicates bounds for guaranteed performance.

Effects of readout error in quantum kernel methods
SinceK is estimated by computing the empirical probability of the all zeros bitstring p(0), the effects of readout error may be bounded in a straightforward manner. The following analysis will consider readout error as the sole source of noise and ignore statistical effects and other sources of decoherence. The resulting bounds apply to only to the infinite-shot limit but will be shown to be approximately correct in the low-shot limit. As in Section D 3, we let p(y n |x n ) describe the conditional probability for observing bitstring y n after exposing the bitstring x n to n distinct bitflip channels, and let q k (y|x) for x, y ∈ {0, 1} describe the corresponding probability for observing "y" after exposing the k-th bit "x" to a single bitflip channel. After exposing the all-zeros bitstring to a stochastic bitflip process repeatedly, the lowest possible value forK occurs when no bitstrings transform into the all-zeros bitstring. The expected fraction of events remaining is The maximum increase in the observed p(0) will result from transitions from other bitstrings into the all-zeros bistring. Intuitively this will occur almost entirely due to low-weight bitstrings with just a few bitflips. The probability of transition into the all-zeros bitstring from an arbitrary starting bitstring y n = y 1 y 2 . . . y n is p(0|y n ) = n k=1 q k (0|y k ), while the log-odds of each term in this product may be rewritten as log q(0|y k ) = log q(0|0)(1 − y k ) + log q k (0|1)y k (E2) The total log-odds for transition into 0 is then which is in the form w · y n + b with w ∈ R n , b ∈ R, indicating a linear dependence of log p(0|y n ) on y n . Since the logarithm is strictly increasing, the corresponding integer programming problem for maximizing p(0|y n ) is: where the second constraint avoids the trivial solution of the all-zeros bitstring. By inspection of Equation E3, q k (0|0) = 1 − q k (1|0) implies that w is strictly negative with the realistic assumption that q k (1|0) < 0.5, q k (0|1) < 0.5. In this case, the solution maximizing Equation E4 must be a bitstring with weight 1 with a "1" at the position argmax k q k (0|1). Combined with Equation E1 this results in the bound: WhereK describes the estimated kernel element after readout error has occurred. This bound is plotted in Figure  8 for one instance of a 10-qubit readout error calibration. The form of these bounds further justify the need for large kernels, since the bound becomes increasingly loose asK approaches the magnitude of typical readout error probabilities on the quantum processor. We now discuss the implementation of the readout error correction technique described in Appendix D 3. While routine calibration of the device returns diagnostic information on readout error probabilities q k (0|1) and q k (1|0), these quantities may drift in the time between calibration and experiment resulting in lower quality readout correction. To account for drift, we periodically estimated q k (0|1) and q k (1|0) by preparing a sequence of random bitstrings |s and the complement |s ⊕ 1 and then measured in the computational basis. We then computed empirical bitflip likelihoods for measurements in parallel over the specific qubits used in each experiment and computed the timeaveraged likelihoods to use for readout correction. We averaged over the different prepared states to reduce the impact of imperfect state preparation on measured outcomes. Figure 10 shows learning curves computed for the 14-qubit experiment with and without readout correction for post-processing. While classifiers trained using error-corrected results are capable of achieving higher accuracies on the reserved test set, the choice of C hyperparameter for improved test accuracy did not generally correspond to improved accuracies for the validation sets. Therefore, we could not consistently improve the classifier performance by applying readout correction and opted to present the results achieved without readout correction in Figure 4 in the main body.

Effects of statistical error on SVM accuracy
While recent work [3] has established performance bounds relating SVM accuracy to statistical sampling error for quantum kernel methods, we conducted experiments using orders of magnitude fewer repetitions than necessary to achieve robust classification. In addition, modified SVM algorithms exist for processing noisy inputs in the data space [35] but these modifications do not generalize to processing noise in the feature space (i.e. ∆k(x i , x j )). Given these limitations, we chose to explore the effects of statistical noise numerically to determine the relative impact of statistical noise on final classifier accuracy compared to other sources of error. We analyzed the LOO cross-validation accuracy versus accuracy on the reserved test set to determine the effect of (truncated) readout correction using readout error likelihoods determined experimentally at periodic intervals during the 14-qubit experiment. While the trained classifiers are often able to achieve significantly higher accuracy on the reserved test set, the error-corrected validation accuracy is not predictive of improved test accuracy. For instance, the k max = 2 classifier achieves a 65.7% validation accuracy and a 64.3% test accuracy while the classifier with no readout correction achieves 66.1% validation accuracy and 61.4% test accuracy, indicating that the improved test score cannot be consistently predicted by LOOCV. Figure 11 shows simulated trials of the type 2 circuits used for experimental runs. Each circuit was initially simulated with an full wavefunction simulator, and then the amplitude K = | 0|ψ | 2 was used to sample R repetitions from the implied binomial distribution Bin(R, K). The sampled kernel elements were used to train and validate SVM classifiers following the procedure outlined in Section III B. The results indicate there are diminishing returns in classifier accuracy beyond the R = 5000 repetitions we used for the experiments, but that this choice incurs ∼ 1-2% classifier error compared to the infinite-shot limit. As expected, greater statistical noise results in less overfitting for the model. Figure 12 shows the result of tuning the hyperparameter C controlling the L2 penalty for violating the SVM margin.

Classifier results
The C values resulting in optimal validation scores were then used to determine the final train/test scores reported in Figure 4 of the main body. We empirically investigated the effect of statistical noise by generating a sampling a binomial distribution with success probability equal to K and enfocing symmetry on the resulting kernel matrix. The results show that additional circuit repetitions beyond R ≈ 5000 provide diminishing returns to the validation accuracy of the classifier. We note that typically 50,000 circuit repetitions per kernel element are required to achieve cross-validated accuracy comparable to noiseless simulation. Fill for finite-R represents 1σ interval for stratified 10-fold cross-validated train/test scores over 10 trials of downsampling to R shots. Hyperparameter optimization for hardware kernels is performed by tuning the L1 penalty parameter C via LOOCV on the training data (top row) during model validation, which then becomes fixed for evaluation of the model on the test set (bottom row). The capacity of the hardware-based models to overfit the data is drastically reduced, and oftentimes the SVM behavior becomes pathological. To avoid undesirable generalization behavior, the validation score corresponding to the optimal C was required to be no greater than the corresponding training score. The vertical dashed line indicates the optimal C decided in the validation stage.