QUBO formulations for training machine learning models

Training machine learning models on classical computers is usually a time and compute intensive process. With Moore’s law nearing its inevitable end and an ever-increasing demand for large-scale data analysis using machine learning, we must leverage non-conventional computing paradigms like quantum computing to train machine learning models efficiently. Adiabatic quantum computers can approximately solve NP-hard problems, such as the quadratic unconstrained binary optimization (QUBO), faster than classical computers. Since many machine learning problems are also NP-hard, we believe adiabatic quantum computers might be instrumental in training machine learning models efficiently in the post Moore’s law era. In order to solve problems on adiabatic quantum computers, they must be formulated as QUBO problems, which is very challenging. In this paper, we formulate the training problems of three machine learning models—linear regression, support vector machine (SVM) and balanced k-means clustering—as QUBO problems, making them conducive to be trained on adiabatic quantum computers. We also analyze the computational complexities of our formulations and compare them to corresponding state-of-the-art classical approaches. We show that the time and space complexities of our formulations are better (in case of SVM and balanced k-means clustering) or equivalent (in case of linear regression) to their classical counterparts.

The importance of machine learning algorithms in scientific advancement cannot be understated. Machine learning algorithms have given us great predictive power in medical science 1 , economics 2 , agriculture 3 etc. These algorithms can only be implemented and deployed after they have been trained-a process that requires tuning the model parameters of a machine learning model in order to extract meaningful information from data. Training a machine learning model is a time and compute intensive process usually. In such situations, one is often forced to make a trade-off between the accuracy of a trained model and the training time. With the looming end of Moore's law and rapidly increasing demand for large-scale data analysis using machine learning, there is a dire need to explore the applicability of non-conventional computing paradigms like quantum computing to accelerate the training of machine learning models.
Quantum computers are known to bypass classically-difficult computations by performing operations on high-dimensional tensor product spaces 4 . To this extent, we believe that machine learning problems, which often require such manipulation of high-dimensional data sets, can be posed in a manner conducive to efficient quantum computation. Quantum computers have been shown to yield approximate solutions to NP-hard problems, such as the quadratic unconstrained binary optimization (QUBO) problem 5 , graph clustering problem 6 , protein folding problem 7 etc. In addition to these results, demonstration of quantum supremacy by Google 8 has led us to believe that quantum computers might offer speedup in a much wider range of problems such as accelerating training of machine learning models.
To this extent, the principal contributions of our work are: 1. We formulate the training problems of three machine learning models-linear regression, support vector machine (SVM) and balanced k-means clustering-as QUBO problems so that they can be trained on adiabatic quantum computers. 2. For the aforementioned models, we provide a theoretical comparison between state-of-the-art classical training algorithms and our formulations that are conducive to being trained on adiabatic quantum computers. We observe that the time and space complexities of our formulations are better in case of SVM and balanced k-means clustering, and equivalent in case of linear regression, to their classical counterparts. www.nature.com/scientificreports/ Our formulations provide a promising outlook for training such machine learning models on adiabatic quantum computers. In the future, larger and more robust quantum computers are sought to abate the limitations of current machines and potentially allow machine learning models to be trained faster and more reliably.

Related work
Quantum machine learning algorithms have been proposed for both universal and adiabatic quantum computers. We briefly review a handful of such algorithms that leverage universal quantum computers here. Relevant algorithms leveraging adiabatic quantum computers have been reviewed in the subsequent sections. Quantum machine learning algorithms, and in general, all quantum algorithms will greatly benefit from optimal design of quantum circuits 9,10 , optimized quantum states 11 , quantum memory 12 , improved quantum coherence times 13 and quantum error correction 14 . Today's quantum machine learning algorithms are catered towards quantum computers in the noisy intermediate-scale quantum (NISQ) era. Results presented in this paper are part of our ongoing work to accelerate training of machine learning models using quantum computers [15][16][17] . Farhi et al. proposed the Quantum Approximate Optimization Algorithm (QAOA), which produces approximate solutions for combinatorial optimization problems [18][19][20] , is computationally universal 21 , and has been used to train unsupervised machine learning models 22 . Farhi and Neven also proposed quantum neural networks where a sequence of parameter dependent unitary transformations act on classical or quantum input data and produce classification predictions on the output qubits 23 . Gyongyosi and Imre proposed training optimizations for such gate-based quantum neural network models 24 . Benedetti et al. 25 proposed the use of the variational quantum eigensolver (VQE) algorithm in conjunction with parameterized quantum circuits as quantum machine learning models. QAOA and VQE based quantum machine learning models are widely used in the literature.

Adiabatic quantum computers
The adiabatic theorem states that a quantum physical system remains in its instantaneous eigenstate under a slowly acting perturbation if there is a gap between its eigenvalue and the rest of the Hamiltonian's spectrum 26 . Adiabatic quantum computers leverage the adiabatic theorem to perform computation 27 . Specifically, starting with the global minimum of a simple Hamiltonian, they homotopically connect it to the global minimum of the problem of interest 28 . The D-Wave adiabatic quantum computers, for instance, are adept at approximately solving the quadratic unconstrained binary optimization (QUBO) problem, which is stated as follows: where, M is a natural number; B = {0, 1} is the set of binary numbers; z ∈ B M is the binary decision vector; A ∈ R M×M is the real-valued M × M QUBO matrix; and, b ∈ R M is the real-valued M-dimensional QUBO vector.

Notation
We use the following notation throughout this paper: • X: Training data set, usually X ∈ R N×d , i.e. X contains N data points along its rows, and each data point is a d-dimensional row vector. • Y: Regression labels of the training data set in case of regression ( Y ∈ R N ); classification labels of the training data set in case of support vector machine ( Y ∈ B N ).

Linear regression
Background. Linear regression is one of the oldest statistical machine learning techniques that is used in a wide range of applications, such as scientific research 29 , business 30 and weather forecasting 31 . Linear regression models the relationship between a dependent variable and one or more independent variables. Adiabatic quantum computing approaches have been proposed in the literature for solving the linear regression problem (Eq. 2). Borle et al. propose a quantum annealing approach for the linear least squares problem 32 . Chang et al. present a quantum annealing approach for solving polynomial systems of equations using least squares 33 . Chang et al. propose a method for solving polynomial equations using quantum annealing and discuss its application to linear regression 34 . While these approaches can only find positive real-valued regression weights, our formulation finds both positive and negative real-valued regression weights.
Here, we denote X ∈ R N×(d+1) as the augmented regression training data matrix, where we have augmented each row of the original X ∈ R N×d with unity for the sake of mathematical convenience. The regression training labels are denoted by Y ∈ R N , and the regression weights are denoted by w ∈ R d+1 . Given X and Y, training a linear regression model can be stated as follows: Here, E(w) is the Euclidean error function. With reference to Fig. 1, the blue dots represent the data points X and Y, and the green line, characterized by the weights w, is the regression hyperplane which fits the data. The regression problem has an analytical solution, given by QUBO formulation. We start by rewriting Problem (2) as: Each entry in P can be an integral power of 2, and can be both positive or negative. We also introduce a K-dimensional vector ŵ i ∈ B K with binary coefficients, such that the inner product ŵ i T P yields a scalar w i ∈ R . This scalar w i represents the ith entry in our weight vector, where 1 ≤ i ≤ (d + 1) . The entries of P must be sorted, for instance P = −2, −1, − 1 2 , 1 2 , 1, 2, T . ŵ ik can be thought of as a binary decision variable that selects or ignores entries in P depending on whether its value is 1 or 0 respectively. With this formulation, we can have up to 2 K unique values for each w i when P contains only positive values for instance. However, if P contains negative values as well, then the number of unique attainable values for each w i might be less than 2 K . For example, if P = [−1, − 1 2 , 1 2 , 1] , then only the following seven distinct values can be attained: {− 3 2 , −1, − 1 2 , 0, 1 2 , 1, 3 2 }. Now, let us define the binary vector ŵ ∈ B K(d+1) , such that Similarly, we can define a precision matrix ( P ) as follows: where I d+1 represents the (d + 1)-dimensional identity matrix, and ⊗ represents the Kronecker product. Note that P has the dimensions (d + 1) × K(d + 1) . We can now recover our original weight vector by observing that: We have thus represented our weight vector (to finite precision) in terms of the precision matrix P and the binary vector ŵ ∈ B K(d+1) . We are now able to pose the minimization problem of Eq. (4) as an equivalent QUBO problem. Let us substitute the expression we obtained for the weight vector w in terms of P and ŵ into Eq. (4), which yields: Note that we have neglected the term Y T Y because it is a constant scalar and does not affect the optimal solution to this unconstrained optimization problem. Observe that Eq. (8) now has the form of a QUBO problem, as desired. Hence, we can solve this optimization problem using an adiabatic quantum computer.
Computational complexity. The regression problem (Problem 2) has O(Nd) data (X and Y) and O(d) weights (w), which is the same for Problem (8). We introduced K binary variables for each of the d + 1 weights when converting Problem (2) to Problem (8). So, we have O(dK) variables in Eq. (8), which translates to quadratic qubit footprint ( O(K 2 d 2 ) ) using an efficient embedding algorithm such as the one proposed by Date et al. 5 www.nature.com/scientificreports/ Embedding is the process of mapping logical QUBO variables to qubits on the hardware, and is challenging because inter-qubit connectivity on the hardware is extremely limited. So, the space complexity of our approach Solving the regression problem takes O(Nd 2 ) time classically. We analyze the time complexity of our approach in three parts: (i) Time taken to convert the regression problem into QUBO problem; (ii) Time taken to embed the QUBO problem onto the hardware; and (iii) Time taken to perform quantum annealing. From Eq. (8), we can infer that the conversion takes O(Nd 2 K 2 ) time. Since we have O(dK) variables in the QUBO formulation, embedding can be done in O(d 2 K 2 ) time using the embedding algorithm proposed by Date et al. 5 . While the theoretical time complexity of quantum annealing to obtain an exact solution is known to be exponential ( O(e √ d ) ) 35 , a more realistic estimate of the running time can be made by using measures such as ST99 and ST99(OPT) 36 , which give the expected number of iterations to reach a certain level of optimality with 99% certainty. Quantum annealing is known to perform well on problems where the energy barriers between local optima are tall and narrow because such an energy landscape is more conducive to quantum tunneling. In order to estimate ST99 and ST99(OPT) for our approach, details on specific instances of the regression problem are required. It remains out of the scope of this paper to estimate ST99 and ST99(OPT) for generic QUBO formulation of the regression problem.
Having said that, we would like to shed some light on the quantum annealing running times observed in practice. An adiabatic quantum computer can only accommodate finite-sized problems-for example, D-Wave 2000Q can accommodate problems having 64 or fewer binary variables requiring all-to-all connectivity 5 . For problems within this range, a constant annealing time and a constant number of repetitions seem to work well in practice. So, the total time to convert and solve a linear regression problem on adiabatic quantum computer would be O(Nd 2 K 2 ).
It may seem that this running time is worse than its classical counterpart. However, the above analysis assumes that K is variable. On classical computers, the precision is fixed, for example, 32-bit or 64-bit precision. We can analogously fix the precision for quantum computers, and interpret K as a constant. The resulting qubit footprint would be O(d 2 ) , and the time complexity would be O(Nd 2 ) , which is equivalent to the classical approach.

Support vector machine (SVM)
Background. Support vector machine (SVM) is a powerful supervised machine learning model that produces robust classifiers as shown in Fig. 2. The classifier produced by SVM maximizes its distance from the classes of the data points. Although SVM was meant for binary classification originally, several variants of SVM have been proposed over the years that allow multi-class classification 37,38 . SVM has wide ranging applications in multimedia (vision, text, speech etc.) 39 , biology 40 , and chemistry 41 , among many other scientific disciplines.
Some quantum approaches for training SVM using adiabatic quantum computers have been proposed in the literature. Ahmed proposes a formulation for quantum SVM that runs on noisy intermediate-scale quantum (NISQ) processors 42 . Welsh et al. propose a formulation of SVM for the D-Wave quantum computers 43 . Our findings improve upon their formulation, allowing for real-valued learning parameters up to a certain precision.
Given training data X ∈ R N×d and training labels Y ∈ {−1, +1} N , we would like to find a classifier (determined by weights, w ∈ R d , and bias, b ∈ R ), that separates the training data. Formally, training SVM is expressed as: www.nature.com/scientificreports/ Note that x i is the ith row vector in X and y i is the ith element in Y. The objective function is convex because its Hessian matrix is positive semidefinite. Furthermore, since the constraints are linear, they are convex as well, which makes Problem (9) a quadratic programming problem. To solve Problem (9), we first compute the Lagrangian as follows: where, is the vector containing all the Lagrangian multipliers, i.e. = [ 1 2 · · · N ] T , with i ≥ 0 ∀i . The non-zero Lagrangian multipliers in the final solution correspond to the support vectors and determine the hyperplanes H 1 and H 2 in Fig. 2. The Lagrangian dual problem (Eq. 10) is solved in O(N 3 ) time on classical computers by applying the Karush-Kuhn-Tucker (KKT) conditions 44,45 . As part of the KKT conditions, we set the gradient of L(w, b, ) with respect to w to zero. We also set the partial derivative of L(w, b, ) with respect to b to zero. Doing so yields: Substituting Eqs. (11) and (12) into Eq. (10): Note that Eq. (13) is a function of only. We want to maximize Eq. (13) with respect to the Lagrangian multipliers, and also ensure that i , j ≥ 0 ∀i, j , while satisfying Eq. (12).

QUBO formulation.
In order to convert SVM training into a QUBO problem, we write Eq. (13) as a minimization problem: This can be written in a matrix form as follows: where, 1 N and 0 N represent N-dimensional vectors of ones and zeros respectively, and ⊙ is the element-wise multiplication operation. We now reintroduce the K-dimensional precision vector P = [p 1 , p 2 , . . . , p K ] T as described in the "Linear regression" section of this paper, but only allow positive powers of 2 in order to impose the non-negativity constraint on . We also introduce K binary variables ˆ ik for each Lagrangian multiplier such that: where, p k denotes the kth entry in the precision vector P. Next, we vertically stack all binary variables: We now define the precision matrix as follows: Notice that: Finally, we substitute the value of from Eq.    46 . We analyze the time complexity for training an SVM model in three parts as outlined in "Linear regression" section. Firstly, the time complexity for converting Problem (9) into a QUBO problem can be inferred from Eqs. (14) and (16) as O(N 2 K 2 ) . Secondly, the time taken to embed the (NK)-sized QUBO problem on the quantum computer is O(N 2 K 2 ) (see "Linear regression" section for more details). Lastly, for the reasons mentioned in the "Linear regression" section, it is not straight forward to get a realistic estimate of the time complexity of the quantum annealing process. However, a constant annealing time in conjunction with a constant number of repetitions seems to work well in practice on an adiabatic quantum computer of fixed and finite size as explained in "Regression" section. So, the total time complexity is O(N 2 K 2 ).
Note that the qubit footprint O(N 2 K 2 ) and time complexity O(N 2 K 2 ) assume that K is a variable. If the precision for all parameters ( ˆ ) is fixed (e.g. limited to 32-bit or 64-bit precision), then K becomes a constant factor. The resulting qubit footprint would be O(N 2 ) , and time complexity would also be be O(N 2 ) . This time complexity is an order of magnitude better than the classical algorithm ( O(N 3 )).

Balanced k-means clustering
Background. k-Means clustering is an unsupervised machine learning model that partitions training data into k clusters such that each point belongs to the cluster with the nearest centroid. The optimal cluster assignments of the training data minimizes within-cluster variance. Balanced k-means clustering is a special case of the k-means model where each cluster contains approximately N/k points as shown in Fig. 3. Balanced clustering models have applications in a variety of domains including network design 47 , marketing 48 , and document clustering 49 .
Quantum approaches to training clustering models have been discussed in the literature. Ushijima-Mwesigwa et al. demonstrate partitioning a graph into k parts concurrently using quantum annealing on the D-Wave 2X machine 50 . Kumar et al. present a QUBO formulation for k-clustering that differs from the k-means model 51 . Bauckhage et al. propose a QUBO formulation for binary clustering ( k = 2) 52 and k-medoids clustering 53 . Our QUBO formulation for balanced k-means clustering synthesizes a number of ideas proposed in the literature.
Given training data X ∈ R N×d , we would like to partition the N data points into k clusters � = {φ 1 , ..., φ k } . Let the centroid of cluster φ i be denoted as µ i . Formally, training the generic k-means clustering model is expressed as: In the case that each cluster is of equal size, |φ i | is constant, and Problem (21) reduces to: www.nature.com/scientificreports/ Note that for most applications of balanced clustering, the cluster sizes are only approximately equal to one another. In these cases, the solution to Problem (22) may not be the exact solution to Problem (21). Classically, the k-means clustering problem is solved heuristically through an iterative approach known as Lloyd's algorithm. A modified version of this algorithm is used for balanced k-means clustering to uphold the constraint that no cluster contains more than N/k points 54 . This modified version of Lloyd's algorithm runs in O(N 3.5 k 3.5 ) time on classical computers 55 .

QUBO formulation.
To formulate Problem (22) as a QUBO problem, it will be useful to define a matrix D ∈ R N×N where each element is given by: where x i and x j are the ith and jth data points in X. We also define a binary matrix Ŵ ∈ B N×k such that ŵ ij = 1 if and only if point x i belongs to cluster φ j . Since we are assuming clusters of the same size, each column in Ŵ should have approximately N/k entries equal to 1. Additionally, since each data point belongs to exactly one cluster, each row in Ŵ must contain exactly one entry equal to 1. Using this notation, the inner sum in Problem (22) can be rewritten: where ŵ ′ j is the jth column in Ŵ . From this relation, we can cast Problem (22) into a constrained binary optimization problem. First, we vertically stack the Nk binary variables in Ŵ as follows: Provided the constraints on ŵ are upheld, Problem (22) is equivalent to: where I k is the k-dimensional identity matrix.
We can add the constraints on ŵ by including penalty terms that are minimized when all conditions are satisfied. First, we account for the constraint that each cluster must contain approximately N/k points. For a given column ŵ ′ j in Ŵ , this can be enforced by including a penalty of the form: where α is a constant factor intended to make the penalty large enough that the constraint is always upheld. Dropping the constant term α(N/k) 2 , this penalty is equivalent to ŵ ′ T j αFŵ ′ j where F is defined as: Using this formulation, the sum of all column constraint penalties is: Next, we account for the constraint that each point belongs to exactly 1 cluster. For a given row ŵ i , this can be enforced by including a penalty of the form: where β is a constant with the same purpose as α in Eq. (27). Dropping the constant term, this penalty is equivalent to ŵ T i βGŵ i where G is defined as: To find the sum of all row constraint penalties, we first convert the binary vector ŵ into the form v shown below: This can be accomplished through a linear transformation Qŵ where each element in Q ∈ B Nk×Nk is defined as: After the transformation, the sum of all row constraint penalties is given by v T (I N ⊗ βG)v . This can be equivalently expressed as: Combining the penalties from Eqs. (29) and (34) with the constrained binary optimization problem from Eq. (26), Problem (22) can be rewritten as:  56 , a classical algorithm for balanced k-means clustering will converge to a locally optimal solution in O(N 3.5 k 3.5 ) time 55 . To compute the time complexity for converting Eq. (22) into a QUBO problem, we can rewrite Eq. (35) as follows: From Eq. (36), the time complexity is O(N 2 kd) , which is dominated by the first term. Embedding a QUBO problem having O(Nk) variables takes O(N 2 k 2 ) time using the embedding algorithm proposed by Date et al. 5 . For the reasons mentioned in the "Linear regression" section, it is not straight forward to get a realistic estimate of the time complexity of the quantum annealing process. However, a constant annealing time in conjunction with a constant number of repetitions seems to work well in practice on an adiabatic quantum computer of fixed and finite size as explained in the "Linear regression" section. Therefore, the total time complexity for the quantum algorithm is O(N 2 k(d + k)) . This time complexity is better than the worst case time complexity of the classical algorithm (O(N 3.5 k 3.5 )) . However, the number of iterations in the classical algorithm varies greatly depending on the quality of the initial guess at the cluster centroids. In some cases, the classical algorithm may converge in much less than O(N 3.5 k 3.5 ) time and outperform its quantum counterpart.

Conclusion
As the task of training machine learning models becomes more computationally intensive, devising new methods for efficient training has become a crucial pursuit in machine learning. The process of training a given model can often be formulated as a problem of minimizing a well-defined error function for a given machine learning model. Given the power of quantum computers to approximately solve certain hard optimization problems with great efficiency as well as the demonstration of quantum supremacy by Google, we believe quantum computers can accelerate training of machine learning models. In this paper, we posed the training problems for three machine learning models (linear regression, support vector machine, and balanced k-means clustering) as QUBO problems to be solved on adiabatic quantum computers like D-Wave 2000Q. Furthermore, we analyzed the associated time and space complexity of our formulations and provided a theoretical comparison to the stateof-the-art classical methods for training these models. Our results are promising for training machine learning models on quantum computers in the future.
In the future, we would like to empirically evaluate the performance of our quantum approaches on real quantum computers. We would also like to compare the performance of our quantum approaches to state-ofthe-art classical approaches. Finally, we would like to formulate other machine learning models such as logistic regression, restricted Boltzmann machines, deep belief networks, Bayesian learning and deep learning as QUBO problems that could potentially be trained on adiabatic quantum computers.