Towards provably efficient quantum algorithms for large-scale machine-learning models

Large machine learning models are revolutionary technologies of artificial intelligence whose bottlenecks include huge computational expenses, power, and time used both in the pre-training and fine-tuning process. In this work, we show that fault-tolerant quantum computing could possibly provide provably efficient resolutions for generic (stochastic) gradient descent algorithms, scaling as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{{{{{\mathcal{O}}}}}}}}({T}^{2}\times {{{{{{{\rm{polylog}}}}}}}}(n))$$\end{document}O(T2×polylog(n)), where n is the size of the models and T is the number of iterations in the training, as long as the models are both sufficiently dissipative and sparse, with small learning rates. Based on earlier efficient quantum algorithms for dissipative differential equations, we find and prove that similar algorithms work for (stochastic) gradient descent, the primary algorithm for machine learning. In practice, we benchmark instances of large machine learning models from 7 million to 103 million parameters. We find that, in the context of sparse training, a quantum enhancement is possible at the early stage of learning after model pruning, motivating a sparse parameter download and re-upload scheme. Our work shows solidly that fault-tolerant quantum algorithms could potentially contribute to most state-of-the-art, large-scale machine-learning problems.

Large machine learning models are revolutionary technologies of artificial intelligence whose bottlenecks include huge computational expenses, power, and time used both in the pre-training and fine-tuning process.In this work, we show that fault-tolerant quantum computing could possibly provide provably efficient resolutions for generic (stochastic) gradient descent algorithms, scaling as O(T 2 × polylog(n)), where n is the size of the models and T is the number of iterations in the training, as long as the models are both sufficiently dissipative and sparse, with small learning rates.Based on earlier efficient quantum algorithms for dissipative differential equations, we find and prove that similar algorithms work for (stochastic) gradient descent, the primary algorithm for machine learning.In practice, we benchmark instances of large machine learning models from 7 million to 103 million parameters.We find that, in the context of sparse training, a quantum enhancement is possible at the early stage of learning after model pruning, motivating a sparse parameter download and re-upload scheme.Our work shows solidly that faulttolerant quantum algorithms could potentially contribute to most state-of-the-art, large-scale machine-learning problems.
It is widely believed that large-scale machine learning might be one of the most revolutionary technologies benefiting society [1], including already important breakthroughs in digital arts [2], conversation like GPT-3 [3,4], and mathematical problem solving [5].However, training such models with considerable parameters is costly and has high carbon emissions.For instance, twelve million dollars and over five-hundred tons of CO 2 equivalent emissions have been produced to train GPT-3 [6].Thus, on the one hand, it is important to make large-scale machine-learning models (like large language models, LLM) sustainable and efficient.
On the other hand, machine learning might possibly be one of the flag applications of quantum technology.Running machine learning algorithms on quantum devices, implementing readings of so-called quantum machine learning, is widely seen as a potentially very fruitful application of quantum algorithms [7].Specifically, many quantum approaches are proposed to enhance the capability of classical machine learning and hopefully find some useful applications, like [8,9].Despite rapid development and significant progress, current quantum machine learning algorithms feature substantial limitations both in theory and practice.First, practical applications of quantum machine learning algorithms for near-term devices are often lacking theoretical grounds that guarantee or at least plausibly suggest to outperform their classical counterparts.Second, for fault-tolerant settings of quantum machine learning problems [10][11][12][13][14][15][16][17][18], rigorous super-polynomial quantum speedups can actually be proven [19][20][21] for highly structured problems.That said, these prescriptions are arguably still far from real stateof-the-art applications of classical machine learning.Some of them are primarily using quantum states as training data instead of classical data, which can be-highly encouraging as these approaches are-argued to be not the currently most important classical machine learning application [20,[22][23][24][25]. Efforts need to be made to extend our understanding of quantum machine learning, in the sense that we have to understand how they could have theoretical guarantees and how they could solve timely and natural problems, at least in principle, of classical machine learning.For instance, they should relate to scalable and sustainable natural problems in large-scale machine-learning.
In this work, we take significant steps in this direction by designing end-to-end quantum machine learning algorithms that are expected to be timely for the current machine learning community and that are to an extent equipped with guarantees.Based on a typical large-scale (classical) machine-learning process (see Fig. 1 for an illustration), we find that after a significant number of neural network training parameters have been pruned (sparse training) [26][27][28][29] and the classical training parameters compiled to a quantum computer, we suggest to find a quantum enhancement at the early state of training before the error grows exponentially.At its heart, the quantum algorithm part of the work includes suitable modifications of the quantum algorithm [30] for solving differential equations to running (stochastic) gradient descent algorithms-presumably the primary classical machine learning algorithm-into a quantum processor after linearization.The expectation of a possible quantum enhancement is rooted in an application of a variant of the so-called Harrow-Hassidim-Lloyd (HHL) algorithm [31], an efficient quantum algorithm for sparse matrix inversion that solves the problem within O(log n) time for suitably conditioned n × n sparse matrices.We find that our algorithm can solve large-scale model-dimension-n machine learning problems in O(polylog(n) × T ) or O(polylog(n) × T 2 ) time, where T is the number of iterations.The scaling in n outperforms the scaling of any classical algorithms we know of.However, for a given machine learning problem with required performances, there is no guarantee that our hybrid quantum-classical algorithm will necessarily outperform all other conceivable classical algorithms for related, but different tasks (for instance, for algorithms that are not gradient-based).Thus, our result gives, to the best of our knowledge, rise to a potential substantial quantum speedup or enhancement of particular classical algorithms, instead of a quantum advantage over the entire problem class.
From a quantum algorithms perspective, stochastic gradient descent processes are solved here using quantum ordinary differential equation (ODE) solvers derived from the findings of Ref. [30], based on linearizing non-linear equations using so-called quantum Carleman linearization.We find that the corresponding differential equation solvers can, in principle, also be used in the discrete setting and for stochastic gradient descent in machine learning.However, in the discrete setting, the theoretical details are significantly different from those applicable in the small learning rate limit.In this work, we systematically establish a novel discrete Carleman linearization in the supplemental material, including reformulations of the Carleman linearization theory, a tensor network diagrammatic notation for the discretization error, analytic derivations of higher-order corrections, and explicit examples for lower order expansions.Further details about the novelty of our algorithms beyond the findings of Ref. [30] are summarized in the supplemental material.
It is important to stress that the above algorithm has a number of requirements that do admit a quantum enhancement.First, both the machine learning model and the weight vectors have to be sufficiently sparse, which will ensure a fast interface between classical and quantum processors (this requirement could be relaxed in the presence of quantum random access memory (QRAM) [32], a fast uploader towards quantum data, but we stress that this is not required and there are no hidden resources in our scheme).Second, the model has to be sufficiently dissipative.For dissipative systems, the linearization error is well controlled, ensuring that the HHL algorithm can obtain reliable results even with non-linear machine learning models.We find dissipation happens generically in the early training process of large-scale machine learning.
We corroborate the intuition developed here by a number of theorems, as well as extensive numerical experiments.The formal definition of dissipation, sparsity, and quantum speedups are rigorously proven in the supplementary material.Informal readings of the main theorems are presented in Section I, while solid numerical evidence up to 103 million training parameters are provided in Section III.Finally, a conclusion providing also an outlook will be provided in Section IV.The neural network weights are then pruned and only a small fraction is preserved.A quantum ordinary difference equation system that corresponds to the sparse training dynamics is created using the sparse network and the training data.To allow quantum enhancement, the system must be sparse and dissipative.Measurement on the solution state is performed to obtain the final trained parameters, used to construct a trained classical sparse neural network.

I. THEOREMS
In this section, we will lay out the informally formulated main theorems that are established in this work.Details can be found in the supplementary material.
Theorem 1 (Informal).For a sparse machine learning model with model size n, running T iterations, with the algorithm being fully dissipative with small learning rates (whose formal definition is given in the supplemental material), there is a quantum algorithm that runs in time with precision ϵ > 0. The sparsity condition also ensures the efficiency of uploading and downloading quantum states towards classical processors.
Theorem 2 (Informal).For a sparse machine learning model with model size n, running in T iterations, and the algorithm being almost dissipative with small learning rates (whose formal definition is given in the supplemental material), then there is a quantum algorithm runs in time with precision ϵ > 0. The sparsity condition also ensures the efficiency of uploading and downloading quantum states towards classical processors.
In these expressions, m := log 2 (n) takes the role of a system size of the quantum system.First, we describe the problem we are trying to solve.A machine learning model is defined partially by a function L A , called the loss function, as a function of weight vector (variational angle) θ ∈ R n = (θ µ ), and the input training set A. The weight vector has n components if an n-dimensional model.The task is to minimize the function L A by adjusting θ making use of T iterations.
The presumably most widely utilized algorithm in machine learning is called (stochastic) gradient descent.Starting from the initial weight vector θ(t = 0), we implement the following ordinary differential equation from t = 0 to t = T with small, positive learning rate η, Variants of the gradient descent algorithms also include adding random noise ξ µ (t) in each step, so-called stochastic gradient descent algorithms.One can show that in many cases, at the end of training, θ µ (t = T ) can make the loss function L A (θ(t = T )) sufficiently small.The quantum algorithm with the promised efficiency in Theorem 1 and Theorem 2 is described in the following.
• Our starting point of the algorithm is given by a initial weight vector, θ(0), the maximal number of iterations T , and the machine learning architecture L A , with model size n.
• In a first step, we use so-called quantum Carleman linearization introduced in Ref. [30], to linearize the model L A with the matrix M (see the supplemental material for more details).
• Then, we need to upload the sparse weight vector θ(0) as a state vector in quantum devices, using tools of Ref. [33] or alternatively more sophisticated and at the same time challenging architectures like quantum random access memory (QRAM) [32].
• Then, in a further step, we use a variant of the HHL solver that has been introduced in Ref. [31] and the supplementary material, to solve the state vector at the end t = T .The pipeline is runnable under the condition of sparsity and dissipation, which is satisfied by our models.Sparsity includes the sparsity of model themselves, and the sparsity of weight vectors (ensured by the assumptions of sparse training), while dissipation is a natural property of the early steps of training, extensively discussed in Section III.
• Finally, we exploit tomographic methods described in, for instance, Refs.[24,34] and the supplemental material, to obtain the classical model parameters θ(T ).
Finally, we wish to mention that this potential enhancement from quantum computing is concerning the size of the model, not necessarily the precision.For instance, we use tomographic method to download the sparse quantum state at the end of quantum ODE solver with the precision scaling as 1/ϵ 2 .This scaling might be optimal in the quantum setting [34], but may not be ideal compared to purely classical algorithms (although the precise relationship between the error and the performances of classical machine learning models is generically not clear to date).We leave those interesting issues for future works.

II. LINEARIZING CLASSICAL NEURAL NETWORKS
In this section, we provide a short and heuristic description of how to solve stochastic gradient descent using HHL algorithms.For a given (stochastic) gradient descent process, the recursion relation is given by with the initial condition u(0) = u in .Here, u(t) = (θ µ )(t) is a set of weight vectors at the iteration t, δo ≡ o(t + 1) − o(t) represents the discrete difference between two time steps (t + 1) and (t) for a variable o (see Ref. [30] for the continuous version), and u ⊗ℓ is the ℓ-th order tensor product.Thus, Eq. ( 4) characterizes the dynamics of q-th order non-linearity in classical neural networks.Now, we introduce a linear process designed to approximate the non-linear model ( 4), called quantum Carleman linearization, as In this linear process, the vector space (whose vectors could be denoted by ŷ) is given by the weight vectors and all possible tensor products thereof, while A is a large matrix with matrix elements given by the F ℓ , the so-called quantum Carleman matrix (QCM).In principle, this linear relation is infinitely dimensional, so we are replacing the original non-linear recursions to an infinite set of linear relations.If we wish to solve this infinite process by a digital system, we need to make truncation.In this work, we show that for dissipative systems (whose QCMs have enough negative eigenvalues, roughly corresponding to large enough positive eigenvalues for the Hessian of the loss functions in classical neural networks; the positive eigenvalues are called dissipative modes in the supplementary material, while negative eigenvalues are called divergent modes), the truncation error can be well-controlled.
For sparse, dissipative systems, Eq. ( 5) can be treated as a matrix inversion problem, thus solved by the HHL algorithm using quantum computers, . . .
Here, we are considering T + 1 iterations in total from t = 0 to t = T , and the vector space has been further extended T + 1 times.I is the identity matrix, and ŷin is the initial weight vector corresponding to u in , written as a tensor product.This quantum ODE solver is our primary strategy towards solving stochastic gradient descent equations using quantum computers.
Although similar to its continuous version [30], the distinct differences between our algorithms and those of Ref. [30] are extensively discussed in the supplementary material.More precisely, the discrete contributions will lead to higher order terms in the learning rate η.In fact, in the continuous case A is linearly depending on various F , while in the discrete case, A also has contributions scaling as η 2 F 2 + η 3 F 3 • • • (see the supplementary material for detailed examples).In the limit where η → 0, the discrete contributions become identical to the continuous ones.It is also worth clarifying that those discrete contributions go beyond those considered in Ref. [30].In fact, although one has to discretize the differential equations eventually in the ODE setup of Ref. [30], the time derivative is computed before discretization for higher order Carleman linearization.For instance, d(u ⊗2 )(t) is treated as du ⊗ u + u ⊗ du both at the time t, while in the discrete case one has to consider contributions both from the (t + 1)-st step and the t-th step.This contribution is the primary difference between the continuous and the discrete ODEs.Finally, in our problem setup, we always assume that an explicit form of the gradient descent equation is accessible, such that one can construct the Carleman linearization and make it available to quantum devices.This may not always be true for generic complicated classical neural networks whose complexity of analytic decomposition might grow with the size of the network.We leave a more thorough treatment to future research.

III. NUMERICAL ANALYSIS
In this part of our work, we focus on providing numerical evidence of a potential quantum enhancement for large-scale machine-learning models.Commercial large-scale LLMs like GPT-3 can have O(100) billion parameters and even more, which is challenging as a starting point due to its tremendous computational costs.Instead, here we provide examples of classification and computer vision machine learning models, which are relatively small compared to language models used in industry.Our computational resources allow us to achieve the scale up to O(100) million, which is both practically minded and reachable.We expect that LLMs and other models will feature a similar behavior to those examples since our algo-rithm works in general as a replacement for stochastic gradient descent.
Thus, in order to provide evidence of the functioning of our quantum algorithm in the context of practically minded machine learning, we perform numerical experiments on a state-of-the-art machine vision architectures, namely the socalled ResNet, to tentatively outline schemes with a potential quantum enhancement.First, we study a model with 7 million trainable parameters trained to distinguish images of 100 classes [35].We first pre-train the neural network, use the largest 10% of learned parameters for initialization, and use the quantum ODE system to obtain a sparse output model.We record the Hessian spectra during sparse training, allowing us to track the evolution of an error bound related quantity, given by where ρ is the eigenvalue density, a is the negative of Hessian eigenvalues, and N c is the renormalization constant implicitly defined by This error proxy discards small magnitude Hessian eigenvalues because they are close to 0, extremely abundant, and renders the error proxy stationary.This numerical prescription is created according to criteria towards positivity of Hessian eigenvalues (dissipative modes).More dissipative systems have more positive Hessian eigenvalues, more negative a, and a better behaved error proxy.Specifically, the dissipative nature of the training dynamics initially leads to a reduction in this error proxy, which then gets overtaken by divergent modes and leads to an exponentially increasing error bound as shown in Fig. 2 (b).This motivates us to download the quantum trained model parameters sparsely and re-upload to the quantum computer to continue training every 100 steps.The effect of this procedure is that the exponentially increasing error restarts at 0 after re-uploading, with the side effect of Hessian broadening and accuracy reduction as shown in Fig. 2.
There is another strategy assuming the existence of QRAM.To combat the effect of Hessian broadening on the error proxy, we train the model classically for 10 steps after download before re-uploading of the new dense parameters, during which no training error is accrued.Although classical training has a cost linear in n, it is a small fraction of the entire training process.The accuracy dips immediately after download improves as training progresses, so our quantum training scheme is capable of producing useful sparse models.Finally, we examine the Hessian of a 103 million parameter ResNet.We start with a pre-trained model and prune 90% of the parameters.Due to the immense computational cost of computing Hessian for a large machine learning model (a relatively large-scale model for computational vision based on our computational resources), we only benchmark the Hessian spectra to provide

IV. CONCLUSION AND OUTLOOK
In our work, we have provided quantum algorithm strategies that are presumably helpful for solving the (stochastic) gradient descent dynamics for large-scale classical machine learning models, like LLMs such as GPT-3.We identify certain precisely stated dissipative and sparse regimes of the model where quantum devices could meaningfully contribute, providing an end-to-end HHL-type quantum algorithm application that could outperform known classical algorithms.The observation that an efficient classical algorithm for efficiently solving all instances of non-linear dissipative differential equations would imply an efficient classical algorithm for any problem that can be solved efficiently by a quantum computer (is BQP hard) [30] can be seen as an argument that our algorithm is implausible to be de-quantized by classical proposals along the lines of Ref. [36].Frankly, the core thesis of this work is that a main application of quantum computers may be in the training of classical neural networks.
Indeed, we claim that our algorithm might significantly increase the scalability and sustainability of classical large-scale machine-learning models and provide evidence for our claims numerically up to 103 million training parameters.Our work provides solid theoretical guarantees and intersections with state-of-the-art classical machine learning research.It sharply deviates from the mindset of variational quantum algorithms, and instead aims at augmenting classical machine learning by a key quantum step that constitutes a bottleneck for the classical training.In a way, it can be seen as adding flesh to the expectation that quantum formulations of neural networks may lead to new computational tools [37].Specifically, our model requires the sparsity to be kept as a constant (or feature a polynomial scaling) in the size of the model to maintain a potential enhancement, which is consistent with the so-called lottery ticket hypothesis [38].The setup is expected to be favorable in large-scale machine learning numerical experiments, although the sparsity ratio will generically decay.
Our work is expected to open up several potential directions in the field of quantum machine learning where one can reasonably hope for algorithmic improvements.In the supplemental material, we hint at a number of potentially particularly fruitful directions for future research.In short, they include the development of an alternative, time-dependent version during gradient descent trajectories, the identification of better formal criteria for dissipation, work on connections to diffusion models in classical machine learning and LLMs [39], theoretical improvements on the truncated HHL algorithms, and the identification of mechanisms of possible quantum speedups beyond notions of dissipation.We hope that this work can provide some stimulus for this type of research.

A. Data availability statement
The full code and data for this work are available.

B. Inclusion and ethics statement
All subjects gave their informed consent for inclusion before they participated in the study.The study has been conducted in accordance with the Declaration of Helsinki.This work obviously does not include any data obtained by experiments with living beings.[34]  Quantum computing is considered one of the most compelling alternatives of von Neumann architectures in the post-Moore era [1].Making use of quantum features such as superposition and entanglement along the computation, quantum computing is expected to outperform its classical counterparts for many specific tasks, including factoring [2], database search [3], the simulation of complex quantum systems [4], sampling [5][6][7], and tasks of linear algebra [8].Those algorithms might be useful in multiple areas, while applications in quantum machine learning might combine several of them at the same time.Quantum machine learning [9] utilizes quantum devices to run suitably modified machine learning algorithms.It has been argued that it may have the potential to lead to dramatically faster algorithms compared to what classical machine learning algorithms can provide in certain scenarios.However, substantial limitations exist for currently known quantum machine learning models.
• There are many studies about hybrid quantum-classical implementations of quantum machine learning, so-called variational quantum algorithms [10].Relevant algorithms of such a type have shown potential applications in quantum computational chemistry [10][11][12][13], in combinatorial optimization [14,15], and data-driven learning tasks [16,17].However, there is increasing evidence that such variational algorithms run on near-term quantum devices [11] are challenged by noise beyond logarithmic depth [18][19][20][21] (even though in some instances, noise can even be of some help [22]).Those algorithms might still be implementable on fault-tolerant quantum computing devices, but it is to date unclear whether and to what extent they are generically expected to have a quantum advantage compared to classical ones.
• Full-scale quantum machine learning algorithms have at the same time been established in the fault-tolerant setting, including steps of principle component analysis [23] (which has been largely de-quantized in the meantime [24]), quantum kernel methods [25,26], or notions of quantum machine learning of quantum physical states and systems [27,28].Those algorithms are promising when fault-tolerant quantum devices will eventually be available, and some may even be equipped with theoretical guarantees.For example, it has been shown that quantum computers fare better in notions of distribution learning [29,30] and applications of kernel methods to achieve robust quantum speed-ups in supervised machine learning [25].However, to be fair, those algorithms are still quite far away from the realm of applicability of state-of-the-art classical machine learning applications.To start with, most data in common applications arising in everyday life are classical, not quantum.So if quantum computing requires quantum data to have a speedup, the applicability in our current world, especially from commercial perspectives, seems narrow.In other instances, the data have to be highly structured.It is to date not clear if those theoretical results are helpful for the general classical machine learning community at the moment.For instance, Ref. [25] presents an algorithm showing a robust quantum advantage in notions of machine learning, but it requires constructing a problem in the discrete logarithmic scenario, and there is no immediate state-of-the-art classical machine learning usage for such structured distributions as far as the authors know.Similar in spirit, theoretical contributions about other quantum machine learning works [31]-despite them contributing substantially to our conceptual and mathematical understanding-often provide information-theoretic bounds using statistical learning theory which might miss intricacies of practical real-world applications [32].
The algorithm designed in this work manages to overcome several of the above issues.Compared to existing literature, our work can be seen as contributing the following improvements.
• Instead of decoupling from applications of real-world classical machine learning, our algorithm is relatively practical and expected to be beneficial to the state-of-the-art classical machine learning community, touching upon one of the most important tasks of the current interest: large-scale machine learning models, including large language models (LLM) [33][34][35].In order to validate our theory, our work also contains a practically minded simulations of around 100 million classical parameters, which is, as far as we know, a record for the quantum computing literature at the moment.
• The performance of our algorithm is, at least in several aspects, amenable to rigorous guarantees.Our algorithm is based on a provably efficient quantum algorithm, the HHL [8] algorithm, which promises to feature exponentially faster performance in sparse matrix inversions over known classical counterparts.Based on the HHL algorithm and Carleman linearization introduced for solving dissipative ordinary differential equations (ODEs) [36], we constructively show that there is an algorithm with poly-logarithmic performance in the number of model parameters and linear or quadratic dependence in the number of iterations when the training is sparse, dissipative, and close to a local minimum of the cost function.Moreover, in the more realistic scenarios, we also have solid theoretical bounds on the performance, which are supported by our numerical simulations.
• In the sparse training setup, our work could be applicable to classical data as well and is not relying on QRAM [37], while in the case of dense training, QRAM is required to ensure fast uploading, and tomographic methods could be a natural pruning approach for further classical machine learning steps.Under the assumptions of sparse training [38][39][40][41], we could create efficient interfaces between classical and quantum processors.Moreover, our algorithm only requires that interface to be established once for totally  iterations, which maximally reduces the burden of transmissions between classical and quantum worlds.Since our inputs and outputs are sparse in the quantum device, we do not necessarily need access to QRAM, which might be challenging to build in the large-scale and fault-tolerant regime [42,43].On the other hand, in the dense training task, QRAM could be used to ensure efficient uploading from the classical processors to quantum devices, and in the example of Section II D, the tomographic algorithm could read  components with largest norms with high probability, thus could serve a pruning method for dense or sparse training in the classical processor in the follow-up steps naturally.
• Our algorithm can be seen as a quantum-classical hybrid algorithm, but this is meant in a very different way than what is commonly considered in variational quantum algorithms.The quantum algorithm introduced here tackles a key step in otherwise classical machine learning algorithms.
One of the critical concepts in our work is sparse training, which makes the quantum implementation feasible and practical.We will summarize the sparse training perspectives in Section I B.

B. Making large-scale machine learning sustainable
Large-scale deep learning models have been shown to be an efficient method for enhancing performance across various tasks [33][34][35].Those models are widely used to interact with humans [44], to solve mathematical problems [45], in the digital arts [46].After all, they might after all be revolutionary for the entire society.Large-scale models, especially LLMs, are beyond reasonable doubt, some of most impressive and cutting-edge technologies in the world.However, training such models with considerable numbers of parameters is costly and can have high carbon emissions.For instance, twelve millions dollars and over five-hundred tons of CO 2 equivalent emissions are being produced to train GPT-3 [47].It is hard to imagine that, in order to build such a machine, we have to use the equivalent of the annual electricity energy consumption of a small city.If one were to vote for the specific areas where quantum computing devices could contribute, LLMs might be the most promising playgrounds.
In our work, we make a substantial step towards such a topic, where quantum and classical computing might both help in the regime of sparse training, namely, training an otherwise classical network with most parameters being zero or very close to zero.Recently, sparse training has emerged as a promising direction to reduce the training cost and improve the inference efficiency [38].One may say that the future of deep learning is sparse: By inducing sparsity in neural networks, we are able to train models at scale with unprecedented efficiency, paving the way for the next-generation artificial intelligence architectures.Sparse training is also a field where multiple research efforts have intersections, including sparse fine-tuning [48,49], sparse-scratch [39][40][41], different pruning techniques and the techniques built on the lottery ticket hypothesis [50,51].One of the intuitive reasons for the advantages of sparse training is apparent: The task itself only fundamentally requires a few data and the current artificial neural networks are highly over-parametrized.In principle, the parameters required for given tasks should be fixed, hence in large-scale tasks, such parameters should be set to zero (see Ref. [52] for a comprehensive discussion).
Appealingly, sparsity is also one of the requirements of the HHL algorithm to obtain an exponential improvement over classical algorithms [52].Moreover, the sparsity is also required for establishing interfaces between classical and quantum devices: The uploading step of dense quantum states necessitates realizations of QRAM, whereas downloading dense quantum states necessitates readings of tomography, which requiring significant resources for generic dense quantum states but efficient for sparse states.See more discussions around Section II D and Section V A.
Fig. 1 illustrates a typical process that elicidates how modern large-scale machine learning models work.The early stage involves pre-training whose data inputs and data processing are typically dense (which might be still sparse after the technology improves).In those processes, since they are more data-dependent, it will be harder to think about how quantum technologies could possibly help.The second process is the actual training where quantum technologies might contribute more substantially, where also another key factor of our quantum algorithm comes into play: dissipation.
When applying the ODE algorithm related to Ref. [36], dissipation is an essential condition.In fact, the algorithm presented in Ref. [36] employs linearization techniques to convert a system non-linear ODEs (in our case, the stochastic gradient descent equations in machine learning) into a sparse linear system problem to which the HHL algorithm applies.If the system is dissipative, the linearization error is dissipative as well and thus exponentially decaying, making it possible to linearize the system in a controllable fashion.In practical machine learning algorithms that are able to generalize efficiently (the generalization regime [52]), we find that the actual training process undergoes two phases: First, the neural network is actually learning non-trivial and significant examples.The system becomes dissipative, and equivalently, sensitive during the phase.Later, when the system accumulates enough data, the model becomes less dissipative.FIG. 1.A possible learning processes in large-scale models.A dense neural network is pre-trained classically.The neural network weights are then pruned and only a small fraction is preserved.A quantum ordinary difference equation system that corresponds to the sparse training dynamics is created using the sparse weight parameters and the training data.To allow for a quantum enhancement, the system must be sparse and dissipative.Measurement on the solution state is performed to obtain the final trained weight parameters, which is used to construct a trained classical sparse neural network.
In our quantum algorithm, we quantify the level of dissipation by the positivity of Hessian matrices during gradient descent (more precisely, quantum Carleman matrices (QCM), and the above phenomena might be intuitively explained from wellestablished information bottleneck theory [52][53][54].When the system is actually learning knowledge, the system is more sensitive, and the trace of the Fisher information matrix tends to be higher (see Ref. [52]).When the learning is effectively ending, the Fisher information is approaching zero.Because the Hessian matrix is a second-order approximation of the Fisher matrix, the above observation also holds for Hessian matrices.Thus, we identify here important and specific circumstances where quantum algorithms could meaningfully contribute: This is when the system is both sparse and dissipative.Then we are likely to be able to employ quantum algorithms to accelerate large-scale machine learning models which are super-polynomially more efficient than known classical algorithms for this task.This regime is likely to occur in the core learning process of sparse training.

C. Theorems
In this subsection, we will primarily state our theorems of the main text and present the respective proofs in full.
Theorem I.1 (First main theorem).For a sparse machine learning model with model size , running  iterations, with the algorithm being fully dissipative (see Eq. (II.1)), there is a quantum algorithm that runs in time with error  > 0, with , ∥ ū∥, Fℓ , , ∥ ∥ are model-related parameters, and  := log 2 () is the number of qubits.The algorithm also assumes that the learning rate is small, such that the condition in V.30 is satisfied.Moreover, if we assume that the output weight vectors are -sparse, the tomographic cost is O ( 2  3 / 2 ), ignoring log factors to bring the quantum states to the classical devices.The algorithm ignores the state preparation cost to prepare the initial weight vector in the quantum device, which is also computationally efficient for sparse training.
Theorem I.2 (Second main theorem).For a machine learning model with model size , running  iterations, with the algorithm being almost dissipative (see Eq. (II.2)), there is a quantum algorithm that runs in time with error  > 0, with , ∥ ū∥, Fℓ , , ∥ ∥ are model-related parameters, under the assumption of Theorem II.4, and  = log 2 () is the number of qubits.The algorithm also assumes that the learning rate is small, such that the condition in V.30 is satisfied.Moreover, again, if we assume that the output weight vectors are -sparse, the tomographic cost is O ( 2  3 / 2 ) ignoring log factors to bring the quantum states to the classical devices.The algorithm ignores the state preparation cost to prepare the initial weight vector in the quantum device, which is also computationally efficient for sparse training.
For an algorithm for the efficient uploading of sparse quantum states, see Ref. [55], and for further discussion and details, see Section V A.

D. Algorithm description
In this subsection, we describe the actual quantum algorithm.A machine learning model is defined partially by a function L A , called the loss function, as a function of weight vector (variational angle)  ∈ R  with components   , and the input training set A. In instances, we will also write  when we emphasize that it is a vector.The weight vector has  components for an -dimensional model.The task is to minimize the function L A by adjusting  with  iterations.The most widely utilized algorithm in machine learning is called (stochastic) gradient descent.Starting from the initial weight vector  ( = 0), we implement the following ordinary differential equation from  = 0 to  = , Variants of the gradient descent algorithms also include adding random noise  ↦ →   () in each step, so-called stochastic gradient descent algorithms.One can show that in many cases, at the end of training,   ( = ) could make the loss function L A ( ( = )) sufficiently small.In our work, we propose the following quantum algorithm.
Algorithm 1: Efficient quantum (stochastic) gradient descent algorithm Input: An initial weight vector  (0), maximal number of iterations , machine learning architecture L A with size .
Output: A terminal weight vector  ().1Linearize the model L A following Section II with a sparse matrix .2Upload the sparse weight vector  (0) as a state vector in the quantum device, using the methods of Ref. [55] or more sophisticated architectures like QRAM.3Perform the quantum linear system solver described in Section II C once to produce the quantum state vector at the end  = .4Perform state tomography described in Section II D to output the classical model parameter  ().

Result: 𝜃 (𝑇).
As long as the conditions described in Section I E are satisfied, the overall complexity of the efficient quantum (stochastic) gradient descent algorithm scales poly-logarithmically in  and linearly or quadratic in .

E. Assumptions on the machine learning setup
In this part, we briefly list the assumptions made in the setup.
• Sparsity of the model.First, the sparsity of the matrix  ℓ used in the matrix  (see Section II for details) should be a constant when  grows.This is depending on the machine learning model architectures.However, in Section V E, we show that as long as the machine learning model architecture is pruned well, the overall sparsity of QCM is well-controlled.
• Sparsity of the weight vectors.The sparsity of the weight vectors is satisfied naturally in sparse training, where the sparsity of the weight vectors stay as a constant and do not scale with the size of the model.
• Dissipative assumptions.The model should be sufficiently dissipative, such that either the eigenvalues of  are all nonnegative, or the number of positive eigenvalues of  is much more than the number of negative eigenvalues (see Section II and Section III for detailed theoretical and experimental discussions).
• Complexity of uploading and downloading quantum states.The tomographic procedure is efficient due to the assumption of sparse training (see the algorithm described by Section II D).

II. EFFICIENT QUANTUM ALGORITHMS FOR DISSIPATIVE DIFFERENTIAL EQUATIONS
The idea of Ref. [36] can be directly generalized to differential equations.Roughly speaking, Ref. [36] primarily makes use of Carleman linearization, basically transforming non-linear ODEs into a set of linear ODEs.Then one can discretize ODEs as differential equations.The error in this procedure arises from the following two sources: the linearization error from Carleman linearization, and the discretization error for solving ODEs.If we are interested in the differential equations, the discretization error should not be considered.Other perspectives of the proof will be very similar to the pure ODE case.The results of Ref. [36] can be generalized from the quadratic case to higher-degree cases [56][57][58].There are alternative linearization approaches that can be implemented on quantum computers [59][60][61][62][63][64][65][66][67].

A. Quantum Carleman linearization and error bounds
In this subsection, we turn to discussing the specific procedure of quantum Carleman linearization [36].As an example, let us consider the -th degree polynomial differential equation given by We have . . .
Here, we assume that  ,••• ,2,1 does not change over time, while  0 =  0 () might depend on time.We further assume that all   are at most -sparse.Moreover, one can define We can now formulate the so-called quantum Carleman linearization.(There is another contribution due to the discrete nature of the dynamics.We will discuss it in Section V B. The contribution is expected to be ignored if the learning rate is small.) Based on these expressions, we can construct the linear system truncated with the parameter  as Here, we specify (II.17) One can show that it will satisfy The term  ↦ →  0 () could be understood as the noise term in the stochastic gradient descent.The solution of the iterative relation is given by We call the matrix () here the quantum Carleman matrix (QCM).The dimension of the QCM is given by Assuming that () has Δ eigenvalues denoted by   (),  ∈ [Δ].
First, we assume that () does not change over time, () = , which corresponds to a time-independent  0 .More generally, we use the index notation , , . . .∈ [Δ].So we have Here, the matrix  is diagonalizing the QCM as ∑︁ Thus, we have Making use of the sub-multiplicativity of the operator norm and by bounding the right hand side by a largest element, one can bound the second term by Moreover, this term can be put into a more refined form, since Assuming that all   are upper bounded as b 2 () ≤ b 2 (II.28) for all , we find At the same time, we can also write where (II.31) Since the matrix  and the vector  are time independent, the specific time dependence is completely depending on the eigenvalues of .In the most extreme case, all eigenvalues of  satisfy then we get a bounded , in that Actually, we also have Now we have the following theorem.and where  is the total number of iterations.
Moreover, we can formulate a probabilistic version.We assume that there are Δ + eigenvalues   satisfying 1 +   < 1.Without loss of generality, we assume that Moreover, we define Δ − := Δ − Δ + , and we write Thus, we have ∑︁ Then, the exponential convergence will still be approximately valid, if ∑︁ This condition means that, weighted by  , , if the contribution of diverging modes in QCM (eigenvalues satisfying |1 +   | ≥ 1) is significantly smaller than the contribution of the dissipative modes (eigenvalues satisfying |1 +   | < 1), then we still get the convergent error .
Let us give a simple example at this point.Assuming that we only have two typical   ,  + as a dissipative mode, and  − as a divergent mode.Furthermore, assuming that  , are equal for all components.We have (II.45) which will imply Replacing  + 1 with the total number of iterations , we get a constraint for the time , namely that has to hold.This translates into This calculation gives a precise condition: We have choose the number of dissipative modes to be much larger than the number of diverging modes in the QCM.One can easily generalize the above argument to different modes.for the  component, one could derive the condition for the total number of iteration   , as Now we care mostly about  1 , which is the scalar difference between the original non-linear equation solution and the linearized version, so we can take  = 1.Finally, in terms of mathematical statement, we can replace ≫ as >, since the bounds stated can be directly established with >.Thus, we arrive at the following claim.
Then, we have and We now turn to discussing the time-dependent cases, which correspond to settings of stochastic gradient descent.Similarly as before, we have the diagonalization ∑︁ and we find where Furthermore, we find Thus, the structure is similar as before, if we have sufficiently small numbers of divergent modes, and much more dissipative modes, we will have similar bounds to guarantee that  is decaying and gives us the exponentially converging errors.One might still worry about how realistic this approach is for practically useful machine learning problems.An assumption we could consider is to assume a distribution of the QCM eigenspectra.For simplicity, we assume that   is independently following the normal distribution N ( ā, ) (with the measure   ).So we get Here,  is the -confluent hypergeometric function.If ā > 0 (the divergent case), the formula will quickly be dominated by the exponential divergence for large , to check the convergence/divergence properties, where (II.61) A particularly interesting example is where ā = 1.In the case where  < 1, one can estimate The expression of  is now dominated by the quadratic term.Thus, the error is quadratic in time  + 1, which is much more mild than exponential divergences.The mathematics is easier when we consider the independent uniform distribution with the range [ ā −   , ā +   ] (note that now   is not exactly the standard deviation).We have Thus, one can get exponential convergence if −1 < ā +   < 0. One can use those criteria to numerically check the property of the bounds.Finally, we estimate the bound on b .We can express this as with  > .Moreover, we can assume that Namely, the condition is that  ↦ → () is bounded in time.By scaling variables, one could without loss of generality, assume that and we have So in all cases, we find Therefore, if we assume that the linearization error is mostly  le , we get For this reason, there exists a positive integer  such that So we obtain for a sufficiently small  le > 0.
B. The QCM spectra

Dominance from Hessian matrices
In this part, we will derive how the QCM spectra could be related to the Hessian in the machine learning problems.For simplicity, let us firstly consider We can now bring these findings into the context of actual machine learning models.For a machine learning model with trainable parameters   and the loss function L  , we have (see Section III for further information, and ⊃ indicates that we are calculating the first order term in the Taylor expansion) Thus, more positive Hessian eigenvalues reflect more dissipative modes in QCM.

Perturbation theory
We here put the findings into the context of perturbation theory.We take the case where In this case, one can use perturbation theory in quantum mechanics to compute the shift of the QCM spectra from the perturbation of  0,ℓ ≥2 .We assume that  0,ℓ ≥2 =  0,ℓ ≥2 F0,ℓ≥2 , (II.85) 0 <  0,ℓ ≥2 ≪ 1. (II.86) We discuss notions of perturbation theory for such cases. Here, and the sum is taken over all the eigenvalues of Here, | , ⟩ is the eigenvector assigned with the permutation  in the eigenspace with the eigenvalue   .Thus we get at most ! different  ,  s in the second-order degenerate perturbation theory, where we have ! eigenvalue splitting at most.

C. Quantum linear system solvers
Until know, we have established how to obtain a linearized differential equation.We now aim at approximately solving this equation by making use of a quantum computer.Given the initial condition described by a quantum state, we aim to provide a quantum-state description of the solution given the terminal time .Quantum computers can identify the solution of a  ℎdimensional linear system in Hilbert space.Originally proposed by Harrow, Hassidim, and Lloyd [8], quantum linear system algorithms [8,[68][69][70][71][72][73] have been developed to provide a quantum state encoding the solution with complexity poly(log  ℎ ).Building on this body of work, generalizations to solve linear ordinary and partial differential equations [74][75][76][77][78] have been proposed.
Theorem II.3 (Quantum linear system algorithm [73]).Let  =  be a system of linear equations, where  is an   -sparse  ℎ ×  ℎ matrix with condition number  ∥ ∥ ∥  −1 ∥.Given a sparse matrix oracle for  and an oracle for |⟩, there exists a quantum algorithm which produces the state vector | −1 ⟩ within error  HHL > 0 using gate complexity O    • poly log( ℎ / HHL ) . (II.100) We consider the linearized differential equation  −1 can actually be written as Given a sparse matrix oracle, the cost of encoding an -sparse  ℎ ×  ℎ matrix  is found to be O ().The upper bound on the condition number shows that  = 2( + 1), which indicates the gate complexity of solving the desired linear system is O  • poly(log( ℎ / HHL )) .Furthermore, we can ask what happens when we have divergent modes?Here, we notice that the HHL algorithm can be implemented only on the well-conditioned eigenspaces by adding auxiliary qubits (see Ref. [8], discussion section), which could be likely applicable to the adiabatic method [73] as well.In the context of the original work [8] when solving   = , we set where   are normalized eigenvectors of .The algorithm will give up to the given precision.Instead, one can add auxiliary qubits such that the output is Here,   > 0 is an effective positive number and may not necessarily be the condition number of .When ∥ ∥ = 1, and   = , the algorithm effectively returns to Theorem II.3 regardless of those auxiliary qubits.The auxiliary states |well⟩ and |ill⟩ will be post-selected at the end of the algorithm.If we set a general   , the error on the state vector |⟩ can be estimated by the fidelity In an actual machine learning process, the weights and biases might depend highly randomly.A good model to reflect this randomness is to assume that |⟩ is a Haar-random state vector.Namely, |⟩ =  Haar |0⟩ where  Haar is distributing uniformly in the whole unitary group (see Ref. [79]).We denote  0 = |0⟩ ⟨0| and where Λ −1 is the diagonal matrix of all eigenvalue inversions of , and Λ −1 is the pruned version of Λ −1 .This is the step of a sparse truncation: Here all eigenvalues satisfying   ≥ 1/  are included, while other eigenvalues satisfying   < 1/  are excluded and set to zero, leading to a sparse setting.We have ∫  Haar ∑︁ Since the expression of ⟨| x⟩ only cares about the inner product that involves a pair of  Haar and its Hermitian conjugate, based on the Page theorem [79,80], it is equivalently assuming that |⟩ is a superposition of all computational basis states.Similarly, we have where #  :   < 1/  refers to the number of eigenvalues that are smaller than the cutoff 1/  .We can use the above formulas to estimate the inner product (II.110).Note that we average over the numerator, the denominator separately, inside or outside the square root separately.This is the difference between the annealed average and its opposite [79,81], where the error is likely negligible in the limit of large ( + 1)Δ, the dimension of the Hilbert space.We can specify the situation in the machine learning case, where we have Δ + dissipative modes with eigenvalues  + .We can simply take   = 2( + 1) as the upper bound, such that the divergent modes are ill-conditioned.In this case, we estimate the fidelity as (II.116) In the limit where  + → 0 − , the formula is simply which indicates that only a constant number of errors will be introduced if we set   linear in , where the constant is referring to the ratio of dissipative modes.This observation is independent of the number of iterations  or ( + 1).Finally, we claim that filtering out ill-conditioned eigenvalues might cause additional overhead on the algorithm.The naive usage of the auxiliary qubits in Ref. [8] introduces extra O ( 2  ) gates.Further works about this filtering algorithm require rigorous improvements on auxiliary qubits and post-selection using amplitude amplification and adapting the above framework to the adiabatic setting [73] which we leave for future research.
Theorem II.4 (Truncated quantum linear system algorithm).Let  =  be a system of linear equations, where  is an  sparse  ℎ ×  ℎ matrix.We define the effective condition number   > 0 that filters out all eigenvalues of  smaller than 1/  , and define the truncated solution | x⟩ in Eq. (II.109).Given a sparse matrix oracle for  and an oracle for |⟩, there exists a quantum algorithm which produces the state vector | x⟩ within error  HHL > 0 using a gate complexity of

D. Tomographic recovery and sparsity
In this subsection, we briefly discuss notions of tomographic recovery and what role they take in sparse training.The requirement of sparsity for the implementation of our quantum algorithms has two components: The sparsity of weight and bias vectors  on the one hand and the sparsity of the QCM matrix  on the other hand.The sparsity of the QCM matrix originates from the machine learning architecture and Taylor expansion, while the sparsity of weight matrices comes from the nature of sparse training.Here, we primarily discuss the sparsity of the output weight vector, where we denote by  the number of non-zero entries of a vector with Hilbert space dimension .The vectors to be recovered tomographically in this step are of the form where each |   ⟩ is a computational basis vector of a system of  = log 2 () qubits.That is to say, the state is -sparse in the sense that there are at most  non-vanishing coefficients.So in a first step, it has to be determined which computational basis vectors feature and hence which coefficients are non-zero, as the sparsity pattern in this sense that is not known.This problem can be identified as an instance of the coupon collector's problem (see, for instance, Ref. [82]).The algorithm in this step goes as follows.One performs projective measurements in the computational Pauli- basis and records observed patterns.This procedure is repeated until all  different patterns are observed.The original coupon collector's problem is recovered in this setting exactly in the situation of the state vector |⟩ =  −1/2  =1 |   ⟩, as an equal superposition, as then each vector reflecting a different pattern is observed with the same probability.For this problem, the expected number of measurements is known to scale as O ( log ).For the state vector as specified in Eq. (II.121), the weights are different for each pattern.For this weighted coupon collector's problem, the expected number of steps until all patterns have been observed is also known [82]: The expected number of repetitions is given by It takes a moment of thought that for equal probability for each probability one arrives at the minimum expectation value.There is a subtlety that is worth commenting upon that relates to small non-vanishing weights.If weights converge to zero, the expectation values of observing all patterns including those of the small weights clearly diverges.In this situation, one can consider only the significant weights above a given threshold  > 0, which will lead to an efficient algorithm for an approximation up to weights larger than , as a detailed analysis shows.Overall, again, this step can be performed making use of O ( log ) many measurements.After this step has been taken of identifying the at most  relevant computational basis vectors, corresponding to a -sparse weight state, we can pursue a variant of state tomography to extract the full pure quantum state including phases.We suggest to do so by implementing a variant of the shadow estimation scheme of Ref. [27] making use of random quantum circuits involving the -qubit Clifford group.For the physical implementation of this, O ( 2 ) entangling gates are necessary.After implementing random Clifford circuits, one performs computational basis measurements, giving rise to outcomes  reflecting computational basis vectors |⟩ of the  qubit system.For the input |⟩ ⟨|, the output of this random quantum channel is obtained as where  ∈ C  is from the -qubit Clifford group.To compute the inverse as one has to compute overlaps of computational basis vectors and stabilizer states obtained from applying Clifford circuits to computational basis vectors.Such overlaps can be classically efficiently computed with an effort polynomial in .The inverse map is found to be For a state vector of the form as in Eq. (II.121), one can take samples of random Clifford circuits applied to |⟩, and sample from the output distribution obtained from computational basis measurements, creating a classical shadow.From these data, one can infer about the elements   for  = 1, . . ., , requiring an overall effort of O ( 2  3 / 2 ) in the random measurement procedure, for a recovery with the 1-norm error  > 0, ignoring logarithmic factors.It also involves a polynomial classical effort in , due to computing overlaps of stabilizer states and computational basis states [83].Alternative methods are also conceivable, e.g., based on compressing circuits and the methods presented in Ref. [84] based on projected least squares.
In the actual machine learning process, the ratio of pruned parameters in sparse training  tr () is scaling with the size of the model .In this case, the weight sparsity is given by  = (1 −  tr ()).For this to be feasible, we expect that  is a constant in .This requires that This will guarantee that the cost of our algorithm will have a poly-logarithmic dependence in .
Finally, we wish to mention that in the case of dense input vectors, the tomographic algorithm will read the  vector components with the largest norms, and this could be interpreted as a natural pruning method for further classical processes.In this case, we are assuming a generic dense training approach, and the input dense weight vectors can be uploaded via QRAM.The downloading process could still be efficient if  is not scaling with , and this could be interpreted as a hybrid quantum-classical pruning method.

III. MACHINE LEARNING AND NUMERICAL ASSESSMENT
In this section, we present substantial numerical evidence regarding the feasibility of accelerating machine learning with our proposed quantum algorithm.First, we formulate the learning dynamics of a multilayer perceptron (MLP) machine learning model as an ODE system and make some error scaling statements.Second, we numerically analyze the convergence characteristics of the quantum algorithm for large vision models using their Hessian spectra.This is because computing the original QCM and diagonalize it for large models is numerically intractable.Lastly, we present numerical evidence supporting the use of the Hessian spectra for convergence analysis.

A. Training dynamics of MLPs as ODE systems
To analytically understand neural networks, we consider the MLP, a simple neural network architecture (see Ref. [85]).The definition of its elements is given by  (1)   () :=  where  is the input data vector,  (ℓ ) and  (ℓ ) are the ℓ-th layer trainable weights and biases,  ℓ is the ℓ-th layer preactivation,  ℓ is the ℓ-th layer width, and  is a non-linear activation function.In vector notation,  (ℓ+1) () :=  (ℓ+1) +  (ℓ+1)   (ℓ )  as a polynomial in .We have Since ,  are fixed,  (ℓ ) are already functions of , what remains to be shown is how to express the derivatives as functions of  as well.We find We now have increments of  in terms of  (and training data), and have obtained a non-linear ODE system.The highest order term of the ODE corresponds to the terms derived from  () / (1)  , and  () / (1)   .Assuming that the non-linearity  is an order   -polynomial, then  (1) is an order   -polynomial in   ,  (2) has order   + 1,  (2) has order   (  + 1), and  (ℓ ) has order ℓ =1    =   ( ℓ  − 1)/(  − 1).From Eq. (III A), we can conclude that  () ; / (1)  ; is of order Since  is one order higher than  (−1) , Eq. (III A) leads to the following result.
Theorem III.1 (Bound to the order).For an  layer MLP with activation functions of order   , the ODE   / = L A /  has order Thus, for a neural network with bounded polynomial activation functions, the degree  of the ODE is finite.

B. Numerical analysis on sparse ResNet for vision
In order to understand the potential of our approach in accelerating training of state-of-the-art large neural networks, we study the training dynamics of the ResNet [86] applied to the CIFAR-100 [87] data set.CIFAR-100 is a set of 60000 images in 100 categories, and the neural network has been trained to classify given images.We first train a dense ResNet, where all parameters are kept as normal, with depth 32 and widen factor of 4. The parameter count for the dense model is 7.451.044.After that, 90% of the parameters are being pruned (set to 0), and a new model is trained.We allow pruned parameters to evolve to non-zero values so that we might obtain a different set of parameters as sparse weights.Further, since the quantum training algorithm deviates from the ideal training trajectory after a number of steps, we download sparse weights (10% of all parameters in this case) every 100 steps from the quantum ODE system and reupload the sparse weights to continue quantum training.We train the ResNet and examine the Hessian spectra at different steps as shown in Fig. 2 (a) and (b), where each step reflects an update using gradient estimated from 2048 randomly selected training samples with a learning rate of 0.01.We observe in Fig. 2 (a) that the Hessian spectra immediately after sparse weights download are broad, but rapidly concentrates and becomes dissipative within 10 steps of training.We also show the Hessian spectra immediately after sparse weights download for the entire training process.The spectra broadening effect of downloads becomes less significant as training progresses.The entire training process contains more dissipative modes than divergent modes as can be seen in the Hessian spectra.
We further quantify the impact of the Hessian spectra on the convergence of the linearized solver.Equation (II.30) gives us a bound on the linearization error.Assuming  , are the same for all , , we test this criterion by examining error proxy = 1 where the   are the negatives of the Hessian eigenvalues.If all eigenvalues are exactly 1, neither dissipative or divergent, the quantity stays exactly 1.If the divergent contributions outweigh the dissipative contributions, the quantity is greater than 1.We numerically compute the density of Hessian eigenvalues given by (), and we have ∫ ∞ −∞ () = 1.Assuming the Hessian spectrum stays constant, the error quantity can be estimated as We estimate the error proxy by sampling the eigenvalue densities at 256 points.Further, since eigenvalues close to 0 are extremely abundant and lead to stationary error proxy, we remove eigenvalues with magnitudes less than 0.4 and renormalize the densities.More explicitly, we define the error proxy as error proxy := 1 where Since the Hessian spectra are broad immediately after download, the error proxy tends to diverge.Therefore, we consider the training scenario where the first 10 steps after download are trained classically without error, and the new dense weights are uploaded to the quantum ODE system with QRAM.As can be seen in Fig. 2 (a), the Hessian spectra after 10 steps following download are already concentrated and mainly dissipative, which results in less rapidly growing linearization error.Fig. 2 (c) shows the estimated error proxy.The error proxy stays exactly at 1 for 10 steps after download because the model is trained classically and has no linearization error.Afterwards, we observe that the error proxy initially decreases due to the dominance of dissipative modes.However, as time increases, the exponentially growing contribution from divergent modes eventually takes over, leading to an apparently exploding error.We download the weights before the error becomes too unreasonable, and resumes the classical training and weights reuploading scheme.Since the Hessian spectra for larger step numbers have less divergent contributions as indicated in Fig. 2 (b), we can be fairly certain that the error proxy will be even better managed in later stages of training.
Finally, we examine the generalization ability of the model following this training scheme.Fig. 2 (d) shows that the test set accuracy drops after download, but the drop becomes less significant as training progresses.Our quantum ODE training scheme, therefore, produces useful sparse machine learning models.In addition, we examine the Hessian of a 103 million parameter ResNet.We start with a pre-trained model and prune 90% of the parameters.Due to the immense computational cost of computing Hessian for a large machine learning model (a relatively large-scale model for computational vision based on our computational resources), we only benchmark the Hessian spectra to provide evidences of dissipation and potential quantum enhancements.Fig. 3 shows the initial Hessian, which clearly shows the dominance of dissipative modes over divergent modes similar to the 7 million parameter model.Since the Hessian improves with training for the 7 million parameter model, we believe this is evidence that the 103 million parameter model will have similarly manageable error growth.

C. Numerical verification of the effect of Hessian on convergence
In this subsection, we provide further evidence that dissipative Hessians actually lead to better convergence.We have simulated systems of quadratic ODEs, where the  1 matrix has a tunable percentage of positive eigenvalues, and the  2 matrix entries are randomly sampled from a uniform distribution  (−0.015, 0.015).To generate random  1 matrices, we first generate positive eigenvalues from  (0, 1) and negative eigenvalues from  (−0.01, 0), form  1 in the diagonal form, and transform its basis using a randomly generated orthogonal matrix. 0 is 0 for all entries.
First, we provide evidence for the suppression of error in time and the QCM order  in the case that all modes are dissipative in Fig. 4(a), which shows the error of a randomly chosen parameter.If all modes are dissipative, an exponential reduction in the error as time increases is observed due to the exponentially vanishing error bound.Second, for different fractions  of positive Hessian eigenvalues, the error of a randomly chosen parameter for  = 4 is shown in Fig. 4(b).For  < 1, divergent modes take over and eventually lead to exponentially growing errors in the limit of large times.However, reducing the number of divergent modes significantly suppresses the error, and extends the time period within which the error remains controlled.

IV. FUTURE IMPROVEMENTS
Our algorithm, as an end-to-end application of the HHL algorithm to the context of quantum machine learning, seems to open up several novel research directions.In this section, we hint at some future improvements biulding on our algorithm that may eventually lead to more effective enhancements of classical machine learning tasks using quantum technologies.
• Time-dependent version during gradient descent trajectories: In our work, we have defined the QCMs by Taylor expanding the machine learning models around specific starting points.For models beyond the scale-invariant activation functions, it might, however, be favorable to improve the previous analysis towards a time-dependent picture of generic matrix elements in the QCM matrices.The current strategy we have laid out in Section III is to expand the model around different points of parameters during the gradient descent trajectories.A re-formulation of the existing theory towards fully time-dependent QCM might make our theory more solid and complete.
• Better criteria for dissipation: In our current work, we define the contributions of different signs of the eigenvalues of QCM to be the primary criteria of dissipation.These, however, are hard to evaluate in large-scale models.Instead, in practice, we might want to count the contributions from the diagonal elements, or, alternatively, the Hessian eigenvalues as criteriawhen aiming at identifying regimes of possible quantum advantages.The question arises whether there are any better criteria that are both more solid and more computationally efficient.
• Connections to diffusion models in classical machine learning: Our work highlights the observation that dissipation is important to obtain such an efficient quantum enhancement.The celebrated diffusion models used in the current LLM [88] are also dissipative systems.It might be interesting to explore their connections.
• Theoretical improvements on the truncated HHL algorithms: In Theorem II.4,we propose a truncated version of the HHL algorithm.However, improvements were plausible if one could extend the ideas from Ref. [73] to such a truncated version, which will likely reduce the overhead related to the sparsity and the condition number from O ( 2  2  ) down to O (  ).• Quantum enhancement beyond dissipation: It is demonstrated in Ref. [36] that there is numerical evidence that the ODE system may not need to feature full dissipation to obtain a quantum speedup.It might be interesting to explore alternative methods to avoid the condition of dissipation and still provide efficient quantum algorithms.
• Classical counterparts: In the sparse training process, it might be possible that for certain sparse training schemes with sufficient few non-zero entries in the weight vectors, efficient classical algorithms might also be created to restrict the calculations inside the -dimensional subspaces over the whole weight vector space.It might be interesting also to consider more sophisticated hybrid quantum-classical combinations to ensure maximal performances with low overheads.
• Large learning rate: In our theory we assume that the learning rate is small enough and the update of weight vectors is slow.This is a usual setting in machine learning problems.However, the learning rate could be large in principle for specific applications [89,90].We give an explicit solution about the Carleman linearization in the discrete case in Section V B. We leave the studies about how the discreteness will affect the performances of machine learning problems in large learning rates for future works.

V. OTHER COMMENTS A. Aaronson's criteria
In this section, we finally briefly comment on our quantum algorithms under the view of Ref. [91].We regard our quantum algorithm as an end-to-end provable application of the HHL algorithm under certain conditions.Ref. [91] summarizes four perspectives on applying HHL, solving  |⟩ = |⟩ and getting |⟩, into machine learning problems with solid guarantees.We will address how our algorithm takes into consideration about all four conditions: • Initial state preparation (resolved by either QRAM or sparse training): Applying the HHL algorithm requires the efficient implementations of the initial state.In general, QRAM [37] might be required to encode non-sparse classical data into the quantum computer.QRAM as a possible efficient uploading oracle architecture has been extensively studied in Refs.[42,[92][93][94].It is practically challenging to realize such a machine in practice [43], although it is realized that polynomially controllable error scaling is possible in certain devices [92].In our work under the sparse training assumption, the input state is sparse due to the assumption of sparse training, and an efficient algorithm has been constructed without resorting to QRAM.For instance, Ref. [55] proposes an algorithm for loading an -sparse quantum state for use in machine learning models with size , using O ( 2 log  log ) classical time, and O ( log ) quantum circuit depth (which only contains single qubit gates and CNOT gates).Thus, under the sparse training assumption, we are able to avoid using QRAM altogether.On the other hand, in the dense training case where input weight vectors are dense, our analysis is still valid when QRAM is available.
• The sparsity of the inverted matrix (resolved by machine learning model architectures): HHL requires the sparsity of  to guarantee the efficiency of the algorithm.In our setup, this issue is related to machine learning architectures.This is closely related to our pruning assumptions.In Section V E, we prove that as long as the machine learning model itself is sufficiently pruned, the matrix  (and naturally ) is sufficiently pruned as well.This justifies the sparsity of the HHL matrix from theory to practice in certain cases.
• Condition number of the inverted matrix (resolved by the dissipative assumptions): The HHL algorithm also requires a polynomial dependence on the condition number of the matrix  to maintain efficiency.In our setup, the condition number is related to our assumption of dissipation.If our QCM matrix  satisfies ∥1 + ∥ < 1, it is entirely dissipative, and the condition number is linearly bounded by the number of iterations .If not, Theorem II.2 and Theorem II.4 will bound the efficiency of the algorithm at early times of training, as has been shown in Section III.
• The tomographic overhead of the output state vector (resolved by sparse training): HHL algorithm provides |⟩ as a quantum state vector as an output, and tomographic efforts are needed to bring the quantum data back to the classical devices.This issue could be resolved by our assumptions of sparse training, as has been discussed in Section II D. For a -sparse quantum state, the tomographic recovery need O ( 2  3 / 2 ) additional resource with error  > 0 ignoring logarithmic factors, which is computationally efficient.On the other hand, if we are under the dense training assumption, the tomographic recovery could still work to read  components of the weight vectors with the largest norms, and this is effectively a pruning method toward further training steps in the classical processors.
Overall, our algorithm is considered an end-to-end HHL algorithm applications under our assumptions that provides guarantees satisfying the requirements of the HHL algorithm and offers efficient interfaces between classical and quantum processors.
The potentially exponential quantum enhancement we suggest invites efforts for training general large-scale machine learning models, rather than specific machine learning tasks such as those considered in Refs.[25,28,29].The capability of deep neural network models is far from being well understood, and they may not provide the most efficient resolution for certain tasks due to the over-parameterization.However, the computational overhead in terms of parameterizations  is generally unavoidable and costly in practice.Actually, it seems highly unlikely that any classical algorithm for non-linear dissipative differential equations (and training dynamics) can run in time complexity poly-logarithmic in .If we consider the continuous-time limit of the differential equations as a system of differential equations, and let the dissipation and non-linearity approach zero, then asymptotically we might have a system of linear differential equations with no dissipation.The problem of efficiently simulating non-dissipative linear differential equations is BQP-hard even when the dynamics is entirely unitary, generated by Hamiltonian evolution [95].In other words, an efficient classical algorithm for non-linear dissipative differential equations would imply a classical algorithms for any problem that can be solved efficiently by a quantum computer, which is considered implausible.A similar argument refers to Section 7 of Ref. [36].This can be seen as an argument that the de-quantization by classical algorithms along the lines of Ref. [96] of the presented quantum algorithm seems highly implausible, even though to assess this matter could give rise to a fruitful research question.

Elementary deriations
Here we will give a more detail explanation on Eq. (II.12).Say that we have a differential equation, where  is an infinitesimal differential operator.The Taylor expansion tells, There is an alternative way of using this, which is where and we use the notation where Λ ∞ = +∞.Specifically, when we are dealing with the differential equation Here,  0 can either be time-dependent or time-independent.Then we can identify Using the language of the Kronecker products, we could try to understand Carleman linearization.Let us start with a simple example, given by One could compute each term as In general, we get where Now, instead, let us estimate the difference between the discrete difference  and the differential .Starting again with the example we have before, we get For each term, we have Here, we use the notation Thus, in general we get where Note that we have the identity For differential equations, we can try to derive the recursion relation for the linearization error.Using the notation in the main text, we denote With the Taylor expansion truncation at the order , we have which is exactly Eq. (II.19) in the matrix form.Here, we use the fact that  , = 0 if  > .On the other hand, in the discrete case, we have Thus, we can define the extra piece between  and  as the extra contribution from discretization, In fact, we have ) in learning rate , since  = O () and  = O ().The condition for making   () ignored is that which is , ′  ⊗ ( ′ +  −1) , (V. 25) in the limit where  is small, such that we could do perturbative expansion.There is another more direct condition.We start from and we also have Using this recursion relation multiple times, we get Note that since  = O (), we schematically can understand the equation as This equation can be understood as the exact solution of the Carleman linearization in the discrete case.The linearization error  vector, could be written as  =  ′  + b′ , where  ′ and b′ will get reduced to  and b in Eq. (II.18) when the learning rate  is small.In fact, we know that the discretization contribution can be ignored if Factorizing , one could make the inequality purely a condition about the tensor , namely   .Thus, the condition in Eq. (V.30) is a quantitative measure on how small the learning rate  is required to be.

A more detailed investigation
In this subsection, we discuss the relevant terms in substantial detail.Along the lines, we introduce a tensor network diagrammatic notation.We start from writing Eq. (V.28) in a different way as where Fhe first non-trivial term is given by (assuming that Λ ∞ − ( + 1) =  and in this case  = 2) ∑︁ This may look tedious.We can see the pattern, however, if we explicitly spell out another term, reflecting the  = 3 case, as The first part is found to be The second term is given by The third part is then The fourth term is found to be And then, finally, the last term is given by Using these two simple examples as a basis, we can reframe this problem as follows.We introduce a matrix    such that the relation can be expressed as The objective is to determine    .To streamline our approach, we will introduce a bit of notation.Firstly, we represent the order of the correction term as , giving us a representative term in the form It is apparent that the count of   within a term, denoted by , signifies the order of correction for a given expression.Additionally, we establish the upper bounds for   :  The values of   are derived from the numbers of blue sticks between the yellow balls.Therefore, when summing over the ℓs, it signifies the summation across various partitions with distinct upper bounds.Additionally, although the sum incorporates ℓ  , it is not connected to any upper bound.Consequently, different partitions with identical upper bounds need to be included in the summation.Going forward, we omit the identity matrices between    for the sake of brevity in expressing the term is the summation of typical terms as described above, ranging from 1 to  where  ≤ .The conditions for this summation involve K :=  −  +  =  =1   and, concurrently, adhering to the constraints for each   , which are defined as 0 ≤   ≤   .This can be formally expressed as The ℓ dependence is embedded within the upper bounds   , and the symbol ˜ denotes a summation under a condition, where K =  =1   .This conditional summation can be expressed recursively.Let us define, for 0 ≤  ′ ≤ , . . .
Therefore, in a general context, we obtain Continuing from the previous illustrative explanation, this recursive description can be visualized as follows, with the basic elements defined as shown in the figure.is introduced as an order 2 tensor, nearly identical to    ′ , with the distinction that K relies on both  and , while    ′ is contingent solely on   ′ .
We are now in the position to formulate the recursion.This involves generating additional purple blobs to contract with the larger one (see Fig. 8).Using this recursive expression, one can systematically generate all typical terms F (1)  K , for any .This can be achieved by repeatedly expressing F (1)  K , in terms of F ( ′ ) K ,1 with 1 ≤  ′ ≤ .The maximum value of  corresponds to  = , resulting in At this order of correction and for the same value of , there is always a constant correction term represented by  ⊗ ( 1 + (−1)/2) 0 .Consequently, we can now formulate the matrix  as Once again, the initial sum encompasses all orders of correction.The highest order, denoted as  = , represents expressions without any identity matrices.The count of identity matrices between each    influences the upper bounds   for   .Summation over all ℓ is equivalent to summing over all potential upper bounds .The introduction of K =  −  +  is to ensure that the identity matrices are counted as 1 in the sum, thereby totalling  to maintain the correct dimension.To summarize, we get Reinserting the identity matrices between every term is straightforward, using the upper bound values   and  1 , enabling the complete expression to be restored.as an order 2 tensor with a restricted range on K.This is achieved by defining it as a sum involving an order 3 tensor contracted with the order 1 tensor    ′ , which has been previously defined.This process is then repeated recursively for  − 1 iterations.The purple blobs in the representation signify the range confinements associated with each step of this procedure.

Examples
We will now demonstrate the applicability of this recursive procedure at hand of specific examples.Let us begin by examining the cases where  = 2.For 0 ≤  ≤  + 1, the first-order contribution with K =  − ( − ) =  − 1 is given by Moving on, let us examine the second-order contribution,   1 ⊗   2 , which is the highest order in this case.This implies that K = .Initially, we have a constant term due to the highest order Now, let us proceed to 1 ≤  ≤ .We will start by determining the upper bounds.The upper bound for  1 is  1 = , and for  2 it's  2 =  + 1.Using the recursive description, we get for which we have For the range 0 ≤  ≤  + 1, there are no issues with the condition 0 ≤  −  1 ≤  2 =  + 1.Thus, for 0 ≤  ≤  + 1, we find However, for the range  + 2 ≤  ≤ 2 + 1, the upper bound condition can pose issues.Consider the case where  =  + 2 As a result, in this case, the constraint on  1 becomes 1 ≤  1 ≤ , Applying the same reasoning, one can deduce the following for the range  + 2 ≤  ≤ 2 + 1 The last contribution comes from  = 2 + 1, given rise to To consolidate all the results, we return to the matrix    , to find This can also be visualized using our diagrams.
FIG. 9. Using the recursive approach outlined in Fig. 8, this scenario only necessitates a single iteration, encompassing the two general cases when  = 2.
It is at this point evident that this mirrors the exact description we previously presented in Eq. (V.33).Now, let us delve into the  = 3 case using the recursive process.The first order entails The situation here aligns with K =  − ( − ) =  − (3 − 1) =  − 2. Therefore, the first-order correction spans from  = 2 to  =  + 2, or more specifically, from  ⊗ (2) to  ⊗ (+2) .Expressing this in a conventional manner (and expanding the sum over ℓ 1 ) gives The second order presents a unique situation ( K =  − ( − ) =  − 1).It involves three terms divided into two categories.The first category has a single term For this category, the upper bound is  2 =  + 2, given that ℓ 1 = 1 while  = 3.The second category is In this case, the upper bound is  2 =  + 1, due to ℓ 1 = 2 while  = 3.There are two terms because we are also summing over ℓ 2 , although it is unrelated to the upper bounds  2 or  1 .The second kind is actually trivial, as it follows the same pattern as demonstrated in the previous example, This is analogous to the second order in the  = 2 case, with the substitution of  − 1 for .The first kind exhibits slight differences but remains similar, Summing up all three terms of the first kind, we encompass contributions from  = 1 to  = 2 + 2, while the second kind spans from  = 1 to  = 2 + 3. The third and highest order contribution for  = 3 is the third-order contribution.Here, K = , and the upper bounds for  1 ,  2 , and  3 are  1 = ,  2 =  + 1, and  3 =  + 2, Indeed, the third-order contribution extends from  = 0 with a constant term  0 ⊗  0 ⊗  0 to  = 3 + 3 with the term  +2 ⊗  +1 ⊗    ⊗ (3+3) .Let us take a closer look at this contribution given by The recursive nature of the calculation is evident from the fact that the second-order correction is embedded within the third-order correction.This recursive structure can also be visualized using our graphical representation: FIG. 10.For  = 3, two rounds of recursion are needed.This aligns with the three cases discussed earlier.
The observed pattern becomes clear now: Within the range of 0 ≤  ≤ 3 + 3, each  1 in the appropriate range contributes to a full second-order term.This suggests that the summation is effectively over second-order contributions indexed by  1 for a given .This description is consistent with our initial calculations.This pattern extends to higher values of  as well.For instance, in the case of  = 4, we would have fourth-order contributions.In this situation, for a given , we would be summing over bunch of third-order contributions with varying index  1 , much like the ones described earlier.The second and third order contributions in the  = 4 case would have multiple variations with different upper bounds.The first-order contribution would align with the standard case.This conforms to the recursive procedure detailed above.To present these concepts more comprehensively, let us explicitly write out a portion of the    matrix    ⊗ (2)  ⊗ (3)  . . . . . .
The matrix elements can be derived from the examples provided earlier.The structure of the matrices for different values of  and  follows the described pattern, and the specific elements can be obtained by using the recursive method and considering the upper bounds for each   , For  1  , there is only the first-order contribution.For  2  , both first-order and second-order contributions are present.The first-order contribution extends from  = 1 to  =  + 1, and the second-order contribution involves specific terms within the range of  determined by the examples in Eq. (V.62),The results above can be deduced from Eq. (V.70).It is worth noting that the first-order contribution, as illustrated earlier, is encompassed within the range from  3 2 to  3 +2 .Meanwhile, the second-order contribution initiates at  3 1 and extends to either  3 2+2 or  3 2+3 depending on the value of ℓ 1 .This detailed discussion was elaborated upon during the derivation of Eq. (V.70).Following the recursive procedure, we can also write down the corrections from order 1 to  for the  =  case.We leave that as an exercise to the readers.

C. Alternative methods
The main text discusses the brute-force linearization of the original differential equations.Alternatively, one could possibly make use of some transformations before linearization to simplify the linearization procedure, and even of transformations of a non-dissipative regime into a dissipative one.We discuss the following concrete example from the toy model discussed in Ref. [89] and restrict our analysis in the ODE limit.We consider a data-set with one-dimensional inputs.We write the network function with one hidden layer and linear activations as The gradient descent equations (V.84) can be written in terms of the so-called neural tangent kernel (NTK).We denote the primary kernel eigenvalue as  =  −1 (∥∥ We are now in the position to describe the quantum algorithm.Given an initial quantum state vector (0), (0) proportional to (0), (0) ∈ R 2 , we perform the quantum Carleman linearization algorithm as the main text for the non-linear ODE in Eq. (V.84).One can also regard  as an independent variable to simplfy the non-linearities in Eq. (V.84).We observe that the NTK dynamics as given by Eq. (V.According to the results of Refs.[36,56] and the main text, the dissipation in Eq. (V.89), in particular the negative definite kernel −, is beneficial for the convergence of the linearization algorithm.In general, we expect the linearization is efficient whenever ,  is nearly located in the flat local minima of  .If that works, there is an exponential enhancement in .
D. The relationship between our work and Ref. [36] The foundational algorithm presented in Ref. [36] has significantly influenced our work, and ideas presented there have been further developed here and put to a good use.Our quantum ODE solver has technical elements that indeed echo the concepts from Ref. [36].This section delves into our unique contributions that generalize and expand the insights presented in that work.
• Building upon the machinery developed in that work, we have expanded its scope to encompass general-order nonlinearities, whereas the original publication has been limited to the second order only.
• The most important distinction at the quantum algorithmic level between our work and Ref. [36] is in the introduction of a notion of discreteness.We introduce a discrete variant of Carleman linearization and delve into its application in solving stochastic gradient descent equations.This is important for our purposes.In support of this, we offer a comprehensive exposition on discrete Carleman linearization, encompassing a reinterpretation of the Carleman linearization theory, we present a tensor network diagrammatic representation for the discretization error, as well as analytical breakdowns of higher-order amendments, and illustrative instances of lower order expansions.
• In our study, we have taken significant strides in conceptualizing and dissecting innovative pruning algorithms for classical neural networks.This endeavor is particularly geared towards efficiently navigating the unique input-output challenges inherent to quantum devices.
• Moreover, we have discerned that sparsity and dissipation stand out as crucial components for quantum ODE solvers.We have not only identified these elements in general terms, but have also delved into offering a precise definition and theorems relating to dissipation conditions within classical neural networks, specifically focusing on the Hessian matrices and their various derivations.
• Furthermore, we have designed and laid out innovative tomography methods that meld classical shadows with projective measurements.These methods are intriguingly connected to the coupon collector's problem, providing a deeper understanding and a fresh perspective on the topic, as we think.
• Finally, we have conducted numerical experiments involving up to 103 million training parameters, potentially setting a new benchmark for large-scale machine learning models in quantum literature.We have further evaluated the Hessian matrix spectra within these models, ensuring our claims are backed by empirical data and rigorous validation.In our theory, we assume that machine learning model itself is sparse, and pruning conditions are added to ensure the sparsity of the architecture.Here, we show that pruning the model will lead to sufficiently good sparsity condition for the QCM .The matrix we have constructed for Carleman linearization is the matrix , the so-called QCM, given by  =  1  1  Similarly, every terms in all  ⊗ ( ⊗ ∈ R   ) will be an entry in the above vector Eq.(V.96).In Eq. (V.96), we have just subsumed all these components for simplicity and clarity of notation.Hence, if a differential equation, such as Eq.(V.(V.97) For 0 ≤  ≤  , the functions    feature   terms.If there are no same terms in the summations above in Eq. (V.97), then we achieve the maximum non-zero terms in the selected row, given by  =1    , which also reflects the sparsity of the corresponding row. □ Then the matrix sparsity of  after truncating at  would require to take the maximum of all rows as Sparsity = max It is less obvious to identify analogous terms in the different functions    .This is because the order of   in the tensor products matters.As a result, we need to ask terms in functions    to take very specific forms in order to get two identical terms.For example, consider ( 1 ⊗  2 ), so Hence, for terms in the first and second part in Eq. (V.99) to be the same up to a factor irrelevant to , we need to require the term in  1 start with  1 , the term in  2 to end with  2 , and all the intermediate terms in the two have to be exactly identical.Overall, the sparsity is a function of  and , since we maximize over , so it is a fixed and bounded number.The order of the ODE is also determined by the problem set up, so  is also a fixed number.Thus, overall, as long as we prune the original machine learning model sufficiently well (reflecting a small ), then the overall sparsity will at most be equipped with a factor  and , which will not affect the overall result.In our real example in the main text, the model itself has been already significantly pruned to 1%.However, it might be interesting to look at the exact numbers and track the real-time evolution of sparsity of quantum Carleman matrices in machine learning.Some of those issues will be discussed in future research.

Figure 1 .
Figure 1.A possible learning process in large-scale models, which might use sparse training, whose early stage in learning might admit possible quantum enhancement.A dense neural network is pre-trained classically.The neural network weights are then pruned and only a small fraction is preserved.A quantum ordinary difference equation system that corresponds to the sparse training dynamics is created using the sparse network and the training data.To allow quantum enhancement, the system must be sparse and dissipative.Measurement on the solution state is performed to obtain the final trained parameters, used to construct a trained classical sparse neural network.

Figure 2 .EigenvalueFigure 3 .
Figure 2. (a) -(c) Numerical results on ResNet as a function of step.Each step corresponds to a step of stochastic gradient descent based on the derivatives of the loss computed from 2048 randomly selected training samples.(a) ResNet Hessian spectra during training.(b) Estimated error proxy during training.(c) Training accuracy evolution for ResNet.

FIG. 2 .
FIG. 2. Numerical results on ResNet as a function of step.Each step corresponds to a step of stochastic gradient descent based on the derivatives of the loss computed from 2048 randomly selected training samples.(a) Hessian spectra of the ResNet during the first 200 steps of training shown for every 10 steps.The neural network is pruned with ratio 0.9 every 100 steps, which corresponds to the sudden broadening of the Hessian spectra.(b) Hessian spectra shown for every 100 steps where spectra broadening take place due to pruning.(c) Estimate of error proxy.The first 10 steps after pruning are trained classically, hence incurring no error.(d) Neural network accuracy during trainig.

EigenvalueFIG. 3 .
FIG. 3. Hessian of the pruned 103 million parameter model immediately after pruning without any additional training.

FIG. 4 .
FIG. 4. Numerical evidence for the effect of the Hessian on the approximation error.(a) Error scaling for different QCM orders  for a Hessian with all positive eigenvalues.(b) Error scaling for different fractions  of positive Hessian eigenvalues.

FIG. 5 .FIG. 6 .
FIG.5.We visualize this by depicting a blue stick for each identity matrix and a yellow ball to symbolize   .For a correction term of   ℎ order, there are  −  blue sticks and  yellow balls.

E
. Pruning QCM by pruning machine learning models