Quantum imaginary time evolution steered by reinforcement learning

The quantum imaginary time evolution is a powerful algorithm for preparing the ground and thermal states on near-term quantum devices. However, algorithmic errors induced by Trotterization and local approximation severely hinder its performance. Here we propose a deep reinforcement learning-based method to steer the evolution and mitigate these errors. In our scheme, the well-trained agent can find the subtle evolution path where most algorithmic errors cancel out, enhancing the fidelity significantly. We verified the method’s validity with the transverse-field Ising model and the Sherrington-Kirkpatrick model. Numerical calculations and experiments on a nuclear magnetic resonance quantum computer illustrate the efficacy. The philosophy of our method, eliminating errors with errors, sheds light on error reduction on near-term quantum devices. Quantum imaginary time evolution – a common technique in theoretical studies to prepare ground states of quantum systems – comes with the uneasy requirement to implement non-unitary time evolution in the lab, and while recent solution has been proposed it carries leftover errors. The present work implements reinforcement learning to mitigate such errors in a physics-informed way, demonstrating the efficiency of AI-enhanced algorithms on a quantum computer.

The quantum imaginary time evolution (QITE) is a promising near-term algorithm to find the ground state of a given Hamiltonian.It has also been applied to prepare thermal states, simulate open quantum systems, and calculate finite temperature properties [17][18][19].A pure quantum state is said to be k-UGS if it is the unique ground state of a k-local Hamiltonian Ĥ = m j=1 ĥ[j], where each local term ĥ[j] acts on at most k neighboring qubits.Any k-UGS state can be uniquely determined by its k-local reduced density matrices among pure states (which is called k-UDP) or even among all states (which is called k-UDA) [20,21].The QITE algorithm is well suited for preparing k-UGS states with a relatively small k.We start from an initial state |Ψ init , which is nonorthogonal to the ground state of the target Hamiltonian.
The final state after long-time imaginary time evolution lim β→∞ e −β Ĥ |Ψ init (1) has very high fidelity with the k-UGS state.If the ground state of Ĥ is degenerate, the final state still falls into the ground state space.Trotter-Suzuki decomposition can simulate the evolution, e −β Ĥ = e −∆τ ĥ [1] e −∆τ ĥ [2] . . .e −∆τ ĥ[m] n + O(∆τ 2 ), (2) where ∆τ is the step interval, n = β ∆τ is the number of Trotter steps.Trotter error subsumes terms of order ∆τ 2 and higher.On NISQ devices, Trotter error is difficult to reduce due to the circuit depth limits and Trotterized simulation cannot be implemented accurately [22].
Since we can only implement unitary operations on a quantum computer, the main idea of the QITE is to replace each non-unitary step e −∆τ ĥ[j] by a unitary evolution e −i∆τ Â [j] such that where |Ψ is the state before this step.Trotter error and the LA error are two daunting challenges in the QITE.These algorithmic errors accumulate with the increase of steps n, which severely weakens the practicability of the QITE.On NISQ computers, a circuit with too many noisy gates is unreliable, and the final measurements give no helpful information.Therefore we cannot use a small step interval ∆τ to reduce Trotter errors since this would increase the circuit depth, and noise would dominate the final state.The number of Trotter steps is a tradeoff between quantum noise and Trotter error.For the QITE with large-size systems, we need more Trotter steps and larger domain sizes, which seems hopeless on current devices.There exist some techniques to alleviate the problem, Refs.[23][24][25] illustrated some variants of the QITE algorithm with shallower circuits.Refs.[19,26] used Hamiltonian symmetries, error mitigation, and randomized compiling to reduce the required quantum resources and improve the fidelity.
In this work, we propose a deep reinforcement learningbased method to steer the QITE and mitigate algorithmic errors.In our method, we regard the ordering of local terms in the QITE as the environment and train an intelligent agent to take actions (i.e., exchange adjacent terms) to minimize the final state energy.RL is well suited for this task since the state and action space can be pretty large.We verified the validity of our method with the transverse-field Ising model and the Sherrington-Kirkpatrick model.The RL agent can mitigate most algorithmic errors and decrease the final state energy.Our work pushes the QITE algorithm to more practical applications in the NISQ era.

II. METHODS
The RL process is essentially a finite Markov decision process [47].This process is described as a state-actionreward sequence, a state s t at time t is transmitted into a new state s t+1 together with giving a scalar reward R t+1 at time t + 1 by the action a t with the transmission probability p(s t+1 ; R t+1 |s t ; a t ).In a finite Markov decision process, the state set, the action set and the reward set are finite.The total discounted return at time t is where γ is the discount rate and 0 ≤ γ ≤ 1.
The goal of RL is to maximize the total discounted return for each state and action selected by the policy π, which is specified by a conditional probability of action a for each state s, denoted as π(a|s).
In this work, we use distributed proximal policy optimization (DPPO) [48,49], a model-free reinforcement learning algorithm with the actor-critic architecture.The agent has several distributed evaluators, and each evaluator consists of two components: an actor-network that computes a policy π, according to which the actions are probabilistically chosen; a critic-network that computes the state value V (s), which is an estimate of the total discounted return from state s and the following policy π.Using multiple evaluators can break the unwanted correlations between data samples and make the training process more stable.The RL agent updates the neural network weights synchronously.For more details of DPPO see Appendix B.
The objective of the agent is to maximize the cumula-  tive reward under a parameterized policy π θ : In our task, the environment state is the ordering of local terms in each Trotter step, and the state space size is (m!) n .The agent observes the full state space at the learning stage, i.e., we deal with a fully observed Markov process.We define the action set by whether or not to exchange two adjacent operations in Eq. ( 2).Note that even if two local terms ĥ[j] and ĥ[j + 1] commute, their unitary approximations e −i∆τ Â[j] and e −i∆τ Â[j+1] are state-dependent and may not commute.The ordering of commuting local terms still matters.For a local Hamiltonian with m terms, there are m − 1 actions for each Trotter step.Any permutation on m elements can be decomposed to a product of O(m 2 ) adjacent transpositions.Therefore our action set is universal, and a well-trained agent can rapidly steer the ordering to the optimum.The agent takes actions in sequence, the size of the action-space is 2 n(m−1) .A deep neural network with n(m − 1) output neurons determines the action probabilities.We iteratively train the agent from measurement results on the output state |Ψ f , the agent updates the path to maximize its total reward.Fig. 1 shows the diagram of our method.
The reward of the agent received in each step is where N d is the time delay to get the reward, R is the modified reward function.In particular, we define R as where E is the output energy given by our RL-steered path, E std is the energy given by the standard repetition path without optimization.In order to avoid divergence of the reward, we use a clip function to clip the value of (E/E std − 1) within a range (0.01, 1.99).

A. Transverse-field Ising model
We first consider the one-dimensional transverse-field Ising model.With no assumption about the underlying structure, we initialize all qubits in the product state . The Hamiltonian can be written as In the following, we choose J = h = 1.The system is in the gapless phase.For finite-size systems, the ground state of ĤTFI is 2-UGS, therefore 2-UDP and 2-UDA.
In the standard QITE, the ordering of local terms in each Trotter step is the same, e.g., we put commuting terms next to each other and repeat the ordering X1 , . . ., XN , Ẑ1 Ẑ2 , . . ., ẐN−1 ẐN .The quantum circuit of the standard QITE with 4 qubits is shown in Fig. 2(a).Inspired by the randomization technique to speed up quantum simulation [50,51], we also consider a randomized QITE scheme where we randomly shuffle the ordering in each Trotter step.There is no large quality difference between randomizations, and we pick a moderate one.In the RL-steered QITE, the reward is based on the expectation value of the output state |Ψ f on the target Hamiltonian The lower the energy, the higher the reward.The RL agent updates the orderings step by step.
For any given β, the RL agent can steer the QITE path and maximize the reward.We fix the system size N = 4, the number of Trotter steps n = 4, and the domain size D = 2.A numerical comparison of energy/fidelity obtained by the standard, the randomized and the RL-steered QITE schemes for different β values is shown in Fig. 3(a)(b).The RL-steered path here is β-dependent.Throughout this paper we use the fidelity defined by F (ρ, σ) = Tr σ 1/2 ρσ 1/2 .When β is small, RL cannot markedly decrease the energy since the total quantum resource is limited.With the increase of β, the imaginary time evolution target state | Ψ f = e −β ĤTFI |Ψ init approaches the ground state, therefore the obtained energy of all paths decrease in the beginning.However, algorithmic errors increase with β, and this factor becomes dominant after a critical point, the energy of the standard/randomized QITE increases when β > 1/3.Accordingly, the fidelity increases first, then decreases.The RL-steered QITE outperforms the standard/randomized QITE for all β values.Algorithmic errors in this path canceled out.The fidelity between the output state and the ground state constantly grows to F > 0.996.The gap between the ground state energy and the minimum achievable energy of the standard QITE is 0.053, and that of the RL-QITE is only 0.016.For a detailed optimized path see Appendix C.
Further, we implement the same unitary evolutions on a 4-qubit liquid state nuclear magnetic resonance (NMR) quantum processor [52].We carry out the experiments with a 300-MHz Bruker AVANCE III spectrometer.The processor is 13 C-labelled trans-crotonic acid in a 7-T magnetic field.The resonance frequency for 13

ĤS =
where ν j is the chemical shift of the j-th nuclei, J ij is the J-coupling between the i-th and j-th nuclei, Ẑj is the Pauli matrix σ z acting on the j-th nuclei.All the parameters can be found in Fig. 2(b).The quantum operations are realized by irradiating radiofrequency pulses on the system.We optimzie the pulses over the fluctuation of the chemical shifts of the nuclei with the technique of gradient ascent pulse engineering [53].The experiment is divided into three steps: (i) preparing the system in to a pseudo-pure state using the temporal average technique [54]; (ii) applying the quantum operations; (iii) performing measurements [55].
Denote the NMR output state as ρ, whose density matrix can be obtained through quantum state tomography.ρ is a highly mixed state since quantum noise is inevitable.We use the virtual distillation technique to suppress the noise [56,57].The dominant eigenvector of ρ, lim M →∞ ρ M /Tr(ρ M ), can be extracted numerically.Its expectation value on ĤTFI and its fidelity with the ground state are shown in Fig. 3(a)(b) with unfilled markers.Consistent with our numerical results, the RLsteered path significantly outperforms the other two for large β.
In our simulation, we have four orderings to optimize and 28 local unitary operations to implement.Denote |Ψ k as the state after the k-th operation, | Ψ k as the temporal target state with the ideal imaginary time evolution.The instantaneous algorithmic error during the evolution can be characterized by the squared Euclidean distance between For β = 0.9, Fig. 3(c) shows alg as a function of evolution step k.Although alg fluctuates in all paths, it accumulates obviously in the standard QITE and eventually climbs to alg = 0.082.The randomized QITE performs slightly better and ends with alg = 0.030.The RL-steered QITE is optimal, the trend of alg shows no accumulation and drops to alg = 0.007 in the end.Although we cannot directly estimate alg in experiments, we can minimize it via maximizing the reward function.
One question that arises is whether we can enhance the QITE algorithm with RL for larger systems?Now we apply our approach to the transverse-field Ising model with system sizes N = 2, 3, . . ., 8 to demonstrate the scalability.Still, we consider the QITE with 4 Trotter steps.Denote the target state with "evolution time" β as In the following, we use an adaptive β for different N such that the expectation value Ψ f (β)| ĤTFI |Ψ f (β) is always higher than the ground state energy of ĤTFI by 1×10 −3 .The results are illustrated in Fig. 4, the RL agent can efficiently decrease the energy and increase the fidelity between the final state and the ground state for all system sizes.The ratio of the RL-steered energy (E RL ) to the  standard QITE energy (E std ) is also given.This ratio increases steadily with the number of qubits.Note that the neural networks we use here only contain four hidden layers.The hyperparameters were tuned for the N = 4 case.We apply the same neural networks to larger N , the required number of training epochs does not increase obviously.If we want to increase the fidelity further for N > 4, we can use more Trotter steps and tune the hyperparameters accordingly.There is little doubt, however, the training process will be more time-consuming.

B. Sherrington-Kirkpatrick model
The second model we apply our method to is the Sherrington-Kirkpatrick (SK) model [58], a spin-glass model with long-range frustrated ferromagnetic and antiferromagnetic couplings.Finding a ground state of the SK model is NP-hard [59].On NISQ devices, solving the SK model can be regarded as a special Max-Cut problem and dealt with by the quantum approximate optimization algorithm [5,60].Here we use the QITE to prepare the ground state of the SK model.Compared with the quantum approximate optimization algorithm, the QITE does not need to sample a bunch of initial points and implement classical optimization with exponentially small gradients [61].
Consider a six-vertex complete graph shown in Fig. 2(c).The SK model Hamiltonian can be written as we independently sample J ij are from a uniform distribution J ij ∼ U (−1, 1).Since Ẑ Ẑ-terms commute, there is no Trotter error in Eq. ( 2) for the SK model.The ground state of ĤSK is two-fold degenerate.The QITE algorithm can prepare one of the ground states.We fix β = 5, n = 6, D = 2, sample J ij and train the agent to steer the QITE path.Define the probability of finding a ground state of ĤSK through measurements as P gs .Energy and P gs as functions of β are shown in Fig. 5. Remember that the RL-steered path here was only optimized for a specific β value (i.e., β = 5) since we want to verify the dependence of the ordering on β.
For each path, we observe a sudden switch from a high probability of success to a low one when 4 < β < 5.The reason is that for some states when the step interval exceeds a critical value, there is an explosion of LA error which utterly ruins QITE.After the critical value, the QITE algorithm loses its stability, and the energy E fluctuates violently.The randomized QITE performs even worse than the standard one, where the highest success probability is only 0.60.In the RL-steered QITE, P gs falls down a deep "gorge" but can recover soon.When β = 5, most algorithmic errors disappeared and P gs = 0.9964.In comparison, the standard QITE ends with P gs = 0.0002 and the randomized QITE ends with P gs = 0.0568, they dropped sharply and cannot be recovered even if we further improve β.For detailed couplings and QITE path see Appendix C.

IV. CONCLUSIONS
We have proposed a RL-based framework to steer the QITE algorithm for preparing a k-UGS state.The RL agent can find the subtle evolution path to avoid error accumulation.We compared our method with the standard and the randomized QITE; numerical and experimental results demonstrate a clear improvement.The RL-designed path requires a smaller domain size and fewer Trotter steps to achieve satisfying performance for both the transverse-field Ising model and the SK model.We also noticed that randomization cannot enhance the QITE consistently, although it plays a positive role in quantum simulation.For the SK model, with the increase of the total imaginary time β, a switch from a high success probability to almost 0 exists.The accumulated error may ruin the QITE algorithm all of a sudden instead of gradually, which indicates the importance of an appropriate β for high-dimensional systems.The RL-based method is a winning combination of machine learning and quantum computing.Even though we investigated only relatively small systems, the scheme can be directly extended to larger systems.The number of neurons in the output layer of the RL agent only grows linearly with system size N .A relevant problem worth considering is how to apply the QITE (or the RL-steered QITE) for preparing a quantum state that is k-UDP or k-UDA but not k-UGS.
RL has a bright prospect in the NISQ era.In the fu-ture, one may use RL to enhance Trotterized/variational quantum simulation [62][63][64][65] similarly, but the reward function design will be more challenging.Near-term quantum computing and classical machine learning methods may benefit each other in many ways.Their interplay is worth studying further.
builds a network to evaluate the policy π θ and takes actions according to the policy π θ .The critic builds a network to evaluate the state value function v φ (s).The loss function of DPPO reads B1) where is the clipping parameter.The expectation Êπ θ old means we take empirical average over a finite batch of samples under the policy π θ old .A is the advantage function which measures the average state value of the action a with respect to the state s under the policy π.The term r θ odd (a|s, θ) is defined as the ratio of likelihoods The clip function with c ≤ d is defined by clip (r θ old (a|s 0 , θ), 1 − , 1 + ) penalizes large changes between nearest updates, which corresponds to the trust region of the first order policy gradient.The loss function poses a lower bound on the improvement induced by an update and hence establishes a trust region around π θ old .The hyperparameter controls the maximal improvement and thus the size of the trust region.
In our computation, the RL agent uses one deep neural network to approximate both the value of the state and the value of the policy.The neural network consists of four hidden layers.All layers have ReLU activation functions except the output layer.Appendix Table 1 summarizes the hyper-parameters of the neural network.
DPPO consists of multiple independent agents (evaluators) with their weights, who interact with a different copy of the environment in parallel.The data collected by the evaluators are assumed to be independent and identically distributed.Further, they can explore a more significant part of the state-action space in much less time.
We modify the original DPPO algorithm for better performance.For the critic, we normalize the advantage so that the value estimation is stable.The clip function is used in both the policy loss and the value loss.For the actor, instead of using the traditional -greedy method [47] to sample the policy, we use the entropy regularization method, which was first introduced in the soft actor-critic algorithm [66].
For the SK model, we found that the control landscape undergoes a phase-transition-like behavior, as reported in the quantum state preparation paradigms [33,[68][69][70].We compare the policies trained with different β values to understand this transition better.The Hamming distance matrix is shown in Fig. 6.For small β values, the energy landscape is almost convex, and we can easily find the unique optimal protocol.With the increase of β, the barriers between local minimums get higher.When β > 4, the control landscape has many local minimums separated by extensive walls.Numerical codes were written with Python 3.8 and Py-Torch 1.8 [67].Numerical calculations were run on two 14-core 2.60GHz CPUs with 188 GB memory and four GPUs.We plot the RL learning curves in Fig. 7.The reward converges after about 6000 iterations.

Figure 2 .
Figure 2. Theoretical and experimental setup.(a) Quantum circuit of the quantum imaginary time evolution (QITE) for the transverse-field Ising model with 4 qubits and n Trotter steps.U j P represents the unitary operator that approximates e −∆τ P for the local Hamiltonian P in the j-th Trotter step.(b) Molecule structure and nuclei parameters of the nuclear magnetic resonance processor.The molecule has 4 Carbon atoms C1, C2, C3, and C4.Diagonal entries of the table are the chemical shifts in Hz, off-diagonal entries of the table are the J-couplings between two corresponding nuclei.The T2 row gives the relaxation time of each nucleus.(c) A six-vertex complete graph with weighted edges.Different shades of grey represent different couplings in the Sherrington-Kirkpatrick model.

Figure 3 .
Figure 3. Different QITE schemes for the transversefield Ising model.Filled markers represent the numerical data; unfilled markers represent the experimental data.(a) Energy versus β, the black dashed line represents the ground state energy.(b) Fidelity versus β.(c) Algorithmic errors during the evolution.

Figure 4 .
Figure 4. Scaling of different QITE schemes.Blue diamonds represent the standard quantum imaginary time evolution (QITE); green triangles represent the randomized QITE; red circles represent the reinforcement learning (RL)-steered QITE.(a) Energy versus system size, and the energy ratio ERL/E std versus system size (inset).The black dash-dotted line represents the ground state energy.(b) Fidelity versus system size.

Figure 5 .
Figure 5. Different QITE schemes for the Sherrington-Kirkpatrick model.Blue dotted lines represent the standard quantum imaginary time evolution (QITE); green dashed lines represent the randomized QITE; red lines represent the reinforcement learning (RL)-steered QITE.(a) Energy versus the imaginary time β.The black dash-dotted line represents the ground state energy.(b) Success probability versus the imaginary time β.

Figure 6 .
Figure 6.The Hamming distance matrix for the RL agent minima protocols.