When does reinforcement learning stand out in quantum control? A comparative study on state preparation

Reinforcement learning has been widely used in many problems, including quantum control of qubits. However, such problems can, at the same time, be solved by traditional, non-machine-learning methods, such as stochastic gradient descent and Krotov algorithms, and it remains unclear which one is most suitable when the control has specific constraints. In this work, we perform a comparative study on the efficacy of three reinforcement learning algorithms: tabular Q-learning, deep Q-learning, and policy gradient, as well as two non-machine-learning methods: stochastic gradient descent and Krotov algorithms, in the problem of preparing a desired quantum state. We found that overall, the deep Q-learning and policy gradient algorithms outperform others when the problem is discretized, e.g. allowing discrete values of control, and when the problem scales up. The reinforcement learning algorithms can also adaptively reduce the complexity of the control sequences, shortening the operation time and improving the fidelity. Our comparison provides insights into the suitability of reinforcement learning in quantum control problems.


I. INTRODUCTION
Reinforcement learning, a branch of machine learning in artificial intelligence, has proven to be a powerful tool to solve a wide range of complex problems, such as the games of Go [1] and Atari [2].Reinforcement learning has also been applied to a variety of problems in quantum physics with vast success , including quantum state preparation [3][4][5][6][7][8], state transfer [9], quantum gate design [10], and error correction [11].In many cases, it outperforms commonly-used conventional algorithms, such as Krotov and Stochastic Gradient Descent (SGD) algorithms [9,10].In the reinforcement learning algorithm, an optimization problem is converted to a set of policies that governs the behavior of a computer agent, i.e. its choices of actions and, consequently, the reward it receives.By simulating sequences of actions taken by the agent maximizing the reward, one finds an optimal solution to the desired problem [25].
The development of techniques that efficiently optimize control protocols is key to quantum physics.While some problems can be solved analytically using methods such as reverse engineering [26], in most cases numerical solutions are required.Various numerical methods are therefore put forward, such as gradient-based methods (including SGD [17], GRAPE [27] and variants [28]), the Krotov method [29], the Nelder-Mead method [30] and convex programming [31].Recently, there is a frenetic attempt to apply reinforcement learning and other machine-learning-based algorithms [32][33][34][35][36][37] to a wide range of physics problems.In particular, the introduction of reinforcement learning to quantum control have revealed new interesting physics [3][4][5][6][7][8], and these techniques have therefore received increasing attention.A fundamental question then arises: under what situation is reinforcement learning the most suitable method?
In this paper, we consider problems related to quantum control of a qubit.The goal of these problems is typically to steer the qubit toward a target state under certain constraints.The mismatch between the final qubit state and the target state naturally serves as the cost function used in the SGD or Krotov methods, and the negative cost function can serve as the reward function in the reinforcement learning procedure.Our question then becomes: under different scenarios of constraints, which algorithm works best?In this work, we compare the efficacy of two commonly-used traditional methods: SGD and the Krotov method, and three algorithms based on reinforcement learning: tabular Q-learning (TQL) [25], deep Q-learning (DQL) [2], and policy gradient (PG) [38] , under situations with different types of control constraints.In Ref. [5], the Q-learning techniques (TQL and DQL) have been applied to the problem of quantum state preparation, revealing different stages of quantum control.The problem of preparing a desired quantum state from a given initial state is on one hand simple enough to be investigated in full detail, and on the other hand contains sufficient physics allowing for various types of control constraints.We therefore take quantum state preparation as the platform that our comparison of different algorithms is based on.While a detailed description of quantum state preparation is provided in Results, we briefly introduce the five algorithms we are comparing in this work here.(Detailed implementations are provided in the Methods and Supplementary Method 1.) SGD is one of the simplest gradient-based optimization  .When the agent is at state s (5) , it reviews Q(s (5) , a (i) ) for all possible actions and chooses one with the maximum "Q-value" (which we assume is a (3) ).As a result, the state then evolves to s (2) .Depending on the distance between s (2) and the target, the Q-values (e.g.Q(s (5) , a (3) )) is updated according to Eq. ( 8).This process is then repeated at the new state s (2) and so forth.(b) In DQL, the Q-table is replaced by the Q-network.Instead of choosing an action with the maximum Q-value from a list, this process is done by a neural network, the Q-network, which takes the input state (s) and outputs an action that it finds most appropriate.Evaluation of the resulting state (s ) after the action suggests how the neural network should be updated (trained).For detailed implementation, see Methods and Supplementary Method 1.
algorithms.In each iteration, a direction in the parameter space is randomly chosen, along which the control field is updated using the gradient of the cost function defined as the mismatch between the evolved state and the target state.Ideally, the gradient is zero when the calculation has converged to the optimal solution.The Krotov algorithm has a different strategy: The initial state is first propagated forward obtaining the evolved state.The evolved state is then projected to the target state, defining a co-state encapsulating the mismatch between the two.Then the co-state is propagated backward to the initial state, during which process the control fields are updated.When the calculation is converged, the costate is identical to the target state.
In Q-learning (including TQL and DQL), a computer agent evolves in an environment.All information required for optimization is encoded in the environment, which is allowed to be in a set of states S. In each step, the agent chooses an action from a set A, bringing the environment of the agent to another state.As a consequence, the agent acquires a reward, which encapsulates the desired optimization problem.Fig. 1a schematically shows how TQL works.At each state s ∈ S, the agent chooses actions a ∈ A according to the action-value function Q(s, a), defined as the estimated total reward starting from state s and action a, forming the so-called Qtable.Each time the agent takes an action, a reward r is generated according to the distance between the resulting state and the target, which updates the Q-table.An optimal solution is found by iterating this process sufficient times.We note that since a table has a finite number of entries, both the states and actions should be discretized.Fig. 1b shows DQL, in which the role of the Q-table is replaced by a neural network, called the Qnetwork.The agent then chooses its action according to the output of the Q-network, and the reward is used to update the network.In this case, although the allowed actions are typically discrete, the input state can actually be continuous.
Similar to TQL and DQL, PG also requires the sets of states S, actions A, and rewards r.The policy of the agent is represented by a neural network.With the state as the input, the network outputs the probability of choosing each action.After each episode, the policy network is updated toward a direction that increases the total reward.Since the state is encoded as the input of the neural network, PG can also accommodate continuous input states.

A. Single-qubit case
We start with the preparation of a single-qubit state.Consider the time dependent Hamiltonian where σ x and σ z are Pauli matrices.The Hamiltonian may describe a singlet-triplet qubit [39] or a single spin with energy gap h under tunable control fields [40,41].
In these systems, it is difficult to vary h during gate operations, and we therefore assume that h is a constant in our work, which at the same time serves as our energy unit.Quantum control of the qubit is then achieved by altering J(t) dynamically.Quantum state preparation refers to the problem to find J(t) such that a given initial state |ψ 0 evolves, within time T , to a final state |ψ f that is as close as possible to the target state |φ .The quality of the state transfer is evaluated using the fidelity, defined as We typically use the averaged fidelity F over many runs of a given algorithm in our comparison (unless otherwise noted, we average 100 runs to obtain F ), because the initial guesses of the control sequences are random, and the reinforcement learning procedure is probabilistic.
In this work, we take |ψ 0 = |0 , |φ = |1 and T = 2π unless otherwise specified.Under different situations, there are various kinds of preferences or restrictions of control.We consider the following types of restrictions: (i) Assuming that control is performed with a sequence of piecewise constant pulses, and in this work, we further assume that the time duration of each piece is equal to each other for convenience.For this purpose, we divide the total time T into N equal time steps, each of which having a step size dt = T /N , with N denoting the maximum number of pieces required by the control.J(t) is accordingly discretized, so that on the ith time step, J(t) = J i and the system evolves under H(J i ).Denoting the state at the end of the ith time step as |ψ i , the evolution at the ith step is |ψ i = U i |ψ i−1 , where U i = exp{−iH(J i )dt}.In principle, the evolution time can be less than T , namely the evolution may conclude at the i f th time step with i f N .(In our calculations, the evolution is terminated when the fidelity F ≥ 0.999.)Due to their nature, SGD and Krotov have to finish all time steps, i.e. i f = N .However, as we shall see below, QL and DQL frequently have i f < N .
(ii) We also consider the case where the magnitude of the control field is bounded, i.e.J i ∈ [J max , J min ] for all i.The constraint can be straightforwardly satisfied in TQL and DQL, since they only operate within the given set of actions thus cannot exceed the bounds.For SGD and Krotov, updates to the control fields may exceed the bounds, in which case we need to enforce the bounds by setting J i as J max when the updated value is greater than J max , and as J min when the updated value is smaller than J min .In the case in which either of them is not restricted, we simply note J min → −∞ or J max → ∞.
(iii) The values of the control field may be discretized in the given range, i.e., J i ∈ {J min , J min + dJ/M, J min + 2dJ/M, • • • , J max } where dJ = (J max −J min ), so that the control field can take M + 1 values including J min and J max .In reality this situation may arise, for example, when decomposing a quantum operation into a set of given gates [42][43][44] and DQL only select actions within the given set so the constraint is satisfied.For SGD and Krotov which keep updating the values of the control field during iterations, we enforce the constraint by setting the value of each control field to the nearest allowed value at the end of the execution.
To sum up, the number of pieces in control sequences N , the bounds of the control field J min and J max , as well as the number of the discrete values of the control field M + 1 are the main factors characterizing situations to prepare quantum states, based on which our comparison of different algorithms is conducted.We also define N iter as the number of iterations performed in executing an algorithm, which is typically taken as equal for different algorithms to ensure a fair comparison.Unless otherwise noted, N iter = 500 in all results shown.
In Fig. 2 we study a situation where the maximum number of pieces in the control sequence N is given, and the results are shown as the averaged fidelities as functions of N .Here, the quality of an algorithm is assessed by the averaged fidelity of the state it prepares (as compared to the target state) F , but not by the computational resources it costs.For N 10, the Krotov method gives the lowest fidelity, possibly due to the fact that Krotov requires a reasonable level of continuity in the control sequence, and one with a few pieces is unlikely to reach convergence.As N increases, the performance of Krotov is much improved, which has the highest fidelity when N is large (N 30 as seen in the figure).SGD performs better than Krotov for N 10, but worse otherwise, because as N increases, the algorithm has to search over a much larger parameter space.Within the given number of iterations (N iter = 500 as noted above), it concludes with a lower fidelity.Of course, this result can be improved if more iterations are allowed, and we shall show relevant results in Supplementary Discussion 2. The SGD results at N = 2 is irregular (thus the cusp at N = 6), due to the lack of flexibility in the control sequence which makes it difficult to achieve high fidelity with only two steps.
The fidelity for TQL is higher than SGD and Krotov, but is still lower than that of DQL and PG, indicating the superior ability of deep learning.Nevertheless, we note that the TQL may sometimes fail: it occasionally arrives at a final state which is completely different than the target state.On the other hand, SGD could fail by being trapped at a local minimum, but even in that case it is not drastically different from the optimal solution in terms of the fidelity.This is the reason why the TQL results drop for N > 10.For larger N , the failure rate for TQL is higher (possibly due to the higher dimensionality of the Q-table), and therefore the averaged fidelity is lower.Among all five algorithms, PG is consistently the best.Apart from PG, DQL gives the highest fidelity for N < 30, but due to its nonzero failure probability, it is outperformed by Krotov for N > 30.Nevertheless, the effect is moderate and the fidelity is still very close to 1 (F = 0.9988).
To further understand the results shown in Fig. 2, we take examples from N = 20 and plot the pulse profiles and the corresponding trajectories on the Bloch sphere in Fig. 3.We immediately realize that reinforcement learning (TQL, DQL and PG) yield very simple pulse shapes: one only has to keep the control at zero for time T /2, and the desired target state (|1 ) will be achieved.However, to find the result, the algorithm has to somehow realize that one does not have to complete all N pieces, which implies their ability to adaptively generating the control sequence.As can be seen from Fig. 3a and 3c, SGD and Krotov only search for pulse sequences with exactly N pieces and therefore miss the optimal solution.Their trajectories on the Bloch sphere are much more complex as compared to those of reinforcement learning.In practice, the complex pulse shapes and longer gate times mean that they are difficult to realize in the laboratory, and potentially introduces error to the control procedure (In Supplementary Discussion 4 we provide more details on this issue).From Fig. 3 we also notice that reinforcement learning possesses better ability to adaptively sequencing, which is particularly suitable for problems that involve optimization of gate time or speed, such as the quan-tum speed limit [5,9].On the other hand, application of SGD or Krotov to the same problem requires searching over various different N values before an optimal solution can be found, which cost much more resources [45,46].
We now study the effect of restrictions on the performances of algorithms.Namely, the control field is bounded between J min and J max , with M + 1 allowed values including the bounds.In Fig. 4, we impose the same restriction J ∈ [0, 1] to all five methods and vary M from M = 1 to M = 49.It is interesting to note that the averaged fidelities of three reinforcement learning algorithms decreases with M , albeit not considerably.This is because TQL, DQL and PG favor bounded and concrete sets of actions, and more choices will only add burden to the searching process, rendering the algorithms inefficient.Improvements may be made by increasing the number of iterations (cf.Supplementary Discussion 2), and using a larger neural network with stronger representational power.For N = 6 (Fig. 4a), TQL and DQL are comparable and have overall the best performance except for M > 14 in which SGD becomes slightly better.On the other hand, F for PG drops rapidly for M 30.For N = 20 (Fig. 4b), DQL and PG have the best performance, but for large M they are not significantly better than other methods.More results involving SGD and Krotov are given in Supplementary Discussion 1, from which we conclude that the effect of boundaries in control is much more obvious for Krotov method than SGD, since Krotov performs much larger updates at each iteration.Meanwhile, the effect of discretization (decreasing M ) are severe for both Krotov and SGD methods, indicating that successful implementations of them depend crucially on the continuity of the problem.
Finally, we note that all results obtained have the target state being |1 .Preparing a quantum state other than |1 may have different results, for which an example is presented in Supplementary Discussion 3. Nevertheless, the overall observation of the pros and cons of the algorithms should remain similar.

B. Multi-qubit case
We now consider a case preparing a multi-qubit state as sketched in Fig. 5a, b.Our system is described by the following Hamiltonian: where K is the total number of spins, S k x , S k y and S k z are the kth spin operator, C describes the constant nearestneighbor coupling strength (set to be C = 1), and B k (t) is the time-dependent local magnetic field applied at the kth spin to perform control.This is essentially a task    transferring a spin: the system is initialized to a state with the leftmost spin being up and all others down, and the goal is to prepare a state with the rightmost spin being up and all others down.We set the operation time duration to be T = (K − 1)π/2, which is divided to 20 equal time steps (i.e.N = 20).The external field is restricted to B k (t)/C ∈ [0, 40] for SGD and Krotov, and B k (t)/C ∈ {0, 40} for all three reinforcement learning algorithms.Note that TQL fails for K ≥ 2 due to the large size of the Q-table, and is thus excluded in the SGD Krotov TQL DQL PG Performance vs number of time steps N Ability to adaptively segment Discrete operation set (M small) Continuous operation set (M large) Scaled-up problems (multi-qubits) TABLE I: Summary of the performances under different situations.A " " indicates that the algorithm performs best, while the arrow " " (" ") denotes decrease (increase) of the performance versus increase of the variable concerned.

comparison.
Fig. 5c shows the average fidelity versus the number of spins (K), after each algorithm is run for 500 iterations.As K increases, the dimensionality of the problem increases and therefore the performances of all algorithms deteriorate.When K < 4, Krotov, DQL and PG have comparable performances, while SGD has the lowest fidelity.As K increases, F for PG and DQL drop much more slowly as compared to Krotov.At K = 8, we have F = 0.0989 (Krotov), F = 0.4214 (PG), F = 0.5433 (DQL), respectively.Here, we have not assumed a particular form of the control field, so one has to search over a very large space.Specializing the control to certain types would improve performances of the algorithms [47].
In order to visualize the final states prepared, we define the amplitude, A k , as the absolute value of the inner product between the final states and the state with the kth spin being up while all others being down.A perfect transfer would be that the amplitude is 1 for the rightmost spin and 0 otherwise.Taking K = 8 as an example, we show how the amplitudes distribute over different spins in Fig. 5d-g.We compare two different kinds of results: one showing the averaged results over 100 runs (shown as red solid bars), and the other the best result among the 100 runs (hollow bars enclosed by dashed lines).We see that SGD completely fails to prepare the desired state.The best results from Krotov, DQL and PG are comparable, but considering the average over many runs, DQL and PG have better performances.Moreover, the optimal control sequences for different algorithms are provided in Supplementary Table 1-4.

III. DISCUSSION
In this paper, we have examined performances of five algorithms: SGD, Krotov, TQL, DQL, and PG, on the problem of quantum state preparation.From the comparison, we can summarize the characteristics of the algorithms under different situations as follows (see also Table. 1).
Dependence on the maximum number of pieces in the control sequence, N : When all algorithms are executed with the same number of iterations, PG has overall the best performance, but the corresponding fidelity still drops slightly as N increases.In fact, the fidelities from all methods decrease as N increases, except the Krotov method, for which the fidelity increases when N is large.
Ability to adaptively segment: During the optimization process, TQL, DQL and PG can adaptively reduce the number of pieces required and can thus find optimal solutions efficiently.SGD and Krotov, on the other hand, always work with a fixed number of N and thus sometimes miss the optimal solution.
Dependence on restricted ranges of the strength of the control field: TQL, DQL and PG naturally work with restricted sets of actions so they perform well when the strength of the control field is restricted.Such restriction reduces the efficiency for both SGD and Krotov method, but the effect is moderate for SGD because its updates on the control field are essentially local.However, the Krotov method makes significant updates during its execution and thus becomes severely compromised when the strength of the control field is restricted.
Ability to work with control fields taking M + 1 discrete values: TQL, DQL and PG again naturally work with discrete values of the control field.In fact, the fidelities from them decrease as the allowed values of the control fields become more continuous (M increases).This problem may be circumvented using more sophisticated algorithms such as Actor-Critic [48,49], and the deep deterministic policy gradient method [50].SGD is not sensitive to M because it works with a relatively small range of control field and a reasonable discretization is sufficient.The Krotov method, on the other hand, strongly favors continuous problem, i.e.M being large.
Ability to accommodate scaled-up problems (multiple qubits): Except for TQL, all other algorithms can be straightforwardly generalized to treat quantum control problems with more than one qubit.However, SGD is rather inefficient, and DQL generally outperforms all others for cases considered in this work (K ≤ 8).
Moreover, we have found that PG and DQL methods, in general, have the best performances among the five algorithms considered, demonstrating the power of reinforcement learning in conjunction with neural networks in treating complex optimization problems.
Our direct comparison of different methods may also shed light on how these algorithms can be improved.For example, the Krotov method strongly favors the "continuous" problem, for which TQL, DQL and PG do not perform well.It should be possible that gradients in the Krotov method can be applied in the Q-learning procedures and thereby improves their performances.We hope that our work has elucidated the effectiveness of reinforcement learning in problems with different types of constraints, and in addition, it may provide hints on how these algorithms can be improved in future studies.

Methods
In this section, we give a brief description of our implementation of TQL, DQL and PG in this work.The full algorithms for all methods used in this work are given in Supplementary Method 1.

A. TQL
For Q-learning, the key ingredients include a set of allowed states S, a set of actions A, and the reward r.The state of qubit can be parametrized as where (θ, φ) corresponds to a point on the Bloch sphere, and a possible global phase of −1 has been included.Our set of allowed states is defined as where We note that this is a discrete set of states, and after each step in the evolution, if the resulting state is not identical to any of the member in the set, it will be assigned as the member that is closest to the state, i.e. having the maximum fidelity in their overlap.
In the ith step of the evolution, the system is at a state s i = |ψ i ∈ S, and the action is given by the evolution operator a i = U i = exp{−iH(J i )dt}.All allowed values of the control field J i therefore form a set of possible actions A. The resulting state U i |ψ i after this step is then compared to the target state, and the reward is calculated using the fidelity between the two states as r i =    10 F ∈ (0.5, 0.9], 100 F ∈ (0.9, 0.999], 5000 F ∈ (0.999, 1], (7) so that the action that takes the state very close to the target is strongly rewarded.In practice, the agent chooses its action according to the -greedy algorithm [25], i.e. the agent either chooses an action with the largest Q(s, a) with 1 − probability, or with probability it randomly chooses an action in the set.The introduction of a nonzero but small ensures that the system is not trapped in a poor local minimum.The elements in Q-tables are then updated as: where a refers to all possible a i in this step, α is the learning rate, and γ is a reward discount to ensure the stability of the algorithm.

B. DQL
DQL stores the action-value functions with a neural network Θ.We take qubit case as an example.Defining an agent state as the network outputs the Q-value for each action a ∈ A as Q(s, a; Θ).We note that in DQL, the discretization of states on the Bloch sphere is no longer necessary and we can deal with states that vary continuously.Otherwise the definitions of the set of actions and reward are the same as those in TQL.
We adopt the double Q-network training approach [2]: two neural networks, the evaluation network Θ and the target network Θ − , are used in training.In the memory we store experiences defined as e i = (s i−1 , a i , r i , s i ).In each training step, an experience is randomly chosen from the memory, and the evaluation network is updated using the outcome derived from the experience.

C. PG
Similar to DQL, PG is based on neural networks.With the state s as the input vector, the network of PG outputs the probability of choosing each action p = P (s; Θ), where p = [p 1 , p 2 , • • • ] T .At each time step t, the agent chooses its action according to p, and stores the total reward it has obtained v t = t i=1 γ i r i .In each iteration, the network is updated in order to increase the total reward.This is done according to the gradient of log P (s t ; Θ)v t , the details of which can be found in Supplementary Method 1.
We note that unlike the case for SGD and Krotov, in which the fidelity monotonically increases with more training in most cases, the fidelity output by TQL, DQL and PG may experience oscillations as the algorithm cannot guarantee optimal solutions in all trials.In this case, one just has to choose outputs which have higher fidelity as the learning outcome.
Here, we consider the situation that the control field is bounded for SGD and Krotov, and the results are shown in Fig. S1.Fig. S1a shows the results for the SGD method, with the blue line identical to that in Fig. 2 of the main text (no restriction) and the black one showing results after J is restricted between 0 and 1.We see that imposing a restriction on the available range of the control field does not change the results much, because the search by the SGD algorithm is essentially local: the alteration of J is small in each step and it is unlikely to build up a significant variation of J in the final results.This fact can also be seen from Fig. 3a of the main text: the strength of control field is mostly within the range of [0, 1] so that the restriction has minimal effect on the results.
The situation is different for the Krotov method.As can be seen in Fig. S1b, for N < 30, the result from the Krotov method with restriction of J i ∈ [0, 1] (black line) has considerably lower average fidelities than that without restriction (red line, identical to the results shown in Fig. 2 of the main text).This is because the Krotov method makes large updates on the values of the control fields, as can be seen from Fig. 3b of the main text where the magnitude of J i can be above 20.Restricting the control field to a much narrower range will severely compromise the ability of the algorithm to find solutions with high fidelities.An exception is N = 6, for which the results with restriction has higher average fidelity.While the true reason remains unclear, we suspect this is because that the agent happens to have found a relatively good local minimum which outperforms many other cases.We believe that the algorithm succeeds in this particular case but not in general.After all, the averaged fidelity is below 0.6 for both lines, with or without restrictions.For N > 30, the results without restric- tion on the range of the control approaches almost one 1 − F < 10 −7 , and those with restriction is lower than one but very close (for example, F = 0.9822 for N = 30).This indicates that having more pieces in the control sequence can greatly help the Krotov algorithm to achieve higher fidelities despite limited strength of control fields.The inset of Fig. S1b gives information on how the two points given by the vertical dashed line at N = 20 connects when we expand the range of the control field.The bound is given as 1 − J max ≤ J i ≤ J max .When J max is increased from 0 to 20, the averaged fidelity from the Krotov method increases from 0.4 to above 0.8.This clearly demonstrates that the range of allowed values of control fields affects the outcome of the Krotov algorithm in a significant way.
We now proceed to consider the effect of discrete control to the averaged fidelities obtained by the algorithms.We start from SGD and Krotov with the range of the control field unrestricted, and the results are shown in Fig. S2.In both panels shown, we see that the averaged fidelities from the SGD method first increases for small M but quickly saturate.The insensitivity of the SGD against the discretization of the control field is due to the fact that SGD updates the control field moderately and can find sufficient control field values as desired within a relatively narrow range, even if the values are discretized.This is similar to the reason why the restriction on the range of control field has little effect on the results in Fig. S1a.
On the other hand, the averaged fidelities from the Krotov method increase as functions of M , but the increase is much more pronounced for N = 20 (Fig. S2b) than for N = 6 (Fig. S2a).In Fig. S2b, the averaged fidelity from Krotov method exceeds that from SGD at around M + 1 = 15.The result indicates that successful implementation of the Krotov method depends crucially on the continuity of the problem, in terms of both the number of pieces in the control sequence as well as allowed values of the control field.We also note that at the limit M → ∞, the extrapolated fidelity values are consistent to results shown in Fig. 2 of the main text, providing a consistency check of our calculations.
Supplementary Discussion 2: Improving the fidelity with more iterations In Fig. 2 of the main text, we have compared five algorithms in terms of the average fidelity versus the maximum number of control pieces.To ensure a fair comparison, all algorithms are requested to stop at N iter = 500, i.e. after 500 iterations.Here we show that by allowing more iterations, the fidelity of all algorithms can improve, and the improvement is particularly pronounced for the SGD method.Fig. S3 shows the average fidelity versus the number of iterations, with all other parameters and constraints the same as those used in Fig. 2 of the main text.Fig. S3a shows the results at N = 20 for

FIG. 1 :
FIG. 1: Sketch of the procedure of TQL and DQL.(a) In TQL, the Q(s, a) values are stored in the Q-table.When the agent is at state s(5) , it reviews Q(s(5) , a (i) ) for all possible actions and chooses one with the maximum "Q-value" (which we assume is a(3) ).As a result, the state then evolves to s(2) .Depending on the distance between s(2) and the target, the Q-values (e.g.Q(s(5) , a(3) )) is updated according to Eq. (8).This process is then repeated at the new state s(2) and so forth.(b) In DQL, the Q-table is replaced by the Q-network.Instead of choosing an action with the maximum Q-value from a list, this process is done by a neural network, the Q-network, which takes the input state (s) and outputs an action that it finds most appropriate.Evaluation of the resulting state (s ) after the action suggests how the neural network should be updated (trained).For detailed implementation, see Methods and Supplementary Method 1.

FIG. 3 :
FIG.3: Pulse profiles and the corresponding trajectories on the Bloch sphere.Left column: Example pulse profiles taken from results of Fig.2with N = 20.Right column: Evolution of the state corresponding to the respective control sequence in the left column.TQL, DQL and PG give the same optimal results and are thus shown together.

FIG. 4 :
FIG. 4: Effect of discrete control fields on the averaged fidelity for all five methods considered.The strength of control field is restricted to Ji ∈ [0, 1], and M + 1 discrete values (including 0 and 1) are allowed.Panel (a) shows the case of N = 6 while (b) N = 20, corresponding to the two vertical dashed lines in Fig. 2.
r / 4 f b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R 2 6 h U M 9 5 i S i r d C a j h U s S 8 h Q I l 7 y S a 0 y i Q / C E Y 3 8 z 8 h y e u j V D x P U 4 S 7 k d 0 G I t Q M I p W a j f 6 4 y q e 9 8 s V t + b r / 4 f b / 4 b t 2 0 O 2 v p g 4 P H e D D P z g k Q K g 6 7 7 7 R T W 1 j c 2 t 4 r b p Z 3 d v f 2 D 8 u F R 2 6 h U M 9 5 i S i r d C a j h U s S 8 h Q I l 7 y S a 0 y i Q / C E Y 3 8 z 8 h y e u j V D x P U 4 S 7 k d 0 G I t Q M I p W a j f 6 4 y q e 9 8 s V t + b

FIG. 5 :
FIG. 5: Spin transfer as preparation of a multi-qubit state.(a) The multi-qubit system is initialized with the leftmost spin being up and all others down.(b) The target state has the rightmost spin being up and all others down.(c) The average fidelity versus the number of spins from different algorithms; (d)-(g) The amplitudes (visualization of the final prepared state) at different spins for K = 8.The red solid bars correspond to results averaged over 100 runs, and the hollow bars enclosed by dashed line shows the results with the highest fidelities.

1 F
Supplementary Figure S1: Effect of bounds of the control on the average fidelities as functions of N for SGD and Krotov methods.(a) F versus N for the SGD method, without (blue) and with (black) restriction of Ji ∈ [0, 1].(b) Main panel: F versus N for the Krotov method, without (red) and with (black) restriction of Ji ∈ [0, 1].Inset: F versus Jmax for N = 20, where Ji is restricted to Ji ∈ [1 − Jmax, Jmax].

b
Supplementary FigureS2: Effect of discrete control fields on the averaged fidelity for SGD and Krotov methods.In the calculation, the strength of control field is not specifically restricted, so the Jmin and Jmax values are determined after the algorithm has run.The values of the control field is then mapped to their respective closest discrete values, with the total number of allowed values including Jmin and Jmax being M + 1. Panel (a) shows the case of N = 6 while (b) N = 20, corresponding to the two vertical dashed lines in Fig.2of the main text.