Transferable control for quantum parameter estimation through reinforcement learning

Measurement and estimation of parameters are essential for science and engineering, where one of the main quests is to find systematic and robust schemes that can achieve high precision. While conventional schemes for quantum parameter estimation focus on the optimization of the probe states and measurements, it has been recently realized that control during the evolution can significantly improve the precision. The identification of optimal controls, however, is often computationally demanding, as typically the optimal controls depend on the value of the parameter which then needs to be re-calculated after the update of the estimation in each iteration. Here we show that reinforcement learning provides an efficient way to identify the controls that can be employed to improve the precision. We also demonstrate that reinforcement learning is highly transferable, namely the neural network trained under one particular value of the parameter can work for different values within a broad range. These desired features make reinforcement learning more efficient than conventional optimal quantum control methods.


I. INTRODUCTION
Metrology, which studies high precision measurement and estimation, has been one of the main driving forces in science and technology. Recently, quantum metrology, which uses quantum mechanical effects to improve the precision, has gained increasing attention for its potential applications in imaging and spectroscopy [1][2][3][4][5][6].
One of the main quests in quantum metrology is to identify the highest precision that can be achieved with given resources. Typically the unknown parameter is encoded in a dynamics. To achieve the highest precision, one needs to optimize the probe states, the controls during the evolution and the measurements on the output states. Previous studies have been mostly focused on the optimization of the probe states and measurements [6]. The control only starts to gain attention recently [7][8][9][10][11][12][13][14][15][16]. It has now been realized that properly designed controls can significantly improve the precision limits. The identification of optimal controls, however, is often highly complicated and time-consuming. This issue is particularly severe in quantum parameter estimation, as typically optimal controls depend on the value of the parameter, which can only be estimated from the measurement data. When more data are collected, the optimal controls also need to be updated, which is conventionally achieved by another run of the optimization algorithm. This creates a high demand for the identification of efficient algorithms to find the optimal controls in quantum parameter estimation.
Over the past few years, machine learning has demonstrated astonishing achievements in certain highdimensional input-output problems, such as playing video games [17] and mastering the game of Go [18]. Machine learning technics have been applied in physics covering many topics including experimental designs [19], finding optimal state transfer schemes in a spin chain [20] and discovering the quantum-error-correction strategies under noises [21]. Among the machine learning algorithms, reinforcement learning is one of the most actively researched [22]. Here we show that RL can be used to efficiently identify the optimal controls in quantum parameter estimation. A main advantage of RL, compared to conventional optimal quantum control algorithms, is that it is highly transferable, i.e., the agent trained through RL under one value of the parameter works for a broad range of the values. There is then no need for re-training after the update of the estimated value of the parameter from the accumulated measurement data. This makes RL more efficient and less time-consuming as compared to conventional gradient-based optimization algorithms.

A. Model and Method
We consider a generic control problem described by the Hamiltonian [23]: whereĤ 0 is the time-independent free evolution of the quantum state, ω the parameter to be estimated, u k (t) the kth time-dependent control field, p the dimensionality of the control field, andĤ k couples the control field to the state. The density operator of a quantum state (pure or mixed) evolves according to the master equation [11], where Γ[ρ(t)] indicates a noisy process, the detailed form of which depends on the specific noise mechanism and will be detailed later.
The key quantity in quantum parameter estimation is the QFI [24][25][26][27], defined by whereL s (t) is the so-called symmetric logarithmic derivative that can be obtained by solving the equation [24,25,28]. According to the Cramér-Rao bound, the QFI provides a saturable lower bound on the estimation as δω ≥ is the standard deviation of an unbiased estimatorω, and n is the number of times the procedure is repeated. Our goal is therefore to search for optimal control sequences u k (t) that maximize the QFI at time t = T (typically the conclusion of the control), F (T ), respecting all constraints possibly imposed in specific problems. Practically, we consider piecewise constant controls so the total evolution time T is discretized into N steps with equal length ∆T labeled by j, and we use u (j) k to denote the strength of the control field u k on the jth time step. Researches of such problem are frequently tackled by the Gradient Ascent Pulse Engineering (GRAPE) method [23], which searches for an optimal set of control fields by updating their values according to the gradient of a cost function encapsulating the goal of the optimal control. It has been found that GRAPE is successful in preparing optimal control pulse sequences that improve the precision limit of quantum parameter estimation in noisy processes [11,12]. Many alternative algorithms can tackle this optimization problem such as the stochastic gradient ascent(descent) method and microbial genetic algorithm [29], but the convergence to the optimal control fields becomes much slower when the dimensionality (p) of the control field or the discretization steps (N ) increases. Other optimal quantum control algorithms, such as Krotov's method [30,[30][31][32][33][34] and CRAB algorithm [35], typically depend on the value of the parameter, thus need to be run repeatedly along the update of the estimation, which is highly time-consuming. More efficient algorithms are thus highly desired.
In this work, we employ reinforcement learning to solve the problem. Fig. 1 shows schematics of the RL procedure used in this work. Fig. 1(a) shows the RL agent who takes an action as prescribed by a neural network. The action is essentially the control field which steers the qubit according to the master equation, Eq. (2), and the resulting state of the evolution determines the reward the agent receives. In practice, the reward encodes the QFI, i.e. higher reward will be obtained when greater QFI is given by the control.
In our problem, the action taken by the agent implies a time evolution of the quantum state according to Eq. (2) with the control field, u k (t). All possible actions therefore form a continuous set. We solve this problem using the Actor-Critic algorithm [22], as shown in Fig. 1(b). Such algorithm is particularly suitable to our problem as it can treat continuous actions. The key of the algorithm is that the neural network is not only updated using the reward, but also a state value, the latter of which greatly improves the efficiency of the training procedure. At certain time step, the neural network takes the quantum state as an input, and outputs both an action, and a state value which assesses how likely the state will lead to a larger QFI. The state is then evolved using the outputted action, obtaining the new state and QFI, which is then implemented into the reward function. This reward function, in conjunction with the state value, forms a loss function which is then used to update the neural network.
In order to improve the efficiency of computation, we used a parallel version of the Actor-Critic algorithm called Asynchronous Advantage Actor-Critic (A3C) algorithm [36]. The details of both the Actor-Critic and the A3C algorithm are described in the Supplementary Materials, as well as the pseudo-code describing the implementation of the algorithm.
Next we apply the algorithm to two commonly considered noisy processes: dephasing and spontaneous emis- sion, to demonstrate the effect of the algorithm.

B. Dephasing Dynamics
Under dephasing dynamics, the master equation, Eq. (2), takes the following form [11]: the control field u(t) = (u 1 , u 2 , u 3 ) is a magnetic field that couples to σ = (σ 1 ,σ 2 ,σ 3 ), and γ is the dephasing rate which is taken as 0.1 throughout the paper. We consider a dephasing along a general direction given by n = (sin ϑ cos φ, sin ϑ sin φ, cos ϑ),σ n = n · σ. The parameter to be estimated is ω 0 in Eq. (5), the true value of which is assumed to be 1, and we take ω −1 0 = 1 as our time unit. We choose the probe state, i.e. the initial state of the evolution, as (|0 + |1 )/ √ 2 in all subsequent calculations, where |0 , |1 are the eigenstates ofσ 3 .
In Fig. 2 we present our numerical results on QFI under dephasing dynamics with ϑ = π/4, φ = 0 using square pulses. Fig. 2(a)-(c) show the results for ∆T = 0.1. Fig. 2(a) shows the training process in terms of F (T )/T as functions of the number of training epochs. The blue line shows results from the training using A3C algorithm. The value of F (T )/T corresponding to results from GRAPE and the case with no control are shown as the orange dotted line and grey dashed line, respectively. The red line shows results from "A3C+PPO", an enhanced version of A3C which converges faster [37]. The details of this algorithm is explained in the Supplementary Materials. We can see that after sufficient training epochs, results from A3C exceed that for the case with no control, and approaches the optimal results found by GRAPE. On the other hand, "A3C+PPO" converges more quickly to essentially the same result of A3C.
We select one training outcome from those with best performances in Fig. 2(a) and show F (t)/t and the pulse profiles in Fig. 2(b), (c) respectively. As can be seen from Fig. 2(b), both GRAPE and A3C outperform the case with no control, while the results of A3C are comparable to those from GRAPE. We have discussed dephasing dynamics along a particular axis pertaining to Fig. 2, and the results for several other dephasing axes are shown in the Supplementary Materials. We conclude from these results that in most cases, the A3C algorithm is capable to produce results comparable to those from GRAPE, while in selected situations (e.g. larger ∆T ) A3C may outperform GRAPE.
We now discuss the transferability of the control sequences for quantum parameter estimation, a key result of this paper. Since the true value of ω 0 is not known a priori, the control sequence has to be found optimal for a chosen ω 0 . When such sequence is applied in situations under other ω 0 values, the true value is still measured, but the resulting QFI is lower than when the optimal control for true ω 0 is used.
The dotted lines in the left column of Fig. 3 show the QFI resulting from measurements with the optimal control found for ω 0 = 1 with GRAPE. Results without control are shown as grey dashed lines for comparison. The range of ω 0 covers a period of 2π/T . As expected, the QFI is largest at ω 0 = 1, but reduces as ω 0 deviates from 1. As ω 0 further varies, the QFI increases at some values of ω 0 which may be due to the geometric relationship of the phase that corresponding to those ω 0 values and the phase at ω 0 = 1. In any case, these QFI values are consistently lower than the value at ω 0 = 1. An obvious way to improve the QFI is to generate new optimal control sequences for each value of ω 0 from GRAPE, but this is costly as the computational complexity scales as O(N 3 ).
With A3C we have an efficient solution to this problem. We can train the neural network at ω 0 = 1, and use this particular network to generate control sequences for different ω 0 values. The neural network is only trained at ω 0 = 1. However, the trained neural network works for a broad range of parameter values. There is no need to re-train the neural network with the updated estimation of the parameter. The computational cost is thus simply O(N ) so it is much more efficient than generating new sequences with GRAPE. These results from A3C are shown in the left column of Fig. 3 as blue solid lines which represents the best-performing sequence from 100 trials generated from the trained neural network. For ∆T = 0.1 [ Fig. 3(a)], although the QFI in the training ω 0 = 1 is slightly lower for A3C than that of GRAPE, A3C demonstrates higher transferability as the QFI deceases slowly when ω 0 deviates from 1. For ∆T = 1 [ Fig. 3(c)], the QFI of A3C is consistently higher than GRAPE except a narrow range of ω 0 around 0.65.
To further reveal the transferability of different methods, we consider the measurement in an ensemble with ω 0 uniformly distributed in [1 − ∆ω, 1 + ∆ω]. The performance of the quantum parameter estimation is therefore given by the average F (T )/T , These results are shown in the right column of Fig. 3, which are averages of the data in the corresponding panels in the left column. As seen from Fig. 3(b) (∆T = 0.1), F (T )/T for GRAPE is high at small ∆ω but drops quickly as ∆ω is increased. On the contrary, F (T )/T for A3C is lower than that for GRAPE at small ∆ω, but decays much more slowly. As a consequence, F (T )/T for A3C exceeds that for GRAPE beyond ∆ω 0.22. This result indicates that for measurements involving a reasonably varying parameter, A3C demonstrates higher transferability than GRAPE. For ∆T = 1, the results of A3C always exceed GRAPE as seen from Fig. 3(d). The result for A3C decays much more slowly than that for GRAPE, in consistency with the ∆T = 0.1 case.

C. Spontaneous Emission
A process involving the spontaneous emission is described by the Lindblad master equation [11]: whereσ ± = (σ 1 ±iσ 2 )/2 andĤ is defined as Eq. (5). The dephasing rates are taken as γ + = 0.1, γ − = 0 throughout our discussion. Fig. 4 shows numerical results on QFI with spontaneous emission. Fig. 4 step ∆T = 1, T = 20. Fig. 4(a), (d) [left column] show the A3C training processes, in which the results from GRAPE are indicated as orange dotted line for reference. We see that "A3C+PPO" converges faster, and both A3C and "A3C+PPO" saturate to values slightly lower than GRAPE. Again, one of the best performing control is picked out and the corresponding F (t)/t and pulse profiles are shown in the middle and right column respectively. From Fig. 4(b), (e) we see that for the best result from A3C, the QFI is lower than, but comparable to results from GRAPE.
As in Sec. II B we consider the transferability of different methods in a situation involving ω 0 that distributes uniformly in a range. Again, we use GRAPE to obtain optimal control sequences for ω 0 = 1 and apply that to other values. For A3C, we trained the neural network at ω 0 = 1; the resulting sequence is then used to obtain an estimate of the true ω 0 value. A new sequence is then generated using the neural network already trained at ω 0 = 1 with the estimated ω 0 . The best-performing results out of 100 A3C outputs are shown as the blue solid lines in Fig. 5, while the results from GRAPE are shown as the orange dotted lines. The left column of Fig. 5 shows F (T )/T as functions of ω 0 for two ∆T values. In both cases, the GRAPE method outperforms A3C in a narrow neighborhood around ω 0 = 1, but its QFI decreases substantially as ω 0 further deviates. On the other hand, A3C exhibits great transferability: for ∆T = 0.1 the QFI does not decrease until ω 0 is reduced to ω 0 0.6, while for ∆T = 1 the QFI remains approximately the same for the entire range of ω 0 considered. The average F (T )/T in the range [1 − ∆ω, 1 + ∆ω] are shown in the right column of Fig. 5. In Fig. 5(b), A3C outperforms GRAPE when ∆ω 0.22, while in Fig. 5(d), A3C outperforms GRAPE in an even larger range ∆ω 0.07.
Overall we conclude that in the case of spontaneous emission, the A3C algorithm provides comparable results to GRAPE, although it cannot give higher QFIs. Nevertheless, A3C has much greater transferability than GRAPE, as is consistent with Sec. II B.

D. Sequences with Gaussian Pulses
For all results shown above, the control sequences involve square pulses only. In practical experiments, shaped pulses are sometimes used. Therefore in this section we consider Gaussian pulses as an example. The total time T is still divided into smaller pieces with ∆T . However, at the jth piece the piecewise constant pulse is replaced by a Gaussian centering on that piece and truncated on the ends: where A (j) indicates the amplitude and σ g,(j) the flatness of the pulse. We note that pulse sequences involving non-boxcar pulses are extremely difficult to treat using GRAPE, but we demonstrate here that with A3C method it is natural to accommodate different pulse shapes. In Fig. 6 we show A3C results using Gaussian pulses and compare them to GRAPE results using square pulses. Fig. 6 For dephasing dynamics, our best results from A3C outperform GRAPE, as is also the case for square pulses generated by A3C. For spontaneous emission, our best performing result has a QFI value slightly lower than those from GRAPE with square pulses, but their values are very close. These results indicate that A3C method can naturally accommodate pulses other than square shape, which is also a key advantage over GRAPE.

III. CONCLUSION
To summarize, RL, in particular the A3C algorithm, is capable of finding the control protocol that enhances QFI in a way comparable to the traditionally-used GRAPE method, and is in certain situations superior than GRAPE, e.g. for pulse sequences with larger time steps. Moreover, RL can accommodate non-boxcar pulse shapes, which would be otherwise difficult for GRAPE. Nevertheless, the key advantage afforded by RL is the transferability, namely the neural network trained for one estimated parameter value can efficiently generate pulse sequences that provide reasonably enhanced QFI for a broad range of parameter values, while in order to achieve the same level of QFI the GRAPE algorithm has to be applied in full each time with a new parameter estimation. Our results therefore suggest that RLbased methods can be powerful alternatives to commonly used gradient-based ones, capable to find control protocols that could be more efficient in practical quantum parameter estimation.

S-I. REINFORCEMENT LEARNING
The Reinforcement Learning (RL) framework is schematically shown in Fig. 1(a). The key ingredients of the RL process include a state space S, an action space A, and a reward R [22]. In the RL procedure, an agent at state s j ∈ S chooses an action a j ∈ A according to a probabilistic policy π θ (a j |s j ) where θ represents parameters of the policy. For example, when using a neural network to represent the policy, θ represents the weights and biases of the neural network. The action a j results in a new state s j+1 according to which the agent receives a numerical reward r j+1 ∈ R. For a given optimization problem, one encapsulates the goal of the problem into the calculation of the rewards, as well as relevant constraints in the available states and actions. In practice, the reward for a given state is not only related to its immediate next step, but several steps in its future, so the total discounted reward for s j , a key quantity, is given by where α ∈ (0, 1] is the reward decaying rate indicating the relative weight between adjacent steps in calculating the total discounted reward received at a given step. When α = 1 the rewards from all future steps contribute equally, while when α → 0 only the immediate next step provides the major contribution. Then, the probability that the agent takes certain action is enhanced or suppressed, according to the value of the total discounted reward. After sufficient iterations of training, the agent learns the optimal actions to take in order to maximize the total discounted reward, thereby gives an optimal solution to the desired problem.
In the RL procedure, the exploration of the agent in the state and action spaces is summarized into a sequence s 0 , a 0 , r 1 , s 1 , a 1 , r 2 , . . . , s k , a k , r k+1 , . . ., called a trajectory. To figure out what is the best action to take at state s, we define the state-action value function, where the expectation includes discounted rewards of all the trajectories after taking the action a at the state s in the jth step of the trajectory, provided that the policy π is observed thereafter [22]. We also define the value of a state to evaluate the likelihood that a given state would lead to a higher reward, where the expectation includes discounted rewards of all the trajectories starting from the state s in the jth step, provided that the policy π is followed thereafter [22].
An RL policy π is declared "optimal" when the actions selected by the policy in each state are such that the resulting expectation value of discounted rewards for all states s ∈ S is no less than that from any other policy π ′ , i.e. V π (s) ≥ V π ′ (s) [22]. Corresponding to the optimal policy π θ * (a|s), the optimal value functions are where the notations θ * and θ * v represent optimal choices of the neural network parameters for the policy and value functions. If the optimal value functions are known, the RL agent simply chooses the action a j that has the largest state-action value Q * (s j , a j ) in state s j . Alternatively, at state s j one may choose the next state s j+1 that has the largest state value V * (s j+1 ). Thus, there are two ways for an RL algorithm to solve an optimization problem: the agent either learns the optimal policy, or if the policy is otherwise specified, the optimal value functions [S1]. The two methods are discussed below.
In the so-called value-based method, the RL agent learns optimal value functions. The state value function and state-action value function are solved iteratively using the Bellmann equations, where a ′ represents all possible actions in the next state s ′ [22]. We define the loss functions as where R n j = n k=1 α k−1 r j+k is called the "n-step" return [22]. We take the ε-greedy policy commonly used in deep Q-learning network [17,20] as an example. Under this policy, the neural network does either of the two things at state s j : with probability 1 − ε the network takes the action a j that maximizes Q π (s j , a j ), or with probability ε ∈ (0, 1] an action is randomly chosen. The latter mechanism encourages the agent to explore a wider range in the search space to reach a globally optimal solution. In practice, Q π (s, a) in the loss function Eq. (S-8) is the prediction by the neural network and r + α max a ′ Q π (s ′ , a ′ ) is calculated from the trajectories of the RL agent. The training procedure of the neural network is essentially minimization of the loss function. As a result, the trained neural network gives optimal state-action values.
We note that in the value-based algorithm, the policy is fixed, and only the value functions are updated, which may not be sufficient to find a globally optimal solution [S1]. More importantly, the way of storing the action space and trajectories have assumed that the actions are discrete, and it becomes far more complicated to treat problems with continuous actions, as is the case of control fields. As shall be discussed below, the policy-based algorithm is most suitable for our problem.
The policy-based algorithm directly updates the policy parameters θ without the need of storing a large amount of RL trajectories. A typical form of the loss function is defined as [S2] where A j is the advantage function, which evaluates the advantage of the chosen trajectory, with the baseline function b(s j ), normally being the estimated state value function, that reduces the variance and speeds up the learning process [22,36]. When the value of A j is large for an action a j , minimizing L increases π θ (a j |s j ), implying that the probability to choose the action a j in state s j is increased. In our problem, a quantum state is completely described by the density matrixρ (j) for each time step j. Therefore our state in the RL procedure is defined using elements of the density matrix as s j = (Re(ρ 00 ), Im(ρ 00 ), Re(ρ 10 ), Im(ρ 10 ), Re(ρ 01 ), Im(ρ 01 ), Re(ρ 11 ), Im(ρ 11 )) .

(S-12)
Our action space is formed by a set of control fields u ≡ a j ∈ A, which steers our quantum state s j to s j+1 according to the master equation Eq. (2). Evaluation of the new state s j+1 and the agent obtains the single step reward: where F and F 0 are the corresponding QFI from Eq. (3) with and without control, respectively. η ≥ 1 and C ≥ 1 are constant parameters used in the training process. η ensures a non-zero reward to the agent in case the RL agent would apply u 1,2,3 (t) = 0, while C gives an extra significance to the last evolution step. After an episode of training, the action sequence in each trajectory constitutes the control field. We also note that our choice of the reward function is not unique.

S-II. ACTOR-CRITIC ALGORITHM
The Actor-Critic algorithm combines the advantages of policy-based and value-based methods. Fig. 1(b) illustrates the basic procedure of the Actor-Critic algorithm. Two neural networks are involved: the actor network governing the policy that chooses actions, and the Supplementary Figure S1: Schematics of the A3C algorithm, adapted from [S3]. The RL neural network is trained asynchronously based on the trajectories of local networks in N env RL environments, labeled as "env i". The notation tc is the CPU time, β is the learning rate. The black dots on the time direction mark the end of each training episode.
critic network managing the value functions, which in turn changes the baseline function used in further policymaking [22]. More specifically, the state value V π (s) generated by the critic network is plugged into Eq. (S-11), Note that the "n-step" return is used instead of R j so that only the n future steps are involved. This is the key distinction from the policy-based method [22]. In the training process, the actor and the critic networks minimize the loss function simultaneously. We update the critic network through Eq. (S-9) while the actor network is trained through Eq. (S-10) using the advantage function defined by Eq. (S-16).
In order to improve the efficiency of the learning process, a parallellized version of Actor-Critic algorithm called A3C, short for Asynchronous Advantage Actor-Critic [36], is implemented in our calculation.

S-III. ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ALGORITHM
The key structure of Asynchronous Advantage Actor-Critic (A3C) is sketched in Fig. S1. The desired policy and value functions are generated by the neural network (left column in Fig. S1), called the "global" network. The Algorithm 1 (episodic) Asynchronous advantage actor-critic with clipped surrogate function Initialize the global counter N ep = 0 repeat Clear gradients: dθ ← 0, dθv ← 0 Synchronize thread-specific parameters: θ ′ = θ and θ ′ v = θv Reset environment and initial state s0 repeat Choose action aj according to policy π θ ′ (aj|sj) Update state sj ← sj+1 and receive reward rj+1 neural network is composed of the state value network V π (s) (orange color), the policy network π(a|s) (green color) and the fully-connected linear layers (blue color). At the beginning of the training process, we made N env copies of the global network, called "local" networks. Then, each of the local networks is allowed to run in independent RL environments, in which the RL agents, called the "local" agents, optimize policies and value functions via gradients with respect to the loss functions. At the end of a training episode for each parallel RL procedure, the local agent uploads the accumulated gradient to update the global network. Then, the updated global network is downloaded back to the local environment, starting a new episode with the environment properly reset. Note that in the entire process, all local agents act independently, which is why the algorithm is asynchronous [36, S4].
We now give details of our implementation of the A3C algorithm. The RL states are first fed through 4 hidden layers, each composing 200 ReLU units [S5]. The resulting outputs are then passed to both the value and policy networks. The value network is constructed by one hidden layer with 200 ReLU units and one fully-connected linear layer outputting a real number as the state value. The policy network has one hidden layer with 200 ReLU units and two fully-connected linear layers as output layers. The outputs are six real numbers µ k , σ G k , k = 1, 2, 3 forming three normal distributions N (µ k , σ G k ). Here, µ k is modified by the SoftShrink(λ) activation function with λ = 0.25 and σ G k is modified by the SoftPlus activation function [S5]. The continuous actions u k are randomly sampled from those normal distributions.
We use the differentiation of the normal distribution as the entropy regularization term, − 1 2 (log(2πσ 2 )+1), to encourage the agent to explore the entire search space. We use the RMSProp optimizers with shared parameters that are updated asynchronously among parallel environments [36]. We keep the choice of hyper-parameters which are listed in the left column of Table S-I similar to those used in [36]. The pseudocode for A3C can be found in [36]. Next we will discuss an optimized version of the code, i.e. with Proximal Policy Optimization (PPO) algorithm [37, S6].
Generally, optimization with the logarithm of the policy gradient leads to large policy updates which, in some cases, makes the learning process unstable. The Proximal Policy Optimization (PPO) algorithm replaces the logarithm in Eq. (S-10) with the probability ratio between the old and the new policy: -17) and the loss function is also truncated at certain values of the probability ratio [37]. Algorithm 1 shows the pseudocode for the A3C algorithm utilizing the PPO strategy.
In this algorithm, we replace the global RMSProp optimizer with the thread-specified Adam optimizers [S5]. The right column of Table S-I lists the hyper-parameters in the A3C algorithm with PPO strategy.
We have used PyTorch [S5] to implement the algorithms and QuTip [S7, S8] to obtain numerical solutions of Eqs. (2)-(3). We also note that practically, when ∆T = 1, we have to set smaller learning rates, gradient norm, entropy weight and N ppo max .

S-IV. ADDITIONAL RESULTS ON DEPHASING DYNAMICS
In the main text, we have provided results of quantum parameter estimation under dephasing dynamics along a chosen axis in Fig. 2, i.e. ϑ = π/4. Here, we present results along two other axes: parallel depasing (ϑ = 0) and transverse dephasing (ϑ = π/2). In Fig. S2, the training process is shown in the upper row, F (T )/T v.s. ω 0 the middle row and the average F (T )/T in [1 − ∆ω, 1 + ∆ω] in the bottom row. For parallel dephasing, our results are very similar to ϑ = π/4 results shown in the main text, namely F (T )/T calculated from A3C is lower than that from GRAPE only in a narrow range of ∆ω. For ∆T = 0.1, A3C outperforms GRAPE when ∆ω 0.15, while for ∆T = 1, A3C is better than GRAPE in a wider range, ∆ω 0.05. For transverse dephasing, the situation is slightly more complicated (note that analytical solutions [11] are provided as references). When ∆T = 1, results from GRAPE has very low F (T )/T , thus A3C always outperforms GRAPE. However, for ∆T = 0.1, A3C does not possess considerable advantages. For 0 ≤ ∆ω 0.4, the A3C results have lower F (T )/T than GRAPE, albeit being very close. For ∆ω 0.4, the A3C results is only slightly higher than GRAPE. These calculations therefore suggest that the transferrability of our method is superior as compared to GRAPE in most situations, in particular for cases with larger time step (∆T ). Nevertheless, in some situations, usually associated with smaller ∆T , our method would not provide considerable improvement. One therefore has to be judicious in choosing appropriate methods for a specific problem. For example, if transferability is not desired, GRAPE may be more appropriate for pulse sequences with smaller time steps. On the other hand, if pulse sequences have larger time steps, or non-boxcar pulse shape is desired, or transferability becomes important in the problem, the A3C method is desired.