Article | Open | Published:

# Coherent transport of quantum states by deep reinforcement learning

## Abstract

Some problems in physics can be handled only after a suitable ansatz solution has been guessed, proving to be resilient to generalization. The coherent transport of a quantum state by adiabatic passage through an array of semiconductor quantum dots is an excellent example of such a problem, where it is necessary to introduce a so-called counterintuitive control sequence. Instead, the deep reinforcement learning (DRL) technique has proven to be able to solve very complex sequential decision-making problems, despite a lack of prior knowledge. We show that DRL discovers a control sequence that outperforms the counterintuitive control sequence. DRL can even discover novel strategies when realistic disturbances affect an ideal system, such as detuning or when dephasing or losses are added to the master equation. DRL is effective in controlling the dynamics of quantum states and, more generally, whenever an ansatz solution is unknown or insufficient to effectively treat the problem.

## Introduction

Some problems in physics are solved as a result of the discovery of an ansatz solution, namely a successful test guess, but unfortunately there is no general method to generate one. Recently, machine learning has increasingly proved to be a viable tool for modeling hidden features and effective rules in complex systems. Among the classes of machine learning algorithms, deep reinforcement learning (DRL)1 is providing some of the most spectacular results due to its ability to identify strategies for achieving a goal in a complex space of solutions without prior knowledge of the system2,3,4,5,6,7. Contrary to supervised learning, which has already been applied to quantum systems, such as in the determination of high-fidelity gates and the optimization of quantum memories by dynamic decoupling8, DRL has only very recently been proposed for the control of quantum systems9,10,11,12,13,14,15,16, along with a strictly quantum reinforcement learning implementation14,17. To show the power of DRL, we apply DRL to the problem of coherent transport by adiabatic passage (CTAP) where an electron (encoding the quantum state) is transferred through an array of quantum dots. The ansatz solution for CTAP is notoriously called counterintuitive because of its nonobvious barrier control gate pulse sequence. During coherent adiabatic passage, an electron spends ideally no time in the central quantum dot, because of the simultaneous modulation of the coupling between the dots that suitably drives the trajectory through the Hilbert space18,19,20,21,22. The system moves from an initial equilibrium condition to a different one, namely the one where only the last dot of the array is populated. By exploiting such an ansatz solution, consisting of applying the barrier control gates between the dots in a “reversed order” with respect to what intuition would naturally suggest, the process displays truly quantum mechanical behavior, provided that the array consists of an odd number of dots. Researchers have already explored silicon-based quantum information processing architectures23,24,25, including the coherent transport of multiple-spin qubits into double quantum dots by adiabatic passage22, heuristic search methods, such as genetic algorithms, to find a universal set of quantum logic gates26,27, and the application of DRL to classical systems28,29,30.

Here, we demonstrate that DRL implemented in a compact neural network can, first of all, autonomously discover an analog of the counterintuitive gate pulse sequence without any prior knowledge, therefore finding a control path in a problem whose solution is far from the equilibrium of the initial conditions. More importantly, this method can outperform the previously introduced analytical solutions in terms of processing speed and when the system deviates from ideal conditions, which are here represented by the imperfect tuning of the ground states of the quantum dots, dephasing and losses. Under such conditions, no analytical approach exists to the best of our knowledge. In our approach, we exploit trust region policy optimization (TRPO)31 to handle the CTAP problem. First, we compare the results discovered by the artificial intelligence algorithm with the ansatz solution given in the literature. Next, we apply the method to solve the system when the ground states of the quantum dots are detuned and when the system is perturbed by an interaction with uncontrollable degrees of freedom of the surrounding environment. This latter condition results in dephasing and loss terms in the master equation describing the system, for which there exists no analytical method. Similar to the case of artificial intelligence learning in the classical Atari framework2, in our approach, the DRL agent interacts with a QuTIP32 simulation, which plays the role of the RL environment, by implementing the master equation of the system. The RL agent exploits the information retrieved from the feedback in terms of the temporal evolution of the population of the dots. As a further advantage of such an approach, a 2-step temporal Bayesian network (2TBN) analysis can identify which parameters of the system influence the process to a greater extent.

## Results

### Deep reinforcement learning by actor-critic neural networks

Reinforcement learning (RL) is a set of techniques used to learn behavior in sequential decision-making problems when no prior knowledge about the system dynamics is available or when the control problem is too complex for classical optimal-control algorithms. RL methods can be roughly classified into three main categories: value-based, policy-based, and actor-critic methods1. Recently, actor-critic methods have proven to be successful in solving complex continuous control problems33.

The idea behind actor-critic methods is to use two parametric models (e.g., neural networks) to represent both the policy (actor) and the value function (critic). The actor decides in each state of the system which action to execute, while the critic learns the value (utility) of taking each action in each state (see Supplementary Note 1). Following the critic’s advice, the actor modifies the parameters of its policy to improve the performance. Among the many actor-critic methods available in the literature, we selected the TRPO algorithm31 (Supplementary Note 2) to find an optimal policy of control pulses. The choice of TRPO is motivated both by its excellent performance on a wide variety of tasks and by the relative simplicity of tuning its hyperparameters31 (see the Supplementary Note 3).

### Coherent spatial adiabatic passage of qubits

CTAP is the solid-state version of a method developed for stimulated Raman adiabatic passage21,34, which is relevant for instance in quantum information processing architectures that shuttle a qubit from one location to another, where attention is paid to minimize the information loss during transport. In solid-state quantum devices based on either silicon35 or gallium arsenide36, the qubit can be encoded, for instance, into spin states of either excess electron(s) or hole(s) in quantum dots24. CTAP was originally developed for single-electron states in single-occupied quantum dots, but it can be also extended to more complex spin states, such as hybrid qubits based on spin triplets22. If one imagines employing, for instance, an array of dopants in silicon37,38,39, a reasonable interdopant spacing is of the order of 20 nm, and a hopping time of 100 ps40 can be achieved. Adiabatic passage requires control pulses with a bandwidth and order of magnitude or two lower than the hopping time, which can be managed by conventional electronics41.

### Steering a quantum system initially far from the final equilibrium

To demonstrate the exploitation of RL, we start by the simplest case of CTAP across a chain of three identical quantum dots. The RL architecture is depicted in Fig. 1a. The simulation of the physical system that supports CTAP consists of an RL environment E that receives as input the updated values of the parameters (in our case the control sequence Ωi,i + 1 with i = 1, 2 between the adjacent ith and (i + 1)th dots) that reflect the action of the control gates as calculated by the agent A according to the policy π. In turn, the RL environment E computes the new RL state (here expressed in terms of the density matrix of the triple quantum dot device, see Supplementary Fig. 1) and provides feedback to agent A. Agent A calculates the next input parameters after evaluating the effects of the previous input according to a reward function rt, which is expressed in terms of the system state. It is worth noting that the definition of an RL environment (a black box that receives an action as input and outputs a result) is very different from the usual definition of a physical environment, which corresponds to an external quantum system that interacts with the quantum system under investigation. More explicitly, the ground state of each quantum dot is tuned with respect to the others by external top metal gates (not shown for simplicity in the sketch of the RL environment E in Fig. 1a), while the coupling between two neighboring quantum dots is in turn controlled by additional barrier control gates. The idealized physical system is prepared so that the ground states of the three quantum dots have the same energy. As the reference energy is arbitrary, without loss of generality we can set E1 = E2 = E3 = 0. The Hamiltonian therefore reads:

$$H(t) = \left( {\begin{array}{*{20}{c}} 0 & { - {\it{\Omega }}_{12}(t)} & 0 \\ { - {\it{\Omega }}_{12}(t)} & 0 & { - {\it{\Omega }}_{23}(t)} \\ 0 & { - {\it{\Omega }}_{23}(t)} & 0 \end{array}} \right).$$
(1)

One of the three eigenstates is referred as a “dark state”, and it is expressed as a function of the state of the first and third dots only. The dark state reads:

$$\left| {D_0(t)} \right\rangle = \cos \theta _1(t)\left| 1 \right\rangle - \sin \theta _1(t)\left| 3 \right\rangle ,$$
(2)

where θ1(t) = arctan(Ω12(t)/Ω23(t)). A suitable time evolution of the values of the control sequence Ω12(t) and Ω23(t) between the dots allows to transform |D0(t)〉 from |1〉 at t = 0 to the final target state |3〉 at t = tmax. If the Hamiltonian is prepared in |D0〉 at t = 0, by choosing θ1(t) = 0 (so that |D0(0)〉 = |1〉), it will remain in the same eigenstate if the adiabaticity criterion is met; that is, $$\left| {E_0 - E_ \pm (t)} \right| \gg \left| {\left\langle {\dot{D}_0(t)|D_ \pm (t)} \right\rangle } \right|$$, where the |D±(t)〉 eigenstates are explicitly given in the Supplementary Note 4, and E±(t) are the corresponding eigenenergies. The effectiveness of the pulse sequence of the barrier control gate, which is reflected in the control sequence Ωi,i + 1(t), is addressed by taking F(t) = ρ33(t) as the fidelity and maximizing it while trying to keep the population ρ22(t) constant at zero, where ρij, i, j {1, 2, 3} represents an element of the density matrix ρ. The time evolution is governed by the von Neumann equation for the density matrix ρ, namely:

$$\dot \rho = - \frac{i}{\hbar }[H,\rho ].$$
(3)

Notoriously, a suitable shaping of Ω12(t) and Ω23(t) as Gaussian pulses can achieve coherent transport with a high fidelity, if $$t_{{\rm{max}}} \ \gtrsim \frac{{10\pi }}{{{\it{\Omega }}_{{\rm{max}}}}}$$ (see Supplementary Note 4 and Supplementary Figs. 25). The two pulses must be applied in a so-called counterintuitive sequence, meaning that the first gate controlling Ω12(t) is operated after Ω23(t). Such pulse sequence drives the occupation of the first dot ρ11(tmax) to zero and the occupation of the last dot ρ33(t) to 1, while maintaining empty occupation of the central dot (ρ22(t) = 0 t [0, tmax])18. It is worth mentioning that recently a different ansatz combination of pulse shapes was proposed to speed up this process42, using the so-called shortcut to adiabaticity protocol43. Generally speaking, there is no analytical method to optimize the pulses, so further improvements will be based on still undiscovered ideas. Here is where the power of DRL comes into play.

### Deep reinforcement learning setup

The time evolution of the physical system can be mapped into a Markov decision process (MDP). An MDP is defined by a state space $${\cal{S}}$$, an action space $${\cal{A}}$$, a probability transition $$P(s_{t_{i + 1}}|s_{t_i},a_{t_i})$$ (where $$s_{t_{i + }},s_{t_{i + 1}} \in {\cal{S}}$$ and $$a_{t_i} \in {\cal{A}}$$), and a reward function r that is generally defined in the space $${\cal{S}} \times {\cal{A}}$$, but in our case $$r:{\cal{S}} \to {\Bbb R}$$. We define the discretized time step ti = iΔt, where i = {0, 1, … N} and $$\Delta t = \frac{{t_{{\rm{max}}}}}{N}$$, such that tN = tmax. It is noteworthy to clarify that, albeit the name is the same, the definition of the RL state $$s_{t_i}$$ differs from that of a quantum state. In this work, $$s_{t_i}$$ is defined as:

$$s_{t_i} = \left[ {|\rho _{nm}(t_i)|\,{\mathrm{for}}\,m,n \in \{ 1,2,3\} ,{\mkern 1mu} {\it{\Omega }}_{12}(t_{i - 1}),{\mkern 1mu} {\it{\Omega }}_{23}(t_{i - 1})} \right]$$
(4)

such that the state at each time step contains the instantaneous absolute values of the nine entries of the density matrix and both Ωij values evaluated at the previous time step. Instead, $$a_{t_i}$$ is defined as:

$$a_{t_i} = \left[ {{\it{\Omega }}_{12}(t_i),{\it{\Omega }}_{23}(t_i)} \right]$$
(5)

such that $$a_{t_i} \in [0,{\it{\Omega }}_{{\rm{max}}}] \times [0,{\it{\Omega }}_{{\rm{max}}}]$$. Due to the robust definition of the reward function that allows the agent to judge its performance, the neural network evolves in order to obtain the best temporal evolution of the coupling parameters Ωi,i + 1, ensuring that the electron population reaches the target state over time (Fig. 1b). The definition of the best reward function is certainly the most delicate choice in the whole model. The key features of the reward can be summarized by two properties: its generality and its expression as a function of the desired final state. Generality means that the reward should not contain specific information on the characteristics of the two pulses. Regarding the dependence on the desired final state, in our case the goal is to maximize ρ33 at the end of the time evolution. The structure of the reward function used in the simulations is:

$$r_{t_i} = ( - 1 + \rho _{33}\left( {t_i} \right) - \rho _{22}\left( {t_i} \right)) + A(t_i)\delta _{i,N} - B(t_i),$$
(6)

where A(ti) and B(ti) are the goal-reaching and punishment terms, respectively. We found that the following combinations of A(ti) and B(ti) work well:

$${\begin{array}{*{20}{l}} {} \hfill & {A(t_i) = \left\{ {\begin{array}{*{20}{c}} {1000} & {\rho _{{\mathrm{33}}}(t_i) \ > \ \rho _{33}^{{\rm{th}}}} \\ 0 & {{\mathrm{elsewhere}}} \end{array}} \right.,} \hfill \\ {} \hfill & {B(t_i) = \left\{ {\begin{array}{*{20}{l}} {100{\kern 1pt} \Theta (\rho _{22}(t_i) - 0.05)} \hfill & {{\hbox{Standard CTAP}}}\\ {e^{6\rho _{22}(t_i)}} \hfill & {{\hbox{With detuning, dephasing and loss}}}\end{array}} \right.} \hfill \end{array}}$$
(7)

The reward functions used in this research are fully accounted for in the Supplementary Note 5. The sum of the first three terms is nonpositive at each time step, so the agent will try to bring the sum to 0 by minimizing ρ22 and maximizing ρ33. We have observed that subtracting B(t) (e.g., punishing the electronic occupation in site 2) and awarding the attainment of the goal with the term A(t) improves the convergence of the learning. Furthermore, in some specific cases, we stop the agent at an intermediate episode if ρ33 is greater than an arbitrary threshold $$\rho _{33}^{{\rm{th}}}$$ for a certain number of time steps. This choice can help find fast pulses that achieve high-fidelity quantum information transport at the cost of a higher $$\rho _{22}^{{\rm{max}}}$$ with respect to the analytic pulses (more details are given in the Supplementary Note 5).

### Training of the agent

Figure 2 shows the best results achieved by the agent at various epochs. The agent is free of spanning the whole coupling range at each step. After the training, we smooth the output pulses Ωi,i + 1, and we run the simulation using the smoothed values, obtaining a slight improvement from the calculation using the original output values of the RL agent. Smoothing is achieved with a moving average of the pulses (more details in the Supplementary Note 6, Supplementary Fig. 6, and Supplementary Table 1). The three occupations ρii, i {1, 2, 3} shown in Fig. 2 refer to the smoothed pulse sequences. At the very beginning, the untrained agent tries random values with apparently low success, as the occupation oscillates between the first and the last dot over time. It is worth noting that despite this fluctuation, due to the reward function, the agent learns very quickly after only 45 epochs to always keep the occupation of the central dot below 0.5. After ~2000 epochs, the agent learns to stabilize the occupations of the first and last dots, while maintaining an empty occupation of the central dot. After approximately twice the time, a reversal of the populations of the first and last dots happens, even if they do not yet achieve the extremal values. Finally, after approximately four times the elapsed time, the agent achieves high-fidelity CTAP. Despite the stochastic nature of DRL, these epoch numbers remain consistent through different trainings (see the Supplementary Fig. 7). Notice that the pulse sequence is similar to the ansatz Gaussian counterintuitive pulse sequence as the second gate acting on Ω23 is operated first, but the shape of the two pulses is different. It is remarkable that the reward function implicitly tries to achieve the result as quickly as possible, resulting in a pulse sequence that is significantly faster than the analytical Gaussian case and comparable to the recent proposal of Ban et al.42. The relevant point is that the agent achieves such results irrespective of the actual terms of the Hamiltonian contributing to the master equation. Therefore, DRL can be applied straightforwardly to more complex cases for which there is no generalization of the ansatz solutions found for the ideal case, which we address in the next section.

### Deep reinforcement learning to overcome disturbances

We turn now our attention to the behavior of our learning strategy when applied to a nonideal scenario in which typical realistic conditions of semiconductor quantum dots are considered, to be compared with ideal transport (Fig. 3a). In particular, we discuss the results produced by DRL when the array is affected by detuning caused by energy differences in the ground states of the dots, dephasing and losses (Fig. 3b–d). These three effects exist, to different degrees, in any practical attempt to implement CTAP of an electron spin in quantum dots. The first effect is typically due to manufacturing defects44, while the last two effects emerge from the interaction of the dots with the surrounding environment45,46 involving charge and spin fluctuations47,48 and magnetic field noise49. Under such disturbances, neither analytical nor ansatz solutions are available. On the other hand, the robustness and generality of the RL approach can be exploited naturally since, from the point of view of the algorithm, it does not differ from the ideal case discussed above. Let us consider a system of N = 3 quantum dots with different energies Ei, i {1, 2, 3}. We define Δij = Ej − Ei, so that the Hamiltonian (1) can be written, without loss of generality, as

$$H(t) = \left( {\begin{array}{*{20}{c}} 0 & { - {\it{\Omega }}_{12}(t)} & 0 \\ { - {\it{\Omega }}_{12}(t)} & {\Delta _{12}} & { - {\it{\Omega }}_{23}(t)} \\ 0 & { - {\it{\Omega }}_{23}(t)} & {\Delta _{13}} \end{array}} \right).$$
(8)

Figure 3c refers to the particular choice Δ12 = Δ23 = 0.15Ωmax (a full 2D scan of both Δ12 and Δ13 and the relative pulses are shown in the Supplementary Figs. 8 and 9, respectively). In this case, DRL finds a solution that induces significantly faster transfer than that obtained with standard counterintuitive Gaussian pulses. Moreover, the latter solutions are not even able to achieve a fidelity comparable to that of the ideal case. Such speed is a typical feature of pulse sequences determined by DRL, satisfying the criterion of adiabaticity (see Supplementary Figs. 10 and 11). Besides the energy detuning between the quantum dots, in a real implementation, the dots interact with the surrounding environment. Since the microscopic details of such an interaction are unknown, its effects are taken into account through an effective master equation. A master equation of the Lindblad type with time-independent rates is adopted to grant sufficient generality while keeping a simple expression. To show the ability of DRL to mitigate disturbances, we consider, in particular, two major environmental effects consisting of decoherence and losses. The first environmental effect corresponds to a randomization of the relative phases of the electron states in the quantum dots, which results in a cancellation of the coherence terms, i.e., the off-diagonal elements of the density matrix in the position basis. The losses, instead, model the erasure of the quantum information carried by the electron/hole moving along the dots. In fact, while the carrier itself cannot be reabsorbed, the quantum information, here encoded as a spin state, can be changed by the interaction with some surrounding spin or by the magnetic field noise. In the presence of dephasing and losses, the system’s master equation is of the Lindblad type. Its general form is

$$\dot \rho = - \frac{i}{\hbar }\left[ {H,\rho } \right] + \mathop {\sum}\limits_n {\Gamma _n} \left( {A_n\rho A_n^\dagger - \frac{1}{2}\left\{ {A_n^\dagger A_n,\rho } \right\}} \right),$$
(9)

where Ak are the Lindblad operators, Γk are the associated rates, and {A, B} = AB + BA denotes the anticommutator.

We first considered each single dot being affected by dephasing and assumed an equal dephasing rate Γd for each dot. The master equation can be rewritten as

$$\dot \rho = - \frac{i}{\hbar }\left[ {H,\rho } \right] + \Gamma _d\left[ {\rho - {\mathrm{diag}}(\rho )} \right].$$
(10)

Figure 3c shows an example of the pulses determined by DRL and the corresponding dynamics for the choice Γd = 0.01Ωmax. CTAP is achieved with Gaussian pulses in a time t ≈ 10π/Ωmax at the cost of a significant occupation of the central dot (inset). The advantage brought by the DRL agent with respect to Gaussian pulses is manifest: DRL (solid lines) realizes population transfer in approximately half of the time required when analytic pulses (dashed lines) are employed. The occupation of the central dot is less than that achieved by Gaussian pulses. We now consider the inclusion of a loss term in the master equation. The loss refers to the information carried by the electron encoded by the spin. While the loss of an electron is a very rare event in quantum dots, requiring the sudden capture of a slow trap50,51,52, the presence of any random magnetic field and spin fluctuation around the quantum dots can modify the state of the transferred spin. Losses are described by the operators Γl|0〉〈k|, k {1, 2, 3}, where Γl is the loss rate, modeling the transfer of the amplitude from the i-th dot to an auxiliary vacuum state |0〉. Figure 3d shows the superior performance of DRL versus analytic Gaussian pulses in terms of the maximum value of ρ33 reached during the time evolution. Because of the reward function, DRL minimizes the transfer time so as to minimize the effect of losses. The explicit reward function for each case is given in the Supplementary Table 2. As a further generalization, we applied DRL to the passage through more than one intervening dot, an extension called a straddling CTAP scheme (or SCTAP)18,53. To exemplify a generalization to odd N > 3, we set N = 5 according to the sketch depicted in Fig. 4, and no disturbance was considered for simplicity (see the Supplementary Note 7 for further details on SCTAP). For five-state transfer, like the cases for three states and for any other odd number of quantum dots, there is one state with zero energy (see Supplementary Fig. 12). The realization of the straddling tunneling sequence is achieved by augmenting the original pulse sequence by straddling pulses, which are identical for all intervening tunneling rates Ωm. The straddling pulse involves the second and third control gates, as shown in Fig. 4a, b. The pulse sequence discovered by DRL is significantly different from the known sequence based on Gaussian pulses. It achieves population transfer in approximately one third of the time with only some occupation in the central dot (see Fig. 4c). It is remarkable that the reward function (see the Supplementary Note 5) achieves such a successful transfer without any requirement for the occupation of the intermediate dots ρi,i with (i {2, 3, 4}) as it only rewards the occupation of the last dot.

### Analysis of the relevant variables within the DRL framework

The advantages of employing DRL with respect to an ansatz solution is further increased by the possibility of determining the factors that are more relevant for the effectiveness of the final solution from the analysis of the neural network. In fact, the employment of the DRL algorithm enables an analysis of the state variables that are actually relevant for solving the control problems, like that discussed above. To select the variables needed to solve an MDP, we follow the approach presented by Peters et al.54,55,56. The idea is that a state variable is useful if it helps to explain either the reward function or the dynamics of the state variables that in turn are necessary to explain the reward function. Otherwise, the state variable is useless, and it can be discarded without affecting the final solution. To represent the dependencies between the variables of an MDP, we use a 2TBN57, described in Supplementary Note 8. In the graph of Fig. 5, there are three types of nodes: the gray diamond node represents the reward function, the circle nodes represent state variables, and the squares represent action variables. The nodes are arranged on three vertical layers: the first layer on the left includes the variables at time ti, the second layer includes the state variables at time ti + 1, and the third layer includes the node that represents the reward function. If a variable affects the value of another variable, we obtain a direct edge that connects the former to the latter. Figure 5 shows the 2TBN estimated from a dataset of 100,000 samples obtained from the standard case of ideal CTAP. The weights are quantified in the Supplementary Table 3. As expected from Eq. (6), the reward function depends only on the values of the variables ρ22 and ρ33. From the 2TBN, it emerges that the dynamics of these two state variables can be fully explained by knowing their values at the previous step, the values of the two action variables Ω12 and Ω23 and the actions taken in the previous time step (stored in the variables $${\it{\Omega }}_{12}^\prime$$ and $${\it{\Omega }}_{23}^\prime$$). All other state variables do not appear and therefore can be ignored for the computation of the optimal control policy. This finding matches the expectation from the constraints of the physical model that coherences are not directly involved in the dynamics and that ρ11 is linearly dependent on ρ22 and ρ33 as the trace of the density matrix is constant. To confirm this finding, the agent was successfully trained in the standard case of ideal CTAP by using only the input values (ρ22, ρ33, Ω12, and Ω23). Furthermore, the size of the hidden layer H1 could be reduced in this case from 64 to 4 with the same convergence of the total reward as a function of the episodes or an even faster convergence by reducing the batch size from 20 to 10 (see the Supplementary Figs. 13 and 14). In the cases of detuning and dephasing, the corresponding 2TBNs are more complex, since the dynamics of ρ22 and ρ33 are affected also by the values of the other elements of the matrix ρ. This fact is also apparent in the changes carried by the Hamiltonian, as the trace is not constant when losses are present and dephasing has a direct effect on the coherence.

## Discussion

Neural networks can discover control sequences for a tunable quantum system whose time evolution starts far from its final equilibrium, without any prior knowledge. Contrary to the employment of special ansatz solutions, we have shown that DRL discovers novel sequences of control operations to achieve a target state, regardless of the possible deviations from the ideal conditions. DRL is achieved irrespective of previously discovered ansatz solutions, and it applies when such solutions are unknown. The control sequence found by DRL can also provide insights into novel analytical shapes for optimal pulses. The use of a pretrained neural network as a starting point to identify the solution of a master equation with similar parameters further reduces the computation time by one order of magnitude (see Supplementary Note 9). Our investigation indicates that a key factor is the appropriate definition of the reward function that deters the system from occupying the central quantum dot and rewards the occupation of the last quantum dot. To apply DRL to a practical quantum system known for its counterintuitive nature, we have studied quantum state transport across arrays of quantum dots, including sources of disturbances such as energy level detuning, dephasing, and loss. In all cases, the solutions found by the agent outperform the known solution in terms of either speed or fidelity or both. Both the pretraining of the network and 2TBN analysis—by reducing the computational effort—contribute to accelerating the learning process. In general, we have shown that neural-network-based DRL provides a general tool for controlling physical systems by circumventing the issue of finding ansatz solutions when neither a straightforward method nor a brute force approach is possible. A comparison with other control strategies of quantum systems16,58,59 will be the objective of future work. As a concluding remark, we observe that in this work we have modeled the effects of the environment or noisy control fields by master equations in the Lindblad form with time-independent coefficients; by construction, the same DRL strategy can be used to determine the optimal pulse sequence in the presence of non-Markovian noise/effects originating from an interaction of the system with structured environments. In this case, an efficient way to compute the time evolution of the open system, which is required to determine the value of the pulses at each time step, is needed. Depending on the nature (bosonic, fermionic) and features (spectral density) of the environment, different approximate or numerically exact strategies ranging from Bloch-Redfield or time-convolutionless master equations60 to path integrals61 or chain-mapping techniques (reaction-coordinate62 or TEDOPA63,64) could be exploited.

## Methods

### Master equation for the Hamiltonian including pure dephasing

Pure dephasing65 is the simplest model that accounts for an environmental interaction of an otherwise closed system. The corresponding Lindblad operators are diagonal and have the form:

$$L_n = \sqrt {\gamma _n} \left| n \right\rangle \left\langle n \right|.$$
(11)

In the case of γ1 = γ2 = γ3 = Γd, the master equation becomes:

$$\dot \rho = - \frac{i}{\hbar }\left[ {H,\rho } \right] + \Gamma _d\left[ {\rho - diag(\rho )} \right].$$
(12)

If the Hamiltonian is time-independent, then

$$\frac{d}{{dt}}\rho _{ii} = 0$$
(13)

for every i, so the electronic population remains unchanged in the presence of pure dephasing. This implies66 that the energy of the system remains unchanged during the evolution, since it cannot be changed by environment.

### Neural network

The neural network consists of a multilayered feed-forward network of (Ni, H1, [H2,]No) neurons, where Ni = 11; it includes the nine elements of the density matrix ρ and the control sequence Ωij, No = 2 are the two updated values of the control sequence spanning a continuous range between 0 and an arbitrary value Ωmax chosen according to experimental considerations, and Hi the number of neurons of the ith hidden layer.

Different setups for the neural network have been employed. Nonetheless, all the neural networks are feedforward and fully connected, with an ReLU activation function for the neurons. For the implementation of the neural networks and their training, Tensorforce67 is employed. For the RL algorithm adopted for this work, TRPO, the hyperparameters are set as summarized in the Supplementary Table 4. The optimal number of neurons of the hidden layer peaks at ~25–26 (see the Supplementary Fig. 15).

### Software

The QuTIP routine has been run in parallel by using GNU Parallel68.

## Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

## Code availability

The code and the algorithm used in this study are available from the corresponding author upon reasonable request.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## References

1. 1.

Sutton, R. S. et al. Reinforcement Learning: An Introduction (MIT Press, Cambridge, Massachusetts, 1998).

2. 2.

Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529 (2015).

3. 3.

Melnikov, A. A. et al. Active learning machine learns to create new quantum experiments. Proc. Natl Acad. Sci. USA 115, 1221–1226 (2018).

4. 4.

Nautrup, H. P., Delfosse, N., Dunjko, V., Briegel, H. J. & Friis, N. Optimizing quantum error correction codes with reinforcement learning. https://arxiv.org/abs/1812.08451 (2018).

5. 5.

Sweke, R., Kesselring, M. S., van Nieuwenburg, E. P. & Eisert, J. Reinforcement learning decoders for fault-tolerant quantum computation. https://arxiv.org/abs/1810.07207 (2018).

6. 6.

Colabrese, S., Gustavsson, K., Celani, A. & Biferale, L. Flow navigation by smart microswimmers via reinforcement learning. Phys. Rev. Lett. 118, 158004 (2017).

7. 7.

Prati, E. Quantum neuromorphic hardware for quantum artificial intelligence. J. Phys. Confer. Ser. 880, (2017).

8. 8.

August, M. & Ni, X. Using recurrent neural networks to optimize dynamical decoupling for quantum memory. Phys. Rev. A 95, 012335 (2017).

9. 9.

Dong, D., Chen, C., Tarn, T. J., Pechen, A. & Rabitz, H. Incoherent control of quantum systems with wavefunction-controllable subspaces via quantum reinforcement learning. IEEE Trans. Syst. Man Cybern. Part B Cybern. 38, 957–962 (2008).

10. 10.

Chen, C., Dong, D., Li, H., Chu, J. & Tarn, T. Fidelity-based probabilistic q-learning for control of quantum systems. IEEE Trans. Neural Netw. Learn. Syst. 25, 920–933 (2014).

11. 11.

Fösel, T., Tighineanu, P., Weiss, T. & Marquardt, F. Reinforcement learning with neural networks for quantum feedback. Phys. Rev. X 8, 031084 (2018).

12. 12.

August, M. & Hernández-Lobato, J. M. Taking gradients through experiments: LSTMs and memory proximal policy optimization for black-box quantum control. https://arxiv.org/abs/1802.04063 (2018).

13. 13.

Niu, M. Y., Boixo, S., Smelyanskiy, V. N. & Neven, H. Universal quantum control through deep reinforcement learning. In AIAA Scitech 2019 Forum 0954 (2019).

14. 14.

Albarrán-Arriagada, F., Retamal, J. C., Solano, E. & Lamata, L. Measurement-based adaptation protocol with quantum reinforcement learning. Phys. Rev. A 98, 042315 (2018).

15. 15.

Zhang, X.-M., Wei, Z., Asad, R., Yang, X.-C. & Wang, X. When reinforcement learning stands out in quantum control? A comparative study on state preparation. https://arxiv.org/abs/1902.02157 (2019).

16. 16.

Bukov, M. et al. Reinforcement learning in different phases of quantum control. Phys. Rev. X 8, 031086 (2018).

17. 17.

Yu, S. et al. Reconstruction of a photonic qubit state with reinforcement learning. Adv. Quantum Technol. 0, 1800074 (2019).

18. 18.

Greentree, A. D., Cole, J. H., Hamilton, A. R. & Hollenberg, L. C. Coherent electronic transfer in quantum dot systems using adiabatic passage. Phys. Rev. B 70, 1–6 (2004).

19. 19.

Cole, J., Greentree, A., C. L. Hollenberg, L. & Das Sarma, S. Spatial adiabatic passage in a realistic triple well structure. Phys. Rev. B 77, 235418 (2008).

20. 20.

Greentree, A. D. & Koiller, B. Dark-state adiabatic passage with spin-one particles. Phys. Rev. A 90, 012319 (2014).

21. 21.

Menchon-Enrich, R. et al. Spatial adiabatic passage: a review of recent progress. Rep. Prog. Phys. 79, 074401 (2016).

22. 22.

Ferraro, E., De Michielis, M., Fanciulli, M. & Prati, E. Coherent tunneling by adiabatic passage of an exchange-only spin qubit in a double quantum dot chain. Phys. Rev. B 91, 075435 (2015).

23. 23.

Rotta, D., De Michielis, M., Ferraro, E., Fanciulli, M. & Prati, E. Maximum density of quantum information in a scalable cmos implementation of the hybrid qubit architecture. Quantum Inf. Process. 15, 2253–2274 (2016).

24. 24.

Rotta, D., Sebastiano, F., Charbon, E. & Prati, E. Quantum information density scaling and qubit operation time constraints of CMOS silicon-based quantum computer architectures. npj Quantum Inf. 3, 26 (2017).

25. 25.

Prati, E., Rotta, D., Sebastiano, F. & Charbon, E. From the quantum moore’s law toward silicon based universal quantum computing. In 2017 IEEE International Conference on Rebooting Computing (ICRC) 1–4 (IEEE, 2017).

26. 26.

Ferraro, E., De Michielis, M., Mazzeo, G., Fanciulli, M. & Prati, E. Effective hamiltonian for the hybrid double quantum dot qubit. Quantum Inf. Process. 13, 1155–1173 (2014).

27. 27.

Michielis, M. D., Ferraro, E., Fanciulli, M. & Prati, E. Universal set of quantum gates for double-dot exchange-only spin qubits with intradot coupling. J. Phys. A Math. Theor. 48, 065304 (2015).

28. 28.

Bonarini, A., Caccia, C., Lazaric, A. & Restelli, M. Batch reinforcement learning for controlling a mobile wheeled pendulum robot. In IFIP International Conference on Artificial Intelligence in Theory and Practice 151–160 (Springer, 2008).

29. 29.

Tognetti, S., Savaresi, S. M., Spelta, C. & Restelli, M. Batch reinforcement learning for semi-active suspension control. In Control Applications, (CCA) & Intelligent Control, (ISIC), 2009 IEEE 582–587 (IEEE, 2009).

30. 30.

Castelletti, A., Pianosi, F. & Restelli, M. A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resour. Res. 49, 3476–3486 (2013).

31. 31.

Schulman, J., Levine, S., Abbeel, P., Jordan, M. & Moritz, P. Trust region policy optimization. In International Conference on Machine Learning 1889–1897 (2015).

32. 32.

Johansson, J., Nation, P. & Nori, F. Qutip: an open-source python framework for the dynamics of open quantum systems. Comput. Phys. Commun. 183, 1760–1772 (2012).

33. 33.

Duan, Y., Chen, X., Houthooft, R., Schulman, J. & Abbeel, P. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning 1329–1338 (2016).

34. 34.

Vitanov, N. V., Halfmann, T., Shore, B. W. & Bergmann, K. Laser-induced population transfer by adiabatic passage techniques. Annu. Rev. Phys. Chem. 52, 763–809 (2001).

35. 35.

Maurand, R. et al. A CMOS silicon spin qubit. Nat. Commun. 7, 13575 (2016).

36. 36.

Bluhm, H. et al. Dephasing time of gaas electron-spin qubits coupled to a nuclear bath exceeding 200 μs. Nat. Phys. 7, 109 (2011).

37. 37.

Prati, E., Hori, M., Guagliardo, F., Ferrari, G. & Shinada, T. Anderson–Mott transition in arrays of a few dopant atoms in a silicon transistor. Nat. Nanotechnol. 7, 443 (2012).

38. 38.

Prati, E., Kumagai, K., Hori, M. & Shinada, T. Band transport across a chain of dopant sites in silicon over micron distances and high temperatures. Sci. Rep. 6, 19704 (2016).

39. 39.

Achilli, S. et al. GeVn complexes for silicon-based room-temperature single-atom nanoelectronics. Sci. Rep. 8, 18054 (2018).

40. 40.

Hollenberg, L. C. L., Greentree, A. D., Fowler, A. G. & Wellard, C. J. Two-dimensional architectures for donor-based quantum computing. Phys. Rev. B 74, 045311 (2006).

41. 41.

Homulle, H. et al. A reconfigurable cryogenic platform for the classical control of quantum processors. Rev. Sci. Instrum. 88, 045103 (2017).

42. 42.

Ban, Y., Chen, X. & Platero, G. Fast long-range charge transfer in quantum dot arrays. Nanotechnology 29, 505201 (2018).

43. 43.

Torrontegui, E. et al. Chapter 2—shortcuts to adiabaticity (Arimondo, E., Berman, P. R., & Lin, C. C., eds) Advances in Atomic, Molecular, and Optical Physics. Vol. 62, 117–169 (Academic Press, Amsterdam, NL, 2013).

44. 44.

Jehl, X. et al. Mass production of silicon mos-sets: can we live with nano-devices variability? Procedia Comput. Sci. 7, 266–268 (2011).

45. 45.

Breuer, H.-P. & Petruccione, F. The Theory of Open Quantum Systems. (Oxford University Press, Oxford, 2002).

46. 46.

Clément, N., Nishiguchi, K., Fujiwara, A. & Vuillaume, D. One-by-one trap activation in silicon nanowire transistors. Nat. Commun. 1, 92 (2010).

47. 47.

Pierre, M. et al. Background charges and quantum effects in quantum dots transport spectroscopy. Eur. Phys. J. B 70, 475–481 (2009).

48. 48.

Kuhlmann, A. V. et al. Charge noise and spin noise in a semiconductor quantum device. Nat. Phys. 9, 570 (2013).

49. 49.

Prati, E. & Shinada, T. Single-Atom Nanoelectronics 5–39 (CRC Press, Singapore, 2013).

50. 50.

Prati, E., Fanciulli, M., Ferrari, G. & Sampietro, M. Giant random telegraph signal generated by single charge trapping in submicron n-metal-oxide-semiconductor field-effect transistors. J. Appl. Phys. 103, 123707 (2008).

51. 51.

Prati, E. The finite quantum grand canonical ensemble and temperature from single-electron statistics for a mesoscopic device. J. Stat. Mech. Theory Exp. 2010, P01003 (2010).

52. 52.

Prati, E., Belli, M., Fanciulli, M. & Ferrari, G. Measuring the temperature of a mesoscopic electron system by means of single electron statistics. Appl. Phys. Lett. 96, 113109 (2010).

53. 53.

Malinovsky, V. & J. Tannor, D. Simple and robust extension of the stimulated raman adiabatic passage technique to n-level systems. Phys. Rev. A 56, 4929–4937 (1997).

54. 54.

Peters, J. & Schaal, S. Natural actor-critic. Neurocomputing 71, 1180–1190 (2008).

55. 55.

Peters, J., Mülling, K. & Altun, Y. Relative entropy policy search. In Twenty-Fourth AAAI Conference on Artificial Intelligence 1607–1612 (Atlanta, 2010).

56. 56.

Castelletti, A., Galelli, S., Restelli, M. & Soncini-Sessa, R. Tree-based variable selection for dimensionality reduction of large-scale control systems. In Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2011 IEEE Symposium on 62–69 (IEEE, 2011).

57. 57.

Koller, D., Friedman, N. & Bach, F. Probabilistic Graphical Models: Principles and Techniques (MIT press, Cambridge, Massachusetts, 2009).

58. 58.

Koch, C. P. Controlling open quantum systems: tools, achievements, and limitations. J. Phys. Condens. Matter 28, 213001 (2016).

59. 59.

Lamata, L. Basic protocols in quantum reinforcement learning with superconducting circuits. Sci. Rep. 7, 1609 (2017).

60. 60.

Breuer, H.-P. & Petruccione, F. The theory of open quantum systems. (Oxford University, New York, 2006).

61. 61.

Nalbach, P., Eckel, J. & Thorwart, M. Quantum coherent biomolecular energy transfer with spatially correlated fluctuations. New J. Phys. 12, 065043 (2010).

62. 62.

Hughes, K. H., Christ, C. D. & Burghardt, I. Effective-mode representation of non-markovian dynamics: a hierarchical approximation of the spectral density. ii. Application to environment-induced nonadiabatic dynamics. J. Chem. Phys. 131, 124108 (2009).

63. 63.

Chin, A. W., Rivas, Á., Huelga, S. F. & Plenio, M. B. Exact mapping between system-reservoir quantum models and semi-infinite discrete chains using orthogonal polynomials. J. Math. Phys. 51, 092109 (2010).

64. 64.

Tamascelli, D., Smirne, A., Huelga, S. F. & Plenio, M. B. Efficient simulation of finite-temperature open quantum systems. https://arxiv.org/abs/1811.12418 (2018).

65. 65.

Taylor, J. M. et al. Relaxation, dephasing, and quantum control of electron spins in double quantum dots. Phys. Rev. B 76, 035315 (2007).

66. 66.

Tempel, D. G. & Aspuru-Guzik, A. Relaxation and dephasing in open quantum systems time-dependent density functional theory: properties of exact functionals from an exactly-solvable model system. Chem. Phys. 391, 130–142 (2011).

67. 67.

Schaarschmidt, M., Kuhnle, A. & Fricke, K. Tensorforce: a tensorflow library for applied reinforcement learning. https://github.com/reinforceio/tensorforce (2017).

68. 68.

Tange, O. et al. Gnu parallel-the command-line power tool. USENIX Mag. 36, 42–47 (2011).

## Acknowledgements

E.P. gratefully acknowledges the support of NVIDIA Corporation for the donation of the Titan Xp GPU used for this research.

## Author information

R.P. developed the simulation and implemented the algorithms, D.T. elaborated on the Hamiltonian framework and the master equation formalism, M.R. operated the reinforcement learning analysis, and E.P. conceived and coordinated the research. All the authors discussed and contributed to the writing of the manuscript.

### Competing interests

The authors declare no competing interests.

Correspondence to Enrico Prati.