Introduction

Some problems in physics are solved as a result of the discovery of an ansatz solution, namely a successful test guess, but unfortunately there is no general method to generate one. Recently, machine learning has increasingly proved to be a viable tool for modeling hidden features and effective rules in complex systems. Among the classes of machine learning algorithms, deep reinforcement learning (DRL)1 is providing some of the most spectacular results due to its ability to identify strategies for achieving a goal in a complex space of solutions without prior knowledge of the system2,3,4,5,6,7. Contrary to supervised learning, which has already been applied to quantum systems, such as in the determination of high-fidelity gates and the optimization of quantum memories by dynamic decoupling8, DRL has only very recently been proposed for the control of quantum systems9,10,11,12,13,14,15,16, along with a strictly quantum reinforcement learning implementation14,17. To show the power of DRL, we apply DRL to the problem of coherent transport by adiabatic passage (CTAP) where an electron (encoding the quantum state) is transferred through an array of quantum dots. The ansatz solution for CTAP is notoriously called counterintuitive because of its nonobvious barrier control gate pulse sequence. During coherent adiabatic passage, an electron spends ideally no time in the central quantum dot, because of the simultaneous modulation of the coupling between the dots that suitably drives the trajectory through the Hilbert space18,19,20,21,22. The system moves from an initial equilibrium condition to a different one, namely the one where only the last dot of the array is populated. By exploiting such an ansatz solution, consisting of applying the barrier control gates between the dots in a “reversed order” with respect to what intuition would naturally suggest, the process displays truly quantum mechanical behavior, provided that the array consists of an odd number of dots. Researchers have already explored silicon-based quantum information processing architectures23,24,25, including the coherent transport of multiple-spin qubits into double quantum dots by adiabatic passage22, heuristic search methods, such as genetic algorithms, to find a universal set of quantum logic gates26,27, and the application of DRL to classical systems28,29,30.

Here, we demonstrate that DRL implemented in a compact neural network can, first of all, autonomously discover an analog of the counterintuitive gate pulse sequence without any prior knowledge, therefore finding a control path in a problem whose solution is far from the equilibrium of the initial conditions. More importantly, this method can outperform the previously introduced analytical solutions in terms of processing speed and when the system deviates from ideal conditions, which are here represented by the imperfect tuning of the ground states of the quantum dots, dephasing and losses. Under such conditions, no analytical approach exists to the best of our knowledge. In our approach, we exploit trust region policy optimization (TRPO)31 to handle the CTAP problem. First, we compare the results discovered by the artificial intelligence algorithm with the ansatz solution given in the literature. Next, we apply the method to solve the system when the ground states of the quantum dots are detuned and when the system is perturbed by an interaction with uncontrollable degrees of freedom of the surrounding environment. This latter condition results in dephasing and loss terms in the master equation describing the system, for which there exists no analytical method. Similar to the case of artificial intelligence learning in the classical Atari framework2, in our approach, the DRL agent interacts with a QuTIP32 simulation, which plays the role of the RL environment, by implementing the master equation of the system. The RL agent exploits the information retrieved from the feedback in terms of the temporal evolution of the population of the dots. As a further advantage of such an approach, a 2-step temporal Bayesian network (2TBN) analysis can identify which parameters of the system influence the process to a greater extent.

Results

Deep reinforcement learning by actor-critic neural networks

Reinforcement learning (RL) is a set of techniques used to learn behavior in sequential decision-making problems when no prior knowledge about the system dynamics is available or when the control problem is too complex for classical optimal-control algorithms. RL methods can be roughly classified into three main categories: value-based, policy-based, and actor-critic methods1. Recently, actor-critic methods have proven to be successful in solving complex continuous control problems33.

The idea behind actor-critic methods is to use two parametric models (e.g., neural networks) to represent both the policy (actor) and the value function (critic). The actor decides in each state of the system which action to execute, while the critic learns the value (utility) of taking each action in each state (see Supplementary Note 1). Following the critic’s advice, the actor modifies the parameters of its policy to improve the performance. Among the many actor-critic methods available in the literature, we selected the TRPO algorithm31 (Supplementary Note 2) to find an optimal policy of control pulses. The choice of TRPO is motivated both by its excellent performance on a wide variety of tasks and by the relative simplicity of tuning its hyperparameters31 (see the Supplementary Note 3).

Coherent spatial adiabatic passage of qubits

CTAP is the solid-state version of a method developed for stimulated Raman adiabatic passage21,34, which is relevant for instance in quantum information processing architectures that shuttle a qubit from one location to another, where attention is paid to minimize the information loss during transport. In solid-state quantum devices based on either silicon35 or gallium arsenide36, the qubit can be encoded, for instance, into spin states of either excess electron(s) or hole(s) in quantum dots24. CTAP was originally developed for single-electron states in single-occupied quantum dots, but it can be also extended to more complex spin states, such as hybrid qubits based on spin triplets22. If one imagines employing, for instance, an array of dopants in silicon37,38,39, a reasonable interdopant spacing is of the order of 20 nm, and a hopping time of 100 ps40 can be achieved. Adiabatic passage requires control pulses with a bandwidth and order of magnitude or two lower than the hopping time, which can be managed by conventional electronics41.

Steering a quantum system initially far from the final equilibrium

To demonstrate the exploitation of RL, we start by the simplest case of CTAP across a chain of three identical quantum dots. The RL architecture is depicted in Fig. 1a. The simulation of the physical system that supports CTAP consists of an RL environment E that receives as input the updated values of the parameters (in our case the control sequence Ωi,i + 1 with i = 1, 2 between the adjacent ith and (i + 1)th dots) that reflect the action of the control gates as calculated by the agent A according to the policy π. In turn, the RL environment E computes the new RL state (here expressed in terms of the density matrix of the triple quantum dot device, see Supplementary Fig. 1) and provides feedback to agent A. Agent A calculates the next input parameters after evaluating the effects of the previous input according to a reward function rt, which is expressed in terms of the system state. It is worth noting that the definition of an RL environment (a black box that receives an action as input and outputs a result) is very different from the usual definition of a physical environment, which corresponds to an external quantum system that interacts with the quantum system under investigation. More explicitly, the ground state of each quantum dot is tuned with respect to the others by external top metal gates (not shown for simplicity in the sketch of the RL environment E in Fig. 1a), while the coupling between two neighboring quantum dots is in turn controlled by additional barrier control gates. The idealized physical system is prepared so that the ground states of the three quantum dots have the same energy. As the reference energy is arbitrary, without loss of generality we can set E1 = E2 = E3 = 0. The Hamiltonian therefore reads:

$$H(t) = \left( {\begin{array}{*{20}{c}} 0 & { - {\it{\Omega }}_{12}(t)} & 0 \\ { - {\it{\Omega }}_{12}(t)} & 0 & { - {\it{\Omega }}_{23}(t)} \\ 0 & { - {\it{\Omega }}_{23}(t)} & 0 \end{array}} \right).$$
(1)

One of the three eigenstates is referred as a “dark state”, and it is expressed as a function of the state of the first and third dots only. The dark state reads:

$$\left| {D_0(t)} \right\rangle = \cos \theta _1(t)\left| 1 \right\rangle - \sin \theta _1(t)\left| 3 \right\rangle ,$$
(2)

where θ1(t) = arctan(Ω12(t)/Ω23(t)). A suitable time evolution of the values of the control sequence Ω12(t) and Ω23(t) between the dots allows to transform |D0(t)〉 from |1〉 at t = 0 to the final target state |3〉 at t = tmax. If the Hamiltonian is prepared in |D0〉 at t = 0, by choosing θ1(t) = 0 (so that |D0(0)〉 = |1〉), it will remain in the same eigenstate if the adiabaticity criterion is met; that is, \(\left| {E_0 - E_ \pm (t)} \right| \gg \left| {\left\langle {\dot{D}_0(t)|D_ \pm (t)} \right\rangle } \right|\), where the |D±(t)〉 eigenstates are explicitly given in the Supplementary Note 4, and E±(t) are the corresponding eigenenergies. The effectiveness of the pulse sequence of the barrier control gate, which is reflected in the control sequence Ωi,i + 1(t), is addressed by taking F(t) = ρ33(t) as the fidelity and maximizing it while trying to keep the population ρ22(t) constant at zero, where ρij, i, j {1, 2, 3} represents an element of the density matrix ρ. The time evolution is governed by the von Neumann equation for the density matrix ρ, namely:

$$\dot \rho = - \frac{i}{\hbar }[H,\rho ].$$
(3)

Notoriously, a suitable shaping of Ω12(t) and Ω23(t) as Gaussian pulses can achieve coherent transport with a high fidelity, if \(t_{{\rm{max}}} \ \gtrsim \frac{{10\pi }}{{{\it{\Omega }}_{{\rm{max}}}}}\) (see Supplementary Note 4 and Supplementary Figs. 25). The two pulses must be applied in a so-called counterintuitive sequence, meaning that the first gate controlling Ω12(t) is operated after Ω23(t). Such pulse sequence drives the occupation of the first dot ρ11(tmax) to zero and the occupation of the last dot ρ33(t) to 1, while maintaining empty occupation of the central dot (ρ22(t) = 0 t [0, tmax])18. It is worth mentioning that recently a different ansatz combination of pulse shapes was proposed to speed up this process42, using the so-called shortcut to adiabaticity protocol43. Generally speaking, there is no analytical method to optimize the pulses, so further improvements will be based on still undiscovered ideas. Here is where the power of DRL comes into play.

Fig. 1
figure 1

The deep reinforcement learning architecture to control the coherent transport by adiabatic passage. a The deep reinforcement learning (DRL) environment E can be represented by a linear array of quantum dots, with tunneling rates controlled by two gates indicated by Ω12 and Ω23. At each time step, the DRL environment can be modeled by a 3 × 3 density matrix that is employed as an input observation (the state) by agent A. In turn, agent A uses the observation to choose the action in the next time step by following a policy \(\pi (a_{t_i}|s_{t_i})\). This action brings E to a new state \(\rho _{t_{i + 1}} = \rho (t + \Delta t)\). Each action is punished or rewarded with a real-valued reward, indicated by rt. b Agent A is represented with a four-layer neural network, which acts as the policy π. At each time step, the network receives the 3 × 3 = 9 real values associated with the absolute values of the elements of the density matrix ρ and the two values of the gate-control pulses Ω12 and Ω23, for a total of 11 neurons in the input layer. Then, the agent computes the policy π and outputs the values of the two pulses that will be applied to barrier control gates in the next time step (b, lower right). The starting and ending points of the highlighted segments can be connected by different functions of time such as the moving average or a spline function to smooth the step function generated by the discrete time steps. Finally, the physical simulation of the quantum dot array brings the system to a new state by updating the density matrix accordingly and returns rt to the agent. At t = tmax, the system reaches the final step (k = N), and the simulation ends

Deep reinforcement learning setup

The time evolution of the physical system can be mapped into a Markov decision process (MDP). An MDP is defined by a state space \({\cal{S}}\), an action space \({\cal{A}}\), a probability transition \(P(s_{t_{i + 1}}|s_{t_i},a_{t_i})\) (where \(s_{t_{i + }},s_{t_{i + 1}} \in {\cal{S}}\) and \(a_{t_i} \in {\cal{A}}\)), and a reward function r that is generally defined in the space \({\cal{S}} \times {\cal{A}}\), but in our case \(r:{\cal{S}} \to {\Bbb R}\). We define the discretized time step ti = iΔt, where i = {0, 1, … N} and \(\Delta t = \frac{{t_{{\rm{max}}}}}{N}\), such that tN = tmax. It is noteworthy to clarify that, albeit the name is the same, the definition of the RL state \(s_{t_i}\) differs from that of a quantum state. In this work, \(s_{t_i}\) is defined as:

$$s_{t_i} = \left[ {|\rho _{nm}(t_i)|\,{\mathrm{for}}\,m,n \in \{ 1,2,3\} ,{\mkern 1mu} {\it{\Omega }}_{12}(t_{i - 1}),{\mkern 1mu} {\it{\Omega }}_{23}(t_{i - 1})} \right]$$
(4)

such that the state at each time step contains the instantaneous absolute values of the nine entries of the density matrix and both Ωij values evaluated at the previous time step. Instead, \(a_{t_i}\) is defined as:

$$a_{t_i} = \left[ {{\it{\Omega }}_{12}(t_i),{\it{\Omega }}_{23}(t_i)} \right]$$
(5)

such that \(a_{t_i} \in [0,{\it{\Omega }}_{{\rm{max}}}] \times [0,{\it{\Omega }}_{{\rm{max}}}]\). Due to the robust definition of the reward function that allows the agent to judge its performance, the neural network evolves in order to obtain the best temporal evolution of the coupling parameters Ωi,i + 1, ensuring that the electron population reaches the target state over time (Fig. 1b). The definition of the best reward function is certainly the most delicate choice in the whole model. The key features of the reward can be summarized by two properties: its generality and its expression as a function of the desired final state. Generality means that the reward should not contain specific information on the characteristics of the two pulses. Regarding the dependence on the desired final state, in our case the goal is to maximize ρ33 at the end of the time evolution. The structure of the reward function used in the simulations is:

$$r_{t_i} = ( - 1 + \rho _{33}\left( {t_i} \right) - \rho _{22}\left( {t_i} \right)) + A(t_i)\delta _{i,N} - B(t_i),$$
(6)

where A(ti) and B(ti) are the goal-reaching and punishment terms, respectively. We found that the following combinations of A(ti) and B(ti) work well:

$${\begin{array}{*{20}{l}} {} \hfill & {A(t_i) = \left\{ {\begin{array}{*{20}{c}} {1000} & {\rho _{{\mathrm{33}}}(t_i) \ > \ \rho _{33}^{{\rm{th}}}} \\ 0 & {{\mathrm{elsewhere}}} \end{array}} \right.,} \hfill \\ {} \hfill & {B(t_i) = \left\{ {\begin{array}{*{20}{l}} {100{\kern 1pt} \Theta (\rho _{22}(t_i) - 0.05)} \hfill & {{\hbox{Standard CTAP}}}\\ {e^{6\rho _{22}(t_i)}} \hfill & {{\hbox{With detuning, dephasing and loss}}}\end{array}} \right.} \hfill \end{array}}$$
(7)

The reward functions used in this research are fully accounted for in the Supplementary Note 5. The sum of the first three terms is nonpositive at each time step, so the agent will try to bring the sum to 0 by minimizing ρ22 and maximizing ρ33. We have observed that subtracting B(t) (e.g., punishing the electronic occupation in site 2) and awarding the attainment of the goal with the term A(t) improves the convergence of the learning. Furthermore, in some specific cases, we stop the agent at an intermediate episode if ρ33 is greater than an arbitrary threshold \(\rho _{33}^{{\rm{th}}}\) for a certain number of time steps. This choice can help find fast pulses that achieve high-fidelity quantum information transport at the cost of a higher \(\rho _{22}^{{\rm{max}}}\) with respect to the analytic pulses (more details are given in the Supplementary Note 5).

Training of the agent

Figure 2 shows the best results achieved by the agent at various epochs. The agent is free of spanning the whole coupling range at each step. After the training, we smooth the output pulses Ωi,i + 1, and we run the simulation using the smoothed values, obtaining a slight improvement from the calculation using the original output values of the RL agent. Smoothing is achieved with a moving average of the pulses (more details in the Supplementary Note 6, Supplementary Fig. 6, and Supplementary Table 1). The three occupations ρii, i {1, 2, 3} shown in Fig. 2 refer to the smoothed pulse sequences. At the very beginning, the untrained agent tries random values with apparently low success, as the occupation oscillates between the first and the last dot over time. It is worth noting that despite this fluctuation, due to the reward function, the agent learns very quickly after only 45 epochs to always keep the occupation of the central dot below 0.5. After ~2000 epochs, the agent learns to stabilize the occupations of the first and last dots, while maintaining an empty occupation of the central dot. After approximately twice the time, a reversal of the populations of the first and last dots happens, even if they do not yet achieve the extremal values. Finally, after approximately four times the elapsed time, the agent achieves high-fidelity CTAP. Despite the stochastic nature of DRL, these epoch numbers remain consistent through different trainings (see the Supplementary Fig. 7). Notice that the pulse sequence is similar to the ansatz Gaussian counterintuitive pulse sequence as the second gate acting on Ω23 is operated first, but the shape of the two pulses is different. It is remarkable that the reward function implicitly tries to achieve the result as quickly as possible, resulting in a pulse sequence that is significantly faster than the analytical Gaussian case and comparable to the recent proposal of Ban et al.42. The relevant point is that the agent achieves such results irrespective of the actual terms of the Hamiltonian contributing to the master equation. Therefore, DRL can be applied straightforwardly to more complex cases for which there is no generalization of the ansatz solutions found for the ideal case, which we address in the next section.

Fig. 2
figure 2

Learning of the agent for various training epochs. Vertical panels show the pulses Ω and the populations ρ accounted for by the diagonal elements of the density matrix after 45 (a), 1824 (b), 4489 (c), and 16700 (d), respectively. Horizontal axis represents time, for which in this case \(t_{{\rm{max}}} = 10\frac{\pi }{{{\it{\Omega }}_{{\rm{max}}}}}\) is set. Vertical axis shows the pulse intensity Ω and the diagonal elements of the density matrix ρ. Each time evolution has been divided into N = 100 time steps. In each time step ti, the neural network can choose the value of both interdot couplings between a pair of neighboring quantum dots, which range from 0 to Ωmax, and they are assumed to be constant during each time step. In the top panel of all four epochs shown, the faded pulses are the outputs of the reinforcement learning (RL) agent, while the highlighted lines are a moving average with a four-sized window. The bottom panels represent the time evolution of the diagonal elements of the density matrix with such averaged pulses applied. We have empirically observed that using a moving average fit achieves better results in terms of minimizing ρ22 and maximizing ρ33 (in this particular case, the maximum value of ρ22 is decreased to 4 × 10−4, and the maximum value of ρ33 is increased to 5 × 10−4). Three different regimes can be observed: in the first regime, a the agent explores randomly. Later, b the agent learns how to minimize the population of the second quantum dot. Finally, c the agent learns how to achieve a high population in site 3 at the end of the evolution, without populating the second quantum dot (d)

Deep reinforcement learning to overcome disturbances

We turn now our attention to the behavior of our learning strategy when applied to a nonideal scenario in which typical realistic conditions of semiconductor quantum dots are considered, to be compared with ideal transport (Fig. 3a). In particular, we discuss the results produced by DRL when the array is affected by detuning caused by energy differences in the ground states of the dots, dephasing and losses (Fig. 3b–d). These three effects exist, to different degrees, in any practical attempt to implement CTAP of an electron spin in quantum dots. The first effect is typically due to manufacturing defects44, while the last two effects emerge from the interaction of the dots with the surrounding environment45,46 involving charge and spin fluctuations47,48 and magnetic field noise49. Under such disturbances, neither analytical nor ansatz solutions are available. On the other hand, the robustness and generality of the RL approach can be exploited naturally since, from the point of view of the algorithm, it does not differ from the ideal case discussed above. Let us consider a system of N = 3 quantum dots with different energies Ei, i {1, 2, 3}. We define Δij = Ej − Ei, so that the Hamiltonian (1) can be written, without loss of generality, as

$$H(t) = \left( {\begin{array}{*{20}{c}} 0 & { - {\it{\Omega }}_{12}(t)} & 0 \\ { - {\it{\Omega }}_{12}(t)} & {\Delta _{12}} & { - {\it{\Omega }}_{23}(t)} \\ 0 & { - {\it{\Omega }}_{23}(t)} & {\Delta _{13}} \end{array}} \right).$$
(8)

Figure 3c refers to the particular choice Δ12 = Δ23 = 0.15Ωmax (a full 2D scan of both Δ12 and Δ13 and the relative pulses are shown in the Supplementary Figs. 8 and 9, respectively). In this case, DRL finds a solution that induces significantly faster transfer than that obtained with standard counterintuitive Gaussian pulses. Moreover, the latter solutions are not even able to achieve a fidelity comparable to that of the ideal case. Such speed is a typical feature of pulse sequences determined by DRL, satisfying the criterion of adiabaticity (see Supplementary Figs. 10 and 11). Besides the energy detuning between the quantum dots, in a real implementation, the dots interact with the surrounding environment. Since the microscopic details of such an interaction are unknown, its effects are taken into account through an effective master equation. A master equation of the Lindblad type with time-independent rates is adopted to grant sufficient generality while keeping a simple expression. To show the ability of DRL to mitigate disturbances, we consider, in particular, two major environmental effects consisting of decoherence and losses. The first environmental effect corresponds to a randomization of the relative phases of the electron states in the quantum dots, which results in a cancellation of the coherence terms, i.e., the off-diagonal elements of the density matrix in the position basis. The losses, instead, model the erasure of the quantum information carried by the electron/hole moving along the dots. In fact, while the carrier itself cannot be reabsorbed, the quantum information, here encoded as a spin state, can be changed by the interaction with some surrounding spin or by the magnetic field noise. In the presence of dephasing and losses, the system’s master equation is of the Lindblad type. Its general form is

$$\dot \rho = - \frac{i}{\hbar }\left[ {H,\rho } \right] + \mathop {\sum}\limits_n {\Gamma _n} \left( {A_n\rho A_n^\dagger - \frac{1}{2}\left\{ {A_n^\dagger A_n,\rho } \right\}} \right),$$
(9)

where Ak are the Lindblad operators, Γk are the associated rates, and {A, B} = AB + BA denotes the anticommutator.

Fig. 3
figure 3

Comparison of coherent transport by adiabatic passage achieved by Gaussian pulses and deep reinforcement learning algorithm respectively under different kind of disturbances. Gaussian pulses are represented by dotted lines while pulses found by the deep reinforcement learning (DRL) by solid lines. Horizontal axis represents time while vertical axis the value of the diagonal elements of the density matrix. a Reference results from a simulation of ideal conditions with identical eigenvalues in the three quantum dots with \(t_{\max} = 12\frac{\pi }{{\Omega}_{\max}}\). b Simulation involving a dephasing term in the Hamiltonian, where Γd = 0.01Ωmax and \(t_{\max} = 5\frac{\pi }{\Omega_{\max}}\). c Simulation for a detuned system, where Δ12 = Δ23 = 0.15Ωmax and \(t_{\max} = 5\frac{\pi }{\Omega_{\max}}\). d Simulation affected by loss, which is accounted for by an effective term weighted by Γl = 0.1Ωmax and \(t_{\max} = 10\frac{\pi }{\Omega_{\max}}\). The insets of b, c show simulations achieved with Gaussian pulses for \(t_{\max} = 10\frac{\pi }{\Omega_{\max}}\). The setups of the neural networks employed are, respectively, (16, 0), (128, 64), (64, 64), and (128, 64), where the first number in parenthesis represents the number of neurons of the first hidden layer H1 of the network and the second number represents the number of neurons of the second hidden layer H2

We first considered each single dot being affected by dephasing and assumed an equal dephasing rate Γd for each dot. The master equation can be rewritten as

$$\dot \rho = - \frac{i}{\hbar }\left[ {H,\rho } \right] + \Gamma _d\left[ {\rho - {\mathrm{diag}}(\rho )} \right].$$
(10)

Figure 3c shows an example of the pulses determined by DRL and the corresponding dynamics for the choice Γd = 0.01Ωmax. CTAP is achieved with Gaussian pulses in a time t ≈ 10π/Ωmax at the cost of a significant occupation of the central dot (inset). The advantage brought by the DRL agent with respect to Gaussian pulses is manifest: DRL (solid lines) realizes population transfer in approximately half of the time required when analytic pulses (dashed lines) are employed. The occupation of the central dot is less than that achieved by Gaussian pulses. We now consider the inclusion of a loss term in the master equation. The loss refers to the information carried by the electron encoded by the spin. While the loss of an electron is a very rare event in quantum dots, requiring the sudden capture of a slow trap50,51,52, the presence of any random magnetic field and spin fluctuation around the quantum dots can modify the state of the transferred spin. Losses are described by the operators Γl|0〉〈k|, k {1, 2, 3}, where Γl is the loss rate, modeling the transfer of the amplitude from the i-th dot to an auxiliary vacuum state |0〉. Figure 3d shows the superior performance of DRL versus analytic Gaussian pulses in terms of the maximum value of ρ33 reached during the time evolution. Because of the reward function, DRL minimizes the transfer time so as to minimize the effect of losses. The explicit reward function for each case is given in the Supplementary Table 2. As a further generalization, we applied DRL to the passage through more than one intervening dot, an extension called a straddling CTAP scheme (or SCTAP)18,53. To exemplify a generalization to odd N > 3, we set N = 5 according to the sketch depicted in Fig. 4, and no disturbance was considered for simplicity (see the Supplementary Note 7 for further details on SCTAP). For five-state transfer, like the cases for three states and for any other odd number of quantum dots, there is one state with zero energy (see Supplementary Fig. 12). The realization of the straddling tunneling sequence is achieved by augmenting the original pulse sequence by straddling pulses, which are identical for all intervening tunneling rates Ωm. The straddling pulse involves the second and third control gates, as shown in Fig. 4a, b. The pulse sequence discovered by DRL is significantly different from the known sequence based on Gaussian pulses. It achieves population transfer in approximately one third of the time with only some occupation in the central dot (see Fig. 4c). It is remarkable that the reward function (see the Supplementary Note 5) achieves such a successful transfer without any requirement for the occupation of the intermediate dots ρi,i with (i {2, 3, 4}) as it only rewards the occupation of the last dot.

Fig. 4
figure 4

DRL-controlled Straggling CTAP (SCTAP) (a) Schematics of the five quantum dot array. The pair of central gates (b) are coupled according to Greentree et al.18: Ωleft is tuned by the first coupling control gate, Ωright by the last coupling control gate, while Ωmiddle is identical as from the second and the third coupling control gates. In the example for \(t_{{\rm{max}}} = 201\frac{\pi }{{{\it{\Omega }}_{{\rm{max}}}}}\). The dashed-thick pulses are discovered by the DRL, while the solid lines are cubic spline interpolations. c Population transfer by DRL-controlled SCTAP. On the x-axis there is the (rescaled) time in unity of tmax, while on the y-axis there is tmax. During a time evolution, the increase/decrease of the electronic populations shows oscillations (Vitanov et al.34), which are reduced for some values of tmax, e.g., \(t_{{\rm{max}}} = 21\frac{\pi }{{{\it{\Omega }}_{{\rm{max}}}}}\). For \(t_{{\rm{max}}} = 21\frac{\pi }{{{\it{\Omega }}_{{\rm{max}}}}}\), the maximum value of ρ33 is ρ33,max = 0.1946, while ρ55,max = 0.99963689, obtained with the fitted pulses. Notice how ρ22, ρ33, and ρ44 are minimized during all the time evolution

Analysis of the relevant variables within the DRL framework

The advantages of employing DRL with respect to an ansatz solution is further increased by the possibility of determining the factors that are more relevant for the effectiveness of the final solution from the analysis of the neural network. In fact, the employment of the DRL algorithm enables an analysis of the state variables that are actually relevant for solving the control problems, like that discussed above. To select the variables needed to solve an MDP, we follow the approach presented by Peters et al.54,55,56. The idea is that a state variable is useful if it helps to explain either the reward function or the dynamics of the state variables that in turn are necessary to explain the reward function. Otherwise, the state variable is useless, and it can be discarded without affecting the final solution. To represent the dependencies between the variables of an MDP, we use a 2TBN57, described in Supplementary Note 8. In the graph of Fig. 5, there are three types of nodes: the gray diamond node represents the reward function, the circle nodes represent state variables, and the squares represent action variables. The nodes are arranged on three vertical layers: the first layer on the left includes the variables at time ti, the second layer includes the state variables at time ti + 1, and the third layer includes the node that represents the reward function. If a variable affects the value of another variable, we obtain a direct edge that connects the former to the latter. Figure 5 shows the 2TBN estimated from a dataset of 100,000 samples obtained from the standard case of ideal CTAP. The weights are quantified in the Supplementary Table 3. As expected from Eq. (6), the reward function depends only on the values of the variables ρ22 and ρ33. From the 2TBN, it emerges that the dynamics of these two state variables can be fully explained by knowing their values at the previous step, the values of the two action variables Ω12 and Ω23 and the actions taken in the previous time step (stored in the variables \({\it{\Omega }}_{12}^\prime\) and \({\it{\Omega }}_{23}^\prime\)). All other state variables do not appear and therefore can be ignored for the computation of the optimal control policy. This finding matches the expectation from the constraints of the physical model that coherences are not directly involved in the dynamics and that ρ11 is linearly dependent on ρ22 and ρ33 as the trace of the density matrix is constant. To confirm this finding, the agent was successfully trained in the standard case of ideal CTAP by using only the input values (ρ22, ρ33, Ω12, and Ω23). Furthermore, the size of the hidden layer H1 could be reduced in this case from 64 to 4 with the same convergence of the total reward as a function of the episodes or an even faster convergence by reducing the batch size from 20 to 10 (see the Supplementary Figs. 13 and 14). In the cases of detuning and dephasing, the corresponding 2TBNs are more complex, since the dynamics of ρ22 and ρ33 are affected also by the values of the other elements of the matrix ρ. This fact is also apparent in the changes carried by the Hamiltonian, as the trace is not constant when losses are present and dephasing has a direct effect on the coherence.

Fig. 5
figure 5

Two-step temporal Bayesian network (2TBN) for ideal standard coherent transport by adiabatic passage. The gray diamond node is used to represent the reward function, the circles represent the state variables and the squares represent the action variables. The nodes are arranged on three vertical layers: the first layer on the left includes the variables at time ti, the second layer includes the state variables at time ti + 1, and the third layer consists of the node that represents the reward function. The prime index indicates the value of a variable as stored in the previous step. A direct edge between two nodes means that the value of the source node influences the value of the target node. The thicker the line, the greater the influence is. As expected, ρ11 (shaded red) has no connections, as it is linearly dependent on ρ22 and ρ33

Discussion

Neural networks can discover control sequences for a tunable quantum system whose time evolution starts far from its final equilibrium, without any prior knowledge. Contrary to the employment of special ansatz solutions, we have shown that DRL discovers novel sequences of control operations to achieve a target state, regardless of the possible deviations from the ideal conditions. DRL is achieved irrespective of previously discovered ansatz solutions, and it applies when such solutions are unknown. The control sequence found by DRL can also provide insights into novel analytical shapes for optimal pulses. The use of a pretrained neural network as a starting point to identify the solution of a master equation with similar parameters further reduces the computation time by one order of magnitude (see Supplementary Note 9). Our investigation indicates that a key factor is the appropriate definition of the reward function that deters the system from occupying the central quantum dot and rewards the occupation of the last quantum dot. To apply DRL to a practical quantum system known for its counterintuitive nature, we have studied quantum state transport across arrays of quantum dots, including sources of disturbances such as energy level detuning, dephasing, and loss. In all cases, the solutions found by the agent outperform the known solution in terms of either speed or fidelity or both. Both the pretraining of the network and 2TBN analysis—by reducing the computational effort—contribute to accelerating the learning process. In general, we have shown that neural-network-based DRL provides a general tool for controlling physical systems by circumventing the issue of finding ansatz solutions when neither a straightforward method nor a brute force approach is possible. A comparison with other control strategies of quantum systems16,58,59 will be the objective of future work. As a concluding remark, we observe that in this work we have modeled the effects of the environment or noisy control fields by master equations in the Lindblad form with time-independent coefficients; by construction, the same DRL strategy can be used to determine the optimal pulse sequence in the presence of non-Markovian noise/effects originating from an interaction of the system with structured environments. In this case, an efficient way to compute the time evolution of the open system, which is required to determine the value of the pulses at each time step, is needed. Depending on the nature (bosonic, fermionic) and features (spectral density) of the environment, different approximate or numerically exact strategies ranging from Bloch-Redfield or time-convolutionless master equations60 to path integrals61 or chain-mapping techniques (reaction-coordinate62 or TEDOPA63,64) could be exploited.

Methods

Master equation for the Hamiltonian including pure dephasing

Pure dephasing65 is the simplest model that accounts for an environmental interaction of an otherwise closed system. The corresponding Lindblad operators are diagonal and have the form:

$$L_n = \sqrt {\gamma _n} \left| n \right\rangle \left\langle n \right|.$$
(11)

In the case of γ1 = γ2 = γ3 = Γd, the master equation becomes:

$$\dot \rho = - \frac{i}{\hbar }\left[ {H,\rho } \right] + \Gamma _d\left[ {\rho - diag(\rho )} \right].$$
(12)

If the Hamiltonian is time-independent, then

$$\frac{d}{{dt}}\rho _{ii} = 0$$
(13)

for every i, so the electronic population remains unchanged in the presence of pure dephasing. This implies66 that the energy of the system remains unchanged during the evolution, since it cannot be changed by environment.

Neural network

The neural network consists of a multilayered feed-forward network of (Ni, H1, [H2,]No) neurons, where Ni = 11; it includes the nine elements of the density matrix ρ and the control sequence Ωij, No = 2 are the two updated values of the control sequence spanning a continuous range between 0 and an arbitrary value Ωmax chosen according to experimental considerations, and Hi the number of neurons of the ith hidden layer.

Different setups for the neural network have been employed. Nonetheless, all the neural networks are feedforward and fully connected, with an ReLU activation function for the neurons. For the implementation of the neural networks and their training, Tensorforce67 is employed. For the RL algorithm adopted for this work, TRPO, the hyperparameters are set as summarized in the Supplementary Table 4. The optimal number of neurons of the hidden layer peaks at ~25–26 (see the Supplementary Fig. 15).

Software

The QuTIP routine has been run in parallel by using GNU Parallel68.