Identifying optimal cycles in quantum thermal machines with reinforcement-learning

The optimal control of open quantum systems is a challenging task but has a key role in improving existing quantum information processing technologies. We introduce a general framework based on Reinforcement Learning to discover optimal thermodynamic cycles that maximize the power of out-of-equilibrium quantum heat engines and refrigerators. We apply our method, based on the soft actor-critic algorithm, to three systems: a benchmark two-level system heat engine, where we find the optimal known cycle; an experimentally realistic refrigerator based on a superconducting qubit that generates coherence, where we find a non-intuitive control sequence that outperform previous cycles proposed in literature; a heat engine based on a quantum harmonic oscillator, where we find a cycle with an elaborate structure that outperforms the optimized Otto cycle. We then evaluate the corresponding efficiency at maximum power.


I. INTRODUCTION
Thermal machines convert between thermal and mechanical energy in a controlled manner. Examples include heat engines such as steam and Otto engines, that extract useful work from a temperature difference, and refrigerators, that extract heat from a cold bath. Quantum thermal machines (QTMs) perform thermodynamic cycles via nanoscale systems that can be as small as single particles or two-level quantum systems (qubits). Quantum heat engines and refrigerators could find applications in heat management at the nanoscale [1], or for on-chip active cooling [2,3].
Quantum thermodynamics is a rapidly growing research area that aims at the understanding, design and optimization of QTMs [4]. An open fundamental question is whether quantum effects can boost the performance of QTMs [2,4,5]. On the other hand, understanding how to optimally control the non-equilibrium dynamics of open quantum systems is a complicated task which can improve existing quantum information processing technologies.
While the laws of thermodynamics pose universal constraints on the efficiency of thermal machines, regardless of their classical or quantum nature, they do not pose any restriction on the dynamics of the system, thus on the speed at which it operates. Therefore, it is crucial to study the power to discover potential benefits of using QTMs. However, optimizing the power is a challenging task: having to operate in finite-time, the state can be driven far from equilibrium, requiring us to model the full dynamics of the quantum system. Furthermore, strategies are needed to identify optimally controlled cycles.
In general, there is no guarantee that typical regimes and specific cycles considered in literature are optimal for power maximization. Overcoming this limitation may allow us to unlock quantum advantages in power extraction. This calls for the development of powerful search strategies to tackle power-maximization without relying on specific control sequences or assumptions.
In this manuscript we propose a Reinforcement Learning (RL) [72] based approach to optimize the performance of QTMs. Specifically, we use a generalization of Soft Actor-Critic methods [73,74] for combined discrete and continuous actions, introduced in the context of robotics and videogames [75,76], to discover thermodynamic cycles that deliver maximum power. RL has received a great deal of attention for its success at mastering complicated tasks beyond human-level such as playing video games [77,78], the board-game Go [79] and for robotic applications [80]. RL has been recently used for accurate quantum state preparation [81][82][83][84], outperforming previous state-of-the-art methods [85,86], to minimize entropy production in closed quantum systems [87], for fault-tolerant quantum computation [88], and machine learning methods have been used for quantum thermometry [89].
Our RL-based scheme for power maximization of QTMs is generic in that it makes no assumptions on the shape of the control cycle. Rather, it starts from scratch, allowing the RL agent to arbitrarily couple or decouple the quantum system from any bath, and to arbitrarily manipulate the control parameter. We apply our approach to three paradigmatic systems that have been well studied in literature: (i) a benchmark heat engine based on a two-level system, where our approach automatically finds the known maximum power cycle [90].
(ii) an experimentally realistic refrigerator based on a superconducting qubit coupled to resonant circuits [48] which generates coherence during its cycle. Our RL approach discovers a new and non-intuitive cycle that outperforms previous proposals [48,54,62]. (iii) a heat engine based on a harmonic oscillator [42], where we find a cycle with an elaborate structure that shares qualitative similarities with the Otto cycle, but which performs better thanks to additional features. We complement the study of these systems by analyzing the corresponding efficiency at maximum power.
The complexity and structure of these cycles demonstrates both the ability of the RL agent to choose actions based on a long-term return, and the complicated non-analytic nature of optimal cycles. Our results also show that the celebrated Otto cycle is not in general optimal for power maximization. Furthermore, we show that the detrimental effect of coherence on the performance of QTMs [71], specifically observed both in the superconducting qubit and harmonic oscillator cases, can be mitigated thanks to carefully crafted cycles.

A. Quantum Thermal Machines
We describe a QTM by a quantum system (QS), acting as a working medium, that can exchange heat with a hot (H) or cold (C) thermal bath characterized by inverse temperatures β H < β C (Fig. 1). We can control the evolution of the QS and exchange work with it though a set of time-dependent control parameters u(t). In classical thermal machines, the working medium could be a gas in a cylinder, and u(t) could be the time-dependent position of the piston which influences the state of the gas and allows us to exchange energy. In a QTM, the working medium is a quantum system whose Hamiltonian can be parametrized by the control variables u(t) [91].
Here we study finite-time thermodynamics of QTMs 1. Schematic representation of a quantum thermal machine. A hot (cold) bath at inverse temperature βH (βC), represented by the red (blue) box, can be coupled to the quantum system QS, gray circle, enabling a heat flux JH(t) (JC(t)). The quantum system is controlled by an external agent through a set of control parameters u(t), which allow power exchange P (t), and through a discrete control d(t) = {Hot, Cold, None} which determines which bath (if any) is coupled to the QS.
within the Markovian regime using the commonly employed master equation [92][93][94][95]. This approach describes the time-evolution of the reduced density matrix of the QS,ρ(t), under the assumption of weak system-bath interaction. Settingh = 1, the master equation reads whereĤ[u(t)] is the Hamiltonian of the QS which depends explicitly on time via the control parameters u(t), [·, ·] denotes the commutator, and D describes the effect of the coupling between the QS and bath α = H, C. d(t) = {Hot, Cold, None} is an additional discrete control parameter which allows us to choose which bath (if any) is coupled to the QS. We compute the extracted power P (t) and the instantaneous heat flux J α (t) flowing out of bath α in the standard way [23] which guarantees the validity of the first law of thermodynamics ∂U (t)/(∂t) = −P (t) + α J α (t), the internal energy being defined as U = Tr[ρ(t)Ĥ[u(t)]] (see Methods for details). The two main thermal machines we consider are the heat engine and the refrigerator. A heat engine is used to extract work, while a refrigerator is used to extract heat from the cold bath. Therefore, we define respectively as the instantaneous power of a heat engine E (since the total heat extracted coincides with the work if the internal energy difference is zero), and as the instantaneous cooling power of a refrigerator R. Our goal is to determine the optimal driving, i.e. to determine the functions u(t) and d(t) that maximize the average power in the long run. We thus define the following exponentially weighted average of the power FIG. 2. Schematic representation of the learning process. A computer agent (blue box) chooses an action ai at time step i based on the current state si of the environment (gray box) through the policy function π(ai|si). The action, which encodes the control (u(t), d(t)), is passed to the environment (lower arrow) which evolves the quantum state si of the machine based on the action ai, and computes the average power during the time step as reward. The new state si+1 and reward ri+1 are returned to the agent (upper arrow), which uses this information to improve π(a|s) using the soft actor-critic algorithm, which learns also the value function Q(s, a). Both π(a|s) and Q(s, a) are parameterized using fully-connected neural networks (NNs). This process is reiterated until convergence of the policy.
whereγ determines the timescale over which we average. While we do not enforce any periodic structure on the controls u(t) and d(t), we expect the RL agent to automatically discover the optimality of periodically driving the QTM and the corresponding driving period. The intuition is the following: in the short term, we can maximize the power by taking advantage of the state preparation of the system, for example by extracting all the free energy from the system. However, the amount of work that can be extracted this way is bounded, while the work that can be extracted through cycles scales with the number of performed cycles, i.e. with time. Therefore in the long run, i.e. for small enoughγ, we expect the maximization of Eq. (3) to naturally discover thermodynamic cycles and to prevent the exploitation of transient effects. We confirm this hypothesis in all QTMs studied below.

B. Reinforcement Learning for Quantum Thermal Machines
We formulate the power optimization problem as a discounted, continuing RL task. As we will demonstrate, this approach is able to learn far-from-equilibrium finitetime thermodynamic cycles with high performance.
RL is a general framework that tackles optimization problems formulated in the following way. As schematically shown in Fig. 2, a computer agent (blue box) must learn to master some task by repeated interactions with some environment (gray box). Discretizing time in time steps ∆t, we denote with s i ∈ S the state of the environment at time t = i∆t, where S is the state space. The agent must choose an action a i ∈ A to perform on the environment based on its current policy (lower orange arrow). A is the action space, and the policy π(a i |s i ) is a function that describes the probability distribution of choosing action a i , given that the environment is in state s i . The environment then evolves its state according to the chosen action, and provides feedback back to the agent by returning the updated state s i+1 and a scalar quantity r i+1 known as the reward (upper orange arrow). This procedure is reiterated for a large number of time steps.
At every time step t i , the aim of the agent is to use the feedback it receives from the environment to learn an optimal policy that maximizes, in expectation, the return, i.e. the total future reward it receives from the environment, defined as where γ ∈ [0, 1) is the discount factor which determines how much we are interested in future rewards, as opposed to immediate rewards. We now turn to the power maximization problem. Discretizing time in steps ∆t, we search for protocols u(t) and d(t) that are constant during each time step. As shown in Fig. 2, we choose as action space A = {(u, d) | u ∈ U, d ∈ {Hot, Cold, None}}, where U is the continuous set of accessible controls, which can account for any experimental limitation, and d is the discrete action, motivated by typical thermodynamic cycles, which determines which bath (if any) is coupled to the QS. We choose the physical quantum states of the QS and the last chosen action as state space, i.e. S = {(ρ, u) |ρ ∈ D, u ∈ U }, where D = {ρ |ρ ≥ 0, Tr[ρ] = 1} is the space of density matrices. Crucially, we choose as reward which is the average power of the machine during the time interval [t, t + ∆t]. Plugging Eq. (5) into Eq. (4), we see that the aim of the agent is to maximize the average power P [ν] introduced in Eq. (3), whereγ = − ln γ/∆t (see Methods for details). In the RL notation, γ sets the timescale for the power averaging, with γ → 1 corresponding to long term averaging. Our agent has no prior knowledge of quantum dynamics, nor of thermodynamic cycles: the evolution of the state from s i =ρ(t) to s i+1 =ρ(t + ∆t) and the computation of the rewards r i is performed by the environment. We learn the optimal policy employing the soft actor-critic method, which relies on learning also a value function Q(s, a), generalized to a combination of discrete and continuous actions [73][74][75][76]. In this approach, the policy function π(a|s) plays the role of an "actor" that chooses the actions to perform, while the value function Q(s, a) plays the role of a "critic" that judges the choices made by the actor, thus providing feedback to improve the actor's behavior. Both π(a|s) and Q(s, a) are parameterized using fully-connected neural networks with two hidden layers, and they are determined by minimizing the loss functions in Eqs. (22) and (29) in Methods using the ADAM optimization algorithm [96]. The gradient of the loss functions is computed off-policy, over a batch of past experience which is recorded and stored in a replay buffer, using backpropagation (see Methods for details).

C. Case Studies
In this section we prove the validity of our RLbased approach by applying it to three different systems, namely a heat engine based on a two-level system, a refrigerator based on a superconducting qubit, and a heat engine based on a quantum harmonic oscillator. While the results presented below were obtained performing a single training, in Methods we show that our RL approach reliably converges to solutions with nearly the same performance across multiple trainings.
a. Two-level system heat engine. We first benchmark our method on a two-level system for which the optimal control cycle is known: where u(t), which determines the energy gap of the twolevel system, is our single control parameter, E 0 is a fixed energy scale andσ z is a Pauli matrix. We consider the qubit to be coupled to fermionic baths with flat density of states (for example, a single-level quantum dot tunnelcoupled to metallic leads), with thermalization timescales fixed by the rates Γ α (see Methods for details).
Our results are shown in Fig. 3. Fig. 3a shows the running average of the power an exponentially weighed average of the past rewards with weight γ, as a function of the time steps. Fig. 3b show the actions chosen by the agent, as a function of the steps, at three different moments during training highlighted by the black dots in Fig. 3a. The position of the segments corresponds to the chosen value of u, while the color represents the discrete action d (see legend). The optimal cycle learned at the end of the training is shown in Fig. 3c. Initially, the agent has no knowledge of the system, and its actions appear random (Fig. 3b, left), producing negative power P [E] γ (Fig. 3a). As expected, by performing random actions the agent is dissipating work into the heat baths, rather than extracting work. With increasing time, the agent gains experience and learns how to control the heat engine: P [E] γ increases, and structure appears in the chosen actions (Fig. 3b, center and right). Eventually the policy converges, and P [E] γ saturates to a finite positive value. The optimal cycle for this model was derived in Ref. [90], and it corresponds to the exact structure discovered by the agent, i.e. a square wave alternating between the hot and the cold bath as fast as possible without spending any time disconnected from the baths. The black dashed line in Fig. 3a shows the corresponding power. Notably, although the frequency of the learned cycle is limited by the choice of ∆t, and although the values of u(t) found by the agent are slightly different respect to the ones predicted by Ref. [90] (the difference is ≈ 0.05), the power it generates nearly coincides with the upper bound. This occurs because there is a manifold of near-optimal solutions.
We conclude the analysis of the two-level system heat engine by computing the efficiency at maximum power (EMP), i.e. the thermodynamic efficiency of the heat engine, defined as the ratio between the extracted work and the input heat, while the engine is operated at maximum power. We compare the EMP both to the Carnot efficiency η c = 1 − β H /β C and to the Curzon-Ahlborn efficiency η ca = 1 − β H /β C . Despite not being a fundamental upper bound, the latter has received considerable attention in the literature for its simplicity, for being an upper bound to the EMP in various specific models [24,28,97,98], and it has been derived by gen-eral arguments from linear irreversible thermodynamics [99]. Interestingly, we find that the optimal cycle shown in Fig. 3c delivers a large EMP corresponding to 100% of η ca , equivalent to 59% of η c .
b. Superconducting qubit refrigerator. We now consider a refrigerator based on an experimentally realistic system, i.e. a superconducting qubit coupled to two resonant circuits which behave as heat baths [48]. As opposed to the previous setup, here coherence between the instantaneous eigenstates is generated while driving the system, since [Ĥ(u 1 ),Ĥ(u 2 )] = 0 for u 1 = u 2 . This quantum effect was found to deter the power of this specific setup [48,54,62] and of arbitrary systems in linear response and for small driving amplitudes [70].
As shown in Refs. [48,54,62], the system Hamiltonian is given byĤ where E 0 is a fixed energy scale, ∆ characterizes the minimum gap of the system, and u(t) is our control parameter. In this setup the coupling to the bath is fixed, and cannot be controlled. However, the qubit is resonantly coupled to the baths at different energies. The u-dependent coupling strength to the C (H) bath is described by the function γ u ) that, as in Ref. [62], is peaked at u = 0 (u = 1/2) with a resonance width determined by the quality factor Q C (Q H ) (see Methods for details).
Panels (a), (b) and (c) of Fig. 4 report the results in the same style as in Fig. 3, with the exception that all actions are black since there is no discrete choice to make, while Fig. 4d shows the coupling strength γ The parameters were chosen as in Fig. 7 of Ref. [62]. As previously, the agent begins with random actions in the first steps, and the corresponding running average cooling power P [R] γ is slightly negative, indicating that we are dissipating heat into the cold bath. As the agent gains experience, P [R] γ increases until it saturates to a final positive value obtained with the cycle shown in Fig. 4c (thick lines).
Interestingly, this setup was partially optimized in Ref. [62]. In their analysis, the authors fix a smoothed trapezoidal cycle u(t) (Fig. 4c, dashed line) which was shown to outperform a sine and a trapezoidal cycle [48]. They find that the cooling power is positive for large cycle periods T , and tends to zero as T → ∞. However, the cooling power becomes negative as T → 0 because the detrimental effect of the generation of coherence increases with the speed of the cycle. As a consequence, there is an intermediate optimal value of T . They find that P [R] ≈ 2.3 × 10 −4 at this optimal choice (dashed line in Fig. 4a,c). Notably, our RL agent discovers a protocol with P [R] ≈ 10.8 × 10 −4 using the same system parameters. This improvement is due to the non-intuitive additional step visible in Fig. 4c  the baths at u = 0 and u = 0.5. However, the agent identifies also a third point, around u ≈ 0.13, where it spends ≈ 1/4 of the total cycle time and where the system is essentially decoupled from the bath, thus undergoing unitary evolution (as can be seen by the small values of γ (α) u corresponding to u ≈ 0.13 in Fig. 4d). This additional feature allows us to roughly extract the same amount of heat per cycle, but 5 times faster. Interestingly, we verified that the trapezoidal cycle running at the same speed as the cycle found by the RL agent would yield negative power. As argued in [48,54,62], this power loss is attributed to the generation of coherence during the cycle, so we can interpret the power enhancement achieved by our cycle as a mitigation of such detrimental effect. To confirm this, we rigorously quantify the generation of coherence in both cycles by computing the time-average of the relative entropy of coherence [100]. Indeed, we find that the trapezoidal cycle operated at the same speed as the cycle found by the RL agent generates almost twice as much coherence (see Methods for details).
We conclude the study of the qubit-based refrigerator by evaluating the coefficient of performance (COP) at maximum power, i.e. the ratio between the heat ex-tracted from the cold bath and the input work, while the system is operated at maximum cooling power. We find that our cycle shown in Fig. 4c as a thick black line delivers a COP at maximum power that is 6% of , which is Carnot's upper bound to the COP. While this may appear as a rather low value, we notice that the COP at maximum power of a two-level system coupled to Fermionic of Bosonic baths is zero [90], and that for on-chip cooling applications, the aim is typically to maximize the cooling power regardless of the efficiency of such a process.
c. Harmonic oscillator heat engine. At last, we consider a heat engine based on a collection of noninteracting particles confined in a harmonic potential [42]. The Hamiltonian is given bŷ where m is the mass of the system, w 0 is a fixed frequency andp andq are the momentum and position operators. The control parameter u(t) allows us to change the frequency of the oscillator. As in the qubit heat engine case, we let the agent choose which bath (if any) to couple to the oscillator. Since [Ĥ(u 1 ),Ĥ(u 2 )] = 0 for u 1 = u 2 , also this system exhibits a power loss at finite driving speed [42,71]. The coupling to the baths, characterized by the thermalization rates Γ α , is modeled as in Ref. [42] (see Methods for details). The solid line in Figs. 5a and 5b shows P [E] γ , as a function of the step, for the same system parameters (chosen as in the upper panel of Fig. 6 of Ref. [42]), but setting respectively U = [0.5, 1] and U = [0. 35, 1.5]. In both cases, the power is negative for small steps, while it converges to a positive value as the agent gains experience. The corresponding final cycles learned by the agent are shown in Figs. 5c and 5d as thick lines. These cycles display a quite elaborate structure which demonstrates the ability of the agent to perform planning for a longtime reward. Indeed, while the system is in contact with the cold bath, which happens roughly for 20 time steps, energy is flowing into the bath, thus producing negative rewards. Nonetheless, the agent discovers that this step, required by the second law of thermodynamics, is necessary for power extraction in the long run.
We now compare the cycle discovered by the agent to the well-known Otto cycle often considered in literature.  [42], the agent discovers a cycle that outperforms the optimized Otto cycle (Fig. 5a). If we fur- ther allow the agent to modulate u(t) in a larger interval, we find a cycle with an even larger power (Fig. 5b).
The protocol discovered by the agent in Fig. 5a resembles the Otto cycle (both are flat for some time at u = u C and u = u H ), yet there are additional features. First, the agent ramps up the control u(t) between u C and u H in a non-linear fashion while in contact with a bath, rather than decoupling the system. Next, there are two strong discontinuities where the system is abruptly disconnected from the baths for a short time (Fig. 5c, green segments). These additional features turn out to enhance the power. At last, we notice that the cycle found by the agent in Fig. 5d is more regular than the one in Fig. 5c, and it further deviates from an Otto cycle. The crucial difference with respect to the Otto cycle therefore seems to be the ramping up of u(t) while in contact with the baths, which probably benefits from simultaneously exchanging heat and modulating u(t), rather than doing them in two separate strokes. As in the two-level case, we evaluate the EMP of the discovered cycles shown in Fig. 5c and 5d. In both cases the EMP is considerably high, corresponding respectively to 60% and 78% of η ca , equivalent to 46% and 59% of η c . Interestingly, the cycle shown in Fig. 5d yields both a higher power and a larger EMP than the cycle displayed in Fig. 5c.

III. DISCUSSION
We introduced a general framework based on Reinforcement Learning to discover thermodynamic cycles that maximize the power of out-of-equilibrium quantum thermal machines, paving the way for a more systematic use of machine learning in the field of quantum thermodynamics. Using state-of-the-art machine learning techniques, we applied our method to three different paradigmatic setups. Our method found the optimal known solution in the benchmark system, while in the other systems it discovered new unintuitive and elaborate cycles which outperform previously proposed cycles.
Our results show that the celebrated Otto cycle is not in general optimal for power extraction, and that carefully crafted cycles can mitigate coherence-induced power losses [48,54,70,71] without introducing additional controls as required by Shortcuts to Adiabaticity [56][57][58][59][60][61][62][63]. As opposed to other optimal control techniques, such as the Pontryaghin Minimum Principle [64][65][66], our RL-based approach has the following advantages: it does not require any analytic calculation; it can handle both continuous and discrete controls (such as the choice of the heat bath), and it can naturally find discontinuous and irregular protocols; it can be applied as-is to arbitrarily complicated setups and it could be used to find optimal protocols also in the presence of noise in the controls.
Future research directions include applications to multi-particle systems, where many-body advantages might be revealed, and a systematic study of the mitigation of coherence-induced power losses. Interesting extensions of our framework include investigating the strong system-bath coupling regime going beyond a master equation approach [101][102][103], optimizing additional thermodynamic quantities, such as minimizing the fluctuations in the power output, and developing of a scheme that can be applied directly to experimental setups that does not require the knowledge of the quantum state.

IV. METHODS
Physical model. As discussed in the main text, we assume that the state evolves according to the Markovian master Eq. (1), which can be derived, also for nonadiabatic drivings [95], in the weak system-bath coupling regime performing the usual Born-Markov and secular approximation [92][93][94] and neglecting the Lamb-shift contribution. We notice that since the RL agent produces piece-wise constant protocols, we are not impacted by possible inaccuracies of the master equation subject to fast parameter driving [104], provided that ∆t is not smaller than the bath timescale. Without loss of gener-ality, the dissipators can be expressed as [93,94] k,u(t) ), (9) where λ α [d(t)] ∈ {0, 1} are functions that determine which bath is coupled the QS,Â (α) k,u(t) are the Lindblad operators, and γ (α) k,u(t) are the corresponding rates. In particular, λ H (Hot) = 1, λ C (Hot) = 0, while λ H (Cold) = 0, λ C (Cold) = 1, and λ H (None) = λ C (None) = 0. Notice that both the Lindblad operators and the rates can depend on time through the value of the control u(t). Their explicit form depends on the details of the system, i.e. on the Hamiltonian describing the dynamics of the overall system including the bath and the system-bath interaction. Below, we provide the explicit form ofÂ (α) k,u(t) and γ (α) k,u(t) used to model the three setups considered in the manuscript. We adopt the standard approach to compute the instantaneous power and heat currents [23] In the two-level system heat engine, we consider the following Lindblad operators and corresponding rates (identifying k = ±): whereσ + andσ − denote the raising and lowering operators, Γ α is a constant rate which sets the thermalization timescale when the QS is coupled to bath α, and f (x) = [1 + exp(x)] −1 is the Fermi distribution. This choice can be derived, for example, when considering the qubit as a single-level quantum dot tunnel-coupled to metallic leads, with flat density of states, which act as heat baths [105][106][107][108].
In the superconducting qubit refrigerator, we employ the model first put forward in Ref. [48], and further studied in Refs. [54,62]. In particular, we consider the following Lindblad operators and corresponding rates (identifying k = ±):Â where |g u(t) and |e u(t) are, respectively, the instantaneous ground state and excited state of Eq. (7). The corresponding rates are given by γ where ∆ u(t) is the instantaneous energy gap of the system, and S α (∆ ) = g α 2 (13) is the noise power spectrum of bath α. Here ω α , Q α and g α are the base resonance frequency, quality factor and coupling strength of the resonant circuit acting as bath α = H, C (see Refs. [48,62] for details). As in Ref. [62], we choose ω C = 2E 0 ∆ and ω H = 2E 0 ∆ 2 + 1/4, such that the C (H) bath is in resonance with the qubit when u = 0 (u = 1/2). The width of the resonance is governed by Q α . The total coupling strength to bath α, plotted in Fig. 4d, is quantified by In the Harmonic oscillator heat engine, following Ref. [42], we describe the coupling to the baths through the Lindblad operatorsÂ −,u(t) =â u(t) and corresponding rates γ are respectively the (control dependent) lowering and raising operators, Γ α is a constant rate setting the thermalization timescale of the system coupled to bath α, and Reinforcement-Learning Algorithm. As discussed in the Results section, the choice of the reward as in Eq. (5) guarantees that the aim of the RL agent is to maximize P [ν] introduced in Eq. (3). To be precise, plugging Eq. (5) into Eq. (4) gives P [ν] (up to an irrelevant constant prefactor) only in the limit of ∆t → 0. However, also for finite ∆t, both quantities are time-averages of the power, so they are equally valid definitions to describe a long-term power maximization.
We use a generalization of the soft-actor critic (SAC) method, first developed for continuous actions [73,74], to handle a combination of discrete and continuous actions [75,76]. We here present an overview of our implementation of SAC putting special emphasis on the differences with respect to the standard implementation. However, we refer to [73][74][75][76] for additional details. Our method, implemented with PyTorch, is based on modifications and generalizations of the SAC implementation provided by Spinning Up from OpenAI [109]. All code and data to reproduce the experiments is available online (see Data Availability and Code Availability sections).
The SAC algorithm is based on policy iteration, i.e. it consists of iterating multiple times over two steps: a policy evaluation step, and a policy improvement step.
In the policy evaluation step, the value function of the current policy is (partially) learned, whereas in the policy improvement step a better policy is learned by making use of the value function. We now describe these steps more in detail.
In typical RL problems, the optimal policy π * (s|a) is defined as the policy that maximizes the expected reward defined in Eq. (4), i.e.: where E π denotes the expectation value choosing actions according to the policy π. The initial state s 0 = s is sampled from µ π , i.e. the steady-state distribution of states that are visited by π. In the SAC method, balance between exploration and exploitation [72] is achieved by introducing an Entropy-Regularized maximization objective. In this setting, the optimal policy π * is given by where ε ≥ 0 (usually denoted with α in the RL literature) is a hyper-parameter that balances the trade-off between exploration and exploitation, and is the entropy of the probability distribution P . Notice that we replaced the unknown state distribution µ π with B, which is a replay buffer populated during training by storing the observed one-step transitions (s k , a k , r k+1 , s k+1 ). We define the value function Q π (s, a) of a given policy π as Q π (s, a) = E π r 1 + ∞ k=1 γ k r k+1 + εH(π(·|s k )) s 0 = s, a 0 = a , (18) and its recursive Bellman equation reads Q π (s, a) = E a1∼π(·|s1) We use a function approximator Q φ (s, a) (e.g. a neural network) to describe the value function of the current policy, where φ represents a collection of learnable parameters. Next, we describe the policy π(a|s). Since we are dealing with a combination of a discrete and continuous actions, we define a = (u, d), where u is the continuous action and d is the discrete action (for simplicity, we describe the case of a single continuous action, though the generalization to multiple variables is straightforward). From now on, all functions of a are also to be considered as functions of u, d. We decompose the joint probability distribution as π θ (u, d|s) = π θD (d|s) · π θ U,d (u|d, s), where π θD (d|s) is a function approximator for the marginal probability of taking discrete action d, which depends on learnable parameters θ D , and π θ U,d (u|d, s) is a parameterization of the conditional probability density of choosing action u, given action d, which depends on learnable parameters θ U,d -one set for each discrete action d. We denote with θ the collection of all parameters θ D and θ U,d . Notice that this decomposition allows us to describe correlations between the discrete and the continuous action, which are crucial in our application. We further parameterize π θ U,d (u|d, s) as a squashed Gaussian policy, i.e. as the distribution of the variablẽ where µ θ U,d (s) and σ θ U,d (s), representing respectively the mean and standard deviation of the Gaussian distribution, are function approximators which depend on the learnable parameters θ U,d , N (0, 1) is the normal distribution with zero mean and unit variance, and where we assume that U = [u a , u b ].
We now describe the policy evaluation step. In the SAC algorithm, we learn two value functions Q φi (s, a) described by the learnable parameters φ i , for i = 1, 2. Since Q φi (s, a) should satisfy the Bellman Eq.
where ρ polyak is a hyperparameter. This change was shown to improve learning [73,74]. In order to evaluate the expectation value in Eq. (23), we use the decomposition in Eq. (20) to write E a ∼π θ (·|s ) where we denote a = (u , d ). Plugging Eq. (25) into Eq. (23), and approximating the expectation value over u with a single sampled value yields We therefore perform a full average over the discrete action, and a single sampling of the continuous action.
We now turn to the policy improvement step. Given a policy π θ old , Ref. [74] proves that π θnew is a better policy [respect to maximization in Eq. (16)] if we update the policy parameters according to where s is any state, D KL denotes the Kullback-Leibler divergence, and Z π θ old is the partition function of the exponential of the value function. Intuitively, this step is the equivalent of making the policy -greedy in the standard RL setting. The idea is to use the minimization in Eq. (27) to define a loss function to perform an update of θ. Noting that the partition function does not impact the gradient, multiplying the Kullback-Leibler divergence by α, and replacing Q π θ old with min j Q φj , we define the loss function as a) . (28) In order to evaluate the expectation value in Eq. (28), we use the previous trick of averaging the discrete action, and performing a single sample of the continuous action using ξ. Recalling Eq. (21), this yields To summarize, the SAC algorithm consists of repeating over and over a policy evaluation step, and a policy improvement step. The policy evaluation step consists of a single optimization step to minimize the loss functions L Q (φ i ) (for i = 1, 2), given in Eq. (22), where y(r, s ) is computed using Eq. (26). The policy improvement step consists of a single optimization step to minimize the loss function L π (θ) given in Eq. (29). In both loss functions, the expectation value over the states is approximated with a batch of experience sampled randomly from the replay buffer B.
Training details. We now provide the details of the algorithm used to learn the four specific cycles described in the main manuscript.
The value function Q φ (s, u, d) is parameterized the following way. We use a fully connected neural network (NN), with two hidden layers, that takes s and u as input (by stacking them into a single array), and outputs |D| values, where |D| is the number of discrete actions. The i th output corresponds to Q φ (s, u, d = d i ), where {d i } are the possible discrete actions. We use the ReLU activation function in all layers except for the output layer, were we apply the identity (since the value function can take arbitrary positive or negative values). The parameters φ correspond to the weights and biases of the whole network.
The policy π θ (u, d|s) = π θD (d|s) · π θ U,d (u, |d, s) is parameterized the following way. We use a fully connected NN, with two hidden layers, that takes s as input, and produces 3 · |D| values, corresponding to More specifically, we use the ReLU activation in all layers except for the output layer, where we use the identity. However, in order to enforce the normalization i π θD (d i |s) = 1, we apply a soft-max to the corresponding outputs, and instead of outputting σ θ U,d i (s), we output log(σ θ U,d i (s)), which has no constraint on the sign.
In order to enforce sufficient exploration in the early stage of training, we do the following. For a fixed number of initial steps, we choose random actions sampling them uniformly withing their range. Furthermore, for another fixed number of initial steps, we do not update the parameters to allow the replay buffer to have enough transitions. B is a first-in-first-out buffer, of fixed dimension, from which batches of transitions are randomly sampled to update the NN parameters. After this initial phase, we repeat a policy evaluation and a policy improvement step n updates times every n updates steps. This way, the overall number of updates coincides with the number of actions performed on the environment. The optimization steps are performed using the ADAM optimizer with the standard values of β 1 and β 2 . To favor an exploratory behaviour early in the training, and at the same time to end up with a policy that is approximately deterministic, we schedule ε. In particular, we vary it during each step according to ε(n steps ) = ε 0 exp(−n steps /ε decay ), where n steps is the current step number, and ε 0 and ε decay are hyperparameters. All hyperparameters used to produce the cycles in Figs. 3, 4 and 5 are provided in Table I. At last, we discuss the parameterization of the state s. As discussed in the manuscript, we use s = (ρ, u) as state. While u is passed to the NNs as-is, we now detail how we encodedρ in the 3 systems studied in the manuscript.
In the two-level system heat engine, a closed equation of motion governing the evolution of p(t) ≡ Tr[ρ(t)σ +σ− ] can be derived from the Markovian master equation, and the instantaneous heat flux J α (t) can be expressed solely in terms of p(t) (see Ref. [90] for details). Therefore, we use the single parameter p(t) ∈ [0, 1] to encodeρ. In the superconducting qubit refrigerator, we encodeρ using the following three real parameters: e u(t) |ρ(t)|e u(t) , Re[ g u(t) |ρ(t)|e u(t) ] and Im[ g u(t) |ρ(t)|e u(t) ]. These fully characterize a density matrix of a qubit.
In the Harmonic oscillator heat engine a closed set of equations of motion can be derived from the Markovian master equation for the following three quantities: Furthermore, the heat flux J α (t) can be expressed solely in terms of these quantities (see Ref. [42] for details). We thus use these three quantities to encodeρ. More specifically, since they are not bounded in an obvious way by some system parameter, for numerical stability we usẽ O(t) ≡ log(|O(t)| + δ) and sO(t) = sign(O(t)) instead of O = H, L, D to encode the state, where δ = 10 −20 is a small parameter introduced to prevent numerical divergences. At last, we do not use sH(t) since H(t) is always a positive quantity (being the energy of the harmonic oscillator). Therefore, we encodeρ using the following 5 parameters: (H,L,D, sL, sD). Convergence of the RL approach. The training process presents some degree of stochasticity, such as the initial random steps, and the random sampling of a batch of experience from the replay buffer to compute an approximate gradient of the loss functions. We thus need to evaluate the reliability of our approach.
In Figs. 6 an 7, we show the training curves for 6 consecutive runs of our method applied to all 4 cases studied in the main text. More specifically, panels (a), (c) and (e) of Fig. 6 correspond to the training of the two-level system heat engine (considered in Fig. 3), while panels (b), (d) and (f) of Fig. 6 correspond to the training of the superconducting qubit refrigerator (considered in Fig. 4). Panels  correspond to the training of the Harmonic oscillator in the interval considered in Figs. 5b and 5d. Both Figs. 6 and 7 show, as a function of the step, the running average of the power [panels (a) and (b)], the running average L Q γ of the loss function L Q [panels (c) and (d)] and the running average L π γ of the loss function L π [panels (e) and (f)] computed on the batch of experience used to estimate the gradient of the corresponding loss function. Every curve corresponds to a separate training.
As we can see in Figs. 6 an 7, in all 4 cases the running average of the reward reliably converges to a solution yielding similar values of the power, and also the running averages of the loss functions display a qualitatively similar behavior. We notice that while L Q is the mean square of the Bellman error [see Eq. (22)], L π is just a function whose gradient provides a better policy, so its value during training is not required to be a decreasing function.
Generation of coherence. In order to quantify the coherence generated in the instantaneous eigenbasis of the Hamiltonian in the refrigerator based on a superconducting qubit, we evaluated the time average of relative entropy of coherence [100], defined as C(ρ(t)) = S(ρ diag. (t)) − S(ρ(t)), where S(ρ) = −Tr[ρ lnρ] is the Von Neumann entropy, and ρ diag. (t) = g u(t) |ρ(t)|g u(t) · |g u(t) g u(t) | + e u(t) |ρ(t)|e u(t) · |e u(t) e u(t) | (34) is the density matrix, in the instantaneous eigenbasis |g u(t) and |e u(t) , with the off-diagonal terms canceled out. We find that that the time-average of the relative entropy of coherence is ≈ 0.116 in the cycle found by the RL agent, while it is ≈ 0.194 applying the trapezoidal cycle with the same period.
Otto cycle comparison. In Fig. 5 we compare the performance of an Otto cycle, optimized as in the upper panel of Fig. 6 of Ref. [42], with the cycle discovered by the RL agent. However, Ref. [42] only provides the value of the power of the optimized Otto cycle, not the durations of the four strokes that produce such power. We therefore performed a grid search in the space of these four durations. After identifying the largest power, we ran the Netwon algorithm to further maximize the power. The final cycle we found is the one shown as a dashed lines in Figs. 5c and 5d. The corresponding power, shown as a dashed line in Figs. 5a and 5b, nicely matches with Ref. [42].