The minimal work cost of information processing

Irreversible information processing cannot be carried out without some inevitable thermodynamical work cost. This fundamental restriction, known as Landauer's principle, is increasingly relevant today, as the energy dissipation of computing devices impedes the development of their performance. Here we determine the minimal work required to carry out any logical process, for instance a computation. It is given by the entropy of the discarded information conditional to the output of the computation. Our formula takes precisely into account the statistically fluctuating work requirement of the logical process. It enables the explicit calculation of practical scenarios, such as computational circuits or quantum measurements. On the conceptual level, our result gives a precise and operational connection between thermodynamic and information entropy, and explains the emergence of the entropy state function in macroscopic thermodynamics.

. Thermodynamical processes can be used to physically implement abstract logical processes.
a. An isothermal compression implements the logical process corresponding to randomizing the position of the particles, within half the original volume. b. In this logical process, the X positions of the particles are mapped to half their value. This can be implemented by introducing many separators, resolving the position of the particles to a good enough precision, and then performing isothermal compression of each slice of gas. This procedure has same optimal work cost as the previous one, even though the logical processes are different. c. A single particle gas can be used to model an and gate. The particle is supposed to be brought in output region A if it was originally in regions I, II or III, and to region B otherwise. The work cost fluctuates because of the probabilistic nature of the input state. d. If the probabilities that the particle initially resides within the different regions are not equal, the same amount of work as in the previous case is needed. However, if only the correct output state is to be reproduced, less work is required in some situations.

a. b.
Supplementary Figure 2. Examples of large, non-typical distributions. a. The probability distribution (given by the spectrum of the state) of a classical system of one random qubit, along with n other qubits that are all 0 if the first qubit is 0, or uniformly random otherwise. b. Two different operations on this system may have the same input and output state, yet their work cost may differ arbitrarily. The first operation copies its input to its output (identity map), which costs no work. The second destroys the input and reproduces a fresh system at the output. Supplementary Figure 3. a. Lambda-Majorization corresponds to absorbing a certain amount of randomness from an ancilla during a unitary operation. The system X starts in state σ, and the ancilla A in a state with λ1 fully mixed qubits with the remaining qubits pure. The goal is to devise a global unitary that will bring the system X to the state ρ, while leaving the least possible number λ2 of fully mixed qubits in A. The difference λ = λ1 − λ2, is the work extracted by the process; if the value is negative, it corresponds to a work cost. b. Our main result gives a fundamental lower bound on the work cost W of a process transforming a state σX (purified by a fictitious |σ XR) into a new state ρ X R obtained by applying a process EX→X . We optimize the work cost of lambda-majorization operations that perform the process E. The lower bound to the work cost is then given by the entropy of the information E that the process E has to discard (which purifies the state ρ X R ), as measured by the Rényi-zero conditional entropy H0 (E|X ) ρ . The relation between thermodynamics and information, and in particular between the statistical Gibbs entropy and the information-theoretic entropy has been extensively studied from various perspectives. We give a short overview in this section; for a rather comprehensive discussion we suggest Ref. [1].
The fundamental question was raised by Maxwell in the 19th century, who imagined a perpetuum mobile on a gas divided into two chambers, whose net effect was the reduction of entropy of an isolated system: a small being could have knowledge of the microscopic degrees of freedom of the gas and operate a trap door in the splitting wall, which he could use to filter the cold molecules from the hot molecules. Szilard [2] realized that the crucial part of the problem was that the demon accessed microscopic degrees of freedom, which are not accessible normally in thermodynamics. He devised a thought experiment, the Szilard box (see main text), which illustrated the reversible conversion of kT ln 2 work from or into well-defined accessible information. This suggested that the demon had to perform work to compensate for the entropy decrease of the gas. Scientists at the time were then led to believing that the measurement itself was a process that had to cost work, and some thought models were developed [3,4]. It was Landauer in 1961 [5] who first associated work cost to the logical irreversibility of an operation, and who stated that the erasure of 1 bit of information had to cost kT ln 2 work, and studied in particular the example of a particle in a double-V shaped potential [5,6]. Bennett showed that computations can be made completely reversible and devised an explicit measurement apparatus which required no work. On the other hand, resetting the demon's memory back to its original state does cost work, effectively exorcising Maxwell's demon [7][8][9][10][11]. Landauer's principle has been criticized [12,13], but became widely accepted as alternative proofs were proposed [14,15] (cf. also [16]) and its conceptual importance clarified [9]. Various physical computational models were explored [17][18][19][20], while general considerations relating information and physics were discussed [21][22][23][24]. These efforts were lead in parallel to Jaynes showing the relevance of information theory for statistical mechanics and thermodynamics [25][26][27][28].
While most information-theoretic approaches had studied averages over many independent repetitions of the same experiment, known as the i.i.d. regime, some effort was made to focus on single instances of informationtheoretic tasks [51], where the natural entropy measures to consider are the smooth entropies [51][52][53]. In this regime, information erasure is also characterized using the smooth entropy framework [54][55][56][57][58][59].
The frameworks that have been considered for the study of the thermodynamics of information processing are extremely varying. While some studies have focused on explicit construction of physical systems, such as Szilard boxes [2,55,59], others have considered for example systems described by general Hamiltonians that are interacting, or for which we allow the modification of individual levels [14,30,46,54,56,[60][61][62][63]. Another very promising approach, from which the framework in this paper is largely inspired, is based on a resource theory of thermal operations, where the Gibbs states are for free [33,57,[64][65][66]. The two approaches are equivalent [65].
Our result adds to the effort of relating information theory to thermodynamics, in the form of a general formula for the minimal work requirement of a logical process. The logical process can be any quantum physical evolution. This result may be seen as an informationtheoretic result, which expresses the minimal size of an ancillary system needed to store the information discarded by the logical process, with a natural direct application to thermodynamics.
As a special case, our result allows us to study the work cost of quantum measurements. We start by noting that in the literature discussed above there is a slight ambiguity as to what a measurement is exactly, and, in particular, whether the memory register of the apparatus starts in a well-defined pure state or first needs initialization. If we include the memory initialization process, the measurement does cost work, whereas with a pure memory, the act of transferring information to the memory costs no work. This fact was also emphasized by Sagawa and Ueda [60,67,68], cf. also [69].

SUPPLEMENTARY NOTE 2 -SOME INITIAL REMARKS AND CLARIFICATIONS.
We wish to emphasize that in this entire work, and unless otherwise stated, by "process" we mean to denote a logical process, and not a thermodynamical process. Thermodynamical processes will come to play via the operations allowed by our framework.
Logical processes, or computations, are an abstract mathematical mapping of input states to output states. For example, and and gate maps logical states 00, 01 and 10 to the logical state 0 and 11 to the logical state 1. Logical processes are defined completely independently of their physical implementation, in the same spirit of Shannon's abstraction of the unit of information. In the most general case, a logical process is specified by a completely positive, trace-preserving map E.
On the other hand, the logical information has to be stored on a physical system, and any logical operations have to be implemented through an appropriate time evolution of the physical system in interaction with a thermal bath and some control system(s). The specification of a logical process E contains no information about how much work was actually used to perform it-different strategies, different thermodynamical process or different levels of noise, losses or friction might cause the physical procedure use up very different amounts of work.
However, there is a fundamental limit on how much work will be required for the implementation of a logical process, or one could build a thermodynamic cycle with net work gain. In usual thermodynamics, this can simply be calculated as the difference in free energies between the final state and the initial state. In other words, the free energy, a state function, acts as a potential from which one can derive the minimal amount of work one needs in order to perform a transition from one thermodynamic state to another.
We derive the fundamental work requirement of implementing a logical process on the microscopic level. This is again the work cost of the best possible thermodynamic process that succeeds in implementing the given logical process. As mentioned in the main text, one of our main conclusions is that this minimal work requirement can no longer be given by a state function. In other words, it is not possible to define a "generalized free energy", a state function which would have the property of giving the minimal work requirement of a process as a difference between initial and final states of the computation. This is also in line with the conclusions of Lieb and Yngvason [70] as well as Horodecki et al . [57].
We wish to draw the attention of the reader to the fact that our conclusions have nothing to do with the statement that thermodynamic work itself is not a state function. Indeed, in standard thermodynamics and as mentioned above, the minimal work requirement for going from one state to another is still given by a state function, namely the free energy.
In order to further clarify the relation between logical and thermodynamical processes, consider an ideal gas of N particles in a box of volume 2V , at temperature T . Let's consider bringing this gas to a new state with half the volume, given by the parameters (T, V, N ) (see Supplementary Figure 1). The specification of a logical process goes beyond specifying the input and output states: indeed, one can require for example that the position of each particle be completely randomized at the end of the process (Supplementary Figure 1a), with no correlations between input and output; one can also require for example that a particle located at a position (x, y, z) be located at (x/2, y, z) at the output ( Supplementary Figure 1b). Both these logical processes have the same input and output state.
The first logical process can be physically implemented with a simple isothermal thermodynamical compression.
The second logical process can be implemented similarly, by using a trick: first, we insert many separators in the box, resolving the positions of the particles to some acceptable precision, and then we perform an isothermal compression of each of those slices of the gas independently. Then each particle originally located at position x on the X-axis is now located at x/2 at the output and kT ln 2 work was expended per particle. (Should the y and z coordinates also be required to be correlated between input and output, a grid of separators along the other axis should also be inserted, while the compression is performed in the X direction.) In these simple examples, both logical processes have the same minimal work cost (it is easy to see that the thermodynamical processes given above are optimal, for example using our main result). This is the illustration of our main result in the thermodynamic limit: in an i.i.d. setting, the minimal work requirement is simply given by the difference between final and initial entropy, regardless of the specific logical process. (Note that here entropy corresponds to free energy, since there is no change in internal energy.) Note that both logical processes could have been performed with an irreversible thermodynamic process, for example a fast compression followed by a thermalization (of the whole gas in Supplementary Figure 1a, or of each slice in Supplementary Figure 1b). Then additional, irreversible work is required. However these irreversible processes are not the optimal thermodynamical processes that carry out the requested logical processes. The expression in our main result is given by the optimal thermodynamic implementation of a given logical process.
We further note that, in general, logically irreversible processes can be implemented in a thermodynamically reversible way: a Szilard box in a completely mixed state, for example, can be reset to a pure state by a simple reversible isothermal process. While the thermodynamic transformation is reversible, meaning that we can recover the initial mixed state and get back all the work invested in the erasure, the precise logical state the memory initially was in (whether the particle was on the left or the right side of the box) is irreversibly lost in the heat bath.(This reasoning does not contradict the results by Ladyman et al . [24], because the thermodynamical processes they show to be irreversible are the thermodynamic processes p x "conditioned" on the particular initial logical state x of the device (using their notation).) Consider now the example depicted in Supplementary  Figure 1c: a single particle is in the box, and three partitions are inserted, subdividing the box into four even regions I, II, III and IV. We wish to perform the logical process that maps a particle in regions I, II and III to output region A, and input region IV to region B. (Upon appropriate relabeling of the regions, this is nothing else than an and gate.) A thermodynamical operation that would carry out this logical mapping is an isothermal compression of the joint three first regions to one-third of their initial volume (removing the two intermediary separators). The work cost of doing so depends on whether the particle is located in one of the regions I, II, III or not: with probability 3 /4, kT ln 2 · log 2 3 work is expended. (We assume that the isothermal process is done infinitely slowly, i.e. quasi-statically, such that the fluctuations have been reduced to zero; the fluctuations of the work cost of the logical process explained here are solely due to the probabilistic nature of the input state or the logical process, and not due to fluctuations of the thermodynamical process.) However, with probability 1 /4, no work is expended as those regions are empty. In this case, the work cost is fluctuating due to the probabilistic nature of the input state. This procedure is again the best strategy we can devise if we require the process to succeed with almost certain probability (this is a consequence of our main result), and its worst-case work cost is kT ln 2 · log 2 3.
If we had considered a large number of particles, then with large probability 3 /4 of the particles would be in regions I, II, III, and thus the work yield would almost deterministically be the average value 3 /4·kT ln 2·log 2 3 ≈ 1.2 kT ln 2, which one can check to be equal to the difference between initial and final Shannon entropy. This is again in agreement with our main result in the i.i.d. regime. Now, let's consider again the and gate with the oneparticle gas, but where the particle has different probabilities of being in the different regions, as shown in Supplementary Figure 1d. Specifically, consider the example where the particle resides in region II or III with the same probability as it would be found in region IV, i.e. p II + p III = p IV . Again, we are required to perform the logical process mapping a particle in regions I, II or III to the region A, and mapping a particle in region IV to region B. Then, one can convince oneself that the optimal procedure is still to isothermally compress regions I, II, and III into the volume of region A, as before. However, as explained in the main text, it is possible to devise a strategy that will still reproduce the correct output state, p A = p I +p II +p III , p B = p IV , that will actually only cost kT ln 2 work. This strategy corresponds to isothermally compressing regions II and III and merging them to region B, and isothermally compressing regions I and IV and merging them to region A (permuting regions and moving regions around does not cost any work, as these are unitary, reversible logical operations). This illustrates that the minimal work requirement of a logical process is not fully specified by the initial and final state. In particular, it cannot be given by a state function (such as the free energy in standard thermodynamics).
The important fact is that in macroscopic thermodynamics, one does not care about correlations between the input and the output, simply because for large i.i.d. systems (e.g. many independent particles, or large weakly interacting systems such as an ideal gas) those correlations do not matter. This is due to the system being in a typical microstate with overwhelming probability (see main text). We determine, for single quantum systems, the minimal work requirement of a logical process; our formula shows that this requirement is not simply given by a function of state. The key point is that if one goes to the thermodynamic limit, then our formula for the work cost of a logical process does become a function of state. This shows that our result is of a fundamentally different nature to the work loss of a process which is thermodynamically irreversible, which persists in the thermodynamical limit and is due to some avoidable irreversibility.
The literature about information-theoretic tasks, pioneered by Shannon [71] has largely focused in the past on average resource costs of asymptotically many independent repetitions of a given task, such as the average communication rate needed to send the information output by a source generating independent messages according to a certain distribution.
Recently, frameworks were developed in order to characterize single instances of these tasks, such as determining how many bits are needed to compress a single message distributed according to some known distribution. The two major approaches are the information spectrum [72,73] and the smooth entropy framework [51,53], the two approaches being closely related [74].
We will focus here on the definition of some of the smooth entropies that are needed in this work and some of their properties. More information and proofs can be found in Refs. [51,53,75,76].
In the remainder of this section, let A, B, C be quantum systems and let |ρ ABC be a pure tripartite state. We say that two states ρ 1 and ρ 2 are ε-close, denoted by ρ 1 ≈ ε ρ 2 , if their purified distance as defined in Ref. [53,75] is less than or equal to ε (for normalized states, the purified distance is defined via their fidelity [77]). We refer the reader to these papers for precise definitions of the purified distance, and for comprehensive discussions about optimization ranges over subnormalized states which will not be particularly relevant here.
Min and Max Entropies. The central quantities of the smooth entropy framework are the so-called minand max-entropies. The conditional smooth min-entropy H ε min (A|B) ρ is defined as follows: H min (A|B) ρ = max Similarly, the smooth conditional max-entropy H ε max (A|C) ρ is defined by: The conditional smooth entropies are invariant under local isometries. They also have clear operational interpretations [78]. For example, the min-entropy H min (A|B) quantifies how many bits in A can be extracted that are uniformly random and uncorrelated to B; the maxentropy H max (A|C) corresponds to the amount of bits needed to send to a third party who has access to C in order to reconstruct A.
Duality Relation. The min-and max-entropy obey the so-called duality relation. For ρ ABC pure, one has (The max-entropy may also be defined first by the duality relation, as originally done [78], and then (2) becomes a theorem.) Classical-Quantum states. A state ρ AB is classicalquantum (c-q) if it can be written in the form for positive operators ρ Rényi-Zero Entropy. An additional entropy measure that will appear naturally in our calculations is the Rényi entropy of order zero, or the Rényi-zero entropy H 0 (A|C) ρ [51,79]. It is defined by where Π AC is the projector onto the support of the state ρ AC . The Rényi-zero entropy is dual to a specific variant of the min-entropy: for ρ ABC pure, one has The Rényi-zero entropy and this variant of the minentropy have also been termed alternative max-entropy and alternative min-entropy, respectively [76].
When smoothed, the Rényi-zero entropy is closely related to the max entropy [76]. We have on one hand and the two quantities are almost equal, up to an error term and a small adjustment f (ε) to the smoothing parameter ε: Von Neumann entropy. Recall that the von Neumann entropy is defined as Asymptotic Equipartition Property. The smooth entropies all converge to the von Neumann entropy in the i.i.d. limit, a property which is known as asymptotic equipartition [52]. When considering n independent copies of the same state ρ, and consider large n, we have: In particular, any terms of order log 1 ε disappear when taking the limit n → ∞, such that the quantities H ε 0 (A|B) ρ and H ε min (A|B) ρ|ρ also obey the asymptotic equipartition property. Our main result states that in our framework, the work cost of a physical process implementing the computation E exactly is lower bounded by the quantity W ε=0 (bound) = kT ln 2 · H 0 (E|X ) ρ . In the case where we consider an ε-approximation is tolerated, the bound takes the value W ε (bound) = kT ln 2 · Hε max (E|X ) ρ , with¯ = √ 2ε. In the following sections, we further discuss the implications of our result and provide some examples.
In a slight abuse of notation, but in an effort to disencumber the mathematical expressions, a generic superscript ε on a work cost W or on an entropy measure will be understood to represent a "smoothing" of the quantity, in order to account for very unlikely events. It is understood that some expressions should actually contain variants of this quantity such as √ 2ε to be technically correct, but our calculations being relatively simple, a technically complete version should be straightforward to obtain.
A. On the Tightness of the Minimal Work Bound.
Since the bound W ε=0 (bound) was obtained through a chain of equivalences from our original framework to the expression of the bound, we know there exists a unital map over an "information battery" A and the system X which achieves this bound. However, it is not clear how to physically carry out a general unital operation at no work cost (remember: unital operations were precisely chosen for their being a very permissive framework, in order to obtain a more general bound). A convenient special case of unital operations are noisy operations [64]: these consist of a sequence of bringing in a maximally mixed ancilla, performing a global joint unitary, then tracing out the ancilla. However not all unital operations are of this kind [80,81]. So this leaves open the question of whether our bound is tight.
We have seen in the main text that using the method proposed by del Rio et al . [54], we can construct an explicit process that carries out the required transformation, which fails with probability less than ε, and which costs work kT ln(2) · H ε max (E|X ) ρ + ∆(ε) , where ∆(ε) is an error term of the order of log (1/ε). That is, we are capable of achieving the bound up to an error term of order log (1/ε).
In some interesting regimes, such as information coding, the error term may be negligible. Indeed, if we want to reset 1 MB of data with a smoothing parameter of at most ε = 10 −10 , then the error term is of order of log (1/ε) ≈ 30 bits, which is small compared to the original ∼ 10 7 bits. However, this error term can become overwhelming when considering small systems consisting of several qubits.
It is an open question to understand the significance of this gap, and to determine whether the bound can be exactly achieved. However, this type of error terms, which are clearly sublinear in the number of systems, are widespread in information and coding theory, and are typically associated with overheads such as encoding the word length itself, or the overhead of adapting the coding scheme. More specifically, they invariantly appear in random constructions of protocols, such as the one used in [54], on which our tightness proof is based.

B. Simple examples: the AND and XOR gates.
Consider the classical and and xor gates presented in the main text. We would now like to calculate how much work the best implementation of these gates would require. We can apply our main result, as given by Eq. (3) of the main text, to obtain W (bound) and = kT ln 2 · log 2 3 ≈ 1.6 kT ln 2 ; (10a) (Several quantum descriptions of these classical gates are possible in quantum mechanics; we assume for this example those which measure the input and prepare the appropriate output.) As long as the input distribution does not have very small eigenvalues, no eigenvalues will be comparably small to ε, and all distributions that are ε-close to the initial one will have same rank. Thus the values (10) are exact also for not too large ε. (Note however that this differs with the expression in Eq. (1) of the main text, because the latter was obtained with an additional relaxation of H 0 to H max for purposes of presentation.) As mentioned in the main text, these gates illustrate the dependence of the minimal work requirement on the specific computation, and not only on the input and output states. More generally, it is worth noting that, although a specific input state is given, the observer can still distinguish the different possible logical processes, even though they give the same output state. Indeed, the observer can prepare a bipartite pure state on X and a reference R, with the reduced state on X matching the required input state. By keeping this way a purification of the input state, the observer can determine exactly which logical process was performed by appropriate measurements on the joint state ρ X R of the output and the reference (note that ρ X R is then the Choi-Jamio lkowski state of the logical process).
Observe also that the value (10a) differs from the average work requirement of the and gate, which is given by the difference in von Neumann entropy between the input state and the output state (most previous work has focused on this regime). Assuming that the input is uniformly random, i.e. ρ X = 1 4 1, then one obtains Additionally, the value (10b) happens to coincide with the average work requirement (calculated similarly) for a uniformly random input; however, if a different input is given, the two values will differ.
C. Arbitrarily large dependence on the computation, with same input and output states.

Consider the example provided in Supplementary
Consider also the two logical processes depicted schematically in Supplementary Figure 2b. The first logical process E 1 is the identity map, E 1 = id X→X (σ) = σ. The second logical process E 2 resets its input and prepares a fresh copy of ρ, i.e. E 2 (σ) = tr (σ) ρ.
First note that both computations have exactly the same input and output states. The minimal work requirement of the identity mapping is zero, obviously, because it can be implemented by doing nothing, or also, because it is logically reversible. However, the analysis is different for E 2 . If we did nothing as for E 1 , then high correlations would remain between the input and the output, and we would not be implementing the computation E 2 but rather E 1 . Now the minimal work requirement that will be needed, if we want to be almost certain that the process succeeds, can be intuitively understood as follows: in the worst case, which happens with probability 1 /2, the input is in the state that is almost fully mixed, and one will first have to reset ∼ n bits, costing ∼ n kT ln 2 work. When preparing the output, we can decide randomly on whether to prepare the pure or the mixed state by extracting 1 bit of work from a Szilard box. However, in the worst case with probability 1 /2, we have to prepare the state |0 . . . 0 , and at worst only one bit of work can be extracted. The total worst-case work cost of this strategy is This value can be calculated exactly as our example is a special case of subsection 4 G below; it turns out to be optimal. The approximation we made above (where we note '∼' and '≈') is simply that log(2 n + 1) ≈ n and n + 1 ≈ n. Also, we have assumed that (1 − ε) n ≈ n.
Note that the quantity (11) can become arbitrarily large, as it scales with the number of qubits n.
This distribution might seem very artificially constructed. We however provide here an example of a physical system which exhibits such behavior. Consider a particle detector, which we model in the following way: as long as no particle has shown up, the detector is initialized in a state |0 . Once a particle hits the device, the state of the detector is changed to a very disordered state τ , which we may for the sake of the example choose as a uniformly mixed state of rank d: τ = 1 we wish to describe the state of the device, not knowing whether a particle has hit it or not. If the probability that a particle was detected is 1 /2, then the state of the detector is precisely ρ (with d = 2 n ).
"Erasure" here simply means resetting the device to its initialized state: the logical process maps the distribution ρ to the pure state |0 . The first logical processes given in Supplementary Figure 2b corresponds to not doing anything to the detector. The second logical process corresponds to resetting the detector, and then again sending a new particle in with probability 1 /2. Note that in this case, by looking at the detector after the process, we may not know the state of the detector at the input of the process. This scenario was studied in the main text; we repeat this derivation in more detail here.
Consider the setting proposed in [54], where a system S is correlated with a system M in a joint state σ SM , and where our task is to erase S while preserving the reduced state on M and any possible correlations of M with other systems. Formally, given a purification σ SM R of σ SM , we are looking for a process that will bring this state to the state ρ SM R = |0 0| S ⊗ σ M R , i.e. we require the process to preserve σ M R . In [54] a process is proposed that performs this task at work cost where H ε max is the smooth max entropy [53,75,78]. The full process that is eventually performed can be written as (It is straightforward to verify that this process preserves the reduced state σ M R .) We can now apply our main result to this particular mapping, simply by considering X to be the joint system of S and the memory M , Then the bound on the work cost, including a smoothing parameter , is where the first equality follows because ρ is pure on S and the second by reversing the isometry U . We can immediately conclude that, within our framework, any process that performs this erasure has to cost at least kT ln(2) H ε max (S|M ) σ work. Thus, the process proposed by del Rio et al . is optimal up to logarithmic factors in ε. Note that if we take the memory M to be trivial i.e. a pure state, then we are in the standard scenario of Landauer erasure on a single system, and we have W H ε max (S) which is achievable, recovering the result of [55].
E. Coherent Preparation of a State on a System with a Memory. "Reverse" of a Logical Process.
One may also wonder which process is the "reverse process" of erasure with a quantum memory. Specifically, starting off with a pure system S and some state ρ M on a memory M , one might ask how much work is needed to prepare a given bipartite state σ SM on these systems.
The process as such is not clearly defined, as we have not specified which correlations between the initial and final state on M are to be preserved, or, equivalently, which completely positive map E is to be applied for this preparation.
Let us first study the erasure mapping (12) a bit more closely. The output state of the erasure including the reference system R is given by the state ρ X R = |0 0| S ⊗ ρ M R , where H X = H S ⊗H M is the total output system. As mentioned earlier, the joint state on X and R may be interpreted as the process matrix of the operation E on σ: it can be thought of as a joint probability distribution giving the probability that we had |k at the input and got |k at the output; also, ρ X is the output state and ρ R is the input state. This consideration gives us a natural way of reversing any process: a natural "reverse" process to the process ρ X R is simply given by swapping the two systems, i.e. considering R as the output and X as a purification of the input. Let us return to the case of the erasure. First, consider the purification of σ SM into a system where plying the erasure process on SM gives us: It is thus natural, for the preparation scenario, to consider the process matrix (Recall that the input state is pure on S.) This is obviously purified by a system E which contains the traced out information, i.e. given an isometry If we then calculate the minimal work cost of performing this process according to our main result, we obtain (We have used the fact that σ and ρ are related by an isometry between R S and E, as well as the duality between min-and max-entropies.) We notice that kT ln 2 · H min (S|M ) σ work can be extracted in the reverse process of the original erasure process, which required kT ln 2 · H max (S|M ) σ work. These values can be arbitrarily different; this gap is expected as we require both processes to succeed with high probability. We find that the gap is exactly the difference between the min-and the max-entropy, similarly to the singleshot irreversibility between distillation rate and formation rate of entangled pairs with LOCC operations [82][83][84][85]. Quantum measurements are special cases of quantum processes, and so may also be plugged into our main result. Note that first, we consider the measurement process to be given access to a pure memory register to store the measurement result. We then consider the minimal work cost of preparing the register in its pure state again for a future measurement.
The Measurement Process and its Work Cost.-Suppose that on the system S, in the state σ, we perform a measurement described by a POVM {Q k }. Each outcome, labeled by the index k, occurs with probability tr (Q k σ). The completely positive map associated with this measurement is, where C is a classical register containing the outcome of the measurement (initially in a pure state), and E (k) are trace-decreasing maps that map σ to its post-measurement state for the outcome k, which occurs with probability tr ( (19) simply expresses that the output state of the measurement is a mixture of the possible postmeasurement states corresponding to the different outcomes k. We emphasize that the register C must start off in a pure state; if this is not the case (as in a purity resource framework, for example), it should be initialized first, causing some work cost, before performing the map E.
We first need to calculate the Stinespring dilation of the process E, which is given by E S→CS (·) = tr E V S→ECS (·) V † , where the isometry V S→ECS can be read off the operator-sum representation of (19), that is For convenience, let R be a purifying system for σ S , i.e. let |σ SR such that tr R |σ σ| SR = σ S . This allows us to write a full, pure, post-measurement state |ρ ECS R as Our main result asserts that the minimal work cost of the measurement (19) is simply given by the quantity, where the entropy measure is evaluated for the state ρ ECS R given by (21). In the remainder of this section, all entropy measures are implicitly evaluated on this state, unless indicated otherwise. The '≈' symbol recalls that one can only actually approximately achieve the given bound (see Section 4 A). In the remainder of this section, when discussing optimal work costs, we will only consider the value of our bound; it is understood that a successful implementation is possible at a work cost close to the discussed bound in the sense of Section 4 A. It is also implied that all work costs are smoothed with a small but finite ε parameter. In this spirit, denote the value of the right-hand side of (22) by As we will see from some simple examples, the quantity (19) in its most general form as presented may take any value, from a work cost to a work yield. However, we will consider an important class of measurements: those for which the collapse operators E k don't themselves need work. In general, we know that those processes that don't need work are sub-unital, i.e. they satisfy E (k) (1)

1.
This is for example the case for projective measurements, or more generally if the E (k) 's only have a single Kraus operator. We also note that any general measurement in the form (19) can be written as a combination of a measurement with collapse superoperators that have each a single Kraus operator, followed by a partial erasure on the memory register C. Proposition 1. Let E S→CS be a measurement process of the form (19), and assume that for all k, E (k) (1) 1.
Then W meas as defined by (23), satisfies W meas 0.
Proof. Instead of proving that the entropic quantity (23) is negative, we will show that the full measurement process E itself is a sub-unital superoperator, which we know from our framework costs no work. This is straightforward to see: Resetting the Memory Register Containing the Measurement Outcome.-Let us now consider the task of resetting C to a pure state, after having performed the measurement process above. This resetting can obviously be performed directly, with a cost given by Landauer's principle as H ε max (C) ρ , which in turn depends on the number of possible outcomes the measurement had (if ε > 0, we would only consider the measurements that are not extremely unlikely). This procedure, however, is not opti-mal if we are allowed access for example the information contained in the post-measurement state on S . Indeed, in the latter case, we may use the system S as a memory as discussed in Sec. 4 D, and the optimal work cost is then This work cost is always positive, or at best, zero, because ρ CS is classical-quantum (c-q).
We will see that this work cost may be both less and larger than W meas with some examples. Of course, this does not constitute a violation of the second law, as we will discuss.
Additionally, we could imagine a scenario where we have kept a purification of the input state |σ SR on an ancilla system R, in order to "remember" the state of the initial system S. We may then of course use this system also to reduce the work cost of erasing C, the latter being It turns out that, if the collapse operators E (k) all have a single Kraus operator, this work cost is always greater than the work yield of performing the measurement (−W meas ), and that the difference between both is precisely the difference between the max and min-entropy of the system C conditioned on R.
Proposition 2. Let E S→CS be a measurement process of the form (19), with for all k, E (k) (·) = E k (·) E † k . Let ρ ECS R be defined as in (21). Then Additionally, if ε is not too large (such that H ε max H ε min [86]), this expression is always positive, Proof. First notice that the state ρ ECS R takes the form and in particular, ρ is invariant under interchange of E and C systems. Then, using duality of the smooth entropies [75] (see Section 3), we have H ε max (E|CS ) = −H ε min (E|R) = −H ε min (C|R) and thus W meas = −kT ln 2 · H ε min (C|R). Then recall that W reset C|R = kT ln 2 · H ε max (C|R) and that the max-entropy is larger than the min-entropy for small ε.
The final (pure) state on E, C, S and R, which is the output of the Stinespring isometry V applied on the S system of |σ SR , is still given by the expression (21).
Some Examples of Measurement Processes.-Let us now focus on some examples of measurement processes, which are all special cases of (19).
(I) Measurement in the computational basis |0 , |1 of a single qubit in a maximally mixed state 1 2 /2. The measurement process then simply yields the output state The input state on S is purified by a fully entangled state |φ SR on R. The system E also purifies the measurement process, and as given by (21), It is then evident that as the corresponding reduced states are all classically correlated.
(II) Measurement of a trivial noisy POVM. Consider a POVM in the extreme case where the state is left untouched, but a random outcome is generated according to a distribution {p k }. The POVM effects are simply Q k = p k 1 and the post-measurement operators E (k) (σ) = p k σ are simply the identity superoperator weighted by the probability p k .
Intuitively, this should be no different than rolling a die, or more generally, generating a random outcome with a specific distribution, which is a process that can yield work. Indeed, based on the explicit expression of the final state we may express of the work cost of the measurement using some basic properties of the smooth entropies, presented in Section 3. Using the duality of the min-and max-entropies, because R is not correlated to E, and ρ is invariant under exchange of E and C (both these statements can be seen in (25)), and by definition the minentropy of C is given by H min (C) = − log ρ C ∞ . Also, smoothing the min-entropy can only increase the quantity by its definition (1c).
One can also calculate, because C is uncorrelated to both S and R, This means that the work we need to invest to reset C is always larger than what we gain from generating the random outcome. In fact, the gap is precisely the difference between the max-and the min-entropy, which is the same kind of irreversibility that is observed between the single-shot entanglement distillation and formation cost between two parties [87].
(III) Projective measurement of a pure superposition state. One may think that intuitively, for the measurement to yield work, the POVM must be noisy. Surprisingly enough, this is not the case. Even projective measurements can yield work for specific input states. For example, consider the state |σ S = |+ := 1 √ 2 (|0 + |1 ). Here R is a trivial system since σ is already pure. Now consider the usual projective measurement that measures σ in the computational basis |0 , |1 . The final state is We then evidently have W meas = kT ln 2 · H ε max (E|CS ) = −kT ln 2 ; W reset C|S = kT ln 2 · H ε max (C|S ) = 0 ; W reset C|R = kT ln 2 · H ε max (C|R) = kT ln 2 · H ε max (C) = kT ln 2 .
We conclude that it is possible to extract one bit of work while performing the measurement, and that resetting the memory register can be done at no work cost using S but needs one bit of work if we use the (trivial) reference system R.
Note that resetting the measurement register C using S costs no work. This is not in violation of the second law of thermodynamics: we have not returned the post-measurement state back to the initial state, but rather we have consumed its purity.
(IV) Measurement with erasure collapse operators. It was noted above that if the collapse operators E (k) were themselves maps that cost work, e.g. erasure channels, then the measurement would also possibly cost work. It is sufficient to consider the following extreme example: take a single-outcome measurement, i.e. a trivial measurement, with a the single collapse operator E k=0 (·) = tr (·) |0 0| being an erasure channel. Obviously this operation has to cost work: performing operation E S→CS is exactly the same as performing just the erasure E k=0 , which costs work according to our main result (which is of course also in line with Landauer's principle).
(V) Information Gain of a Measurement. Existing literature [88][89][90][91] has studied and identified the amount of information that a quantum measurement provides about a system being measured. With the notation above, the information gain in the asymptotic, i.i.d. regime is defined in [90] as In our framework, information contained in a quantum system is represented by how much work we need in order to erase that system. Bearing this in mind, the natural way of defining the amount of information gained about the system using the measurement is then the difference in work costs of erasing S before and after the measurement. Since S was consumed by the measurement, this statement doesn't fully make sense, so we will rather consider erasing the system R instead, which is a purification of S. Our take at the information gain of the measurement is then Notice that in the i.i.d. regime where the entropies converge to the von Neumann entropy, this definition coincides with the previous one (26).

G. State Transformation while Decoupling from the Reference System.
Let's return to another special case that we can derive as a corollary from our main result. Consider the process that erases its input and prepares the required output independently. This would occur if we required the output state to be completely uncorrelated to the reference system R: ρ X R = ρ X ⊗ ρ R . This corresponds to a replacement map. Any third party R that would have been correlated to the input is now completely uncorrelated to the output.
Again, we may simply apply our main result with the additional condition that ρ X R = ρ X ⊗ ρ R . In this case, the purification of ρ X R , ρ X RE , takes a special form due to the tensor product structure, with the E system split into two E R and E X systems (E = E R ⊗ E X ), where |ψ X E X and |φ RE R are purifications of ρ X and ρ R , respectively. The (smooth) lower bound on the minimal work cost W , given by our main result, then reads Now, the spectrum of ρ E R is exactly the same as the spectrum of ρ R by the Schmidt decomposition of |φ . This in turn has the same spectrum as σ X also by the Schmidt decomposition of σ XR and because ρ R = σ R . It follows that H ε max (E R ) ρ = H ε max (X) σ . Also, by duality of the min-and max-entropies, we have H ε That is, to transform a state σ to ρ while completely decorrelating ρ from the input, then one has to erase σ to a pure state (at cost H ε max (X) σ ), and then prepare ρ (extracting work H ε min (X ) ρ ). Consider the W state on a system S, a memory M and a reference system R given by The reduced states on SM and M are respectively given by σ SM = 1 3 |00 00| + 2 3 |Ψ + Ψ + | and σ M = 2 3 |0 0| + 1 3 |1 1|, where |Ψ + = 1 √ 2 (|01 + |10 ). By symmetry of the W state, the reduced state on any two or one qubit(s) have the same form.
By actions on S and M , we would like to erase S, leading to the final state on S and M given by ρ SM = |0 0| ⊗ σ M . Let us consider two processes that achieve this goal: the first one will preserve correlations with R but will cost work, the second will not cost work but will modify those correlations.
We may directly apply the special case above concerning the erasure of a system conditioned on a memory: the fundamental work cost of such an erasure, if one preserves correlations with a reference system R, is given by H 0 (S|M ) σ . In this case we have H 0 (S|M ) σ = log 2 3 ≈ 0.59 (which we calculate below) and thus this process must cost at least this amount of work. Because of the small system size, we may not assert the achievability of this erasure at this work cost (the error terms discussed in Section 4 A become overwhelming). However we can safely exclude the possibility of performing this operation at no work cost, a statement which suffices for our purposes here.
Observe now that both σ SM and σ M have the same spectrum { 2 /3, 1 /3}. This means that there exists a uni-tary U that performs the erasure simply as |0 0| ⊗ σ M = U σ SM U † , and this unitary process by definition does not cost any work. Note though that the correlations with R are not preserved. Indeed, the unitary sends |00 to |01 and |Ψ + to |00 , so one explicitly calculates that the state after the process is given by We notice that the reduced state on M and R is now pure and differs from initial one, given by σ M R = 1 3 |00 00| + 2 3 |Ψ + Ψ + |. As before, this is an example where one can transform the input state into the output state at no work cost a priori, but if correlations are to be preserved between the memory and a reference system or, equivalently, if the exact erasure process (12) is to be performed, then a physical implementation of this operation would require work.
It remains to calculate H 0 (S|M ) σ . Written out explicitly in the basis {|0 , |1 }, the state σ SM and the projector Π SM on its support take the form (Empty entries are zero.) We then see that For completeness, we provide an alternative proof of our main result, based on techniques of majorization and semidefinite programming. This proof (historically, the original one) ventures via the study of possible state transitions, regardless of the logical process, but then imposes that the resulting logical process be the one required.
A. The Framework. Work cost or yield as generating or absorbing randomness.
Framework.-Consider a quantum mechanical system X in an initial state described by the density operator σ.
Our task is to bring the system X to another state ρ, while attempting to maximize some kind of notion of "extracted" work in the process. We postulate a restricted set of operations as possible physical processes which we may carry out. Throughout this paper we assume that the system starts and ends with a fully degenerate Hamlitonian upon each application of an allowed operation. There is no further restriction, however, on how each of the allowed operations themselves are implemented-they might require a timedependent Hamiltonian for example.
We first postulate two basic operations of thermodynamical nature, involving a heat bath at temperature T : the erasure of a single qubit to a pure state at kT ln(2) work cost, and the corresponding reverse process which extracts kT ln(2) work by transforming a pure state into a fully mixed state. Here k is the Boltzmann constant. These operations are motivated by the variety of explicit physical thermodynamical frameworks in which they can be performed, for example using Szilard boxes [2,55] or by isothermally manipulating energy levels of Hamiltonians [30,54,56]. Crucially, we assume the second law of thermodynamics, and require that there exist no operation that would allow us to form a cycle for which the net effect would be the extraction of work. This justifies that no other work extraction procedure can yield more work than kT ln(2) from a pure qubit, or else a cycle with net work gain could be formed by appending an erasure process, itself only costing kT ln (2).
Apart from this constraint on the set of allowed operations, it is natural to also allow usual quantum information processing. Since our Hamiltonians are degenerate, we can allow all global unitaries and they cost no work. We do not need to use the fact that these unitaries are implementable by a device operating in contact with a heat bath, since expanding the class of allowable operations actually strengthens the bound we derive. In practice, one has very crude local control over the operations, and the acting agent does not know which unitary is being implemented, however, this is actually not an obstacle for implementation [65,92]. In addition to unitaries, we will allow pure ancillas to be added to the system, which permits more general computation. Crucially, ancillas will have to be exactly restored to their initial pure state, so that it is not possible to "hide" a work cost in an ancilla that was left in a mixed state.
The following framework is motivated by the above considerations. The processes we allow are (finite) combinations of the following elementary operations: (a) Bring n qubits (of the system X or an ancilla A) from any state to a pure state ('erasure') at cost n kT ln 2 work; (b) Bring n qubits (of the system X or an ancilla A) from a pure state to a fully mixed state while extracting n kT ln 2 work; (c) Add and remove ancillas in a pure state at no work cost, as long as all the ancillas have been restored to their initial pure state when they are restored; (d) Perform arbitrary unitaries (over X and any added ancillas) at no work cost.
Operations (a) and (b) are those of thermodynamical nature, and may be carried out in a wide range of existing frameworks as mentioned above. One may view these operations as defining a quantity which we call "work". We note that these operations can be performed quasistatically in a thermodynamically reversible fashion (as long as operation (a) acts on a fully mixed state, which in fact will turn out to be sufficient for our purposes).
On the other hand, operations (c) and (d) are purely information-theoretical. They allow us to perform any quantum information processing circuit, since we allow pure ancillas to be added. However, there is the condition that "randomness" may not be disposed of for free, namely that ancillas have to be restored to their initial pure states at the end of the process.
We emphasize that these operations are allowed operations, but they are not necessarily always optimal. For example, a pure state need not require n kT ln 2 for its erasure, as given by operation (a). However, any attempt to allow operation (a) for any state (or even just for a mixed state) at any lower cost than n kT ln 2 would result in a macroscopic violation of the second law of thermodynamics.
Lambda-Majorization.-We will now provide a simple mathematical characterization of all operations allowed in our framework.
First, note that the operations (a)-(d) allow the use of so-called noisy operations [64], which correspond to adding an ancilla system N in a fully mixed state, performing a joint unitary, and removing the ancilla. Specifically, a noisy operation is composed in our framework of first an operation of type (c) (adding a pure ancilla of n qubits), followed by an operation of type (b) (extracting n kT ln 2 work from the ancilla making it fully mixed), then one of type (d) (performing the necessary unitary to carry out the noisy operation), and finally an operation of type (a) (erasing the ancilla back to its pure state at a work cost n kT ln 2). (It can be assumed without loss of generality that the ancilla is left in a fully mixed state after the noisy operation; indeed, this is the case for the construction of the noisy operation given by Ref. [64], which is capable of performing an equivalent transformation to any other noisy operation.) The total process has a work balance of zero. This means that we may thus carry out noisy operations for free within our framework and use them as building blocks for more complex processes. In the following, we deal implicitly with the ancilla N and it should not be confused with further ancillas that will be added.
Noisy Operations and Majorization. The transition on system X from state σ to state ρ is possible by noisy operation if and only if σ ρ.
Majorization between two (normalized) states σ ρ captures the fact that ρ is "more mixed" than σ, or that the eigenvalues of ρ can be written as a "mixture" of the eigenvalues of σ. Formally, majorization can be characterized by the existence of a unital, trace-preserving completely positive map that brings σ to ρ [96][97][98][99]. A map E is trace-preserving if E † (1) = 1 and unital if E (1) = 1. The notion of majorization is discussed in more detail in Section 6.
We will now provide some background insight for our new concept of lambda-majorization, which is a generalization of majorization inspired by other majorization variants [57,59,[100][101][102][103]. The idea is to characterize "how well" a state σ majorizes a state ρ. Suppose that we have a system X in state σ X and we want to bring it to the state ρ X , where σ X ρ X . In this case, one can simply carry out a noisy operation as described above. Suppose now that we have an ancilla A that is in a fully mixed state, 1 A |A| , and suppose that we are fortunate enough for σ X ⊗ 1 A |A| ρ X ⊗ |0 0| A to also hold (for some pure state |0 A on A). Then by applying a joint noisy operation on both systems, this would correspond to actually erasing the system A "for free" during the transition σ → ρ. We could then say that the randomness of the ancilla A was "transferred" into system X. We will view this type of transition as work extraction on system X during a transition σ X → ρ X . Indeed, work can be extracted in an initial stage of the process by starting with a pure ancilla and making it maximally mixed; the operation described above costs no work and the ancilla can then be restored in its pure final state.
In another situation, it might be that σ X ρ X . However, in that case, for a large enough ancilla A the majorization σ X ⊗ |0 0| A ρ X ⊗ 1 A |A| will hold. The corresponding noisy operation then leaves us with a mixed ancilla that started off pure and thus requires work to restore; we will view such a transition on system X as costing work.
Such operations can be performed within our framework, using operations (a)-(d). In particular, the relation to work is given by elementary erasure and work extraction (operations (a) and (b)) applied to the ancilla A after the transition to restore it to its initial state.
In general, the ancilla A may start with λ 1 mixed qubits and end up with λ 2 mixed qubits after a noisy operation; we consider in this case to have extracted (λ 1 − λ 2 ) kT ln(2) amount of work. This situation is depicted in Supplementary Figure 3a. Both considerations above about work cost and work extraction are encompassed, simply because we count the difference in the "amount of randomness" present in the ancilla before and after the process. This is the idea behind the concept of lambda-majorization, whose definition we can now state.
Lambda-Majorization. For two density operators σ X , ρ Y on two systems X and Y , we will say that σ X λmajorizes ρ Y , denoted by σ X λ − → ρ Y , if there exists a (large enough) ancilla system A, as well as λ 1 , λ 2 0 with λ = λ 1 − λ 2 , such that where 2 −λ1 1 2 λ 1 and 2 −λ2 1 2 λ 2 are fully mixed states on λ 1 (respectively λ 2 ) qubits of A, and where the remaining qubits of A in each case are pure.
An expression for "by how much" a state majorizes another was originally introduced in [58] and used in [59], in the context of work extraction games from Szilard boxes. Their measure, the "relative mixedness" between σ and ρ, corresponds to the optimal λ such that σ λ − → ρ. Lambda-majorization captures all the possible processes that are allowed in our framework. Indeed, if σ λ − → ρ, then one has 2 −λ1 1 2 λ 1 ⊗ σ 2 −λ2 1 2 λ 2 ⊗ ρ for some λ 1 , λ 2 with λ = λ 1 − λ 2 . Hence, there exists a noisy operation (itself a combination of operations (a)-(d) with zero total work cost) that performs the transition from 2 −λ1 1 2 λ 1 ⊗ σ to 2 −λ2 1 2 λ 2 ⊗ ρ. The λ 1 mixed qubits that we have appended to σ can be created by appending a large pure ancilla (operation (c)), and using operation (b) to extract λ 1 kT ln(2) work from λ 1 qubits, rendering them fully mixed. At the end of the process, after the noisy operation, we need to restore the ancilla in a pure state; we thus need to erase (operation (a)) the remaining λ 2 qubits, costing λ 2 kT ln(2) work. The total extracted work is then (λ 1 − λ 2 ) kT ln(2) = λ kT ln(2). Conversely, each individual operation (a)-(d), individually transforming some state σ into a state ρ and costing work W , implies the lambda-majorization σ λ − → ρ with W = −λ kT ln(2). This is clear for operations (c) and (d). For operations (a) and (b), this follows from results derived in Section 6 C. Furthermore, the composition of lambda-majorizations is again a lambda-majorization (Section 6).
The ancilla system above may be viewed as some kind of "information battery", as was proposed by Bennett [7] who suggested using a blank memory tape as "fuel" to extract work. In this case, the ancilla can be used as a storage of "purity" (or as a storage for "mixedness" or "randomness" which we would like to get rid of), which is increased or decreased by processes like the ones suggested above. Equivalently, a two level system, or work bit can be used [57].
It turns out that one can characterize lambdamajorization by the existence of a completely positive map satisfying some special normalization conditions, analogously to Proposition 3. Proposition 4. Let λ ∈ R. Two normalized density matrices σ X and ρ Y on two systems X and Y satisfy σ X λ − → ρ Y if and only if there exists a completely positive map T X→Y satisfying ρ Y = T X→Y (σ X ), such that T † A map T X→Y that satisfies the two last conditions will be referred to as a lambda-majorization map.
Furthermore, although the map T is not directly a physical mapping (it can be, for example, tracedecreasing), it can always be viewed as part of a unital channelĒ, in the sense that T can be obtained by projection onto specific subspaces and tracing out the ancilla A of the mapĒ (see Section 6 B). In turn, unital channels are a (strict [80]) superset of the noisy operations. Recall that our task is to find a lower bound on the work cost of all possible processes allowed in our framework, which we will do by optimizing the work cost over all processes that perform a given state transition. However, instead of considering only the unital channelsĒ that are noisy operations, we will relax this last condition and consider all unital mapsĒ, and thus allow the optimization to range over all T that satisfy the conditions of the above proposition. This will make our lower bound even stronger, by showing that the lower bound still holds even if we relax somewhat the assumptions in our framework.
B. The Main Result.

Formulation and Proof Sketch.
Formulation of the Main Result.-We are now ready to derive our main result. Consider a system X in the state σ X . This system can always be purified by a reference system, R, in a pure joint state |σ XR .
Allowing actions defined by our framework on X, we will study the transition of this state to a state ρ XR , by applying a process T X→X . The systems are depicted in Supplementary Figure 3b.
The task we would like to solve is the following. Given σ X and a logical process E X→X , and given a purification |σ XR of σ X and an output state ρ X R = E (σ XR ), we would like to find the least amount of work W one has to pay for any process in our framework that implements the action of E on σ. As we have seen in the previous section, we can formulate within our framework all possible processes as lambda-majorizations, so our task is actually to find the best λ such that σ X λ − → ρ X , with the corresponding lambda-majorization map T from Prop. 4 satisfying T (σ XR ) = ρ X R .
Our main result gives an upper bound on the optimal amount of work that can be extracted by this transition, or equivalently, a lower bound on the minimum amount of work that will have to be paid in order to perform the transition. The main result follows directly from following technical proposition.
We are given an input state σ X and a process E X→X . Let |σ XR be a purification of σ X , and let ρ X R = E X→X (σ XR ). Let also ρ X RE be a purification of ρ X R in a system E. Main Result. Any procedure in our framework acting on system X that implements the map E when given input σ X (or equivalently, that brings the state σ XR to the state ρ X R ) has a work cost W which is at least In other words, the minimal work cost of a process E mapping σ to ρ is given by the amount of (informationtheoretic) entropy discarded, and thus dumped into the environment, conditioned on the output of the computation. This is precisely the quantitative generalization to correlated quantum systems of the original Landauer's principle [5].
The Main Result follows from Prop. 5 because, as we have noted above, lambda-majorization is equivalent to our original framework of operations (a)-(d).
It is worth noting that instead of specifying the map E, we may also simply specify the output state ρ X R , which completely determines the process (on the support of σ X ) since it is the Choi-Jamio lkowski state corresponding to E rescaled by σ X (ρ X R = E (σ XR )). One can thus understand the input to the problem to actually be a bipartite state ρ X R , such that ρ X is the required output, ρ R is the input that will be fed into the process, and any correlations between X and R specify parts of the output that we wish be preserved and not be modified, or thermalized, by the process.
We have kept above the notation X consistently to remember that we are talking about the output of the computation on X. However, X could be a different system entirely. It could even have a different dimension than X, however in this case there are some clarifications needed: whenever the system dimension increases, pure ancillas have been brought in and haven't necessarily been restored to their pure state since they are part of the output; however this operation need not have cost any work (in contrast to other noisy operations resource theories, where purity is costly [64,65]). However, whenever the system dimension decreases, then any ancilla that was removed had to be reset to a pure state first before being disposed of, which may have cost work. In other words, we adhere to the convention where purity can be brought in for free, but where disposing of randomness is costly; this is equivalent to the other approach where purity is costly but disposing of mixed states is done for free. Our choice is a priori arbitrary but possesses the advantage of well integrating into our mathematical framework with simple mathematical descriptions in terms of subunital maps and weak sub-majorization (see Section 6).
The full proof of Prop. 5 is provided in Section 5 B 2. We provide the general idea of the proof in the following.
Proof Sketch of the Main Result.-The main idea of the proof is to write the optimization problem as a semidefinite program for the variables α = 2 −λ 0, T XX 0 (the Choi-Jamio lkowski representation of T X→X ). Let (·) t X denote the partial transpose operation on X. Consider the state transformation σ → ρ. An upper bound on the extracted work λ in the lambda-majorization σ λ − → ρ, while ensuring that the map T from Prop. 4 performs the same logical operation as E, is given by the following semidefinite program (see [104,105] for a introduction to SDPs in a style similar to what we use here.): Primal minimize: α subject to: The optimal value α = 2 H0(E|X )ρ is achieved (see Section 5 B 2) by the completely positive map T X→X = tr E V X→X E (·) V † , where V X→X E is the partial isometry with minimal support relating σ XR to ρ X ER (both being purifications of the same σ R = ρ R ).
While it is clear from the formulation of our problem that T is already completely determined on the support of σ X (expressed by the condition T (σ XR ) = ρ X R ), the optimization over T is done in order to (at least formally) find the optimal action on the complement of the support of σ X .
Also, the formulation of a lambda-majorization problem as a semidefinite program is a more general toolbox that could be used in the case where the mapping is not completely determined and where arbitrary additional semidefinite conditions can be imposed at will. For example, instead of fixing the process with T (σ XR ) = ρ X R , one may have instead required that T (σ X ) = ρ X for given σ X and ρ X , not specifying and optimizing over what happens to correlations between the input and the output (or, equivalently, one could optimize over ρ X R with fixed reductions ρ X and ρ R ). In that case, the semidefinite program can be used to obtain bounds to the optimal value. This also implies that the "relative mixedness" introduced in [59] can be formulated as a semidefinite program. However it is not clear if the result in this case can be written in terms of an entropy measure.
Let H X be a quantum system in the state σ X . Let H R be an additional quantum system and let |σ XR be a purification of σ X .
Suppose we want to perform the computation E X→X on system X, bringing its initial state σ XR into a given state ρ X R with a lambda-majorization. Here ρ X R is not necessarily pure; giving the joint state with R allows us to specify which correlations we want to preserve, equivalently specifying the computation E on the support of σ X . The task is then the following.
In other words, we would like to find the trace nonincreasing map that satisfies T X→X (σ XR ) = ρ X R , that has the smallest possible T X→X (1 X ) ∞ .
This problem can be formulated as a semidefinite program in terms of the variables α 0 (defined as α = 2 −λ ) and T X→X (through its Choi-Jamio lkowski map T XX 0), and the dual variables ω X 0, X X 0 and Z X R = Z † X R .
Primal minimize: α subject to: Note that since the map does not act on σ R , we must necessarily have σ R = ρ R . Let E be a system that purifies the output state as ρ X RE . As two purifications with the same reduced state on R, the two states σ XR and ρ X RE must be related by an isometry V X→X E as ρ X RE = V X→X E σ XR V † . Note that this is an equivalent construction of the Stinespring dilation of the original computation E X→X . We can choose here V X→X E to be a partial isometry such that V V † =Π X E , the projector on the support of ρ X E , and V † V = Π X , the projector on the support of σ X . Now, define T by its Stinespring dilation and let α = T (1 X ) ∞ . In fact, T is simply a projection on the support of σ X followed by the mapping E. We will show that this choice of variables is feasible and optimal, and will derive a more explicit value of α. Condition (32a) is satisfied by definition and (32b) because V is a partial isometry. Also, verifying condition (32c), We will now show that this value is optimal by exhibiting a solution to the dual problem that achieves the same value. Let ω X = τ X be the optimal τ X for the definition of H 0 (E|X ) as in (36), let Z X R = σ −1 R ⊗ ω X and let X X = 0. This choice is feasible since condition (33a) is automatically satisfied and condition (33b) becomes where Φ X|R is an unnormalized maximally entangled state on the supports of σ X and σ R . Let ρ X RE and V X→X E be defined as before. The value achieved by this choice of dual variables is then From this, we conclude that the optimal λ for this problem is where ρ X RE is a purification of ρ X R . Let also λ i (ρ) denote the i-th eigenvalue of ρ (the order doesn't matter), and λ ↓ i (ρ) denote the i-th eigenvalue of ρ taken in decreasing order.
, and if tr σ = tr ρ. The notion of majorization defines a (partial) order relation on P(H Z ). When considering the set of density matrices S = (H Z ), there is a "least" element: the fully mixed state, 1 . Remark that if σ, ρ ∈ S = (H Z ), then the concept of weak submajorization is equivalent to regular majorization simply because the traces of these matrices are already equal to unity.
The following theorem is due to Hardy, Littlewood and Pólya [93].
Theorem 6 (Hardy, Littlewood, and Pólya, 1929). Let σ, ρ ∈ P(H Z ). Then σ ρ if and only if there exists a d × d doubly stochastic matrix S j i such that λ i (ρ) = j S j i λ j (σ) . A similar theorem is obtained for weak submajorization and doubly substochastic matrices [94].
Proposition 7. Let σ ∈ P(H X ) and ρ ∈ P(H Y ). Then σ w ρ if and only if there exists a d Y × d X doubly substochastic matrix B j i such that λ i (ρ) = j B j i λ j (σ). Majorization defines a partial order on states and has a "smallest" element, the fully mixed state. Also, a pure state majorizes any other state.
A proof for the direct sum of two vectors can be found in [94,Cor. II.1.4]. We provide here an alternative proof along with the tensor product case.
Proof. Let S j i and S j i be doubly stochastic matrices such that λ i (ρ) = j S j i λ j (σ) and λ i (ρ ) = j S j i λ j (σ ). Then S ⊕ S is also doubly stochastic and satisfies λ i (ρ ⊕ ρ ) = j (S ⊕ S ) j i λ j (σ ⊕ σ ), because the vectors of eigenvalues of the direct sum are simply the direct sums of the individual vector of eigenvalues. This shows that The same proof holds for doubly substochastic matrices, so majorization may be replaced by weak submajorization in the proposition.
We are now all set for a formal definition of lambdamajorization.
Let λ ∈ R and let λ 1 , λ 2 0 such that λ = λ 1 − λ 2 and 2 λ1 , 2 λ2 are integers. (The case when 2 λ is irrational will be discussed later.) Take H C of size greater than both 2 λ1 and 2 λ2 and let H A and H B be subspaces of H C of respective dimensions 2 λ1 and 2 λ2 .
Lambda-Majorization. For σ ∈ P(H X ) and ρ ∈ P(H Y ), we say that σ λ-majorizes ρ, denoted by σ We have assumed here that 2 λ is rational. If 2 λ is irrational, we say that σ λ-majorizes ρ if for all rational The following proposition guarantees that the definition above does not depend on the exact values of λ 1 and λ 2 but only on their difference. This is the same as saying that a fully mixed state cannot act as a catalyst.
Proposition 9. For any σ, ρ ∈ P (H Z ), and for any n, we have σ w ρ if and only if σ ⊗ 1n n w ρ ⊗ 1n n .
Proof. If σ w ρ, then the majorization passes over the tensor product, and thus proves the claim. Conversely, if σ ⊗ 1n n w ρ ⊗ 1n n , then in particular, for any k d, (d is the maximum rank of σ or ρ.) But λ ↓ in ( 1n The following proposition is a direct consequence of the definition of lambda-majorization, and just states that you can move around randomness into or out of the ancillas in the definition of lambda-majorization. Proposition 10. For any σ ∈ P(H X ), ρ ∈ P(H Y ), and for any λ ∈ R, n > 0, we have Similarly to Thm. 6 and to Prop. 7, it is possible to characterize lambda-majorization by the existence of a matrix relating the vector of eigenvalues that satisfies some specific normalization conditions. Proposition 11. Let σ ∈ P(H X ) and ρ ∈ P(H Y ).
Then there exists a doubly substochastic matrix S ak bi such that Now we have so one can define which fulfills λ i ρ = k T k i λ k σ . Because S is doubly substochastic, and using the fact that indices a (resp. b) range to 2 λ1 (2 λ2 ), the matrix T satisfies as well as Additionally, T k i 0 because S ak bi 0. Conversely, suppose that a matrix T k i exists, with The required weak submajorization for the desired lambda-majorization is provided by this doubly substochastic matrix,

B. Formulation of Lambda-Majorization in Terms of Maps
Majorization can also be characterized in terms of unital, trace-preserving completely positive maps [96][97][98][99]. Similarly, one can prove an analogous characterization of weak submajorization. The proof of this proposition will be given later.
Proposition 13. Let σ ∈ P(H X ) and ρ ∈ P(H Y ). Then σ w ρ if and only if there exists a completely positive map E X→Y : Then the two conditions on the structure of the map E X→Y in the above proposition require the map to be subunital and trace-nonincreasing. A subunital trace-nonincreasing completely positive map can always be seen as part of a unital, tracepreserving completely positive map on a larger Hilbert space. This is analogous of the result that doubly substochastic matrices are submatrices of stochastic matrices [94].
In the following, let 1 X→Z (resp. 1 Y →Z ) denote the canonical embedding isometry, and define 1 Z→X (resp. 1 Z→Y ) as the canonical projection partial isometry. Let also 1 X (resp. 1 Y ) be the projector onto the subspace The space H Z and map E Z→Z may be chosen as The space H Z may also be chosen to be any space bigger than H X ⊕ H Y .
In order to generalize this concept to our lambdamajorization, let's introduce the concept of an αsubunital map. These generalize the notion of subunital maps to arbitrary normalizations.
α-subunital Maps. We'll call a map T X→Y α-subunital if it satisfies T X→Y (1 X ) α1 Y .
Proposition 15 (Composition of α-subunital maps). Let H W ∈ H Z be another subspace of H Z in addition to H X and H Y , and let T X→Y , T Y →W be completely positive, trace-nonincreasing maps. Assume that T X→Y is α-subunital and that T Y →W is β-subunital. Then their Proof of Prop. 15. The composition of T X→Y and T Y →W is trace-nonincreasing, Their composition is also α · β -subunital, Remark 16. Let V X→Y be a partial isometry. Then the map (·) X −→ V X→Y (·) X V † X←Y is trace-nonincreasing and subunital.
Proof of Prop. 14. The remark proves the first part of the proposition. To prove the converse, we will show that the expression (44) satisfies the conditions of the claim. Notice first that the channel E Z→Z is equal to its own adjoint, i.e. E Z→Z = E † Z→Z . Moreover, the map is unital: which makes it automatically trace-preserving, the map being its own adjoint. Condition (43) follows from the definition of E Z→Z in (44).
If we choose H Z to be any space larger than H X ⊕H Y , we can adapt the definition of E Z→Z in (44) for some pure states |i Q , |f Q . In addition, Proof. Apply the converse in Prop. 14 to dilate the subunital map E K→L to a unital map E Z→Z (take H X = H K and H Y = H L ).
We may choose where H W is a space whose dimension we haven't yet fixed. We would like the space H Z to factorize as both for some systems H Q and H Q . A necessary and sufficient condition for that is where dim (H Z ) = dim (H X ) + dim (H Y ) + dim (H W ). We thus fix the dimension of H Z to be where lcm (·, ·) designates the least common multiple of its arguments and m is some integer chosen such that dim (H Z ) dim (H K ) + dim (H L ). We have now fixed dim (H Z ), and this in turn implicitly fixes the dimension of H W . This choice of H Z = H K ⊗ H Q = H L ⊗ H Q fixes the embedding isometries 1 K→Z , 1 L→Z , and projective partial isometries 1 Z→K , 1 Z→L , as for some fixed states |i Q , |f Q . The right hand side of (46) is We have used the above definition of the embedding and projection (partial) isometries, and applied (43). This proves condition (46). Note that any choice of E KQ→LQ which would satisfy the conditions of the Proposition would require that dim (H K ⊗ H Q ) = dim (H L ⊗ H Q ). Indeed, a unital, trace-preserving completely positive map may only exist if the dimensions of the input and output spaces match: with E (1 KQ ) = 1 LQ and E trace-preserving we have that tr (1 KQ ) = tr 1 LQ and thus that the dimensions are equal.
Proof of Prop. 13. By the weak submajorization condition, if tr ρ = tr σ, we must have tr ρ < tr σ. Consider an extension space H Y ∈ H Z (consider a larger H Z if necessary) in which we extend ρ by many small eigenvalues such that tr ρ Y ⊕Y = tr σ, while still having σ w ρ Y ⊕Y . Now we have a (regular) majorization, σ ρ Y ⊕Y , and can apply Prop. 12.
The obtained map, E Z→Z , is then unital and tracepreserving. It can be restricted by projecting the input onto H X and the output onto H Y , This restricted operator, by Remark 16, is a valid tracenonincreasing subunital map (take λ = 0).
Conversely, if E X→Y is a subunital trace-nonincreasing completely positive map with E X→Y (σ) = ρ, then one can dilate it with Proposition 14 to a unital, trace-preserving completely positive map E Z→Z such that 1 Y E Z→Z (σ ⊕ 0 Y ) 1 Y = ρ. Note also that the map (·) → 1 Y (·) 1 Y + 1 X (·) 1 X is a pinching [94, p. 50, Prob. II.5.5], so we have σ ⊕ The last weak submajorization is because some eigenvalues were left out.
In the same way as lambda majorization can be characterized with differently normalized doubly substochastic maps, it can also be characterized in terms of a differently normalized subunital map. Proof of Prop. 18. "⇒". Assume first that 2 −λ1 1 A ⊗ σ w 2 −λ2 1 B ⊗ ρ, with H A , H B (of respective sizes 2 λ1 and 2 λ2 ) being subsystems of an ancilla system H C , with λ = λ 1 − λ 2 .
By Prop. 13, there exists a subunital tracenonincreasing completely positive map E AX→BY , such that Now let the map T be defined by This map is trace-nonincreasing, 2 −λ1 tr A (1 AX ) = 1 X , and 2 −λ -subunital, The map T brings σ to ρ, so that T satisfies all the claimed properties. "⇐". To prove the converse, assume that a tracenonincreasing, 2 −λ -subunital map T X→Y exists, such that T X→Y (σ) = ρ.
This map is trace-nonincreasing, and subunital, since λ = λ 1 − λ 2 and T is 2 −λ -subunital. Also, By Prop. 13, we eventually have Remark 19. A trace-nonincreasing, 2 −λ -subunital completely positive map T X→Y can always be written as in Eq. (53) for a sub-unital trace-nonincreasing completely positive map E AX→BY , which itself can always be written as projections of a unital map E CZ→CZ (see text of the previous proof, and Prop. 14). Conversely, for any unital map E CZ→CZ with E 2 −λ1 1 ⊗ σ X = 2 −λ2 1 ⊗ ρ Y , in particular for any noisy operation in our framework, the map T obtained by Eq. (53) is trace-nonincreasing and 2 −λ -subunital.
In particular, for our purposes of optimizing λ over all possible processes of our framework with an additional condition to the map carrying out the process (namely to preserve correlations between our system X and the reference system R), we may impose that condition directly on the map T to obtain an upper bound on λ.

C. Properties for quantum states
We will consider in this section some useful properties of lambda-majorization in the case where we consider normalized states σ, ρ. Here, weak majorization automatically implies (regular) majorization because tr σ = tr ρ = 1.
The converse holds because any state majorizes a uniform state of the same rank. Define the absorbed randomness (or relative mixedness [59]) of a transition from σ to ρ as the maximal amount of randomness that you can get rid of, or the minimal amount of randomness that you have to generate, in a noisy operation process: Recent work has shown that this measure is relevant for the amount of extractable work of processes acting on arrays of Szilard boxes [59].
The absorbed randomness has some tight relations to single-shot entropy measures, which we present here. These are reformulations of results shown in [55,58].
Proposition 23. The absorbed randomness defined above satisfies the following bounds.