Optimal entanglement distribution policies in homogeneous repeater chains with cutoffs

We study the limits of bipartite entanglement distribution using a chain of quantum repeaters that have quantum memories. To generate end-to-end entanglement, each node can attempt the generation of an entangled link with a neighbor, or perform an entanglement swapping measurement. A maximum storage time, known as cutoff, is enforced on the memories to ensure high-quality entanglement. Nodes follow a policy that determines when to perform each operation. Global-knowledge policies take into account all the information about the entanglement already produced. Here, we find global-knowledge policies that minimize the expected time to produce end-to-end entanglement. Our methods are based on Markov decision processes and value and policy iteration. We compare optimal policies to a policy in which nodes only use local information. We find that the advantage in expected delivery time provided by an optimal global-knowledge policy increases with increasing number of nodes and decreasing probability of successful swapping.


I. INTRODUCTION
Bipartite entangled states shared between two parties are often required as a basic resource in quantum network applications.As an example, in cryptography, bipartite entanglement can be directly used for quantum key distribution between two parties [1,2], but also in multi-party applications such as quantum secret sharing [3].Bipartite entanglement can also be used to generate multipartite entangled states that are necessary for other applications [4][5][6].As a consequence, a reliable method to distribute entanglement in a quantum network is crucial for the implementation of quantum cryptography applications.
Two neighboring nodes in a quantum network can generate a shared bipartite entangled state, which we call an entangled link.This can be done, e.g., by generating an entangled pair at one node and sending half of the pair to the neighbor via an optical fiber [7,8] or free space [9,10].Two distant nodes can generate an entangled link by generating entanglement between each pair of adjacent nodes along a path that connects them, and then combining these entangled links into longer-distance bipartite entanglement via entanglement swap operations [11,12].This path constitutes a quantum repeater chain (see Figure 1).We consider repeater chains in which nodes can store quantum states in the form of qubits and perform operations and measurements on them.Experimentally, qubits can be realized with different technologies, such as NV centers [13][14][15][16][17] and trapped ions [18,19].
We focus on a single repeater chain of n equidistant and identical nodes, which could be part of a larger quantum network.To generate an entangled link between the two end nodes, also called end-to-end entanglement, we assume the nodes can perform the following operations: (i) heralded gen- eration of entanglement between neighbors [13,20], which succeeds with probability p and otherwise raises a failure flag; (ii) entanglement swaps [11,12,21], which consume two adjacent entangled links to generate a longer-distance link with probability p s ; and (iii) removal of any entangled link that existed for longer than some cutoff time t cut , to prevent generation of low-quality end-to-end entanglement due to decoherence [16,[22][23][24][25].Note that cutoff times are a key ingredient, since many applications require quantum states with a high enough quality.
We assume that nodes always attempt entanglement generation if there are qubits available.Cutoffs are always applied whenever an entangled link becomes too old.However, nodes are free to attempt swaps as soon as entangled links are available or some time later, so they must agree on an entanglement distribution policy: a set of rules that indicate when to perform a swap.We define an optimal policy as a policy that minimizes the expected entanglement delivery time, which is the average time required to generate end-to-end entanglement.Here, we consider optimal globalknowledge policies, in which nodes have information about all the entangled links in the chain.A policy is local when the nodes only need to know the state of the qubits they hold.An example of local policy is the swap-asap policy, in which each node performs a swap as soon as both entangled links are available.
Previous work on quantum repeater chains has mostly focused on the analysis of specific policies rather than on the search for optimal policies.For example, [26] provides analytical bounds on the delivery time of a "nested" policy [27], and [28] optimizes the parameters of such a policy with a dynamic programming approach.Delivery times can be studied using Markov models.In [29], the authors introduce a methodology based on Markov chains to calculate the expected delivery time in repeater chains that follow a particular policy.Similar techniques have also been applied to other quantum network topologies, such as the quantum switch [30,31].Here, we focus on Markov decision processes (MDPs), which have already been applied to related problems, e.g., in [32], the authors use an MDP formulation to maximize the quality of the entanglement generated between two neighboring nodes and between the end nodes in a three-node repeater chain.Our work builds on [33], wherein the authors find optimal policies for quantum repeater chains with perfect memories.Since quantum memories are expected to be noisy, particularly in the near future, quantum network protocols must be suitable for imperfect memories.Here, we take a crucial step towards the design of high-quality entanglement distribution policies for noisy hardware.By formulating a generalized MDP to include finite storage times, we are able to find optimal policies in quantum repeater chains with imperfect memories.Our optimal policies provide insights for the design of entanglement distribution protocols.
Our main contributions are as follows: • We introduce a general MDP model for homogeneous repeater chains with memory cutoffs.The latter constraint poses a previously unaddressed challenge: MDP states must incorporate not only entangled link absence/presence, but also link age; • We find optimal policies for minimizing the expected end-to-end entanglement delivery time, by solving the MDP via value and policy iteration; • Our optimal policies take into account global knowledge of the state of the chain and therefore constitute a lower bound to the expected delivery time of policies that use only local information.
Our main findings are as follows: • The optimal expected delivery time in a repeater chain with deterministic swaps (p s = 1) can be orders of magnitude smaller than with probabilistic swaps; • When swaps are deterministic, the advantage in expected delivery time offered by an optimal policy as compared to the swap-asap policy increases for lower probability of entanglement generation, p, and lower cutoff time, t cut , in the parameter region explored.However, when swaps are probabilistic, we find the opposite behavior: the advantage increases for higher p and t cut ; • The advantage provided by optimal policies increases with higher number of nodes, both when swaps are deterministic and probabilistic, albeit the advantage is larger in case of the latter.
This paper is structured as follows.In Section II, we explain in detail our repeater chain model and then present our main results.In Section III, we discuss the implications and limitations of our work.In Section IV, we provide more details on how to formulate the MDP and how to solve it.

A. Network model
We analyze quantum repeater chains wherein nodes can store quantum states in the form of qubits and can perform three basic operations with them: entanglement generation, entanglement swaps, and cutoffs.
E n ta n g l e m e n t g e n e r at i o n .Two adjacent nodes can attempt the heralded generation of an entangled link (i.e., a shared bipartite entangled state), succeeding with probability p. Generation of entanglement is heralded, meaning that the nodes receive a message stating whether they successfully generated an entangled link or not [13,20].We assume that entanglement generation is noisy.Hence, the newly generated entangled links are not maximally entangled states but Werner states [34].Werner states are maximally entangled states that have been subjected to a depolarizing process, which is a worst-case noise model [35], and they can be written as follows: where is a maximally entangled state, F is the fidelity of the Werner state to the state |φ + , and I d is the d-dimensional identity.In our notation, the fidelity of a mixed state ρ to a pure state |φ is defined as We assume that the fidelity of newly generated entangled links is F new ≤ 1.
E n ta n g l e m e n t s wa p .Two neighboring entangled links can be fused into a longer-distance entangled link via entanglement swapping.Consider a situation where node B shares an entangled link with node A, and another link with node C (see Figure 2).Then, B can perform an entanglement swap to produce an entangled link between A and C while consuming both initial links [11,12,21].We refer to the link generated in a swap operation as a swapped link.This operation is also probabilistic: a new link is produced with probability p s , and no link is produced (but both input links are still consumed) with probability 1 − p s .
The generation of an entangled link between two end nodes without intermediate repeaters is limited by the distance between the end nodes [36] -e.g., the noise affecting a photon sent over an optical fiber grows exponentially with the length of the fiber [27].Therefore, a repeater chain that makes use of entanglement swapping is needed to generate end-to-end entanglement over long distances.
C u t o f f s.The fidelity of a quantum state decreases over time due to couplings to the environment [35,37].These decoherence processes can be captured using a white noise model in which a depolarizing channel is applied to the entangled state at every instant.As a result, the fidelity of a Werner state at time t, F (t), is given by where ∆t is an arbitrary interval of time and τ is a parameter that characterizes the exponential decay in fidelity of the whole entangled state due to the qubits being stored in noisy memories.This parameter depends on the physical realization of the qubit.(3) is derived in Appendix A.
In general, quantum network applications require quantum states with fidelity above some threshold value F min .A common solution is to impose a cutoff time t cut on the entangled links: all entangled links used to generate the final end-to-end link must be generated within a time window of size t cut [23].Imposing memory cutoffs requires keeping track of the time passed since the creation of each entangled link.We call this time the age of the link.A link is discarded whenever it gets older than t cut .Moreover, we assume that an entangled link generated as a result of entanglement swapping assumes the age of the oldest link that was involved in the swapping operation.Another valid approach to calculate the age of a swapped link would be to re-compute the age based on the post-swap fidelity, although this would lead to a more complicated formulation to ensure that all the links that were used to produce a swapped link were generated within the time window of size t cut .To produce end-to-end links with fidelity above F min on a repeater chain that generates new links with fidelity F new , it suffices to ensure that the sequence of events that produces the lowest end-to-end fidelity satisfies this requirement.In Appendix B, we show that such a sequence of events corresponds to all links being simultaneously generated in the first attempt and all the entanglement swaps being performed at the end of the t cut interval.Analyzing such a sequence of events leads to the following condition for the cutoff time: where n is the number of nodes.For a full derivation of the previous condition, see Appendix B.
In this paper, we consider quantum networks that operate with a limited number of qubits.Specifically, we use the following additional assumptions: (i) The chain is homogeneous, i.e., the hardware is identical in all nodes.This means that all pairs of neighbors generate links with the same success probability p and fidelity F new , all swaps succeed with probability p s , all states decohere according to some coherence time τ , and all nodes apply the same cutoff time t cut .This assumption may not hold for some long-distance quantum networks where each node is implemented using a different technology, but may be directly applicable to, e.g., small metropolitan-scale networks.
(ii) We assume that each node has only two storage qubits, each of which is used to generate entanglement with one side of the chain.Each end node has a single storage qubit.This assumption is in line with the expectations for early quantum networks, in which nodes are likely to have a number of storage qubits on the order of the unit (e.g., in [17] the authors realized the first three-node quantum network using NV centers, each with a single storage qubit).
(iii) We also assume that classical communication between nodes is instantaneous.This means that every node has global knowledge of the state of the repeater chain in real time.In general, this is not a realistic assumption.However, given that classical communication delays decrease the performance of the network, our results constitute a lower bound on the expected delivery time of real setups and can be used as a benchmark.
(iv) A repeater chain under the previous assumptions is characterized by four parameters: • n: number of nodes in the chain, including end nodes.
• p: probability of successful entanglement generation.
• p s : probability of successful swap.
• t cut : cutoff time.Note that F new , F min , and τ are used to determine a proper value of cutoff time (see condition ( 4)), but they are not needed after that.
In an experimental setup, the value of p is determined by the inter-node distance and the type of hardware used, as quantum nodes can be realized using different technologies, such as NV centers [13][14][15][16][17] and trapped ions [18,19].Linear optics setups generally perform swaps with probability p s = 0.5 [11,38], while other setups can perform deterministic swaps (p s = 1) at the cost of a slower speed of operation [17].The cutoff time t cut can be chosen by the user, as long as condition ( 4) is satisfied.Note that (4) depends on τ (which depends on the hardware available), F new (which depends on the hardware and the choice of entanglement generation protocol), and F min (which is specified by the final application).
The state of the repeater chain at the end of each time slot can be described using the age of every entangled link.In Figure 3 we show an example of the evolution of the state of a chain with cutoff t cut = 3, over four time slots: • In the first time slot (t ∈ [0, 1)), all pairs of neighbors attempt entanglement generation, but it only succeeds between nodes two and three.No swaps can be performed, and the only link present is younger than the cutoff, so it is not discarded.
• In the second time slot (t ∈ [1, 2)), the age of the link between nodes two and three increases by one.All pairs of neighbors (except nodes two and three) attempt entanglement generation, which succeeds between nodes four and five.
• In the third time slot (t ∈ [2, 3)), the age of both existing links increases by one.All pairs of neighbors (except nodes two and three and nodes four and five) attempt entanglement generation, and only nodes five and six succeed.A swap can be performed at node five but they decide to wait.
• In the fourth time slot (t ∈ [3, 4)), the age of every existing link increases by one.Nodes one and two and nodes three and four attempt entanglement generation but none of the pairs succeeds.A swap is successfully performed at node five, and a new link between nodes four and six is generated.This new link assumes the age of the oldest link involved in the swap operation.Lastly, the entangled link between nodes two and three is discarded, as its age reached the cutoff time.

B. Optimal entanglement distribution policies
As described above, nodes always attempt entanglement generation if there are qubits available.Cutoffs are always applied whenever an entangled state becomes too old.Since nodes are free to attempt swaps as soon as entangled links are available or some time later, they must agree on an entanglement distribution policy: a set of rules that indicate when to perform a swap.An optimal policy minimizes the average time required to generate end-to-end entanglement when starting from any state (i.e., from any combination of existing links) and following said policy.In particular, it minimizes the mean entanglement delivery time, which is the average time required to generate end-to-end entanglement when starting from the state with no entangled links.We employ the mean entanglement delivery time as a performance metric.
In a global-knowledge policy, nodes have information about all the entangled links in the chain.In a localknowledge policy, the nodes only need to know the state of the qubits they hold.An example of local policy is the swap-asap policy, in which each node performs a swap as soon as both entangled links are available.
We model the evolution of the state of the repeater chain as an MDP.We then formulate the Bellman equations [39] and solve them using value iteration and policy iteration to find global-knowledge optimal policies.More details and formal definitions are provided in Section IV.
Let us now describe the relation between the expected delivery time of an optimal policy, T opt , and the variables of the system (n, p, p s , and t cut ).Repeater chains with a larger number of nodes n yield a larger T opt , since more entangled links need to be generated probabilistically.When p is small, more entanglement generation attempts are required to succeed, yielding a larger T opt .Decreasing p s also increases  T opt , since more attempts at entanglement swapping are required on average.When t cut is small, all entangled states must be generated within a small time window and therefore T opt is also larger.Figure 4 shows the expected delivery time of an optimal policy in a five-node chain.Interestingly, p s has a much stronger influence on T opt than p and t cut : decreasing p s from 1 to 0.5 in a five-node chain translates into an increase in T opt of an order of magnitude.Similar behavior is observed for other values of n, as shown in Appendix C.
To evaluate the advantages of an optimal policy, we use the swap-asap policy as a baseline.Early swaps can provide an advantage in terms of delivery time, since swapping earlier can free up qubits that can be used to generate backup entangled links, as displayed in the first transition in Figure 5.However, the age of a swapped link may reach the cutoff time earlier than one of the input links consumed in the swap, as the swapped link assumes the age of the oldest input link.Following the example in Figure 5 and assuming t cut = 1, if no swaps are performed, the links between nodes two and three and between three and four will exist for one more time slot, while the link between nodes four and five will be removed immediately since it reached the cutoff time.If both swaps are performed, the swapped link between nodes two and five will be removed immediately since it reached the cutoff time.Since we have arguments in favor of and against swapping early, it is not trivial to determine the scenarios in which the swap-asap policy is close to optimal.Next, we compare the expected delivery times of an optimal global-knowledge policy and the swap-asap policy.
Figure 6 shows the relative difference between the expected delivery times of an optimal global-knowledge policy, T opt , and that of the swap-asap policy, T swap , in a five-node chain.Increasing values of (T swap − T opt )/T opt mean that the optimal policy is increasingly faster on average.Note that we restrict our analysis to the parameter regime p ≥ 0.3 Entangled links are represented as solid black lines, with occupied qubits in black and free qubits in white.A waiting policy decides to not perform any swap, while the swap-asap policy decides to swap all three links.The swap frees up qubits (marked in orange) that can be used to resume entanglement generation either if the swap is successful, as in the picture, or not.After performing swaps, a cutoff tcut = 1 is applied and links with age 1 are removed, causing the swapped link to expire.and 2 ≤ t cut ≤ 6 due to the very large computational cost of calculating the solution for smaller p and larger t cut (for more details, see Section IV).Let us first focus on deterministic swaps (Figure 6a).The advantage provided by an optimal policy increases for decreasing p.When p is small, links are more valuable since they are harder to generate.Therefore, it is convenient to avoid early swaps, as they effectively increase the ages of the links involved and make them expire earlier.When t cut is small, a similar effect happens: all entangled links must be generated within a small time window and early swaps can make them expire too soon.For larger t cut , increasing the age of a link does not have a strong impact on the delivery time, since the time window is larger.Therefore, an optimal policy is increasingly better than swap-asap for decreasing t cut .The maximum difference between expected delivery times in the parameter region explored is 5.25%.

Repeater chain
Interestingly, probabilistic swaps (Figure 6b) yield an opposite behavior in the parameter region explored: optimal policies are increasingly better than swap-asap for increasing p and t cut (except when p ≤ 0.4 and t cut ≤ 3), and the relative difference in expected delivery time can be as large as 13.2% (achieved in a five-node chain with p = 0.9 and t cut = 6).One reason for this may be the action that each policy decides to perform when the repeater chain is in a full state, which is a situation where each pair of neighboring nodes shares an entangled link (see state at the top of Figure 7).When swaps are deterministic, the optimal policy chooses to swap all links in a full state, since end-toend entanglement will always be achieved.However, when swaps are probabilistic, an optimal policy generally chooses to perform two separate swaps (see Figure 7), similar to the nested purification scheme proposed in [27].As an example, for n = 5, p = 0.9, t cut = 2, and p s = 0.5, the swap-asap policy yields an expected delivery time of T = 9.35.If, in full states, the swap at the third node is withheld, T a) b) Figure 6.In a five-node chain, an optimal policy performs increasingly better than swap-asap for lower/higher values of p and tcut when swaps are deterministic/probabilistic.Relative difference between the expected delivery times of an optimal policy, Topt, and the swap-asap policy, Tswap, in a five-node chain, for different values of p and tcut.drops to 8.34.The swap-asap policy is on average slower than this modified policy by 12.1%.The action chosen in full states has a stronger influence on T for increasing p.This is because full states are more frequent for large p: whenever a swap fails, a full state is soon recovered, since new entangled states are generated with high probability.As a consequence, an optimal policy is increasingly better than swap-asap for higher p when swaps are probabilistic.A similar effect happens for large t cut .Note however that the effect of the action chosen in full states is practically irrelevant in four-node chains (see Appendix C).Note also that the advantage of an optimal policy in terms of delivery time is not always monotonic in p and t cut (see Appendix C).
Optimal policies are also increasingly faster than swapasap for increasing n, as shown in Figure 8.For example, for p = 0.3, p s = 0.5, and t cut = 2, the relative difference in expected delivery time is 1.7%, 5.9%, and 12.3%, for n = 4, 5, and 6, respectively.This is in line with the fact that, when the number of nodes grows, there are increasingly more states in which the optimal action to perform is a strict subset of all possible swaps, as shown in Appendix D. Note that, in three-and four-node chains, the relative difference in expected delivery time is generally below 1%.

Repeater chain swap-asap nested
Figure 7.All possible transitions after performing a swapasap action or a nested action in a full state, depending on which swaps succeed.In full states, every pair of neighbors shares an entangled link (solid black lines, with occupied qubits in black and free qubits in white).The swap-asap policy decides to swap all links, while the nested approach consists in swapping only at nodes 2 and 4. When swaps are probabilistic, the nested approach is generally optimal in terms of expected delivery time.

III. DISCUSSION
Our work sheds light on how to distribute entanglement in quantum networks using a chain of intermediate repeaters with pre-configured cutoffs.We have shown that optimal global-knowledge policies can significantly outperform other policies, depending on the properties of the network.In particular, we have found and explained non-trivial examples in which performing swaps as soon as possible is far from optimal.We have also contributed a simple methodology to calculate optimal policies in repeater chains with cutoffs that can be extended to more realistic scenarios, e.g., asymmetric repeater chains, by modifying the transition probabilities of the MDP.
In this work, we have assumed that classical communication is instantaneous.Hence, our optimal policies may become sub-optimal in setups with non-negligible communication times, where decisions must be made using local information only.Nevertheless, our optimal policies still constitute a best-case policy against which to benchmark.
Note also that we have restricted our analysis to repeater chains with less than seven nodes.This is due to the exponentially large computational cost of solving the MDP for larger chains (see Appendix G for further details).However, each entanglement swap decreases the fidelity of the entangled links.Hence, a large number of swaps limits the maximum end-to-end fidelity achievable, making chains with a very large number of nodes impractical.Therefore, we consider the analysis of short chains to be more relevant.
An interesting extension of this work would be to explore different cutoff policies.For example, one could allow the nodes to decide when to discard entangled links, or one could optimize simultaneously over the cutoff and the swapping policy.This may lead to improved optimal policies.
As a final remark, note that we have employed the expected delivery time as the single performance metric.In a) b) Figure 8.An optimal policy performs increasingly better than swap-asap in longer chains.Relative difference between the expected delivery times of an optimal policy, Topt, and the swap-asap policy, Tswap, for tcut = 2 and different values of p, as a function of the number of nodes n.Black lines correspond to p = 0.3, and the value of p increases in steps of 0.1 with increasing line transparency up to p = 0.9.(a) Deterministic swaps (ps = 1).(b) Probabilistic swaps (ps = 0.5).some cases, the expected value and the variance of the delivery time distribution are within the same order of magnitude (some examples are shown in Appendix E).Therefore, an interesting follow-up analysis would be to study the delivery time probability distribution instead of only the expected value.Additionally, we put fidelity aside by only requiring an end-to-end fidelity larger than some threshold value, via a constraint on the cutoff time.This constraint can be lifted to optimize the fidelity instead of the expected delivery time, or to formulate a multi-objective optimization problem to maximize fidelity while minimizing delivery time.

IV. METHODS
We have formulated the problem of finding optimal entanglement distribution policies as an MDP where each state is a combination of existing entangled links and link ages.Let s be the state of the repeater chain at the beginning of a time slot.As previously explained, s can be described using the age of every entangled link.Mathematically, this means that s can be represented as a vector of size n 2 : s = [g 2 1 , g 3 1 , . . ., g n 1 ; g 3 2 , . . ., g n 2 ; . . .; where g j i is the age of the entangled link between nodes i and j (if nodes i and j do not share an entangled link, then g j i = −1).In each time slot, the nodes must choose and perform an action a. Mathematically, a is a set containing the indices of the nodes that must perform swaps (if no swaps are performed, a = ∅).
The state of the chain at the end of the time slot is s .Since entanglement generation and swaps are probabilistic, the transition from s to s after performing a happens with some transition probability P (s |s, a).A policy is a function π that indicates the action that must be performed at each state, i.e., where S is the state space and A is the action space.W.l.o.g., we only consider deterministic policies, otherwise a policy would be a probability distribution instead of a function (see Appendix F for further details).
Let us define s 0 as the state where no links are present and S end as the set of states with end-to-end entanglement, also called absorbing states.In general, the starting state is s 0 , and the goal of the repeater chain is to transition to a state in S end in the fewest number of steps.When a state in S end is reached, the process stops.Let us define the expected delivery time from state s when following policy π, T π (s), as the expected number of steps required to reach an absorbing state when starting from state s.The expected delivery time is also called hitting time in the context of Markov chains (see Chapter 9 from [40]).A policy π is better than or equal to a policy π if T π (s) ≤ T π (s), ∀s ∈ S.An optimal policy π * is one that is better than or equal to all other policies.In other words, an optimal policy is one that minimizes the expected delivery time from all states.One can show that there exists at least one optimal policy in an MDP with a finite and countable set of states (see Section 2.3 from [41]).To find such an optimal policy, we employ the following set of equations, which are derived in Appendix F: where S is the state space and P (s |s, π) is the probability of transition from state s to state s when following policy π.Equations ( 5) are a particular case of what is generally known in the literature as the Bellman equations.
An optimal policy can be found by minimizing T π (s), ∀s ∈ S, using (5).To solve this optimization problem, we used value iteration and policy iteration, which are two different iterative methods whose solution converges to the optimal policy (both methods provided the same results).For more details, see Appendix G, and for a general reference on value and policy iteration, see Chapter 4 from [39].
We provide an example of how to calculate the transition probabilities P (s |s, π) analytically in Appendix H, although this is generally impractical, since the size of the state space grows at least exponentially with n and polynomially with t cut (as shown in Appendix I, |S| = Ω (t cut ) n−2 ).Lastly, in Appendix J we discuss how to simplify the calculation of transition probabilities.
As a validation check, we also implemented a Monte Carlo simulation that can run our optimal policies, providing the same expected delivery time that we obtained from solving the MDP.

V. DATA AVAILABILITY
The data shown in this paper can be found at [44].

Appendix A: Depolarization of Werner states
In this Appendix we show that the fidelity of a Werner state in which each qubit independently experiences a depolarizing process evolves as where t is the time, ∆t is an arbitrary interval of time, and τ is a parameter that characterizes the exponential decay in fidelity of the whole entangled state due to the qubits being stored in noisy memories.Note that we assume independent noise on each qubit since, in our problem, they are stored in different nodes of the repeater chain.
The depolarizing channel [35,42] is defined as where ρ i is a single-qubit state, 0 ≤ p ≤ 1 (this p is not to be confused with the entanglement generation probability used in the main text of this paper), and I d is the d-dimensional identity.Let us assume that each qubit independenlty experiences a depolarizing channel while stored in memory for a finite time t dep .During an interval of time t dep , a Werner state ρ with fidelity F is therefore mapped to , where E i is a depolarizing channel acting on the i-th qubit.Let us calculate this output state explicitly: with the following steps: a. We use the fact that the partial trace of a maximally entangled state is a maximally mixed state.As a consequence, Tr i (ρ) = I2 2 , for any Werner state ρ. b.We use the definition of Werner state: ρ = 4F −1 ).The output state (E 1 ⊗ E 2 )(ρ) is a Werner state with fidelity F .Then, the application of n successive transformations E 1 ⊗ E 2 produces a Werner state with fidelity This can be shown by induction as follows.The base case is proven in (A2): ). Next, if we assume that (A3) is true for n = k, we can show that it also holds for n = k + 1: where we have used (A2) in the first step.The total time required for these operations is ∆t = nt dep .Therefore, if the fidelity of the state at time t − ∆t was F (t − ∆t), the fidelity at t is given by Finally, we map p ∈ [0, 1] to a new parameter τ ∈ [0, +∞) as p 2 ≡ e −t dep /τ .Then, we obtain where x swap is the Werner parameter of the output state after swapping two Werner states with Werner parameters x 1 and x 2 .If we apply Equation (B8) repeatedly to our sequence of m entangled links, assuming that all swaps happen simultaneously (i.e., with no decoherence happening in between swaps), we obtain a final state with Werner parameter Then, the final fidelity is given by A similar result was derived in [27], although assuming F i = F , ∀i.
We are now ready to find a relationship between the cutoff time and the minimum fidelity in an n-node quantum repeater chain with cutoff time t cut .For this, we need to identify the sequence of events that produces the end-to-end link with the lowest fidelity.First, note that all the entangled links that eventually form a single end-to-end link are created within a window of t cut time slots.According to (B6), delaying entanglement swaps has a negative impact on the final fidelity.Therefore, the sequence of events that produces end-to-end entanglement with the lowest fidelity must be one where all swaps are performed at the end of the time window, i.e., all swaps are performed when the oldest link reaches the cutoff time, just before it expires.If any of those swaps were performed earlier, the final fidelity would be larger.Such a sequence of events produces the lowest end-to-end fidelity when all the links are as old as possible, i.e., when their age is t cut .If any of the links were younger, the end-to-end fidelity would be larger.Hence, the lowest end-to-end fidelity is achieved when all the links are generated simultaneously and all the swaps are performed when those links are t cut time slots old.In this case, the fidelity of each link before swapping is given by (B2): where F new is the fidelity of newly generated elementary links.The final fidelity after swapping all the links can be calculated using (B10): Finally, we impose that the worst-case end-to-end fidelity must be larger than the desired minimum fidelity F min : F worst ≥ F min .Solving for t cut yields an explicit condition for the cutoff time: When this condition is satisfied, every sequence of events will lead to a large enough fidelity.Consequently, any policy that we implement on the repeater chain will also deliver entanglement with a large enough fidelity.Figure 10.The swap-asap policy is close to optimal in four-node chains.Relative difference between the expected delivery times of an optimal policy, Topt, and the swap-asap policy, Tswap, in a four-node chain, for different values of p and tcut.
although the largest advantage observed is below 2%, meaning that the swap-asap policy is still close to optimal.The advantage over swap-asap increases up to 30% in five-and six-node chains, as shown in Figure 8 (main text) and in Figure 11.
Figure 11 shows the advantage of an optimal policy over swap-asap in terms of expected delivery time, versus p and for different number of nodes.When swaps are deterministic, the advantage is larger for smaller p.The reason is that links are harder to generate as p approaches zero, and therefore a fine-tuned policy that makes better use of those scarce resources is expected to be increasingly better than a greedy policy like swap-asap.When swaps are probabilistic and n > 4, the advantage is not monotonic in p anymore, as can clearly be seen for n = 6, t cut = 2, and p s = 0.5.On the one hand, the advantage increases when p approaches zero, due to swap-asap making an inefficient use of the links, which become a scarce resource.On the other hand, when p approaches one, the advantage also increases, due to the effect of full states, as discussed in the main text.
. An optimal policy acts as the swap-asap policy in a large number of states.Percentage of states in which the optimal policy found by our solver decides to (a,c) perform all possible swaps or (b,d) not perform any swap, in a five-node repeater chain with (a-b) ps = 1 or (c-d) ps = 0.5.We only consider states in which at least one swap can be performed.
delivery time, as explained in the example of full states in the main text.
In Figure 13 we plot the same quantities for increasing number of nodes.The percentage of states in which the optimal policy decides to perform all possible swaps, acting as the swap-asap policy, decreases with increasing n.This agrees with the fact that the advantage provided by an optimal policy in terms of expected delivery time over the swap-asap policy increases with increasing n, as shown in the main text.However, the data from Figure 13 alone should not be used to draw any conclusions, since arguments (i) and (ii) also apply to these plots.

Appendix E: Delivery time distribution
Here, we show two examples of repeater chains in which the entanglement delivery time distribution is heavy-tailed.Figure 14 shows the delivery time distribution in a five-node chain with p s = 0.5, t cut = 2, and p = 0.5 (Figure 14a) or p = 0.9 (Figure 14b).The results shown here have been calculated by repeatedly simulating the optimal policy in a repeater chain (source code available at https://github.com/AlvaroGI/optimal-homogeneous-chain).As shown in the figure, the Figure 13.The percentage of states in which the optimal policy acts as the swap-asap policy decreases in longer chains.Percentage of states in which the optimal policy found by our solver decides to (a) perform all possible swaps or (b) not perform any swap, in a repeater chain with p = 0.3 and tcut = 2.We only consider states in which at least one swap can be performed.distribution is heavy-tailed for some combinations of parameters.In those cases, the average value does not provide an accurate description of the whole distribution.
Appendix F: Expected time to reach an absorbing state In this Appendix, we show that the expected time required to reach an absorbing state in a discrete Markov decision process (MDP) starting from state s and following policy π, T π (s), satisfies where S is the state space and P (s |s, π) is the probability of transition from state s to state s when following policy π.We also discuss the difference between deterministic and stochastic policies.
Let t π (s) be the time required to reach an absorbing state starting from state s in one realization of the process, and let be its expected value.The time required to reach an absorbing state starting from s can be calculated as the time required to go from state s to any state s plus the time required to go from s to an absorbing state.Since the Markov chain is discrete, we can write this as the values in the previous iteration.All our results have been calculated using ε = 10 −7 .In this work, we have applied both value iteration and policy iteration, which provided the same results (our specific implementations can be found at https://github.com/AlvaroGI/optimal-homogeneous-chain).For a detailed explanation of both algorithms, see Sections 4.3 and 4.4 from [39].
In terms of computational cost, policy iteration is generally faster since less iterations are required.To the best of our knowledge, there are no known tight bounds on the number of iterations until convergence.However, the computational complexity of a single iteration in policy iteration is O(|A||S| 2 + |S| 3 ), where A is the action space and S is the state space, which can be prohibitive for some combinations of parameters [43].The computational complexity of one iteration in value iteration is O(|A||S| 2 ) [43].In our problem, the complexity of each iteration increases exponentially with increasing number of nodes and polynomially with increasing cutoff time (see Appendix I), and the number of iterations increases with decreasing probability of entanglement generation and decreasing probability of successful swap, since the estimate of the values is worse when these probabilities are small.Consequently, to study long chains with large cutoffs and small probabilities of successful entanglement generation and swap, one may need to employ approximate methods, such as deep reinforcement learning, which can find sub-optimal but good enough policies at a lower computational cost.
Appendix H: Markov decision process example Here we provide an example of how to formulate the Markov decision process (MDP) for a three-node repeater chain with cutoff t cut = 1.Specifically, we calculate each term in the Bellman equations, which can then be used to find an optimal policy, as explained in the main text.
We start by listing all the states in which the chain can be found.The sequence of events during each time slot is the following: 1. First, the ages of all entangled links are increased by 1.
2. Second, entanglement generation is attempted between every pair of neighbors with qubits available.
3. Third, entanglement swaps can be performed.4. Lastly, links whose age is equal to t cut are removed.End-to-end links are not removed.
Since the cutoff is 1, the ages of all links are at most 1.All possible states are listed in Figure 15.Let us now find the equation for T π (s 0 ), the expected delivery time from state s 0 , which is given by as explained in the main text and derived in F. For clarity, let us abuse notation and write T i to denote T π (s i ).We can find each term by considering each possible scenario separately: • With probability (1 − p) 2 , no links are successfully generated and the state remains s 0 .Swaps and cutoffs do not apply to this state.This contributes with a term (1 − p) 2 T 0 .
• With probability p(1 − p), only one of the links is generated and the state becomes either s 1 or s 2 .Swaps and cutoffs do not apply to these states.This contributes with p(1 − p)T 1 + p(1 − p)T 2 .
• With probability p 2 , both links are generated and the state becomes s 3 .Then, a swap can be performed and the last term splits into two contributions: -If the policy decides to perform a swap in node 2, i.e., π(s 3 ) = {2}, the state at the end of the time slot will be s 4 if the swap is successful, and s 0 if the swap fails.These scenarios contribute with p 2 1 π(s3)={2} p s T 4 + (1 − p s )T 0 , where 1 A is the indicator function that takes value 1 if A is true and value 0 otherwise.
-If the policy decides to not perform the swap, i.e. π(s 3 ) = ∅, the state remains s 3 .The contribution is then p 2 1 π(s3)=∅ T 3 .Now, we can write all the terms above in a single equation: Note that T 4 = T 5 = 0, since s 4 and s 5 are absorbing states.Rearranging terms, we obtain Let us now find the equation for s 1 .At the beginning of the time slot, the age of the link is increased by 1 and the state becomes s 6 .After that: • With probability (1 − p), no links are successfully generated.Then, the only existing link is removed since it is 1 time slot old.This contributes with a term (1 − p)T 0 .
• With probability p, the remaining link is generated and the state becomes s 8 .Then, a swap can be performed: -If the policy decides to perform a swap in node 2, i.e., π(s 8 ) = {2}, the state at the end of the time slot will be s 5 if the swap is successful, and s 0 if the swap fails.These scenarios contribute with p1 π(s8)={2} p s T 5 + (1 − p s )T 0 .
-If the policy decides to not perform the swap, i.e. π(s 8 ) = ∅, the state remains s 8 , and the link with age 1 is removed afterwards.The state becomes s 2 and the contribution is then p1 π(s8)=∅ T 2 .
Combining all terms, we obtain where we have used that T 5 = 0. Due to the symmetry of the problem, so we do not need to derive a new equation for T 2 .Lastly, we find the equation for s 3 .At the beginning of the time slot, the age of each link is increased by 1 and the state becomes s 10 , in which no more links can be generated.After that, a swap can be performed: -If the policy decides to perform a swap in node 2, i.e., π(s 10 ) = {2}, the state at the end of the time slot will be s 5 if the swap is successful, and s 0 if the swap fails.These scenarios contribute with 1 π(s10)={2} p s T 5 + (1 − p s )T 0 .
-If the policy decides to not perform the swap, i.e. π(s 10 ) = ∅, the state remains s 10 , and both links are removed afterwards, when cutoffs are applied.The state becomes s 0 and the contribution is then 1 π(s10)=∅ T 0 .
The equation then reads where we have used that T 5 = 0.
We can write Equations (H2), (H3), (H4), and (H5) as (H6) An optimal policy π * can be found by minimizing T 0 , T 1 , and T 3 in this system of equations.This can be done, e.g., using iterative algorithms such as value and policy iteration, as discussed in the main text.In this case, it can be shown that the swap-asap policy is optimal, i.e., π * (s 3 ) = π * (s 8 ) = π * (s 10 ) = {2}.This also makes sense intuitively: once both links are generated, waiting provides no advantage in terms of delivery time over performing the swap immediately.For this policy, the system of equations becomes which yields an optimal expected delivery time of .
As a final remark, note that the expected delivery times from states s 6 to s 10 were not necessary to compute T 0 .In fact, states s 6 to s 10 cannot exist at the beginning of a time slot, since the links that exist at the beginning of a time slot are always younger than t cut (i.e., their age is 0) or are end-to-end links.This is the reason why we do not need to optimize over T 6 to T 10 .
Let us now consider a chain with two links where both are connected to the same node i.From node i towards one end of the chain, there are i − 1 nodes.From i towards the other end of the chain, there are n − i nodes.Then, the number of states with two links where both are connected to node i is given by where the last factor accounts for all possible ages of both links.We can find a lower bound to |S(2)| by considering only the states in which both links are connected to the same node i, i.e.,  where, in step a, we have used the binomial sum:  Solid lines correspond to the number of states found by our policy iteration algorithm (note that the number of states only depends on n and tcut).The purple solid line is the total number of states and the green line is the number of states in which a decision can be made (i.e., states in which at least one swap can be performed).The dashed line corresponds to the lower bound (I6) to the total number of states.

Figure 1 .
Figure 1.A quantum repeater chain that can store two qubits per intermediate node and one qubit per end node.White circles represent qubits.All nodes are equidistant and identical.

Figure 2 .
Figure 2. Entanglement swap.When node B performs a swap, an entangled link between nodes A and B and an entangled link between nodes B and C are consumed to produce a single entangled link between A and C.This operation is essential for the generation of long-distance entanglement.

Figure 3 .
Figure3.Example of entangled link dynamics in a repeater chain.Each row represents the state of the chain at the end of time slot t.Entangled links are represented as black solid lines, with occupied qubits as black circles.The number above each entangled link is the age of the link.We assume cutoff tcut = 3.

Figure 4 .
Figure 4.The expected delivery time increases with lower p, ps, and tcut.Expected delivery time of an optimal policy, Topt, versus p in a five-node chain, for different values of cutoff (tcut = 2, 5, 10).Solid lines correspond to deterministic swaps (ps = 1) and dashed lines correspond to probabilistic swaps with ps = 0.5.

Figure 5 .
Figure 5. Swap-asap policies free up qubits, but swapped links expire earlier.Evolution of an example state when following a waiting policy versus the swap-asap policy during a single time slot.Entangled links are represented as solid black lines, with occupied qubits in black and free qubits in white.A waiting policy decides to not perform any swap, while the swap-asap policy decides to swap all three links.The swap frees up qubits (marked in orange) that can be used to resume entanglement generation either if the swap is successful, as in the picture, or not.After performing swaps, a cutoff tcut = 1 is applied and links with age 1 are removed, causing the swapped link to expire.
Figure 6.In a five-node chain, an optimal policy performs increasingly better than swap-asap for lower/higher values of p and tcut when swaps are deterministic/probabilistic.Relative difference between the expected delivery times of an optimal policy, Topt, and the swap-asap policy, Tswap, in a five-node chain, for different values of p and tcut.(a) Deterministic swaps (ps = 1).(b) Probabilistic swaps (ps = 0.5).

Figure 9 .
Figure 9.The expected delivery time increases with lower p, ps, and tcut.Expected delivery time of an optimal policy, Topt, versus p for (a) n = 3 and (b) n = 4 and different values of cutoff (tcut = 2, 5, 10).Solid lines correspond to deterministic swaps (ps = 1) and dashed lines correspond to probabilistic swaps with ps = 0.5.

Figure 11 .
Figure 11.The advantage provided by an optimal policy over swap-asap is not always monotonic with p. Relative difference between the expected delivery times of an optimal policy, Topt, and the swap-asap policy, Tswap, in an n-node chain with tcut = 2, for different values of n and p.

Figure
FigureThe delivery time distribution can be heavy-tailed.Delivery time distribution after simulating an optimal policy in a five-node repeater chain with ps = 0.5, tcut = 2, and (a) p = 0.5 or (b) p = 0.9.The number of samples is 10 5 .Solid orange lines correspond to the expected delivery time of the optimal policy.

StateFigure 15 .
Figure 15.All possible states in a three-node repeater chain with cutoff tcut = 1.Nodes are labeled 1 to 3 from left to right.

Figure 16 .
Figure 16.The number of states scales at least exponentially with increasing n and polynomially with increasing tcut.(a) Number of states versus the cutoff time in a four-node chain, and (b) versus the number of nodes in a chain with cutoff tcut = 1.Solid lines correspond to the number of states found by our policy iteration algorithm (note that the number of states only depends on n and tcut).The purple solid line is the total number of states and the green line is the number of states in which a decision can be made (i.e., states in which at least one swap can be performed).The dashed line corresponds to the lower bound (I6) to the total number of states.
Time is discretized into non-overlapping time slots.