Computing secure key rates for quantum cryptography with untrusted devices

Device-independent quantum key distribution (DIQKD) provides the strongest form of secure key exchange, using only the input–output statistics of the devices to achieve information-theoretic security. Although the basic security principles of DIQKD are now well understood, it remains a technical challenge to derive reliable and robust security bounds for advanced DIQKD protocols that go beyond the previous results based on violations of the CHSH inequality. In this work, we present a framework based on semidefinite programming that gives reliable lower bounds on the asymptotic secret key rate of any QKD protocol using untrusted devices. In particular, our method can in principle be utilized to find achievable secret key rates for any DIQKD protocol, based on the full input–output probability distribution or any choice of Bell inequality. Our method also extends to other DI cryptographic tasks.


INTRODUCTION
Device-independent quantum key distribution (DIQKD) considers the problem of secure key exchange using devices which are untrusted or uncharacterized [1][2][3].In this setting, security is based entirely on the observation of nonlocal correlations, which are typically measured by a Bell inequality [4,5].In particular, if the correlations violate a Bell inequality, then we say that they are non-local.This is useful for secure key distribution, for it certifies that the key must come from measurements on an entangled state [6][7][8].While the basic principle behind the security of DIQKD is well understood from the monogamy property of non-local correlations [9], an explicit security analysis is rather involved and tricky.This is mainly because the dimension of the underlying shared quantum state is unknown and therefore the usual security proof techniques can not be applied.
Recently, security proof techniques based on semidefinite programming (SDP) have been proposed for standard QKD [10][11][12][13][14].In this so-called device-dependent (DD) setting, the underlying QKD devices are assumed to be suitably characterized.Our main result extends this approach to a wider range of settings, adapting to different levels of device characterization (see Fig. 1).Previously, to prove the security of DIQKD, the existing approaches were to either employ a reduction to qubit-level systems [1], or to bound the adversary's guessing probability [15][16][17].However, the former is restricted to protocols based on the CHSH inequality or similar Bell inequalities with binary inputs and outputs, while the latter only bounds the minentropy, which typically leads to sub-optimal bounds on the von Neumann entropy (the relevant quantity for computing secret key rates against general attacks in the asymptotic limit [3]).The direct computation of DIQKD secret key rates is therefore an important task to address [18].
Here, we approach this problem with a universal computational toolbox that directly bounds the von Neumann entropy using the complete measurement statistics of a device-independent cryptographic protocol.Given this, our FIG.1. Levels of device assumptions: Under devicedependent (DD) assumptions, all measurements and their underlying Hilbert spaces are characterized.Under fully deviceindependent (DI) assumptions, none of these are known, and we only assume the validity of quantum mechanics.One-sided device-independent (1sDI) assumptions lie between these two cases.For the 1sDI setting, we consider the case where one party's measurements are fully characterized while the other's are unknown (e.g.see Refs.[19,20]).
method not only applies to QKD, but also to some other DI cryptographic tasks such as randomness expansion [21][22][23][24][25]. Importantly, this computational approach liberates the scope of device-independent cryptography to more complex scenarios, which could prove useful in analyzing the security of non-standard protocols which are known to be more robust against noise and loss [26][27][28][29][30], as well as multipartite protocols [31].
The main mechanism of our toolbox is a technique for estimating the entropy production of a quantum channel acting on an unknown state under algebraic constraints.Entropy production [32][33][34][35] is a fundamental concept traditionally used to study non-equilibrium thermodynamical processes, but here we show that it has an intrinsic connection to quantum cryptography as well.The simplest way to understand entropy production is to view it as the amount of entropy introduced to a system after performing some action on it.For instance, in the case of projective measurement, the entropy production is the entropy difference between the post-measurement system and the initial system.
Our toolbox bounds this entropy production via a (noncommutative) polynomial optimization over the measurement operators in the protocol.This can be evaluated using the SDPs in the the Navascués-Pironio-Acín (NPA) hierarchy [36].In this context, switching from DI to 1sDI or DD scenarios translates to adding more constraints on the SDPs and thus higher values for the final secret key rates.We present the key ideas used to derive this bound in the Methods section, and more specific details in Sec.I-III of the Supplement.
(After release of this preprint, other approaches to solve the same optimization problem were separately developed in [37,38], with the technique in the latter yielding arbitrarily tight bounds in principle.We refer the interested reader to those works for comparisons and further details.)

Main theorem
We focus mainly on describing our results for DIQKD, with results for other DI cryptographic tasks elaborated on in Sec.I-IV of the Supplement.
FIG. 2. Basic situation: By measuring her share of the joint state ψABE with measurement A0, Alice is (virtually) sending a raw key to Bob who (virtually) receives it by measuring B0.Bob's uncertainty about Alice's bit values is quantified by the classical entropy H(A0|B0).We assume that Eve has access to all classical communication and her share of the joint quantum state, which gives her some partial information on A0 as well.This is quantified by the classical-quantum entropy H(A0|E).
To assess the security performance of QKD, one can start by finding the asymptotic key rate under the assumption of independent and identically distributed (IID) states.In this setting, we consider protocols that are modelled as follows: in each round, Alice and Bob share a quantum state ρ AB , and Eve's side-information E is described by the purification ψ ABE of ρ AB (see Fig. 2).Qualitatively, this means Eve controls all systems that are not in the labs of Alice and Bob.In each round, Alice (resp.Bob) performs one measurement from a set {A 0 , A 1 , . . .A X −1 } (resp.{B 0 , B 1 , . . ., B Y−1 }) on their local system.The raw key will be produced from the measurements (A 0 , B 0 ).This model describes entanglement-based protocols, but can be easily converted to security proofs for prepare-and-measure protocols [13,39,40].Here, we focus on protocols that use one-way error correction.In this case, the asymptotic key rate r ∞ is lower bounded by the Devetak-Winter formula [41]: where H is the von Neumann entropy (which reduces to the Shannon entropy for H(A 0 |B 0 )).This can be intuitively interpreted as the difference between Eve's and Bob's uncertainty about Alice's measurement A 0 .The H(A 0 |B 0 ) term in Eq. ( 1) can be computed based on the expected behaviour of the devices (see [3] for more details), so the main challenge here is to reliably bound H(A 0 |E) using the observed statistics.More specifically, suppose the protocol estimates parameters of the form abxy Pr(ab|xy) for some coefficients c (j) abxy , where Pr(ab|xy) is the probability of obtaining outcomes (a, b) from measurements (A x , B y ) (e.g.these parameters could be Bell inequalities in a DI scenario).Without loss of generality (see Sec. V of the Supplement) we assume all measurements are projective by taking an appropriate Naimark dilation.For simplicity, we take the systems to be finitedimensional; however, we do not impose any upper bound on the dimension.Let P a|x denote the projector corresponding to an outcome a of Alice's measurement A x , and analogously, let P b|y denote Bob's measurement projectors.Our task is to find lower bounds on where L j = abxy c (j) abxy P a|x ⊗ P b|y , and the infimum takes place over ψ ABE and any uncharacterized measurements (which may be some or all of the measurements, for 1sDI or DI scenarios).This is a non-convex optimization (even after applying the approach from [10]), and the dimensions of any uncharacterized measurements could be arbitrarily large, hence there is no a priori guarantee that any specific dimension suffices to find the minimum.Our central result is a method to tackle this task despite its challenges, which we achieve by proving the following theorem: Theorem 1.For a DI scenario as described, the minimum value of H(A 0 |E) (in base e), subject to constraints where xy ab e κ abxy P a|x ⊗ P b|y ) or detection efficiency (for the scenario studied in [42], which optimizes the state and measurements to achieve maximal CHSH value).For the latter, r∞ was computed by optimizing the key-generating measurement B0 alone to minimize the value H(A0|B0), without changing the state and other measurements from those in [42].Also, to yield higher keyrates, the key-generating measurement B0 was preserved as a 3-outcome measurement (following [43]) rather than postprocessing it to 2 outcomes.It can be seen from the graph that our bounds are either close to or slightly better than the best previous result [1] for these scenarios, which was based on the CHSH value alone.For comparison, we also show the indirect bounds obtained by using the inequality H(A0|E) ≥ 2(1 − Pg(A0|E)) (in base 2).
Importantly, Eq. ( 4) is a non-commutative polynomial in the measurement operators, and thus the task of bounding K ρ can now be tackled using the well-established NPA hierarchy [36].We can also study 1sDI scenarios by imposing additional algebraic constraints corresponding to those satisfied by the characterized measurements.We highlight that since the optimization over λ is a supremum, any value of λ yields a secure lower bound, without needing to identify the optimal λ.
To go beyond the asymptotic IID scenario, one could apply the recently developed entropy accumulation theorem [3,44].This technique is applicable to DD, 1sDI and DI scenarios, and shows that the key rate against general attacks is still of a form essentially similar to Eq. ( 1).It inherently accounts for finite-size and non-IID effects, and reduces the main challenge in a security proof to an IID problem, namely, finding lower bounds on the optimization problem in Eq. ( 2) (see [3,44,45] for more details).Specif-ically, our technique allows us to bound the min-tradeoff function in the statement of the entropy accumulation theorem.Hence our approach could also be used to compute finite key lengths against general attacks, by applying the entropy accumulation theorem.

Computed keyrates
We apply our method to two commonly studied DI scenarios, in which Alice and Bob each perform parameter estimation on 2 binary-outcome measurements.(For QKD purposes, Bob will need to perform a third measurement for key generation, corresponding to B 0 in Eq. ( 1), but we do not use this when bounding H(A 0 |E).)Our results are shown in Fig. 3. Results in some other scenarios, including distributions optimized for tilted CHSH inequalities [46], are presented in Sec.IV of the Supplement.The first scenario is parametrized by a depolarizingnoise value q ∈ [0, 1/2], and corresponds to performing the ideal CHSH measurements (i.e. Lower bounds on H(A0B0|E) as a function of depolarizing noise (for the scenario studied in [1]) or detection efficiency (for the scenario studied in [42]).Our approach outperforms both previous approaches, namely indirect bounds via the one-party entropy H(A0|E) (using the bound in [1]) or the guessing probability.We also show a curve obtained when applying our method with only the CHSH value as the constraint, instead of the full output distribution.
state (1 − 2q)|Φ + Φ + | + (q/2)I, where |Φ + is the Bell state (|00 + |11 )/ √ 2 and Z and X are Pauli operators.The second scenario is a limited-detection-efficiency model parametrized by η ∈ [0, 1], where for every measurement the outcome 1 is flipped to 0 with probability 1 − η.This is a simplistic model for a photonic setup where all nondetection events are mapped to the outcome 0 [42].For this scenario, we use different states and measurements for each value of η, as follows: to compute the H(A 0 |E) bound, we first optimize the state and parameter-estimation measurements to maximize the CHSH value the same way as in [42]; then to compute the r ∞ curves, we optimized the keygenerating measurement B 0 on its own without changing the state or other measurements.In principle, this yields parameter choices that may be suboptimal for maximizing H(A 0 |E) or r ∞ , since maximizing either of these quantities is not necessarily equivalent to maximizing CHSH value (this was later confirmed in [37,38], which aimed to optimize the rates directly).However, our method is too computationally intensive to attempt to maximize H(A 0 |E) or r ∞ directly, so we use the CHSH value as an indirect proxy (since it can be optimized independently of our bounds).
The previous best bound on H(A 0 |E) in these scenarios (see Sec. IV of the Supplement for known results in other cases) was that derived in Ref. [1], which uses only the CHSH value instead of the full probability distribution.To make use of the latter, the only preceding approach was to first bound the guessing probability P g (A 0 |E) and then apply the inequality H(A 0 |E) ≥ − ln P g (A 0 |E) [15][16][17] (all entropies are in base e unless otherwise specified).We note that if the marginal distribution of A 0 is uniform and binary-valued, then in fact the tighter inequality [47] H(A 0 |E) ≥ (2 ln 2)(1 − P g (A 0 |E)) holds, and we plot this bound in Fig. 3. (See Sec.IV of the Supplement for details on how it applies in the limited-detection-efficiency model.)However, approaches based on guessing probability do not outperform the bound in [1] for the two scenarios consid-ered here.
Our method uses the full input-output distribution to bound H(A 0 |E) directly.As shown in Fig. 3, it gives results that are close to or slightly outperform the bound from Ref. [1].Roughly speaking, our approach tends to perform well for moderate noise values, which is useful since many Bell-test implementations are currently in such noise regimes [48][49][50][51][52].Our results prove that for the limiteddetection-efficiency scenario, better bounds on H(A 0 |E) can be obtained by considering the full distribution rather than just the CHSH value (since the CHSH-based bound [1] is tight).This suggests it may not be optimal to simply choose experimental parameters that maximize the CHSH value-maximizing a different Bell value may allow our method to yield a further improvement over the results in Fig. 3.

FIG. 5. 1sDI six-state protocol: Lower bounds on H(A0|E)
for a 1sDI version of the six-state protocol [53].Interestingly, the bound we obtain from our method coincides with that for the BB84 protocol.For reference, we also show the bound that could be obtained from a tomographically complete characterisation of the state, such as via the measurements in the standard (devicedependent) six-state protocol.
With minor modifications (see Sec.I of the Supplement) our method can also bound the "two-party entropy" H(A 0 B 0 |E), which is relevant for DI randomness expansion [21][22][23][24][25].The previous approaches for this were similar to those for H(A 0 |E): firstly, simply noting that H(A 0 B 0 |E) ≥ H(A 0 |E) and then applying the bound from [1]; secondly, bounding it via H(A 0 B 0 |E) ≥ − ln P g (A 0 B 0 |E).These approaches are suboptimal for similar reasons as before, though here the former is further limited by the fact that it ignores the register B 0 .As shown in Fig. 4, our method clearly outperforms both of these approaches, which could improve the key rates for DI randomness expansion.
We also analyze a 1sDI version of the six-state protocol [53], where Bob's measurement device is uncharacterized.As mentioned earlier, the characterization of Alice's device translates to algebraic relations between the operators P a|x , which we impose as additional constraints on top of the NPA hierarchy.We see that in Fig. 5, the resulting bound coincides with the bound for the BB84 protocol.This supports a conjecture [54] that when Bob's measurements are uncharacterized, performing three measurements does not offer any advantage over performing only two measurements.

DISCUSSION
Here, we have developed a universal toolbox to obtain reliable secret key rates for QKD with untrusted devices.The main advantage of our method is that it can be applied to any DIQKD protocol, not only those based on specialized Bell inequalities.The only previous known approach that could be applied to DIQKD with such generality is that based on bounding the guessing probability [15][16][17], which is generally not optimal.Our method outperforms all earlier results in some cases, as shown in Figs. 3 and 4. Importantly, it seems to give good bounds in regimes with substantial noise, which are likely to be experimentally relevant.
Currently, our method scales rapidly in computational difficulty as the number of inputs or outputs for the protocol increases-the polynomial in Eq. ( 4) is generally of high order, hence a high NPA hierarchy level [36] is needed to bound K ρ .Because of this, we currently do not have good bounds for DI scenarios with large numbers of inputs or outputs (though we find suboptimal bounds for some such cases; see Sec.IV of the Supplement).An important goal now would be to find ways to improve the tractability of our approach, perhaps by following reductions along the lines of those described in Ref. [55].This would enable the computation of key rates for DIQKD protocols (or other DI protocols) with more measurement settings and/or outcomes.
With our toolbox in hand, one can now explore DI protocols based on maximizing a variety of Bell expressions (or maximizing the key rate directly) instead of being re-stricted to CHSH.While the scaling issues currently impose some limitations, we observe that there remains substantial unexplored territory even within 2-input 2-output DI protocols.For instance, the tilted CHSH inequalities [46] can certify higher two-party entropies than CHSH in the absence of noise, but the previous bounds were based on min-entropy and not very noise-robust.Using our approach to improve these bounds (see Sec. IV of the Supplement) would be relevant for experimental implementations of DI protocols such as randomness expansion [24,25].

Bounding the von Neumann entropy
The advantage of quantum over classical cryptography stems from the fact that for the former, it is possible to bound Eve's knowledge using only Alice's and Bob's systems (essentially, using the monogamy property of entanglement).To make this precise for H(A 0 |E), one can regard the key-generating measurement as a quantum-to-classical channel that maps Alice's (quantum) system A to a memory register A 0 which stores the (classical) measurement outcomes.By Stinepring's theorem [56], this channel can be described via an isometry V to an extended system A 0 A .This isometry maps the pure initial state Ψ ABE to a pure final state Ψ A BEA0 (see Fig. 6).

FIG. 6. Connection to entropy production:
The keygenerating measurement is regarded as an isometry to a larger Hilbert space, by expanding the classical memory A0 with an ancilla A .From this perspective the initial and final states are pure, and thus the entropy change ∆H on the memory-Eve subsystem equals the entropy change on the Alice-Bob subsystem.
Since the entropies of the marginal states of a pure bipartite state are equal, this gives where (We remark that this approach was used in Ref. [57].)The last line can be interpreted as entropy production, ∆H, resulting from the transformation AB → A B. Since it only depends on the reduced states of Alice and Bob, they can be used to bound Eve's knowledge using only their own systems.For projective measurements, V can be chosen [57] such that T is the pinching channel Besides its application to QKD, the amount of entropy that is produced or consumed by a quantum operation T is one of the central quantities of a physical system.However, computing this entropy quantity is technically challenging, since the entropy of a quantum state is not directly accessible.Instead, the quantities that are directly accessible are typically the expectation values of certain observables, i.e. expressions of the form L j ρ = tr(ρL j ) for operators L j (which in QKD scenarios have the form described earlier).Following this perspective, we have to study the following problem: find bounds on ∆H that hold for all states consistent with the observed constraints L j ρ = l j .For QKD, these bounds have to be lower bounds, since we consider the "worst-case scenario" for the honest parties.
To solve this problem, we propose the following ansatz : for coefficients λ j ∈ R, we define L = j λ j L j and aim to find an operator K such that holds for all states.To find such a K, we note that Jensen's operator inequality and the Gibbs variational principle imply (see Sec. III of the Supplement for details) where T * is the adjoint channel of T .Applying a recently discovered generalisation of the Golden-Thompson inequality [58], it follows that for any self-adjoint Lk such that L = k Lk , we can choose where β(t) = (π/2)(cosh(πt) + 1) −1 .Thus, this yields a family of lower bounds on H (T [ρ]) − H(ρ), characterized by λ j and Lk .
Our task is now reduced to finding upper bounds on K ρ .If the explicit matrix representation of K is known, such as in a DD scenario, this is an SDP in a standard form and can be solved directly (see e.g.[10]).However, 1sDI and DI scenarios appear much more challenging, because one does not have an explicit form for K.This reveals the key breakthrough allowed by our approach: a careful choice of Lxy lets us bound K ρ without an explicit matrix representation.Specifically, by setting we obtain (see Sec. III of the Supplement) Theorem 1 as stated above.For the DI scenario, the channel T is selfadjoint and idempotent, so T * T = T .With this choice of Lxy , we achieved the critical goal of reducing K ρ to a form that can be bounded using the NPA hierarchy.

PUBLICATION
This version of the article has been accepted for publication, after peer review, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections.The Version of Record is available online at: https://doi.org/10.1038/s41534-021-00494-zSupplementary Information: Computing secure key rates for quantum cryptography with untrusted devices

I. EVE'S SIDE-INFORMATION AS ENTROPY PRODUCTION
We first give a more detailed derivation of the relation described in the main text between H(A 0 |E) and entropy production on ρ AB .Following the notation in the main text, recall that our goal is to find a lower bound on where X ab|xy = P a|x ⊗ P b|y , and the infimum takes place over ψ ABE and any measurements.(In the main text, a slightly more general situation was considered, where constraints were formed using linear combinations of the operators X ab|xy ; note that subsequent analyses in this Supplement can be easily generalized to that situation.)In Sec.V below, we argue that without loss of generality, we can assume that all uncharacterized measurements are projective.
We regard the process of producing Alice's raw key (in one round) as a quantum-to-classical channel R that maps Alice's quantum system A to a memory register A 0 that stores the classical measurement outcome.Explicitly, we have By considering the complementary channel of R⊗id B , we can express H(A 0 |E) in terms of ρ AB using the following lemma: Lemma 1.For any projective measurement A 0 on a pure state ψ ABE as described above, we have where T is the pinching channel If all projectors P a|0 are rank-1, this reduces to Proof.By Stinespring's theorem [1], we can describe the channel R as the restriction of an isometry V from A to an expanded system A 0 A .For projective measurements, we can take A to be isomorphic to A and use as easily verified by a short calculation.
When this isometry is applied to Alice's part of our initial global pure state Ψ ABE , it maps it to a pure final state Ψ A BEA0 .Since the entropies of the two sides of a bipartite pure state are equal, we have and hence we can write where T is the complementary channel of R ⊗ id B : If all measurement operators are rank-1 projectors, we can write them in the form P a|0 = |a a| A and identify these with the register states |a a| A0 , in which case T [ρ AB ] ∼ = ρ A0B and hence the expression simplifies to We remark that if one is instead interested in lowerbounding the entropy H(A 0 B 0 |E) of both Alice and Bob's outputs (for inputs (x, y) = (0, 0)), the proof of Lemma 1 also generalizes to that task, by constructing (This slightly increases the order of the operator polynomial K in our subsequent constructions.)Such a bound is used in, for instance, security proofs for DI randomness expansion [2].However, additional work would seem to be necessary to obtain bounds suitable for DI randomness amplification, where the biased input distribution can increase the Bell value achievable by classical (insecure) models.
Additionally, the analysis generalizes to POVMs (i.e. a set of operators E a|0 ≥ 0 such that a E a|0 = id A ) as well, though in that case we would choose A to be isomorphic to A 0 A where A 0 is a copy of A 0 , and take where M a|0 := E a|0 ⊗ I B .

II. BOUNDING THE ENTROPY PRODUCTION
Having reduced our task to bounding the entropy production, we introduce the following ansatz: given a quantum channel T and a self-adjoint operator L, we aim to find an operator K such that holds on all quantum states.Any valid solution to (17) will allow us to lower-bound the entropy production of T by estimating the expectations of K and L. The form of the ansatz (17) arises from our use of the Gibbs variational principle (see Lemma 2), though it can also be viewed as arising from the Lagrange dual of the optimization in 1.(In Sec.VI we give a more detailed analysis of the dual, from which we find that our approach yields bounds that are essentially tight up to the use of the generalised Golden-Thompson inequality and NPA hierarchy.Also, at the end of that section we show that our approach yields a convex bound, which is relevant for applying the entropy accumulation theorem [2][3][4].)For brevity, we will for now denote the constraints X ab|xy = Pr(ab|xy) in the form X j = γ j for j in an index set J .
For simplicity we will assume that all measurements act on finite dimensional systems.However, our bounds will be completely independent of this dimension, and hence can be considered device-independent since they do not need any explicit bound on the system dimensions.Additionally, we will assume that all measurements have a countable set of possible outcomes.Note that this assumption only puts a relatively mild restriction to applicability: firstly, we cover all projective measurements on finite-dimensional systems.By this we already cover a big part of the typical measurement settings considered in literature, e.g.spin measurements, discrete energy levels, and anything that happens in a quantum computer.In addition, one can always argue that continuous sets of outcomes are "only" an idealized object: in an experiment, continuous sets are typically coarse-grained by sorting outcomes into a finite set of bins.
Nevertheless, from our perspective, an extension of this work to continuous sets of outcomes and/or infinitedimensional systems seems possible.However, we will leave this for future work since it would require an extension of some of the theorems used, such as the extended Golden-Thompson inequality of [5], to integrals (in place of sums) and/or infinite-dimensional operators, which would go beyond the scope of this work.
Our main result, which we prove in Sec.III, provides a family of possible choices of K: Proposition 1.Let T be a quantum channel with adjoint T * .For any set of self-adjoint operators Lj , the operators L and K defined as with satisfy Eq. ( 17)1 for all quantum states ρ. (We use the notation There is a considerable amount of freedom in choosing the operators Lj .In particular, we can choose Lj = λ j X j for any coefficients λ j ∈ R.These coefficients would in general play the role of variational parameters that can be used to optimize the bound (17) in a particular situation (in fact they can be interpreted as Lagrange multipliers, see Sec.VI).With this choice of L j , we can directly write down the expectation of L as It therefore only remains to bound the expectation of K in order to evaluate the r.h.s of (17).
For the device-dependent case, Prop. 1 will allow us represent K as an explicit matrix whenever representations of the X j are known, since in this case the integration in (19) can be solved analytically (see Eqs. ( 25)-( 26)).Then bounding K ρ = tr(ρK) via the constraints imposed by the observed values is precisely an SDP: As a further feature, a solution of ( 22) also allows us to judge the quality of our bound: an SDP solver used for (22) should additionally provide an optimizer ρ * , which is a valid quantum state that satisfies the constraints X j ρ * = γ j .Hence it will give a feasible point for our original optimization (Eq.( 1)).We can then compute the value of the objective function at that feasible point, and compare it with our lower bound to quantify the quality of our bound for a given set of variational parameters λ j .
For the device-independent case, the measurement operators X j are not known, but we shall now describe how Prop. 1 can still be applied.As previously mentioned, we can assume that all X j are projectors (though not necessarily rank-1).It is also convenient to assume that these projectors can be relabelled as X k|l such that for each value of l, the operators X k|l form a resolution of the identity, i.e. k X k|l = I and X k|l X k |l = δ kk X k|l .(This would automatically be true if the constraints include the probabilities of all outcomes from every measurement; otherwise, it could always be achieved in principle by introducing an extra projector X j = I − X j for each X j .)Then for any scalars c k|l ∈ C, it holds that which directly yields the following corollary by choosing Ll = k λ k|l X k|l in Prop.1: Let T be a quantum channel with adjoint T * .Take any family of projectors X k|l such that for each l, we have k X k|l = I and X k|l X k |l = δ kk X k|l .Then for any coefficients λ k|l ∈ R, the operators L and K de- satisfy Eq. ( 17) for all quantum states ρ.
Theorem 1 in the main text can be obtained from this corollary by using the fact that the channel T in that scenario satisfies T * T = T , followed by replacing X k|l with P a|x ⊗ P b|y and λ k|l with j λ j c (j) abxy .To use Eq. ( 25) in practice, we need to simplify it further, by expanding the product and evaluating the integrals via the result 2  R dt e iαt β(t) = α csch α for α ∈ R. Supposing that l ∈ {1, 2, . . ., N } and recalling our notation |A| 2 = A * A, this gives The sum can be further simplified slightly by using X k1|1 X k 1 |1 = δ k1k 1 X k1|1 to eliminate the summation over k 1 .The indices {1, 2, . . ., N } can be permuted to yield other possible choices of K.
Note that even without grouping the operators X j into resolutions of the identity, a similar result still holds for coefficients λ j ∈ R by choosing Lj = λ j X j , in which case the operator product could be written as When the operators can be grouped into resolutions of 2 The expression α csch α should be interpreted as having the removable discontinuity at α = 0 "filled in", i.e. take α csch α = 1 at α = 0 (this correctly matches the evaluation of the integral for α = 0).To evaluate the integral for α = 0, we note the integrand e iαt β(t) can be written as (π/4)e iαt sech 2 (πt/2).This integral is easily shown to be absolutely convergent, hence we can evaluate it by finding its Cauchy principal value.Due to the sech 2 (πt/2) factor, the integrand is holomorphic on C except at where it has poles of order 2. The Cauchy principal value can then be evaluated by considering a rectangular contour with corners at ±R and ±R + 2i (which encloses one pole) and taking R → ∞.
the identity, this reduces to the expression in (25) as long as the operator product does not break up the groupings.We now note that as long as the Kraus operators of the channel T * T are polynomials in the measurement operators, Eq. ( 26) is such a polynomial as well.This is an important property, because the task of maximizing K ρ over all possible measurement operators is now a non-commutative polynomial optimisation similar to the type studied in [6].(For completeness, in Sec.VII we will briefly collect the essential steps and assumptions that are needed to bound this optimization via an SDP hierarchy, as described in that work.)

III. PROOF OF THE MAIN RESULT
To prove our main proposition (Prop.1) we will make use of the following lemmas, the proofs of which we provide for completeness: Taking logarithms on both sides of Eq. ( 31) and applying this inequality for p = 2, we find ln tr e j Cj = 2 ln e j Cj /2 where the last line follows from concavity of the logarithm.Since the logarithm is also monotone increasing, this implies the desired result.
We now prove our main result, Prop. 1.
Proof.Starting from (17) we want to find K such that holds on all states ρ.To do so, first we use Jensen's operator inequality to write since ln is concave.We combine this with Lemma 2 with the substitution C = ln (T * T [ρ]) + L: By Lemma 3, the last line is lower-bounded by yielding the desired result.

IV. NUMERICAL BOUNDS ON H(A0|E)
The channel T relevant for DIQKD (Eq.( 4)) indeed has the property that the Kraus operators of T are explicit polynomials in the measurement operators, allowing us to apply the above results 3 .We can hence use our method to bound H(A 0 |E) in various DI and sDI scenarios, which we now describe.In the following, all optimization of the coefficients λ was done with heuristic numerical methods, so it is possible that there exist choices of coefficients that yield better bounds but were not found by the heuristic methods.(In Sec.VIII, we provide more details on our numerical approaches, including some methods to simplify the task.)Supplementary Figure 1.Tilted CHSH inequalities: Lower bounds on H(A0|E) (in base 2) as a function of depolarizing noise applied to distributions (described in Eq. ( 7) of [7]) that maximize the tilted CHSH value, for α = 1.5 and α = 2 (see Eq. ( 38) below) on the left and right respectively.The second curve in each case is obtained by computing the CHSH value of the distributions and then applying the bound from [8].It can be seen that our method outperforms the previous bounds in most noise regimes.
First, as described in the main text, we considered depolarizing-noise and limited-detection-efficiency models for distributions that maximize the CHSH value.We performed calculations for those scenarios at level 6 of the NPA hierarchy (see Sec. VIII).Also, note that the limited-detection-efficiency model does not produce uniform distributions for A 0 , but we still wish to apply the inequality 1 H(A 0 |E) ≥ (2 ln 2)(1 − P g (A 0 |E)) [9], which was derived for uniform A 0 .This can be addressed by performing a symmetrization step [8,10], which produces a distribution with uniform A 0 .The value of H(A 0 |E) before symmetrization is lower-bounded by the value after symmetrization (following the argument in [10] with von Neumann entropy in place of min-entropy), so it suffices to bound P g (A 0 |E) for the latter and use it to bound H(A 0 |E).
As for the bounds on the two-party entropy H(A 0 B 0 |E), they were computed by choosing the channel T as described in Eq. ( 14).This entropy is the relevant quantity for describing the asymptotic key rate for randomness expansion [3], and a finite-size analysis against non-IID attacks can also be performed using entropy accumulation as described in that work.We note that these improved bounds on H(A 0 B 0 |E) could also slightly improve the key rates for DIQKD security proofs based on entropy accumulation, because part of the proof involves the two-party entropy rather than the one-party entropy.(Specifically, the key rate can be improved by an amount on the order of the test-round probability.) In addition, we also consider 2-input 2-output distributions that maximally violate tilted CHSH inequalities [7] of the form It was shown in [7] that as α → ∞, these distributions become arbitrarily close to the set of local distributions but still certify that H(A 0 |E) = 1.However, this result was only derived for distributions achieving maximal tilted CHSH value, i.e. without noise.When noise is applied, the only previous method to robustly bound H(A 0 |E) was the suboptimal P g (A 0 |E)-based approach.As shown in Supp.Fig. 1, we find that for α = 1.5 and α = 2, our method outperforms this approach in most noise regimes (and also the CHSH-based bound, since the CHSH violation for these distributions becomes arbitrarily small at large α).However, note that at any fixed value of q under this noise model, choosing measurements such that the distribution maximizes the CHSH value (instead of the tilted CHSH value) still currently yields better bounds on H(A 0 |E) than applying our approach to these distributions, at least for these values of α.Finally, we consider a 4-input 2-output distribution obtained by performing the Pauli measurements4 on the Werner state (1 − 2q)|Φ + Φ + | + (q/2)I.As the matrices in the NPA hierarchy are substantially larger for this scenario, we could not apply our method in its full generality.Instead, we only considered a special case where the coefficients λ are chosen such that , for two free parameters µ 0 , µ 1 to be optimized over (this can be viewed as yielding an entropic uncertainty relation in the form of lower bounds on This substantially reduces the order of the polynomial in Eq. ( 26), since the product only involves two pairs of (x, y) values, and hence NPA level 3 suffices to bound K ρ in this case. 5As shown in Supp.Fig. the optimum at that point being attainable by µ 1 = 1 and, apparently, arbitrary µ 0 ≥ 0 (negative values of µ 0 appear to yield worse bounds as |µ 0 | increases).However, the bound is not very robust as the noise level increasespresumably, we would need to include more measurement operators in L to obtain more robust bounds.We were also unable to further improve it by heuristically optimizing over all the λ ab|00 , λ ab|11 instead simply µ 0 , µ 1 .We remark that choosing L ρ = µ 1 H(A 1 |B 1 ) for the 2-input 2-output scenarios in the main text results in a suboptimal bound that only certifies H(A 0 |E) ≥ 0.398 in the absence of noise.We also used this choice of L ρ to investigate distributions heuristically optimized to maximize violation of the I4422 inequality [11], but were only able to find the same suboptimal bound even after optimizing over all λ ab|00 .This may possibly be related to the fact that for these distributions we have H(A 1 |B 1 ) = 0 even in the absence of noise.

V. ASSUMING PROJECTIVE MEASUREMENTS
When considering general scenarios involving multiple uncharacterized measurements, it may not always be valid to assume that they are all projective: for instance, counterexamples have been found for sequential measurements [12] and contextuality scenarios [13].However, we shall show in this section that for DIQKD as considered in this work, this assumption poses no issues.
We first construct a simultaneous Naimark dilation of all the POVMs, following fairly standard methods (see e.g.[14]).Suppose that Alice and Bob's devices are performing POVMs, i.e.Pr(ab|xy) = tr[(E a|x ⊗ E b|y )ρ AB ].
To construct the Naimark dilation, we embed an arbitrary state ρ AB in a higher-dimensional Hilbert space via the identification ρ AB ∼ = V ρ AB V * , where V is the isometry with Ā, B being Hilbert spaces of dimension equal to the number of measurement outcomes. 6We then define projectors for Alice via P a|x = U † x (|a a| ⊗ id A )U x , where U x are any unitary operators satisfying Defining P b|y analogously for Bob, it is easily verified that these operators satisfy where on the right-hand side we have implicitly embedded ρ AB in the larger Hilbert space.Therefore, one can dilate the POVMs to projective measurements while preserving the constraints.
We shall now argue that for the optimization problem relevant to DI or sDI cryptography (Eq.( 1)), this dilation also does not affect the objective function, so it is sufficient to optimize over projective measurements only.This simply follows from the fact that the objective function and constraints in Eq. ( 1) can be written entirely in terms of the classical-classical-quantum states (R xy ⊗ id E )[ρ ABE ], where R xy are the channels mapping Alice and Bob's quantum states to classical output registers A x , B y : (43) for all input pairs x, y.By Eq. ( 42), we see that these are exactly the same channels as would be realised by the projectors P a|x , P b|y (embedding ρ AB in the larger Hilbert space).Therefore, the objective function is indeed invariant under the dilation (the embedding does not affect Eve's system).
(We remark that a priori, one might instead wish to consider channels R xy from AB to A x B y A B , where A , B are quantum registers storing Alice and Bob's post-measurement states.The constraints and objective function would then be described via the channels so this produces the same results.)

VI. DUAL PROBLEM
We shall use the dual of the optimization problem (1) to argue that when applied to DIQKD, our approach yields bounds that are tight up to the use of the generalised Golden-Thompson inequality and NPA hierarchy, except possibly for certain edge cases that would not arise in practical applications.
For brevity, we will again denote the constraints X ab|xy ψ = Pr(ab|xy) in the form X j ψ = γ j for j in an index set J , as in Sec.II.Now let F ( γ) denote the optimal value of the optimization problem (1), viewed as a function of the constraint values γ.This is not a convex optimization problem, because the constraints are not convex in the standard sense.However, it has the crucial property that F is still a convex function, because we have a "pseudo-convex" behaviour arising from the fact that Eve can always perform classical mixtures of different strategies. 7This suggests that we can apply some methods from In this section, we will follow conventions in that field and take F to be a function R |J | → [−∞, +∞], with the infimum of an empty set being +∞.It is also conventional to define the "domain" of F to be the set In our case, dom(F ) is exactly equal to the set of γ that are achievable by the family of states and measurements we consider (this follows from the fact that our objective function is always finite, being the conditional entropy of a finite-dimensional classical register).Finally, for any subset C of an Euclidean space, let relint(C) denote its relative interior (basically, its interior relative to its affine hull; see e.g.[15] for a full definition).
We aim to compare F to the bound given by our approach before the generalized Golden-Thompson inequality is applied.Let S γ denote the set 8 of all ρ, X 7 Specifically: consider any γ 0 , γ 1 ∈ dom(F ), and any > 0. By definition of F , there exists a state (and measurements) achieving outcome distribution γ 0 and conditional entropy H(A 0 |E) = t 0 for some value t 0 ≤ F ( γ 0 ) + .Doing the same for γ 1 , we now note that if Eve has a classical register E in the state p|0 0| + (1 − p)|1 1| and uses it to determine which of those strategies to implement (including providing copies of E to the users' devices so they can implement the corresponding measurements), this yields a strategy that achieves outcome distribution p γ 0 + (1 − p) γ 1 and entropy Notice that this argument uses the fact that there is no bound on the dimension of Eve's side-information. 8Strictly speaking, this collection might not be a valid set, depending on the precise specification of the "allowed" states and measurements (for this work, we restrict this to be the union over d ∈ N of states ρ and measurements X on C d ⊗ C d , in which case S γ is a valid set, but if we allow arbitrary algebras or Hilbert spaces then some care may be needed).In any case, this can be such that X ρ = γ (in this section, ρ is to be understood as the state on AB only).We can then write our bound (see Eq. ( 36)) as where we used the fact that X ρ = γ for all states and measurements in S γ , and introduced the function using the fact that T * T = T in the DIQKD setting (Eq.( 4)).By construction, this is a lower bound: F ≥ F .We now prove two lemmas that together imply that the reverse inequality F ( γ) ≥ F ( γ) holds for all γ ∈ relint(dom(F )), and hence the bound is tight for all such points.This is sufficient to cover practical implementations, since any γ ∈ dom(F ) that is not in the relative interior corresponds to a relative boundary point of the set of distributions allowed by quantum theory, which would not arise in physical implementations with nontrivial noise.(Alternatively, one can restrict Eve's allowed strategies slightly such that F ( γ) ≥ F ( γ) holds on all of dom(F ); see Lemma 6 below.)These lemmas are based on the Lagrange dual of the optimization (1), i.e. the function where inf ψ, X refers to an infimum over the entire allowed class of states and measurements (here ψ denotes the joint state on ABE), not just those such that X ψ = γ.The Lagrange dual is always a lower bound on the original optimization.In these two lemmas, we shall first prove that the bound F is always at least a good a bound as the Lagrange dual G, then we shall prove that in fact G( γ) = F ( γ) (i.e.we have strong duality) for all γ ∈ relint(dom(F )), hence the bound F is also tight for all such points.
Lemma 4. Let F be as defined in Eq. (45), and let G be the Lagrange dual as defined in Eq. (47).Then we have F ≥ G.
addressed by noting that strictly speaking, an optimization of the form in e.g. ( 1) is to be understood as F ( γ) = inf T γ where T γ is a subset of R (i.e. a valid set) defined via an existence quantifier taken over the class of "allowed" states and measurements.For brevity, however, we continue referring to S γ as a set.Alternatively, one could analyze this in terms of the universal representation, as in Sec.VII.
Proof.We first note that for any fixed value of λ and operators X (of some finite dimension), we can rewrite the infimum over ψ in the Lagrange dual (47) by applying several results from [16] for device-dependent QKD.Specifically, they perform the following reduction (T again denotes the pinching channel in Eq. ( 4)): where all optimizations take place over the entire set of states on the relevant systems (ABE for ψ and AB for ρ, σ, with dimensions compatible with the fixed operators X), and in the last line σ is relabelled as ρ (since it is merely a "dummy variable" in the optimization).We can use this to rewrite the expression for G, filling in the optimizations over λ and X: Now note that because of the inequality ln x ≤ x/e and the fact that the supremum on the left is over a smaller domain.Referring back to Eq. ( 45), we hence see that F ≥ G.
Lemma 5. Let F correspond to the optimal value of the optimization (1) in the sense described above, and let G be the Lagrange dual as defined in Eq. (47).Then we have G( γ) = F ( γ) for all γ ∈ relint(dom(F )).
Proof.Abstractly, the F we consider here is a constrained optimization of the form where D is some optimization domain (which could in principle be a class rather than a set, though we will continue using set notation for it), and f and Γ j are functions (or class functions) D → R. Take any γ ∈ relint(dom(F )).We now use the fact that since F is convex and γ is in the relative interior of its domain, there exists a subgradient of F at γ (see e.g.[17] Prop.2.2.2).Explicitly, this means there exists λ such that Since the Lagrange dual G is a supremum over all possible λ, it is lower-bounded by setting λ = λ , which yields where the second line holds9 because for each x ∈ D in the optimization (58), we can set γ = Γ(x) to obtain a Hence we have proven that G( γ) ≥ F ( γ) for all γ ∈ relint(dom(F )).Also, the Lagrange dual G always satisfies F ≥ G, so we have the desired result.
Putting together these lemmas, we have F ( γ) ≥ F ( γ) for all γ ∈ relint(dom(F )), as desired.Note that while Lemma 4 holds everywhere, Lemma 5 only holds for γ ∈ relint(dom(F )) (though this is enough for practical applications).The main difficulty in generalizing the latter to all γ ∈ dom(F ) is finding a suitable subgradient λ in the proof (or at least a family of suitable λ, as we shall shortly see).One possible way to ensure this can be done is to slightly restrict the allowed strategies for Eve.For instance, suppose we take a large but fixed value d ∈ N, and restrict Eve to the following class of strategies: she prepares a classical ancilla E with any desired distribution, and then depending on the value of E , implements a "d-dimensional strategy" (i.e.conditioned on the value of E , the Alice-Bob-Eve state is in C d ⊗ C d ⊗ C d 2 , and the measurements are appropriate projectors on the corresponding spaces), and we take E to be part of Eve's side-information.In that case, Lemma 4 (with its optimization domain restricted to such strategies) still holds for all γ via the same proof as before, but now we can prove a variant of Lemma 5 that holds for all γ ∈ dom(F ): Lemma 6.Take any fixed d ∈ N. Let F correspond to the optimal value of the optimization (1), but with the optimization domain restricted to "mixtures of ddimensional strategies" in the sense described above.Let G be the corresponding Lagrange dual, i.e.Eq. (47) but with the domain similarly restricted.Then we have G( γ) = F ( γ) for all γ ∈ dom(F ).
Proof.The argument here essentially follows standard geometric arguments in convex optimization (see e.g.[15] Sec.5.1.6for some useful diagrams), but with modifications to account for the fact that our optimization problem is not quite convex.
We use the same notation as the proof of Lemma 5, though here the optimization domain D refers to "mixtures of d-dimensional strategies" as described above.Furthermore, let D denote the set of d-dimensional strategies without allowing "mixtures", i.e. the Alice-Bob-Eve state is genuinely a state on C d ⊗ C d ⊗ C d 2 alone, without allowing Eve to generate the classical ancilla E .We now define two subsets of R |J |+1 as follows: The set T has two critical properties.Firstly, it is a compact convex set, because it is straightforward to see that T is the convex hull of T , and the latter is a (finitedimensional) compact set since it is the image of the compact set D under a continuous function.Secondly, it is related to F by recalling that the infimum of an empty set is +∞.
Consider any γ ∈ dom(F ), and take an arbitrary > 0. By Eq. ( 63), we see that the point ( γ, F ( γ) − ) does not lie in T .Then since T is a compact (and thus closed) convex set, we can apply a strict separating hyperplane theorem (e.g.[15] Example 2.20) between T and that point, which implies that there exists some ( λ, v) ∈ R |J |+1 such that Furthermore, v must be strictly positive, by the following argument.Recall that in our case, γ ∈ dom(F ) implies that the feasible set in the optimization F ( γ) is nonempty, i.e. there exists some x ∈ D such that Γ(x ) = γ.Then we have ( γ, f (x )) ∈ T , and choosing this point in Eq. (64) yields the strict inequality vf (x ) > v(F ( γ) − ). Hence we cannot have v = 0 (which would imply 0 > 0) or v < 0 (which would imply f (x ) < F ( γ) − , contradicting the definition of F ( γ)).
Since v is strictly positive, we can divide (64) throughout by v and define λ = − λ/v to obtain Considering this value λ in the supremum in the Lagrange dual G, we obtain where the second line holds because each x ∈ D in the optimization (66) yields a ( γ , t) ∈ T with the same feasible value in the optimization (67), and vice versa.Since > 0 was arbitrary, this implies G( γ) ≥ F ( γ), yielding the desired result.
(This proof relies on the fact that we have defined D in such a way that T is a closed convex set, which yields strict inequality in (64) -without this, it may not be possible to ensure that v = 0.The question of whether T is closed when considering more general classes of allowed strategies appears to be fairly subtle -in particular, it likely depends on whether the set of quantum correlations is closed for the Bell scenario being considered, which is in general a complicated open question and outside the scope of this work.) We remark that in this work we have described how to bound F , but the same methods could have been applied to bound the Lagrange dual G in the form (54), since they are essentially describing ways to bound the function g.However, by recalling the inequality ln x ≤ x/e, we can see from the expressions ( 45) and ( 54) that when the same methods are applied in both cases to bound g, the bound resulting from F is always at least as good as the one resulting from G, 10 which is perhaps slightly surprising since we have a tightness argument for G (Lemma 5), whereas the inequalities involved in deriving F do not appear tight a priori.
Lastly, we remark that our final bound (Eq.( 17) with K ρ bounded in terms of γ using the SDP method as described) is convex with respect to γ.This follows from the fact that the optimal value of a maximization SDP is a concave function of the constraint values γ, and − ln is a convex nonincreasing function, which implies their composition is a convex function of γ.Hence our final bound is also a convex function of γ (the L ρ term is linear in γ and hence trivially convex).This property is important in some applications, such as the entropy accumulation theorem [2][3][4].If one needs to impose the stronger requirement that the bound is affine, then note that if we pick any particular λ in Eq. ( 45) rather than taking the supremum, and relax the inner optimization to sup ρ, X , we get an affine lower bound with gradient λ.In addition, this λ essentially describes a single Bell expression that certifies the same amount of entropy as can be certified by all the original constraints using this choice of λ (in the relaxed optimization).However, this and sup ρ, X .Essentially, the optimisation over λ implicitly parametrizes a family of affine upper bounds on ln g ρ, λ • X , and the envelope of these affine bounds returns precisely the ln function.
approach (relaxing the inner optimization) comes with the drawback that it can potentially make the final bound worse.

VII. NON-COMMUTATIVE POLYNOMIAL OPTIMIZATION ON C * -ALGEBRAS AND SDP HIERARCHIES
When studying device-independent scenarios, there can potentially be some variability in specifying the exact class of states and measurements to consider -for instance, whether one restricts the analysis to finitedimensional Hilbert spaces with a tensor-product structure (as in this work), or whether one allows arbitrary Hilbert spaces and merely requires some measurements to commute.In the following, we outline a framework that encapsulates a rather general range of possibilities, within which we can restrict ourselves to narrower classes as necessary.(See footnote 8 for a slightly different perspective.) To begin with, we shall assume there exists enough structure in our problem description to have a (unital) C*-algebra A that abstractly describes all observables X j we have to take into account.Whenever a particular such algebra A is provided, the set of all states S(A) is a well defined object given by the set of all positive normalized linear functionals [18].Using the Gelfand-Naimark-Segal construction we can then assign representations π ρ : A → B(H) to states which then allow us to speak about the X j as an explicit set of bounded operators on a specified Hilbert space, for which a functional calculus exists such that the computations done in the last section would hold.
Building on this, one can additionally consider the universal representation U of A, which is simply the direct sum over all representations π ρ , i.e.
We will hence interpret device-independent formulations of the optimization problem (1) to be an optimization over this underlying object, where we extended the functional calculus from the π ρ to U. Furthermore we have the identification To restrict to narrower classes of possibilities, one would simply need to specify that only certain blocks in the direct sum should be considered (for instance, only finitedimensional representations).
We now turn to a description of non-commutative polynomial optimizations in general.Whenever we fix a list X = {X 1 , . . ., X n }, in the following context also referred to as an alphabet X of letters X j , we can define the set of all possible concatenations of letters, i.e.
We refer to this as the set of words, and index it with some index-set I.A non-commutative polynomial P (X) with coefficients c ij ∈ C is then given as a finite linear combination of words w i , w j ∈ W, i.e.
for some set Ω ⊆ I. (We use two indices i, j to index the terms of the polynomial because it will subsequently be convenient to consider a matrix of coefficients c ij .)We will call P hermitian if c ij = c ji holds for all ij.
We are now in a position to interpret an expression such as the expansion of K in (26) as a non-commutative polynomial P (X) arising from some computation done in the representation U which is then evaluate on a particular state ρ via (70).Additionally we can also collect the constraints X j ρ = γ j and some other algebraic constraints, which are not trivially covered as identity in A, by a list of further constraints Q k (X) ρ ≥ α k that are expressed as non-commutative polynomials Q k .(In a generic non-commutative polynomial optimization, it may not be clear a priori whether an imposed constraint, for example X 1 +X 2 = I, should be fulfilled on the level of an operator identity or only for a particular state, i.e. as X 1 ρ + X 2 ρ = 1, which is weaker.However, in the context of quantum theory this is determined by the chosen postulates.) At the end we arrive at a point where we can formulate the task of finding a device-independent bound on the expectation K ρ from ( 17) as an optimization problem inf P (X) ρ s.th.: on objects defined in the aforementioned sense.By introducing the matrix Γ ρ Ω with entries which is also called the moment matrix of X or the representing GNS matrix of ρ on W| Ω , we can rewrite Let C Ω be the set of indices k such that the polynomial Q k (X) only involves terms from the set Ω.For any k ∈ C Ω , we can analogously write for some matrix C Q k Ω .Then a lower bound on (73) (differing only in that some constraints are omitted) is which is an optimization of linear matrix functionals over the set of all valid matrices Γ ρ Ω .It is well known that a matrix like Γ ρ Ω is a positive matrix whenever ρ is a positive functional.We therefore have that Ξ Ω is a subset of P + n the positive n × n matrices with n = |Ω|.This observation [6] directly suggests to relax the optimization problem (77) by extending the optimization from Ξ Ω to P + n , which will then result in the semidefinite program Note that in constructing this relaxed optimization, we have only required that the domain Ω is large enough to represent P , but not necessarily all algebraic constraints Q k in the original optimization (73).By expanding the set Ω, we could include more and more constraints Q k expressing the structure of A. For our original formulation (73), this does not make any difference since all such constraints are included.For the relaxation (79), however, this will have an impact.We therefore get a hierarchy of semidefinite programs by subsequently replacing Ω in (79) with a sequence of sets Ω ⊂ Ω ⊂ Ω • • • ⊂ I whilst incorporating more and more structural constraints from A by adding new Q k .
In an explicit application of this technique, the optimal choice of those new constraints and a suitable sequence of sets is highly case-dependent and influenced by the computational resources that could be spent.

VIII. DETAILS FOR NUMERICAL WORK
By level n of the NPA hierarchy we always mean the "global level", i.e. the rows/columns of the NPA matrix are indexed by all the operators consisting of products of n or fewer projectors.This is in contrast to "local level" n, where the index operators are all the operators consisting of products of n or fewer projectors per party.Essentially, local level n is equal to global level 2n with some index operators removed (namely, all those which have more than n projectors by a single party).The local level concept can be slightly extended by having different local levels for Alice and Bob.
In our approach, high levels of the NPA hierarchy are required in order to have all the terms in Eq. ( 26) appearing in the NPA matrix (unless we simplify by restricting some of the coefficients λ ab|xy to be zero).To cope with this, we used the simplification that one projector per measurement can be omitted, by using the relation c P c|z = I to write one of the projectors P c|z in terms of the others.
For scenarios with 2 inputs per party, global level 6 is sufficient to capture all the terms in Eq. ( 26) (for some choices of the operator product order, at least), with the NPA matrix having size 85 × 85 when the above simplification is implemented.In principle, local level 5 for Alice with local level 3 for Bob is also sufficient (while producing a slightly smaller NPA matrix of size 77 × 77), since the terms in Eq. ( 26) have basically equal numbers of projectors from each party, apart from the asymmetry introduced by the channel T .However, the resulting bounds we obtained were slightly but noticeably worse than those for global level 6, to the extent that the results for H(A 0 |E) no longer outperform the bound in [8], at least as far as we were able to optimize the coefficients λ ab|xy .
We also note that some algebraic constraints on the projectors can be used to reduce the number of coefficients λ ab|xy to optimize over.For instance, we can use the normalization conditions ab P a|x ⊗ P b|y = I to eliminate one coefficient per input pair (x, y), as follows.Consider any set of values for λ ab|xy .Pick some specific input pair (x, ỹ) and a real number α ∈ R, and define a new set of coefficients as follows: λ ab|xy = λ ab|xy − α for (x, y) = (x, ỹ), λ ab|xy otherwise.
We now show that the coefficients λ ab|xy yield the same bound as the coefficients λ ab|xy .Denote the operators constructed from the coefficients λ ab|xy (in the sense of Corollary 1) as L and K .First, it is easy to see that L ρ = L ρ − α due to the normalization conditions.Similarly, we also have which implies that K ρ = e −α K ρ and thus L ρ − ln K ρ = L ρ − ln K ρ , as claimed.Notice that if we choose α to be equal to one of the λ ab|xỹ values, this procedure yields a new set of coefficients λ ab|xy that gives the same bound, but with the corresponding λ ab|xỹ value equal to zero.In summary, this implies that for each input pair (x, y), we can set one coefficient λ ab|xy to be zero without loss of generality.It might appear that something similar could be done with the no-signalling conditions; namely, if we pick some specific ã, x, ỹ, ỹ , they must satisfy b P ã|x ⊗ P b|ỹ = b P ã|x ⊗ P b|ỹ , since b P b|y = I for any y.(Analogous conditions hold for summations over Alice's outputs a.The statements we derive below also cover those no-signalling conditions, as well as the no-signalling conditions in "marginal form", b P ã|x ⊗ P b|ỹ = P ã|x ⊗ I.) However, this approach runs into a subtle difficulty.To see this, consider how one could try to proceed in a manner similar to the use of the normalization conditions: take any set of values for λ ab|xy , pick some α ∈ R, and define λ ab|xy =      λ ab|xy − α for (a, x, y) = (ã, x, ỹ), λ ab|xy + α for (a, x, y) = (ã, x, ỹ ), λ ab|xy otherwise. (82) Then by the no-signalling conditions, we have L ρ = L ρ .As for the operator product xy in K , the term with index (x, ỹ) has the form where the product of the two exponentials in the last line could also be taken in the reverse order (because the operators P a|x ⊗ P b|ỹ for different a, b all commute).Similarly, the term with index (x, ỹ ) has the form where the product of the two exponentials in the last line could also be taken in the reverse order.All the other terms in the product remain the same as they were with the coefficients λ ab|xy .We now observe that if the operator product is ordered such that the terms (x, ỹ) and (x, ỹ ) are next to each other, we would have K = K because the no-signalling conditions imply that the terms involving α cancel each other out.Hence in such a case, the entropy bound would be invariant under the substitution λ ab|xy → λ ab|xy .However, if the operator product is not ordered such that those terms are next to each other, then the α-dependent terms do not cancel, and it seems possible to have K ρ = K ρ as a result.In summary, whether a substitution based on the no-signalling conditions leaves the bound invariant depends on the ordering of the operator product when constructing K.
When implementing our approach in a context where the constraints are linear combinations of the probabilities instead of simply the form in (1), the above discussion implies that although we can rewrite or eliminate constraints using the normalization conditions while leaving our final bounds invariant, this may not necessarily be true when using the no-signalling conditions.In particular, this means it is possible that the bounds given by our approach might not be invariant if we replace the constraints in terms of probabilities with constraints in terms of the correlators A x ρ , B y ρ , A x B y ρ , e.g. as was done in [19] (which uses both the normalization and no-signalling conditions to perform this replacement).As tentative support for this possibility, we were unable to reproduce the results in the main text when only imposing constraints in terms of the correlators (though it is possible that this might have simply resulted from the heuristic search not finding the optimal bound).

FIG. 3 . 2 -
FIG.3.2-input 2-output DI protocols, H(A0|E) and r∞ (in base 2): Lower bounds on entropy H(A0|E) and keyrate r∞, as a function of depolarizing noise (for the scenario studied in[1]) or detection efficiency (for the scenario studied in[42], which optimizes the state and measurements to achieve maximal CHSH value).For the latter, r∞ was computed by optimizing the key-generating measurement B0 alone to minimize the value H(A0|B0), without changing the state and other measurements from those in[42].Also, to yield higher keyrates, the key-generating measurement B0 was preserved as a 3-outcome measurement (following[43]) rather than postprocessing it to 2 outcomes.It can be seen from the graph that our bounds are either close to or slightly better than the best previous result[1]  for these scenarios, which was based on the CHSH value alone.For comparison, we also show the indirect bounds obtained by using the inequality H(A0|E) ≥ 2(1 − Pg(A0|E)) (in base 2).