(Nearly) optimal P-values for all Bell inequalities

A key objective in conducting a Bell test is to quantify the statistical evidence against a local-hidden variable model (LHVM) given that we can collect only a finite number of trials in any experiment. The notion of statistical evidence is thereby formulated in the framework of hypothesis testing, where the null hypothesis is that the experiment can be described by an LHVM. The statistical confidence with which the null hypothesis of an LHVM is rejected is quantified by the so-called P-value, where a smaller P-value implies higher confidence. Establishing good statistical evidence is especially challenging if the number of trials is small, or the Bell violation very low. Here, we derive the optimal P-value for a large class of Bell inequalities. What's more, we obtain very sharp upper bounds on the P-value for all Bell inequalities. These values are easily computed from experimental data, and are valid even if we allow arbitrary memory in the devices. Our analysis is able to deal with imperfect random number generators, and event-ready schemes, even if such a scheme can create different kinds of entangled states. Finally, we review requirements for sound data collection, and a method for combining P-values of independent Bell experiments.


I. INTRODUCTION
Local hidden variable models (LHVM) predict concrete limitations on the statistics that can be observed in a Bell experiment [1]. These are typically phrased in terms of probabilities or expectation values. However, in any experiment we can only observe a finite number of trials, and not probabilities. We thus need to quantify the statistical evidence against an LHVM given a finite number of trials.
The traditional way to analyze statistics in Bell experiments is to compute the number of standard deviations that separate the observed data from the best LHVM. However, it is now known that this method has flaws [2][3][4][5] (see [4] for a detailed discussion). In particular, we would have to assume Gaussian statistics and independence between subsequent attempts, allowing for the memory loophole [2,3]. Fortunately, it is possible to rigorously analyze the statistical confidence even when allowing for memory as was first done by Gill [6]. This is the approach that we follow here.
Instead of bounding the standard deviation, the intuitive idea behind the rigorous analysis is to bound the probability of observing the experimental data if nature was indeed governed by an LHVM. In the language of hypothesis testing, this is known as the P -value, where the null hypothesis is that the experiment can be modelled as an LHVM (see e.g. [7]). A small P -value can be interpreted as strong evidence against the null hypothesis. Hence, in the case of a Bell experiment, a small P -value can be regarded as strong evidence against the hypothesis that the experiment was governed by an arbitrary LHVM.
There is an extensive literature regarding methods for evaluating the P -value in Bell experiments [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16] and discussions regarding the analysis of concrete experiments and loopholes [17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. Previous approaches to obtain such P -values known from the literature can be roughly divided into two categories. In the first approach, we select a suitable Bell inequality based on the expected experimental statistics or test data collected ahead of time. After a Bell inequality is fixed, one can model the process as a (super-)martingale to which standard concentration inequalities [2, 6, 9-11, 13, 15] can be applied. While this allows one to obtain bounds for all Bell inequalities relatively easily, the resulting upper bounds on the P -values are generally very loose. Crucially, this means that a much larger amount of trials would need to be collected than is actually necessary to obtain good statistical confidence. Figure 1 illustrates the significance of using bounds employed in previous works compared to the bound used here. When making a statement about all Bell [36]: The three curves show bounds on the P -value for a fixed number of trials n = 245 and random number generators bias τ = 1.08 · 10 −5 [37,38]. The Pvalue is computed as a function of the violation S which is defined as: S = 8(c/n − 1/2), where c is the number of wins in the CHSH game. From top to bottom, the curves show the bound on the P -value computed with Azuma-Hoeffding used in [11], McDiarmid's inequality [39] given in [14] and the upper bound from (22) (with βwin = 3/4 + τ − τ 2 as shown in Lemma 1 in the Appendix). In the Delft experiment [36] a number c = 196 of wins were observed, giving S = 2.4 and (22) yields P -value≈ 0.039. The dots indicate the Pvalues predicted by the other bounds. To obtain the same P -value with McDiarmid's inequality and Azuma-Hoeffding [11] the required violations would be S = 2.54 and S = 2.98 (beyond QM) respectively.
we will also take a martingale approach using however the much sharper concentration offered by the Bentkus' inequality [33]. For some simple inequalities like Clauser-Horne-Shimony-Holt (CHSH) [34] and Clauser-Horne (CH) [35], tight bounds on the P -value have been obtained when the measurement settings in the experiment are chosen uniformly, and no event-ready scheme is employed [3,5,16]. Such a bound was first informally derived in [3], and later rigorously developed by Bierhorst [5] whose approach for CHSH closely inspires our analysis of Bell inequalities that correspond to win/lose games below.
The second approach that has been pursued is to combine the search for a good Bell inequality with a numerical method adapting to the data [4,12,14]. This method is asymptotically optimal in the limit of many experimental trials. While conceptually beautiful, this numerical method can need a rather significant amount of trials to out-perform even the somewhat loose bounds given by standard martingale concentration inequalities, and can hence only be used in regimes where the amount of trials collected in the experiment is indeed large.

II. MATERIALS AND METHODS
Here we present a method for analyzing the Pvalue for Bell experiments that is optimal for large classes of Bell inequalities. This method also applies to event-ready schemes as used in [36], and can also deal with more complicated forms of event-ready procedures (heralding) in which different states are created in each trial (see Figure 5). In particular, situations in which we apply a different Bell inequality at each trial depending on which state is generated. Furthermore, we show how to bound the P -value of all Bell experiments using Bentkus' inequality which is optimal up to a small constant.
Before we can state the concept of a P -value more precisely, let us briefly recall the concept of a Bell inequality (see e.g. [40] for an in-depth introduction). For simplicity, we thereby restrict ourselves to Bell inequalities involving two sites (Alice and Bob), but all our arguments hold analogously for an arbitrary number of sites. As illustrated in Figure 2, in a Bell experiment we choose inputs x and y to Alice and Bob, and can record outputs a and b [41]. If the experiment was governed by a LHVM, then we could write the probabilities of obtaining outputs a and b given inputs x and y as p(a, b|x, y) = dµ(h)p(a|x, h)p(b|y, h) , (2) where dµ is an arbitrary measure over hiddenvariables h, that also include any prior history of the experiment. The locality of the model is captured by the fact that p(a, b|x, y, h) = p(a|x, h)p(b|y, h) if Alice and Bob are indeed space-like separated. Throughout, we refer to the supplemental material for a formally precise notation, definitions and derivation. A Bell inequality then states that for any LHVM for some numbers s xy ab . Evidently, in an experiment we never have access to actual probabilities p(a, b|x, y). Nevertheless, Bell inequalities turn out to be very useful to establish bounds on the P -value above.
A B x y a b FIG. 2. A Bell test involving two space-like separated sites, labelled Alice and Bob. Alice and Bob receive two randomly chosen inputs x and y, and produce outputs a and b. We indicate that Alice and Bob are space-like separated via the dotted line. When testing the CHSH inequality, for example, the inputs and outputs can be taken to be single bits x, y, a, b ∈ {0, 1}. Viewing CHSH as a non-local game, the winning condition is that x · y = a ⊕ b (we use the shorthand a ⊕ b to denote a + b mod 2). This means that in one trial of the experiment, we check whether x · y = a ⊕ b and if yes we increment the number c of wins by 1. For all Bell inequalities that are win/lose games (see Section III A), we analogously count the number of wins. General Bell inequalities (see Section III B) can also be cast as a game in which we do not just decide on whether Alice and Bob win or lose, but instead assign a score to each correct answer. In the experiment, we then compute the total score from the inputs and outputs observed. Our analysis is analogous for Bell inequalities involving more than two sites.
Let us now rephrase this inequality in a way that will make our approach more intuitive later on. In an experiment we choose settings with some probability p(x, y), hence, it will be convenient to define s ab|xy = s xy ab /p(x, y) .
For the moment, let us assume we have perfect random number generators, and that we choose the settings x and y uniformly such that p(x, y) = p(x)p(y) where p(x) = 1/N x and p(y) = 1/N y . The Bell inequality then reads The reason why this notation is convenient is because we can now think of s ab|xy as a score that Alice and Bob obtain when giving answers a and b for questions x and y. We thus adopt a modern formulation of Bell inequalities in terms of games [40]. The statement that an LHVM governs the experiment then means that Alice and Bob can only use a local-hidden variable strategy to achieve a high score in the game. Using this formulation it is clear that the term in (5) is just the average score that Alice and Bob can hope to achieve in the next trial. Since the Bell inequality holds for any local-hidden variables, including the history, it is clear that playing the game n times in succession, i.e., performing n trials of the experiment corresponds to a classic example of martingale sequence (see supplemental material).
To analyze the experimental data we then proceed as follows: In trial j, we compute the score s aj bj |xj yj that Alice and Bob obtain for the inputs x and y and outputs a and b we observed in that trial. By adding all these numbers we compute the total score c = n j=1 s aj bj |xj yj after performing n trials. The P -value then corresponds to That is, the probability that Alice and Bob would obtain a score C that is at least as large C ≥ c as the score c actually observed in our experiment.
Note that the choice for the score function is not unique. The only restriction, in order to define a P−value, is that the score needs to be a valid test statistic. A test statistic is a function that assigns a real value to each possible experimental outcome. Then, the P−value is the probability, under the null hypothesis, that the value of the test statistic is equal or larger to the value obtained from the observed data. There are many possible score functions that verify this restriction, though we would argue that the one used here is particularly natural.

A. P -values for win/lose games
We first obtain optimal P -values for a certain class of Bell inequalities, also known as non-local games. In particular, this includes the Bell inequalities phrased in terms of correlation functions such as the famous CHSH inequality [34]. What sets these inequalities apart is that the scores s ab|xy can take on only two values, which we associate with winning or losing the game.

Winning probability
To illustrate, how Bell inequalities correspond to games, let us consider the CHSH correlation function where A x and B y correspond to the observables measured by Alice and Bob respectively (see Figure 2). Note that we can write one of the correlators as In terms of the score function, this means that s a,b|x,y = 1 if a = b and s a,b|x,y = −1 if a = b.
Note that in any game in which s a,b|x,y can only take on these two values we can think of the probability that Alice and Bob win for a particular choice of measurement settings x and y as p(lose|x, y) = a,b s a,b|x,y =−1 p(a, b|x, y) Any Bell inequality for which s a,b|x,y ∈ {±1} [42] can thus be written as x,y p(x, y) (p(win|x, y) − p(lose|x, y)) = (12) x,y p(x, y)2p(win|x, y) − 1 .
To draw full analogy with the usual representation of non-local games (see e.g. [40]) let us normalize the scores to be 0 and 1 instead by definingŝ a,b|x,y = s a,b|x,y /2 + 1. We then have which is precisely the probability that Alice and Bob win the non-local game [40]. In this language, a Bell inequality now takes on the form p(win) ≤ β win (15) where β win denotes the optimal winning probability that can be achieved using an LHVM. Note that if necessary, β win can be obtained by normalizing the given values β min , β max appropriately.

Analyzing data
The following steps need to be taken to obtain a P -value for an experiment based on a non-local game, where for simplicity we first consider schemes that are not event-ready. We refer to the supplemental material for formal definitions and derivation.
First, we determine a bound on the bias of the random number generator. We will never be able to generate settings x and y exactly according to the specific distributions p(x) and p(y), instead we will generate the settings according to some other distributionsp(x) andp(y). We are interested in the numbers τ A and τ B such that It is clear that for any physical device, these are estimates ideally supported by a theoretical device model with clearly specified assumptions. Second, we need to obtain a bound on the winning probability using such imperfect random number generators (RNGs).
that is valid for all LHVM, where we condition on the history of the experiment. Such a bound can be obtained analytically for many inequalities, including CHSH (see supplemental material). In general, a bound onβ win can be computed numerically using a linear program (LP), when re-normalizing the score functionsŝ a,b|x,y ∈ {0, 1} as above. We remark that that this LP has size that is exponential in the number of inputs and outputs, but can nevertheless be solved numerically when these are small enough which is typically the case in all experimental Bell tests. It is known that it is NP-hard to compute the winning probability for arbitrary non-local games [43].
Third, in each of the n experimental trials, we generate inputs x and y and record outputs a and b. In the end, we count the number c of trials in which Alice and Bob won the game, i.e., the number of times s ab|xy = 1.
Finally, we compute the P -value. The interpretation of the P -value is the probability that Alice and Bob win at least c times, maximized over any LHV strategy. As we prove in the supplemental material, for all LHVMs including arbitrary memory effects, This bound is a generalization of [3] and [5] that already had given a binomial upper bound for one particular win/lose game, the CHSH game, when the RNGs are perfect, and no event ready-scheme is used. We emphasize that this bound is tight, whenever (18) is tight. That is, there exists a LHVM that produces at least c wins with this probability, and this LHVM does not use any memory. While a theoretical analysis is of course necessary to prove (20) it follows that the memory loophole [3] can only be exploited for general Bell inequalities, where it indeed turns out to be significant. Figures 3 and 4 illustrate this bound for the CHSH and Mermin's inequality [44].

Event-ready schemes
To illustrate the analysis of event-ready schemes, let us here focus on the usual case where the tag (see Figure 5) can be either t = 0 (null game, no entanglement was made) or t = 1 (one game, one specific S PYDOXH n =150 n =200 n =250 FIG. 4. P -values for the Mermin's inequality [44] with perfect random number generators. Mermin's inequality is a tripartite inequality in which each party has two inputs and two possible outputs. It is an example of a non-bipartite inequality that has already been violated in the laboratory [45][46][47]. The three parties Alice, Bob and Charlie receive three random chosen inputs x, y and z with the promise that the parity of the inputs is even, that is that the inputs are limited to (0, 1, 1), (1, 0, 1), (1, 1, 0), (0, 0, 0), and produce outputs a, b and c which can also be taken to be bits. That is: x, y, z, a, b, c ∈ {0, 1}. The winning condition for Mermin's inequality is that a ⊕ b ⊕ c = x ∨ y ∨ z. That is the game is won if the xor of the outputs equals 0 when (x, y, z) = (0, 0, 0) and if the xor of the outputs equals 1 in the remaining cases. Hence we get: s abc|xyz = a ⊕ b ⊕ c ⊕ 1 when (x, y, z) = (0, 0, 0) and s abc|xyz = a ⊕ b ⊕ c when (x, y, z) = (0, 0, 0). The winning probability for this game is p(win) = 3/4 [48], but note that in contrast with CHSH if Alice, Bob and Charlie share entanglement they can win with probability one. The curves show the P -value as a function of S = 8(c/n − 1/2) for fixed number of trials n (c is the number of wins). The three curves show from top to bottom the P -value for n = 150, n = 200 and n = 250. The P -values are computed with the binomial upper bound (20). state was made). We will use the term attempt to refer to an attempt to create entanglement (outcome t = 0 or t = 1) and reserve the word trial for those in which t = 1. In the supplemental material, we will discuss more complex versions of event-ready schemes in which different entangled states can be created, and we employ a different game for each state.
While it is important that the random numbers are chosen independently of the tag t, we otherwise allow the LHVM arbitrary control over the statistics of heralding station. In particular, this means that the LHVM may use more (or less) attempts to realize x y FIG. 5. A Bell test using an event-ready scheme as proposed by Bell [1,49]. In an event-ready scheme, there is an additional site which we call the "heralding station" that is space-like separated from Alice and Bob at the time they receive their inputs (see Figure 2). This heralding station can be under full control of the localhidden variable model. It takes no input, but produces a tag t as output. In the simplest case, t is just a single bit where t = 1 corresponds to 'yes' and t = 0 to 'no'. If yes, then we check the winning condition for Alice and Bob as in Figure 2. If no, then no record is made (i.e., the null game is played). In physical implementations such as [36] this tag indicates whether an attempt to produce entanglement was successful. More complicated scenarios are possible, in which the tag t takes on more than two-values. Depending on t, a particular game is played, i.e., scores are computed according as dictated by the game labelled by t. In physical implementations this is interesting when two different entangled states can be created in the event-ready scheme, and each state is best for a particular game. An example is given by CHSH, where different Bell states are created and we play two different CHSH games with x·y = a⊕b or x·y = a⊕b⊕1.
Using both states can improve the time scales at which statistical confidence can be obtained.
c wins on n trials than we actually observed during the experiment. Specifically, where t m = t 1 , . . . , t m , |t m | denotes the number of ones in t m and the maximization over LHVM includes a maximization over an arbitrarily large number of attempts m and heralding statistics. As we will formally show in the supplemental material, That is, we can formally ignore the non successful attempts. The P -value only depends on the trials.

B. General games
Let us now move on to considering general games, that is, games in which the score functions s ab|xy take on more than two possible values. As before, we first need to consider the bias. Our bound will depend on the values of Recall that since s ab|xy = s xy ab /p(x, y) the distribution p(x, y) and hence also the bias influence s max and s min . Second, we again compute the total score where x j , y j , b j and a j are the inputs and outputs used during trial j respectively. We then have that where C is the random variable corresponding to obtaining a particular score using the LHVM strategy. Using the Bentkus' inequality, we prove in the supplemental material that Whenever the Bell inequality is normalized such that s min = 0 and s max = 1 this becomes where c and c stand respectively for the greatest integer smaller than c and the smallest integer larger than c.
If we treat a win/lose game as a general game we can also upper bound the P−value by (30). However, if we compare this formula with (20), we see that we have lost a factor of e. We have obtained a simple formula that can address general games but it is not tight. It remains unknown whether or not e is the optimal prefactor, but it is known that for general games it cannot be smaller than 2 [33].
In some cases it is possible to transform a general game into a win/lose game by postselecting the trials that take the maximum and minimum value [2,16]. In that situation, it would be possible to apply the tight bounds for win/lose games. Techniques sometimes referred to as "speeding up time" [2,28] can analogously be used in conjunction with this refined bound.
The idea behind this bound is to model an experiment as a bounded difference supermartingale, where we note that a Bell inequality is nothing else than the expectation of the score random variable C j in trial j conditioned on the history leading up to that trial. That is, where the expectation is taken over all inputs x, y and outputs a and b. A (super)martingale is a concept known from probability theory (see supplementary material for details). A sequence M 1 , M 2 , . . . of random variables is known as a supermartingale, if the expectation value of the difference M n − M n−1 conditioned on the history is always negative. Choosing M j to be a weighted sum of the differences n j=1 C j − β max one can easily obtain such a Martingale. The key aspect of a Martingale is that even though the subsequent variables are not independent from each other, we nevertheless observe a concentration akin to the law of large numbers for processes which are independent from each other. The prime example is tossing a coin n times. Indeed, thinking of "heads" as "win" and "tails" as "loose", we can easily evaluate the probability that we get "win" more than k times. When a process is a Martingale a similar argument holds, even if the coin can take many values and depend on the history.  [50] with perfect random number generators. CGLMP is a sequence of bipartite inequalities in which each party has two inputs and d ≥ 2 possible outputs. This is an example of a general game within experimental reach [51].

IV. DISCUSSION
A. Before conducting the experiment To ensure sound data collection, there are several important considerations to make before the experiment takes place. These are standard in statistical testing, and in essence say that the rules on how the statistical analysis is performed is decided independent of the data. This can be achieved by establishing those rules before the data collection starts. First, we choose a Bell inequality. Not all Bell inequalities lead to the same statistical confidence. In Section IV C we discuss methods for obtaining a good one. While there may be future analyses that allow a partial optimization over Bell inequalities using the actual experimental data, we emphasize that the procedure above assumes a fixed inequality has been chosen ahead of time. Second, there are two ways to deal with imperfect random number generators, and a choice should be made as discussed in Section IV B. Third, we assume that the number of trials to be collected is independent of the data. This means that we do not decide to take another few trials if the P -value is not yet low enough for our liking, a practise also known as P -value fishing in statistics. There are ways to augment the analysis [52] to safely collect more data in some specific instances, but this brings many subtleties. A number of trials n can be determined from the expected violation given prior device characterization, aiming for a particular P -value.

B. How to deal with imperfect RNGs
From the discussions above, it becomes clear that there are two ways to deal with imperfect RNGs. The first is of interest in win/lose games. If there is a bias τ , then the winning probability (14) simply increases. This means that when we perform an experiment based on a win/lose game in which we use an imperfect RNG, then the game remains win/lose and the bound in (20) still applies. Since this is a simple Binomial distribution, without any additional factor e this is desirable if the bias is small.
However, we saw from the analysis of general games that there is a second way. When considering a general Bell inequality (3), we make no statements about the probabilities of choosing settings p(x, y) = p(x)p(y). Starting from a scoring function s xy ab we can define s ab|xy = s xy ab /p(x, y) to introduce an explicit dependence on the input distribution p(x, y) of our choosing. Using RNGs with a bias then merely affects p(x, y) and thus the maxima and minima of the scoring functions s ab|xy which enter into the bound given in (27). It is crucial to note that when defining s ab|xy as above, then a win/lose game in which p(x, y) was chosen to be perfectly uniform, can now turn into a general game. That is, we will no longer have that the scoring functions s ab|xy take on only two values. This means that we have to use the general bound (27) carrying the additional factor e, as opposed to (20).
How we deal with imperfect RNGs thus depends: if we start with a win/lose game, and if the bias is small, then it is typically advantageous to preserve the win/lose property of the game and derive a new winning probability as a function of the bias. If, however, the bias is very large, then it can be advantageous to sacrifice the win/lose property, and adopt the analysis for general games. If the game was not win/lose to begin with, we always adopt the second method.

C. Selecting a Bell inequality
One of the main objectives of a Bell experiment is to quantify the evidence against a LHVM , hence ideally one would like to choose a game that would yield the lowest P−value for a fixed number of trials. The optimization of games with this objective is a non-trivial task. A reasonable alternative which one can use as heuristic is to mazimize the gap between the expected score achievable in the experiment and the expected score that a LHVM can attain. In other words, we are looking for a Bell inequality for which the violation we can observe is as large as possible. To find such an inequality, standard linear programming methods can be used (see e.g. [40]).
To apply them we assume that a reasonably good guess is available as to what the probabilities p(a, b|x, y) are in the experiment. Such a guess can be made by either analyzing data collected prior to the Bell experiment and approximating probabilities by relative frequencies, or by having sufficient confidence in the theoretical model that describes the experiment and calculating the probabilities from this model.
Suppose that in the estimation process we find some estimates of p(a, b|x, y).
If such probabilities could be realized by a LHVM, then we could write them as a mixture of deterministic local strategies. To make this precise, let λ = (a 1 , . . . , a |X| , b 1 , . . . , b |Y | ) denote a deterministic strategy in which Alice and Bob give outputs a x and b y for inputs x = {1, . . . , |X|} and y ∈ {1, . . . , |Y |}. In terms of a probability distribution, this would correspond to a distribution d λ (a, b|x, y) such that d λ (a, b|x, y) = 1 if and only if a = a x and b = b y as indicated by the vector λ, and d λ (a, b|x, y) = 0 otherwise. A behaviour, that is distributions p(a, b|x, y), is local if and only if p(a, b|x, y) = λ q λ d λ (a, b|x, y) where the sum is taken over all |X| |A| |Y | |B| possible λ [40], where |A| and |B| denote the number of possible outputs for Alice and Bob, and ∀λ, q λ > 0, and We note that one can test whether such q λ exist, i.e. whether the behaviour is local, using a linear program [53,54]. The dual of this linear program can be used to find a Bell inequality that certifies a behaviour p(a, b|x, y) is not local [40]. One can easily adapt this linear program to search for an inequality that achieves a high violation. Specifically, maximize Violation = x,y,a,b s xy ab p(a, b|x, y) − S x,y,a,b s xy ab d λ (a, b|x, y) ≤ S, ∀λ 0 ≤ s xy ab ≤ 1, ∀x, y, a, b , where the p(a, b|x, y) and d λ (a, b|x, y) are givens, and we optimize over s ab xy (see [40] for details). Note that the second constraint means that for every LHVM, we have a Bell inequality in which β max = S and the difference V is precisely the violation we achieve when normalizing the score functions to lie in the interval [0, 1] which can be done without loss of generality.
It is clear from the discussion above that it can be to our advantage to search for a win/lose game, rather than a general game since the p-values for such games are sharper. This can be done by optimizing over score functions in which s xy ab ∈ {0, 1}. This, however, is now an integer program [55] rather than a linear program [56], which are in general NPhard to solve [55]. Nevertheless, this may be feasible for the small number of inputs and outputs used in any experimental implementation, and heuristic methods exist.

D. Combining independent experiments
Suppose that a series of n experiments is run independently. Each experiment could correspond to completely different settings, Bell inequalities, etc. Associated with each experiment we obtain a series of P -values corresponding to the probability that each of them was governed by a LHVM: (p i ) n i=1 . In this situation, it is possible to take all the P -values associated with each one of the individual experiments and obtain a combined P -value. One such a method is Fisher's method [57,58]. With Fisher's method the combined P -value is given by the tail probability of χ 2 2n , a chi-squared distribution with 2n degrees of freedom: The right hand side of this equation can be easily evaluated numerically. However, it can be shown that the tail probability of χ 2 2n accepts the following closed expression: where we can choose x = − log n i=1 p i . However, we make no claim of optimality regarding the combined P -value. There is a rich literature on methods for combining P -values [59] and depending on the concrete situation a different choice should be made.

E. Conclusions
We have shown how to derive (nearly) optimal Pvalues for all Bell inequalities that can easily be applied to evaluate the data collected in experiments. A suitable Bell inequality can be found as outlined above, however, it would be interesting to combine this method with the numerical search for inequalities in [4,12,14]. The latter can adaptively find the best way to discriminate between LHVMs and theories like quantum mechanics that go beyond localhidden variables that is asymptotically optimal, but requires a significant amount of data to train.
We note that there exist many ways to extend the methods presented here to deal with specific situations at hand, for example, by conducting multiple experiments in succession and using data from prior runs to find more suitable Bell inequalities in the next instance.
We emphasize that the methods outlined here can be used to test other models than LHVMs. It is clear from the proof that only the winning probability in (15), or the expectation value (31), depends on the model to be tested. The argument that extends these bounds for a single trial to a bound on the P -value for the entire experiment allowing arbitrary memory in the devices, however, does not depend on the model tested. In particular, this means that any theories that predict bounds of the form (15) and (31) are excluded with the same bound on P -value. This also makes it ap-parent how one can extend the analysis to refute models that are more powerful than an LHVM. For example, Hall [60] defined and quantified interesting relaxations of an LHVM, with reduced free will, or where some amount of signalling is allowed. It is straightforward to adapt the analysis of [60] to derive bounds on (15) and (31) to subsequently obtain a P -value for testing such extended models. Note that since Alice and Bob obtain an advantage by allowing models such as [60], i.e. they are allowed more powerful strategies, hence they can achieve a higher score in the game. This implies that concrete scores will result in higher P -values and lower confidence.
Furthermore, while we focused the discussion on tests of Bell inequalities, our methods can also be applied to the study of certified randomness as in [11,13,61], or more generally to tests of e.g. noncontextual models that can be phrased as one player games.

ACKNOWLEDGMENTS
We thank Hannes Bernien, Peter Bierhorst, Andrew Doherty, Anaïs Dréau, Richard Gill, Peter Grünwald, Ronald Hanson, Bas Hensen, Jed Kaniewski, Laura Mančinska, Corsin Pfister, Tim Taminiau, Thomas Vidick and Yanbao Zhang for discussions and/or comments on an earlier version of this manuscript. We also thank the referees for their careful reading and suggestions. DE and SW are supported by STW, Netherlands and an NWO VIDI Grant. In this supplemental material, we formalize and prove our claims. To accomplish this, we first need to introduce more precise notation and a formal description of LHVMs in Section A. We then proceed to analyze win/lose games in Section B, and general games in Section C.
Appendix A: Preliminaries

Notation
We will use capital letters to denote random variables: A, B, . . . and the corresponding lower case letters to denote the value that the random variable takes: a, b, . . .. Instead of the notation p(a) used in the main text, we will use the more precise form Pr (A = a) = p(a). During the experiment, we perform many attempts to generate entanglement in which we will record the inputs a i , b i , outputs x i and y i , and event-ready tags t i in each attempt. We will reserve the word trial for the attempts in which t i = 0. Note that in an experiment that does not use an event-ready procedure we always have t i = 0.
While we restrict our explanations to the bipartite case, we emphasize that is straightforward to extend our analysis to any number of sites and we provide a simple example of how this is done in Figure 4 in the main text. A single attempt of a (bipartite) game is characterized by the inputs that we denote by X and Y and the corresponding outcomes A and B. In the case of an event-ready scheme, the outcome of the event-ready station would be denoted by T . Let ∆ a,b,x,y,t i be an indicator function It takes on the value 1 if all equalities are satisfied for a particular choice of a, b, x, y, t, and 0 otherwise. We will let the random variable C i stand for the score of the game obtained in trial i. This variable is defined as function of the coefficients s ab|xy that characterize the game attempt as Depending on the event-ready tag t, a different game might be played, which implies that s ab|xyt depend on t. Whenever the dependence on t is clear, we will drop t and simply write s ab|xy to avoid cluttering the notation. The concrete instance of the i-th attempt we denote by We will often drop the dependence on (a, b, x, y, t) by writing c i in order to lighten the notation. The term concrete instance means that c i can be computed from the observed data. In an experiment consisting of m attempts, we first need to compute the following number, which is the total score Alice and Bob obtain The corresponding random variable is It will furthermore be convenient to define the following shorthand

Local-hidden variable models
In order to formally state the null hypothesis, we briefly need to state what LHVMs are more precisely. More details can be found in e.g. [40] and [16]. To do so, let us introduce the following sequences of random variables in correspondence with the concrete instances of inputs and outputs denoted by the lowercase letters above. Let the histories of attempts previous to the i-th attempt, C m = (C i ) m i=1 denotes the scores at each attempt and T m = (T i ) m i=1 is the sequence of event-ready signals in the case of an event-ready experiment. In an event-ready experiment, we make no assumptions regarding the statistics of the event-ready station, which may be under full control of the LHVM, and can depend arbitrarily on the history of the experiment.
The random variable H i models the state of the experiment prior to the measurement. As such, H i includes any hidden variables, sometimes denoted using the letter λ [40]. It also includes the history of all possible configurations of inputs and outputs of the prior attempts (X j , Y j , A j , B j , T j ) i−1 j=1 . However, this is the only requirement for H i ; that is the history may also include other aspects of the experiment. For simplicity, we assume that it is a countable random variable, though in full generality it could be defined over an arbitrary probability space.
The null hypothesis (to be refuted) is that our experimental setup can be modeled using a LHVM (see [5] for more details). This model has the following properties: 1. Local randomness generation. Conditioned on the history of the experiment the inputs X i , Y i are independent of each other and of the output of the event-ready signal We allow X i and Y i to be partially predictable given the history of the experiment. We use p x and p y for the distribution that we are hoping to achieve using imperfect RNGs. However, we assume this target distribution to be the same for all i.

2.
Locality. The outputs a i and b i only depend on the local input settings and history: they are independent of each other and of the input setting at the other side, conditioned on the previous history and the current event-ready signal 3. Sequentiality of the experiments. Every one of the m attempts takes place sequentially such that any possible signalling between different attempts beyond the previous conditions is prevented. The reason for this condition is that this signalling opens the simultaneous measurement loophole [3]. Also note, that if the sequentiality condition is not met the history random variable becomes ill-defined.
Except for these properties the variables might be correlated in any possible way. A model that verifies these properties or constraints is what we call an LHVM and it is under these conditions that our statements on the P−value do hold. Then, a small P−value can be used to reject the hypothesis that the experiment was governed by such an LHVM. Note that some of these constraints are naturally backed by some experimental setups while in some others they might be less justified. For instance, one might argue that that X i and T i are independent given the history because the corresponding stations are space-like separated. If, on the other hand, the stations can signal to each other, one might still make the assumption that X i and T i are independent. However, in contrast to the scenario in which the stations are space-like separated, a small P−value and consequent rejection of the null hypothesis would still leave the door open to other local models that can explain the observed data with high probability, e.g. models in which X i and T i are not independent.
Appendix B: Analysis of win/lose games In this section we consider games where the scoring variable takes only two values. As argued in the main text, these games can always be transformed, via normalization, into games that take the values 1 and 0. We identify these values with winning and losing. This analysis is an extension of the one done for the Delft experiment (Supplementary information [36]). For a win/lose game, the probability of winning in a given trial equals the probability that the score takes the value 1: Pr (C i = 1). Note that the score c we compute from the data given in (A4) is now just the number of times that Alice and Bob win the game.
Suppose that we perform a win/lose game n times and we observe c wins. The P -value for an experiment that employs an event-ready procedure with two outputs t j = 0 (no, not ready) and t j = 1 (yes, ready) is Let us now show how to obtain a tight upper bound on (B1) for all win/lose games in a systematic way. We detail the procedure in the following. 1.
Step 1: Bounding the probability of winning the next trial First, we need to prove that if the experiment is ruled by an LHVM, then the winning probability of the next trial is bounded from above by some β t win for any possible history h i and event-ready signal t. Such bounds can be obtained in two ways. If a tight bound is achieved for β t win , then our final bound will also be tight and attained by a LHVM strategy that does not use any memory.

a. A numerical bound using linear programming
In general, it is always possible to obtain a numerical bound via a linear program (LP, see e.g. [40] and also Section IV in the main text). While the history can be arbitrary, note that the history can always be reflected in terms of a choice of hidden-variables. It is known that these can be taken to be finite, allowing us to compute whereŝ ab|xyt ∈ {0, 1} are the normalized score functions. Note that we write q λ|h,t for a fixed history H j = h and T j = t, but the LP above does not depend on knowing h and t as they simply form labels. We remark this is an upper bound, since we allowed for maximum bias.
Linear programs can be solved efficiently, although the number of variables is prohibitively large. Nevertheless, for games used in experiment the number of inputs and outputs is generally small enough for the LP to be solved using Mathematica or Matlab.
b. An analytical bound: example CHSH However, analytical derivation is also viable, and indeed for an existing Bell inequality we can convert the parameters β min and β max into suitable bounds. For the purpose of illustration we provide a very simple example of this idea using CHSH, where we derive a bound directly in terms of the probability of winning the game. We remark that is a refinement over the analysis in [36] that becomes interesting for a larger bias τ , but more cumbersome to read.
Specifically, in Lemma 1 we derive a tight upper bound on the winning probability of CHSH with imperfect random number generators in an event-ready setup. Analogous derivations for other simple inequalities are straightforward. For CHSH, the inputs X i , Y i , outputs A i , B i and output of the heralding station T i take values 0 and 1. If T i = 0 the scoring variable C i takes always the value zero, if T i = 1 then C i = 1 when x · y = a ⊕ b and C i = 0 in the remaining cases. Given that the predictability of the RNG is τ , we have for all i ∈ N with i ≤ m, any possible history H i = h i of the experiment, and T i = 1 that the probability of C i = 1 is upper bounded by where Proof. We first expand the desired term using the definition of C i as x,y,z∈{0,1} (x,y) =(1,1) We can break these probabilities into simpler terms The first equality followed by the locality condition, the second one simply by the definition of conditional probability. With this decomposition, we can express (B9) as where we have used the shorthands Now we will expand (B13). We know that 1/2 − τ ≤ α x , β y ≤ 1/2 + τ . In principle, τ does not need to take the values in the extreme on the range. Without loss of generality let α 0 = 1/2 + τ A and β 0 = 1/2 + τ B , with τ A , τ B ∈ (−1/2, 1/2). x,y∈{0,1} It thus remains to bound the sum of f x,y . Note that we can write Since (B23) is a sum of two convex combinations, it must take its maximum value at one of the extreme points, that is with χ 0 ∈ {0, 1} and χ 1 ∈ {0, 1}. We can thus consider all four combinations of values for χ 0 and χ 1 given by Since 0 ≤ γ 0 , γ 1 ≤ 1, we have in all cases that the sum is upper bounded by 3. Finally, using (B21) we have where in the first inequality we bound f 0,0 , f 0,1 , f 1,0 by 1. The second inequality follows since τ ≤ 1/2 and for τ A , τ B below 1/2 (B26) is strictly increasing both in τ A and τ B ; this implies that the maximum is found in the extreme: Note that in the case T i = 0, we trivially have Pr Step 2: Replacing the history with the recorded values of C i−1 and T i . Now, building on the above, we prove that the probability that C i takes the value one given not the entire history, but only the heralding events and the prior sequence of scores, is bounded from above by the same β t win . While the two statements look very similar, the main difference is that while in Step 1 we condition on the entire history H i = h i , in Lemma 2 we condition on the heralding events T i = t i , and the prior sequence C i−1 = (C j ) i−1 j=1 of data that can actually be observed [62]. Although both statements are similar, it is Lemma 2 that we can easily use in the proof of Lemma 3 to bound the p-value of the experiment.
We will need Proposition 1, which is a basic probabilistic statement necessary for Lemma 2. In essence, it is just the measure theoretic version of We state it for completeness, with the purpose of having the derivation of the bound on the p-value as self contained as possible.
Proposition 1 (Law of total probability). Let A, B be two random variables on the same probability space Ω with E(|A|) < ∞. Then the probability of an event A = a admits the following integral form for some measure dµ on Ω.
Then for all sequences t i ∈ {0, 1} i and c i−1 ∈ {0, 1} i−1 : Proof. The following equalities hold from the definition of conditional probability and Proposition 1 Let us bound the integrand in the previous equation. We have where δ is a shorthand for The first equality (B33) follows from the fact that t i−1 and c i−1 are events either compatible or incompatible with h i , the second one (B34) from the definition of conditional probability, and the inequality (B35) from Lemma 1. We now introduce (B36) back into (B32) to obtain where the equality (B39) follows from Proposition 1. We complete the proof by cancelling the terms Pr T i = t i , C i−1 = c i−1 on the right and left side of the equation above.

2.
Step 3: Going from one to many attempts In the last part of this technical derivation, we put together the statements above and instead of making a statement just about the next attempt, we now make a statement about all attempts together. This proof generalizes Proposition 4 in [5] to event-ready schemes. Even though the analysis is more involved, the proof technique follows the same steps as the original one in [5].
What makes the analysis more tricky, is that in an event-ready scheme we have a long sequence of m attempts, and a (potentially much shorter) sequence of n trials, that is, attempts for which t i = 1. It is intuitive that of relevance in the long sequence of m attempts, is the sequence C m = (C 1 , . . . , C m ) together with the sequence of event-ready attempts T m = (T 1 , . . . , T m ). Recall that the latter tells us which elements of C m are of interest, i.e., can at all be non-zero. To reason about the shorter sequence of n trials, let us first introduce some notation. Our goal will be to define a series of random variables D n = (D 1 , . . . , D n ) for the short sequence of trials, where intuitively D j corresponds to the random variable taking value one when the j-th event-ready success also results in T i = 1 for any corresponding i. In other words, we will define D n in such a way that instead of worrying about the number of 1's in (C 1 T 1 , . . . , C m T m ) we will be concerned with the number of 1's in (D 1 , . . . , D n ).
To define this formally, we need a way to map the j-th trial from the short sequence of trials, to the index i in the longer sequence of attempts. Note that for a particular event-ready sequence t m = (t 1 , . . . , t m ) ∈ T m , the j-th trial is mapped to the smallest index i, such that the subsequence t i = (t 1 , . . . , t i ) of t m has exactly j 1's. Of course, there are many sequences t i ∈ T i that have precisely j 1's, where the last element is also a 1, and for all such strings the mapping from j in the sequence of trials, to the index i in the sequence of attempts is the same. Let us thus define T j→i = t m = (t 1 , . . . , t m ) ∈ {0, 1} m | |t m | = n and t i = (t 1 , . . . , t i ) satisfies |t i | = j and |t i−1 | = j − 1 , to be the set of all event-ready sequences t m for which j is mapped to one particular i. By summing over all possible indices i in the long sequence of attempts, we can thus formally define where 1 is as before the indicator function. In terms of probabilities, this means that the probability that the j-th trial gives D j = 1 is given by We can thus express the P -value as Before delving into the proof below, it will be convenient to simplify (B42). Note that for a fixed i, the term in brackets in (B42) contains a sum over all possible t i+1 , . . . , t m and c i+1 , . . . , c m . This means we can use the law of total probability to shorten the sum by expressing (B42) in terms of the marginal distributions as where After having formally established the relation between the sequence of trials and the sequence of attempts, we are now ready for the final proof, where we can now argue in terms of the sequence of trials (D 1 , . . . , D n ).
then we have that for all m ≥ n, the probability that at least k of the (D j ) n j=1 take the value one is upper bounded by where P n,k (B γ ) denotes the probability that n Bernoullis with probability γ := β win yield at least k 1's, and P n,k (B γ ) = 0 if k > n.
Proof. Let us define the shorthand The probability that we see at least zero 1's (k = 0) obeys P n,0 (D) = 1 (B51) = P n,0 (B γ ) for all n and m ≥ n.
We now prove the statement for k > 0 by induction on n. For n = 1, we need only to verify that (B49) holds for k = 1 (we already dealt with k = 0 and the case k > 1 trivially holds). We have where the first equality (B54) is just (B46), the second equality (B55) the definition of conditional probability, the inequality (B56) follows from Lemma 2, and the final equality (B57) from the definition of the sets T j→i and the fact the sum of all probabilities is 1.
In order to prove the induction step below, let us first express the probability of having at least k 1's on trial n as the sum of the probability of having at least k on trial n − 1, plus the probability of having exactly k − 1 1's on trial n − 1 and a one on the n-th trial We now upper bound the second term in (B60), where we will use the shorthand Equality (B63) follows by the definition of conditional probability, inequality (B64) from Lemma 2, equality (B65) from (B42), and the last equality (B66) because the sum over all vectors having exactly k − 1 1's equals the probability of having at least k − 1 1's, minus the probability of having at least k.

a. Event-ready schemes creating multiple states
We now consider arbitrary event-ready schemes in which multiple games can be played as specified by the tag t. This is of interest, for example, when multiple Bell states can be generated at the event-ready station, but we want to use more than one of them. Note that this only makes sense if Alice and Bob can hope to violate all inequalities using the same measurements, since the event-ready signal is space-like separated from the generation of the random numbers used as inputs to Alice and Bob. In the Delft experiment [36], only one Bell state was used, namely the one which is most noise-free resulting in a higher violation.
However, it is possible to use multiple Bell states by playing the standard CHSH game t = 1 for one Bell state, and one in which we flip the role of the inputs for t = 2. That is, instead of taking x · y = a ⊕ b as the winning condition, we take x · y = a ⊕ b ⊕ 1. This game has exactly the same winning probability than the standard CHSH game, meaning that our analysis above applies without change by using a new tag t = 0 whenever t = 0, and setting t = 1 for either t = 1 or t = 2. Naturally, we then need to apply the relabelling to the outputs before computing the total score.
We remark that is possible to combine games that have different success probabilities, but we are not aware of any situation yet in which this may be beneficial.

Conventional analysis
For completeness, let us now illustrate how the p-value for win/lose games compares to statements made in a conventional analysis using standard deviations. In such an analysis it is assumed that there is no memory and that the distribution is Gaussian. For simplicity, we thereby consider the case where no event-ready scheme is used. It is straightforward to extend to event-ready schemes. Observe that in win/lose games without the use of an event-ready procedure we can express the P -value as where n is the number of trials. Now, assume that each of the C i is i.i.d. (independently and identically distributed) and characterized by the probability that it takes the value one ∀i, Pr (C i = 1) = q .
We can now approximate the sum of Bernoulli trials by a Gaussian random variable with mean nq and variance nq(1 − q). If the hypothesis holds we can approximate the right hand side of (B76). However, for all win/lose games we have that q ≤ β win , that is β win is a cap on the probability that C i takes the value one. If additionally c > nγ we obtain an upper bound on the approximation as follows where Q(·) denotes the tail probability of the standard normal distribution. Observe that the right hand side of (B78) is increasing in q (or alternatively c−nq √ is decreasing in q). That is, the inequality follows because β win is the largest possible value of q.

Relation to the CHSH correlator CHSH
For completeness, let us explain how for the example of the CHSH correlator, one can understand the relation between the number of wins c obtained in the win/lose game, and the maybe more familiar form of the correlator. Since our objective is only to illustrate this link and give some intuition on the p-values, we assume, only from here and until the end of this section, perfect RNGs. We will also drop the index i, since we are considering only one trial. We denote by XY ab the average of the random variable XY when the settings are A = a, B = b That is, we can map the average CHSH value S to the probability that C takes the value one. It directly follows that the known CHSH upper bound S ≤ 2 corresponds with Pr (C = 1) ≤ 0.75, which is the upper bound that we obtain if we assume perfect RNGs (τ = 0) in Lemma 1 and Lemma 2. In the same way we can map the observed CHSH violation to the number of successes. Let n a,b , n = a,b and n = a,b denote denote the number of trials, the number of wins and the number of losses associated with setting (a, b) and letS denote the observed CHSH value:S The approximation holds since for large n the number of trials at each setting should be approximately n/4 and in consequence n/n a,b ≈ 4.
to denote the maximum and minimum values of the score functions. Recall that since s ab|xy = s xy ab p(x, y) , the values of s max and s min depend on the distribution p(x, y), if we use imperfect RNGs then bounds on the bias of the random numbers will translate into different values of s max and s min . In order to obtain a bound on the P -value we will proceed as follows. First we state Bentkus' inequality, a concentration bound for bounded martingale difference sequences. Then we show how to construct a martingale sequence with bounded differences for any Bell inequality and finally apply Bentkus' inequality to this martingale sequence.

Bentkus' inequality
The score functions of general games take more than two values, and in full generality they might even take a continuous range of values. However, Bentkus' is given in terms of the tail of a sum of independent and identically distributed Bernoulli random variables or, as we equivalently state it here, in terms of the tail of a binomial distribution. These random variables are discrete and, by definition, can only take a discrete number of values. The gap between both is bridged by an interpolation of the binomial distribution. Let us introduceP n,y (B γ ) a function that interpolates P n,k (B γ ) between consecutive values of k. We define P n,y (B γ ):P n,y (B γ ) = P n, y (B γ ) 1−(y− y ) P n, y (B γ ) y− y (C4) Theorem 1 (Bentkus' inequality (Theorem 1.2 [33])). Let M n be a martingale sequence with differences X k = M k − M k−1 and M 0 = 0. If for k = 1...n the differences satisfy the following boundedness condition: then Pr (M n ≥ x) ≤ eP n,x+nγ (B γ ) (C6) and γ = n i=1 α i /n.
That is, Bentkus' inequality states that the probability that a martingale sequence surpasses some value x for any martingale sequence with bounded differences is bounded by the tail of a binomial distribution multiplied by e. The inequality also holds if M n is a sequence of supermartingale differences [63].

A bounded difference supermartingale sequence
Let {C i } n i=1 denote the scores at each attempt of a game (see Section A) and let E [C i |H i = h i ] stand for the mean value of the score function given the history. This mean is nothing more than a Bell inequality and we have that Consider the sequence of random variables the elements in the sequence correspond to the scores in each attempt normalized by s max −s min and displaced by β max . We subtract β max such that M n is a supermartingale sequence with respect to the sequence H n . This can be readily verified since: With the supermartingale differences bounded above and below by 1 − α k and −α k respectively we can apply Bentkus' inequality: Where we use the identification of x from (C14). In order to evaluate the right hand side of (C18), there remains to identify x + nγ. It can be evaluated from the observed data: Finally, we obtain the following bound by directly applyng Bentkus' inequality Pr (C ≥ c) ≤ eP n,x+nγ (Bγ) (C22)