Confronting p-hacking: addressing p-value dependence on sample size

Biomedical research has come to rely on p-values to determine potential translational impact. The p-value is routinely compared with a threshold commonly set to 0.05 to assess the significance of the null hypothesis. Whenever a large enough dataset is available, this threshold is easily reachable. This phenomenon is known as p-hacking and it leads to spurious conclusions. Herein, we propose a systematic and easy-to-follow protocol that models the p-value as an exponential function to test the existence of real statistical significance. This new approach provides a robust assessment of the null hypothesis with accurate values for the minimum data-size needed to reject it. An in-depth study of the model is carried out in both simulated and experimentally-obtained data. Simulations show that under controlled data, our assumptions are true. The results of our analysis in the experimental datasets reflect the large scope of this approach in common decision-making processes.


INTRODUCTION
In the most complex scenarios, decision making is only possible when we are able to reduce intricate working conditions to a dichotomous or binary case.Statistical hypothesis testing has always supported the ability to discriminate between different events.Yet previous methods do not always provide robust results due to dependence on the size of the datasets being tested (Altman and Krzywinski, 2017;Krawczyk, 2015;Lin et al., 2013), and requires an urgent revision (Altman and Krzywinski, 2017;Amrhein et al., 2019;Leek et al., 2017).
Typically obtained from any conventional test, the "gold standard" p-value has long been recognized as an unreliable but popular measure of statistical significance (Altman and Krzywinski, 2017;Amrhein et al., 2019;Halsey et al., 2015;Nuzzo, 2014).The p-value is itself a random variable that depends on the data used; and, therefore, has a sampling distribution.A straightforward example is as follows: the p-value has a uniform distribution (0,1) under the null hypothesis.If the null hypothesis is not trivially assessable, it remains always possible to obtain a sufficiently small p-value that rejects the null hypothesis by sufficiently increasing the sample size (also called p-hacking) (Bruns and Ioannidis, 2016;Cumming et al., 2007;Halsey et al., 2015;Krzywinski and Altman, 2013).For instance, even when comparing the mean value of two groups with identical distribution, statistically significant differences among the groups can always be found as long as a sufficiently large number of observations is available using any of the conventional statistical tests (i.e., Mann Whitney U-test (Mann and Whitney, 1947), Rank Sum test (Wilcoxon, 1945), Student's ttest (Student, 1908)) (Bruns and Ioannidis, 2016).Non-parametric statistical tests for two samples, such as the Kolmogorov-Smirnov test (Smirnov, 1939), also conclude with the rejection of the null hypothesis when working with sufficiently large datasets.In other words, big data can make insignificance seemingly significant by means of the classical p-value.Similar to the examples in (Altman and Krzywinski, 2017;Halsey et al., 2015), Fig. 3 in the Materials and Methods further illustrates the described problem.Despite this finding, there remain many situations for which the 'dichotomy' associated with the p-value is necessary for decision-making (Leek et al., 2017).Designing a robust tool devoted to this task could be an inflection point in the use of statistical tests.In this work, we aim to answer the question of when can we solidly assert that bona fide differences exist between two sets of data, independent of sample size.
To introduce our method, we first show that the p-value can be accurately approximated through its expression as an exponential function of the sample size , (): p(n) = a • e −cn where a, c ∈ R + 1 In Fig. 1a, different randomly generated normal distributions are compared using the Mann-Whitney U statistical test (Mann and Whitney, 1947) to illustrate the decrease of the function () with the sample size.
The use of the Student's t-test was avoided as it is well known that the p-value associated to the t-statistic has an exponential decay (Student, 1908).Technical details about the convergence of the function () and evidence about Eq. 1 holding for any statistical test are given in the Materials and Methods.
Note that the p-value curve, the function (), is used to compare pairs of experimental conditions; therefore, () is computed as the exponential fit to the probability value of multiple sample comparisons.
Hence, the parameters  and  in Eq. 1 correspond to those defining the exponential fit ().We use the Monte Carlo cross-validation (MCCV) (Xu and Liang, 2001) as the sampling strategy: two subsets of size n (one from each of the groups to be compared) are randomly sampled and compared with a statistical test.The resulting p-value is stored and the procedure is repeated many times.At the end of the procedure, a large set of -dependent p-values is obtained and the exponential function in Eq. 1 can be fit.
Similarly to any exponential function, () converges to zero.The faster the function converges, the more robust the significance.When normal distributions of standard deviation one and mean value in the range [0,3] are compared, we see that the higher the difference among experimental conditions, the faster the decay of the exponential function that approximates () (Fig. 1b).We observe that the parameters  and  (Eq. 1) increase proportionally with the mean value of the distribution compared with (0, 1) (Fig. 1b).With this new idea in mind, a robust decision index,  , , can be mathematically defined (Eq. 10 in Materials and Methods).Note that subscripts  (statistical significant threshold) and  (regularization parameter) are omitted from now on.
Instead of comparing a single p-value with the ideal statistical significance threshold  (i.e.,  = 0.05 for a 95% of statistical significance), a distance  (Eq. 9 in the Materials and Methods) is defined to compare the function () with  for all  values. measures the difference between the areas under the constant function at level  and the area under the curve () (Fig. 1c).The distance  is then used to obtain the binary index  that indicates whether () and the  constant are far from each other or not.If for most values of  the function () is smaller than , then  = 1, which means that there is an acceptable statistical significance.However, if  is null, the tested null hypothesis cannot be rejected.
As the exponential function is defined for all values  ∈ (−∞, +∞), it is necessary to determine a range of  for which the function () is meaningful.The decay of () is concentrated in a range between  = 0 and a certain value of  for which () ≈ 0 (convergence of ()); so,  should be only calculated in that range.A parameter  is used as a regularizer to measure the point of convergence  =   , such that ( =   ) ≈ 0 (Fig. 1c and Eq. 8 in the Materials and Methods).Small  values imply less restrictive decisions.Nonetheless, the experimental evaluation of the method over synthetic and real data evidences  = 5 −06 to be a reasonable choice (detailed information is given in the Materials and Methods and the Supplementary Material).Note that when () is determined simply by the definition of the parameters  and  in Eq. 1, the minimum data size needed to observe statistically significant differences at -level can also be provided.As () continuously decreases, the value of  for which () is always smaller than  can be calculated easily.This value is called   (Fig. 1c and Eq. 12 in the Materials and Methods).

RESULTS
Both the decision index  and the minimum data size   provide for intuition about the veracity of the null hypothesis of the statistical test.To illustrate this, different normal distributions were compared with the Mann-Whitney U statistical test (Mann and Whitney, 1947) with an -level of 0.05 (Table 1 in Materials and Methods).When (0, 1) is compared with (0, 1), (0.01, 1) and (0.1, 1),  is null; so those distributions are assumed to be equal.In the remaining comparisons though,  = 1, thus there exist differences between (0, 1) and (, 1) for  ∈ [0.25, 3] (Fig. 1d).Likewise, the value of   increases until infinity as the mean value  decreases when (0, 1) is compared with (, 1) for  ∈ [0.1,3].Indeed,   cannot be determined when (0, 1) is compared with (0, 1) and (0.01,1), as the null hypothesis in this case is true and therefore, () is a constant function, which represents the uniform distribution of p-values (Figs. 1e and 1f,and Fig. 3 in Materials and Methods).
To prove the generality of the proposed method, we tested its different functionalities on published and non-published data from biological experiments.The first application of the method consists in discriminating between conditions; that is, to declare whether two conditions are different or not.In this case, we wanted to determine whether cancer cells cultured in 3D collagen matrices and imaged under a light microscope changed shape after administration of a chemotherapeutic drug (Taxol) (details about data collection and processing are given in the Supplementary Material).Three different groups were compared: control cells (non-treated), and cells treated with 1 nM and 50 nM Taxol respectively.Cells exposed to low concentrations of Taxol (1 nM) remained elongated (low roundness index), i.e.  = 0 for the comparison between control cells and those treated with Taxol at 1 nM.However, when the dose was increased to 50 nM Taxol, cells became circular; therefore  = 1 when comparing cells treated with 50 nM Taxol versus control cells, or cells treated with 1 nM Taxol (Fig. 2a and Table S3 in the Supplementary Material).
Secondly, we analyzed the flow cytometry data used by Khoury et al. (Khoury et al., 2018) to determine the transcriptional changes induced by the in vivo exposure of human eosinophils to glucocorticoids.As it was done in the previous example, the proposed method allowed us to discriminate between treated and untreated eosinophils using the entire dataset.For that, we analyzed the eosinophil surface expression of the gene CXCR4 2 h after the exposure to 20 and 200 mcg/dL of Methylprednisolone.The eosinophils belong to 6 different healthy human subjects.With the estimation of the function () (Eq.1), it is possible to conclude that the exposure of eosinophils to glucocorticoids causes a differential expression of CXCR4 (Fig. 2b), i.e.  = 1 for the comparison between vehicle and eosinophils treated with 20 and 200 mcg/dL (Table S6 in the Supplementary Material).Indeed the conclusion is the same as the one made in (Khoury et al., 2018), where only the median fluorescence intensity of the data from each subject was calculated and the resulting 6 data points were compared (Fig. 2b).However, the latter approach can lead to false conclusions when the data distribution differs or when the data deviation is large.
The last use of the method consists of analyzing whether a single specific feature of the data (variable) can fully characterize the problem at hand.For instance, many different biomolecular and biophysical features of human cells were analyzed (Phillip et al., 2017) to predict cellular age in healthy humans.This is only possible if these features contain enough information about the aging of the patients.To show that, we re-analyzed a large and a small dataset with information of nuclei morphology and cell motility respectively, collected by Philip et al. (Phillip et al., 2017).The information of 2 year-old human cells (the youngest one) was compared with the rest of the ages.The decay of () in cell nuclei area and short axis length are directly related to the age of human cells.The parameter  (Eq. 1) of the orientation of the cell nuclei is null in all cases, which indicates that this measure does not contain information about aging (Fig. 2c and Table S5 in the Supplementary Material).Moreover, the estimated function () for the total diffusivity of the cells of 2 year-old and 3 year-old human donors shows that even if a larger dataset was given, the result will remain the same (Fig. 2d and Table S4 in the Supplementary Material).Namely, () does not decrease, therefore, there is strong evidence that the null hypothesis is true (i.e. = 0, groups behave similarly).The most extreme cases given by the differences between 2 and 96 year-old human donors, can also be detected without the need of large datasets,   = 11 (Fig. 2d).That is, the estimation of () allows one to decide whether it is valuable to collect new data to determine differences among the studied groups, or not.

DISCUSSION
The use of statistical hypothesis testing is largely extended and well established in the scientific research.Moreover, the number of statistically significant p-values reported in scientific publications has increased over the years (Chavalarias et al., 2016) and there exists a tendency among researchers to look for that combination of data that provides a p-value smaller than 0.05 (Bruns and Ioannidis, 2016).However, the assessment of the p-value has some drawbacks which can lead to spurious scientific conclusions (Altman and Krzywinski, 2017;Amrhein et al., 2019;Bruns and Ioannidis, 2016;Halsey et al., 2015;Nuzzo, 2014).The data recorded from high-content, high-throughput studies, and the capacity of the computers to analyze thousands of numbers, has enabled us to enlighten the current uncertainty around the exploited p-value.
We report clear evidence about the well-known dependence of the p-value on the size of the data (Altman and Krzywinski, 2017;Krawczyk, 2015;Lin et al., 2013).While some approaches analyze the distribution of empirically estimated p-values, also known as p-curve, (Simonsohn et al., 2014), to the best of our knowledge, there are not approaches that focus on the size-dependence shown here to assess decision making.
Due to the lack of new techniques to face the latter, we believe that our method will have a huge impact in the way scientists perform hypothesis testing.
The approximation of the function (), through the use of a basic exponential function, lets us analyze the data more robustly utilizing () decay.With it, we provide a new perspective about hypothesis testing that prevents from treating the p-value as a dichotomous index and enables the study of data's variability.
Using a simple mathematical formulation, a robust decision index  is defined to enable good praxis in the same context as statistical hypothesis testing.Indeed, the presented method is transferable to any field of study, same as the common null-hypothesis testing.Moreover, the presented approach used as a preliminary analysis, provides evidence about the existence (or not) of statistical significance.Even when large datasets are not available.Therefore, it supports the management of new data collection and can help researchers to reduce the cost of collecting experimental data.The p-value is a variable that depends on the sample size and can be modelled as an exponential function (() =  − , Eq. 1).
For each pair of normal distributions being compared, two subsets of size  are obtained by sampling from the corresponding normal distribution.Then, these datasets are compared using the Mann-Whitney statistical test and the p-value obtained is stored.
The procedure is repeated many times for each size .The blue bars with the standard error of the mean (SEM), show the distribution of all the p-values obtained at each size n when two normal distributions of mean 0 and 0.1, and standard deviation 1 are compared.The blue curve shows the corresponding exponential fit.The magenta and yellow curves represent the resulting () function when a normal distribution of mean 0 and standard deviation 1 is compared with a normal distribution of the same standard deviation and mean 0.25 and 0.5, respectively; b) The decay of () (parameters  and  of the exponential fit) increases with the mean value of the normal distribution being compared with (0, 1).The larger the distances between the means of the distributions, the higher the decay of the exponential function (Table 1 in Materials and Methods).c) Comparison of () (red curve) and significance area at 95% (blue area).If the area under the red curve is smaller than the blue area, then there is a strong statistical significance.The parameter   measures the minimum data size needed to find statistical significance.The parameter   measures the convergence of (): ( =   ) ≈ 0. The binary decision index  indicates whether the area under () from 0 to   is larger than the area under the -level (blue box) in the same range; d) The faster the decay of (), the stronger the statistical significance of the tested null hypothesis.For  = 5 −06 ,  , = 1 whenever the mean value of the normal distribution compared with (0,1) is larger than 0.5 (Table 1 in Materials and Methods).e) The empirical estimation of () with small datasets enables the detection of the most extreme cases: those in which the null hypothesis can be accepted, and those in which it clearly cannot; f) The minimum data size needed to obtain statistical significance (  ) is inverse to the mean value of the normal distributions being compared.is higher ( and  parameters in Eq. 1), i.e. it decreases much faster than the one corresponding comparison of control and Taxol at 1 nM (orange curve).b) Flow cytometry data was recorded to determine the transcriptional changes induced by the in vivo exposure of human eosinophils to glucocorticoids.(Left) The entire dataset has a wider range of values and a smaller 95% confidence interval around the mean than the distribution obtained when the median fluorescence intensity (MFI) is calculated by each of the 6 subjects.(Right) There is an increase of the surface expression of CXCR4 when human eosinophils are exposed to 20 or 200 mcg/dL of Methylprednisolone.Namely, the minimum size   is low and the decision index θ = 1 when any of those conditions are compared with the vehicle condition.Note that the decay parameters  and  are almost the same in those two cases, so the markers co-localize (Supplementary Material).The minimum size   when eosinophils are treated (blue circle) is not shown as it has infinite value.c) The morphology of 2 year-old human cells is compared with the morphology of 3,9,16,29,35,45,55,65, 85 and 96 year-old human cells.For both, nuclei area and nuclei short axis measures, the minimum size   and the decay  change proportionally with the age of the donor.The nuclei orientation does not characterize the age of the human donors for all the comparisons; the parameter  is null, and therefore, () is constant.d) The analysis of a small dataset is enough to determine that the total diffusivity can characterize the cellular aging in humans.The total diffusivity of 2, 3 and 9 year-old human cells are equivalent, while it differs when compared to cells from older human donors.

Drug analysis on phase contrast microscopy
Phase contrast microscopy images of a human invasive ductal carcinoma (MDA-MB-231) cell line were acquired.The set-up used was composed by a Cascade 1K CCD camera (Roper Scientific), mounted on a Nikon TE2000 microscope with a 10X objective lens.Cells were embedded in 3D collagen type I matrix at 100.000 cells/mL.The time lapse videos were recorded every two minutes with a focus plane of at least 500  away from the bottom of the culture plates to diminish edge effects (He et al., 2017).Three different groups of cells were analyzed: control and treated with fresh media at 1 nM Taxol and 50 nM Taxol.Ten videos of 16.5 hours (500 frames of 809  x 810  with a resolution of 0.806 /) each were analyzed per group.
All videos were automatically processed using a convolutional neural network (U-net (Ronneberger et al., 2015)) to get binary masks for the cell bodies and their protrusions.The resulting semantic segmentation corresponds uniquely to focused cells in the image.For each of these cells, their body and protrusions are segmented.See some examples of the resulting segmentation in Figure S2.
Using the segmentations, eight different morphological measurements were calculated: cell body size (CS), cell body perimeter (CP), cell body roundness (CR), cell with at least one protrusion (Pb), protrusion size (PS), protrusion perimeter (PP), protrusion length (PL) and protrusion diameter (PD) (Table S1).Further information about the distribution of each of the measurements is given in the Supplementary Material.Same as the analysis done for CR (Fig. 2a in the main manuscript), the differences between control and 1 mM and 50 nM Taxol were analyzed using each of the remaining variables (Figures S4-S6 and Table S3).

METHOD DETAILS
Here, we first provide the mathematical details behind our hypothesis that the p-value is a variable that critically depends on the size of the sample and that the p-value function can be approximated with an exponential function of the sample size .Then, we define the method of how to work with the p-value as a function and to determine when a statement of statistical significance can be made ( , , Eq. 9).Once the problem is described technically, it is possible to calculate the minimum size   at which the null hypothesis of the test is statistically significant (Eq. 11).This parameter   can be used to characterize the data.Finally, the reliability of our method is rigorously tested.

p-value as an exponential function of data size
Fig. 3 illustrates the idea that the p-value is a function that depends on the sample size .There exists a continuous inverse relation between p-values and , i.e. p-values decrease when  increases, (Altman and Krzywinski, 2017;Krzywinski andAltman, 2013a, 2013b).This allows us to assume that p-values can be considered indeed, as a function of , i.e. ().
Either with Mann Whitney U test (Mann and Whitney, 1947) or with Student's t-test (Student, 1908), it can be proved that the obtained p-value converges to zero when the sample size is large and the distributions being assessed are not exactly the same, i.e., the p-value tends to zero when the sample size tends to infinity.
A mathematical demonstration of this statement is available in the Supplementary Material.
Going a step further, we claim that the p-values can be indeed written directly as a function of , (), and that this function adjusts well to an exponential function.To show this, we first estimate the value that the p-value function has at each possible value of .This can be done easily with the Monte Carlo cross validation method (MCCV) (Xu and Liang, 2001): at each iteration  of the procedure,  =   is fixed, and two populations of size   are compared.This procedure is repeated many times in each given iteration  to cover the variability of the problem at  =   .At the end, we have as many sets of p-values as iterations  that are of the form: Note that this procedure is similar to the upstrap (Crainiceanu, 2018) using an increasing fraction of the sample.
The details about the procedure followed for the estimation of the p-values is explained in the Supplementary Material.
In Fig. 3, the procedure is applied using random populations from different normal distributions.We distinguish two different situations: either the obtained distributions are uniform, so the mean value of all the   values is constant for any  (Figs.3a and 3b); or the distributions tend to decrease when the sample size  increases (Figs.3c-f).In other words, () can be written as a continuous function.Hence, for each iteration , each set of   values is averaged to obtain the empirical estimation of the function () at  =   (red markers in Fig. 3).Then, a smooth curve is fitted to these values using locally weighted scatter plot smoothing (LOWESS) (Cleveland, 1979), which shows () has an exponential shape (Figs.4a and 4b).
To prove that the estimated function () can be written as an exponential function, it is sufficient to verify that the quotient between its first derivative whether there exists a statistical difference, as for instance, when (0, 1) and (0.1, 1) are compared.When () becomes very small, the quotient ()  () has more outliers, especially when the sample size  is small.This can be observed when comparing (0, 1) with (0.75, 1),  (1,1),  (2,1) and  (3,1).(Fig. 4d).These are extreme cases in which there exist clear differences between populations and therefore, p-values are close to zero most of the time.
As we have proved above that the estimated function () can be written as an exponential function, an exponential curve is fitted to all the values   calculated with MCCV (Figs. 4a and 4b).Both LOWESS and exponential curves are very close to each other, even if the former was fitted using the mean values of each group   and the latter with all of them.An exponential fit is more suitable in this case as it is calculated with all the values obtained through MCCV, and only outputs positive values by definition.A LOWESS approximation can occasionally lead to biased negative values, such as when (0, 1) and (0.75, 1) are compared while the p-values are positively defined.Note that as () → 0 when  → ∞,  < 0 necessarily in Eq. 2. Therefore, we assume from now on that () can be given as an exponential function of the form Here the parameters  and  control the amplitude and the decay of the function (), respectively.If  = 0, then the value of () would be uniform in : () = .As p-values are computed probabilities and the global maximum of () is ,  belongs to the [0, 1] interval.size (3,5,10,15,30,45,60,90,200,670,1750

Distance to the 𝜶-level of statistical significance
The ideal case of a true (1 − ) statistical significance would lead to the rejection of the null hypothesis independently of data size, i.e., p-values would always be lower than .Hence, we claim that whenever there exist real statistically significant differences between two samples, () reaches  rapidly.So, the values of () are mostly distributed in a range smaller than α.Therefore, we compare all the values of the curve () with .In the discrete case, we would evaluate  − ( =   ) for each index  and sum all the results: if the sum is positive, then () is smaller than α most of the time.In the continuous case, this sum is obtained by integrating the difference where   is the area under the constant function  and  () is the area under the estimated p-values' curve, () (Fig. 5).A positive () implies that   is larger than  () , i.e. most of the values in () are below the significance threshold ; a negative () implies the opposite.
As shown in the next paragraphs, Eq. 4 aims to quantify and evaluate the distribution of p-values (i.e., the distribution of {(, ()),  ∈ }) taking into account two aspects, whether (1) most of the p-values are smaller than  and ( 2) the decay of () is large enough.

Mathematical formulation of the decision index
By means of the exponential expression of () given in Eq. 3, the measure   () (Eq.4) can be rewritten as follows Due to the limits of  and ,   () is still well-defined.However, in the limit of ,   () will always be positive and it tends to infinity: Also, from a practical perspective, the area of interest to evaluate the decay of () is that enclosed between zero and its convergence point : | () | ≈ 0. Namely, a relevant sub-sample size  can be computed as where  is the threshold chosen to determine the convergence point (Fig. 5).Finally,  , is now formally defined as As claimed at the end of the last section, the computation of  , enables the identification of a rapid convergence to zero at small values of  induced by the high slope of (), which is indicative of the existence of true statistical significant differences.
The decision index we propose,  , , is defined as where  , follows Eq. 8.

Restricting an optimal threshold
The proposed approach depends on two thresholds: (1) significance threshold  and (2) the convergence threshold .The former measures the level of statistical significance, while the latter controls decisions.
Therefore, the only critical threshold to discuss in this work is .
The rules to follow for the selection of the threshold  are:  The parameter  is the maximum value that () can take.Therefore, if  is smaller than , then  , = 1 for any  given.
 As  , () tends to infinity with , the smaller the value of  is set, the larger   will be and the chances of  , = 1 will also increase.
 The values of  should be small:  is considered a significant number and () values are constantly compared with it.It seems reasonable to compare the slope of () at the convergence point with a value smaller than , which is usually smaller than 0.

Data characterization in stable and uncertain cases.
The threshold  controls severe decisions.Namely, the lower this value is set, the less strict the decision will be.In Fig. 6a, we show the dynamics of  =0.05, when  changes: the dark area ( =0.05,= 1) increases inversely to , showing that the chances for which the null hypothesis is rejected increase as well.Moreover, the limit between dark and light ( =0.05,= 0) areas is precisely the curve  , = 0.The value of  determines this curve and therefore, the conditions for which  =0.05,= 1 (dark area) and  =0.05,= 0 (light area).In Fig. 6b, we illustrate the condition  , = 0 when  = 0.05, as a function of ,  and .The case  = 5 −06 is underlined in black.
There exist some points (, ) for which the rejection of the null hypothesis is independent of .A clear example is the case in which  ≥  and  ≈ 0. These cases represent the situation in which the null hypothesis cannot be rejected with a statistical significance of level .For instance, when (0,1) is compared with (0,1) or (0.01,1)(Fig. 1b in the main manuscript, Figs.3a-b).Likewise, if  ≤  or  is large enough, the null hypothesis is always rejected with a statistical significance of level .For instance when (0,1) is compared with (2,1) or (3,1) (Fig. 1b in the main manuscript, Figs.3e-f).
The proposed methodology let us also classify each case by its level of uncertainty.The coefficients to fit an exponential curve are precisely coordinate points in any of the plots in Fig. 6.Therefore, once an exponential curve is fitted and parameters  and  are known, it is possible to know in which position of the graph is the case of study: clear cases will always be close to the left or to the right side of the graphs in Fig. 6, while most unstable or unclear cases will be placed in the middle.Therefore, with this method, it is possible to determine when there are statistically significant differences, and when these differences are not sufficiently clear and it might be necessary to perform a deeper study.
An intuitive interpretation of statistical significant differences between two groups (the classical threshold pvalue < ) is that their mean confidence intervals do not overlap.These intervals decrease when the size of the data increases (Krzywinski and Altman, 2013c).Therefore, this section is devoted to study how large two populations must be in order to obtain non-overlapping intervals.Interestingly, the estimation of the function () allows us to determine the specific minimum value of ,   , for which () is lower than the significance level  (Fig. 5).This value is the solution to the equation As computed,   represents the minimum sample size needed to obtain a statistically significant p-value, in case it exists.In other words, reproducing an experiment with   samples assures the rejection of the null hypothesis.The estimated   allows to assess the strength of the evidence against the null hypothesis.If   is small, the strength of the statistical difference is very clear and two populations are distinguishable.
The parameters  and  in Eq. 11 are obtained empirically through MCCV so they can introduce some bias in the calculation of   .Hence, a better estimator of   ,  ̂, can be computed using the p-values obtained directly from the data and their variance where   ̅̅̅ represents the mean of the set of values   (MCCV) and    ̅̅̅ , the mean standard error (SEM), which is included to correct for the variability of the estimated p-values.The estimator  ̂ is limited to those cases in which the data is large enough: if the size of the data is smaller than   , then  ̂cannot be computed (Fig. 1e-f, Fig. 2d in the main manuscript).As  ̂ is more restrictive than   , its value will always be slightly larger (Table 1).

Test of reliability
Unlike many computational methods, the analysis of statistical significance of the differences between two groups cannot be evaluated by means of Ground Truth data, simulations or human-made annotations.
Nonetheless, it is possible to determine the robustness on the reproducibility of the results.Namely, whether the statistical significance is maintained when the experiment is repeated.To do so, we test our method using simulated normal distributions.
To evaluate the effect of , () is simulated for all pairs of normal distributions and it is compared with a significance level of  = 0.05 using different values of  (Table S7 in the Supplementary Material).The lower the convergence criteria  is, the less restrictive the diagnosis is (Fig. 6).Using the simulated data, the range of  =0.05,values obtained let us recommend a value for this parameter.When (0, 1) and (0.1, 1) are compared with  = 2.5 −06 , the decision index  =0.05,= 2.5 −06 = 1 indicates that there exist statistically significant differences among both distributions, which is the opposite of what we expected.If the value of parameter  increases, the statistical significance is rejected in those cases in which there is a larger uncertainty.For instance, when (0, 1) and (0.25, 1) are compared with  = 5 −05 ,  =0.05,=5−05 = 0.However, the latter is not straightforward for two reasons:  =0.05,= 5 −05 = −5.84(small difference) and  ̂ = 186 (few samples to observe statistically significant differences).
Therefore, we strongly recommend the use of  = 5 −06 .
To test the generality of this results, the same procedure was repeated several times by changing the samples of the normal distributions being compared.Hence, it is possible to provide a probability of how often the resulting  , would be the same as the one stated in Table 1.Additionally, the presented method has its limitations in the computational time needed to perform MCCV iterations: the more iterations we compute the longer the process will take.Moreover, the accuracy of any estimated () depends on the sample size  =   and p-values,   , that the program can evaluate.Therefore, we also tested the results of the method when the number of iterations  and  in MCCV is reduced.Overall, the probability of obtaining exactly the same result under any change of the previous conditions was calculated (Table S8 in the Supplementary Material).The closer this probability gets to 100, the more robust and general the result will be.We can confirm that the results are most of the time the same as the ones given in Table 1 when  = 5 −06 .The only critical case is the comparison (0, 1) -(0.5, 1) when few   points are used to estimate ().
The last procedure was repeated using the real data from the first experiment (study of the effect of Taxol in the cell body and protrusions morphology) (Tables S9 and S10 in the Supplementary Material).Even with more complex and noisier data, the results obtained show that the method is stable and robust.All technical details about these computations are given in the Supplementary Material.Robustness of the convergence threshold and required computational load . . . . . . . . . . . . . . . . .9

Technical details
The main motivation of the study is that the p-value is no longer useful when working with large datasets as its value tends to zero.In the next section, we demonstrate for the particular cases of the Mann-Whitney U test, (1), and Student's t-test, (2), that indeed, the p-value will always tend to zero even when the null hypothesis is almost true and should not be rejected.
A. p-values tend to zero for large sample sizes.The statistic U of the Mann-Whitney U test, (1), is defined as min{U1, U2}, where Ui follows the Eq. ( 1), being ni the size of the dataset i and Ri its rank sum.
When ni are large enough, U follows a normal distribution, (1), with mean and standard deviation values, µU and σU respectively, described by Eq. ( 2).
Therefore, the main procedure to estimate the p-value consists in analyzing the typified value of U, z, defined by Replacing the values of Ui, µU and σ 2 U in Eq. ( 3), we obtain [4] Note that U1 and U2 can be indifferently chosen to be the minimum value for the Mann-Whitney statistic U. Hence, for simplicity U = U1 is assumed.
In the worst case scenario, when both datasets are identical and therefore the null hypothesis should be true, R1 = R2 = R. Also, as ni are assumed to be large enough, we can study the case n1 = n2 = n.Moreover, due to the hypothesized large sample size, Ri could be upper limited as Finally, the value of z in the limit, when n tends to infinity, is also infinity Therefore, p-value tends to zero.That is to say, even when we assume that both datasets are equal, the result would be to regret the null hypothesis.Likewise, Student's t-test (2) fails by means of large samples.The statistic t is defined as follows , [7] where µi and ni correspond to the mean and sample size of the dataset i ∈ {1, 2}.Once again, assuming that both ni are large enough, ni = n is accepted; t is directly compared with the Student's t distribution and in the limit of n, t tends to infinity (as long as both mean values are not exactly the same).Thus, p-value tends to zero and the null hypothesis is rejected.

B. p-values as a function of the sample size. Proofs in Section
A let us concluding that the p-values depend on the size of the data being evaluated.While this is not a breakthrough, it is one of the pillars in this study.The fact that the p-value varies with n, allows us to assume that they can be considered indeed, as a function of n.In the case of Student's t-test, it is straightforward that the t parameter is n-dependent (Eq.7).Note that mean and standard deviation values are similar for any n.
In the case of the parameter z, Eq. 3, it can be slightly more complicated to prove the same statement.However, it is easy to see that U, µU , and σU depend on n, and that z will always increase with respect to n (i.e., p-value decreases).Therefore, we can assume that the estimated p-values can be written as a function of n.  3).This way, the effect of the sample bias on the p-value can be ignored.The procedure followed for the estimation of the p-values is illustrated in Figure S1 and the corresponding pseudocode in Algorithm 1.Notice that in Algorithm 1 we estimate p(n) in two different ways, using either a locally weighted scatter plot smoothing (LOWESS) approximation ( 4) (pL) or and exponential fit (pe).The main reason to do this is that we use a standard curve fitting (LOWESS) to show that p(n) is exponential.
Aiming to compare two sets of values, SA and SB, and to determine if there exists statistically significant differences between them, the estimation of the p-values is done in pairs (i.e., two sub-samples are compared each time).A range of values needs to be defined for both the sample size and the number of folds in MCCV.These are given by the grids N and F, respectively.
The range for all possible sub-sample sizes (n) goes from 2 (n0) to the smallest size between samples SA and SB (Nmax).A grid covering all these values for large Nmax, is computationally expensive and redundant.As the p-value tends to zero when n → ∞, the most important information is condensed in the smallest values of n.So the grid N follows an exponential distribution from n0 = 2. Similarly, a large enough upper-limit (n∞) is chosen such that it ensures a fast computation (n∞ Nmax) and the convergence to zero of p(n).Hence, N is determined as where U is the uniform distribution that goes from log(n0) to log(n∞).In MCCV, the number of folds can be extremely large when working with large datasets and a small partition.On the contrary, for a large partition size, the number of folds might decrease dramatically.To compensate for both situations, F is defined as given below U is the uniform distribution, k1 controls the upper-limit on the number of folds for small sub-sample sizes, and k2 controls the lower-limit for large sub-sample sizes.Note that the number of elements in N and F are the same.Finally, for each ni in N , MCCV is applied to obtain the set of p-values defined as being fi the number of folds in F that corresponds with the sub-sample size ni in N .

D. Assessment of minimum data size needed for statistical significance (nα). The estimation of the p-value function p(n)
supports the computation of the minimum data size needed to obtain statistically significant differences (nα).This value is the solution to the equation As explained in the online methods, the parameters a and c are the result of fitting an exponential function to the empirical values Pi in Equation 10.Therefore, there exists an intrinsic bias in the estimated values a and c.Additionally, the estimation of p(n), and specially, its decay (parameter c) can be less precise when the data size is small.(See Figure 2d in the main manuscript), so nα in Equation 11 can be biased.For this reason, the calculation of a more conservative estimator, nα, is strongly recommended nα = arg min where pi represents the mean of Pi and σ p i , the mean standard error (SEM), which is included to correct for the variability of the estimated p-values.Hence, the estimator of the theoretical value will always be slightly larger nα ≥ nα.
However, nα can only be provided when the sample is large enough to cover those n values smaller or equal to nα.For this reason, whenever the data is not large enough, the theoretical value nα in Equation 11 is also given.In these cases, even if nα and Θα,γ might be slightly deviated, they still serve as an indicator of the existence of statistical significance.A. Experiment 1: Drug analysis on phase contrast microscopy data.Image processing analysis provided the necessary information to distinguish the cellular body and protrusions of each cell in the videos (Figure S2).Hence, we got eight different measurements: cell body size (CS), cell body perimeter (CP ), cell body roundness (CR), cell with at least one protrusion (P b ), protrusion size (PS), protrusion perimeter (PP ), protrusion length (PL) and protrusion diameter (PD).Each of the morphological measurements is given in microns.Table S1 contains the complete list of variables.In Figure S3, the distribution of the the variables used in the analysis of cellular shape is shown.The cellular body changes with the amount of Taxol used to treat cells.When they are treated at 50 nM Taxol, the cellular body is biger and more rounded (Figure S3a).Besides, this same treatment prevents cells from producing long and thick protrusions (Figure S3b).

Experimental data
None of the continuous variables presented in Table S1 follows a normal distribution, so the comparison was carried out by the Mann-Whitney U-test (1).P b was a binary variable (Figure S3c) to distinguish protruding cells (value one).Therefore, it was analyzed by means of Pearson χ 2 -test for categorical data (5).
As per the number of observations reported in Table S2 and following methodology guidelines, we set Pi with n0 = 2 and n∞ = 2500.N and F were set to have 190 points.The number of folds F described in the Supplementary Material, was computed using k1 = 1, k2 = 20 and Nmax = 11037.These values were chosen to have a reasonable number of permutations for both small and large sample sizes (6.000 permutations when n0 = 2, and 90 when n190 = 2500, respectively).Table S3 contains the estimated coefficients a and c of the exponential curve (ae −cn ) for each of the variables we analyzed and each pair of comparisons (Control -1 nM Taxol, Control -50 nM Taxol, and 1 nM -50 nM Taxol).Figures S4, S5 and S6 show the shape of each of the exponential curves that result with the coefficients in Table S3.To determine whether Taxol has a significant effect in cell's morphology, Θα,γ was chosen such that α = 0.05 (95% of statistical significance) and γ = 5 • 10 −6 .
When comparing the control group and 1 nM Taxol, there are not statistically significant differences in cell body morphology: the curve p(n) of any cell body feature decreases slowly, i.e. nα and nγ are large and Θα,γ = 0 (Table S3 and Figure S4a).On the other hand, cells at 50 nM Taxol have a significantly higher roundness index and bigger cellular body: when comparing control vs. 50 nM Taxol or 1 nM vs. 50 nM Taxol the curves corresponding to CR and CS decrease rapidly, i.e. nα and nγ are small, and Θα,γ = 1 (Table S3 and Figures S4b and S4c, respectively).For CP , it is also possible to appreciate some differences when comparing 1 nM with 50 nM Taxol group, i.e.Θα,γ = 1 (Table S3).Namely, the blue curve shown in Figure S4c decreases faster than those in Figures S4a and S4b.Similar results are obtained when the morphology of cellular protrusions is evaluated (Table S3 and Figure S5).While Taxol at 1 nM does not change their morphology (Θα,γ = 0 in Table S3 and Figure S5a), the effect of Taxol at 50 nM is much larger (Θα,γ = 1 in Table S3, Figures S4b and S4c) .
Usually, when a categorical variable such as P b is analyzed, the input of a statistical test is a percentage rather than the raw data.Hence, when there is no statistical significance, the p(n) function shoots up, as for instance in Figure ??.However, when there exist statistical differences, p(n) decreases and it is possible to analyze its decay, as in Figures ?? and ??.With all, we can say that the formation of protrusions is inhibited when 50 nM Taxol are administered: there is a significant reduction in the number of cells that form protrusions and their protrusions are smaller (shorter and thinner) (S3, Figures S3b and S3c).

B. Experiment 2:
Drug analysis on flow cytometry data.Flow cytometry is a technique that generates a large amount of data for each experiment.Consequently, any statistical test for groups comparison results in a vanishing p-value.To avoid that situation, practitioners tend to reduce the data to a single, representative measure for subject.For instance, Khoury et al. (6) acquired fluorescence intensity data from 6 different subjects and compute the median fluorescence intensity (MFI) for each of them.So, the statistical test is just applied on the 6 MFI values.However, our proposal of estimating the p-value as a function of the sample size enables to incorporate in the test the information given by the whole dataset and take into consideration the deviation and bias present in the data.Table S3.To illustrate the proposed procedure, we analyzed the flow cytometry data provided in (7) to determine the transcriptional changes induced by the in vivo exposure of human eosinophils to glucocorticoids.Khoury et al. (6) studied eosinophil surface proteins after being exposed to glucocorticoids and demonstrated that this exposure causes the apoptosis of human eosinophils (eosinopenia) once they migrate out of the blood circulation.
While they performed an extensive analysis, we have focused our study on the data related to the chemokine receptor gene CXCR4.In particular, the expression of CXCR4 on the surface of human eosinophils after being exposed for 2 hours to vehicle, 20 mcg/dL and 200 mcg/dL of Methylprednisolone (MP).After filtering the raw data to discard noise and debris, we got clean distributions to analyze (Figure 2b, right).The p-value curves computed for pair group comparisons were the result of applying Mann-Whitney U statistical tests following the proposed procedure.Then, the exponential curves (ae −cn ) were fitted (Figure 2b, left).The parameters configuration was n0 = 2, N and F had 200 points, k1 = 1, k2 = 20, α = 0.05, and γ = 5 • 10 −6 .As number of data points was lower than 1, 000, n∞ and Nmax where chosen to be the minimum number of points for each group pair being compared.The results are summarized in Table S4.Our results are similar to those in (6), in the sense that we also find a differential expression of CXCR4 when eosinophils are exposed to glucocorticoids.

C. Experiment Cellular age characterization by means of biomolecular and biophysical properties.
The data extracted from Phillip et al. in (8) was analysed with the following parameter configuration: for each of the stated variables, the data belonging to the group of 2 years-old was compared with the data from {3, 9, 16, 29, 35, 55, 65, 85, 92} and {3, 9, 16, 29, 35, 55, 65, 85, 96} years-old human donors to test cell motility and morphology, respectively.For each pair of groups, the distribution of the n-dependent p-values was obtained using the Mann-Whitney U statistical test.Then, the parameters of the exponential function (ae −cn ) were fitted.The parameter configuration was n0 = 2, N and F had 200 points, k1 = 1, k2 = 20, α = 0.05, and γ = 5 • 10 −6 .As the number of data points was lower than 1000, n∞ and Nmax where chosen to be the minimum number of points of each pair of groups being compared.The results for cell motility and cell morphology are summarized in Tables S5  and S6, respectively.

Test of robustness
The variability in the statistical significance of the results caused by the selection of the parameter γ and the grid sizes N and F are characterized in this section.The method is first tested using theoretical distributions and then, using the real data from Experiment 1.
Theoretically, an optimal grid N would be the one that covers the values from n0 = 2 to n∞ = Nmax.This set up entails an extremely large amount of computations, while it suffices a value n∞ ≈ 1000 to understand what is the tendency of the data.If p(n) converges to zero when n > 1, 000, then it can be assumed that p(n) does not represent a statistical significant case.Hence, n∞ = 2, 500 is large enough for the implementation of the method.As the p-values for very small samples are especially unstable and small samples are not representative of any real scenario, the minimum value n0 can be increased.The number of permutations for each ni can be decreased as well: while the amount of data to analyze may be infinite, it is enough to study a certain limited number of different data subsamples to approach a realistic scenario.Hence, to test the robustness of the proposed method, we set grids N and F using n0 = 20, n∞ = 2, 500, Nmax = 10, 000, k1 = 1 and k2 = 20 in Equations 8 and 9, respectively.Both N and F were configured to have a size of 200 points.Thus, MCCV is repeated 200 times.See Figure S1 for the workflow.
With this grid parameters, for γ in the set {2.5 • 10 −6 , 5 • 10 −6 , 5 • 10 −5 , 5 • 10 −4 }, we run the pipeline to evaluate the effect of γ value on the rejection of the null hypothesis of the Mann-Whitney U statistical test (Table S7).The results obtained help to assess the most suitable γ value.Specifically, the decision about γ relies on the result obtained for the comparison between N (0, 1) and N (0.25, 1): while the distance δα,γ for γ = 5 • 10 −6 and γ = 5 • 10 −5 expresses the same (δα,γ = ±5.84), the minimum data size needed to observe statistically significance differences is low enough as to reject the null hypothesis, i.e. nα = 186 and Θα,γ = 1.Hence, the value chosen for the following simulations and for the real data is γ = 5 • 10 −6 .(Table S7).
To test the computational limitations of the method, we evaluated the value Θ 0.05,5•10 −6 reducing N and F: N was chosen to be a grid of size 10, 20, 50, 100, 150 or 200 points and the values in F were reduced by a factor of 1/2, 1/3, 1/5 and 1/10 (i.e., each of the values in the original F was multiplied by this fraction).The experiment was repeated 100 times on each of the setups, so the probability of obtaining exactly the same Θ 0.05,5•10 −6 (Table S7) and the stability of the method could be evaluated.The information given in Table S8 lets the assessment of (1) the size of N and (2) the number of folds in F. In most cases, the probability obtained was 100%, which shows that the final results are very stable.When N (0, 1) and N (0.5, 1) were compared with small grid parameters, this probability decreased slightly to 89 − 96% (Table S8).In conclusion, the number of computations could be considerably reduced, for the example, to N = 50 and F = 0.2F).

B. Test of robustness on real data.
B.1.The p-value can be estimated by an exponential function.. Repeating the procedure followed with simulation of normal distributions, we verify that the condition for p(n) being exponential is satisfied again: in Figures S7a-c  N (0, 1) -N (0.5, 1)

Fig. 1|
Fig. 1| Estimation of the p-value as a function of the size (()) enables the correct discrimination between conditions.a)

Fig. 2|
Fig. 2| The function () acts as a data descriptor and supports the experimental study of multiple conditions.a) Breast cancer cells (MDA-MB-231) were cultured in collagen and imaged under a microscope to determine if cells change shape when a chemotherapy drug (Taxol) is administered.Three different groups were compared: control (non-treated) cells, cells at 1 nM and at 50 nM Taxol.(Leftmost) The cell roundness distribution of control cells and cells treated at 1 nM Taxol have lower values than that of cells treated at 50 nM.(Right) The three groups were compared, the p-values were estimated and () was fitted for each pair of compared groups.When Taxol at 50 nM is evaluated (blue and yellow dashed curves),   is lower and the decay of () () is itself a constant, i.e. ()  () =  ↔ () =  •   ℎ ,  ∈ ℝ . 2 Collecting the values () of the LOWESS fit, the quotient ()  () is calculated (Figs.4c and 4d).Most of the quotients verify the condition in Eq. 2. In Fig 4c, we show cases in which it is more challenging to decide

Figure 3 .
Figure 3. Distribution of the p-values obtained when two normal distributions are compared.For each fixed value of the sample and 2499 points), two normal distributions of that size are simulated and compared via the Mann-Whitney statistical test.This procedure is repeated multiple times.A normal distribution with a mean of 0 and a standard deviation of 1 is compared with a normal distribution of mean: (a) 0, (b) 0.01, (c) 0.1, (d) 1, (e) 2, and (f) 3 and a standard deviation of 1.When both normal distributions are almost the same, (a) and (b), the p-value follows a uniform distribution.Though, as long as both normal distributions get farther to each other, the distribution of p-values become closer to a normal distribution with a faster decay.

Figure 5 .
Figure 5.Comparison of a 95% of statistical significance ( = 0.05) and an n-dependent p-value curve.The parameter   represents the minimum sample size to detect statistically significant differences among compared groups.The parameter   represents the convergence point of the p-value curve.When the p-value curve expresses statistically significant differences, the area under the red curve ( () ) is smaller than the area under the constant function  = 0.05 ( =0.05 ) when it is evaluated between 0 and   .

Figure 6 .
Figure 6.Decision index  =0.05, for different values of parameters  and  in the function  − and threshold : (a) Each of the subplots is drawn for a specific value of , being the dark area the cases for which there exists a 95% of statistical significance ( =0.05,= 1), and white area the rest of the cases  =0.05,= 0; (b) Colors in the image correspond to the values of  for which  =0.05,= 0.The black frontier shows  =0.05,=5−06 = 0 (red box in (a)).All the values of  and  for which  =0.05,=5−06 = 1 (95% of statistical significance) lie on the left side of this limit and, the rest, on the right.The plots shown in (a) show the influence of the parameter  in a wide range of values, while the plots shown in (b) are limited to the range of values we find in this posterior experiment.The vertical dashed line indicates the cases  = 0.05 which are always statistically significant.

Fig. S1 .
Fig. S1.Illustration of the work flow used for the estimation of p-values as a function (p(n)) of the sample size (n).For each possible value of n (ni), Monte Carlo cross validation (MCCV) is performed fi times.For each fold in the cross validation, two random sub-samples of size ni are chosen from samples A and B (yellow spheres).Then, a statistical test is applied to obtain a p-value (p j i ).The procedure is repeated fi times; fi depends on ni as both samples A and B have to be covered.Thereby, fi decreases (fi > fi+1 > fi+2) as long as ni gets larger.This procedure is repeated for n0, ..., ni+1,ni+2, ... until the desired data size (ni→∞) is reached.C. Estimation of p-values with Monte Carlo cross validation method.We propose to model the p-value empirically as a data's size dependent function (p(n)) by Monte Carlo cross validation (MCCV) with replacement (3).This way, the effect of the sample bias on the p-value can be ignored.The procedure followed for the estimation of the p-values is illustrated in FigureS1and the corresponding pseudocode in Algorithm 1.Notice that in Algorithm 1 we estimate p(n) in two different ways, using Fig. S2.Segmentation of phase contrast microscopy images of cancer cells (MDA-MB-231) embedded in a 3D collagen Type I matrix.Cell bodies are labeled in blue and cell protrusions in green.Images of (a) control cells and cells treated at (b) 1 nM and (c) 50 nM Taxol were acquired with a 10 x magnification objective.
Fig. S3.Quantitative variables used to measure (a) cell bodies and (b) cellular protrusions morphology, and (c) the ratio of cells with and without protrusions for the three different treatment groups (control, 1 nM Taxol, 50 nM Taxol).Error bars in (a) and (b) correspond to the confidence interval at 99 % Treatment group Cell body Cell protrusions Control 77,700 45,871 1 nM Taxol Fig. S4.Results obtained for cell body morphology when control, 1 nM Taxol and 50 nM Taxol groups are compared as (a) control vs. 1 nM Taxol, (b) control vs. 50 nM Taxol and (c) 1 nM vs. 50 nM Taxol.Vertical lines correspond to the convergence point nγ with γ = 5 • 10 −6 .
Fig. S5.Graphical illustration of the results obtained for cell protrusions morphology when (a) control and 1 nM Taxol, (b) control and 50 nM Taxol and (c) 1 nM Taxol and 50 nM Taxol groups are compared.Vertical lines correspond to the convergence point nγ with γ = 5 • 10 −6 .
The result of the pipeline () relies on a new threshold called , which can only change in the most uncertain cases as shown in Materials and Methods.Compared to the classical p-value and  threshold, the parameter  is mathematically constrained and  is stable to its variations.Similarly,   is an unbiased indicator of the effect size, i.e. how different the samples are or how big is this difference (further details about robustness are given in Materials and Methods). 1.
|| is greater than , it would vanish the assumption that () has arrived to a convergence point equivalent to zero.Therefore, our claim is that  || <  with at least  < 0.1.

Table S5 . Parameters of the exponential function
all LOWESS fittings have exponential shapes, and in FiguresS7d-fthe quotient p (n)/p(n) of LOWESS fits are constant.

Table S10 . Table of results for different sizes of N and F grids. Each value represents the probability (%) of obtaining the same decision index
Θ 0.05,5e−06 as the one shown in TableS9.