This Month
Published: 29 April 2014

Points of significance

Nonparametric tests

Martin Krzywinski¹ &
Naomi Altman²

Nature Methods volume 11, pages 467–468 (2014)Cite this article

64k Accesses
53 Citations
13 Altmetric
Metrics details

Subjects

A Corrigendum to this article was published on 27 June 2014

This article has been updated

Nonparametric tests robustly compare skewed or ranked data.

You have full access to this article via your institution.

Download PDF

We have seen that the t-test is robust with respect to assumptions about normality and equivariance¹ and thus is widely applicable. There is another class of methods—nonparametric tests—more suitable for data that come from skewed distributions or have a discrete or ordinal scale. Nonparametric tests such as the sign and Wilcoxon rank-sum tests relax distribution assumptions and are therefore easier to justify, but they come at the cost of lower sensitivity owing to less information inherent in their assumptions. For small samples, the performance of these tests is also constrained because their P values are only coarsely sampled and may have a large minimum. Both issues are mitigated by using larger samples.

These tests work analogously to their parametric counterparts: a test statistic and its distribution under the null are used to assign significance to observations. We compare in Figure 1 the one-sample t-test² to a nonparametric equivalent, the sign test (though more sensitive and sophisticated variants exist), using a putative sample X whose source distribution we cannot readily identify (Fig. 1a). The null hypothesis of the sign test is that the sample median m_X is equal to the proposed median, M = 0.4. The test uses the number of sample values larger than M as its test statistic, W—under the null we expect to see as many values below the median as above, with the exact probability given by the binomial distribution (Fig. 1c). The median is a more useful descriptor than the mean for asymmetric and otherwise irregular distributions. The sign test makes no assumptions about the distribution—only that sample values be independent. If we propose that the population median is M = 0.4 and we observe X, we find W = 5 (Fig. 1b). The chance of observing a value of W under the null that is at least as extreme (W ≤ 1 or W ≥ 5) is P = 0.22, using both tails of the binomial distribution (Fig. 1c). To limit the test to whether the median of X was biased towards values larger than M, we would consider only the area for W ≥ 5 in the right tail to find P = 0.11.

**Figure 1: A sample can be easily tested against a reference value using the sign test without any assumptions about the population distribution.**

The P value of 0.22 from the sign test is much higher than that from the t-test (P = 0.04), reflecting that the sign test is less sensitive. This is because it is not influenced by the actual distance between the sample values and M—it measures only 'how many' instead of 'how much'. Consequently, it needs larger sample sizes or more supporting evidence than the t-test. For the example of X, to obtain P < 0.05 we would need to have all values larger than M (W = 6). Its large P values and straightforward application makes the sign test a useful diagnostic. Take, for example, a hypothetical situation slightly different from that in Figure 1, where P > 0.05 is reported for the case where a treatment has lowered blood pressure in 6 out of 6 subjects. You may think this P seems implausibly large, and you'd be right because the equivalent scenario for the sign test (W = 6, n = 6) gives a two-tailed P = 0.03.

To compare two samples, the Wilcoxon rank-sum test is widely used and is sometimes referred to as the Mann-Whitney or Mann-Whitney-Wilcoxon test. It tests whether the samples come from distributions with the same median. It doesn't assume normality, but as a test of equality of medians, it requires both samples to come from distributions with the same shape. The Wilcoxon test is one of many methods that reduce the dynamic range of values by converting them to their ranks in the list of ordered values pooled from both samples (Fig. 2a). The test statistic, W, is the degree to which the sum of ranks is larger than the lowest possible in the sample with the lower ranks (Fig. 2b). We expect that a sample from a population with a smaller median will be converted to a set of smaller ranks.

**Figure 2: Many nonparametric tests are based on ranks.**

Because there is a finite number (210) of combinations of rank-ordering for X (n_X = 6) and Y (n_Y = 4), we can enumerate all outcomes of the test and explicitly construct the distribution of W (Fig. 2c) to assign a P value to W. The smallest value of W = 0 occurs when all values in one sample are smaller than those in the other. When they are all larger, the statistic reaches a maximum, W = n_Xn_Y = 24. For X versus Y, W = 3, and there are 14 of 210 test outcomes with W ≤ 3 or W ≥ 21. Thus, P_XY =14/210 = 0.067. For X versus Z, W = 2, and P_XZ = 8/210 = 0.038. For cases in which both samples are larger than 10, W is approximately normal, and we can obtain the P value from a z-test of (W – μ_W)/σ_W, where μ_W = n₁(n₁ + n₂ + 1)/2 and σ_W = √(μ_Wn₂/6).

The ability to enumerate all outcomes of the test statistic makes calculating the P value straightforward (Figs. 1c and 2c), but there is an important consequence: there will be a minimum P value, P_min. Depending on the size of samples, P_min can be relatively large. For comparisons of samples of size n_X = 6 and n_Y = 4 (Fig. 2a), P_min = 1/210 = 0.005 for a one-tailed test, or 0.01 for a two-tailed test, corresponding to W = 0. Moreover, because there are only 25 distinct values of W (Fig. 2c), only two other two-tailed P values are <0.05: P = 0.02 (W = 1) and P = 0.038 (W = 2). The next-largest P value (W = 3) is P = 0.07. Because there is no P with value 0.05, the test cannot be set to reject the null at a type I rate of 5%. Even if we test at α = 0.05, we will be rejecting the null at the next lower P—for an effective type I error of 3.8%. We will see how this affects test performance for small samples further on. In fact, it may even be impossible to reach significance at α = 0.05 because there is a limited number of ways in which small samples can vary in the context of ranks, and no outcome of the test happens less than 5% of the time. For example, samples of size 4 and 3 offer only 35 arrangements of ranks and a two-tailed P_min = 2/35 = 0.057. Contrast this to the t-test, which can produce any P value because the test statistic can take on an infinite number of values.

This has serious implications in multiple-testing scenarios discussed in the previous column³. Recall that when N tests are performed, multiple-testing corrections will scale the smallest P value to NP. In the same way as a test may never yield a significant result (P_min > α), applying multiple-testing correction may also preclude it (NP_min > α). For example, making N = 6 comparisons on samples such as X and Y shown in Figure 2a (n_X = 6, n_Y = 4) will never yield an adjusted P value lower than α = 0.05 because P_min = 0.01 > α/N. To achieve two-tailed significance at α = 0.05 across N = 10, 100 or 1,000 tests, we require sample sizes that produce at least 400, 4,000 or 40,000 distinct rank combinations. This is achieved for sample pairs of size of (5, 6), (7, 8) and (9, 9), respectively.

The P values from the Wilcoxon test (P_XY = 0.07, P_XZ = 0.04) in Figure 2a appear to be in conflict with those obtained from the t-test (P_XY = 0.04, P_XZ = 0.06). The two methods tell us contradictory information—or do they? As mentioned, the Wilcoxon test concerns the median, whereas the t-test concerns the mean. For asymmetric distributions, these values can be quite different, and it is conceivable that the medians are the same but the means are different. The t-test does not identify the difference in means of X and Z as significant because the standard deviation, s_Z, is relatively large owing to the influence of the sample's largest value (0.81). Because the t-test reacts to any change in any sample value, the presence of outliers can easily influence its outcome when samples are small. For example, simply increasing the largest value in X (1.00) by 0.3 will increase s_X from 0.28 to 0.35 and result in a P_XY value that is no longer significant at α = 0.05. This change does not alter the Wilcoxon P value because the rank scheme remains unaltered. This insensitivity to changes in the data—outliers and typical effects alike—reduces the sensitivity of rank methods.

The fact that the output of a rank test is driven by the probability that a value drawn from distribution A will be smaller (or larger) than one drawn from B without regard to their absolute difference has an interesting consequence: we cannot use this probability (pairwise preferences, in general) to impose an order on distributions. Consider a case of three equally prevalent diseases for which treatment A has cure times of 2, 2 and 5 days for the three diseases, and treatment B has 1, 4 and 4. Without treatment, each disease requires 3 days to cure—let's call this control C. Treatment A is better than C for the first two diseases but not the third, and treatment B is better only for the first. Can we determine which of the three options (A, B, C) is better? If we try to answer this using the probability of observing a shorter time to cure, we find P(A < C) = 67% and P(C < B) = 67% but also that P(B < A) = 56%—a rock-paper-scissors scenario.

The question about which test to use does not have an unqualified answer—both have limitations. To illustrate how the t- and Wilcoxon tests might perform in a practical setting, we compared their false positive rate (FPR), false discovery rate (FDR) and power at α = 0.05 for different sampling distributions and sample sizes (n = 5 and 25) in the presence and absence of an effect (Fig. 3). At n = 5, Wilcoxon FPR = 0.032 < α because this is the largest P value it can produce smaller than α, not because the test inherently performs better. We can always reach this FPR with the t-test by setting α = 0.032, where we'll find that it will still have slightly higher power than a Wilcoxon test that rejects at this rate. At n = 5, Wilcoxon performs better for discrete sampling—the power (0.43) is essentially the same as the t-test's (0.46), but the FDR is lower. When both tests are applied at α = 0.032, Wilcoxon power (0.43) is slightly higher than t-test power (0.39). The differences between the tests for n = 25 diminishes because the number of arrangements of ranks is extremely large and the normal approximation to sample means is more accurate. However, one case stands out: in the presence of skew (e.g., exponential distribution), Wilcoxon power is much higher than that of the t-test, particularly for continuous sampling. This is because the majority of values are tightly spaced and ranks are more sensitive to small shifts. Skew affects t-test FPR and power in a complex way, depending on whether one- or two-tailed tests are performed and the direction of the skew relative to the direction of the population shift that is being studied⁴.

**Figure 3: The Wilcoxon rank-sum test can outperform the t-test in the presence of discrete sampling or skew.**

Nonparametric methods represent a more cautious approach and remove the burden of assumptions about the distribution. They apply naturally to data that are already in the form of ranks or degree of preference, for which numerical differences cannot be interpreted. Their power is generally lower, especially in multiple-testing scenarios. However, when data are very skewed, rank methods reach higher power and are a better choice than the t-test.

Change history

23 May 2014
In the version of this article initially published, the expression X (n_X = 6) was incorrectly written as X (n_Y = 6). The error has been corrected in the HTML and PDF versions of the article.

References

Krzywinski, M. & Altman, N. Nat. Methods 11, 215–216 (2014).
Article PubMed CAS Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 10, 1041–1042 (2013).
Article PubMed CAS Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 11, 355–356 (2014).
Article CAS Google Scholar
Reineke, D. M., Baggett, J. & Elfessi, A. J. Stat. Educ. 11 (2003).

Download references

Author information

Authors and Affiliations

Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,
Martin Krzywinski
Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,
Naomi Altman

Authors

Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Krzywinski, M., Altman, N. Nonparametric tests. Nat Methods 11, 467–468 (2014). https://doi.org/10.1038/nmeth.2937

Download citation

Published: 29 April 2014
Issue Date: May 2014
DOI: https://doi.org/10.1038/nmeth.2937

This article is cited by

Diseased and healthy murine local lung strains evaluated using digital image correlation
- T. M. Nelson
- K. A. M. Quiros
- M. Eskandari
Scientific Reports (2023)
A machine learning approach to predicting early and late postoperative reintubation
- Mathew J. Koretsky
- Ethan Y. Brovman
- Nick Cheney
Journal of Clinical Monitoring and Computing (2023)
Survival analysis—time-to-event data and censoring
- Tanujit Dey
- Stuart R. Lipsitz
- Naomi Altman
Nature Methods (2022)
Analysis of COVID-19 data using neutrosophic Kruskal Wallis H test
- Rehan Ahmad Khan Sherwani
- Huma Shakeel
- Muhammad Aslam
BMC Medical Research Methodology (2021)
Restoration of tumour-growth suppression in vivo via systemic nanoparticle-mediated delivery of PTEN mRNA
- Mohammad Ariful Islam
- Yingjie Xu
- Jinjun Shi
Nature Biomedical Engineering (2018)

Nonparametric tests

Subjects

Change history

23 May 2014

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Diseased and healthy murine local lung strains evaluated using digital image correlation

A machine learning approach to predicting early and late postoperative reintubation

Survival analysis—time-to-event data and censoring

Analysis of COVID-19 data using neutrosophic Kruskal Wallis H test

Restoration of tumour-growth suppression in vivo via systemic nanoparticle-mediated delivery of PTEN mRNA

Search

Quick links

Subjects

Change history

23 May 2014

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Diseased and healthy murine local lung strains evaluated using digital image correlation

A machine learning approach to predicting early and late postoperative reintubation

Survival analysis—time-to-event data and censoring

Analysis of COVID-19 data using neutrosophic Kruskal Wallis H test

Restoration of tumour-growth suppression in vivo via systemic nanoparticle-mediated delivery of PTEN mRNA

Search

Quick links