Introduction

Next-generation sequencing (NGS) techniques use random (‘shotgun’) sequencing of the template DNA in order to cover all ‘targets’ with a sufficient number of sequencing reads, that is, to reach a sufficient ‘coverage’. Accordingly, NGS always involves at least the fluctuation of a Poisson process. The distribution of the coverage thus cannot be entirely even but must have a variance that is at least as large as the mean (as in case of a Poisson distribution). On top of that lower bound, real NGS distributions show overdispersion and have variances that are substantially larger than their means. Overdispersion is due to various factors, including copy number variability of the template DNA, for instance, or pre-NGS manipulations, such as selective capturing of template DNA.

To assess the degree of inhomogeneity of NGS coverage quantitatively, Mokry et al.1 elaborated on a consideration of Gnirke et al.2 and introduced the ‘evenness score’ E. This score has found its way into the NGS field. Very recently, for instance, Lelieveld et al.3 applied E in their comparison of exome sequencing and whole-genome sequencing. Here I derive a computationally more efficient formula for the calculation of E. Then I use that formula in a general analysis, producing simple but close approximations. The latter allow for a comparison of the evenness score with conventional descriptors of the relative width of a distribution, such as the coefficient of variation.

Material & Methods and Results

The evenness score E

Mokry et al.1 developed the evenness score as a tool to describe the dispersion of the coverage around the average coverage Cave. Their idea is intuitive and proved useful for their study, but it has not been extensively characterized mathematically. Here I show that the explanation and derivation of the evenness score can be simplified significantly, leading to a more efficient computation of the score and to general insights into its relationship with more traditional statistical measures. In order to simplify the explanation, it has to be reproduced in the following paragraph. Readers asking for an immediate intuitive understanding of the evenness score are referred to Figure 1 and may then proceed directly to equation (3).

Figure 1
figure 1

Explanation of the evenness score E (shaded area) according to Mokry et al.1 (also see their Figure 2b). If the coverage is homogeneous, its pdf (probability density function) is narrow and close to the mean of the normalized coverage at x=1. Then the shaded area approximates a size of 1 or 100%. E is calculated as (see derivation of equation (4)) where f(x) is the pdf and F(x) is the cumulative distribution function (cdf).

Mokry et al.1 stated

where Mi ‘is defined as number of targeted positions with at least coverage Ci, Cave is defined as the average coverage through all targeted positions and NTP is defined as the total number of targeted positions.’ The introduction of the term Ci in this definition is an unnecessary complication as the instruction of the sum in equation (1) obviously implies that Ci=i. Thus, as Mokry et al.1 stated, E equals 1 (=100%) in case of completely uniform coverage of all targeted positions at a level of Cave because in this case Mi/NTP=1 for all iCave yielding . (Mokry et al.1 used the letter Pi instead of Mi in equation (1) which is avoided here as P usually indicates a probability or relative frequency. Only after dividing Mi by NTP, a probability results, that is, P(coveragei)=Mi/NTP.)

Mokry et al.1 also provided a version of equation (1) for the continuous case, that is, ‘, …where F(i) is the fraction of positions with normalized coverage of at least C(i)/Cave’, with ‘normalization’ meaning division by the mean. Again, the definition is a little complicated as the reader needs to figure out that C(i)/Cave=i. Moreover, the use of the letters i and F in this formula is unfavorable as i has been applied in equation (1) already, although with a different meaning (!), and F usually relates to the left-sided cumulative distribution function (from −∞ to x, see https://en.wikipedia.org/wiki/Cumulative_distribution_function). Therefore, I prefer to write

with xi/Cave and G(x)≈Mi/NTP, where i is defined as in equation (1), and G(x) is the fraction of positions with normalized coverage of at least x. In equation (2), I omitted the factor ‘100%’ as it equals 1 anyway. With increasing Cave, the residual difference between the discrete and the continuous version of E declines. Mokry et al.1 used the continuous version for a visual explanation of the evenness score (see their ‘Figure 2’ and Figure 1 of the present paper). This explanation implies that G(x) (that is, ‘F(i)’ in terms of Mokry et al.1) is the complement of the cumulative distribution function (cdf) of the normalized coverage. Hence G(x)=1–F(x), because 1–G(x) equals the fraction of positions with normalized coverage of at most x, that is, the cdf for which I use the common descriptor F(x) here. In case of a very even NGS result, almost all target positions have a coverage close to the mean, so that the probability density distribution (pdf) is restricted to the vicinity of the mean, the cdf is close to 0 for x<1, and the evenness score E approximates or 100%. Conversely, a coverage that is uneven with F(x)>0 for x<1 results in E<1 (that is, <100%).

Figure 2
figure 2

Evenness scores (E) of some symmetrical distributions as functions of the coefficient of variation, that is, of the standard deviation (σ) after normalization by division by the mean. Note that E is well predicted by the average (dashed line) of the upper (1–σ2/2) and lower limits (1–σ/2) as given by inequality(7). Because of the normalization, the base length of the triangular and the rectangular distribution cannot be > 2, so that the maximal σ is and , respectively. The Gaussian normal distribution necessarily is truncated at 0, which inflicts increasing skewness with increasing σ. Interestingly, the normalized left-truncated Gaussian distribution also has a maximal σ, which equals (see Supplementary Material F). As the figure shows, the E score of the distribution even then is well predicted by inequality(7).

Thus, except for the expression in percentage, the evenness score E of Mokry et al.1 is given by

where F(x) is the cdf of the normalized coverage x=coverage/mean coverage.

With f(x) being the related pdf where , that is, , as there is no negative coverage (f(t)=0 for t<0), a rather convenient expression can be derived from equation (3) using integration by parts: As , equation (3) can be written as . Hence,

For the discrete case with xi/Cave and f(x)≈ni/NTP, the analogous formula is

where NTP and Cave are defined as in equation (1) as the total number of targeted positions and the average coverage, respectively, while ni is the number of targets that are covered with exactly i reads (that is, i is the non-normalized coverage). equation (5a) can be transformed to

where the condition ‘1jNTP, C(j)Cave’ guarantees that the index of the summation runs through all target positions j whose coverage C(j) is not larger than the average coverage. As each of these positions occurs exactly once, equation (5b) does not have a weighing factor comparable to ni in equation (5a) where i denotes coverage level instead of position. For a direct derivation of equations (5a and 5b), from equation (1), see Supplementary Material A and B and Supplementary Figure S1. As Cave is usually not an integer, there might be a small deviation between the values calculated by equation (1) and equations (5a and 5b). The deviation is small for Cave>10 but may be considerable if Cave is of order 1. The difference vanishes if Cave is rounded to the next integer. With equations (5a and 5b), the computation time to calculate E is a linear function of the number of NGS reads as each read is addressed only once in the summation over (Cavei)ni, whereas equation (1) requires computational time that increases as a quadratic function of the number of reads as each Mi represents a summation itself. For theoretical considerations (see below), equations (4, 5a and 5b) are also more useful than equations (1, 2, 3) because for various distributions, including the Gaussian normal distribution, F(x) cannot be provided in closed form. Concerning computational efficiency, the situation is then similar to the discrete version, as equation (4) involves only one numerical integration, while the calculation according to equation (3) implies the numerical integration of numerical integrations.

Commands to calculate E according to equations (4, 5a and 5b) on the R command line or as parts of R programs are as follows (see Supplementary Material B for a detailed explanation): In case of empirical (that is, discrete) data, let D be a vector that contains the data as a sequence of numbers representing the coverage of each of the targets. If this sequence is the column k of a table T, use the command ‘D=T[,k]’ to produce D. Then implementation of equation (5b) in R yields E by the command line script

where Cave is rounded to the next integer (which only has a substantial effect for data whose non-normalized Cave is very small; that is, of order 1). This command also works after normalization of the data. For operations with a theoretical distribution f(x) of a continuous normalized random variable, equation (4) can be implemented in R as

where ‘f(x)’ has to be replaced by the specific pdf.

In the following, I derive approximations of the evenness score E, especially in terms of the distribution parameter σ. I also consider alternative scores such as eσ, which is restricted to the interval between 0 and 1 by definition and thus qualifies for scoring in percentage.

E and σ in case of a symmetrical pdf

As coverage always is positive, with f(x)=0 for x<0, and because normalization implies μ=1, the variance is given by . With both f(x)0 and (x–1)20, the variance is where k2>1. As (1–x)21–x for 0x1, equation (4) implies σ2/k21–E, which yields an upper limit of E. In analogy to k2, a constant k0 can be defined with , where k0>1 since because f(x) is a pdf. To derive the lower limit of E, apply Jensen’s inequality for convex functions such as (x–1)2 (see Supplementary Material C) yielding . With equation (4), this is equivalent to k0σ2/k2(k0(1–E))2. Hence,

The constants k0 and k2 depend on the form of the distribution; k0 is associated with the relation of median m and mean μ=1. If 1>m, then , and k0<2. In case of symmetrical pdfs, m=μ=1 and k0=2. Moreover, symmetry implies , and therefore, k2=2. Hence,

(see Figure 2 for some examples). Inequality(7) makes sense only if the normalized standard deviation σ, which equals for symmetrical pdfs, ranges between 0 and 1. Indeed, this can be shown using the extreme types of symmetrical pdfs: If f(x)→0 for x≠1, we get σ→0, whereas if f(x) has a U-form, with f(x)→0 for x(x–2)≠0, thus maximizing the distance of the random variable from the mean, we have σ→1 as (1–x)2 equals either (1–0)2 or (1–2)2. For these two extremes, E is precisely determined by inequality(7), being 1 and 0.5, respectively. The relative error in estimating E by inequality(7), that is, by the mean of the limits (1–σ/2) and (1–σ2/2), must be smaller than half of their difference divided by the lower limit, 0.5(σ/2–σ2/2)/(1–σ/2). The maximum of the latter term is found at and is only 0.086. Analyzing realistic distributions (see below) yields relative errors even much smaller than that.

Among the pdfs that are symmetrical and unimodal (for example, bell-shaped), the pdf with the maximal σ is realized by an approximate rectangular distribution over the interval [0, 2] with f(x)=0.5 for 0x2, and f(x)=0 otherwise. A simple calculation yields , 1–σ/2=0.71, 1–σ2/2=0.83, E=0.75, a relative error in estimating E by inequality (7) of ((1–σ/2+1–σ2/2)/2–E)/E=0.03, and eσ=0.56. More so than the rectangular, the triangular distribution might serve as a semi-realistic but still analytically treatable model of a symmetrical and unimodal pdf. For a triangular pdf with its base on the interval [1–b,1+b],b1, and, consequently, peak height of 1/b, one gets and . Again, E can be predicted rather well by inequality(7) with a relative error of <0.028. Of note, the range of E, that is, the interval [0.83, 1], is only half as large as the ranges of σ or eσ. Even more realistic, of course, than a triangular pdf is the assumption of a Gaussian normal distribution. The latter is reasonably symmetrical as long as the standard deviation is small with σμ (see Figure 2 and Supplementary Material F for the effect of truncation at x=0). If the coverage results from a random production of reads as in a Poisson process, its distribution is approximately Gaussian with a variance before normalization that is as large as the mean coverage. Assuming a mean coverage of 100 before normalization, the standard deviation after normalization then is . Numerical integration using R (see Supplementary Material B) yields E(σ) as 0.96, 0.92 and 0.84 for σ being 0.1, 0.2, and 0.41, respectively, which is almost the same as in case of the triangular distribution (Figure 2).

One might think that E1–σ/2 (see inequality(7)) also applies to all positively skewed normalized distributions, that is, normalized pdfs with positive third moment. However, this may not necessarily be the case. Defining with n{0,1,2,3,…} we get k0 and k2 according to their definitions in the derivation of inequality(6), k1=2 owing to the definition of the mean μ, which equals 1 after normalization and k3>2 in case of positive skewness. For E1–σ/2 to be true, the product k0k2 needs to be >4 (see inequality(6)). Proofs in that matter are not trivial. For ‘positively slanted’ distributions4 (that is, pdfs for which f(μ+x)–f(μx) is not identically zero and changes sign in x>0 at most once and from negative to positive, which include the Pearson family and the log-normal distribution) it can be derived, using the reasoning of MacGillivray,5 that k2>2 and 2>k0>1 (not shown). However, this is not very helpful. For the log-normal distribution, better approximations are derived in the following section.

The evenness score of the log-normal distribution

Measurements on biological entities usually are positive with a maximum at x>0 and a tail towards higher values. As such, their distributions resemble a log-normal distribution (see Limpert et al.6 for a review). This type of distribution has been found in a great variety of cases, including gene expression,7 telomere length,8 neuronal activity,9 fecundity10 or time-to-event duration (for example, incubation time) of infectious and other diseases,11, 12 for instance, although log-normal genesis (multiplicative interaction of many random effects) cannot always be demonstrated perfectly. The pdf of the coverage in NGS may also have log-normal appearance (Figure 4): The rolling circle technique of Complete Genomics or the use of selective capturing of targets as in exome sequencing produce such distributions, whereas whole-genome sequencing with the Illumina technique results in rather symmetrical distributions.13, 14 Therefore, I examined the evenness score E(σ) of the log-normal distribution (see Figure 3).

Figure 4
figure 4

Fitting log-normal distributions to exome data. (a) Variant coverage distribution of an individual exome (‘son’) that can be downloaded from Glusman et al.17 (b) Average coverage of each variant position on chromosome 18 that is covered above threshold (>7 ×) in all 4300 European American samples of the Exome Variant Server (http://evs.gs.washington.edu/EVS, Jan 2016). For the moments and scores of the normalized coverage as discussed in the present article, distribution (a) yields σ=1.006, eσ=0.366, E=0.657, 1–σ/2=0.497, 1–σ2/2=0.494, eσ*/2=0.658 and 1–F(e−1)=0.724, and (b) yields σ=0.456, eσ=0.634, E=0.827, 1–σ/2=0.772, 1–σ2/2=0.896, eσ*/2=0.804 and 1–F(e−1)=0.965. Thus the evenness score E of realistic data is well approximated by eσ*/2 as in equation (10) while 1–F(e−1) as in equation (11) is not sufficient yet. For (b) where the deviation from symmetry is relatively small, even inequality(7) yields a good approximation with 0.5(1–σ/2+1–σ2/2)=0.834. Moreover, panels (a) and (b) show that, for the characterization of realistic data, the score eσ exploits a much larger part of its range than E. (See Supplementary Figure S2 for exome data that fit the log-normal distribution less perfectly while the relations of the moments and scores are still quite similar to here.)

Figure 3
figure 3

(a) Assumed log-normal probability density functions (pdfs) of the normalized coverage x for different values of the standard deviation σ. (See the main text for the relation between σ and the form parameter σ* of the log-normal pdf.) As normalization means division by the mean here, the mean μ of x is always 1. Therefore, the coefficient of variation (=σ/μ) equals σ. Note that for σ→0, the pdf approximates the form of a Gaussian normal distribution while with increasing σ the skewness also increases. (b) Evenness score E(σ) and alternative scores (eσ, eσ*,eσ*/2 and 1–F(x,σ)) for normalized log-normal distributions with varying σ. The cumulative distribution function F(0.2,σ) quantifies the fraction or targets with a coverage of <0.2 (that is, 20 × if the mean of the non-normalized random variable is 100 ×) depending on σ. Note that E(σ) is well approximated by eσ*/2 up to large σ (=3) and by 1–F(e−1,σ) for very large σ (2.5).

The log-normal distribution15 is the density of a variable whose logarithm ln(x) has a Gaussian normal distribution No(μ*, σ*). Being the first derivative of the cdf with ∂ln(x)/∂x=x−1, the log-normal pdf thus is

where μ* and σ* now are form parameters only that relate to mean and variance of x as and , respectively.16 Normalization (division by the mean) conserves the log-normal form of a distribution, since ln(x/μ)=ln(x)–ln(μ) implies that ln(x/μ) has the Gaussian normal distribution No(μ*–ln(μ),σ*) if the distribution of ln(x) is No(μ*,σ*). For normalized coverage with μ=1, the relation of σ* and σ simplifies to

The log-normal distribution is increasingly skewed with increasing σ (see Figure 3a), whereas it approximates a Gaussian normal distribution No(μ, σ) if σ→0 (see Supplementary Material D for a proof of the latter tendency). In case of small σ, the evenness score of a normalized log-normal distribution thus can be estimated by inequality(7) (see Figures 3b and 4b). First-order approximation of equation (9) in the vicinity of ln(1) yields σ*2=ln(1+σ2)≈σ2, that is, σσ*, so that inequality(7) translates to 1–σ*2/2E1–σ*2 for σσ*→0. With first-order approximation in the vicinity of e0 as e0+Δt≈1+Δt, this results in . Figures 3b and 4 and Supplementary Figure S2 show that also holds beyond the region of small σ. At σ=1.3 where σ*=1, the values of E=0.62 and e−1/2=0.61 still are almost identical. Indeed, the approximation eσ*/2 is valid up to σ=3 (that is, σ*=1.5), with a maximal absolute error of 0.02,

To derive an approximation for even larger σ, use y=ln(x), which, by definition, has a Gaussian normal distribution No(μ*,σ*). Substituting x by y, that is, FLogNo(x) by FNo(y) in equation (3), yields , taking into consideration that dx=∂x/∂ydy=eydy. The value of is determined by the region close to the origin as the factor ey is approaching 0 for negative values of y beyond that region. For large σ (and, therefore, large σ*), FNo(y) approximates its maximum (=1) in that region so that its graph becomes flat and rather linear, because its mean μ* moves away from the origin with the square of its standard deviation, μ*=σ*2/2, according to equation (9). Hence, for large σ, FNo(y) can be replaced by a low-grade Taylor series approximation. The Taylor series can be expressed as , considering that fNo is the first derivative of FNo. With , one gets (see Supplementary Material E for a detailed derivation). Stopping that series at n=0 yields E≈1–(FNo(0)+fNo(0)(−1)). The term in the brackets amounts to a first-order Taylor approximation of FNo(y) for y=−1. Returning to the log-normal distribution of x with x=ey then results in E≈1–F(e−1). As shown in Figure 3b, this approximation is rather good for very large values of σ. For σ2.5, its maximal absolute error is at most 0.02. Hence,

As F(e−1)=F(0.38), which is the number of reads with a normalized coverage of at most 0.38, E here indicates the number of reads with a normalized coverage of at least 0.38. Figure 3b shows that 1–F(e−1) is quite parallel to 1–F(0.2); in case of a 100 × average coverage, which is frequently aimed for in NGS projects, F(0.2) indicates the limit of 20 × that is usually considered as the minimal coverage necessary for reliable mutation detection. As such, for NGS projects with very large variance of the coverage, E may serve as useful and more or less direct indicator of the fraction of sufficiently covered targets.

Discussion

The evenness score is used in NGS to quantify the homogeneity of target coverage with sequencing reads.1, 3 As such, it is a measure of the relative width of a distribution, with the coverage being the distributed variable. Its use can be recommended only if it has advantages compared with the coefficient of variation, which is the parameter conventionally applied for this purpose. Here I have performed that comparison. To do so, I used the evenness score in its continuous version, which assumes a normalized random variable (that is, having a mean μ of 1). Therefore, the evenness score E was compared with the standard deviation σ, as σ equals the coefficient of variation if μ equals 1.

At first, I clarified the mathematical definition of E and derived a computational more efficient version (see equation (4)), which then was also translated to the non-normalized, discrete case of empirical coverage data (see equations (5a and 5b)). Using this version, the calculation of E avoids double summations thus making it now about as fast as the calculation of σ. As most software applications still do not contain a built-in routine for the calculation of E, I have provided short R commands that will be easily translatable to analogous commands in other programming languages.

Besides the unconventionality of E, its definition might appear to imply another disadvantage: Since the integration in its calculation, runs only up to the mean (=1 owing to normalization), E might appear to be insensitive to the variable’s distribution above the mean. However, this is not true because we see that . Hence, by influencing the location of the mean (before normalization), the upper part of the distribution influences the upper end of the lower part and, thereby, the result of the integration of the normalized lower part.

More important is the outcome of the general analysis of E performed in the present paper. For any symmetrical distribution, including the Gaussian normal distribution, I showed that E can be predicted with little error by σ, that is, by 1–σ/2 (see inequality(7) and Figure 2). Moreover, as some NGS methods entail positively skewed coverage data (see Figure 4, Ernani et al.13 and Lam et al.14), I examined the evenness score of the log-normal distribution, which is the typical distribution of positively skewed results of biological measurements:6 For a rather wide range of σ(3), E was found to be predictable by eσ*/2 with σ*2=ln(σ2+1) (see equation (10) and Figure 3b). In these cases, E also does not seem to provide much information that is not easily derivable from σ. An advantage of E was revealed only for cases with very large coefficient of variation (that is, σ of normalized data 2.5), as it then satisfyingly and directly predicts the fraction of targets with sufficiently high coverage (see equation (11) and Figure 3b), whereas this fraction cannot be easily estimated directly from σ.

Some might argue that the evenness score has the advantage of being a score between 0 and 1 (0% and 100%). However, a simple score with that quality can also be devised using σ, namely as eσ, which is 1 (that is, 100%) for absolutely homogeneous coverage and approaches 0 for inhomogeneous coverage. The major difference between E and eσ is given by the rate of approaching 0 as can be seen in Figure 3b. There, E still indicates an evenness of 0.37=37% if F(0.2)=0.5 with 50% of the targets having a coverage of at most 0.2 (that is, of at most 20 × if the mean is 100 ×), while eσ is already down to a level of e−5=0.007=0.7%. If such NGS outputs were unacceptable due to insufficient coverage of too many targets, E would not exploit its full range (0–1) for the evaluation of the acceptable NGS outputs. Indeed, the minimal E values of published NGS outputs as calculated in Mokry et al.,1 Lelieveld et al.3 and this present paper are still as large as 0.62, 0.68 and 0.66, respectively, while eσ goes down to 0.37 (see Figure 4). On the other hand, if outputs with 50% of the targets having a coverage of at most 20% of the mean coverage were acceptable, E would have the advantage to preserve some of its range for their quantitative evaluation.

Dealing with log-normal distributions, it might also be worth considering the analog of the standard deviation of a Gaussian normal distribution, that is, the ‘multiplicative standard deviation’ σ* as recommended by Limpert et al.6 (note that the naming of the variables is different in Limpert et al.6). It is one of the two form parameters in equations (8 and 9). In case of empirical data, it can be calculated as the standard deviation of the natural logarithm of the random variable. In Figure 3b, the score eσ* is presented as a possible tool for the quantitative evaluation of the homogeneity of NGS outputs. It may provide a compromise between eσ and E. However, the ‘multiplicative standard deviation’ does not yet seem to be in common use and the NGS community may therefore hesitate to take it into consideration.

In summary, the general evaluation presented in this paper reveals that in most circumstances the evenness score E of a NGS output can be predicted quite well by the standard deviation σ of the normalized data (that is, by the coefficient of variation σ/μ in case of non-normalized data). Only if σ is very large (2.5μ), does E have the advantage of directly reflecting the fraction of sufficiently covered targets. The general relation between E and σ set out here should also apply to other scientific fields that develop a parameter equivalent to E for their statistics.