Evaluation of the evenness score in next-generation sequencing

Abstract

The evenness score (E) in next-generation sequencing (NGS) quantifies the homogeneity in coverage of the NGS targets. Here I clarify the mathematical description of E, which is 1 minus the integral from 0 to 1 over the cumulative distribution function F(x) of the normalized coverage x, where normalization means division by the mean, and derive a computationally more efficient formula; that is, 1 minus the integral from 0 to 1 over the probability density distribution f(x) times 1–x. An analogous formula for empirical coverage data is provided as well as fast R command line scripts. This new formula allows for a general comparison of E with the coefficient of variation (=standard deviation σ of normalized data) which is the conventional measure of the relative width of a distribution. For symmetrical distributions, including the Gaussian, E can be predicted closely as 1–σ2/2E1–σ/2 with σ1 owing to normalization and symmetry. In case of the log-normal distribution as a typical representative of positively skewed biological data, the analysis yields E≈exp(−σ*/2) with σ*2=ln(σ2+1) up to large σ (3), and E≈1–F(exp(−1)) for very large σ (2.5). In the latter kind of rather uneven coverage, E can provide direct information on the fraction of well-covered targets that is not immediately delivered by the normalized σ. Otherwise, E does not appear to have major advantages over σ or over a simple score exp(−σ) based on it. Actually, exp(−σ) exploits a much larger part of its range for the evaluation of realistic NGS outputs.

Access optionsAccess options

from\$8.99

All prices are NET prices.

References

1. 1

Mokry, M., Feitsma, H., Nijman, I. J., de Bruijn, E., van der Zaag, P. J., Guryev, V. et al. Accurate SNP and mutation detection by targeted custom microarray-based genomic enrichment of short-fragment sequencing libraries. Nucleic Acids Res. 38, e116 (2010).

2. 2

Gnirke, A., Melnikov, A., Maguire, J., Rogov, P., LeProust, E. M., Brockman, W. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).

3. 3

Lelieveld, S. H., Spielmann, M., Mundlos, S., Veltman, J. A. & Gilissen, C. Comparison of exome and genome sequencing technologies for the complete capture of protein coding regions. Hum. Mutat. 36, 815–822 (2015).

4. 4

Rösler, U. Distributions slanted to the right. Stat. Neerl. 49, 83–93 (1995).

5. 5

MacGillivray, H. L. The mean, median, mode inequality and skewness for a class of densities. Aust. J. Stat. 23, 247–250 (1981).

6. 6

Limpert, E., Stahel, W. A. & Abbt, M. Log-normal distributions across the sciences: keys and clues. BioScience 51, 341–352 (2001).

7. 7

Bengtsson, M., Ståhlberg, A., Rorsman, P. & Kubista, M. Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels. Genome Res. 15, 1388–1392 (2005).

8. 8

Oexle, K. Telomere length distribution and Southern blot analysis. J. Theor. Biol. 190, 369–377 (1998).

9. 9

Rupasov, V. I., Lebedev, M. A., Erlichman, J. S. & Linderman, M. Neuronal variability during handwriting: lognormal distribution. PLoS ONE 7, e34759 (2012).

10. 10

Herrera, C. M. & Jovani, R. Lognormal distribution of individual lifetime fecundity: insights from a 23-year study. Ecology 91, 422–430 (2010).

11. 11

Sartwell, P. E. The distribution of incubation periods of infectious disease. Am. J. Hyg. 51, 310–318 (1950).

12. 12

Horner, R. D. Age at onset of Alzheimer’s disease: clue to the relative importance of etiologic factors? Am. J. Epidemiol. 126, 409–414 (1987).

13. 13

Ernani, F. P ., LeProust, E. M. & Agilent Technologies. Target enrichment for NGS. Euro Biotech. News 8, 42–44 (2009).

14. 14

Lam, H. Y., Clark, M. J., Chen, R., Chen, R., Natsoulis, G., O'Huallachain, M. et al. Performance comparison of whole-genome sequencing platforms. Nat. Biotechnol. 30, 78–82 (2012).

15. 15

McAlister, D. The law of the geometric mean. Proc. R. Soc. 29, 367–376 (1879).

16. 16

Rinne, H. Taschenbuch der Statistik 3rd edn (eds Harri Deutsch, Frankfurt a.M.) 301–305 (Germany, 2003).

17. 17

Glusman, G., Cariaso, M., Jimenez, R., Swan, D., Greshake, B., Bhak, J. et al. Low budget analysis of Direct-To-Consumer genomic testing familial data version1; referees: 2 approved. F1000Research 1, 3 (2012).

Acknowledgements

I thank Kay E Reed for inspiring talks and critical reading.

Ethics declarations

Competing interests

The author declares no conflict of interest.

Supplementary Information accompanies the paper on Journal of Human Genetics website

Rights and permissions

Reprints and Permissions

• DOI

https://doi.org/10.1038/jhg.2016.21