Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Avoiding common pitfalls in machine learning omic data science

This Comment describes some of the common pitfalls encountered in deriving and validating predictive statistical models from high-dimensional data. It offers a fresh perspective on some key statistical issues, providing some guidelines to avoid pitfalls, and to help unfamiliar readers better assess the reliability and significance of their results.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Fig. 1: The curse of dimensionality and overfitting.
Fig. 2: Avoiding bias when training and evaluating molecular predictors.
Fig. 3: Unknown confounders and class prediction.
Fig. 4: Avoiding bias when comparing feature selection methods.

References

  1. Kalinin, S. V., Sumpter, B. G. & Archibald, R. K. Nat. Mater. 14, 973–980 (2015).

    CAS  Article  Google Scholar 

  2. Marx, V. Nature 498, 255–260 (2013).

    CAS  Article  Google Scholar 

  3. Mattmann, C. A. Nature 493, 473–475 (2013).

    CAS  Article  Google Scholar 

  4. Fodor, S. P. et al. Science 251, 767–773 (1991).

    CAS  Article  Google Scholar 

  5. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Science 270, 467–470 (1995).

    CAS  Article  Google Scholar 

  6. Perou, C. M. et al. Proc. Natl Acad. Sci. USA 96, 9212–9217 (1999).

    CAS  Article  Google Scholar 

  7. Wheeler, D. A. et al. Nature 452, 872–876 (2008).

    CAS  Article  Google Scholar 

  8. Nagalakshmi, U. et al. Science 320, 1344–1349 (2008).

    CAS  Article  Google Scholar 

  9. van ’t Veer, L. J. et al. Nature 415, 530–536 (2002).

    Article  Google Scholar 

  10. Guo, S. et al. Nat. Genet. 49, 635–642 (2017).

    CAS  Article  Google Scholar 

  11. Gerlinger, M. et al. N. Engl. J. Med. 366, 883–892 (2012).

    CAS  Article  Google Scholar 

  12. Xu, R. H. et al. Nat. Mater. 16, 1155–1161 (2017).

    CAS  Article  Google Scholar 

  13. Storey, J. D. & Tibshirani, R. Proc. Natl Acad. Sci. USA 100, 9440–9445 (2003).

    CAS  Article  Google Scholar 

  14. Leek, J. T. et al. Nat. Rev. Genet. 11, 733–739 (2010).

    CAS  Article  Google Scholar 

  15. Teschendorff, A. E., Zhuang, J. & Widschwendter, M. Bioinformatics 27, 1496–1505 (2011).

    CAS  Article  Google Scholar 

  16. Simon, R., Radmacher, M. D., Dobbin, K. & McShane, L. M. J. Natl Cancer Inst. 95, 14–18 (2003).

    CAS  Article  Google Scholar 

  17. Ioannidis, J. P. PLoS Med. 2, e124 (2005).

    Article  Google Scholar 

  18. Jager, L. R. & Leek, J. T. Biostatistics 15, 1–12 (2014).

    Article  Google Scholar 

  19. Sebastiani, P. et al. Science 333, 404 (2011).

    CAS  Article  Google Scholar 

  20. Ioannidis, J. P. et al. Nat. Genet. 41, 149–155 (2009).

    CAS  Article  Google Scholar 

  21. Seoighe, C., Tosh, N. J. & Greally, J. M. Nat. Genet. 50, 1062–1063 (2018).

    CAS  Article  Google Scholar 

  22. Jacob, L. & Speed, T. P. Genome Biol. 19, 97 (2018).

    Article  Google Scholar 

  23. Nieuwenhuis, S., Forstmann, B. U. & Wagenmakers, E. J. Nat. Neurosci. 14, 1105–1107 (2011).

    CAS  Article  Google Scholar 

  24. Qin, L. X., Huang, H. C. & Begg, C. B. J. Clin. Oncol. 34, 3931–3938 (2016).

    Article  Google Scholar 

  25. Ernst, J. & Kellis, M. Nat. Biotechnol. 33, 364–376 (2015).

    CAS  Article  Google Scholar 

  26. Vapnik, V. N. Statistical Learning Theory (Wiley, New York, 1998).

    Google Scholar 

  27. Bishop, C. M. Pattern Recognition and Machine Learning (Springer, New York, 2006).

  28. Friedman, J., Hastie, T. & Tibshirani, R. J. Stat. Softw. 33, 1–22 (2010).

    Article  Google Scholar 

  29. Webb, S. Nature 554, 555–557 (2018).

    CAS  Article  Google Scholar 

  30. Bishop, C. M. Neural Networks for Pattern Recognition (Oxford Univ. Press, Oxford, 1995).

    Google Scholar 

  31. Varma, S. & Simon, R. BMC Bioinform. 7, 91 (2006).

    Article  Google Scholar 

  32. Teschendorff, A. E. et al. Genome Biol. 7, R101 (2006).

    Article  Google Scholar 

  33. Ambroise, C. & McLachlan, G. J. Proc. Natl Acad. Sci. USA 99, 6562–6566 (2002).

    CAS  Article  Google Scholar 

  34. Reunanen, J. J. Mach. Learn. Res. 3, 1371–1382 (2003).

    Google Scholar 

  35. Efron, B. & Tibshirani, R. J. J. Am. Stat. Assoc. 92, 548–560 (1997).

    Google Scholar 

  36. Simon, R. J. Natl Cancer Inst. 97, 866–867 (2005).

    CAS  Article  Google Scholar 

  37. Biton, A. et al. Cell Rep. 9, 1235–1245 (2014).

    CAS  Article  Google Scholar 

  38. Leek, J. T. & Storey, J. D. PLoS Genet. 3, 1724–1735 (2007).

    CAS  Article  Google Scholar 

  39. Horvath, S. Genome Biol. 14, R115 (2013).

    Article  Google Scholar 

  40. Leek, J. T. & Storey, J. D. Proc. Natl Acad. Sci. USA 105, 18718–18723 (2008).

    CAS  Article  Google Scholar 

  41. Galea, M. H., Blamey, R. W., Elston, C. E. & Ellis, I. O. Breast Cancer Res. Treat. 22, 207–219 (1992).

    CAS  Article  Google Scholar 

  42. Bartlett, T. E. et al. PLoS ONE 10, e0143178 (2015).

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew E. Teschendorff.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Teschendorff, A.E. Avoiding common pitfalls in machine learning omic data science. Nat. Mater. 18, 422–427 (2019). https://doi.org/10.1038/s41563-018-0241-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41563-018-0241-z

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing