Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Interpretable meta-score for model performance

A preprint version of the article is available at arXiv.

Abstract

Benchmarks are an integral part of machine learning development. However, the most common benchmarks share several limitations. For example, the difference in performance between two models has no probabilistic interpretation, it makes no sense to compare such differences between data sets and there is no reference point that indicates a significant performance improvement. Here we introduce an Elo-based predictive power meta-score that is built on other performance measures and allows for interpretable comparisons of models. Differences between this score have a probabilistic interpretation and can be compared directly between data sets. Furthermore, this meta-score allows for an assessment of ranking fitness. We prove the properties of the Elo-based predictive power meta-score and support them with empirical results on a large-scale benchmark of 30 classification data sets. Additionally, we propose a unified benchmark ontology that provides a uniform description of benchmarks.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: A diagram of the EPP benchmark.
Fig. 2: Boxplots of EPP scores split by different algorithms across data sets.
Fig. 3: Actual empirical probability and predicted probability of winning computed on the basis of the EPP meta-score value for data sets ‘banknote-authentication’ and ‘wdbc’.
Fig. 4: Boxplots of scores split by four selected models from the VTAB.

Similar content being viewed by others

Data availability

The data sets generated during the current study are available in the EPP meta-score GitHub repository available at https://github.com/agosiewska/EPP-meta-score36. Source data are provided with this paper.

Code availability

An implementation of the EPP score is available at https://github.com/ModelOriented/EloML. The codes generated during the current study are available in the EPP meta-score GitHub repository available at https://github.com/agosiewska/EPP-meta-score36.

References

  1. Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: analyzing and interpreting neural networks for NLP (eds. Linzen, T., Chrupała, G. & Alishahi, A.), 353-355 (Association for Computational Linguistics, 2018).

  2. Wang, A. et al. SuperGLUE benchmark for general-purpose language understanding systems. Adv. Neural Inform. Process. Syst. 3261–3275 (2019).

  3. Zhai, X. et al. A large-scale study of representation learning with the Visual Task Adaptation Benchmark. Preprint at https://arxiv.org/abs/1910.04867 (2020).

  4. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) - round XIII. Proteins 87, 1011–1020 (2019).

    Article  Google Scholar 

  5. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  Google Scholar 

  6. Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 1–23 (2019).

    Article  Google Scholar 

  7. Lensink, M. F., Nadzirin, N., Velankar, S. & Wodak, S. J. Modeling protein–protein, protein–peptide, and protein–oligosaccharide complexes: CAPRI 7th edition. Proteins: Structure, Function, and Bioinformatics 88, 916–938 (2020).

  8. Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. OpenML: networked science in machine learning. SIGKDD Explor. Newsl. 15, 49–60 (2014).

  9. Martínez-Plumed, F., Barredo, P., hÉigeartaigh, S. Ó. & Hernández-Orallo, J. Research community dynamics behind popular AI benchmarks. Nat. Mach. Intell. 3, 581–589 (2021).

  10. Powers, D. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63 (2008).

    Google Scholar 

  11. Sokolova, M. & Lapalme, G. A systematic analysis of performance measures for classification tasks. Inform. Process. Manage. 45, 427–437 (2009).

    Article  Google Scholar 

  12. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

  13. Abadi, M. et al. Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), 265–283 (USENIX Association, 2016).

  14. Bischl, B. et al. mlr: machine learning in R. J. Mach. Learn. Res. 17, 1–5 (2016).

    MathSciNet  MATH  Google Scholar 

  15. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).

    MathSciNet  MATH  Google Scholar 

  16. Dietterich, T. G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998).

    Article  Google Scholar 

  17. Alpaydin, E. Combined 5 × 2 cv F test for comparing supervised classification learning algorithms. Neural Comput. 11, 1885–1892 (1999).

    Article  Google Scholar 

  18. Bouckaert, R. R. Choosing between two learning algorithms based on calibrated tests. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning(eds. Fawcett, T., Mishra, N.), ICML’03, 51–58 (AAAI Press, 2003).

  19. Salzberg, S. L. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1, 317–328 (1997).

    Article  Google Scholar 

  20. Guerrero Vázquez, E., Yañez Escolano, A., Galindo Riaño, P. & Pizarro Junquera, J. in Bio-Inspired Applications of Connectionism (eds. Mira, J., Prieto, A.), pp 88–95 (Springer, 2001).

  21. Pizarro, J., Guerrero, E. & Galindo, P. L. Multiple comparison procedures applied to model selection. Neurocomputing 48, 155–173 (2002).

    Article  Google Scholar 

  22. Hull, D. Information Retrieval Using Statistical Classification. PhD thesis, Stanford Univ. (1994).

  23. Brazdil, P. B. & Soares, C. A comparison of ranking methods for classification algorithm selection. In Machine Learning: ECML 2000 (eds. López de Mántaras, R., Plaza, E.), 63–75 (Springer, 2000).

  24. Elo, A. & Sloan, S. The Rating of Chess Players, Past and Present (Ishi, 2008).

  25. Bischl, B. et al. OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (eds. Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S. & Wortman Vaughan, J.), vol. 1 (Curran Associates, Inc., 2021).

  26. Kretowicz, W. & Biecek, P. MementoML: performance of selected machine learning algorithm configurations on OpenML100 datasets. Preprint at https://arxiv.org/abs/2008.13162 (2020).

  27. Probst, P., Boulesteix, A.-L. & Bischl, B. Tunability: importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20, 1–32 (2019).

    MathSciNet  MATH  Google Scholar 

  28. Bradley, R. A. & Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39, 324–345 (1952).

    MathSciNet  MATH  Google Scholar 

  29. Clark, A. P., Howard, K. L., Woods, A. T., Penton-Voak, I. S. & Neumann, C. Why rate when you could compare? Using the “EloChoice” package to assess pairwise comparisons of perceived physical strength. PLOS ONE 13, 1–16 (2018).

    Article  Google Scholar 

  30. Agresti, A. In Categorical Data Analysis, vol. 482, chap. 6 (Wiley, 2003).

  31. Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508 (2002).

    Article  Google Scholar 

  32. Shimodaira, H. Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. Ann. Stat. 32, 2616–2641 (2004).

    Article  MathSciNet  Google Scholar 

  33. Suzuki, R. & Shimodaira, H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22, 1540–1542 (2006).

    Article  Google Scholar 

  34. Agresti, A. In Categorical Data Analysis, vol. 482, chap. 4 (Wiley, 2003).

  35. Gosiewska, A., Bakała, M., Woźnica, K., Zwoliński, M. & Biecek, P. EPP: interpretable score of model predictive power. Preprint at https://arxiv.org/abs/1908.09213 (2019).

  36. Gosiewska, A. & Woźnica, K. agosiewska/EPP-meta-score: EPP paper. Zenodo https://doi.org/10.5281/zenodo.6949519 (2022).

Download references

Acknowledgements

Work on this project is financially supported by NCN Opus grant 2017/27/B/ST6/01307.

We thank L. Bakała and D. Rafacz for inspiring ideas, W. Kretowicz and M. Zwoliński for preliminary work35 and P. Teisseyre, E. Sienkiewicz, H. Baniecki and B. Rychalska for useful comments.

Author information

Authors and Affiliations

Authors

Contributions

A.G. and K.W. designed and implemented the EPP method as an R package, as well as, studied and described theoretical properties of the EPP. A.G. performed the EPP Leaderboard on the VTAB benchmark and developed the unified benchmark ontology. K.W. performed the EPP Leaderboard on the OpenML benchmark and designed and performed simulations in the Supplementary Materials. P.B. supervised the project, provided technical advice, helped design the method and analysed the experiments. All authors participated in the conceptualization and preparation of the paper.

Corresponding authors

Correspondence to Alicja Gosiewska, Katarzyna Woźnica or Przemysław Biecek.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Philipp Probst and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A unified Ontology of ML Benchmarks.

The violet dashed rectangle shows a minimal setup for any benchmark.

Extended Data Table 1 Example Schemes for EPP Benchmark
Extended Data Table 2 The descriptions of the EPP Benchmark components that extend the Unified Benchmark Ontology
Extended Data Table 3 EPP of selected models for ada_agnostic data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark
Extended Data Table 4 The best models in algorithm class for mozilla4 data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark
Extended Data Table 5 The best models in algorithm class for credit-g data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark
Extended Data Table 6 Springleaf Marketing Response Kaggle Competition. https://www.kaggle.com/c/springleaf-marketing-response
Extended Data Table 7 IEEE-CIS Fraud Detection Kaggle Competition. https://www.kaggle.com/c/ieee-fraud-detection

Supplementary information

Supplementary information

Suplementary Figs. S1–S5, Discussion S1–S5 and Tables 1 and 2.

Reporting summary

Supplementary Data 1

Source Data Supplementary Material Fig. 1.

Supplementary Data 2

Source Data Supplementary Material Fig. 2.

Supplementary Data 3

Source Data Supplementary Material Fig. 3.

Supplementary Data 4

Source Data Supplementary Material Fig. 4.

Supplementary Data 5

Source Data Supplementary Material Fig. 5.

Source data

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gosiewska, A., Woźnica, K. & Biecek, P. Interpretable meta-score for model performance. Nat Mach Intell 4, 792–800 (2022). https://doi.org/10.1038/s42256-022-00531-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00531-2

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics