Interpretable meta-score for model performance

Gosiewska, Alicja; Woźnica, Katarzyna; Biecek, Przemysław

doi:10.1038/s42256-022-00531-2

Article
Published: 22 September 2022

Interpretable meta-score for model performance

Nature Machine Intelligence volume 4, pages 792–800 (2022)Cite this article

1101 Accesses
1 Citations
5 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Benchmarks are an integral part of machine learning development. However, the most common benchmarks share several limitations. For example, the difference in performance between two models has no probabilistic interpretation, it makes no sense to compare such differences between data sets and there is no reference point that indicates a significant performance improvement. Here we introduce an Elo-based predictive power meta-score that is built on other performance measures and allows for interpretable comparisons of models. Differences between this score have a probabilistic interpretation and can be compared directly between data sets. Furthermore, this meta-score allows for an assessment of ranking fitness. We prove the properties of the Elo-based predictive power meta-score and support them with empirical results on a large-scale benchmark of 30 classification data sets. Additionally, we propose a unified benchmark ontology that provides a uniform description of benchmarks.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: A diagram of the EPP benchmark.**

**Fig. 2: Boxplots of EPP scores split by different algorithms across data sets.**

**Fig. 3: Actual empirical probability and predicted probability of winning computed on the basis of the EPP meta-score value for data sets ‘banknote-authentication’ and ‘wdbc’.**

**Fig. 4: Boxplots of scores split by four selected models from the VTAB.**

A framework for rigorous evaluation of human performance in human and machine learning comparison studies

Article Open access 31 March 2022

Hannah P. Cowley, Mandy Natter, … William Gray-Roncal

Mapping global dynamics of benchmark creation and saturation in artificial intelligence

Article Open access 10 November 2022

Simon Ott, Adriano Barbosa-Silva, … Matthias Samwald

Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data

Article Open access 05 July 2021

Charles H. Martin, Tongsu (Serena) Peng & Michael W. Mahoney

Data availability

The data sets generated during the current study are available in the EPP meta-score GitHub repository available at https://github.com/agosiewska/EPP-meta-score³⁶. Source data are provided with this paper.

Code availability

An implementation of the EPP score is available at https://github.com/ModelOriented/EloML. The codes generated during the current study are available in the EPP meta-score GitHub repository available at https://github.com/agosiewska/EPP-meta-score³⁶.

References

Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: analyzing and interpreting neural networks for NLP (eds. Linzen, T., Chrupała, G. & Alishahi, A.), 353-355 (Association for Computational Linguistics, 2018).
Wang, A. et al. SuperGLUE benchmark for general-purpose language understanding systems. Adv. Neural Inform. Process. Syst. 3261–3275 (2019).
Zhai, X. et al. A large-scale study of representation learning with the Visual Task Adaptation Benchmark. Preprint at https://arxiv.org/abs/1910.04867 (2020).
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) - round XIII. Proteins 87, 1011–1020 (2019).
Article Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article Google Scholar
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 1–23 (2019).
Article Google Scholar
Lensink, M. F., Nadzirin, N., Velankar, S. & Wodak, S. J. Modeling protein–protein, protein–peptide, and protein–oligosaccharide complexes: CAPRI 7th edition. Proteins: Structure, Function, and Bioinformatics 88, 916–938 (2020).
Vanschoren, J., van Rijn, J. N., Bischl, B. & Torgo, L. OpenML: networked science in machine learning. SIGKDD Explor. Newsl. 15, 49–60 (2014).
Martínez-Plumed, F., Barredo, P., hÉigeartaigh, S. Ó. & Hernández-Orallo, J. Research community dynamics behind popular AI benchmarks. Nat. Mach. Intell. 3, 581–589 (2021).
Powers, D. Evaluation: from precision, recall and F-factor to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63 (2008).
Google Scholar
Sokolova, M. & Lapalme, G. A systematic analysis of performance measures for classification tasks. Inform. Process. Manage. 45, 427–437 (2009).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Abadi, M. et al. Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), 265–283 (USENIX Association, 2016).
Bischl, B. et al. mlr: machine learning in R. J. Mach. Learn. Res. 17, 1–5 (2016).
MathSciNet MATH Google Scholar
Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006).
MathSciNet MATH Google Scholar
Dietterich, T. G. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 1895–1923 (1998).
Article Google Scholar
Alpaydin, E. Combined 5 × 2 cv F test for comparing supervised classification learning algorithms. Neural Comput. 11, 1885–1892 (1999).
Article Google Scholar
Bouckaert, R. R. Choosing between two learning algorithms based on calibrated tests. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning(eds. Fawcett, T., Mishra, N.), ICML’03, 51–58 (AAAI Press, 2003).
Salzberg, S. L. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1, 317–328 (1997).
Article Google Scholar
Guerrero Vázquez, E., Yañez Escolano, A., Galindo Riaño, P. & Pizarro Junquera, J. in Bio-Inspired Applications of Connectionism (eds. Mira, J., Prieto, A.), pp 88–95 (Springer, 2001).
Pizarro, J., Guerrero, E. & Galindo, P. L. Multiple comparison procedures applied to model selection. Neurocomputing 48, 155–173 (2002).
Article Google Scholar
Hull, D. Information Retrieval Using Statistical Classification. PhD thesis, Stanford Univ. (1994).
Brazdil, P. B. & Soares, C. A comparison of ranking methods for classification algorithm selection. In Machine Learning: ECML 2000 (eds. López de Mántaras, R., Plaza, E.), 63–75 (Springer, 2000).
Elo, A. & Sloan, S. The Rating of Chess Players, Past and Present (Ishi, 2008).
Bischl, B. et al. OpenML benchmarking suites. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (eds. Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P.S. & Wortman Vaughan, J.), vol. 1 (Curran Associates, Inc., 2021).
Kretowicz, W. & Biecek, P. MementoML: performance of selected machine learning algorithm configurations on OpenML100 datasets. Preprint at https://arxiv.org/abs/2008.13162 (2020).
Probst, P., Boulesteix, A.-L. & Bischl, B. Tunability: importance of hyperparameters of machine learning algorithms. J. Mach. Learn. Res. 20, 1–32 (2019).
MathSciNet MATH Google Scholar
Bradley, R. A. & Terry, M. E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 39, 324–345 (1952).
MathSciNet MATH Google Scholar
Clark, A. P., Howard, K. L., Woods, A. T., Penton-Voak, I. S. & Neumann, C. Why rate when you could compare? Using the “EloChoice” package to assess pairwise comparisons of perceived physical strength. PLOS ONE 13, 1–16 (2018).
Article Google Scholar
Agresti, A. In Categorical Data Analysis, vol. 482, chap. 6 (Wiley, 2003).
Shimodaira, H. An approximately unbiased test of phylogenetic tree selection. Syst. Biol. 51, 492–508 (2002).
Article Google Scholar
Shimodaira, H. Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling. Ann. Stat. 32, 2616–2641 (2004).
Article MathSciNet Google Scholar
Suzuki, R. & Shimodaira, H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 22, 1540–1542 (2006).
Article Google Scholar
Agresti, A. In Categorical Data Analysis, vol. 482, chap. 4 (Wiley, 2003).
Gosiewska, A., Bakała, M., Woźnica, K., Zwoliński, M. & Biecek, P. EPP: interpretable score of model predictive power. Preprint at https://arxiv.org/abs/1908.09213 (2019).
Gosiewska, A. & Woźnica, K. agosiewska/EPP-meta-score: EPP paper. Zenodo https://doi.org/10.5281/zenodo.6949519 (2022).

Download references

Acknowledgements

Work on this project is financially supported by NCN Opus grant 2017/27/B/ST6/01307.

We thank L. Bakała and D. Rafacz for inspiring ideas, W. Kretowicz and M. Zwoliński for preliminary work³⁵ and P. Teisseyre, E. Sienkiewicz, H. Baniecki and B. Rychalska for useful comments.

Author information

Authors and Affiliations

Why R? Foundation, Warsaw, Poland
Alicja Gosiewska
Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, Poland
Katarzyna Woźnica & Przemysław Biecek
Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Warsaw, Poland
Przemysław Biecek

Authors

Alicja Gosiewska
View author publications
You can also search for this author in PubMed Google Scholar
Katarzyna Woźnica
View author publications
You can also search for this author in PubMed Google Scholar
Przemysław Biecek
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.G. and K.W. designed and implemented the EPP method as an R package, as well as, studied and described theoretical properties of the EPP. A.G. performed the EPP Leaderboard on the VTAB benchmark and developed the unified benchmark ontology. K.W. performed the EPP Leaderboard on the OpenML benchmark and designed and performed simulations in the Supplementary Materials. P.B. supervised the project, provided technical advice, helped design the method and analysed the experiments. All authors participated in the conceptualization and preparation of the paper.

Corresponding authors

Correspondence to Alicja Gosiewska, Katarzyna Woźnica or Przemysław Biecek.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Philipp Probst and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A unified Ontology of ML Benchmarks.

The violet dashed rectangle shows a minimal setup for any benchmark.

Extended Data Table 1 Example Schemes for EPP Benchmark

Full size table

Extended Data Table 2 The descriptions of the EPP Benchmark components that extend the Unified Benchmark Ontology

Full size table

Extended Data Table 3 EPP of selected models for ada_agnostic data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark

Full size table

Extended Data Table 4 The best models in algorithm class for mozilla4 data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark

Full size table

Extended Data Table 5 The best models in algorithm class for credit-g data set. AUC values are averaged. The numbers of models are IDs from the MementoML benchmark

Full size table

Extended Data Table 6 Springleaf Marketing Response Kaggle Competition. https://www.kaggle.com/c/springleaf-marketing-response

Full size table

Extended Data Table 7 IEEE-CIS Fraud Detection Kaggle Competition. https://www.kaggle.com/c/ieee-fraud-detection

Full size table

Supplementary information

Suplementary Figs. S1–S5, Discussion S1–S5 and Tables 1 and 2.

Reporting summary

Supplementary Data 1

Source Data Supplementary Material Fig. 1.

Supplementary Data 2

Source Data Supplementary Material Fig. 2.

Supplementary Data 3

Source Data Supplementary Material Fig. 3.

Supplementary Data 4

Source Data Supplementary Material Fig. 4.

Supplementary Data 5

Source Data Supplementary Material Fig. 5.

Source data

Source Data Fig. 2

Source Data Fig. 4

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gosiewska, A., Woźnica, K. & Biecek, P. Interpretable meta-score for model performance. Nat Mach Intell 4, 792–800 (2022). https://doi.org/10.1038/s42256-022-00531-2

Download citation

Received: 17 August 2021
Accepted: 05 August 2022
Published: 22 September 2022
Issue Date: September 2022
DOI: https://doi.org/10.1038/s42256-022-00531-2