Research community dynamics behind popular AI benchmarks

Martínez-Plumed, Fernando; Barredo, Pablo; hÉigeartaigh, Seán Ó; Hernández-Orallo, José

doi:10.1038/s42256-021-00339-6

Article
Published: 17 May 2021

Research community dynamics behind popular AI benchmarks

Nature Machine Intelligence volume 3, pages 581–589 (2021)Cite this article

2050 Accesses
8 Citations
59 Altmetric
Metrics details

Subjects

Abstract

The widespread use of experimental benchmarks in AI research has created competition and collaboration dynamics that are still poorly understood. Here we provide an innovative methodology to explore these dynamics and analyse the way different entrants in these challenges, from academia to tech giants, behave and react depending on their own or others’ achievements. We perform an analysis of 25 popular benchmarks in AI from Papers With Code, with around 2,000 result entries overall, connected with their underlying research papers. We identify links between researchers and institutions (that is, communities) beyond the standard co-authorship relations, and we explore a series of hypotheses about their behaviour as well as some aggregated results in terms of activity, performance jumps and efficiency. We characterize the dynamics of research communities at different levels of abstraction, including organization, affiliation, trajectories, results and activity. We find that hybrid, multi-institution and persevering communities are more likely to improve state-of-the-art performance, which becomes a watershed for many community members. Although the results cannot be extrapolated beyond our selection of popular machine learning benchmarks, the methodology can be extended to other areas of artificial intelligence or robotics, and combined with bibliometric studies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Progress in accuracy over time for ImageNet.**

**Fig. 2: Progress in accuracy over time for SQuAD1.1.**

**Fig. 3: Most prolific institutions (at least ten entries) in terms of total SOTA jumps entries and activity.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Maximum diffusion reinforcement learning

Article 02 May 2024

Entropy, irreversibility and inference at the foundations of statistical physics

Article 01 May 2024

Data availability

The data regarding all the papers analysed, their authors, community memberships, results for the different benchmarks and SOTA jumps can be found in the data folder on GitHub⁴¹ (‘data’ folder).

Code availability

The code for reproducing results can be found on GitHub⁴¹.

References

Fortunato, S. et al. Science of science. Science 359, eaao0185 (2018).
Article Google Scholar
Wu, L., Wang, D. & Evans, J. A. Large teams develop and small teams disrupt science and technology. Nature 566, 378–382 (2019).
Article Google Scholar
Frank, M. R., Wang, D., Cebrian, M. & Rahwan, I. The evolution of citation graphs in artificial intelligence research. Nat. Mach. Intell. 1, 79–85 (2019).
Article Google Scholar
Martínez-Plumed, F. et al. Accounting for the neglected dimensions of AI progress. Preprint at https://arxiv.org/abs/1806.00610 (2018).
Perrault, R. et al. The AI Index 2019 Annual Report (AI Index Steering Committee, Human-Centered AI Institute, Stanford Univ. 2019); https://hai.stanford.edu/ai-index-2019
Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. Phys. Rev. E 70, 66–111 (2004).
Article Google Scholar
Van Raan, A. The influence of international collaboration on the impact of research results: some simple mathematical considerations concerning the role of self-citations. Scientometrics 42, 423–428 (1998).
Article Google Scholar
Deng, J. et al. ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–55 (IEEE, 2009).
Rajpurkar, P., Zhang, J., Lopyrev, K. & Liang, P. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2383–2392 (Association for Computational Linguistics, 2016).
Bonferroni, C. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8, 3–62 (1936).
MATH Google Scholar
Kwok, R. Junior AI researchers are in demand by universities and industry. Nature 568, 581–584 (2019).
Article Google Scholar
Rhoades, S. A. The Herfindahl–Hirschman index. Fed. Res. Bull. 79, 188–189 (1993).
Google Scholar
Cave, S. & Ó hÉigeartaigh, S. S. An AI race for strategic advantage: rhetoric and risks. In Proc. 2018 AAAI/ACM Conference on AI, Ethics, and Society 36–40 (Association for Computing Machinery, 2018).
Lee, K.-F. AI Superpowers: China, Silicon Valley, and the New World Order (Houghton Mifflin Harcourt, 2018).
Horowitz, M. C., Allen, G. C., Kania, E. B. & Scharre, P. Strategic Competition in an Era of Artificial Intelligence 8 (Center for New American Security, 2018).
Li, W. C., Nirei, M. & Yamana, K. Value of Data: There’s No Such Thing as a Free Lunch in the Digital Economy Working Paper (US Bureau of Economic Analysis, 2019).
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (2009).
Hernández-Orallo, J. et al. A new AI evaluation cosmos: Ready to play the game? AI Magazine 38, 66–69 (2017).
Article Google Scholar
Shoham, Y. Towards the AI index. AI Magazine 38, 71–77 (2017).
Article Google Scholar
Niu, J., Tang, W., Xu, F., Zhou, X. & Song, Y. Global research on AI from 1990–2014: spatially-explicit bibliometric analysis. ISPRS Int. J. Geoinf. 5, 66 (2016).
Article Google Scholar
Juan Mateos-Garcia, K. S., Klinger, J. & Winch, R. A Semantic Analysis of the Recent Evolution of AI Research. https://www.nesta.org.uk/report/semantic-analysis-recent-evolution-ai-research/ (NESTA, 2019).
Gao, F. et al. Bibliometric analysis on tendency and topics of artificial intelligence over last decade. Microsyst. Technol. 1–13 (2019).
Tran, B. X. et al. Global evolution of research in artificial intelligence in health and medicine: a bibliometric study. J. Clin. Med. 8, 360 (2019).
Article Google Scholar
Tang, X., Li, X., Ding, Y., Song, M. & Bu, Y. The pace of artificial intelligence innovations: speed, talent, and trial-and-error. J. Inf. 14, 101094 (2020).
Google Scholar
Qian, Y., Liu, Y. & Sheng, Q. Z. Understanding hierarchical structural evolution in a scientific discipline: a case study of artificial intelligence. J. Inf. 14, 101047 (2020).
Google Scholar
Serenko, A. The development of an AI journal ranking based on the revealed preference approach. J. Inf. 4, 447–459 (2010).
Google Scholar
Campbell, M., Hoane Jr, A. J. & Hsu, F.-h Deep Blue. Artif. Intell. 134, 57–83 (2002).
Article Google Scholar
Ferrucci, D. A. Introduction to ‘This is Watson’. IBM J. Res. Dev. 56, 235–249 (2012).
Article Google Scholar
Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
Article Google Scholar
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article Google Scholar
Schlangen, D. Language tasks and language games: on methodology in current natural language processing research. Preprint at https://arxiv.org/abs/1908.10747 (2019).
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A. & Choi, Y. Hellaswag: can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 4791–4800 (Association for Computational Linguistics, 2019).
Lei, Y. & Liu, Z. The development of artificial intelligence: a bibliometric analysis, 2007–2016. J. Physi. 1168, 022027 (2019).
Martínez-Plumed, F. et al. The facets of artificial intelligence: a framework to track the evolution of AI. In Proc. Twenty-Seventh International Joint Conference on Artificial Intelligence 5180–5187 (International Joint Conferences on Artificial Intelligence Organization, 2018).
Bhattacharya, J. & Packalen, M. Stagnation and Scientific Incentives Technical Report (National Bureau of Economic Research, 2020).
Houghton, B. et al. Guaranteeing reproducibility in deep learning competitions. Preprint at https://arxiv.org/abs/2005.06041 (2020).
Lucic, M., Kurach, K., Michalski, M., Gelly, S. & Bousquet, O. Are gans created equal? A large-scale study. Adv. Neural Inf. Process. Syst. 700–709 (2018).
Hernandez, D. & Brown, T. B. Measuring the algorithmic efficiency of neural networks. Preprint at https://arxiv.org/abs/2005.04305 (2020).
Mattson, P. et al. MLPerf training benchmark. Preprint https://arxiv.org/abs/1910.01500 (2019).
Martínez-Plumed, F. & Hernández-Orallo, J. Dual indicators to analyse AI benchmarks: difficulty, discrimination, ability, and generality. IEEE Trans. Games 12, 121–131 (2020).
Article Google Scholar
Martínez-Plumed, F., Barredo, P., hÉigeartaigh, S. Ó. & Hernández-Orallo, J. AI research dynamics. GitHub https://github.com/nandomp/AI_Research_Dynamics (2021).
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T. & Serre, T. HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision 2556–2563 (IEEE, 2011).
Soomro, K., Zamir, A. R. & Shah, M. UCF101: a dataset of 101 human actions classes from videos in the wild. Preprint at https://arxiv.org/abs/1212.0402 (2012).
Bellemare, M. G., Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Article Google Scholar
Timofte, R., De Smet, V. & Van Gool, L. Anchored neighborhood regression for fast example-based super-resolution. In Proc. IEEE International Conference on Computer Vision 1920–1927 (IEEE, 2013).
Hutter, M. Human knowledge compression contest. Hutter Prize http://prize.hutter1.net/ (2006).
Mikolov, T., Deoras, A., Kombrink, S., Burget, L. & Černocky, J. Empirical evaluation and combination of advanced language modeling techniques. In Twelfth Annual Conference of the International Speech Communication Association 605–608 (2011).
Dettmers, T., Minervini, P., Stenetorp, P. & Riedel, S. Convolutional 2D knowledge graph embeddings. In Proc. AAAI Conference on Artificial Intelligence Vol. 32 (2018).
Bojar, O. et al. Findings of the 2014 workshop on statistical machine translation. In Proc. Ninth Workshop on Statistical Machine Translation 12–58 (Association for Computational Linguistics, 2014); http://www.aclweb.org/anthology/W/W14/W14-3302
Sang, E. F. & De Meulder, F. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 142–147 (2003).
Weischedel, R. et al. Ontonotes Release 5.0 ldc2013t19 23 (Linguistic Data Consortium, 2013).
Lin, T.-Y. et al. Microsoft COCO: common objects in context. In European Conference on Computer Vision 740–755 (Springer, 2014).
Andriluka, M., Pishchulin, L., Gehler, P. & Schiele, B. 2D human pose estimation: new benchmark and state of the art analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3686–3693 (IEEE, 2014).
Yang, Y., Yih, W.-t. & Meek, C. Wikiqa: a challenge dataset for open-domain question answering. In Proc. 2015 Conference on Empirical Methods in Natural Language Processing 2013–2018 (Association for Computational Linguistics, 2015).
Cordts, M. et al. The cityscapes dataset for semantic urban scene understanding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 3213–3223 (IEEE, 2016).
Everingham, M. et al. The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. 111, 98–136 (2015).
Article Google Scholar
Maas, A. L. et al. Learning word vectors for sentiment analysis. In Proc. 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Vol. 1, 142–150 (Association for Computational Linguistics, 2011).
Socher, R. et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proc. 2013 Conference on Empirical Methods in Natural Language Processing 1631–1642 (Association for Computational Linguistics, 2013).
Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5206–5210 (IEEE, 2015).

Download references

Acknowledgements

F.M.-P. acknowledges funding from the AI-Watch project by DG CONNECT and DG JRC of the European Commission. J.H.-O. and S.Ó.h. were funded by the Future of Life Institute, FLI, under grant RFP2-152. J.H.-O. was supported by the EU (FEDER) and Spanish MINECO under RTI2018-094403-B-C32, Generalitat Valenciana under PROMETEO/2019/098 and European Union’s Horizon 2020 grant no. 952215 (TAILOR).

Author information

Authors and Affiliations

European Commission, Joint Research Centre, Seville, Spain
Fernando Martínez-Plumed
Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politécnica de Valéncia, Valencia, Spain
Fernando Martínez-Plumed & José Hernández-Orallo
Universidad de Oviedo, Oviedo, Spain
Pablo Barredo
Leverhulme Centre for the Future of Intelligence, University of Cambridge, Cambridge, UK
Seán Ó hÉigeartaigh & José Hernández-Orallo
Centre for the Study of Existential Risk, University of Cambridge, Cambridge, UK
Seán Ó hÉigeartaigh

Authors

Fernando Martínez-Plumed
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Barredo
View author publications
You can also search for this author in PubMed Google Scholar
Seán Ó hÉigeartaigh
View author publications
You can also search for this author in PubMed Google Scholar
José Hernández-Orallo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The four authors, F.M.-P., P.B., S.Ó.h. and J.H.-O., participated in the definition and refinement of the goals of this study and the hypotheses. F.M.-P., J.H.-O. and P.B. conceived the technical methodology. P.B. and F.M.-P. implemented the code that collects and processes the data, and creates the communities. F.M.-P. generated the plots. All authors discussed the results and contributed to the writing of the final manuscript.

Corresponding authors

Correspondence to Fernando Martínez-Plumed, Pablo Barredo, Seán Ó hÉigeartaigh or José Hernández-Orallo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks Nima Dehmamy, Lars Kotthoff and Dashun Wang for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary analysis, Figs. 1–26 and Table 1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Martínez-Plumed, F., Barredo, P., hÉigeartaigh, S.Ó. et al. Research community dynamics behind popular AI benchmarks. Nat Mach Intell 3, 581–589 (2021). https://doi.org/10.1038/s42256-021-00339-6

Download citation

Received: 09 October 2020
Accepted: 07 April 2021
Published: 17 May 2021
Issue Date: July 2021
DOI: https://doi.org/10.1038/s42256-021-00339-6

This article is cited by

Interpretable meta-score for model performance
- Alicja Gosiewska
- Katarzyna Woźnica
- Przemysław Biecek
Nature Machine Intelligence (2022)
Beijing’s central role in global artificial intelligence research
- Bedoor AlShebli
- Enshu Cheng
- Talal Rahwan
Scientific Reports (2022)
Mapping global dynamics of benchmark creation and saturation in artificial intelligence
- Simon Ott
- Adriano Barbosa-Silva
- Matthias Samwald
Nature Communications (2022)