Joelle Pineau doesn’t want science’s reproducibility crisis to come to artificial intelligence (AI).
Spurred by her frustration with difficulties recreating results from other research teams, Pineau, a machine-learning scientist at McGill University and Facebook in Montreal, Canada, is now spearheading a movement to get AI researchers to open up their methods and code to scrutiny.
Alongside Koustuv Sinha, a PhD student at McGill, Pineau holds one of two new roles dedicated to reproducibility on the organizing committee for the Conference on Neural Information Processing Systems (NeurIPS), a major meeting for AI that this year attracted some 13,000 researchers. Ahead of this year’s conference in Vancouver, Canada, from 8 to 14 December, the committee asked scientists to provide their code and fill in a checklist of methodological details for each paper submitted. They also ran a competition that challenged researchers to recreate each other’s work.
Pineau spoke to Nature about the measures and how they’ve gone down with the community.
It’s easy to imagine why scientific studies of the natural world might be hard to reproduce. But why are some algorithms irreproducible?
It’s true that with code, you press start and, for the most part, it should do the same thing every time. The challenge can be trying to reproduce a precise set of instructions in machine code from a paper. And then there’s the issue that papers don’t always give all the detail, or give misleading detail. Sometimes it’s unintentional and perhaps sometimes it’s towards making the results look more favourable. That’s a big issue.
What got you interested in reproducibility?
I fell into reproducibility by accident. Over and over again my students would say ‘I can’t get these results’, or they found that, to get the results, they had to do things that I didn’t think were correct, methodologically. So for me it was important to stop it before it becomes the norm. It’s also very timely for the wider community because there are a lot of people flooding in to the field and it’s important to establish what the methodological norms are.
What’s an example of such a practice?
In reinforcement learning, for example, if you do two runs of some algorithms with different initial random settings, you can get very different results. And if you do a lot of runs, you’re able to report only the best ones. Results from the people with more computing power to do more runs will look better. Papers don’t always say how many runs were performed. But it makes a big difference to the conclusions you draw.
What reproducibility measures were enacted at NeurIPS this year?
There were three portions: we encouraged people to submit their code; we’re running a reproducibility challenge; and we introduced a checklist for papers submitted. The checklist asks, for example, whether you clearly labelled the type of metrics and measures you’re using, whether you provided error bars, what the details of your model are and how you set certain aspects of the model that can change the results a lot.
What has the reception been like from researchers?
Very good. I’ve had discussions with people in other scientific communities who were quite surprised that I was given so much of a platform to talk about reproducibility. It felt like, in their field, if you start questioning reproducibility, there’s a very strong establishment that wants to keep that message shut down. Code submission is one of the elements I’m most impressed with. A year ago, 50% of accepted NeurIPS papers contained a link to code; this year, we’re at 75%.
We have partnered with two information scientists who are analysing the response and drawing out useful patterns in data. Already, we can say that 34% of reviewers of conference papers indicated that the checklist had been useful, and the presence of the code had a significant positive effect on reviewer score. But we need to study this more before we draw conclusions.
What does the reproducibility challenge entail?
We have 173 participants at 73 institutions worldwide. They pick which paper they want to reproduce — any that’s been accepted at NeurIPS is fair game — and produce a report that documents which part they verified, their method and what they found. Those reports are shared on OpenReview, a website that promotes openness in scientific communication. We will collect a few very good examples and publish them in a special issue of the journal ReScience C.
What’s the ultimate goal of these measures?
Better reproducibility means it’s much easier to build on a paper. Often, the review process is short and limited, and the true impact of a paper is something we see much later. The paper lives on, and as a community we have a chance to build on the work, examine the code, have a critical eye to what are the contributions.
Nature 577, 14 (2020)