In times of tight budget constraints, scientists' wranglings about the real and perceived sins of public funding agencies become particularly acute. Complaints usually lead to the creation of a panel of respected, thoughtful and well-meaning scientists who come up with a plan of reform based on their intuition and experience. Funding agencies, who are genuinely concerned about improving the productivity of the scientific enterprise, often adopt these recommendations, at least in part. In one example of this process, the US National Institutes of Health (NIH) in Bethesda, Maryland, has created a large array of funding mechanisms, each one targeted to a particular problem — including the K99/R00 or 'kangaroo' grants, which pair postdoctoral scientists with mentors to help them to prepare for tenure-track faculty positions and funding independence. Not only is this range of mechanisms confusing and costly to administer, but the effectiveness of such reforms is never seriously evaluated.

It is time to turn the scientific method on ourselves. In our attempts to reform the institutions of science, we should adhere to the same empirical standards that we insist on when evaluating research results. We already know how: by subjecting proposed reforms to a prospective, randomized controlled experiment. Retrospective analyses using selected samples are often little more than veiled attempts to justify past choices.


What could such a formal experiment look like? Let me give an example. It is well documented that the past 30 years have seen a marked increase in the age at which academic scientists achieve funding independence1. One way to ensure the continued injection of talent into these ranks would be to evaluate first-time applicants separately from a larger pool, and dedicate to them a predetermined share of the available funding. An alternative would be to keep the current system in place, but to award their proposals 'bonus points'. Which reform should we adopt? And what if the 'greying' of the scientific workforce does not stem from institutional failure, but rather reflects the influence of an ever-expanding burden of knowledge, whereby scientists must spend more time in training before they can become productive2?

To test these questions empirically, for example within the NIH (the agency I have studied most closely), we could choose a random subset of funding panels to implement the first method and a second subset to implement the other. A third subset, in which funding panels proceed with business as usual, would serve as the control group in the experiment. Ideally, the study would be designed to avoid 'panel shopping' by applicants; the 100,000 or so R01 grant proposals reviewed each year by the 183 NIH funding panels are more than enough to craft a statistical protocol with adequate power.

Experimenting on ourselves may well lay bare some shortcomings of the scientific community.

These experiments could exist outside government agencies, too. When philanthropic organizations develop new models to fund research, they should formally investigate how their approach compares with the dominant model, which averages experts' scores to determine the funding-priority ranking for particular projects. Some emerging models, for example, give higher ranking to projects that elicit enthusiasm and controversy than to projects that generate more consensus but only tepid support across reviewers. It might be that funding a project on the basis of reviewer sentiment is more likely to result in the selection of truly innovative, field-changing projects, but how will we know for sure? A serious evaluation of this question would compare the two systems by randomizing proposals to one of these two ranking approaches, and then examining which portfolio of projects is most successful.

When I suggest these experiments, I encounter a lot of resistance. Wouldn't this be gambling with scientists' careers? How can we measure success — by counting publications and citations, looking at the students trained as a by-product of these grants, or using other metrics? Won't this work shift scarce funding away from actual scientific investigations?

These criticisms are without merit. The current system already gambles with scientific careers, just in a haphazard way. Scientists often disagree on how to measure success, and the choice of metric, as well as the period necessary for a careful assessment, will always be context-dependent and controversial. With the good will of administrators in public agencies or private foundations, these experiments could be rolled out with minimal disruption for about the cost of a R01 grant (typically US$250,000 per year for 5 years). If the scientific community could test even a small number of hypotheses in this way, the system-wide benefits would dwarf this modest investment.

I am well aware that this vision will sound utopian to some. Sceptics abounded when my colleagues at the Massachusetts Institute of Technology in Cambridge founded the Jameel Poverty Action Lab and subjected development-assistance methods to randomized, controlled trials to see which worked best. But as a result of this work, policy-makers now know that it is better to give out free mosquito nets to prevent malaria than to charge even a low price for their purchase — one example among many3,4.

We inherited the current institutions of science from the period just after the Second World War. It would be a fortuitous coincidence if the systems that served us so well in the twentieth century were equally adapted to twenty-first-century needs. Experimenting on ourselves may well lay bare some shortcomings of the scientific community and expose us to criticisms from politicians, who are always looking for excuses to cut science funding. But the only alternative to such controlled experimentation is the gradual stultification of our most cherished scientific institutions.