Purists will tell you that science is about what scientists don’t know, which is true but not much of a basis on which to develop new cancer drugs. Hence the importance of knowledge: how crucial this mutation or that cell-surface receptor really is to cancer growth. These are the findings that launch companies and clinical trials — provided, of course, that they have been published in research papers in peer-reviewed journals.

As we report in a News story this week, a systematic effort to check some of these findings by repeating an initial five published cancer studies has reported that none could be completely reproduced. The significance of this divergence — how the specific experiments were selected and what the results mean for the broader agenda of reproducibility in research — is already hotly contested.

Perhaps the most influential aspect of the exercise, called the Reproducibility Project: Cancer Biology, has nothing to do with those arguments. It lies beneath the surface, in the peer reviews of the project teams’ replication plans, which were published before the studies began. These reviews can be read as part of the editorial decision letters linked to each replication plan, or ‘registered report’ (see go.nature.com/2jte08a).

What, one might ask, could be less interesting? What insights can emerge from technical feedback on plans to repeat what other groups of scientists have done? Plenty. The decision letters reveal the practice of science at its best: probing whether an experiment truly tests a particular idea; identifying shortcomings in the original set-ups; and proposing further, sounder tests. To those committed to improving science, reading these insights — the fruit of voluntary donations of time and effort — will prove a moving experience. Journal clubs would do well to read the original papers, and then compare their own analyses with those of the scholars who have reviewed the replication reports.

Again and again, the peer reviewers and the replicators clash. The reviewers are eager to produce the best experiment to test a publication’s conclusions; they want to correct deficiencies in the design of the original high-impact studies. The replicators do, on several occasions, agree to add an extra measurement, particularly of positive and negative controls that had originally been neglected. Often, however, they resist calls for more definitive studies. Testing “the experiment’s underlying hypothesis,” they insist, “is not an aim of the project.”

Grand conclusions are built on myriad laboratory experiments.

This is a frustrating, if understandable, response. It is easier to compare the results of highly similar experiments than to assess a conclusion. Thus, the replication efforts are not especially interested in, say, the big question of whether public gene-expression data can point to surprising uses for new drugs. They focus instead on narrower points — such as whether a specific finding that an ulcer drug stalls the growth of lung cancer in mice holds up (it did, more or less; see I. Kandela et al. eLife 6, e17044; 2017). Even so, the results are not definitive. One set of dissimilar results can’t establish that the original result was a false positive; nor do similar results show that the original conclusion is correct.

Yet a project that sticks to direct replication and eschews broader ‘truth’ can still raise bigger issues. After all, grand conclusions are built on myriad laboratory experiments. How well (or not) such experiments work outside the original lab is too often communicated through gossip rather than open discourse. For just US$2 million in private funding — less than a typical 5-year grant from the US National Institutes of Health to a single investigator — this replication project shines a very public light on the sticking points of experiments.

One aspect that merits increased focus is how markedly the results of control experiments varied between the original work and replications. In one case, mice in the control group of an original study survived for nine weeks after being engrafted with tumours, whereas those in the replication survived for just one. In another, a bioluminescent signal used to track tumour burden differed markedly between the original and replication studies. Such situations occur frequently in biology, but are often overlooked because researchers focus only on results in their own laboratories. There is great value in highlighting variability in ‘established’ experimental systems.

Such results deserve more attention, both technically and philosophically. Researchers routinely optimize experimental conditions to see the signals that they are looking for. But how far can this proceed until confirmation bias makes results uninterpretable? How does one do pilot studies without overly encouraging favourable results?

More than 50 years ago, the philosopher Thomas Kuhn defined ‘normal science’ as the kind of work that faithfully supports or chisels away at current hypotheses. It is easy to dismiss this as workmanlike and uninteresting. But only by doing such normal science — and doing it well — can we recognize when revolutions are necessary.