Many scientific disciplines are currently experiencing a 'reproducibility crisis' because numerous scientific findings cannot be repeated consistently. A novel but controversial hypothesis postulates that stringent levels of environmental and biotic standardization in experimental studies reduce reproducibility by amplifying the impacts of laboratory-specific environmental factors not accounted for in study designs. A corollary to this hypothesis is that a deliberate introduction of controlled systematic variability (CSV) in experimental designs may lead to increased reproducibility. To test this hypothesis, we had 14 European laboratories run a simple microcosm experiment using grass (Brachypodium distachyon L.) monocultures and grass and legume (Medicago truncatula Gaertn.) mixtures. Each laboratory introduced environmental and genotypic CSV within and among replicated microcosms established in either growth chambers (with stringent control of environmental conditions) or glasshouses (with more variable environmental conditions). The introduction of genotypic CSV led to 18% lower among-laboratory variability in growth chambers, indicating increased reproducibility, but had no significant effect in glasshouses where reproducibility was generally lower. Environmental CSV had little effect on reproducibility. Although there are multiple causes for the 'reproducibility crisis', deliberately including genetic variability may be a simple solution for increasing the reproducibility of ecological studies performed under stringently controlled environmental conditions.
Reproducibility—the ability to duplicate a study and its findings—is a defining feature of scientific research. In ecology, it is often argued that it is virtually impossible to accurately duplicate any single ecological experiment or observational study. The rationale is that the complex ecological interactions between the ever-changing environment and the extraordinary diversity of biological systems exhibiting a wide range of plastic responses at different levels of biological organization make exact duplication unfeasible1,2. Although this may be true for observational and field studies, numerous ecological (and agronomic) studies are carried out with artificially assembled simplified ecosystems and controlled environmental conditions in experimental microcosms or mesocosms (henceforth, ‘microcosms’)3,4,5. Since biotic and environmental parameters can be tightly controlled in microcosms, the results from such studies should be easier to reproduce. Even though microcosms have frequently been used to address fundamental ecological questions4, 6,7, there has been no quantitative assessment of the reproducibility of any microcosm experiment.
Experimental standardization—the implementation of strictly defined and controlled properties of organisms and their environment—is widely thought to increase both the reproducibility and sensitivity of statistical tests8,9 because it reduces within-treatment variability. This paradigm has recently been challenged by several studies on animal behaviour, suggesting that stringent standardization may, counterintuitively, be responsible for generating non-reproducible results9,10,11 and contribute to the actual reproducibility crisis12,13,14,15; the results may be valid under given conditions (that is, they are local ‘truths’), but are not generalizable8,16. Despite rigorous adherence to experimental protocols, laboratories inherently vary in many conditions that are not measured and are thus unaccounted for, such as experimenter, micro-scale environmental heterogeneity, physico-chemical properties of reagents and laboratory-ware, pre-experimental conditioning of organisms, and their genetic and epigenetic background. It has even been suggested that attempts to stringently control all sources of biological and environmental variability might inadvertently lead to amplification of the effects of these unmeasured variations among laboratories, thus reducing reproducibility9,10,11.
Some studies have gone even further, hypothesizing that the introduction of controlled systematic variability (CSV) among the replicates of a treatment (for example, using different genotypes or varying the organisms’ pre-experimental conditions among the experimental replicates) should lead to less variable mean response values between the laboratories that duplicate the experiments9,11. In short, it has been argued that reproducibility may be improved by shifting the variance from among experiments to within them9. If true, introducing CSV will increase researchers’ ability to draw generalizable conclusions about the directions and effect sizes of experimental treatments and reduce the probability of false positives. The trade-off inherent to this approach is that increasing within-experiment variability will reduce the sensitivity (that is, the probability of detecting true positives) of statistical tests. However, it currently remains unclear whether introducing CSV increases the reproducibility of ecological microcosm experiments and, if so, at what cost for the sensitivity of statistical tests.
To test the hypothesis that introducing CSV enhances reproducibility in an ecological context, we had 14 European laboratories simultaneously run a simple microcosm experiment using grass (Brachypodium distachyon L.) monocultures and grass and legume (Medicago truncatula Gaertn.) mixtures. As part of the reproducibility experiment, the 14 laboratories independently tested the hypothesis that the presence of the legume species M. truncatula in mixtures would lead to higher total plant productivity in the microcosms and enhanced growth of the non-legume B. distachyon via rhizobia-mediated nitrogen fertilization and/or nitrogen-sparing effects17,18,19.
All laboratories were provided with the same experimental protocol, seed stock from the same batch and identical containers in which to establish microcosms with grass only and grass–legume mixtures. Alongside a control with no CSV and containing a homogenized soil substrate (a mixture of soil and sand) and a single genotype of each plant species, we explored the effects of five different types of within- and among-microcosm CSV on experimental reproducibility of the legume effect (Fig. 1): (1) within-microcosm environmental CSV (ENVW) achieved by spatially varying soil resource distribution through the introduction of six sand patches into the soil; (2) among-microcosm environmental CSV (ENVA), which varied the number of sand patches (none, three or six) among replicate microcosms; (3) within-microcosm genotypic CSV (GENW), which used three distinct genotypes per species planted in homogenized soil in each microcosm; (4) among-microcosm genotypic CSV (GENA), which varied the number of genotypes (one, two or three) planted in homogenized soil among replicate microcosms; and (5) both genotypic and environmental CSV (GENW + ENVW) within microcosms, which used six sand patches and three plant genotypes per species in each microcosm. In addition, we tested whether CSV effects are modified by the level of standardization within laboratories by using two common experimental approaches (‘setups’ hereafter): growth chambers with tightly controlled environmental conditions and identical soil (eight laboratories) or glasshouses with more loosely controlled environmental conditions and different soils (six laboratories; see Supplementary Table 1 for the physico-chemical properties of the soils).
We measured 12 parameters representing a typical ensemble of response variables reported for plant-soil microcosm experiments. Six of these were measured at the microcosm level (shoot biomass, root biomass, total biomass, shoot-to-root ratio, evapotranspiration and decomposition of a common substrate using a simplified version of the ‘tea bag litter decomposition method’20). The other six were measured on B. distachyon alone (seed biomass, height and four shoot-tissue chemical variables: N%, C%, δ15N and δ13C). All 12 variables were used to calculate the effect of the presence of a nitrogen-fixing legume on ecosystem functions in grass–legume mixtures (‘net legume effect’ hereafter) (Supplementary Table 2), calculated as the difference between the values measured in the microcosms with and without legumes—an approach often used in grass–legume binary cropping systems19,21 and biodiversity–ecosystem function experiments17,22.
Statistically significant differences among the 14 laboratories were considered an indication of irreproducibility. In the first instance, we assessed how our experimental treatments (CSV and setup) affected the number of laboratories that produced results that could be considered to have reproduced the same finding. We then determined how experimental treatments affected the s.d. of the legume effect for each of the 12 variables both within and among laboratories (lower among-laboratory s.d. implies that the results were more similar, suggesting increased reproducibility). Finally, we explored the relationship between within- and among-laboratory s.d. and how the experimental treatments affected the statistical power of detecting the net legume effect.
Although each laboratory followed the same experimental protocol, we found a remarkably high level of among-laboratory variation for most response variables (Supplementary Fig. 1) and the net legume effect on those variables (Fig. 2). For example, the net legume effect on mean total plant biomass varied among laboratories from 1.31 to 6.72 g dry weight per microcosm in growth chambers, suggesting that unmeasured laboratory-specific conditions outweighed the effects of experimental standardization. Among glasshouses, the differences were even larger: the net legume effect on mean plant biomass varied by two orders of magnitude from 0.14 to 14.57 g dry weight per microcosm (Fig. 2). Furthermore, for half of the variables (root biomass, litter decomposition, grass height, foliar C%, δ15C and δ15N), the direction of the net legume effect varied with the laboratory.
Mixed-effects models were used to test the effect of legume species presence, laboratory, CSV and their interactions (with experimental block—within-laboratory growth chamber or glasshouse bench—as a random factor) on the 12 response variables. The impact of the presence of legumes varied significantly with laboratory and CSV for half of the variables, as indicated by the legume × laboratory × CSV three-way interaction (Table 1 and Supplementary Figs. 2 and 3). For the other half, significant two-way interactions between legume × laboratory and CSV × laboratory were found. The same significant interactions were found when analysing the first (PC1) and second (PC2) principal components from a principal component analysis that included all 12 response variables. PC1 and PC2 together explained 45% of the variation (Table 1 and Supplementary Fig. 4a,b). Taken together, these results suggest that the effect size or direction of the net legume effect was significantly different (that is, not reproducible) in some laboratories and that the introduced CSV treatment affected reproducibility. In a complementary analysis including the setup in the model (and accounting for the laboratory effect as a random factor), we found that the impact of the CSV treatment varied significantly with the setup (CSV × setup or legume × CSV × setup interactions; Supplementary Table 3), suggesting that the reproducibility of the results differed between glasshouses and growth chambers.
To answer the question of how many laboratories produced results that were statistically indistinguishable from one another (that is, reproduced the same finding), we used Tukey’s post-hoc honest significant difference test for the laboratory effect on PC1 and PC2 describing the net legume effect, which together explained 49% of the variation (Supplementary Fig. 4c,d). Of the 14 laboratories, 7 (PC1) and 11 (PC2) were statistically indistinguishable in controls. This value increased in the treatments with environmental or genotypic CSV for PC1 but not PC2 (Table 2). When we analysed the responses in growth chambers alone, five of eight laboratories were statistically indistinguishable in controls, but this increased to six laboratories when we considered treatments with only environmental CSV and seven in treatments with genotypic CSV (GENW, GENA and GENW + ENVW). In glasshouses, introducing CSV did not affect the number of statistically indistinguishable laboratories with respect to PC1, but decreased the number of statistically indistinguishable laboratories with respect to PC2 (Table 2).
We also assessed the impact of the experimental treatments on the among- and within-laboratory s.d. Analysis of the among-laboratory s.d. of the net legume effect revealed a significant CSV × setup interaction (F5,121 = 7.38, P < 0.001; Fig. 3a,b). This interaction included significantly lower fitted coefficients (that is, lower among-laboratory s.d.) in growth chambers for GENW (t5,121 = −3.37, P = 0.001), GENA (t5,121 = −2.95, P = 0.004) and ENVW + GENW treatments (t1,121 = −3.73, P < 0.001) relative to the control (see full model output for among-laboratory s.d. in the Supplementary Note). For these three treatments, the among-laboratory s.d. of the net legume effect was 18% lower with genotypic CSV than without it, indicating increased reproducibility (Fig. 3a). The same analysis performed on within-laboratory s.d. of the net legume effect only found a slight but significant increase of within-laboratory s.d. in the GENA treatment (t5,121 = 3.52, P < 0.001) (see model output for within-laboratory s.d. in the Supplementary Note). We then tested whether there was a relationship between within- and among-laboratory s.d. with a statistical model for among-laboratory s.d. as a function of within-laboratory s.d., setup, CSV and their interactions. We found a significant within-laboratory s.d. × setup × CSV three-way interaction (F5,109 = 2.4, P < 0.040) affecting among-laboratory s.d. (Supplementary Note). This interaction was the result of a more negative relationship between within- and among-laboratory s.d. in glasshouses relative to growth chambers, but with different slopes for the different CSV treatments (Fig. 4).
Introducing CSV can increase within-laboratory variation, as indicated by the positive coefficients fitted in some of the CSV treatments (see model output for within-laboratory s.d. in the Supplementary Note). Thus, for the three CSV treatments that produced the most consistent results (GENW, GENA and ENVW + GENW), we analysed the statistical power of detecting the net legume effect within individual laboratories. In growth chambers, adding genotypic CSV led to a slight reduction in statistical power relative to the control (57% in the control versus 46% in the three treatments containing genotypic variability) that could have been compensated for by using 11 instead of 6 replicated microcosms per treatment. In glasshouses, owing to a higher effect size of legume presence on the response variables, the statistical power for detecting the legume effect in the control was slightly higher (68%) than in growth chambers, but was reduced to 51% on average for the three treatments containing genotypic CSV—a decrease that could have been compensated for by using 16 replicated microcosms instead of 6.
Overall, our study shows that results produced by microcosm experiments can be strongly biased by laboratory-specific factors. Based on the PC explaining most of the variation in the 12 response variables (PC1), only 7 of the 14 laboratories produced results that can be considered reproducible (Table 2) with the current standardization procedures. This result is in line with ref. 12, which reports that out of ten laboratories, only four generated similar leaf growth phenotypes of Arabidopsis thaliana (L.). In addition to highlighting that approximately one in two ecological studies performed in microcosms under controlled environments produce statistically different results, our study provides supporting evidence for the hypothesis that introducing genotypic CSV can increase the reproducibility of ecological studies9,10,11. However, the effectiveness of genotypic CSV for enhancing reproducibility varied with the setup; that is, it led to lower (−18%) among-laboratory s.d. in growth chambers only, with no benefit observed in glasshouses. Lower among-laboratory s.d. in growth chambers implies that the microcosms containing genotypic CSV were less strongly affected by unaccounted-for laboratory-specific environmental or biotic variables. Analyses performed at the level of individual variables (Table 1) showed that introducing genotypic CSV affected the among-laboratory s.d. in most, but not all variables. This suggests that the relationship between genotypic CSV and reproducibility is probabilistic and results from the decreased likelihood that microcosms containing CSV will respond to unaccounted-for laboratory-specific environmental factors in the same direction and with the same magnitude. The mechanism is likely to be analogous to the stabilizing effect of biodiversity on ecosystem functions under changing environmental conditions23,24,25,26, but additional empirical evidence is needed to confirm this conjecture.
Introducing genotypic CSV increased reproducibility in growth chambers (with stringent control of environmental conditions), but not in glasshouses (with more variable environmental conditions). Higher among-laboratory s.d. in glasshouses may indicate the existence therein of stronger laboratory-specific factors and our deliberate use of different soils in the glasshouses presumably contributed to this effect. However, the among-laboratory s.d. in glasshouses decreased with increasing within-laboratory s.d., irrespective of CSV—an effect that was less clear in growth chambers (Fig. 4). This observation appears to be in line with the hypothesis put forward in ref. 9, where it was proposed that increasing the variance within experiments can reduce the among-laboratory variability of the mean effect sizes observed in each laboratory. Yet, despite the negative correlation between within- and among-laboratory s.d. observed in glasshouses, the among-laboratory s.d. remained higher in glasshouses than growth chambers. Therefore, we consider that the hypothesized mechanistic link between CSV-induced higher within-laboratory s.d. and increased reproducibility is poorly supported by our dataset. Nevertheless, one possible explanation for the lack of effect on reproducibility in glasshouses is that our CSV treatments did not introduce a sufficiently high level of within-laboratory variability to buffer against laboratory-specific factors for all response variables; across the 12 response variables, the average main effect (that is, without the interaction terms) of the CSV treatment contributed to a low percentage (2.6% ± 1.6 s.e.m.) of the total sum of squares relative to the main effects of laboratory (43.4% ± 5.2 s.e.m.) and legumes (10.9% ± 3.1 s.e.m.). A similar conjecture was put forward by the other two studies that explored the role of CSV for reproducibility in animal behaviour9,10. At present, we are unable to conclude that the introduction of stronger sources of controlled within-laboratory variability can increase reproducibility in glasshouses with more loosely controlled environmental conditions and different soils.
Our results indicate that genotypic CSV is more effective at increasing reproducibility than environmental CSV, irrespective of whether the CSV is introduced within or among individual replicates (that is, microcosms). However, we cannot discount the possibility that we found this result because our treatments with environmental CSV were less successful in increasing within-microcosm variability. Additional experiments could test whether other types of environmental CSV, such as soil nutrients, texture or water availability, might be more effective at increasing reproducibility.
We expected higher overall productivity (that is, a net legume effect) in the grass–legume mixtures and enhanced growth of B. distachyon because of the presence of the nitrogen-fixing M. truncatula. However, these species were not selected because of their routine pairings in agronomic or ecological experiments (they are rarely used that way), but rather because they are frequently present in controlled environment experiments looking at functional genomics. In contrast with our expectation and despite the generally lower 15N signature of B. distachyon in the presence of nitrogen-fixing M. truncatula (suggesting that some of the nitrogen fixed by M. truncatula was taken up by the grass), the biomass of B. distachyon was lower in the microcosms containing M. truncatula. The seed mass and shoot N% data of B. distachyon were lower in mixtures (Supplementary Fig. 1), suggesting that the two species competed for nitrogen. The lack of a significant nitrogen fertilization effect of M. truncatula on B. distachyon could have resulted from the asynchronous phenologies of the two species: the eight- to ten-week life cycle of B. distachyon may have been too short to benefit from the nitrogen fixation by M. truncatula.
Because well-established meta-analytical approaches can account for variation caused by local factors and still detect the general trends across different types of experimental setup, environment and population, we should ask whether the additional effort required for introducing CSV in experiments is worthwhile. Considering the current reproducibility crisis in many fields of science27, we suggest that it is, for at least three reasons. First, some studies become seminal without any attempts to reproduce them. Second, even if a seminal study that is flawed due to laboratory-specific biases is later proven wrong, it usually takes significant time and resources before its impact on the field abates. Third, the current rate of reproducibility is estimated to be as low as one-third12,13,14, implying that most data entering any meta-analysis are biased by unknown laboratory-specific factors. The addition of genotypic CSV may enhance the reproducibility of individual experiments and eliminate potential biases in the data used in meta-analyses. Additionally, if each individual study was less affected by laboratory-specific unknown environmental and biotic factors, we would also need fewer studies to draw solid conclusions about the generality of phenomena. Therefore, we argue that investing more in making individual studies more reproducible and generalizable will be beneficial in both the short and long term. At the same time, adding CSV can reduce the statistical power to detect experimental effects, so some additional experimental replicates would be needed when using it.
Arguably, our use of statistical significance tests of effect sizes to determine reproducibility might be viewed as overly restrictive and better suited to assessing the reproducibility of parameter estimates rather than the generality of the hypothesis under test27. We used this approach because no generally accepted alternative framework is available to assess how close the multivariate results from multiple laboratories need to be to conclude that they reproduced the same finding. It is worth noting that although the direction of the legume effect was the same in the majority of laboratories, the differences among laboratories were very large (for example, up to two orders of magnitude for shoot biomass) and in 10% of the 168 laboratory × variable combinations (14 laboratories × 12 response variables) the direction of the legume effect differed from the among-laboratory consensus (Fig. 2).
Our study shows that the current standardization procedures used in ecological microcosm experiments are inadequate in accounting for laboratory-specific environmental factors and suggests that introducing controlled variability in experiments may buffer some of the effects of laboratory-specific factors. Although there are multiple causes for the reproducibility crisis15, 28,29, deliberately including genetic variability in the studied organisms may turn out to be a simple solution for increasing the reproducibility of ecological studies performed in controlled environments. However, as the introduced genotypic variability only increased reproducibility in experimental setups with tightly controlled environmental conditions (that is, in growth chambers using identical soil), our study indicates that the reproducibility of ecological experiments may be enhanced by a combination of rigorous standardization of environmental variables at the laboratory level as well as controlled genotypic variability.
All laboratories tried, to the best of their abilities, to carry out identical experimental protocols. While not all laboratories managed to precisely recreate all of the details of the experimental protocol, we considered this to be a realistic scenario under which ecological experiments using microcosms are performed in glasshouses and growth chambers.
The seeds from three genotypes of B. distachyon (Bd21, Bd21-3 and Bd3-1) and M. truncatula (L000738, L000530 and L000174) were first sterilized by soaking 100 seeds in 100 ml of a sodium hypochlorite solution with 2.6% active chlorine, which was stirred for 15 min using a magnet. Thereafter, the seeds were rinsed three times in 250 ml of sterile water for 10–20 s under shaking. Sterilized seeds were germinated in trays (10 cm deep) filled with vermiculite. The trays were kept at 4 °C in the dark for 3 days before being moved to light conditions (300 μmol m−2 s−1 photosynthetically active radiation) and 20°C and 60% relative air humidity during the day and 16 °C and 70% relative air humidity at night. When the seedlings of both species reached 1 cm in height above the vermiculite, they were transplanted into the microcosms.
Preparation of microcosms
All laboratories used identical containers (2 l volume, 14.8 cm diameter and 17.4 cm height). Sand patches were created using custom-made identical ‘patch makers’ consisting of six rigid polyvinyl chloride tubes (2.5 cm in diameter and 25 cm long) arranged in a circular pattern with an outer diameter of 10 cm. A textile mesh was placed at the bottom of the containers to prevent the spilling of soil through drainage holes. The filling of microcosms containing sand patches started with the insertion of the empty tubes into the containers. Thereafter, in growth chambers, 2,000 g dry weight of soil, subtracting the weight of the sand patches, was added to the containers and around the ‘patch maker’ tubes. Because different soils were used in the glasshouses, the dry weight of the soil differed depending on the soil density and was first estimated individually in each laboratory as the amount of soil needed to fill the pots up to 2 cm from the top. After the soil was added to the containers, the tubes were filled with a mixture of 10% soil and 90% sand. When the microcosms did not contain sand patches, the amount of sand otherwise contained in the six patches was homogenized with the soil. During the filling of the microcosms, a common substrate for measuring litter decomposition was inserted at the centre of the microcosm at 8 cm depth. For simplicity, as well as for its fast decomposition rate, we used a single batch of commercially available tetrahedron-shaped synthetic tea bags (mesh size of 0.25 mm) containing 2 g of green tea (Lipton; Unilever), as proposed by the ‘tea bag index’ method20. Once filled, the microcosms were watered until water could be seen pouring out of the pot. The seedlings were then manually transplanted to pre-determined positions (Fig. 1), depending on the genotype and treatment. Each laboratory established two blocks of 36 microcosms, resulting in a total of 72 microcosms per laboratory, with blocks representing two distinct chambers in the growth chamber setups or two distinct growth benches in the same glasshouse.
All laboratories using growth chamber setups used the same soil, whereas the laboratories using glasshouses used different soils (see Supplementary Table 1 for the physico-chemical properties of the soils). The soil used in growth chambers was classified as a nutrient-poor cambisol and was collected from the top layer (0–20 cm) of a natural meadow at the Centre de Recherche en Ecologie Expérimentale et Prédictive (Saint-Pierre-lès-Nemours, France). Soils used in glasshouses originated from different locations. The soil used by laboratory 2 was a fluvisol collected from the top layer (0–40 cm) of a quarry site near Avignon in the Rhône valley, Southern France. The soil used by laboratory 4 was collected from near the La Cage field experimental system (Versailles, France) and was classified as a luvisol. The soil used by laboratories 11 and 12 was collected from the top layer (0–20 cm) within the haugh of the river Dreisam in the East of Freiburg, Germany. This soil was classified as an umbric gleysol with high organic carbon content. The soil used by laboratory 14 was classified as a eutric fluvisol and was collected on the field site of the Jena Experiment, Germany. Before the establishment of microcosms, all soils were air-dried at room temperature for several weeks and sieved using a 2 mm mesh sieve. A common inoculum was provided to all laboratories to ensure that rhizobia specific to M. truncatula were present in all soils.
Abiotic environmental conditions
The set points for environmental conditions were 16 h light (at 300 μmol m−2 s−1 photosynthetically active radiation) and 8 h dark, at 20 °C and in 60% relative air humidity during the day and 16 °C and 70% relative air humidity at night. Different soils (for glasshouses) and treatments with sand patches likely affected water drainage and evapotranspiration. The watering protocol was thus based on dry weight relative to weight at full water-holding capacity (WHC). The WHC was estimated based on the weight difference between the dry weight of the containers and the wet weight of the containers 24 h after abundant watering (until water was flowing out of the drainage holes in the bottom of each container). Soil moisture was maintained between 60 and 80% of WHC (that is, the containers were watered when the soil water dropped below 60% of WHC and water was added to reach 80% of WHC) during the first 3 weeks after seedling transplantation and between 50 and 70% of WHC for the rest of the experiment. Microcosms were watered twice a week with estimated WHC values from two microcosms per treatment. To ensure that the patch/heterogeneity treatments did not become a water availability treatment, all containers were weighed and brought to 70 or 80% of WHC every 2 weeks. This operation was synchronized with within-block randomization. All 14 experiments were performed between October 2014 and March 2015.
Sampling and analytical procedures
After 80 days, all plants were harvested. Plant shoots were cut at the soil surface, separated by species and dried at 60 °C for 3 days. Roots and any remaining litter in the tea bags were washed out of the soil using a 1 mm mesh sieve and dried at 60 °C for 3 days. The microcosm evapotranspiration rate was measured before the harvesting as the difference in weight changes from 70% of WHC after 48 h. Shoot C%, N%, δ13C and δ15N were measured on pooled shoot biomass (including seeds) of B. distachyon and analysed at the Göttingen Centre for Isotope Research and Analysis using a coupled system consisting of an Elemental Analyzer (NA 1500; Carlo-Erba) and a gas isotope mass spectrometer (Finnigan MAT 251; Thermo Electron Corporation).
Data analysis and statistics
All analyses were done using R version 3.2.4 (ref. 30). Before data analyses, each laboratory was screened individually for outliers. Values that were lower or higher than 1.5 × interquartile range31 within each laboratory, and representing less than 1.7% of the whole dataset, were considered to be outliers due to measurement errors or typos. These values were removed and subsequently treated as missing values. We then assessed whether the impact of the presence of legume varied with laboratory and the treatment of CSV. This was tested individually for each response variable (Table 1) with a mixed-effects model using the ‘nlme’ package32. Following the guidelines suggested by ref. 33, we first identified the most appropriate random structure using a restricted maximum likelihood approach and then selected the random structure with the lowest Akaike information criterion. For this model, CSV and laboratory were included as fixed factors, as well as experimental block as a random factor and a ‘varIdent’ weighting function to correct for heteroscedasticity resulting from more heteroscedastic data at the laboratory and legume level (R syntax: ‘model = lme (response variable ~ legume*CSV*laboratory, random = ~1|block, weights = varIdent (form = ~1|laboratory*legume)’) (Table 2). As the laboratory and setup experimental factors were not fully crossed (that is, laboratories performed the experiment only in one type of setup), the two experimental variables could not be included simultaneously as fixed effects. Therefore, to test for the setup effect, we used an additional complementary model including CSV and setup as fixed effects and laboratory as a random factor (R syntax: ‘model = lme (response variable ~ legume*CSV*setup, random = ~1|laboratory/block, weights = varIdent (form = ~1|laboratory*legume)’) (Supplementary Table 3). To test whether the results were affected by the collinearity among the response variables, the two models were also run on PC1 and PC2 of the 12 response variables (Fig. 4a,b). PCs were estimated using the ‘FactoMineR’ package34, with missing values replaced using a regularized iterative multiple correspondence analysis35 in the ‘missMDA’ package36. The same methodology was used to compute a second principal component analysis derived from the net legume effect on the 12 response variables (Supplementary Fig. 3c,d). To assess how many laboratories produced results that were statistically indistinguishable from one another, we applied Tukey’s post-hoc honest significant difference test in the ‘multcomp’ package to laboratory-specific estimates of PC1 and PC2 (Table 2).
To assess how the CSV treatments affected the among- and within-laboratory variability, we used the s.d. instead of the coefficient of variation, because the net legume effect contained both positive and negative values. To calculate among- and within-laboratory s.d., we centred and scaled the raw values using the z-score normalization (z-scored variable = (raw value – mean)/s.d.) individually for each of the 12 response variables. Among-laboratory s.d. was computed from the mean of the laboratory z-scores for each response variable, CSV and setup treatment (n = 144; 6 CSV levels × 2 setup levels × 12 response variables). Within-laboratory s.d. was computed from the values measured in the six replicated microcosms for each CSV and setup treatment combination, individually for each response variable, resulting in a dataset with the same structure as that for among-laboratory s.d. (n = 144; 6 CSV levels × 2 setup levels × 12 response variables). Some of the 12 response variables were intrinsically correlated, but most had correlation coefficients < 0.5 (Supplementary Fig. 5) and were therefore treated as independent variables. To analyse and visualize the relationships between the s.d. calculated from variables with different units, before the calculation of the among- and within-laboratory s.d., the raw values of the 12 response variables were centred and scaled.
The impact of experimental treatments on among- and within-laboratory s.d. was analysed using mixed-effects models following the same procedure described for the individual response variables. The model with the lowest Akaike information criterion included a random slope for the setup within each response variable, as well as a ‘varIdent’ weighting function to correct for heteroscedasticity at the variable level (R syntax: ‘model = lme (s.d. ~ CSV*setup, random = ~setup|variable, weights = varIdent (form = ~1|variable)’) (see also Supplementary Note). The relationship between within- and among-laboratory s.d. was also tested with a model with similar random structure but with among-laboratory s.d. as a dependent variable and within-laboratory s.d., CSV and setup as predictors.
Because the treatments containing genotypic CSV increased reproducibility in growth chambers but slightly increased within-laboratory s.d., we also examined the effect of adding CSV on the statistical power for detecting the net legume effect in each individual laboratory. This analysis was done with the ‘power.anova.test’ function in the ‘base’ package. We computed the statistical power of detecting a significant net legume effect (if one had used a one-way analysis of variance for the legume treatment) for the control, GENW, GENA and ENVW + GENW treatments for each laboratory and response variable. This allowed us to calculate the average statistical power for the aforementioned treatments and how many additional replicates would have been needed to achieve the same statistical power as we had in the control.
Life sciences reporting summary
Further information on experimental design is available in the Life Sciences Reporting Summary.
The data that support the findings of this study are publicly available at https://doi.pangaea.de/10.1594/PANGAEA.880980.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This study benefited from the Centre Nationnal de la Recherche Scientifique human and technical resources allocated to the Ecotrons research infrastructures, the state allocation ‘Investissement d’Avenir’ ANR-11-INBS-0001 and financial support from the ExpeER (grant 262060) consortium funded under the EU-FP7 research programme (FP2007–2013). Brachypodium seeds were provided by R. Sibout (Observatoire du Végétal, Institut Jean-Pierre Bourgin) and Medicago seeds were supplied by J.-M. Prosperi (Institut National de la Recherche Agronomique Biological Resource Centre). We further thank J. Varale, G. Hoffmann, P. Werthenbach, O. Ravel, C. Piel, D. Landais, D. Degueldre, T. Mathieu, P. Aury, N. Barthès, B. Buatois and R. Leclerc for assistance during the study. For additional acknowledgements, see the Supplementary Information.