Structured plant metabolomics for the simultaneous exploration of multiple factors

Multiple factors act simultaneously on plants to establish complex interaction networks involving nutrients, elicitors and metabolites. Metabolomics offers a better understanding of complex biological systems, but evaluating the simultaneous impact of different parameters on metabolic pathways that have many components is a challenging task. We therefore developed a novel approach that combines experimental design, untargeted metabolic profiling based on multiple chromatography systems and ionization modes, and multiblock data analysis, facilitating the systematic analysis of metabolic changes in plants caused by different factors acting at the same time. Using this method, target geraniol compounds produced in transgenic tobacco cell cultures were grouped into clusters based on their response to different factors. We hypothesized that our novel approach may provide more robust data for process optimization in plant cell cultures producing any target secondary metabolite, based on the simultaneous exploration of multiple factors rather than varying one factor each time. The suitability of our approach was verified by confirming several previously reported examples of elicitor–metabolite crosstalk. However, unravelling all factor–metabolite networks remains challenging because it requires the identification of all biochemically significant metabolites in the metabolomics dataset.

Metabolomics generates large, multi-dimensional datasets using automated analytical procedures such as gas chromatography or high-pressure liquid chromatography coupled to mass spectrometry (GC-MS and HPLC-MS). It is therefore necessary to reduce the dimensionality of the data using multivariate statistical methods. The complexity of data mining is enhanced further when the data originate from several sources (e.g. complementary chromatography systems or ionization modes) and data fusion strategies are therefore required. An additional difficulty is encountered when multiple input factors are varied simultaneously, because different sources of variation are mixed. The importance of multiple simultaneous metabolic effects has been underestimated in the past and here we addressed this challenge by combining several orthogonal techniques: reversed-phase ultra-high-pressure liquid chromatography (RP-UHPLC) with positive and negative electrospray ionization (ESI) modes, and hydrophilic interaction liquid chromatography (HILIC), both coupled to time of flight mass spectrometry (TOF-MS) to achieve greater coverage of the metabolome.
Several strategies have been developed for the simultaneous analysis of multiple datasets. The proposed data modelling approach is an extension of the multiple kernel learning method to orthogonal partial least squares discriminant analysis (OPLS-DA), i.e. consensus OPLS-DA, which combines data blocks using the weighted sum of X·X T product association matrices corresponding to their linear kernel 13 . The OPLS-DA framework is advantageous for data interpretation because relevant metabolic variations are associated with predictive components, whereas unrelated variation is summarized in so-called orthogonal components 14 . In consensus OPLD-DA, the block weighting is based on modified RV-coefficients so that the Y response orientates the consensus kernel towards improved predictability. Cross-validation is carried out to assess the optimal model size and avoids overfitting, using DQ 2 (an adaptation of the conventional Q 2 value) for discriminant analysis 15 .
To our knowledge, this is the first systematic investigation of metabolic remodelling in plants following simultaneous multi-factorial treatment. This novel combination of metabolomics and experimental design, associated with the simultaneous analysis of multiblock omics data, is a powerful approach that allows us to unravel the metabolic responses in transgenic tobacco cells at a global level when diverse input factors such as macronutrients, plant growth regulators and light are varied simultaneously. Furthermore, this high-throughput screening system can be used for process optimization with metabolically engineered cell lines. Herein we hypothesize that product optimization using the simultaneous exploration of multiple factors may achieve more accurate and reproducible results than the assessment of one factor at a time.

UHPLC-QTOF-MS fingerprinting.
The acquisition of high-quality metabolomics data is an essential aspect of metabolic profiling because it facilitates the identification step. The UHPLC-QTOF-MS gradient conditions we applied allowed us to monitor more than 3500 features in the m/z range 100-1200 based on RP-UHPLC (ESI + and ESI − ion detection modes) and HILIC (ESI + ion detection mode). Pareto scaling was applied as data normalization. The multiblock data fusion strategy we used allows the integration of datasets originating from different ionization (ESI + /ESI − ) and separation (RP/HILIC) modes, yielding 1500, 1366 and 1368 variables for the RP/ESI + , RP/ESI − and HILIC/ESI + blocks, respectively. Some of the signals were present in all blocks, whereas others were found in only one or two data blocks and thus provided complementary information 13 . The target metabolites in methanol extracts, obtained from tobacco suspensions expressing Valeriana officinalis geraniol synthase (VoGES) gene, were identified based on mass fragmentation and the comparison of characteristic m/z species with internal or publicly accessible natural product databases.
Consensus OPLS-DA data modelling and metabolite identification. A supervised data mining approach (consensus OPLS-DA) was applied to determine the distinct metabolic changes caused by the simultaneous modulation of diverse input factors, and leave-one-out cross-validation (LOOCV) was carried out to evaluate the appropriate number of orthogonal components based on the DQ 2 parameter. A series of 1000 permutation tests was performed for each model by randomizing the original Y class response in order to assess the statistical validity of the models 13 .
Significant models were obtained for most of the factors with a prediction accuracy (PA) of 87-100% and a statistical significance of p < 0.01. Lower prediction accuracy was observed for GA 3 (PA = 62.5%, p = 0.013) and ethephon (PA = 60.4%, p = 0.015), whereas KH 2 PO 4 (PA = 57.3%, p = 0.225) and MgSO 4 (PA = 47.9%, p = 0.62) were found to be statistically non-significant (Table 1). Random permutations of the design matrix simulate data under the null hypothesis, i.e. no effect of the experimental factor under study. Because similar or higher prediction accuracies could be achieved from randomly permutated designs, we decided to omit KH 2 PO 4 and MgSO 4 from subsequent experiments aiming to identify metabolites and to determine their biological importance.
Next we used S-plots to highlight relevant metabolites based on their contributions to the model, reflecting both their amplitude of variation and their reliability 14 . This visualization method helps to identify biochemically significant metabolites based on their position in the S-plots. The ideal biomarker has a high covariance magnitude and high correlation reliability. This low risk of spurious correlations corresponds to the upper right quadrant (area 1) or lower left quadrant (area 4) of each S-plot. The biomarkers located in area 1 are associated with metabolites that become more abundant when the factor strength is enhanced or less abundant when the factor strength is reduced, i.e. an upregulation effect (metabolic levels (+) /factor (+) or metabolic levels (−) /factor (−) ). Similarly, the biomarkers in area 4 are linked to metabolites that become less abundant when the factor strength is enhanced or more abundant when the factor strength is reduced, i.e. a downregulation effect (metabolic levels (−) /factor (+) or metabolic levels (+) /factor (−) ). Because all three data blocks (RPLC NEG, RPLC POS and HILIC) are integrated in a single model for each factor, ions detected by each analytical protocol can be associated in a combined S-plot.
High-resolution QTOF-MS/MS profiling allowed 45 compounds represented by some of the most significant features with respect to the factors we investigated (i.e. they were located in areas 1 and 4 of the S-plots) to be Scientific RepoRts | 6:37390 | DOI: 10.1038/srep37390 tentatively identified, at least to the level of the compound class (Table 2). However, the complete identification of all other relevant metabolites remains challenging and these preliminary results may serve as a starting point for the further targeted isolation and purification of the metabolites of interest.
Our observations show that naturally occurring (endogenous) auxins such as IAA and IBA upregulate (2 and 4) and downregulate (1, 5 and 6) the synthesis of geraniol glycosides, whereas the synthetic auxins (NAA and 2,4-D) have exactly the opposite effects on these metabolites.

Figure 1. KNO 3 S-plot showing the distribution of metabolites in transgenic tobacco cell cultures exposed to different combinations of environmental factors, revealing metabolites in areas 1 (upper right) and 4 (lower left) that are the most sensitive to changes in KNO 3 levels. Numbers refer to compounds listed in
Scientific RepoRts | 6:37390 | DOI: 10.1038/srep37390 Clustering. Cluster analysis provided a global overview of regulation events that follow changes in the experimental factors. This approach allows the grouping of metabolites with similar ion features using a dendrogram, and a heat map summarizes the contribution of each of the factors in the context of each identified metabolite and thus enables the visualization of upregulation and downregulation in response to different treatments (Fig. 2).
Both geranidiol glycosides (1 and 5) are located in a small cluster, whereas the third geranidiol derivative (7), an esterified monoglycoside, is found in a neighbouring cluster (Fig. 2). The geraniol metabolites (1 and

Discussion
The combination of fractional factorial design and consensus OPLS-DA methods allowed us to systematically explore the effect of multiple factors on plant metabolism, using transgenic tobacco cell cultures as a model system. This simultaneous application of treatment stress assesses all experimental factors under diverse conditions that could occur in nature. We tentatively identified 45 constituents in areas 1 and 4 of the S-plots following the fractionation and analysis of plant cell extracts by UHPLC-QTOF-MS, corresponding to metabolites whose abundance changed substantially in response to the experimental factors. These metabolites represented multiple classes of natural products: monoterpenoids, nitrogen-containing compounds, coumarins, fatty acids and their esters, and derivatives of phytohormones and plant growth regulators used as additives in the experiments.
Geraniol and its glycosides do not occur naturally in tobacco plants and their presence in our samples reflects the activity of the stably integrated VoGES gene 16 . However, the glycosylation profile of geraniol produced by our cell suspension cultures differed to that observed in whole plants. The cells produced seven distinct geraniol glycosides whereas 19 variants were produced by transgenic tobacco plants and Nicotiana benthamiana leaves used for transient expression. The acetylated glycosides produced at later stages of plant development were not monitored in our plant cell cultures. The cell suspension cultures produced geraniol monoglycosides and diglycosides, whereas the whole plants also produced geraniol glycosides with three or more sugar adducts. However, our cell suspension cultures accumulated geranidiol derivatives that are not produced in agroinfiltrated or transgenic plants, which instead produce geranic acid glycosides 16 .
Nitrogenous compounds are found in tobacco cells because they have multiple core metabolic functions and are also precursors in the biosynthesis of tobacco alkaloids 17 . Phytoalexins defend plants against biotic and abiotic stress 18 . The coumarin scopoletin (12) is one of the phytoalexins produced in tobacco [19][20][21] . C16:3 monoacylglycerol (29) is a glyceride which can be formed by the esterification of glycerol with one fatty acid or by enzymatic hydrolysis of a fatty acid from diacylglycerol by the action of diacylglycerol lipase. Diacylglycerol acts as a signalling molecule during plant development and in response to stress tolerance, nutrient deficiency and other environmental stimuli [22][23][24] .
Our metabolomic analysis showed that plant cells react strongly to phytohormones and plant growth regulators. Plants limit the impact of harmful xenobiotic compounds by hydroxylation, glutathione conjugation, glycosylation, malonylation and sulfonylation 25,26 . Most of the phytohormone and growth regulator derivatives we identified were polar glucosylated products, which are generally more soluble than the parent molecule thus facilitating elimination. We also detected malonylated geraniol glycosides, confirming that malonylation is one of the key mechanisms used by tobacco cells to metabolize xenobiotic compounds 27 .
The production of glutamine was strongly upregulated by KNO 3 treatment in our experiment. Nitrate is assimilated by plant cells from nitrite and ammonium, and is then converted into the amino acid glutamine 28 . The protein kinase CIPK23 is involved in both nitrate and potassium signalling 29 . CIPK23 phosphorylates nitrate transporter NPF6.3 after interacting with the calcineurin-B-like protein CBL9, and reduces nitrate uptake capacity in the presence of high external NO 3 − concentrations, whereas the CBL1-CIPK23 and CBL9-CIPK23 complexes activate the K + channel AKT1 30 . A monoacylglycerol derivative (29) was also less abundant following KNO 3 treatment. This may be a hydrolysis product of diacylglycerol which activates protein kinase C and induces nitrate reductase gene expression 31 .
The NO 3 − /NH 4 + ratio in the culture medium influences the activity of auxins and cytokinins 32 . We observed an increase in the abundance of ions representing 2,4-D and the loss of ions representing DHZ metabolites and BAP glycoside in the KNO 3 S-plot, but an increase in the abundance of BAP glycosides and the loss of ions representing 2,4-D and NAA metabolites in the NH 4 NO 3 S-plot. This indicates that the balance between NO 3 − and NH 4 + ions may affect phytohormone sensitivity. NH 4 NO 3 and CaCl 2 had the most significant impact on the biosynthesis of geraniol glycosides among the inorganic factors we tested. NH 4 NO 3 induced the formation of geraniol glycosides (2, 4, 7) but inhibited the formation of hexosyl-hexosyl-geraniol (6), whereas calcium showed the opposite behaviour. Higher concentrations of useable nitrogen also enhanced the accumulation of linalool and citronellol by Saccharomyces cerevisiae 33 . Geraniol blocks calcium and potassium channels in mammalian cells 34 and similar cyclic nucleotide-gated ion channels are found in plants 35 .
Ca 2+ induced the hydroxylation of sphingosine and adenosine agreeing with previous observations that sphingosine-1-phosphate increases cytosolic free Ca 2+ 36 and cyclic adenosine monophosphate regulates calcium channels in the plasma membrane of Arabidopsis thaliana leaf guard and mesophyll cells 37 .
Scopolin and its 7-O-glucoside are key components of the abiotic stress response 18 and the abundance of both compounds increased following treatment with all three statistically significant nutrients in our study. The accumulation of scopoletin in tobacco cells and its conversion to a glucoside is also induced by 2,4-D 21 . Ions representing scopolin were also detected following the treatment of our cells with 2,4-D. Scopoletin synthesis was strongly inhibited by MeJa concurring with data showing that scopoletin biosynthesis induced by Alternaria alternata is strongly dependent on jasmonic acid but not ABA, although MeJA does not induce scopoletin production in the absence of A. alternata 38 . In our cells, the formation of scopoletin was also inhibited by GA 3 .
We detected nicotine produced in trace amounts by our transgenic tobacco cell cultures, which comprise green (photosynthesizing) cells derived from the aerial parts of the plant, although de novo nicotine synthesis takes place mainly in the roots 39 . MeJa induced nicotine production in our tobacco cells in a similar manner as previously shown for N. attenuata 39 . The downregulation of nicotine production we observed following the treatment with KNO 3 and NAA agrees with previous reports for cultured tobacco callus, and the effect of K + is probably mediated by NAA 40 . We also observed that nicotine biosynthesis was induced by IAA but moderately suppressed by 2,4-D 41 . The combined effect of MeJa, auxins and K + on the regulation of nicotine synthesis suggests that multiple factors contribute to the same process.
The biosynthesis of monoterpenoid glycosides appeared to be influenced by auxins and cytokinins, perhaps reflecting the antagonistic crosstalk between these two phytohormone classes 42 . Our experimental results support the idea that phytohormones function in a complex network involving the different hormonal pathways but that there is also elaborate crosstalk with nutrients and elicitors. The auxin-sensitive signalling protein SHY2 is regulated by the cytokinin-induced protein ARR1 (Arabidopsis response regulator), which in turn is repressed by gibberellin thus connecting three hormones in one network 42 . This may explain why our GA 3 S-plot contained ion traits that were also affected by auxins and cytokinins. We also observed evidence for ethylene/cytokinin and cytokinin/ABA crosstalk 42 .
We have developed a systematic approach, which implements an experimental design strategy in the context of metabolomics to account for the diverse factors applied simultaneously to plant cells. This is a valuable method for the investigation of complex environmental stress and its impact on plant metabolism by optimizing the number of experiments needed to assess the factors. Our approach significantly reduces the time and effort required for testing by using consensus OPLS-DA models to evaluate and interpret metabolic changes caused by the simultaneous application of diverse ecological factors. This systematic workflow may facilitate the discovery and characterization of factor-nutrient-elicitor networks and appropriate biomarkers. Finally, we conclude that this novel approach should be able to streamline process optimization for the reproducible production of any secondary metabolite in plant cell cultures by the simultaneous exploration of multiple factors rather than the assessment of one factor at a time.

Materials and Methods
Plant cell cultures, treatments and harvesting. We used tobacco (N. tabacum cv. Samsun NN) transgenic cell suspension cultures, expressing stably V. officinalis geraniol synthase. The cell cultures were initiated and maintained as previously described 8 . Two levels (low/high) were used for each of the factors selected for analysis. The low level of each macronutrient in the plant cell culture medium was based on classical Murashige and Skoog (MS) medium 11,45 whereas the high level was based on our recent medium optimization study, although the concentrations of NH 4 NO 3 were reversed 11,45 . The specific low/high concentrations were prepared as follows: KNO 3 Table S1.
The following cultivation conditions were used: flask volume 50 ml (Erlenmeyer glass flasks), filling volume 25 ml, inoculum size 1.4 g fresh weight (FW), triacetyl-β -cyclodextrin concentration 2 mM, and shaking frequency 180 rpm 8 . The plant cell suspension cultures were grown for 9 days before the cells were harvested, then filtered twice under vacuum and frozen at − 20 °C. The cultures were elicited with phytohormones and plant growth regulators 6 days post-inoculation. The experiment was conducted at 26 °C with a 16-h photoperiod in an ISF1-X shaker (Kühner AG, Birsfelden, Switzerland).
Sample preparation. The samples were extracted as previously described 46  A pool was created by adding equal volumes from all samples to serve as a QC injection. Nine QC injections in total were distributed at regular intervals in the analytical batch. An acceptable variation was achieved for all peaks, including those with the highest intensity (coefficient of variation less than 40%).
Positive and negative ESI was applied in separate analytical runs (positive only for the HILIC method) with a desolvation gas flow of 780 l/h at 400 °C, a capillary voltage of 4.5 kV and a cone voltage of 45 V. Mass spectra were acquired over the m/z range 100-1200 in "W mode" using leucine enkephalin as a lock mass standard.
Raw data processing. Mass/retention time markers were extracted from the raw UHPLC-ESI-MS data using MarkerLynx XS v4.1 (Waters). The following method parameters were set: retention time window 1.6-24.6 min, m/z range 100-1200, XIC window 0.02 Da, peak smoothing activated, marker intensity threshold 30, mass window 0.04 Da, retention time window 0.2 min, noise elimination level 6.0, and deisotope data activated. All mass (m/z)/retention-time features related to the auxins, cytokinins and plant growth regulators provided as supplements in the plant cell culture media, as well as features derived from impurities in the LC eluents, were removed from the raw metabolomics datasets by removing all features detected in blank runs (solvent injection) or in analytical runs of the pure additives before data evaluation.

Experimental design. The experimental design was based on an orthogonal array with 96 runs created
with the free open-source R package DoE.base 47 as described in the supplementary material. Given the size of this experiment, tests for effects of 2-level factors at significance level 5% can detect effects of size "one standard deviation" with about 99% power, effects of size "half a standard deviation" with about 68% power, and effects of size "0.75 standard deviations" with about 95% power. The design was optimized to screen 14 factors by keeping the confounding of low-order effects minimal: all main effects are orthogonal to each other (orthogonal array), the design was based on an array with the lowest possible number of squared canonical correlations from three-factor sets equal to 1 48 and the factors were accommodated on columns of the base array such that confounding between main effects and two-factor interactions, and subsequently among two-factor interactions, was minimized 8 .
This fractional factorial design, with a randomized run order, was used to screen 12 two-level factors, one three-level factor and one four-level factor: we thus screened for the effects of light and 18 diverse substances representing macronutrients, auxins, cytokinins and elicitors. For the two-level factors, we investigated the presence of the high levels of NH 4 NO 3 , KNO 3 , CaCl 2 , KH 2 PO 4 , MgSO 4 , MeJa, salicylic acid, GA 3, ethephon, cyclanilide, ABA and light. For auxins (four-level factor), exactly one of IAA, IBA, NAA or 2,4-D was present, whereas for cytokinins (three-level factor), exactly one of kinetin, DHZ or BAP was present. The experimental design with 96 runs is summarized in coded values in Table S2. The remaining potentially relevant confounding between main effects and two-factor interactions in terms of triples of factor comparisons are summarized in Fig. S17. For each such triple, the comparison between the levels of each factor in the triple might be affected by an interaction between the other two factors (e.g. an interaction between KNO 3 and NH 4 NO 3 might affect the assessment of the BAP vs. DH-z comparison for cytokinins). Sceptics might argue that this possibility for confounding is a reason to refrain from using an experimental design approach in favour of only changing one factor at a time (OFAT approach). However, if two-factor interactions are indeed relevant -as would be necessary for the experimental design approach to suffer from misleading conclusions in terms of factor level comparisons -the conclusions from an OFAT approach are also limited in the same manner and would be valid only for the exact settings at which the other factors have been fixed. Furthermore, to achieve a reasonable amount of replication, the OFAT approach would need a much larger number of experimental runs -e.g. 24 runs for the reference level combination (the number obtained for the four-level factor in the 96-run experiment) might be combined with 24 runs each for the other level of the 12 two-level factors (12*24), and 24 runs each for the other levels of the three-level and the four-level factors (5*24), resulting in a total of 18*24 = 432 runs instead of the 96 runs in our experiment.
Data processing and analysis. For each experimental factor, a consensus OPLS-DA model was built to relate the experimental metabolomics data (X) to a class matrix consisting of zeros and ones, filled according to the levels of each factor (Y). The columns of the experimental design were therefore used individually as a response matrix in the context of supervised analysis. For auxins and cytokinins, a response vector was generated individually for each hormone and filled with zero when the corresponding hormone was absent, whereas a value of one indicated its presence. The consensus OPLS algorithm implements data fusion based on multiple kernel learning. The joint analysis of multiple data tables is achieved by the combination of association matrices computed for each block. Therefore, requirements in terms of memory resources and computation time are minimized without information loss even if the experimental data include a large number of signals. A block-scaling step ensures fairness between blocks by offering equal starting chances to contribute to the model. RV coefficients are then computed to build a consensus matrix and orientate the model towards better prediction performance. A common subspace is built using a kernel version of the OPLS algorithm and the optimal number of orthogonal components is estimated by cross-validation. Because systematic variations are summarized using Y-predictive and Y-orthogonal components (OPLS framework), the interpretation of the multiblock model is straightforward. Like classical multivariate methods, a consensus score plot allows the distribution of the observations to be evaluated. Because linearity is maintained, variable loadings can easily be computed for biomarker discovery. The weight of each block in the projection also allows the role of each data source to be evaluated 13 . The OPLS model can be summarized as follows: p p T where X contains the normalized metabolomic data from data block i (i ∈ [RPPOS, RPNEG, HILICPOS]), Y represents a 0/1 indicator variable for experimental condition j (j ∈ [NH 4 NO 3 , KNO 3 , CaCl 2 , KH 2 PO 4 , MgSO 4 , IAA, IBA, NAA, 2,4-D, kinetin, DHZ, BAP, MeJa, salicylic acid, GA 3 , ethephon, cyclanilide, ABA, light]), t p is the Y-predictive score matrix, p p is the Y-predictive loading matrix for X, t o is the Y-orthogonal score matrix, p o is the Y-orthogonal loading matrix for X, q p is the Y-predictive loading matrix for Y, and E and F are the residual matrices for X and Y, respectively. Note that the four indicator variables for the cytokinins sum to a column of "+ 1" entries, as do the three indicator variables for the auxins. Cluster analysis. Subsets of metabolites sharing similar patterns were investigated using cluster analysis. For that purpose, the contribution (loading) of each ion feature associated with an identified metabolite was collected across all significant consensus OPLS-DA models and displayed in a dendrogram and a heat map. This strategy highlights upregulation and downregulation. Cluster analysis was carried out with the Bioinformatics Toolbox v4.2 under the MATLAB ® v8 environment (The MathWorks) using Euclidean distances and the Ward aggregation method.
Factors with more than two levels. Auxins and cytokinins were two factors in our design associated with three and four levels, respectively. Exactly one auxin and one cytokinin were included in each run of the experimental design. Consequently, the four Y(j) indicator columns corresponding to auxins and the three Y(j) indicator columns corresponding to cytokinins are linearly dependent, as stated above. Our analysis included all indicator columns, because each is treated separately. This implies that downregulation or upregulation must be interpreted within the linearly-dependent groups, e.g. if the three auxins IAA, IBA and 2,4-D are identified as downregulators, the fourth (NAA) must be an upregulator (relative to the other auxins). This behaviour is clearly shown in the heat map and also implies analogous dependencies among the S-plots of the auxins and cytokinins, respectively.