Computational workflow to study the seasonal variation of secondary metabolites in nine different bryophytes

In Eco-Metabolomics interactions are studied of non-model organisms in their natural environment and relations are made between biochemistry and ecological function. Current challenges when processing such metabolomics data involve complex experiment designs which are often carried out in large field campaigns involving multiple study factors, peak detection parameter settings, the high variation of metabolite profiles and the analysis of non-model species with scarcely characterised metabolomes. Here, we present a dataset generated from 108 samples of nine bryophyte species obtained in four seasons using an untargeted liquid chromatography coupled with mass spectrometry acquisition method (LC/MS). Using this dataset we address the current challenges when processing Eco-Metabolomics data. Here, we also present a reproducible and reusable computational workflow implemented in Galaxy focusing on standard formats, data import, technical validation, feature detection, diversity analysis and multivariate statistics. We expect that the representative dataset and the reusable processing pipeline will facilitate future studies in the research field of Eco-Metabolomics.

In Eco-Metabolomics interactions are studied of non-model organisms in their natural environment and relations are made between biochemistry and ecological function. Current challenges when processing such metabolomics data involve complex experiment designs which are often carried out in large field campaigns involving multiple study factors, peak detection parameter settings, the high variation of metabolite profiles and the analysis of non-model species with scarcely characterised metabolomes. Here, we present a dataset generated from 108 samples of nine bryophyte species obtained in four seasons using an untargeted liquid chromatography coupled with mass spectrometry acquisition method (LC/MS). Using this dataset we address the current challenges when processing Eco-Metabolomics data. Here, we also present a reproducible and reusable computational workflow implemented in Galaxy focusing on standard formats, data import, technical validation, feature detection, diversity analysis and multivariate statistics. We expect that the representative dataset and the reusable processing pipeline will facilitate future studies in the research field of Eco-Metabolomics.

Background & Summary
In Ecological Metabolomics (or short "Eco-Metabolomics"), metabolite profiles of organisms are studied in order to describe ecological processes such as biotic interactions or the impact of environmental changes on various biological species [1][2][3] . In contrast to biochemistry, wild non-model species are typically studied in their natural environment in ecology. This often involves different individuals of one or more species from populations growing under quite heterogeneous conditions when compared to the controlled conditions in greenhouses or growth chambers. As a result, metabolite profiles are highly variable when compared to each other. Moreover, profiles of non-model species contain a large number of novel compounds (so called "unknown unknowns") that are difficult to identify because of lacking reference compounds, which have so far been mostly elucidated in model organisms 3,4 . Furthermore, designing ecological experiments is often complex and involves multiple factors 5 . Thus, the metabolomics data processing pipeline needs to be adapted in order to deal with the particular hypotheses and idiosyncrasies of ecological experiments.
Here, we present a descriptor for a dataset that we consider representative for the research field of Eco-Metabolomics. Our study makes use of a field campaign with a two-factorial design (seasons and species), which includes (except Marchantia polymorpha) non-model species of bryophytes. In order to facilitate subsequent analysis, we kept the experiment design as simple as possible. The sampling was conducted on-site at the Botanical Garden of Martin Luther University Halle-Wittenberg once in each season over a period of one year (see below). Metabolite profiles were acquired using untargeted liquid chromatography coupled with mass spectrometry (LC/MS). Raw metabolite profiles are available in the metabolomics data repository MetaboLights 6 (Data Citation 1).
In biochemistry there are strict laboratory protocols that ensure reproducibility of the analytical methods, while in bioinformatics this function is accomplished by implementing reusable computational workflows 7,8 . Thus, in addition to the dataset we also address the typical bioinformatic challenges that come with Eco-Metabolomics experiments by implementing a reproducible and reusable computational workflow (Fig. 1). While the analysis and ecological interpretation of the study is described in Peters et al. 9 , here we focus on the analytical and bioinformatic work that is required to create a computational processing pipeline that is reproducible and that can be reused by other subsequent studies.
We describe in detail the experimental methodology that was used to create the dataset as well as the methodology to make the computational workflow reproducible (to give identical results in different computational environments). By formalizing and validating the processes that led to the results 10,11 , we expect that this approach can serve as a model for subsequent studies. We further expect that Eco-Metabolomics studies use our dataset and the computational workflow to foster reuse and improve future data processing pipelines.

Methods
These methods describe in detail the steps in producing the data, including full descriptions of the experimental design in our related work 9 , data acquisition, computational processing, diversity analysis, biostatistics and bioinformatics procedures.

Sampling protocol
In each season, three composite samples of different individuals of each species were taken, leading to a total of 3 * 9 * 4 = 108 samples. Only above-ground parts of the moss gametophytes such as leaves, branches, stems or thalloid parts were taken for sampling. From dioecious species such as M. polymorpha, P. strictum and P. undulatum female, male and sterile gametophytes were collected in a composite sample. Before sampling, visible archegonial and antheridial heads and any belowground parts such as rhizoids and rooting stems were removed with a sterile tweezer. The gametophytic moss parts were put in Eppendorf tubes and were frozen instantly on dry ice and later in the lab in liquid nitrogen.

Collecting ecological characteristics
In order to relate metabolomes of the bryophytes to ecology, several ecological characteristics were recorded on-site and compiled from literature. The on-site characteristics type of substrate with the nominal/categorical levels "soil", "rock with lean soil cover" and "rock"; light conditions with the ordinal www.nature.com/sdata/ SCIENTIFIC DATA | 5:180179 | DOI: 10.1038/sdata.2018.179 levels "sunny", "half-shade" and "shade"; moisture of the substrate with the ordinal levels "dry", "fresh", "damp" and wet; and exposition with the nominal levels "North", "East", "South", "West", "Northeast", "Northwest", "Southeast" and "Southwest" were recorded when taking the samples in the field.
The nominal characteristics growth form, habitat type, substrate and life strategy, the ordinal lifehistory characteristics spore size, gametangia distribution and sexual reproduction frequency, as well as the ordinal Ellenberg indicator values (indices for light, temperature, continentality, moisture, reaction, nitrogen and life-form) were collected from the literature [13][14][15][16][17] . For an overview, please refer Table 1 (available online only) in Peters et al. 9 or the file m_characteristics.csv in the dataset (see Data Citation 1, and Table 1 (available online only)).

Extraction protocol and LC/MS analysis
Frozen moss samples were homogenized by adding 200 mg ceramic beads (0.5 mm diameter, Roth) and ribolysing (Precellys 24, 2 × 20 s at 6500 r.p.m., 5 min pause in liquid nitrogen). 1 ml ice-cold 80/20 (v/v) methanol/water spiked with internal standards 5 μM biochanin A (Sigma-Aldrich), 5 μM kinetin (Sigma-Aldrich) and 5 μM N-(3-indolylacetyl)-l-valine (Sigma-Aldrich) were added. Samples were vortexed and thawed while shaking for 15 min at 1,000 r.p.m. at room temperature followed by ultrasonification for 15 min and again 15 min shaking. After 15 min centrifugation at 13,000 r.p.m. 500 μl of supernatant were dried in a vacuum centrifuge at 40°C and reconstituted in 80/20 (v/v) methanol/water with the volume adjusted to the initial fresh weight of the sample to a final concentration of 10 mg fresh weight per 100 μl extract.

Quality control
In order to validate the instrument performance and to detect batch effects between the instrument runs, the following quality control (QC) protocol was realized. Samples with a lab-internal standard mix (MM8) were interspersed before and after 7 bryophyte samples in the MicrOTOF 18 . The following substances were used in the MM8: 2-Phenylglycine (Fluka), Kinetin (Roth), Rutin (Acros Organics), O-Methylsalicylic acid (Sigma), Phlorizin dihydrate (Sigma), N-(3-Indolyacetyl)-L-valine (Sigma), 3-Indolylacetonitrile (Fluka) and Biochanin A (Sigma). Substances in the MM8 were selected based on their ionization properties (ionization in both positive and negative mode and the differential adduct formation) and a wide coverage of known retention times throughout the gradient with our instrumental setup. Known ionization properties were used to detect shifts and effects in mass-to-charge ratios (m/z) and retention times (RT) of the respective batches and to validate RT correction made by XCMS (see below).

Raw data acquisition
Raw LC/MS data were converted to the open data format mzML 19 with the software CompassXPort 3.0.9 from Bruker Daltonics (available at http://www.bruker.com/service/support-upgrades/software-downloads.html). In compliance with the minimum information guidelines for Metabolomics studies 20 , metadata were recorded to ISA-Tab format 21 using ISAcreator 1.7.10 (ref. 22) (available at https://github. com/ISA-tools/ISAcreator/releases) and uploaded together with the raw data to the metabolomics repository MetaboLights 6 (Data Citation 1). Profiles of positive mode were used for the data analyses as many important and known secondary metabolites classes in bryophytes such as flavonoids, phenylpropanoids, anthocyans, glycosides and previously characterized compounds such as Marchantins, Communins and Ohioensins ionize well in positive mode with our instrumental setup.

Peak detection
Chromatographic peak picking was performed in R 3.4.2 (available at https://cran.r-project.org) with the package XCMS 1.52.0 (ref. 23) using the centWave algorithm and the following parameters: ppm = 35, peakwidth = 4,21, snthresh = 10, prefilter = 5-50, fitgauss = TRUE, verbose.columns = TRUE. Grouping of chromatographic peaks was performed with two factors (in XCMS called "phenoData"): seasons with the levels summer, autumn, winter and spring; and species with the levels Brarut, Calcus, Fistax, Gripul, Hypcup, Marpol, Plaund, Polstr and Rhysqu. The following parameters were used for grouping: mzwid = 0.01, minfrac = 0.5, bw = 4. To improve subsequent data analyses, intensities in the peak table were log transformed before grouping. For further analysis, only features between the retention times 20 s and 1020 s were kept. Retention time correction was performed using the function retcor in XCMS using the parameters method = loess, family = gaussian, missing = 10, extra = 1, span = 2. The parameters were additionally optimized using the R package IPO 1.3.3 (ref. 24), but better alignment precision was achieved with manual control and knowledge of instrument settings 25 .

Exemplary compound annotation
Compounds were putatively annotated for the follow-up validation and biochemical interpretation with the software Bruker Compass IsotopePattern 4.4. Annotation was performed by calculating accurate masses (mass-to-charge values) from known compounds in M. polymorpha and other liverworts found in PubChem, the KNApSAcK database and Asakawa et al. 27,28 . In the software Bruker Compass DataAnalysis 4.4 the mass-to-charge was matched to device-specific retention times in the metabolite profile. To validate whether the known compound was present in the profile, Extracted Ion Chromatograms (EIC) and area-under-curve (integrated intensities) were checked manually.

Diversity analysis
Statistical analyses were performed using the additional R packages: multtest, RColorBrewer, vegan, multcomp, multtest, nlme, ape, pvclust, dendextend, phangorn, Hmisc, gplots and VennDiagram. A presence-absence matrix was generated from the feature matrix to determine the differences in metabolite features between the experimental factors species and season. In accordance with the minfrac parameter in the alignment step in XCMS (see above), a feature was considered present when it was detected at least in two out of three replicates. The presence-absence matrix was used for measuring the metabolite richness for each species and season by calculating the Shannon diversity index (H') for each sample i using the function diversity in vegan with the parameter index = shannon 29 . The following equation was used for calculation: where t represents the number of samples in the particular group. The total number of features and the number of unique features were calculated from the presenceabsence matrix accordingly. To test factor levels for significant differences, the Tukey HSD on a one-way ANOVA was performed post-hoc using the multcomp package.
Variability was calculated with the Pearson Correlation Coefficient (PCC, Pearson's r) using the function rcorr in the package Hmisc. Venn diagrams were created for each species separately using the package VennDiagram. Each set in the Venn diagram represents one season and shows distinct and shared features in all possible combinations between the sets.

Multivariate statistical analysis
Variation partitioning was performed using the function varpart in the package vegan to analyze the influence of the factors species and seasons on the metabolite profiles. Distance-based redundancy analysis (dbRDA) using the function capscale with Bray-Curtis distance and multidimensional scaling in the package vegan was chosen to analyse the relation of the ecological characteristics with the species metabolite profiles 30,31 . Ordinal and categorical ecological characteristics were transformed to presenceabsence matrices for the ordination. The optimal model for the dbRDA was chosen with forward and backward selection using the function ordistep in the package vegan. Ecological characteristics were added to the plots as post-hoc variables using the function envfit in the package vegan.

Chemotaxonomic comparison to phylogeny
Relationships between metabolite profiles and phylogeny were analysed by calculating dissimilarities for phylogeny and the feature matrix using Bray-Curtis distance (function vegdist in vegan) followed by hierarchical clustering using the function hclust and the complete linkage method. In order to improve the visual comparison between the two trees, the chemotaxonomic plot was reordered using the function order.optimal (package cba) and leaves of Polstr and Plaund were swapped using the function reorder in vegan. The similarity of the two trees was determined with the normalized Robinson-Foulds metric (function RF.dist in package phangorn). The similarity of the distance matrices was determined with the Mantel statistics (function mantel in vegan).

Computational workflow
For the computational workflow, the required software tools, their dependencies, as well as software libraries and R packages were containerized using Docker technology 32 . The container was based on Linux and Ubuntu 16.04 and included R version 3.4.2 from the R apt repository. The commands for building the container can be found in the Dockerfile (Table 1 (available online only)). The resulting container image was made available at DockerHub (https://hub.docker.com/r/korseby/mtbls520/).
The computational workflow was constructed with the Galaxy workflow management system 33 . It consists of 20 modules and each individual module represents one or more dedicated steps in the Peters et al. study 9 , e.g. data retrieval, feature detection, alignment or statistical analysis (Fig. 1). For the workflow, individual Galaxy modules were written in XML format. Each Galaxy module executes a shell or R script with defined inputs and outputs. Scripts are only executed inside the software container. Thus, code execution is encapsulated and all required software dependencies were resolved in the software container. In order to comply with the Interoperability criterion in the FAIR guidelines 34 , the PhenoMeNal cloud e-infrastructure was used to test the workflow in different computational environments (https://phenomenal-h2020.eu). To ensure that the workflow generates the same results in different computational environments, continuous automatic workflow testing was implemented with wft4galaxy 35 .

Data Records
The primary access site for the dataset is MetaboLights (Data Citation 1), which includes the 108 metabolite profles of the bryophytes in positive and negative mode, QC profiles, ecological data and meta-data (see Table 2 (available online only) for an overview of sample names and associated factor  Table 1 (available online only) provides an overview of data files, formats and functions in the computational workflow.

Code Availability
The source code (also deposited at https://github.com/korseby/container-mtbls520/) was published 36 and made available under the terms of APACHE license 2.0. Please refer Table 1 (available online only) for an overview of the function of each file of the source code.
Code for building the software container image and the workflow including Galaxy modules and scripts that are executed inside the container were published under Open Access 36 . A pre-built binary software container image was made available at DockerHub (retrievable at https://hub.docker.com/r/ korseby/mtbls520/).

Technical Validation Quality control
Four sets of 27 bryophyte samples were generated in the experiment. One set for each season was analyzed with UPLC/ESI-QTOF-MS (see methods below) which resulted in a total of 108 bryophyte metabolite profiles. In order to validate the instrument performance and to detect batch effects between the four instrument runs, a quality control (QC) protocol was implemented. Sets of 27 species samples were interspersed by samples of a lab-internal standard mix (MM8) before and after 7 bryophyte samples. Peak detection in these MM8 profiles was performed with the identical parameters as for the bryophyte samples.
The four sets containing the MM8 metabolite profiles were checked visually for differences by plotting them against each other (Fig. 2a) and stacked next to each other (Fig. 2b). The density distribution of the intensities within the sets of MM8 profiles were also checked and compared to each other with a density plot (histogram) (Fig. 2c).
Mass-to-charge ratio and retention time deviation (in seconds) and correction made by XCMS were checked with diagnostic plots made by XCMS (Fig. 3). We found maximum retention time deviations within 2 s ( Fig. 3a and b) which are in the expected range of the analytical setup 18 . The determined massto-charge deviations ( Fig. 3c and d) are within instrument specification as well 18 .
The variation in the intensities of the internal lab standards was also checked for each reference compound individually as shown in Figs. 4 and 5. In general, the variation for each reference compound and the deviations between MM8 profiles are both well within the typical range of 10 to 15% (ref. 18).
We conclude that there are no significant batch effects in the technical replicates to overlap with the factor seasons of the experiment. Thus, the automatic retention time correction made by XCMS is validated for the parameters used in the peak detection process. Exemplary annotation of Marchantia polymorpha profile With known accurate masses (m/z values) and calculated retention time values (see methods), we confirm the annotation of many known compounds which are described in literature for the model species Marchantia polymorpha 27,28 (Fig. 6). Many of these known compounds also constitute the most abundant features in the profile of M. polymorpha (Fig. 6).

Computational workflow
We have implemented the computational workflow in the Galaxy workflow management system 33 and have made the workflow and underlying code available as Open Source 36 . The Galaxy workflow represents the entire computational processing pipeline that is used in the Peters et al. study 9 (Fig. 1). Each of the individual modules represents a particular step in the workflow and has defined inputs (e.g. pre-processed peak table data matrix) and outputs (e.g. PDF containing the plot of a particular statistical method) (Fig. 1). We used data standards and minimum information criteria for constructing the modules of the workflow 20,22 . Continuous automatic testing of the workflow was performed with wft4galaxy 35 in the PhenoMeNal e-infrastructure (https://phenomenal-h2020.eu) to ensure that the workflow generates the same results in different computational environments.
We proceeded according to the FAIR guiding principles 34 in order to implement a reusable computational workflow. The acronym FAIR stands for Findable, Accessible, Interoperable and Reusable and encompasses several criteria to support the reuse of scholarly data. So far, the FAIR guidelines have only been aspired to make data reusable. However, as the conceptual formulation within FAIR are quite generalized 37 , these principles can also be applied to computational workflows. Nonetheless, there are some computational challenges involved. For example, software runs in different software environments and software dependencies need to be resolved. We tackle this by creating software containers which can be run on multiple systems and contain the software tools, all required libraries and R packages 32,38 . As dependencies in the container have already been resolved, sharing the container image greatly facilitates to allow the software to be run in multiple environments.
We have chosen the Galaxy Workflow Management system 33,39 to implement the whole data processing pipeline (Fig. 1) as it is already known to facilitate reproducible results 40 . Several processing modules were constructed that represent the individual steps of the Peters et al. study 9 . Software tools are invoked from the Galaxy modules and are executed inside the container, thus, adding a level of encapsulation and eliminating the need for the user to install additional software 41 . Galaxy has a graphical user interface that hides the technical complexity from the end user and does not need intensive bioinformatic background knowledge to run the particular modules and workflows. This greatly contributes to the adoption by the end users (biochemists and ecologists) and facilitates future studies in the research field of Eco-Metabolomics.

Statistical analyses
With untargeted metabolomics analysis in ecology, diversity analysis is typically used to characterize the richness and the abundance of biochemical features in the metabolite profiles of biological species 42 . Metabolite richness is a simple measure that counts the individual biochemical features in the metabolite profiles of the species 43 . The abundance of features in the metabolite profiles is usually calculated by diversity indices such as the Shannon diversity index (H') in order to characterize simple relationships with regard to the study factors 44 .
Ordination methods such as Redundancy Analysis (RDA) and distance-based Redundancy Analysis (dbRDA) are frequently used in Ecology 30 . They allow to derive correlations of specific variables between the matrix of predictors containing the measurements (X matrix) and the response matrix with the ecological traits (Y matrix) 30,45 . These methods are also suitable for Eco-Metabolomics data as they allow the use of multiple (non-categorial) variables in a single model and allow to calculate the amount of explained variance of the model. We have chosen the dbRDA, which can also be regarded as a constrained version of metric scaling (MDS) 46,47 . We have implemented dedicated modules for these statistical operations in our computational workflow (see Methods section and Fig. 1).

Usage Notes
Building the container image Following are instructions to manually build the container image. The file Dockerfile in Table 1 (available online only) contains the ruleset. The container has been built using Docker version 17.05-ce under Linux Ubuntu 16.04. The following commands were run to generate the image: sudo apt-get install apt-transport-https ca-certificates git sudo echo deb http://apt.dockerproject.org/repo ubuntu-xenial main >>/etc/apt/sources.list sudo apt-key adv --keyserver hkp://ha.pool.sks-keyservers.net:80 --recv-keys 58118E89F3A912897C070ADBF76221572C52609D sudo apt-get update && sudo apt-get install docker git clone https://github.com/korseby/container-mtbls520 cd container-mtbls520 docker build -t korseby/mtbls520. Installing and using Galaxy to run the workflow The workflow was tested with Galaxy version 17.09. Instructions how to install Galaxy can be found in the training material of the Galaxy project (accessible at https://galaxyproject.github.io/training-material/). However, it is recommended that an official Galaxy server is used, such as those from the PhenoMeNal infrastructure (available at https://public.phenomenal-h2020.eu/).
After being logged into Galaxy, a click on "Workflow" in the menu bar on the top and then a click on the "Upload" button opens up a new page. In the field "Galaxy workflow URL:" enter the following address "https://raw.githubusercontent.com/korseby/container-mtbls520/develop/galaxy/ mtbls520_workflow.ga" or upload the .ga file from the GitHub repository (Table 1 (available online only)) and then clicking on the button "Import". This will import the workflow of the study into Galaxy. The workflow will now be available in Galaxy under Workflows as "Metabolights 520 Eco-Metabolomics Workflow". From there, clicking on the drop-down menu there are options to "Edit" (visually view the complete workflow in the Galaxy workflow editor) or to "Run" the workflow. Required data can be downloaded from MetaboLights with the Galaxy module "mtbls520_01_mtbls_download" (Table 1 (available online only)). Once the download has been completed, data can be extracted with the Galaxy module "mtbls520_02_extract" (Table 1 (available online only)). The workflow can be directly run once the inputs have been assigned to the extracted data files. Processing will take approx. 40 min depending on the work load of the computational infrastructure.