Introduction
Introduction
The genetic diversity of primary and secondary metabolites is incredibly high, notably in plants1; however, our understanding of such metabolism and its regulation is still limited2. In a recent paper3, we have made the first attempt to unravel the genetic architecture of metabolism in a model plant using "genetical metabolomics." This is a derivative of the strategy of genetical genomics4 that has been applied in recent years to the genetic study of gene expression data in a wide range of organisms5, 6, 7, 8, 9, 10, 11, 12, 13, 14. For transcriptome data, this strategy works as follows: determine gene expression (preferably genome-wide) in genetically different individuals, treat the transcript abundances of each gene over all individuals as a quantitative trait, use molecular markers to fingerprint the individuals, use quantitative trait locus (QTL) mapping to identify regulators (expression quantitative trait loci (eQTL)) and (re)-construct regulatory networks. For such network reconstruction, correlations of either transcript abundances11, 15, 16 or eQTL profiles11, 17 are applied. Keurentjes et al.3 developed and applied a similar strategy to metabolite abundance data.
Specifics of MetaNetwork
Similar to the approach used in gene expression studies, the genetic determinants of variation for metabolite abundance (mQTL) can be mapped. However, algorithms used for the analysis of transcript abundance have to be accommodated to the specifics of metabolite abundance. In the work of Keurentjes et al.3, one-third of the mass peaks segregating were not present in the parental lines, presumably caused by new allelic combinations. Likewise, many segregating mass peaks were not present in an appreciable proportion of the segregants, causing clear spikes at zero in the corresponding metabolite abundance distributions. Standard parametric approaches for QTL mapping (e.g., t-test12, ANOVA6, 7, 10, maximum likelihood13) make use of the assumption that the residual variation follows a normal distribution and departure from this assumption due to a spike can inflate errors of type I and II18. Standard non-parametric approaches for QTL mapping (Wilcoxon–Mann–Whitney test5, 14) can solve this problem, but they are less useful in consideration of multiple QTL models18. A more suitable approach is to perform QTL analysis on the binary trait defined by whether an individual has a non-zero abundance, and on the quantitative trait for those individuals who have non-zero abundance. To combine these two analyses, MetaNetwork implements a two-part parametric model18 for mQTL mapping and outputs QTL profiles (- 10log P significance values plotted at marker positions along the genome).
Network reconstruction approaches based on the correlation of transcript abundance15, 16 may also be suitable for metabolite abundance. However, whereas transcripts are translated into molecules of another type (proteins), metabolites are transformed by enzymes into molecules of the same type (other metabolites). Therefore, if one metabolite is the precursor of another metabolite, an mQTL involved in the transformation will exert reversed effects for the precursor and its successor. Counterbalancing of positive and negative effects of multiple mQTLs may make it difficult to infer associations between metabolites from abundance correlations. Metabolites in the same pathway will show similar peaks in their QTL profiles, so that a correlation analysis based on QTL profiles may overcome this problem. MetaNetwork subsequently uses such correlations to determine associations between metabolites and to re-construct metabolic networks.
Challenges in MetaNetwork
Within the context of the genetical genomics experimental space, MetaNetwork encounters numerous challenges due to the size and the scope of the data set and the complexity of metabolic networks. Testing multiplicity is obviously a general challenge in QTL mapping19. The genome-wide mapping of each of many (correlated) mass peaks can result in a large number of false positives and/or false negatives. MetaNetwork uses Storey's method20 to control false discovery rate (FDR). Candidate gene multiplicity is another challenge: an mQTL may still harbor hundreds of candidate genes21. Incorrect connections between metabolites affected by different enzymes may be predicted if the genes for those enzymes appear to colocalize on the genome. To predict or to prioritize candidates among many potential genes in a mQTL region requires additional strategies such as fine mapping and/or follow-up laboratory experiments. Appropriate information can also be derived from the use of assumedly independent (in silico) information in databases with metabolic pathway information, such as KEGG22, MetaCyc23 or AraCyc24, or data on eQTL studies, enzyme activity assays, or phenotypic data on the same segregants. Mass peak multiplicity, that is, metabolites represented by multiple mass peaks, is another challenge25. For example, a metabolite with mass m can have one or more charges and peaks can appear at masses m, m/2, m/3 and so on. Or different isotopes of this metabolite have different numbers of neutrons and peaks appearing at m + 1, m + 2, m + 3 and so on. Unfortunately, error-free assignment of different mass peaks to a single metabolite is still difficult with today's mass spectrometry methods26. However, MetaNetwork can provide important independent information to improve on this: it can predict possibly related peaks based on highly correlated mQTL profiles (r > 0.95).
Applications of MetaNetwork
To date, our MetaNetwork applications have been based on untargeted metabolite abundance data collected from recombinant inbred lines (RILs) of Arabidopsis thaliana plants using liquid chromatography–mass spectrometry technology3. It measures a large range of different metabolites mainly involved in secondary metabolism, including phenylpropanoids, flavonoids and glucosinolates27. Many of these metabolites show a spike in their abundance distribution and MetaNetwork was specifically developed to handle such data. However, the MetaNetwork protocol can equally well handle abundance data without spikes. Moreover, it can handle data obtained from other mass spectrometry techniques, such as gas chromatography–mass spectrometry28 that can detect polar primary metabolites.
In addition to mass spectrometry technologies for targeted or untargeted measuring amounts of metabolites3, 29, other high-throughput technologies for measuring amounts of other molecular entities, such as microRNAs, proteins and their post-translational modifications, are rapidly being developed30. The methodology described here is directly applicable to these and other quantitative types of data and helps biologists to understand how biological systems function.
Implementation of MetaNetwork
MetaNetwork is implemented in R, an open source software environment for statistical computing and graphics31. MetaNetwork is executed via a command line. However, users with little experience of command-line-driven applications and/or computer programming can easily run MetaNetwork using default parameter settings. An advanced user of R can change parameter settings or modify the underlying protocol, for example, by replacing the module for calculation of correlations by one for calculation of mutual information32, or the module for QTL analysis on RILs by one for QTL analysis on other types of segregating or natural populations. Future MetaNetwork releases will offer more options, for example, multiple QTL analysis33, 34 in the two-part model, combined analysis of metabolite abundance data with other types of biomolecular data11 and direct access of the R-tools to a metabolite abundance database. A seamless software infrastructure that supports MetaNetwork data management and analysis workflows is under development using code generation techniques35. For more implementation details, please consult the Supplementary Manual online.
Algorithm of MetaNetwork
The flowchart of the MetaNetwork protocol is shown in Figure 1. Given the scope of this manuscript, we will limit ourselves to the definition of the two main steps in the procedure: QTL mapping of metabolite abundances; and reconstruction of metabolic networks from correlations of QTL profiles. It should be noted that MetaNetwork does not offer data pre-processing, for example, alignment of mass peaks has to be performed by external applications such as METALIGN27.
Figure 1: MetaNetwork flowchart.
The shaded squares represent computational steps where names of R-functions are indicated between parentheses and the superscript numbers refer to steps in Box 1. The ellipses represent significance thresholds and cylinders represent biological results where the result names as R objects are indicated between accolades. The solid line represents the step that is by default "on" in MetaNetwork and the dashed line represents the step that is by default "off" in MetaNetwork.
Full size image (87 KB)MetaNetwork detects the genetic determinants underlying variation in metabolite abundance with the help of a two-part QTL analysis. Part one tests whether the presence/absence of metabolites has a genetic basis: whether different genotype classes at a given marker differ in their numbers of non-zero observations. Part two tests whether quantitative variation in non-zero abundances has a genetic basis: whether the non-zero observations for each of these genotype classes at a given marker differ in mean abundance. The "P-value" of the QTL is computed as the product of the two "P-values" in the two parts. With binary data only (no quantitative data) or quantitative data only (no spike), the "P-value" of the missing part is set to one. These "P-values" are not yet corrected for multiple testing at many markers and also not for testing multiple metabolites. MetaNetwork can run simulation and FDR procedures20 to set an empirical threshold for the "P-values" at desired multiple-testing significance levels. MetaNetwork will output all relevant information such as the estimated effect of each mQTL, its support interval on the genome and the proportion of variance explained by it (see Box 1).
MetaNetwork explores the associations between metabolites by comparing their QTL profiles based on correlations. A permutation procedure sets an empirical threshold for the correlation at a desired significance level. MetaNetwork generates files with network connections that can be visualized using Cytoscape, an open source software suite for visualization of biomolecular interactions36 (see Box 1).
Materials
Equipment
- Computer operating systems: Windows XP, GNU Linux or Mac OS X
- R (http://www.r-project.org): software environment for statistical computing and graphics. The R application (current version 2.4.1) and installation manual can be found at http://www.r-project.org. In this paper, we assume an application under Windows XP
- Required R-packages: "qvalue" for FDR control. R packages can be easily installed via Packages | install package(s). The user can choose a mirror site close to his location and then select the package "qvalue" for installation. Please go to http://www.r-project.org for help if necessary
- MetaNetwork package, user manual and example data files can be downloaded from http://gbic.biol.rug.nl/supplementary/2007/MetaNetwork and saved locally. Install MetaNetwork package via Packages | install package(s) from local zip files: browse the zip file of MetNetwork package
- Cytoscape: open source software for visualizing biomolecular interaction networks. Cytoscape (current version 2.3.2) can be downloaded from http://www.cytoscape.org. Cytoscape requires Java version 1.4.2, which can be downloaded from http://java.sun.com/j2se/1.4.2/index.jsp
ADVERTISEMENT
Procedure
Overview
- Points from here (point 1) up to and including point 3 are related to
Preparing and startingPrepare input files. Three kinds of information are required in QTL analysis: the genetic linkage map of molecular markers (markers, see Table 1); the genotypes of each individual at each marker position (genotypes, see Table 2); and the trait values (metabolite abundances) of each individual (traits, see Table 3). Optionally, the user can provide mass weight information for the mass peaks, to allow for a combined analysis of mass data and QTL profiles (peaks, see Table 4). The files should be formatted as comma separated values (CSV), for example, as "markers.csv," "genotypes.csv," "traits.csv" and "peaks.csv," respectively. Files can be formatted by using Microsoft's Excel via File | Save as, and choosing the file type "CSV (comma delimited) (*.csv)" from the pull-down menu of "Save as type."
- Load the MetaNetwork package by starting the R application and typing the command
> library(MetaNetwork)
This loads the functions of MetaNetwork and the required qvalue package.
- Change the working directory (optional). The default directory of R is most likely to be "C:/Program Files/R/R-2.4.1," where R is installed. Users can change it to the directory where the files from Step 1 are saved, for example, change to "C:/MetaAnalysis" using the command
> setwd("C:/MetaAnalysis")
- Points from here (point 4) up to and including point 7 are related to
Loading data (the order of Steps 4–7 does not matter)Load the marker data. Load marker data (see Table 1 for format) from a file into an R object using the function "loadData," for example, load file "markers.csv" into R object "markerData" using the command
> markerData <- loadData("markers.csv")
If the user did not set the working directory in Step 3, he should give the full path of the file. The same holds for Steps 5–7.
> markerData <- loadData("C:/MetaAnalysis/markers.csv")
- Load the genotype data (see Table 2 for format) using the command
> genotypeData <- loadData("genotypes.csv")
- Load the trait data (see Table 3 for format) using the command
> traitData <- loadData("traits.csv")
- Optionally, load the peak data (see Table 4 for format). Load peak data to allow for a combined analysis of peak masses and QTL profiles using the command
> peakData <- loadData("peaks.csv")
- Points from here (point 8) up to and including point 8 are related to
Running the analysisRun MetaNetwork. Run the "MetaNetwork" function on data from previous steps and with default settings using the command
> MetaNetwork(markers=markerData, genotypes=genotypeData, traits=traitData, spike=4)
The arguments "markers," "genotypes" and "traits" take values from the R objects "markerData," "genotypeData" and "traitData" loaded in Steps 4–6. Absence of a mass peak in a considerable number of individuals leads to signal intensities equal to or less than the detection limit and therefore causes a spike in the trait distribution at zero. The argument "spike" has to be specified to separate presence/absence (binary) from available trait abundance (quantitative) in the trait data, for example, here using a threshold of four times the local noise3. The order of arguments does not matter (see Table 5). The above command will run analysis steps A–E and G by default (see Box 1). These steps can be individually excluded from, or optional steps F and H can be included in, the analysis using the commands outlined in Box 1. During MetaNetwork analysis (see Box 1), a summary of the process (e.g., the progress of the procedure, generated R objects and output files and the computing time) will be displayed in the R Console (see Fig. 2) and saved in the file "output.txt" for future reference.
Figure 2: The view of the R console for the MetaNetwork application.
The procedures, R object names and file names for saving results and processing times are shown.
Full size image (91 KB)
Critical step R objects exist only during the working period of the R Console. To serve later MetaNetwork analyses, R objects can be saved during closure of the R console. - Points from here (point 9) up to and including point 10 are related to
VisualizationQTL profiles visualization. The QTL likelihood along the genome (-
10log P calculated at each marker position) can be visualized in R with function "qtlPlot" using the command
>qtlPlot(markers=markerData, qtlProfiles=qtlProfiles, qtlThres=qtlThres)
where argument "markers" takes values from object "markerData" generated in Step 4; argument "qtlProfiles" is the QTL test statistic and takes the values in the object "qtlProfiles" generated in Step 8A (see Box 1) of MetaNetwork; argument "qtlThres" is the threshold for significant QTLs and takes the value from object "qtlThres" generated in Step 8B of MetaNetwork.
- Network visualization using Cytoscape. Launch Cytoscape and choose "File | Import | Network (multiple file types)" to load network file ("network.sif") and "File | Import | Edge Attributes" to load edge attributes file ("network.eda") generated in Step 8G (see Box 1). Different layout and visualization styles can be applied to view the network, for example, applying the threshold "corrThres" from Step 8F (see Box 1) as a filter to only show significant edges. For details, please see the Cytoscape manual (http://www.cytoscape.org).Troubleshooting
Timing
Figure 2 shows the timing of the analysis of 24 metabolites from 162 RILs in Arabidopsis at 117 markers3, using a Windows XP PC with an AMD Athlon 64 CPU (2.20 GHz) and 1 GB of RAM. The computation time increases with the number of traits and markers: linearly for QTL mapping (Steps 8A and C), and quadratically for correlation (Steps 8D and E) and peak multiplicity finding (Step 8H). The computation time of QTL threshold simulation (Step 8B) and correlation threshold permutation (Step 8F) increases linearly with the number of simulations/permutations. The timing for optional steps 8F and H are not shown: 10,000 permutations take 5,270 min (use of a computer cluster is suggested); peak multiplicity finding takes a few seconds. The total computation time for a default MetaNetwork analysis of 2,000 mass peaks is up to 4 days.
Troubleshooting
The most important sources of error and possible solutions are given in Table 6.
Anticipated results
MetaNetwork was used for the genetic study of
2,000 mass peaks in 162 RILs of Arabidopsis generated from a cross between the distant accessions Landsberg erecta (Ler) and Cape Verde Islands (Cvi)3. These individuals have been genotyped at 117 markers which are nearly evenly distributed along the genome. The network correlations as predicted by the MetaNetwork protocol were verified against previous knowledge29, 37, 38, 39 for 18 aliphatic glucosinolates and six glycosylated flavonols, all products of secondary metabolism. We use this small data set as an example of the type of results that can be anticipated. All data are shipped with the package and can be loaded in R using
> data(markers)
> data(genotypes)
> data(traits)
Alternatively, users can load data and test MetaNetwork simply by command line
> example(MetaNetwork).
Mapping genetic determinants
The QTL likelihood along the genome as stored in "qtlProfiles" is visualized with the function "qtlPlot," loaded by > data(qtlProfiles) and visualized by > qtlPlot(markers,qtlProfiles,4.11). At the empirical -
10log P threshold 4.11 (
=0.05, FDR=0.0003), the glucosinolate mQTLs map to two major loci, which were confirmed by a previous targeted study39: gene AOP at 9.0 cM of chromosome 4 is responsible for glucosinolate side-chain modification37, and gene MAM at 35 cM of chromosome 5 is responsible for chain elongation39. The observation that all glucosinolates have a QTL at MAM but only some of them have a QTL at AOP suggests that AOP acts downstream of MAM (Fig. 3a). The mQTL at MAM exerts the same sign of effect for all glucosinolates that are in the same branch of the network, whereas the mQTL at AOP exerts reversed effects on precursors and their successors. Six flavonols showed strong mQTLs at 88.6 cM of chromosome 1, where a not previously known glycosyl transferase or regulator was suggested3 (Fig. 3b).
Figure 3: The visualization of metabolic QTL profiles and networks.
(a) The mQTL profiles for ten aliphatic glucosinolates before AOP catalysis (upper part) and eight after AOP catalysis (lower part). The mQTL at 303.3 cM on chromosome 4 is at the AOP locus. The mQTL at 409.4 cM on chromosome 5 is at the MAM locus. A positive (negative) sign indicates that individuals carrying the Cvi allele have higher (lower) abundance than individuals carrying the Ler allele. The different colors represent different carbon chain lengths (black 3C; red 4C; green 5C; blue 6C; light blue 7C). (b) The mQTL profiles for six glycosylated flavonols. The mQTL at 88.6 cM on chromosome 1 is a putative glycosyl transferase, catalyzing the production of flavonoldihexosides. The different colors represent different aglycone classifications (black: quercetin; red: kaempferol; green: isorhamnetin), different line types represent different glycosylation patterns (solid line: dihexoside; dashed line: hexoside). (c) The detected mQTLs explain a percentage of the total variation observed between the RILs: the percentage of variance explained for the binary presence/absence of metabolite is on the x axis; the percentage of variance for the non-zero quantitative metabolite abundance is on the y axis. The green dots represent MAM mQTLs for glucosinolates; the red dots represent AOP mQTLs for glucosinolates; the blue triangles represent mQTLs for flavonols. (d) Visualization of the metabolic network using Cytoscape. The nodes represent different metabolites and the edges represent significant correlations. Glucosinolates are presented in a different color based on their carbon chain length—Hgray (3C), red (4C), green (5C) and blue (6C)—and flavonols are presented in pink.
Full size image (93 KB)The mQTLs can underlie binary variation of presence/absence of the metabolite, quantitative variation of metabolite abundance or both types of variation in the segregants (Fig. 3c). For the detected 52 mQTLs, 22 mQTLs only underlie quantitative variation; seven mQTLs predominantly underlie binary variation and the rest underlies both types of variation. For example, two flavonols showed mQTLs 88.6 cM of chromosome 1 that underlie only quantitative variation, whereas the four other flavonols showed mQTLs at that position that underlie both binary and quantitative variation. Further interpretation of these mQTLs can be obtained from the QTL summary "qtlSumm," loaded by > data(qtlSumm).
A combined analysis of mass data and QTL profiles predicted that a single glucosinolate can have up to six mass peaks (1.2 on average, 6 glucosinolates had 3–6 mass peaks); a single flavonol can have up to four mass peaks.
Metabolic network (re)-construction
MetaNetwork computes the zero-order correlation "corrZeroOrder" and second-order partial correlation "corrSecondOrder" between pairs of metabolites, loaded by > data(corrSecondCorr) and > data(corrZeroOrder), respectively. Thirty-one second-order correlations were significant at a Bonferroni-corrected
=0.05 level ("corrThres"=0.14 from 20,000 simulations). These significant correlations are plotted using Cytoscape (Fig. 3d). We can observe that glucosinolates and flavonols are separated into two networks because they have different mQTLs.
The similarities between the reconstructed and known glucosinolate pathway validate the approach, and the dissimilarities may suggest (but do not prove) possible previously unknown steps in the formation of glucosinolates. In the constructed network for glucosinolates (left in Fig. 3d), edges for the known transformation between the methylthio group and the methylsulfinyl group were always observed. But novel edges between metabolites were also observed, for example, the edge linking 2-propenyl to 4-methylthiobutyl (but the biochemical linkage may be indirect, that is, due to coregulation by the same mQTL). The reverse additive effect of the AOP locus for 4-hydroxybutyl, 2-propenyl and 4-benzoyloxybutyl formation shows that regulation can be completely different for different growth stages3.
Except one flavonol, all pairwise partial correlations among the other five flavonols remain significant (right in Fig. 3d). Colocation of mQTLs of these six flavonols suggests that the biochemical linkages are indirect, that is, variation in their abundance is attributable to a single locus affecting glycosylation of the basic flavonoid backbone3.
These results show how the combined genetic and metabolomic approach allows the (re)construction of metabolic pathways. It can provide an independent line of evidence to create new knowledge or to validate or modify current knowledge. Even an untargeted approach can therefore facilitate the annotation of metabolites and show that they play a role in existing or new pathways3. Although MetaNetwork can identify meaningful associations between metabolites, it can obviously not prove causality (i.e., that there are true biochemical linkages between highly correlated metabolites). Any output should therefore be treated as an independent source of information solely for the use of hypothesis formation and be used as guidelines for future experimental confirmation.
Although MetaNetwork is developed for and has been applied to metabolite data, its theoretical basis readily extends to other high-throughput quantitative measurements such as gene and protein expression. We expect that MetaNetwork will prove increasingly useful in elucidating systems genetics.
Note: Supplementary information is available via the HTML version of this article.

