Abstract
Understanding the genotype–phenotype map and how variation at different levels of biological organization is associated are central topics in modern biology. Fast developments in sequencing technologies and other molecular omic tools enable researchers to obtain detailed information on variation at DNA level and on intermediate endophenotypes, such as RNA, proteins and metabolites. This can facilitate our understanding of the link between genotypes and molecular and functional organismal phenotypes. Here, we use the Drosophila melanogaster Genetic Reference Panel and nuclear magnetic resonance (NMR) metabolomics to investigate the ability of the metabolome to predict organismal phenotypes. We performed NMR metabolomics on four replicate pools of male flies from each of 170 different isogenic lines. Our results show that metabolite profiles are variable among the investigated lines and that this variation is highly heritable. Second, we identify genes associated with metabolome variation. Third, using the metabolome gave better prediction accuracies than genomic information for four of five quantitative traits analyzed. Our comprehensive characterization of population-scale diversity of metabolomes and its genetic basis illustrates that metabolites have large potential as predictors of organismal phenotypes. This finding is of great importance, e.g., in human medicine, evolutionary biology and animal and plant breeding.
Introduction
Understanding how information encoded in the DNA is transcribed to RNA, translated to proteins and other downstream endophenotypes, such as metabolites, and how this information dictates the organismal functional phenotype is in the core of several biological disciplines. While the revolutionary work that led to the discovery of this central flow of genetic information within biological systems was published more than 50 years ago (Crick 1970) we are still making progress in understanding the genotype–phenotype map. This is aided by new technologies within molecular and systems biology, allowing researchers to obtain full genome sequences from individuals of any species, detailed information about expression levels of all genes, abundancies of proteins and metabolites, etc. These omics tools have provided unforeseen knowledge about the genetic and environmental background of complex traits, which has revolutionized the field of genetics with strong impacts on multiple research disciplines including medicine, animal and plant breeding and evolutionary biology (Dekkers 2012; Elmer 2016; Hasin et al. 2017; Pinu et al. 2019).
One common goal of studies on genotype–phenotype associations is to understand to what extent complex organismal phenotypes, such as behavioural traits, traits linked to reproduction, diseases, yield or the ability to cope with stressful environmental conditions, can be predicted from DNA sequence information or from endophenotypes. If information on endophenotypes, such as transcripts, proteins or metabolites, can accurately predict the phenotype this has wide ranging applications across life sciences, and this has proven useful in several cases (Buckler et al. 2009; Hayes and Goddard 2010; Desta and Ortiz 2014; Hickey et al. 2017; Grinberg et al. 2019). However, studies have also demonstrated that predicting complex phenotypes based on genetic information can often be difficult and the predictive power in such studies is typically low (Schrodi et al. 2014; Märtens et al. 2016; Sun et al. 2016). Reasons for this are many and include (1) that quantitative trait values result from a complex interplay between a large number of genes, each with a small contribution to the phenotype, combined with environmental factors, (2) despite progress, the underlying genetic architectures of most traits of medical interest or traits with relevance in agriculture or evolutionary biology are still not well-understood, (3) the effect of so-called candidate genes is often depending on the genetic background (epistasis) and they explain only a small proportion of the heritability and (4) genes and environments interact in their effect on the phenotype. These factors give rise to substantial challenges in constructing and implementing genetic risk prediction models across biological disciplines.
Challenges with using DNA sequence variation to predict variation at the organismal phenotypic level have sparked an interest in using endophenotypes as predictors of complex functional phenotypes (Scoriels et al. 2015; Hayes et al. 2017; te Pas et al. 2017; Van Der Ende et al. 2018). Endophenotypes influence and regulate the functional phenotype and in contrast to the genotype, which is fixed in an individual’s lifetime, they are governed by interactions between the genome of an individual and internal and external influences that range from the cellular level and to the wealth of external biotic and abiotic factors an individual is exposed to in its environment. Thus, endophenotypes have been proposed to constitute proximal links between variation at the genome level and the organismal phenotype and they may provide more accurate predictors of the functional phenotype compared to the genotype (te Pas et al. 2017; Zhou et al. 2020).
The predictive value of endophenotypes for organismal functional phenotypes is likely linked to the proximity of the endophenotype to the organismal phenotype (Fiehn 2002; Civelek and Lusis 2014; Hasin et al. 2017; Zampieri and Sauer 2017; Zhou et al. 2020). Since the abundancies of metabolites can be seen as an ultimate molecular response of biological systems to genetic or environmental changes, information on the metabolome level may provide more accurate prediction than, e.g., information at the gene expression or protein levels (Xu et al. 2016; Bahado-Singh et al. 2017). This is an emerging research field and we do not have many results yet. However, there are studies that support the hypothesis that transcriptomic or proteomic data combined with genotype information improve prediction of several traits in, e.g., Drosophila melanogaster and maize (Wang and Marcotte 2010; Harel et al. 2019; Li et al. 2019; Azodi et al. 2020). A recent study based on investigating 453 metabolites in 40 isogenic lines suggest that the metabolome might constitute a reliable predictor of organismal phenotypes and that the metabolome provide novel insights into the underpinnings of complex traits and its genetic basis (Zhou et al. 2020).
Here we elaborate on the findings from Zhou et al. (2020) by using nuclear magnetic resonance (NMR) metabolomics to obtain information about D. melanogaster metabolomes in pooled samples of whole male flies from 170 inbred lines from the D. melanogaster Genetic Reference Panel (DGRP); a system of fully inbred sequenced lines of D. melanogaster (Mackay et al. 2012; Huang et al. 2014). NMR metabolomics constitute a highly reproducible technique that in contrast to mass spectrometry allows for metabolic profiling of the total complement of metabolites in a sample (Emwas 2015). With this set up, we first investigated to what extent the metabolome varies across the investigated DGRP lines and whether this variation was heritable. Second, we performed a genome-wide scan to detect DNA sequence variants that were associated with variation in NMR data point intensity. Finally, we investigated to what degree metabolomic data could increase prediction accuracy (PA) of five complex behavioural and stress tolerance phenotypes compared to when predictions were based solely on DNA sequence variation.
Materials and methods
Drosophila melanogaster lines, husbandry and collection
We used 170 inbred lines of the DGRP (Mackay et al. 2012; Huang et al. 2014). The DGRP lines were established by 20 consecutive generations of full sibling inbreeding from isofemales collected at the farmer’s market in Raleigh, NC. Complete genome sequence of the DGRP lines has been obtained using Illumina platform and is publicly available (Mackay et al. 2012; Huang et al. 2014).
The DGRP lines were maintained on standard Drosophila diet consisting of yeast, sucrose, oatmeal and agar, mixed with tap water. Following autoclaving, nipagin and acetic acid were added to the diet (see Kristensen et al. 2016 for recipe details). Flies were maintained in a 23 °C climate chamber at 50% relative humidity with a 12:12 h light:dark cycle. To generate experimental flies, 20 adult flies (age 3–4 days) from each DGRP line were allowed to reproduce for ~12 h in each of ten replicated vials with 7 mL medium after which they were discarded. Density of developing larvae was not controlled in the vials. The number of emerging flies was below 80 in all vials and for all lines (data not shown). This level of larval density is considered to be optimal for developing D. melanogaster cultures (Barker and Podger 1970; Lefranc and Bundgaard 2000). The first emerging male flies from these vials were collected when <24 h old. Flies from the ten replicates were pooled and distributed randomly to five new vials with fresh food with 20 individuals per vial. Flies were keep in these vials for ~72 h and then transferred to Eppendorf tubes and immediately thereafter snap frozen in liquid nitrogen. In cases where we did not get four replicates with 20 males, we supplemented with flies from the extra vial so that we ended out with four replicates with 20 males from each of 170 DGRP lines.
Drosophila melanogaster sample preparation
We prepared four replicates of 20 pooled male flies for NMR spectroscopy using a routinely used protocol (Malmendal et al. 2006; Schou et al. 2017). Males were snap frozen at 3–4 days of age and kept at –80 °C. Samples were mechanically homogenized with a Kinematica, Pt 1200 (Buch & Holm A/S, Herlev, Denmark) in 1 mL of ice-cold acetonitrile (50%) for 45 s. Hereafter samples were centrifuged (10,000g) for 10 min at 4 °C and the supernatant (900 μL) was transferred to new tubes, snap frozen and stored at –80 °C. The supernatant was lyophilized and stored at −80 °C. Immediately before NMR measurements, samples were rehydrated in 200 mL of 50 mM phosphate buffer (pH 7.4) in D2O, and 180 μL was transferred to a 3 mm NMR tube. The buffer contained 50 mg/L of the chemical shift reference 2,2-Dimethyl-2-silapentane-5-sulfonate-d6 sodium salt (DSS), and 50 mg/L of sodium azide to prevent bacterial growth.
NMR experiments and spectral processing
NMR measurements were performed at 25 °C on a Bruker Avance III HD 800 spectrometer (Bruker Biospin, Rheinstetten, Germany), operating at a 1H frequency of 799.87 MHz, and equipped with a 3 mm TCI cold probe. 1H NMR spectra were acquired using a standard NOESYPR1D experiment with a 100 ms delay. A total of 128 transients of 32 K data points spanning a spectral width of 20 ppm were collected.
The spectra were processed using Topspin (Bruker Biospin, Rheinstetten, Germany). An exponential line broadening of 0.3 Hz was applied to the free-induction decay prior to Fourier transformation. All spectra were referenced to the DSS signal at 0 ppm, manually phased and baseline corrected. The spectra were aligned using icoshift (Savorani et al. 2010). The region around the residual water signal (4.87–4.70 ppm) was removed in order for the water signal not to interfere with the analysis. The high- and low-field ends of the spectrum, where no signals except the reference signal from DSS appear, were also removed (i.e., leaving data between 9.7 and 0.7 ppm). The spectra were normalized by probabilistic quotient area normalization (Dieterle et al. 2006). In order to reduce the size of the NMR data, each two NMR data points were averaged resulting in a final number of 14,440 NMR data points.
Metabolite assignments were done based on chemical shifts, using earlier assignments and spectral databases previously described (Malmendal et al. 2006; Cui et al. 2008; Ulrich et al. 2008; Schou et al. 2017; Wishart et al. 2018) together with Chenomx NMR Suite (Chenomx Inc.).
Quantitative genetics of NMR intensities
Each NMR data point (14,440 in total) was treated as a quantitative trait. For each NMR data point, we fitted a linear mixed model to partition the total phenotypic variation (i.e., one NMR data point) into genetic and environmental variation. Using the R package qgg (Rohde et al. 2020), we fitted the model:
where y was a vector containing the replicated measurements of intensities for a particular NMR data point (three to four replicates per DGRP line), X and Z are design matrices linking fixed and random effects to the phenotype, b is a vector of the fixed effects (Wolbachia infection status, and major polymorphic inversions; In2Lt, In2RNS, In2RY1, In2RY2, In3LP, In3LM, In3LY, In3RP, In3RK, In3RMo, In3RC. Information available at http://dgrp2.gnets.ncsu.edu), g is a vector of the random genetic effects defined as \({\boldsymbol{g}}\sim N(0,{\boldsymbol{G}}\sigma _g^2)\) and e is a vector of residual effects defined as \({\boldsymbol{e}}\sim N(0,{\boldsymbol{I}}\sigma _e^2)\). The variance structure among the random effects is modelled as independent for the residual effects (by the identity matrix I), and for the genetic effects by the additive genomic relationship matrix G, which was constructed using all SNPs (minor allele frequency ≥ 0.05) as \({\boldsymbol{G}} = {\boldsymbol{WW}}^\prime /m\), where m is the number of SNPs (i.e., 1,725,755), and W is a centred and scaled genotype matrix, where each column vector is \({\boldsymbol{w}}_i = \frac{{{\boldsymbol{a}}_i - 2p_i}}{{\sqrt {2p_i(1 - 2p_i)} }}\), pi is the allele frequency of the ith SNP, and ai is the ith column vector of the allele count matrix, A, which contains the genotypes coded as 0 or 2 counting the number of the minor allele (genotypes are available at http://dgrp2.gnets.ncsu.edu). The G-matrix was upscaled to a block matrix to match the replicated measurements of the NMR intensities.
For each NMR data point we estimated the proportion of phenotypic variation explained by SNP variation as \({\hat h_{{\mathrm{SNP}}}^2} = \frac{{\hat \sigma _g^2}}{{\hat \sigma _g^2 + \hat \sigma _e^2}}\), where \({\hat \sigma} _g^2\) and \({\hat \sigma} _e^2\) are the estimated variance components from Eq. (1). The significance of each \({\hat h}_{{\mathrm{SNP}}}^2\) was determined as \({\hat h}_{{\mathrm{SNP}}}^2 - {\mathrm{SE}}\left( {\hat h_{{\mathrm{SNP}}}^2} \right) \times Z > 0\), where Z is the quantile function of the normal distribution at probability P = 0.05/14,440. Thus, the resulting set of heritability estimates are those estimates that differ significantly from zero when accounting for a total of 14,440 statistical tests.
Mapping of metabolome QTL
Metabolome quantitative trait loci (mQTLs) for mean NMR intensity were identified using linear regression (Huang et al. 2015) using the function for single marker association analysis implemented in the qgg package (Rohde et al. 2020). The estimated genetic effects (\({\hat{\boldsymbol g}}\), from Eq. (1)) were used as line means, since these values represent the within-DGRP-line mean intensity of a single NMR data point adjusted for Wolbachia, chromosomal inversions and polygenicity, which then was regressed on marker genotypes.
Significant mQTLs were identified as those SNP–NMR associations where the heritability estimate for the NMR data point was significant and further that the P value was below \(\frac{{0.05/1,725,755}}{{14,440 \times 0.39}} = 5.14\, \times \,10^{ - 12}\) to account for both the number of SNP associations performed per NMR data point and the total NMR data points analyzed (i.e., 14,440 × 0.39). Significant mQTLs were annotated to the D. melanogaster genome using FlyBase annotation v.5.57 (flybase.org).
Phenotypic predictions
Using the linear mixed model (BLUP; best linear unbiased prediction) framework, we performed several phenotypic prediction models using either genomic best linear unbiased prediction (GBLUP) or metabolomic best linear unbiased prediction (MBLUP) information to investigate if the metabolome provides additional information that will increase the accuracy of prediction compared to genomic prediction. The DGRP has been characterized for a wide range of physiological, morphological and behavioural phenotypes (Anholt and Mackay 2018; Mackay and Huang 2018). When the number of individual genotypes is low (i.e., the number of DGRP lines), a large number of within-line replicates are required for accurate predictions (Edwards et al. 2016). Therefore, we restricted the set of quantitative traits to those where we had access to all individual observations, and where the average number of observations within line was >25 (Table 1).
The five test traits were initially adjusted for experimental factors (see references in Table 1), Wolbachia infection status and major polymorphic inversions (In2Lt, In2RNS, In3RP, In3RK, In3RMo). The adjusted phenotypic values were obtained as: \({\tilde{\boldsymbol y}}_l = \hat L_l + {\hat{\boldsymbol e}}_l\), where \(\hat L_l\) is the estimated line effect for DGRP line l (i.e., the BLUP value), and \({\hat{\boldsymbol e}}_l\) is a vector containing the residuals for line l. Thus, yl and \({\tilde{\boldsymbol y}}_l\) have the same dimension. The estimated line effects (\({\hat{\boldsymbol L}}\)) were assumed \({\boldsymbol{L}}\sim N(0,\;{\boldsymbol{I}}\sigma _L^2)\). The metabolome contains aggregated information both on the individual genotypes and environmental exposures. To avoid double counting the genomic variation, as represented by SNP information by adding genomic information in both the two steps in the GBLUP and MBLUP analysis, we assumed the DGRP lines to be independent, modelled by the identity block matrix I, instead of G in the first step.
For each quantitative trait (Table 1), we fitted two models, each for 100 randomly selected training sets (t, the training sets were the same for all prediction models) containing 90% of the DGRP lines:
where \({\tilde{\boldsymbol y}}_t\) is the adjusted phenotypic values for the DGRP lines in training set t, et is a vector of random residuals, Zt is a design matrix linking the genomic (gt) and metabolomic (mt) effects to the phenotypes. The random genomic effects are \({\boldsymbol{g}}_t\sim N(0,{\boldsymbol{G}}_{[t,t]}\sigma _g^2)\), and the metabolomic effects are \({\boldsymbol{m}}_t\sim N(0,{\boldsymbol{M}}_{[t,t]}\sigma _m^2)\), where G is the additive genomic relationship matrix as specified previously, and M is the metabolomic relationship matrix. The metabolomic relationship matrix was computed as \({\boldsymbol{M}} = {\boldsymbol{QQ}}^\prime /m_{{\mathrm{NMR}}}\), where Q is a n × mNMR matrix of adjusted, centred and scaled NMR intensities (mNMR = 14,440). Each column vector of Q contains the BLUP values from a mixed model where the phenotype was the corresponding NMR intensity, which was adjusted for Wolbachia infection status and major polymorphic inversions (In2Lt, In2RNS, In3RP, In3RK, In3RMo), using a block identity matrix as covariance structure to link the replicated NMR measurements with each DGRP line. This was done to obtain a data structure similar to the genomic data; namely, one value of each DGRP line/NMR intensity.
The predicted phenotypes in validation set v (\({\hat{\boldsymbol y}}_v\)) were obtained using Eqs. (4) and (5) for GBLUP and MBLUP, respectively:
The PA was quantified as the mean Pearson’s correlation between observed and predicted phenotypes across training sets, \({\rm{PA}} = \frac{1}{{100}}\mathop {\sum }\nolimits_{t = 1}^{100} {\rm{Cor}}({\tilde{\boldsymbol y}}_v,{\hat{\boldsymbol y}}_v)\). The accuracies of GBLUP and MBLUP were compared using a paired t-test.
NMR cluster-guided phenotypic predictions
From genomic prediction models, we know that allowing marker effects to come from different distributions, e.g., grouping genetic variants into functional pathways, can increase the PA markedly (Speed and Balding 2014; Edwards et al. 2016; Rohde et al. 2017, 2018; Sørensen et al. 2017). Therefore, we investigated if similar benefits could be achieved by partitioning the metabolome.
Using the Q-matrix (i.e., the n × mNMR matrix of adjusted, centred and scaled NMR intensities), we computed all pairwise Pearson’s correlation coefficients and performed hierarchical clustering on the dissimilarity on the correlation coefficients using an unweighted pair group method with arithmetic mean agglomeration (Fig. 1). Using a range of total number of clusters Kcl = {25, 50, 75, 100, 125, 200}, we performed metabolomic data point best linear unbiased prediction (MFBLUP), which is an extension to the MBLUP model (Eq. (3)) containing an additional metabolomic effect (Eq. (6), Fig. 1). For each Kcl level, we estimated model parameters for each cluster (for Kcl = 25 we ran 25 models and for Kcl = 50 we ran 50 models) as follows:
where the superscript Kcl_k indicates the total number of clusters (Kcl) and the cluster number (cl). The first metabolomic effect was defined as \({\boldsymbol{m}}_t^{K_{cl\_k}}\sim N(0,{\boldsymbol{M}}_{[t,t]}^{K_{cl\_k}}\sigma _{m^{K_{cl\_k}}}^2)\), where \({\boldsymbol{M}}_{[t,t]}^{K_{cl\_k}}\) corresponds to the metabolomic relationship of the DGRP lines within the training set (t) for the NMR data points within cluster number cl among the Kcl clusters. The second metabolomic effect (\({\boldsymbol{m}}_t^r\sim N(0,{\boldsymbol{M}}_{[t,t]}^r\sigma _{m^r}^2)\)) is the random effects using all NMR data points except those within the Kcl_k cluster.
All pairwise correlations among the NMR features were computed, which was used in a hierarchical clustering of NMR data points. The dendrogram was then sequentially cut into K clusters (25, 50, 75, 100, 125, 150 and 200 clusters), and each individual cluster was then used in the MFBLUP model. NMR data points within one cluster were used to construct a metabolomic relationship matrix that was used as covariance matrix in the MFBLUP model. The MFBLUP model was fitted for all clusters within the seven levels of K clusters.
Similar to the MBLUP model, the predicted phenotypes in the validation set v (\({\hat{\boldsymbol y}}_v\)) were obtained as follows:
The PA for each Kcl_k cluster was obtained as \({\rm{PA}} = \frac{1}{{100}}\mathop {\sum }\limits_{t = 1}^{100} {\rm{Cor}}({\tilde{\boldsymbol y}}_v,{\hat{\boldsymbol y}}_v)\) and was compared (using paired t-test corrected for the number of tests performed within each cluster level using a false discovery rate (FDR) of <0.05) within and across Kcl clusters to identify the NMR data points resulting in the largest PA. We only considered clusters to be significant if the FDR was below 0.05, and if the proportion of variance captured by the cluster was larger than 1%, which is computed as \(\left( {\hat \sigma _{m^{K_{cl_k}}}^2/(\hat \sigma _{m^{K_{cl_k}}}^2 + \hat \sigma _{m^r}^2)} \right) > 1\%\).
Finally, to investigate if we could increase the predictive performance further, we took all the clusters that increased the trait-specific predictive performance (including clusters where the variance captured was below 1%), ranked them by their predictive performance, and ran a new series of prediction models, adding the NMR data points sequential to the model based on the clusters predictive performance (high to low).
Results
The metabolome of D. melanogaster
1H NMR spectroscopy measures the intensity of signals from hydrogens in different chemical environments and can therefore be used to assess which molecules are present in a biological sample. Here we used 1H NMR to quantify the metabolome of male flies from 170 DGRP lines in four biological replicates. For each sample (i.e., one for each DGRP line and replicate), we obtained one NMR spectrum consisting of 14,440 data points, together describing all NMR-visible hydrogens in the sample. For each NMR data point, we estimated the proportion of variation in 1H intensity explained by common genetic variants; i.e., the heritability (h2, Fig. 2A). In total, 39% of the NMR data points showed a significant heritability estimate (Fig. 1B), of which the average heritability was 0.26 (0.13 across all data points).
Panel (A) shows in solid blue line the average NMR intensity across all DGRP lines (intensity axis not shown) as function of chemical shift. For each NMR data point, we estimated the heritability (h2); the points in grey represent non-significant estimates of h2, and points in green are significant estimates of h2. Panel (B) is a histogram of the significant heritability estimates.
For the NMR data points with a significant heritability estimate, we identified genetic variants associated with NMR 1H intensity, namely, mQTLs. We found a total of 152 genome-wide significant mQTLs (Supplementary Table 1) covering 98 NMR data points and 53 SNPs. The significant mQTLs were distributed across the D. melanogaster genome (Supplementary Fig. S1), and we annotated the significant variants to 56 genes. Among the 56 genes 20 of them contained two or more significant mQTLs (Supplementary Table 1); however, in several cases the same SNP was annotated to different genes (five gene sets in total, Supplementary Table 1) as these genomic regions were complex and contained different annotations and the exact annotation could therefore not be resolved. The genes CCHamide-2 receptor (CCHa2r), sidestep (side), Glutamate receptor IB (GluRIB), Coronin (coro) and CG43373 were the genes with the largest number of genome-wide significant associations (between 5 and 74 associations; Supplementary Table 1).
Phenotypic predictions
To test the predictive performance of the metabolome, we obtained phenotypic data from five previously published studies (Table 1); two behavioural traits and three stress resistance traits. We constructed relationship matrices based on genomic and on metabolomic information and performed GBLUP/MBLUP. For each trait we used 90% of the data to estimate the parameters using either the genomic or metabolomic relationship matrices and used the estimated parameters to predict the remaining 10% of the data. This was repeated on 100 random data divisions.
The mean PA for the two behavioural traits, locomotor activity without and with treatment of Ritalin, was below 0.1 when based on genomic information (Fig. 3A, B). Using metabolomic information the predictive performance was increased to above 0.4 (Fig. 3A, B and Supplementary Table S2). By using the D. melanogaster metabolome, we could also increase the predictive accuracy of the two environmental stress resistance traits, chill come recovery and starvation resistance (Fig. 3D, E and Supplementary Table S2). However, for startle response (a behavioural response to a physical disturbance) prediction using genomic information was superior over the metabolome (Fig. 3C).
For each panel, the barplot shows the maximum mean prediction accuracies (+standard error) for the different models. GBLUP and MBLUP are based on single component prediction models, whereas the two MFBLUP models are based on two components. The global maximum prediction accuracy obtained across all levels of clusters (Kcl = {25, 50, 75, 100, 125, 200}) is shown in the MFBLUP bar (indicated with white arrow). The prediction accuracy when combining the significant clusters is shown in the MFBLUP2 bar (indicated with white square and circle). Significant improved predictive performance is indicated by asterisks above the bars, see Supplementary Table S3 for all comparisons. The heatmaps on the right side of the panels show all prediction accuracies for the NMR cluster-guided prediction model within Kcl cluster level. The columns correspond to NMR data points (fixed across the Kcl cluster levels) and each cell is one cluster of NMR data points (link between NMR data points and clusters can be found in Supplementary Table S4). The predictive performance of each cluster within Kcl cluster level is indicated with the colour scale. Within cluster level significant prediction accuracies are indicated with white squares, and the cluster with the highest significant prediction accuracy is indicated with asterisk. Across all cluster levels, the highest prediction accuracy is indicated with white arrow (which then corresponds to the orange bars on the left-side panel). The set of significant clusters that give the highest predictive performance is marked with white squares with black circle (corresponds to the light green bars in the barplot).
We computed Pearson’s correlation coefficients among all NMR data points and performed hierarchical clustering (Supplementary Fig. S2). We then ran a two component NMR cluster-guided prediction model (MFBLUP) where the first component was based on NMR data points within one cluster, and the second component was based on the remaining NMR data points (Fig. 1). We tested all clusters from the hierarchical clustering using different number of total clusters; Kcl = {25, 50, 75, 100, 125, 200} (Fig. 1). By the extension of the MBLUP model we could increase the predictive performance of all five quantitative traits by 17–185% (Fig. 3 and Supplementary Table S2). Interestingly, the largest improvement in PA was obtained at different cluster levels for the five traits (Fig. 3 and Supplementary Fig. S3). For locomotor activity the largest improvement in PA was obtained at cluster level Kcl = 100 (Fig. 3A), with cluster 1 as the only cluster that had significantly increased predictive accuracy (Supplementary Table S3 and Fig. S3). For locomotor activity (Ritalin treatment), startle response and starvation resistance, cluster level Kcl = 200 contained the clusters that gave the highest predictive performance (Fig. 3B–D and Supplementary Fig. S3). For activity (with Ritalin) clusters 121 and 74 (Supplementary Table S3) increased the predictive performance. Combining the two clusters increased the PA insignificantly by 1% (Figs. 3B and S6). Eight clusters increased the predictive performance (that also captured >1% intensity variance) for startle response (1, 2, 5, 14, 99, 112, 145 and 187; Supplementary Table S3 and Fig. S4), and by combining clusters 1, 2, 5, 112 and 145 we further increased the PA (Figs. 3E and S8 and Supplementary Table S2). Five clusters (21, 65, 83, 95 and 140) increased the accuracy of prediction (and captured >1% intensity variance) for starvation resistance (Supplementary Fig. 4), and the joint effect of clusters 21, 65, 83 and 95 further insignificantly increased the accuracy (Figs. 3D and S9 and Supplementary Table S3). Finally, for chill coma recovery, the maximum PA was obtained at cluster level Kcl = 50 (Fig. 3E), where clusters 5, 27 and 35 significantly explained >1% of the NMR intensity variance (Supplementary Table S3 and Fig. S4). Combining the two clusters with the highest accuracy led to insignificant increased accuracy (Figs. 3B and S9 and Supplementary Table S2).
The contributions to the predictive clusters at cluster level Kcl = 200 were mapped on to the NMR spectrum (Fig. 4). Interestingly, none of the clusters contained signals from the highest concentration metabolites. Rather they contain signals from metabolites at lower abundance or very broad signals, suggesting contributions from small metabolites bound to larger molecules or larger molecules themselves. Out of the 18 clusters (Table 2), 7 (1, 2, 5, 21, 32, 45, 65; Fig. 4A) contained signals in a region where aromatic compounds with quaternary nitrogens, such as NAD and nicotinamide ribotide, appear. These clusters showed high PA for locomotor activity, chill coma recovery and startle response. Four clusters (32, 74, 83, 95; Fig. 4A) contained signals in a region where other heterocycles, such as adenosine, appear. Furthermore, six clusters (112, 121, 129, 139, 145, 154; Fig. 4B) contained signals in a region where aromatic groups from amino acids like histidine and tyrosine appear. Most of these are important for locomotor activity with Ritalin and starvation resistance. There is also one cluster (171; Fig. 4C) that contained signals in a region where signals from sugars appear, and another (187; Fig. 4D) that contained signals in a region where amino acid CH2 groups appear. Out of these clusters 1 and 187 clearly contained signals that would usually be identified as baseline, while clusters 2, 5, 32, 112, 129, 139, 145 contained mostly well resolved, though often low intensity, signals. The remaining signals are somewhere in between. Only cluster 32 could be matched to a known metabolite and its signals are assigned as coming from the nicotinamide group of NADP. The other metabolites could not be found in currently available databases with NMR characteristics of metabolites (Cui et al. 2008; Ulrich et al. 2008; Wishart et al. 2018) or in metabolomics studies of D. melanogaster or other model organisms. The clusters at cluster level Kcl = 200 were compared with those giving the highest PA for locomotor activity at cluster level Kcl = 100, and chill coma recovery at cluster level Kcl = 50 (Supplementary Fig. S10). For locomotor activity the larger cluster with the higher prediction activity cover a larger stretch of the baseline in the nicotinamide region indicating that it is optimal to include a larger number of higher molecular weight nicotinamide units (Figs. 3A and S10A, B). For chill coma recovery, the larger cluster covers the broad peak containing the signals from aromatic amino acids in a higher molecular weight context and there is no smaller cluster that retain significant predictivity once the larger cluster is broken up (Figs. 3E and S10C, D).
Clusters are indicated with coloured dots on the average of all NMR spectra (black line). Panel (A) shows clusters: 1, 2, 5, 21, 32, 45, 65, 74, 83 and 95; panel (B) clusters: 112, 121, 129, 139, 145 and 154; panel (C) cluster: 171; and panel (D) cluster: 187. Selected major metabolites in these regions are identified. The location of the nicotinamide ribotide signals resonating at the highest ppm values is also indicated in panel (A).
Discussion
Prediction of phenotypic trait values using genetic markers has been a central element in plant and livestock breeding for decades (Van Arendonk et al. 1994; Meuwissen et al. 2001; Goddard et al. 2009), and more recently this strategy has emerged within human genetics attempting to accurately predict, e.g., disease risk from DNA information (Hall et al. 2004; Wray et al. 2008, 2019; Schrodi et al. 2014). However, the predictive value from genotyped genetic variants is often low (Schrodi et al. 2014; Patron et al. 2019), and this is problematic when aiming to predict complex phenotypes such as diseases, behaviours or production traits. Therefore, there is a potential to further optimize the applicability of these methods. Here, we used 1H NMR spectroscopy to quantify the metabolomic profiles of groups of male D. melanogaster from 170 completely inbred and genome sequenced lines to investigate if an endophenotype, here the metabolome, has improved predictive power compared with a situation where only genome information was used.
Since the metabolome NMR spectra contain signals from every molecule that contains 1H (and does not exchange with water) there are contributions from many different small molecule metabolites like amino acids and sugars, but also from larger molecules that remain in solution after the acetonitrile/water extraction. The former signals are sharp and often easy to identify, while the latter typically cover very broad areas of the spectra and are only identifiable in terms of the nature of the chemical groups behind the signal. For D. melanogaster it may, however, also be difficult to identify some of the small molecule metabolites since they do not appear in the available databases that typically contain rodent or human metabolites. It is thus not feasible at this point to produce a table with names and concentrations of all metabolites that, e.g., show significant association with the genome. As the main focus in this study was on the predictive ability of the metabolome quantified with 1H NMR, we focused on a few specific metabolites.
We found that the D. melanogaster metabolome was highly variable with more than 39% of the NMR spectrum having a significant heritability estimates (Fig. 2), displaying same level of genetically determined variability as the D. melanogaster transcriptome (Huang et al. 2015). It has previously been shown that metabolome variation appears to have a genetic signature (Pedersen et al. 2008; Malmendal et al. 2013; Reed et al. 2014; Zhou et al. 2020), but also that the metabolome is highly variable among sexes (Schou et al. 2017; Li et al. 2018; Zhou et al. 2020) and change with age (Lawton et al. 2008; Yoshida et al. 2010; Sarup et al. 2012; Hoffman et al. 2014). Our findings confirm that variation in metabolite abundance is genetically controlled. The metabolome can therefore be influenced by evolutionary forces like any other phenotypic trait and this variation can be utilized, e.g., in livestock and plant breeding where specific metabolites may be of interests (Browne and Brindle 2007; Goldansaz et al. 2017; Gamboa-Becerra et al. 2019).
Metabolome-wide association studies, which is the mapping of metabolite QTLs (mQTLs), seek the same as genome-wide and transcriptome-wide association studies, namely, to identify genetic variants associated with variation in a functional character or an endophenotype (Holmes et al. 2008; Bictash et al. 2010). Here, we mapped the individual 1H intensities in the NMR spectra with SNP genotypes and identified abundant mQTLs (Supplementary Fig. S1 and Supplementary Table S1). CCHa2R, which encodes a neuropeptide (Hansen et al. 2011), was the gene with the largest number of genome-wide significant associations (74 in total). In humans, CCHa2R is the bombesin receptor subtype 3, which is involved in regulating metabolic rate (Xiao et al. 2017) and glucose metabolism (Feng et al. 2011). This gene associates with signals from tyrosine only (Supplementary Fig. S11D). Recently, CCHa2R was associated with variation in levels of five metabolites, and upregulation of this gene extended the life span of D. melanogaster (Jin et al. 2020). The second most associated gene was sidestep, which controls the migration of motor axons in the developing fly (Siebert et al. 2009). Across several model organisms and humans sidestep has no apparent orthologous genes (Hu et al. 2011). The glutamate receptor (GluRIB) contained six significant associations, and the variants herein associated with a set of features close to a signal from the imidazole group in histidine. Coronin, with six associations, is involved in muscle morphogenesis (Schnorrer et al. 2010). The human orthologue of Coronin is coronin 1C and has previously been associated with lipoprotein and cholesterol levels (Wakil et al. 2016; Siewert and Voight 2018). Accordingly, the NMR data points associated with Coronin include signals corresponding to the choline methyl groups of sn-glycerophosphocholine and methylene and methyl groups from larger molecules such as peptides or fatty acids (Supplementary Fig. S11A). Another gene with many associated NMR data points (five NMR data points) was CG43373, which has the human ortholog adenylate cyclase 5 (ADCY5), that in multiple studies have been associated with type II diabetes mellitus (Mahajan et al. 2014; Qi et al. 2017; Bonàs-Guarch et al. 2018), body mass index (Locke et al. 2015), blood glucose (Manning et al. 2012) and cholesterol levels (Liu et al. 2017; Hoffmann et al. 2018). Both sidestep and CG43373 were associated with the same unidentified signal in a region with signals from hydrogen atoms in the vicinity of hydroxy or carboxy groups (Supplementary Fig. S11B, C). The observation that the genes that contain most associations with NMR data points are indeed involved in metabolic processes indicates that the identified mQTLs are biologically relevant and not a statistical artefacts. Further studying the regulatory effects of these genes on the metabolome would require functional validation, e.g., using RNA interference (Jin et al. 2020), which is beyond the scope of this study. Furthermore, in the present study, we limited our search for mQTLs to those identifiable using single SNP marginal effects. Gene/Pathway-based enrichment analyses are known to increase statistical power and have the potential to improve biological interpretation (Holmans et al. 2009; de Leeuw et al. 2015; Rohde et al. 2016). We demonstrated, that despite using an underpowered statistical mapping approach (a univariate marker model), we were able to identify abundant mQTLs with apparently large effects. Applying a gene-based model would likely have increased the statistical power enabling identification of additional genes of minor affects. However, as the main focus of the present study was on predictive performance of the D. melanogaster metabolome we did not enter that route.
Accurate phenotypic prediction of any trait requires large sample sizes to reliably estimate model parameters, and some measure that can describe the covariance structure among individuals, such as genetic variants or other molecular variation. Although the DGRP system appears to lack power for mapping and prediction studies because of the limited number of inbred lines, it gains statistical power because it is possible to obtain repeated measures on a large number of individuals from highly genome-wide homozygotes lines resulting in very accurate within-DGRP-line measures (Mackay and Huang 2018).
The main aim of the current study was to investigate the predictive power of the D. melanogaster metabolome compared with prediction models using genomic data. Using genomic information generally resulted in low predictive abilities, and for chill coma recovery time it was even negative (Fig. 3). Despite similar heritability estimates for the five quantitative traits investigated (Table 2), it was not surprisingly that we observed poor predictive ability for chill coma recovery time as this has been found multiple times (Ober et al. 2012, 2015; Edwards et al. 2016; Sørensen et al. 2017). It has been suggested that the lack of predictive ability for chill coma recovery might be due to more pronounced non-additive effects, like epistasis, for this trait (Ober et al. 2015; Morgante et al. 2018). We showed that for four out of five quantitative traits using the D. melanogaster metabolome for phenotypic predictions significantly improved the accuracy of prediction compared to using the D. melanogaster genome (Fig. 3). The extent to which prediction was increased varied across traits, but the trend is clear across all five traits; partitioning the metabolome by highly correlated NMR data points increased the predictive performance (Fig. 3). Even for chill coma recovery time, we observed a great improvement in predictive ability (Fig. 3), which further supports the idea that this particular trait could be under influence of non-additive effects, as these are inherent in the metabolome. Partitioning the metabolomic variation using the cluster-guided approach further increased the predictive performance (Fig. 3). This is expected because partitioning the genome by functional categories has also been shown to increase the predictive performance (Speed and Balding 2014; Edwards et al. 2016; Fang et al. 2017; Rohde et al. 2018, 2019). Here we did not partition the genome as our main contrast of interest was to compare predictions based on the whole genome and the whole metabolome. Comparing two partitioned prediction models would be a challenging comparison because the two partitionings would be different; i.e., one based on functional genomic regions and one based on metabolomic signatures.
Comparing the significant mQTLs with the NMR data points within the trait-specific predictive clusters, we see an overlap for startle response, starvation resistance and chill coma recovery time (Supplementary Table S5). However, none of the associated mQTL genes common across traits (Supplementary Table S6) were previously found associated with startle response, starvation resistance or chill coma recovery (Mackay et al. 2012). The most likely explanation for this result is that GWAS capture genetic variants with strong effect sizes, whereas in prediction models variants or metabolomic intensities with small effect sizes are also detectable. Similarly, we compared the NMR data points within the predictive clusters across the five functional phenotypes (Supplementary Table S5) and found commonalities between activity (control treatment) and startle response; activity (Ritalin treatment) and chill coma recovery; startle response and chill coma recovery; and starvation resistance and chill coma recovery. Out of these overlapping regions the first one contained a very broad signal from what seems to be a nicotine amide bound to or in a large molecule. The others are small unidentified metabolites in this region or the region where the aromatic groups for amino acids appear.
Recently, Zhou et al. (2020) measured metabolite variation in 453 metabolites using 40 DGRP lines, and concluded that if the sample size was larger the accuracy of metabolome predictions could be improved. This is exactly what we have shown in this study, namely that the metabolome can increase the accuracy of phenotypic prediction. The increased power relative to Zhou et al. (2020) may originate not only from the larger numbers of DGRP lines, but also from the higher reproducibility that 1H NMR metabolomics provide compared to mass spectroscopic methods. It should be stressed, however, that while NMR has a much higher reproducibility, mass spectrometry is more sensitive and better at resolving which signals belong to which metabolite, but these features seem to be of less importance when it comes to predictive power.
Our results clearly demonstrate the added value of performing predictions of functional phenotypes using NMR metabolomics compared with SNP genotypes. These findings, together with others, truly open the doors for applying metabolomics in different disciplines, for example, in the human health sector or in animal breeding. Metabolites can be easily quantified in biofluids from livestock, human blood donors or patients that have blood samples taken on a regular basis. This entails a clear advantage over other methods in terms of translatability (Fontanesi 2016). Recently, studies have shown that the metabolomic signatures of blood from cattle (Novais et al. 2019) and pigs (Carmelo et al. 2020) can be used to accurately separate animals by their feed efficiency which is one of the most economically important traits in livestock production. Metabolomics also has great potential for prediction of traits that are difficult or expensive to measure (Riedelsheimer et al. 2012; Rohart et al. 2012; Xu et al. 2016; Hayes et al. 2017; Gemmer et al. 2020). Studies of the human metabolome and its relevance to human health have also clearly increased the last 5 years (Rangel-Huerta et al. 2019; Zhang et al. 2020), and it is very likely to be one of the cornerstones in an implementation of personalized medicine. With the emergence of large human cohort projects, like the UK Biobank (Bycroft et al. 2018), FinnGenn (Mars et al. 2020) and BioBank Japan (Nagai et al. 2017), the data samples are reaching a sample level enabling ground-breaking research if full metabolomic profiles were obtained. For example, it has recently been shown that the 5–10-year all-cause mortality was better predicted from NMR spectra of plasma and serum samples than using traditional clinical risk factors (Deelen et al. 2019).
The functional phenotypes investigated here were not obtained on the same flies that were used for metabolomics. Thus, the only thing that links the phenotypes to the metabotype is the common genotype of the DGRP lines. This would imply that the higher predictive power of the metabolome comes from the fact that the metabolome has a more linear relationship to the phenotype than the genome does, which makes very much sense in an organism with several layers of mechanisms striving to maintain homoeostasis (Chan et al. 2010). Furthermore, predictions based on the BLUP framework rely heavily on the estimated relationship among samples/individuals. The metabolomic relationship is a condensation of all the “causal genetic effects” that actually has an effect on the phenotype, whereas the genomic relationship matrix mostly describes the relationship among samples. Thus, despite the fact that the functional phenotypes and the metabolome were not measured on the same flies from the same environments, we do see improved predictive performance likely because the metabolomic relationship is closer to the true causal genetic effects than the estimated genomic relationship matrix, as also recently observed by Harrison et al. (2020). An interesting example of the close relation between the metabolome and the phenotype in insects is the very high correlation (R2 = 0.93) between the haemolymph metabolome and cold tolerance in different Drosophila species (Olsson et al. 2016).
The fact that many of the NMR signals with high predictivity come from unidentified and/or larger metabolites of lower concentration shows that despite the inherently low sensitivity of NMR, the high reproducibility of the method allows for a high accumulated sensitivity to be obtained for the combined data set. An interesting example here is the cluster with the highest PA for chill coma recovery that covers the entire aromatic amino acid region. It also shows the need for a deeper investigation of the NMR detectable D. melanogaster metabolome and stresses its complementarity to mass spectrometry. From a metabolic signalling perspective it also makes sense that it is the less abundant metabolites with aromatic groups that are important in these processes. In this context, it should be made clear that when using NMR metabolomics, information about the entire spectrum is obtained in one go so there is no reason to choose which region to use beforehand for predictive purposes. It should also be stressed that in order to make predictions, it is not needed to be able to identify the molecules behind the NMR signals.
In conclusion, we have convincingly shown that metabolomic approaches have large potential for predicting functional phenotypes. Obviously, the generality and repeatability of these findings should be verified in different genetic backgrounds, in non-model species and in samples that are easy to generate from livestock, humans and plants. However, our main findings, namely, that metabolite profiles are highly heritable, that specific genes are associated with metabolome variation and that the metabolome predicts phenotypes more accurately than genomic data are likely to be robust.
Data availability
The DGRP genotypes, chromosomal inversions, Wolbachia infection status, and the phenotypic values for startle response, starvation resistance and chill coma recovery can be obtained from http://dgrp2.gnets.ncsu.edu/, and the locomotor activity measurements can be obtained from the original publication (Rohde et al. 2019). The raw metabolomic data can be obtained from MetaboLights under accession number MTBLS2060.
References
Anholt RRH, Mackay TFC (2018) The road less traveled: from genotype to phenotype in flies and humans. Mamm Genome 29:5–23
Azodi CB, Pardo J, VanBuren R, de Los Campos G, Shiu SH (2020) Transcriptome-based prediction of complex traits in maize. Plant Cell 32:139–151
Bahado-Singh R, Poon LC, Yilmaz A, Syngelaki A, Turkoglu O, Kumar P et al. (2017) Integrated proteomic and metabolomic prediction of term preeclampsia. Sci Rep 7:1–10
Barker JSF, Podger RN (1970) Interspecific competition between Drosophila melanogaster and Drosophila simulans: Effects of larval density on viability, developmental period and adult body weight. Ecology 51:170–189
Bictash M, Ebbels TM, Chan Q, Loo RL, Yap IKS, Brown IJ et al. (2010) Opening up the ‘black box’: metabolic phenotyping and metabolome-wide association studies in epidemiology. J Clin Epidemiol 63:970–979
Bonàs-Guarch S, Guindo-Martínez M, Miguel-Escalada I, Grarup N, Sebastian D, Rodriguez-Fos E et al. (2018) Re-analysis of public genetic data reveals a rare X-chromosomal variant associated with type 2 diabetes. Nat Commun 9:1–14
Browne RA, Brindle KM (2007) 1H NMR-based metabolite profiling as a potential selection tool for breeding passive resistance against Fusarium head blight (FHB) in wheat. Mol Plant Pathol 8:401–410
Buckler ES, Holland JB, Bradbury PJ, Acharya CB, Brown PJ, Browne C et al. (2009) The genetic architecture of maize flowering time. Science 325:714–718
Bycroft C, Elliott LT, Young A, Vukcevic D, Effingham M, Marchini J et al. (2018) The UK Biobank resource with deep phenotyping and genomic data. Nature 562:203–209
Carmelo VAO, Banerjee P, da Silva Diniz WJ, Kadarmideen HN (2020) Metabolomic networks and pathways associated with feed efficiency and related-traits in Duroc and Landrace pigs. Sci Rep 10:1–14
Chan EKF, Rowe HC, Hansen BG, Kliebenstein DJ (2010) The complex genetic architecture of the metabolome. PLoS Genet 6:e1001198
Civelek M, Lusis AJ (2014) Systems genetics approaches to understand complex traits. Nat Rev Genet 15:34–48
Crick F (1970) Central dogma of molecular biology. Nature 227:561–563
Cui Q, Lewis IA, Hegeman AD, Anderson ME, Li J, Schulte CF et al. (2008) Metabolite identification via the Madison Metabolomics Consortium Database. Nat Biotechnol 26:162–164
Deelen J, Kettunen J, Fischer K, van der Spek A, Trompet S, Kastenmüller G et al. (2019) A metabolic profile of all-cause mortality risk identified in an observational study of 44,168 individuals. Nat Commun 10:1–8
Dekkers JCM (2012) Application of genomics tools to animal breeding. Curr Genomics 13:207–212
de Leeuw CA, Mooij JM, Heskes T, Posthuma D (2015) MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput Biol 11:1–19
Desta ZA, Ortiz R (2014) Genomic selection: genome-wide prediction in plant improvement. Trends Plant Sci 19:592–601
Dieterle F, Ross A, Schlotterbeck G, Senn H (2006) Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal Chem 78:4281–4290
Edwards SM, Sørensen IF, Sarup P, Mackay TFC, Sørensen P (2016) Genomic prediction for quantitative traits is improved by mapping variants to gene ontology categories in Drosophila melanogaster. Genetics 203:1871–1883
Elmer KR (2016) Genomic tools for new insights to variation, adaptation, and evolution in the salmonid fishes: a perspective for charr. Hydrobiologia 783:191–208
Emwas A-HM (2015) The strengths and weaknesses of NMR spectroscopy and mass spectrometry with particular focus on metabolomics research. In: Bjerrum JT (ed) Metabonomics: methods and protocols. Springer New York, New York, NY, p 161–193
Fang L, Sahana G, Ma P, Su G, Yu Y, Zhang S et al. (2017) Use of biological priors enhances understanding of genetic architecture and genomic prediction of complex traits within and between dairy cattle breeds. BMC Genomics 18:604
Feng Y, Guan XM, Li J, Metzger JM, Zhu Y, Juhl K et al. (2011) Bombesin receptor subtype-3 (BRS-3) regulates glucose-stimulated insulin secretion in pancreatic islets across multiple species. Endocrinology 152:4106–4115
Fiehn O (2002) Metabolomics–the link between genotypes and phenotypes. Plant Mol Biol 48:155–171
Fontanesi L (2016) Metabolomics and livestock genomics: insights into a phenotyping frontier and its applications in animal breeding. Anim Front 6:73–79
Gamboa-Becerra R, Hernández-Hernández MC, González-Ríos Ó, Suárez-Quiroz ML, Gálvez-Ponce E, Ordaz-Ortiz JJ et al. (2019) Metabolomic markers for the early selection of coffea canephora plants with desirable cup quality traits. Metabolites 9:214
Gemmer MR, Richter C, Jiang Y, Schmutzer T, Raorane ML, Junker B et al. (2020) Can metabolic prediction be an alternative to genomic prediction in barley? PLoS ONE 15:1–15
Goddard ME, Wray NR, Verbyla K, Visscher PM (2009) Estimating effects and making predictions from genome-wide marker data. Stat Sci 24:517–529
Goldansaz SA, Guo AC, Sajed T, Steele MA, Plastow GS, Wishart DS (2017) Livestock metabolomics and the livestock metabolome: a systematic review. PLoS ONE 12:1–26
Grinberg NF, Orhobor OI, King RD (2019) An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 109:251–277
Hall W, Morley K, Lucke J (2004) The prediction of disease risk in genomic medicine: Scientific prospects and implications for public policy and ethics. EMBO Rep 5:S22–S26
Hansen KK, Hauser F, Williamson M, Weber SB, Grimmelikhuijzen CJP (2011) The Drosophila genes CG14593 and CG30106 code for G-protein-coupled receptors specifically activated by the neuropeptides CCHamide-1 and CCHamide-2. Biochem Biophys Res Commun 404:184–189
Harel T, Peshes-Yaloz N, Bacharach E, Gat-Viks I (2019) Predicting phenotypic diversity from molecular and genetic data. Genetics 213:297–311
Harrison BR, Wang L, Gajda E, Hoffman EV, Chung BY, Pletcher SD et al. (2020) The metabolome as a link in the genotype-phenotype map for peroxide resistance in the fruit fly, Drosophila melanogaster. BMC Genomics 21:1–22
Hasin Y, Seldin M, Lusis A (2017) Multi-omics approaches to disease. Genome Biol 18:1–15
Hayes B, Goddard M (2010) Genome-wide association and genomic selection in animal breeding. Genome 53:876–883
Hayes BJ, Panozzo J, Walker CK, Choy AL, Kant S, Wong D et al. (2017) Accelerating wheat breeding for end-use quality with multi-trait genomic predictions incorporating near infrared and nuclear magnetic resonance-derived phenotypes. Theor Appl Genet 130:2505–2519
Hickey JM, Chiurugwi T, Mackay I, Powell W (2017) Genomic prediction unifies animal and plant breeding programs to form platforms for biological discovery. Nat Genet 49:1297–1303
Hoffman JM, Soltow QA, Li S, Sidik A, Jones DP, Promislow DEL (2014) Effects of age, sex, and genotype on high-sensitivity metabolomic profiles in the fruit fly, Drosophila melanogaster. Aging Cell 13:596–604
Hoffmann TJ, Theusch E, Haldar T, Ranatunga DK, Jorgenson E, Medina MW et al. (2018) A large electronic-health-record-based genome-wide study of serum lipids. Nat Genet 50:401–413
Holmans P, Green EK, Pahwa JS, Ferreira MAR, Purcell SM, Sklar P et al. (2009) Gene ontology analysis of GWA study data sets provides insights into the biology of bipolar disorder. Am J Hum Genet 85:13–24
Holmes E, Loo RL, Stamler J, Bictash M, Yap IKS, Chan Q et al. (2008) Human metabolic phenotype diversity and its association with diet and blood pressure. Nature 453:396–400
Hu Y, Flockhart I, Vinayagam A, Bergwitz C, Berger B, Perrimon N et al. (2011) An integrative approach to ortholog prediction for disease-focused and other functional studies. BMC Bioinforma 12:357
Huang W, Carbone MA, Magwire MM, Peiffer JA, Lyman RF, Stone EA et al. (2015) Genetic basis of transcriptome diversity in Drosophila melanogaster. Proc Natl Acad Sci 112:6010–6019
Huang W, Massouras A, Inoue Y, Peiffer J, Ràmia M, Tarone AM et al. (2014) Natural variation in genome architecture among 205 Drosophila melanogaster Genetic Reference Panel lines. Genome Res 24:1193–1208
Jin K, Wilson KA, Beck JN, Nelson CS, Brownridge GW, Harrison BR et al. (2020) Genetic and metabolomic architecture of variation in diet restriction-mediated lifespan extension in Drosophila. PLoS Genet 16:e1008835
Kristensen TN, Henningsen AK, Aastrup C, Bech-Hansen M, Bjerre LBH, Carlsen B et al. (2016) Fitness components of Drosophila melanogaster developed on a standard laboratory diet or a typical natural food source. Insect Sci 23:771–779
Lawton KA, Berger A, Mitchell M, Milgram KE, Evans AM, Guo L et al. (2008) Analysis of the adult human plasma metabolome. Pharmacogenomics 9:383–397
Lefranc A, Bundgaard J (2000) Controlled variation of body size by larval crowding in Drosophila melanogaster. Drosoph Inf Serv 83:171–174
Li Z, Gao N, Martini JWR, Simianer H (2019) Integrating gene expression data into genomic prediction. Front Genet 10:1–11
Li Z, Zhang Y, Hu T, Likhodii S, Sun G, Zhai G et al. (2018) Differential metabolomics analysis allows characterization of diversity of metabolite networks between males and females. PLoS ONE 13:1–10
Liu DJ, Peloso GM, Yu H, Butterworth AS, Wang X, Mahajan A et al. (2017) Exome-wide association study of plasma lipids in >300,000 individuals. Nat Genet 49:1758–1766
Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR et al. (2015) Genetic studies of body mass index yield new insights for obesity biology. Nature 518:197–206
Mackay TFC, Huang W (2018) Charting the genotype–phenotype map: lessons from the Drosophila melanogaster Genetic Reference Panel. Wiley Interdiscip Rev Dev Biol 7:1–18
Mackay TFC, Richards S, Stone EA, Barbadilla A, Ayroles JF, Zhu DH et al. (2012) The Drosophila melanogaster Genetic Reference Panel. Nature 482:173–178
Mahajan A, Go MJ, Zhang W, Below JE, Gaulton KJ, Ferreira T et al. (2014) Genome-wide trans-ancestry meta-analysis provides insight into the genetic architecture of type 2 diabetes susceptibility. Nat Genet 46:234–244
Malmendal A, Overgaard J, Bundy JG, Sørensen JG, Nielsen NC, Loeschcke V et al. (2006) Metabolomic profiling of heat stress: hardening and recovery of homeostasis in Drosophila. Am J Physiol Regul Integr Comp Physiol 291:205–212
Malmendal A, Sørensen JG, Overgaard J, Holmstrup M, Nielsen NC, Loeschcke V (2013) Metabolomic analysis of the selection response of Drosophila melanogaster to environmental stress: are there links to gene expression and phenotypic traits? Naturwissenschaften 100:417–427
Manning AK, Hivert MF, Scott RA, Grimsby JL, Bouatia-Naji N, Chen H et al. (2012) A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nat Genet 44:659–669
Mars N, Koskela JT, Ripatti P, Kiiskinen TTJ, Havulinna AS, Lindbohm JV et al. (2020) Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med 26:549–557
Märtens K, Hallin J, Warringer J, Liti G, Parts L (2016) Predicting quantitative traits from genome and phenome with near perfect accuracy. Nat Commun 7:1–8
Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829
Morgante F, Huang W, Maltecca C, Mackay TFC (2018) Effect of genetic architecture on the prediction accuracy of quantitative traits in samples of unrelated individuals. Heredity 120:500–514
Nagai A, Hirata M, Kamatani Y, Muto K, Matsuda K, Kiyohara Y et al. (2017) Overview of the BioBank Japan Project: study design and profile. J Epidemiol 27:S2–S8
Novais FJ, Pires PRL, Alexandre PA, Dromms RA, Iglesias AH, Ferraz JBS et al. (2019) Identification of a metabolomic signature associated with feed efficiency in beef cattle. BMC Genomics 20:1–10
Ober U, Ayroles JF, Stone EA, Richards S, Zhu D, Gibbs RA et al. (2012) Using whole-genome sequence data to predict quantitative trait phenotypes in Drosophila melanogaster. PLoS Genet 8:e1002685
Ober U, Huang W, Magwire M, Schlather M, Simianer H, Mackay TFC (2015) Accounting for genetic architecture improves sequence based genomic prediction for a Drosophila fitness trait. PLoS ONE 10:1–17
Olsson T, MacMillan HA, Nyberg N, Staerk D, Malmendal A, Overgaard J (2016) Hemolymph metabolites and osmolality are tightly linked to cold tolerance of Drosophila species: a comparative study. J Exp Biol 219:2504–2513
Patron J, Serra-Cayuela A, Han B, Li C, Wishart DS (2019) Assessing the performance of genome-wide association studies for predicting disease risk. PLoS ONE 14:1–24
Pedersen KS, Kristensen TN, Loeschcke V, Petersen BO, Duus J, Nielsen NC et al. (2008) Metabolomic signatures of inbreeding at benign and stressful temperatures in Drosophila melanogaster. Genetics 180:1233–1243
Pinu FR, Beale DJ, Paten AM, Kouremenos K, Swarup S, Schirra HJ et al. (2019) Systems biology and multi-omics integration: viewpoints from the metabolomics research community. Metabolites 9:1–31
Qi Q, Stilp AM, Sofer T, Moon JY, Hidalgo B, Szpiro AA et al. (2017) Genetics of type 2 diabetes in U.S. Hispanic/Latino individuals: results from the Hispanic Community Health Study/Study of Latinos (HCHS/SOL). Diabetes 66:1419–1425
Rangel-Huerta OD, Pastor-Villaescusa B, Gil A (2019) Are we close to defining a metabolomic signature of human obesity? A systematic review of metabolomics studies. Springer, USA
Reed LK, Lee K, Zhang Z, Rashid L, Poe A, Hsieh B et al. (2014) Systems genomics of metabolic phenotypes in wild-type Drosophila melanogaster. Genetics 197:781–783
Riedelsheimer C, Czedik-Eysenberg A, Grieder C, Lisec J, Technow F, Sulpice R et al. (2012) Genomic and metabolic prediction of complex heterotic traits in hybrid maize. Nat Genet 44:217–220
Rohart F, Paris A, Laurent B, Canlet C, Molina J, Mercat MJ et al. (2012) Phenotypic prediction based on metabolomic data for growing pigs from three main european breeds. J Anim Sci 90:4729–4740
Rohde PD, Demontis D, Cuyabano BCD, Børglum AD, Sørensen P (2016) Covariance association test (CVAT) identifies genetic markers associated with schizophrenia in functionally associated biological processes. Genetics 203:1901–1913
Rohde PD, Fourie Sørensen I, Sørensen P (2020) qgg: an R package for large-scale quantitative genetic analyses. Bioinformatics 36:2614–2615
Rohde PD, Gaertner B, Ward K, Sørensen P, Mackay TFC (2017) Genomic analysis of genotype-by-social environment interaction for Drosophila melanogaster aggressive behavior. Genetics 206:1969–1984
Rohde PD, Jensen IR, Sarup PM, Ørsted M, Demontis D, Sørensen P et al. (2019) Genetic signatures of drug response variability in Drosophila melanogaster. Genetics 213:633–650
Rohde PD, Østergaard S, Kristensen TN, Sørensen P, Loeschcke V, Mackay TFC et al. (2018) Functional validation of candidate genes detected by genomic feature models. Genes Genomes Genet 8:1659–1668
Sarup P, Pedersen SMM, Nielsen NC, Malmendal A, Loeschcke V (2012) The metabolic profile of long-lived Drosophila melanogaster. PLoS ONE 7:e47461
Savorani F, Tomasi G, Engelsen SB (2010) icoshift: a versatile tool for the rapid alignment of 1D NMR spectra. J Magn Reson 202:190–202
Schnorrer F, Schönbauer C, Langer CCH, Dietzl G, Novatchkova M, Schernhuber K et al. (2010) Systematic genetic analysis of muscle morphogenesis and function in. Drosoph Nat 464:287–291
Schou MF, Kristensen TN, Pedersen A, Göran Karlsson B, Loeschcke V, Malmendal A (2017) Metabolic and functional characterization of effects of developmental temperature in Drosophila melanogaster. Am J Physiol Regul Integr Comp Physiol 312:R211–R222
Schrodi SJ, Mukherjee S, Shan Y, Tromp G, Sninsky JJ, Callear AP et al. (2014) Genetic-based prediction of disease traits: prediction is very difficult, especially about the future. Front Genet 5:1–18
Scoriels L, Salek RM, Goodby E, Grainger D, Dean AM, West JA et al. (2015) Behavioural and molecular endophenotypes in psychotic disorders reveal heritable abnormalities in glutamatergic neurotransmission. Transl Psychiatry 5:e540
Siebert M, Banovic D, Goellner B, Aberle H (2009) Drosophila motor axons recognize and follow a Sidestep-labeled substrate pathway to reach their target fields. Genes Dev 23:1052–1062
Siewert KM, Voight BF (2018) Bivariate genome-wide association scan identifies 6 novel loci associated with lipid levels and coronary artery disease. Circ Genom Precis Med 11:e002239
Sørensen IF, Edwards SM, Rohde PD, Sørensen P (2017) Multiple trait covariance association test identifies gene ontology categories associated with chill coma recovery time in Drosophila melanogaster. Sci Rep. 7:2413
Speed D, Balding DJ (2014) MultiBLUP: improved SNP-based prediction for complex traits. Genome Res 24:1550–1557
Sun S, Yang F, Tan G, Costanzo M, Oughtred R, Hirschman J et al. (2016) An extended set of yeast-based functional assays accurately identifies human disease mutations. Genome Res 26:670–680
te Pas MFW, Madsen O, Calus MPL, Smits MA (2017) The importance of endophenotypes to evaluate the relationship between genotype and external phenotype. Int J Mol Sci 18:1–22
Ulrich EL, Akutsu H, Doreleijers JF, Harano Y, Ioannidis YE, Lin J et al. (2008) BioMagResBank. Nucleic Acids Res 36:402–408
Van Arendonk JAM, Tier B, Kinghorn BP (1994) Use of multiple genetic markers in prediction of breeding values. Genetics 137:319–329
Van Der Ende MY, Said MA, Van Veldhuisen DJ, Verweij N, Van Der Harst P (2018) Genome-wide studies of heart failure and endophenotypes: lessons learned and future directions. Cardiovasc Res 114:1209–1225
Wakil SM, Ram R, Muiya NP, Andres E, Mazhar N, Hagos S et al. (2016) A common variant association study reveals novel susceptibility loci for low HDL-cholesterol levels in ethnic Arabs. Clin Genet 90:518–525
Wang PI, Marcotte EM (2010) It’s the machine that matters: predicting gene function and phenotype from protein networks. J Proteom 73:2277–2289
Wishart DS, Feunang YD, Marcu A, Guo AC, Liang K, Vázquez-Fresno R et al. (2018) HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res 46:D608–D617
Wray NR, Goddard ME, Visscher PM (2008) Prediction of individual genetic risk of complex disease. Curr Opin Genet Dev 18:257–263
Wray NR, Kemper KE, Hayes BJ, Goddard ME, Visscher PM (2019) Complex trait prediction from genome data: contrasting EBV in livestock to PRS in humans. Genetics 211:1131–1141
Xiao C, Piñol RA, Carlin JL, Li C, Deng C, Gavrilova O et al. (2017) Bombesin-like receptor 3 (Brs3) expression in glutamatergic, but not GABAergic, neurons is required for regulation of energy metabolism. Mol Metab 6:1540–1550
Xu S, Xu Y, Gong L, Zhang Q (2016) Metabolomic prediction of yield in hybrid rice. Plant J 88:219–227
Yoshida R, Tamura T, Takaoka C, Harada K, Kobayashi A, Mukai Y et al. (2010) Metabolomics-based systematic prediction of yeast lifespan and its application for semi-rational screening of ageing-related mutants. Aging Cell 9:616–625
Zampieri M, Sauer U (2017) Metabolomics-driven understanding of genotype-phenotype relations in model organisms. Curr Opin Syst Biol 6:28–36
Zhang XW, Li QH, Di XuZ, Dou JJ (2020) Mass spectrometry-based metabolomics in health and medical science: a systematic review. RSC Adv 10:3092–3104
Zhou S, Morgante F, Geisz MS, Ma J, Anholt RRH, Mackay TFC (2020) Systems genetics of the Drosophila metabolome. Genome Res 30:392–405
Acknowledgements
The DGRP lines were obtained from the Bloomington Drosophila Stock Center (NIH P40OD018537, http://flystocks.bio.indiana.edu). We thank Helle Blendstrup, Susan Marie Hansen and Michael Ørsted from Aalborg University for assistance in fly maintenance and sample collection. All of the computing for this project was performed on the GenomeDK cluster. We would like to thank GenomeDK and Aarhus University for providing computational resources and support that enabled us to perform the analyses presented in the paper. The authors thank Anders Pedersen at the Swedish NMR Center at the University of Gothenburg for help with sample preparation and experimental setup and for access to the 800 MHz spectrometer. The study was supported by the Danish Natural Science Research Council through a grant to TNK (DFF-8021-00014B), and by a grant from the Lundbeck Foundation to PDR (R287-2018-735).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Associate editor: Darren Obbard
Rights and permissions
About this article
Cite this article
Rohde, P.D., Kristensen, T.N., Sarup, P. et al. Prediction of complex phenotypes using the Drosophila melanogaster metabolome. Heredity 126, 717–732 (2021). https://doi.org/10.1038/s41437-021-00404-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41437-021-00404-1
This article is cited by
-
Multi-omics to predict changes during cold pressor test
BMC Genomics (2022)
-
A Drosophila melanogaster model for TMEM43-related arrhythmogenic right ventricular cardiomyopathy type 5
Cellular and Molecular Life Sciences (2022)