Introduction

Trait analysis, especially genome-wide trait analysis, is centered on how genetic variation gives rise to phenotypic variation1. This type of analysis relies on statistical methods and tools to perform association mapping between causal genetic variants and resulting phenotypes, which can determine the heritability of a trait at a subset of genetic variants (typically referred to as single-nucleotide polymorphisms, or SNPs) and delineate regions of the genome that control the trait, thereby providing markers that can be utilized to accelerate breeding by marker-assisted selection2. Because of the great success of genome-wide association studies (GWAS), hundreds of SNPs conferring genetic variation of complex traits have been identified and reported3. However, the genetic structures of most traits remain unexplained, as associated SNPs detected from GWAS explain only a small fraction of heritability (e.g., <40% in schizophrenia studies)4 and a much smaller percentage of the total phenotypic variance. This is mainly because a number of these studies employed only additive models that fail to account for epistasis, or the interaction between multiple loci and the environment4,5,6.

Xu et al.7 proposed a new linear mixed model (LMM) for mapping quantitative loci (QTL) by incorporating multiple polygenic covariance structures. Based on this model, a pipeline for estimating epistatic effects (PEPIS) was developed to comprehensively estimate additive effects, dominance effects, and interaction effects between multiple genetic loci. PEPIS allows analysis of genome-wide genetic architectures, including genotype interaction effects (GxG), and can thereby explain more than 80% of phenotypic variance8.

Compared with standard GWAS tools that consider only additive effects, the PEPIS pipeline is equipped with a more complex polygenic linear model that can explain more phenotypic variance. However, neither of these methods can explain nearly 100% of phenotypic variance, as neither considers the interaction between genotypes and environments (GxE). Today, the predominant thinking in biology is that the orchestrated expression of many genes in different environmental conditions affects the transcriptome, proteome, and metabolome to produce a final observable phenotype9. Recent work in Saccharomyces cerevisiae suggests that GxE can occur at the individual locus level and the group level for multiple loci, leading to environment-dependent epistatic interactions10,11,12. Although Muir et al.13 conceptualized the partitioning of GxE into two possible interaction types, our mathematical understanding of the genetic and molecular mechanisms by which GxE collectively gives rise to phenotypes is still incomplete14.

The central dogma of biology is that the genome, transcriptome, proteome, and metabolome are cascading and connected to the end phenome15. The development of life science technologies enables transcriptomic, proteomic, and metabolomic events to be analyzed in detail within the same biological system, allowing the systematic study of a complete biological system16. Out of all the omic data from the same biological system, genomic data generally remain constant across environments, although the same genotype subjected to different environments can produce a wide range of phenotypes by triggering the expressions of different genes, downstream enzymes, and metabolites17. Most current association methods and analysis tools perform association mapping based on fundamental relationships between DNA sequence variation and phenotypic variation without addressing environmental variation. GxE can be understood by observing and measuring the expression of genes or metabolites. Harper et al.2 developed an associative transcriptomic approach to study complex traits in the polyploidy crop species Brassica napus by correlating trait variation with the quantitative expression of genes and sequence variation of transcripts, with the consistent physical positions of the two kinds of associative markers allowing the identification of high-confidence transcription factor candidates2,18. However, their method is based on a pure additive model only, and they make no mention of interaction effects between biomarkers or their contribution to phenotypic variation.

To overcome the limitation of standard GWAS that fails to consider the GxG and GxE effects, we extend associative genomics and transcriptomics into a broader associative omics by systematically integrating all available omic data into one analytical model. Here we propose a new LMM and describe the development of a pipeline for analyzing traits through ome-wide association studies (PATOWAS) to implement the model. The proposed LMM considers not only the additive effects of each biological marker but also the interaction effect of each marker pair. The marker pairs’ interaction effect introduced here corresponds to two-dimensional (2D) association mapping, which is complementary to one-dimensional (1D) association mapping in regular GWAS. Consequently, the proposed model and PATOWAS pipeline are not limited to GWAS for genotype-to-phenotype mapping (G2P); instead, they are capable of performing multiple types of ome-wide association studies, such as transcriptome-wide association studies (TWAS) for transcript-to-phenotype mapping (T2P) and metabolome-wide association studies (MWAS) for metabolite-to-phenotype mapping (M2P).

We submit a rice recombinant inbred line (RIL) dataset with three omics markers and two agronomic traits to PATOWAS for comprehensive analyses of associative omics. The results demonstrate that our proposed LMM and the pipeline PATOWAS can effectively address the GxG effect and the GxE effect, perform multiple-level associative omics in one platform, and innovatively provide a systems biology view into the traits analyzed.

Results

Associative omics, PATOWAS, and integrative omics

We aimed to systematically integrate multiple associative omic results to provide more biological insights into the phenotypic traits to be analyzed. We first collected a dataset of 210 rice RILs genotyped with 1619 marker bins, profiled with 22,584 transcripts and 1000 metabolites, and phenotyped with two agronomic traits (Table 1). The phenotypic traits (Supplementary Data 12) were yield (YIELD) and (kilo-) thousand grain weight (KGW), and the omic quantitative markers (Supplementary Data 35) were bin-based genotype data, Affymetrix RNA microarray-based gene expression data, and mass spectrometry-based profiling of metabolite abundance data. We presumed that expressed transcripts, proteins, and metabolites are prone to vary when subjected to the environments, while the genetic variants are considerably stable. Therefore, compared with genome-wide genotypic data, we further presumed that measured gene expression and metabolite abundance contain both gene and environment information and expect that associative transcriptomics (T2P or TWAS) or metabolomics (M2P or MWAS) could explain more phenotypic variance (Supplementary Fig. 1).

Table 1 Summary of phenotypic trait data and omic marker data

Motivated by our consideration of genetic epistasis and our desire to explain more phenotypic variance, we next proposed a statistical LMM that considers not only the additive effects of each marker variant but also the interaction effects of each marker pair. Based on this linear model, we developed a PATOWAS pipeline to analyze traits through multiple ome-wide association studies. Therefore, the proposed model and PATOWAS can be used to study not only GWAS for G2P but also TWAS for T2P and MWAS for M2P, which is progress toward an integrative omics (Fig. 1a).

Fig. 1
figure 1

Biological concept of the PATOWAS pipeline and biological insight of an example association mapping resulting from PATOWAS. a Three types of omic markers to phenotype association mappings, e.g., (1) genome-wide sequence/genotype variation to phenotypic variation mapping (G2P or GWAS), (2) transcriptome-wide gene expression variation to phenotypic variation mapping (T2P or TWAS), and (3) metabolome-wide metabolite abundance variation to phenotypic variation mapping (M2P or MWAS), can be analyzed using the unified linear mixed model in PATOWAS. b PATOWAS needs 2D omics marker matrix data and 1D phenotypic trait data as input. Specific PATOWAS results include (1) variance component analysis result showing two biologically meaningful components: additive, additive–additive, and residual; (2) 1D association mapping for the Additive component; and (3) 2D association mapping for the Additive–Additive component, and further, the significant omics marker pairs extracted by thresholding

To test this presumption and verify our consideration, we used PATOWAS to analyze the rice RIL datasets with two agronomic traits and three different omics markers. PATOWAS accepts 2D omics marker matrix data and 1D phenotypic trait data as inputs (Fig. 1b). PATOWAS results for one specific associative omics mainly include three parts: variance component analysis for the partition of phenotypic variance, a 1D association map for the direct biological markers, and a 2D association map for the interaction of biological marker pairs (Fig. 1b). Of the three variance components, the additive component for the markers’ direct effects and the additive–additive component for the marker pairs’ interaction effects are biologically meaningful and can be explained by the linear model. The higher the sum of the two components, the lower the residual component and the more phenotypic variance can be explained by the model. Of all markers’ and marker pairs’ effects, those with higher −log10(p) values indicate markers or marker pairs that are more relevant to the phenotypic trait.

In the present study, we sequentially submitted three omic marker datasets to PATOWAS to analyze the two field traits, YIELD and KGW. We downloaded the results after completion of the analyses. Based on these results, multiple associative omics and the biological insight can be compared and integrated. For example, the combination of 1D association mapping across G2P and T2P can help identify the genotype and expressed gene transcript markers with consistent physical positions; comparison of the metabolites from 1D M2P association mapping can uncover the biochemical relevance of tissue-specific metabolites and traits to be analyzed; and the investigation of major biomarker pairs from 2D association mapping can be used to build an association network. All these together provide a systems biology view into the analyzed traits, leading toward an answer about how genes, transcripts, proteins, and metabolites work together to produce an observable phenotype.

Variance component analysis

Based on the variance component analysis results, we generated six pie charts displaying the three variance components of the two traits across associative genomics, associative transcriptomics, and associative metabolomics (Fig. 2).

Fig. 2
figure 2

Pie chart illustrations of variance component analysis results for traits YIELD (a) and KGW (b) across associative genomics, transcriptomics, and metabolomics. Three components in each pie chart are colored with blue, brown, and yellow and represent the three estimated variance ratios of additive, additive–additive, and residual, respectively

We found that the two biologically meaningful variance components accounted for nearly all of the phenotypic trait variance in associative transcriptomics and associative metabolomics but not in associative genomics. Also, YIELD was a more complex trait than KGW, as the two biologically meaningful variance components accounted for only 66% of the total phenotypic variance in associative genomics but nearly 100% of the total phenotypic variance in associative transcriptomics and metabolomics (Fig. 2). These findings demonstrate that a chain of environmentally responsive genes and metabolites can be observed and explained at the transcriptomic and metabolomic levels but not at the genomic level.

Here we noticed that the marker number for transcripts was obviously one-order of scale higher than the other two. Consider the marker-by-marker interactions: The pairwise number of transcripts will reach to ~250 million, which is about two-order of scale larger than the other two kinds of omic markers.

To test whether the higher ratio of biological explanatory components observed in the TWAS result is not due to the larger numbers of transcripts used in TWAS, we further produced a reduced transcript gene set with a number scale comparable to the genotypes and metabolites. We separately submitted the reduced transcript gene set to PATOWAS and checked the variance component analysis result.

The procedures to generate a reduced gene set are described as follows: First we mapped the 22,584 transcript genes into the 1619 genotype bins (Supplementary Data 6); one genotype bin may contain none to hundreds of transcript genes. Based on the 1D association mapping result, at most only one representative transcript in one bin was selected. We chose the transcript with the highest −log10(p) as the representative transcript of a genotype bin. Then we generated a reduced transcript gene set for each phenotypic trait, which essentially is a data matrix with a dimension of 1543 × 210 (Supplementary Data 78). Its number of markers was comparable to those in the analyzed genotypes and metabolites. The same approaches were also used to generate two positional comparable 1D G2P and T2P association mapping results in the following section.

We submitted the reduced transcript data and the two phenotypic traits, KGW and YIELD, to PATOWAS for further study. Based on the variance component analysis results, two additional pie charts displaying the three variance components of the two traits in associative transcriptomics were plotted (Supplementary Fig. 2). Again, we observed that the two biologically meaningful components explained nearly 100% of the phenotypic variance, with only a fluctuation between the two components. Thus, we conclude that the much larger numbers of transcripts used in TWAS is not the reason for the higher explanatory ratio of phenotypic variance in associative transcriptomics.

Our proposed LMM involve two biologically meaningful variance components: \(\sigma _{\mathrm {a}}^2\), \(\sigma _{\mathrm {aa}}^2\). To measure the portion of phenotypic variance that can be explained by the model, we define the broad-sense heritability by

$$H = \frac{{\sigma _{\mathrm {a}}^2 + \sigma _{\mathrm {aa}}^2}}{{\sigma _{\mathrm {a}}^2 + \sigma _{\mathrm {aa}}^2 + \sigma ^2}}$$
(1)

Modern GWAS application often involves a panel with hundreds of thousands, or even millions, of genetic variants under only several hundred individual samples19. The statistical modeling of such cases is usually challenging because the sample size is substantially smaller than the number of covariates. This is well-known as a “large p small n” problem20 and requires careful assessment of the statistical characteristics21.

Our proposed method really can explain more of phenotypic variance, but the cost is that it generates a large number of pairwise covariates. Therefore, it is worthwhile to assess the heritability of the proposed LMM, particularly at the high-dimensional data.

First, the predictability22 that is represented by the squared correlation coefficient between the observed and predicted phenotypic value was applied. The squared correlation is approximately equal to R2 = 1−PRESS/SS, where PRESS is the predicted residual error sum of squares and SS is the total sum of squares of the phenotypic values. In principle, we treated each transcript or metabolite marker as an intermediate phenotypic trait and predicated all of these intermediated phenotypic values from all the genotypic data. Therefore, each transcript or metabolite will have an R2 value, predictability (PRED). We then used the HAT method23 to calculate the PREDs for all transcripts and metabolites (Supplementary Data 910), applied a series of variable thresholds to the PREDs, and selected the transcript and metabolite markers. Finally, we submitted the subsets of selected transcript genes and metabolites to PATOWAS for variance component analysis and calculated the broad-sense heritability, H. Figure 3 shows the assessment result of the broad-sense heritability (H) with the selected markers by PRED thresholding. We found that the number of selected markers continued decreasing as the PRED threshold increased; however, the broad-sense H provides us with a very different perspective of different traits and different associative omics. It needs only ~1000 and fewer than 100 transcripts to explain more than 97% of the phenotypic variance in traits YIELD and KGW, respectively. In associative metabolomics, only 30 metabolites are enough to explain more than 90% of the phenotypic variance. In general, trait KGW is more conserved than trait YIELD, and associative metabolomics is more conserved than associative transcriptomics.

Fig. 3
figure 3

The assessment of broad-sense heritability (H) with the selected markers after PRED thresholding. a and b H vs. PRED threshold for trait YIELD and trait KGW in associative transcriptomics. c The numbers of selected transcript markers with the applied PRED thresholds. d and e H vs. PRED threshold for trait YIELD and trait KGW in associative metabolomics. f The numbers of selected metabolite markers with the applied PRED thresholds

Variance component analysis provides us with a big picture by partitioning the phenotypic variation into three components. The two biologically meaningful components for individual markers’ direct effects and the marker pairs’ interaction effects can be further illustrated by 1D and 2D association mapping, respectively.

1D association mapping

1D association mappings from PATOWAS across different associative omics can be combined, integrated, and compared, providing biological insights in trait analysis on both system and molecular biology levels.

Consistency of 1D G2P mapping using PATOWAS and other GWAS tools

Conventional GWAS tools such as TASSEL24, GCTA1, and PLINK25 can build associations between genotypes and phenotypes by calculating and outputting a p-value or −log10(p) value for each genotypic marker. The linear model adopted usually considers only the marker’s direct effect, which is mostly additive. This process essentially is 1D association mapping. PATOWAS is based on our proposed LMM, which considers not only the additive effect for each marker but also the additive × additive interaction effect for each marker pair. Therefore, PATOWAS calculates and outputs a p-value for each marker and a p-value for each marker pair, which essentially provides both 1D and 2D association mapping.

Regarding G2P mapping, 1D association mapping using PATOWAS can be compared with other GWAS tools26. We submitted the same RIL rice genotype and two phenotypic trait data to PATOWAS and TASSEL. We compared the 1D p-values returned from both tools and found that the results are very consistent. The Manhattan and Q–Q plots using the same genotype and phenotypic trait data from PATOWAS and TASSEL are illustrated in Supplementary Fig. 3.

Positional alignment and molecular validation across 1D G2P and T2P mapping

Harper et al.2 developed an associative transcriptomic approach to analyzing traits of the polyploid crop B. napus. Their method combines SNP-based and gene expression-based association results to identify high-confidence transcription factor candidates. As mentioned before, the 1D p-values returned from PATOWAS correspond to the additive effects for the individual markers, and can be used to generate a 1D Manhattan plot. To generate comparable plots between associated genotypic markers and transcript gene markers along their chromosomal position, we first mapped transcript genes to genotype bins and then selected the minimum p-value as the representative p-value of a bin (Supplementary Data 6). This mapping process between genotype bins and transcript genes ensured that there would be 1619 p-values for the two associative omic markers, making it possible to generate aligned 1D plots of −log10(p) values along the markers’ chromosomal positions.

We could easily find the positional consistency between genotype and expressed gene markers (Fig. 4). For YIELD, there was one local maximum region matched between G2P and T2P located in chromosome 1 and bounded with two red lines (Fig. 4a). By contrast, for KGW, most local maximum regions were matched between G2P and T2P (Fig. 4b). Therefore, as KGW is a more specific trait that is less affected by external environmental factors than YIELD, its high genotype variation regions always correspond to high gene expression variation regions pinpointed with high −log10(p) values in both G2P and T2P.

Fig. 4
figure 4

Illustration of the chromosomal position consistency between genomics and transcriptomic markers. a Aligning the 1D G2P and T2P association mapping for trait YIELD. b Aligning the 1D G2P and T2P association mapping for trait KGW. The maximum −log10(p) value among the multiple expressed transcript genes in a genotype bin was selected as the representative −log10(p) value. Dashed lines distinguish the 12 chromosomes and corresponding marker/bin numbers for the complete rice genome. Chromosomal position consistency between G2P and T2P is presented, and the matched local maximum regions are bounded with two red lines

According to the −log10(p) values, we focused on trait YIELD and picked up the top 10 transcript gene markers for a deep molecular function investigation. The top 10 transcript gene markers are distinguished with a unique index and can be identified by its gene locus ID. Through a literature search, we found that at least five of the top 10 transcript markers have been reported to biologically affect rice YIELD (Supplementary Table 1). For example, marker T_2925 (LOC_Os01g62860) was reported to be related to seed shattering27; marker T_3229 (LOC_Os01g67580) was reported to be related to drug resistance28; and markers T_6368 (LOC_Os03g03070) and T_13429 (LOC_Os06g11330) were reported to control or delay flowering time29,30. Marker T_11921 (LOC_Os05g31040) in particular acquired the highest significance value (−log10(p) = 7.53) and was reported as the CKX9 plant hormone gene that could lead to the accumulation of cytokinin and the increased tiller number31,32. All these literature-validated gene markers demonstrate that our PATOWAS has the capability to perform trusted association mapping between causal expressed transcript variants and the resulting phenotypes. We annotated and marked these five genes to the 1D T2P association mapping plot and found that most of them belong to high association peaks (Supplementary Fig. 4). The aims of associative genomics or transcriptomics are to find the genetic variants or expressed transcript variant, which can obviously affect the phenotypic trait. Therefore, the high genotype variation or high gene expression variation regions warrant further study. For YIELD, there is only one obvious consistent matched region between G2P and T2P, and it falls into the surroundings of markers T_2925 and T_3229. However, other transcript gene markers, such as T_11921 with its highest significance value of 7.5309, do not fall into the high genotype variation region (Supplementary Fig. 4, Region C). Therefore, we could conclude that PATOWAS and the associative transcriptomics capture not only the inheritable genetic information from the genome but also the intermediated environmental information at the transcriptome level.

1D M2P association mapping and comparison of metabolite markers between leaf and seed

In the present study, we used PATOWAS to analyze the association of 1000 metabolites with traits YIELD and KGW and then plotted the 1D M2P association results across individual metabolites (Fig. 5).

Fig. 5
figure 5

Illustration of the 1D M2P association mapping results. Scatter plot of the 1D M2P association mapping for traits YIELD (a) and KGW (b). The M2P analysis in this study includes 683 leaf metabolites and 317 seed metabolites, and the two kinds of metabolites are colored with blue and red in the two scatter plots. A tentative significance threshold bar (−log10(p) = 3.0) is set up to show the relevance of the metabolite markers with the phenotypic trait

In genetic association analysis, determining the correct p-value threshold is always critical and subjective33. To tell a methodology story, we tentatively set the threshold as p = 0.001, and the metabolite marker could be considered significant if \(- {\mathrm {log}}_{10}(p) \ge 3.0\). Obviously, we could observe that there were more significant (−log10(p) ≥ 3.0) metabolite markers from leaf than from seed for YIELD (Fig. 5a), whereas there were more significant (−log10(p) ≥ 3.0) metabolite markers from seed than from leaf for KGW (Fig. 5b).

Further, we picked up variable top n significant metabolite markers from the total 1000 metabolites and classified them as metabolites from leaf and seed. Table 2 gives the relationship of the variable top n with the number of significant (−log10(p) ≥ Significance_Th) leaf and seed metabolites. Considering that there are 683 and 317 metabolites from leaf and seed, respectively, we set 0.683 and 0.317 as two meaningful ratio thresholds for significant metabolites from leaf and seed. From Table 2, we found that (1) for YIELD, when top n < 25, significant leaf metabolites against total top n metabolites always have a ratio higher than 0.683; and (2) for KGW, when top n < 500, the significant seed metabolites against the total top n metabolites usually produce a ratio higher than 0.317.

Table 2 Summary of variable top n significant (−log10(p) ≥ Significance_Th) metabolites from leaf and seed across two traits

All these results suggest that leaf metabolites are more relevant to YIELD, while seed metabolites are more relevant to KGW, which is consistent with the findings of Xu et al.34. This could be explained by the fact that the photosynthesis process takes place mainly in leaf tissue and is the main factor determining rice yield35.

Further, we focused on the top 10 significant metabolites for deep molecular function investigation. Based on a literature search, the identification and classification of the top 10 metabolites are summarized in Supplementary Table 2. Of the 10 metabolites, five were identified and two were further classified as flavonoid, of which content was reported as an assessment of the crop yield36.

2D association mapping

The biological interpretation of 2D association mapping for marker pairs’ interaction effect can be illustrated by visualizing the 2D association matrix directly, significance thresholding, and constructing weighted association networks, etc.

Illustration of marker pairs’ interaction effect and its significance thresholding

For trait YIELD, three 2D association mapping results were analyzed, and each association matrix was illustrated as a scaled image with pseudocolor (Fig. 6). By comparison, we found that genotypic markers were neighbor-dependent, as evidenced by the clustering of dots, whereas expressed transcript gene and metabolite markers were neighbor-independent, as evidenced by a random distribution of dots. This phenomenon could be explained by the existence of linkage disequilibrium (LD) blocks in population genetics37.

Fig. 6
figure 6

The pseudo-color images that illustrate the 2D association mapping results, representing the marker pairs’ interaction effects for the trait YIELD across associative genomics, associative transcriptomics, and associative metabolomics. The pseudo-color scaling maps the −log10(p) value from low (deep blue) to high (yellow)

We are usually interested in the significant (−log10(p) ≥ Significance_Th) marker pairs instead of all the marker pairs. Similar to 1D association mapping, we could set a significance threshold to generate a binarized version of the 2D association matrix (Supplementary Fig. 5). We further zoomed in to a specified local region for each associative omics and found that associative genomics demonstrated a 2D local rectangular array while the associative transcriptomics and associative metabolomics showed a 1D local strip (Supplementary Fig. 5 inset). The specificity of the 2D local structure pattern for associative genomics was due to the existence of LD blocks in genomics level. Further, the dimension size of 2D local rectangular array corresponds to the LD block size.

Conditional 1D association mapping and weighted association network

To a specific omics marker pair, say, markers X and Y, there is a p(X, Y) value and its significance measured by −log10(p(X, Y)), which shows how much the omics marker pair is relevant to the phenotypic trait to be studied. If we pinpoint a marker pair (X, Y) to the image illustrated for 2D association mapping, there surely are two specific lines recorded by 1D association significance values (Supplementary Fig. 6). To each associative omics, we selected a representative marker pair, and for each representative marker pair, we marked the two specific lines as white and red and generated two corresponding conditional 1D association mapping plots (Supplementary Fig. 6, middle and bottom). Here, the conditional 1D association mapping originally came from the 2D association matrix, which biologically means how much the other omics markers interconnected with the selected marker to affect the studied phenotypic trait. We found that the conditional 1D associative genomics mapping could provide us with some obvious QTLs, while the conditional 1D associative transcriptomics and metabolomics mapping showed us random association mapping.

Further, if we focus on one specific omics marker and set a significance threshold, its interactive pairs along the vertical or horizontal axes with higher −log10(p) values can be considered relevant regulators of that specific marker. Then an association network centered on the specific omics marker could be constructed. The tie connecting two omics marker nodes has an assigned association significance values. This could be called a weighted association network, which is very different from the co-expression-based gene regulation network38 due to it having a direct biological meaning with the phenotypic trait to be studied.

According to the marker pairs’ significance values, we picked up top 10 associative transcript and metabolite marker pairs. To acquire a deep molecular-level investigation, we conducted a comprehensive literature search and function annotation for two types of associative omics marker pairs (Supplementary Tables 34).

Of the top 10 transcript marker pairs, most of the expressed transcripts are molecularly functional relevant to plant growth, plant hormones, cold and drought stress, etc. (Supplementary Table 3), which can finally affect the phenotypic trait YIELD. In addition, five transcript marker pairs are interconnected with one hub transcript T_8111(LOC_Os03g45280). Therefore, a hub transcript T_8111 (LOC_Os03g45280)-centered expressed gene association network has been tentatively constructed (Fig. 7).

Fig. 7
figure 7

A weighted expressed gene marker interaction network. a Illustration of the association network. Five of the top 10 expressed gene transcript marker pairs are interconnected with a hub transcript T_8111 (LOC_Os03g45280) and used to construct an example association network. The weight on each tie is the significance value measured by −log10(p). b Plot of conditional 1D association mapping for transcript marker T_8111. For the transcript T_8111 (LOC_Os03g45280), its conditional 1D association mapping was extracted from the 2D association matrix

Although the current knowledge about metabolite identification is very limited, we found that most of the identified metabolites from the top 10 metabolite marker pairs were classified into flavonoid or phenolic (Supplementary Table 4). There have been reports that the total phenolic and flavonoid content was comparably relevant to the crop final product yield36. Of the top 10 association metabolite marker pairs, we found four marker pairs were centered on unknown metabolite marker M_195 and two marker pairs were centered on flavonoid metabolite marker M_311 (Supplementary Fig. 7).

Discussion

We extended the concept of genome-wide association to a broader concept of ome-wide association. To overcome the limitations of regular additive GWAS models that fail to consider epistatic and environmental interaction effects, we proposed a new LMM and successfully developed a new PATOWAS pipeline for ome-wide association studies.

We presume that the measured data of gene expression in the transcriptome and metabolite abundance in the metabolome contain not only heritable, stable genetic information but also fluctuating environmental information. Thus, the systematic integration and analysis of multiple levels of associative omics data can provide panoramic insight for complex trait analysis.

To test and validate our presumption, we analyzed a dataset of 210 RILs of rice consisting of genomic, transcriptomic, and metabolomic markers as well as two agronomic phenotypic traits using PATOWAS. The results of the three associative omics analyses were integrated and compared to perform complete trait analysis.

Compared with the genotype-based G2P association, the variance component analysis of gene expression-based T2P and metabolite abundance-based M2P association explained nearly 100% of the phenotypic variance, supporting our presumption that measured gene expression and metabolite abundance data contain both gene and environment information. For KGW, genotype-based G2P association explained >98% of the phenotypic variance, suggesting that this is a simple trait that is less affected by the environment. Of the two types of biologically meaningful variance components, the additive component corresponding to individual genetic markers accounted for 80% of the phenotypic variance, further suggesting that KGW is a more heritable trait that can be easily manipulated by breeding. However, for YIELD, genotype-based G2P association explained only 66% of the phenotypic variance, suggesting that it is a more complex trait that is easily affected by the external environment. In addition, the additive genetic component accounted for only 28% of the phenotypic variance, suggesting that YIELD is more difficult to manipulate by breeding. However, as YIELD may be the most important agronomic trait, environmental factors that affect the transcriptome and metabolome should be carefully considered to produce improvements in this trait.

We found obvious consistencies in genome coordinates between associated genotype markers and expressed gene transcript markers, allowing us to identify high-confidence, co-verified genotype and transcript markers for the same trait and suggesting the presence of correlations between SNP-based genotype data and transcriptomic data. Compared with trait YIELD, trait KGW had more consistent regions between G2P and T2P, indicating that more correlated information was transferred from the genomic level to the transcriptomic level.

We also investigated inconsistent regions between the G2P and T2P plots for YIELD (Supplementary Fig. 4). On chromosome 3, there was a local maximum peak in the G2P plot but not in the T2P plot (Supplementary Fig. 4, Region B), whereas on chromosome 5, there was a local maximum peak in the T2P plot but not in the G2P plot (Supplementary Fig. 4, Region C). We have verified that the latter was the CKX9 plant hormone gene, which could lead to the accumulation of cytokinin and affect the rice grain yield31,32. We speculate that these inconsistencies occurred because of environmental fluctuations resulting in the downregulation of genes located in Region B, and the upregulation of the CKX9 gene located in Region C.

Our associative metabolomic results from PATOWAS indicated that there were more leaf metabolites than seed metabolites relevant to YIELD and vice versa for KGW, suggesting that significantly associated metabolites are tissue-specific and trait-specific. In contrast to Xu et al.’s method 34, which provides only global information, our PATOWAS results provide details about how relevant each metabolite is to YIELD and KGW.

Although there are more than 200,000 different metabolites in the plant kingdom39, only a few hundred have been able to be measured in one experiment. Furthermore, because of technical bottlenecks in metabolite identification, most measured metabolites are unannotated40. If we can increase the number of measured and identified metabolites, the metabolome-wide association results from PATOWAS will become more accurate. Furthermore, if we can link associated genes with known metabolites, we can find and explain new pathways connecting enzymatic genes with their eventual metabolites.

The 2D p-value scanning results from PATOWAS can be used to construct an association network. Such an association network is trait-related and also can be constructed and analyzed for different ome-wide association studies. The integration of multiple layers of ome-wide association networks, together with other results of PATOWAS analysis, can provide panoramic biological insight for trait analysis, leading toward an answer to the question of how genes, transcripts, proteins, and metabolites work together to produce an observable phenotype.

Methods

Statistical method

A new LMM incorporating additive and interaction effects: We proposed a new LMM for multiple associative omics, mathematically described below, that incorporates all markers’ direct additive effects and marker pairs’ interaction effects.

Let y be an n × 1 vector of a quantitative phenotypic trait and Z be an m × n marker matrix for a quantitative omic dataset, such as coded genotypic data, transcript gene expression data, or metabolite abundance data. Coded genotypic data can be acquired by sequencing and genotyping a population41, gene expression data can be acquired by microarray hybridization or mRNA-seq experiments, and metabolite abundance data can be acquired by gas chromatography–mass spectrometry or liquid chromatography–mass spectrometry followed by metabolite feature extraction, annotation, alignment, and quantification42,43,44.

The LMM that incorporates the markers’ additive effects and marker pairs’ interaction effects can be represented as

$$y = X\beta + \mathop {\sum}\limits_{{\it{i}}{\mathrm{ = 1}}}^m {Z_ia_i} + \mathop {\sum}\limits_{i = 1}^{m - 1} {\mathop {\sum}\limits_{j = i + 1}^m {W_{ij}\gamma _{ij}} + e}$$
(2)

where X is an n × 1 vector of unity and β is the intercept; Ziis the ith column of matrix Z, and ai is the ith marker’s additive effect on the trait; \(W_{ij} = Z_i \ast Z_j\) is the element-wise product of vectors Zi and Zj; γij is the interaction effect between marker i and markerj; and e is an n × 1 vector of residual error.

We treat each marker’s effect as a randomly distributed normal variable with a mean of zero and a common variance across all markers or pairs of markers, as shown by \(a_i \sim N(0,\sigma _{\mathrm {a}}^2)\) and \(\gamma _{ij}\sim N(0,\sigma _{\mathrm {aa}}^2)\). The residual errors are of \(e\sim N(0,\sigma ^2)\). The total additive and interaction effects are denoted by: \(\mathop {\sum}\limits_{i = 1}^m {Z_ia_i}\) and \(\mathop {\sum}\limits_{i = 1}^{m - 1} {\mathop {\sum}\limits_{j = i + 1}^m {W_{ij}\gamma _{ij}} }\), respectively.

The expectation of the model is E(y) = , and the variance is

$${\mathrm{var}}\left( y \right) = K_{\mathrm {a}}\sigma _{\mathrm {a}}^2 + K_{\mathrm {aa}}\sigma _{\mathrm {aa}}^2 + I\sigma ^2$$
(3)

where Ka and Kaa are marker-generated additive and epistatic kinship matrices with values calculated by formulas (4) and (5).

$$\begin{array}{l}K_{\mathrm {a}} = \frac{1}{d_{\mathrm {a}}}\mathop {\sum}\limits_{i = 1}^m {Z_i} Z_j^T \hfill \\ K_{\mathrm{aa}} = \frac{1}{{d_{\mathrm{aa}}}}\mathop {\sum}\limits_{i = 1}^{m - 1} {\mathop {\sum}\limits_{j = i + 1}^m {W_{ij}W_{ij}^T} \hfill } \end{array}$$
(4)

where

$$\begin{array}{l}d_{\mathrm {a}} = \frac{1}{n}tr\left( {\mathop {\sum}\limits_{i = 1}^m {Z_iZ_j^T} } \right)\hfill \\ d_{\mathrm {aa}} = \frac{1}{n}tr\left( {\mathop {\sum}\limits_{i = 1}^{m - 1} {\mathop {\sum}\limits_{j = i + 1}^m {W_{ij}W_{ij}^T} } } \right)\end{array}$$
(5)

are normalization factors that allow the K matrices to have diagonal elements as close to unity as possible.

The model involves three variance components, \(\sigma _{\mathrm {a}}^2\), \(\sigma _{\mathrm {aa}}^2\), and σ2, which can be estimated by the restricted maximum likelihood (REML) method for dissection of phenotypic variance.

Estimating variance components using the REML method

The model to estimate variance component is

$$y = X\beta + \xi + \zeta + e$$
(6)

where ξ and ζ are the additive and interaction effects, respectively. The expectation of the model is E(y) = , and the variance is

$${\mathrm {var}}\left( y \right) = {\mathrm {var}}\left( \xi \right) + {\mathrm {var}}\left( \zeta \right) + {\mathrm {var}}\left( e \right) = K_{\mathrm {a}}\sigma _{\mathrm {a}}^2 + K_{\mathrm {aa}}\sigma _{\mathrm {aa}}^2 + I\sigma ^2$$
(7)

The restricted log-likelihood function is

$$L(\beta ,\sigma _{\mathrm {a}}^2,\sigma _{\mathrm {aa}}^2,\sigma ^2) = - \frac{1}{2}\ln \left| V \right| - \frac{1}{2}\ln \left| {X^TV^{ - 1}X} \right| - \frac{1}{2}(y - X\beta )^TV^{ - 1}(y - X\beta )$$
(8)

Given \(\sigma _{\mathrm {a}}^2\), \(\sigma _{\mathrm {aa}}^2\), and σ2, we can solve for β by

$$\hat \beta = (X^TH^{ - 1}X)^{ - 1}X^TH^{ - 1}y$$
(9)

Substituting Eq. (9) into Eq. (8) gives

$$\begin{array}{l}L(\sigma _{\mathrm {a}}^2,\sigma _{\mathrm {aa}}^2,\sigma ^2) = - \frac{1}{2}\ln \left| H \right| - \frac{1}{2}\ln \left| {X^TH^{ - 1}X} \right| - \frac{1}{{2\sigma ^2}}(y - X\beta )^TH^{ - 1}(y - X\beta ) \\ \qquad \qquad \qquad + \frac{{n - r(X)}}{2}\ln (\sigma ^2)\hfill\end{array}$$
(10)

Therefore, the defined likelihood function has three unknowns. Calling any optimization subroutine, we can obtain the REML estimates of the three variance components.

After the three variance components are acquired, we fix the variance ratio \(\hat {\lambda}_{\mathrm{a}} = \hat {\sigma}_{\mathrm {a}}^{2}/\hat {\sigma} ^2\), \(\hat {\lambda} _{\mathrm {aa}} = \hat {\sigma} _{\mathrm {aa}}^{2}{/}\hat {\sigma}^2\) and estimate and test the additive effects and interaction effects by conducting 1D scanning across all markers and 2D scanning across all marker pairs, respectively.

1D and 2D scanning to estimate additive and interaction effects

We define model I and use it to estimate the additive effect of marker Zi as shown below:

$$y = X\beta + Z_ia_i + e$$
(11)

The expectation of this model is

$$E(y) = X\beta + Z_ia_i$$
(12)

We also define model II and use it to estimate the interaction effect of marker pair Wij as shown below:

$$y = X\beta + Z_ia_i + Z_ja_j + W_{ij}\gamma _{ij} + e$$
(13)

The expectation of this model is

$$E(y) = X\beta + Z_ia_i + Z_ja_j + W_{ij}\gamma _{ij}$$
(14)

When (λa,λaa) are fixed, the two models are fixed models and can be solved using the weighted least-squares method. The variance of each model can be written as

$$\begin{array}{ccccc}\\ V = {\mathop{\rm{var}}} (y) &=& K_{\mathrm {a}}\sigma _{\mathrm {a}}^2 + K_{\mathrm {aa}}^{}\sigma _{\mathrm {aa}}^2 + I\sigma ^2\\ &=& \left( {K_{\mathrm {a}}\lambda _{\mathrm {a}} + K_{\mathrm {aa}}\lambda _{\mathrm {aa}} + I} \right)\sigma ^2 \\ &=& \left( {H + I} \right)\sigma ^2\hfill\\ \end{array}$$
(15)

Furthermore, we define a parameter matrix P and a vector b as the following:

$$P = \left\{ {\begin{array}{*{20}{c}} {\left[ {X||Z_i} \right]} & {{{{\rm{Model}}}\quad {\mathrm{I}}}} \\ {\left[ {X||Z_i||Z_j||W_{ij}} \right]} & {{{{\rm{Model}}}\quad {\mathrm{II}}}} \end{array}} \right.$$
(16)
$$b = \left\{ {\begin{array}{*{20}{c}} {\left[ {\beta //a_i} \right]} & {{{{\rm{Model}}}\quad {\mathrm{I}}}} \\ {\left[ {\beta //a_i//a_j//\gamma _{ij}} \right]} & {{{{\rm{Model}}}\quad {\mathrm {II}}}} \end{array}} \right.$$
(17)

where P is a n × 2 or n × 4 matrix that concatenates all matrices horizontally and b is a 2 × 1 or 4 × 1 vector that concatenates all regression coefficients vertically. The generalized least square estimate of b is

$$\hat b = (P^TV^{ - 1}P)^{ - 1}P^TV^{ - 1}y$$
(18)

Note that

$$V^{ - 1} = (H + I)^{ - 1}{\mathrm{/}}\sigma ^2$$
(19)

Therefore,

$$\hat b = (P^T(H + I)^{ - 1}P)^{ - 1}P^T(H + I)^{ - 1}y$$
(20)

Note that when (λa,λaa) are fixed, the following matrix is a constant matrix and can be simplified using Eigen decomposition:

$$(H + I)^{ - 1} = (UDU^T + I)^{ - 1} = U(D + I)^{ - 1}U^T$$
(21)

where D (a diagonal matrix) holds the eigenvalues of H, and U (a matrix) holds the eigenvectors of matrix H. The inverse of D + I is simply

$$(D + I)^{ - 1} = {\mathrm {diag}}\left\{ {\frac{1}{{\delta _i + 1}}} \right\}$$
(22)

Rewriting Eq. (20) gives

$$\hat b = (P^T(H + I)^{ - 1}P)^{ - 1}P^T(H + I)^{ - 1}y = (P^{ \ast T}WP^ \ast )^{ - 1}P^{ \ast T}Wy^ \ast$$
(23)

where

$$\left\{ {\begin{array}{*{20}{l}} {P^ \ast = U^TP} \hfill \\ {y^ \ast = U^Ty} \hfill \\ {W = (D + I)^{ - 1}} \hfill \end{array}} \right.$$
(24)

The residual error variance is estimated using

$$\hat \sigma ^2 = \frac{1}{{n - r(P)}} = (y^ \ast - P^ \ast \hat b)^TW(y^ \ast - P^ \ast \hat b)$$
(25)

where r(P) = 2 and r(P) = 4 for model I and model II, respectively. The variance matrix of the estimated effects is

$${\mathop{\rm{var}}} (\hat b) = (P^{ \ast T}WP)^{ - 1}\hat \sigma ^2$$
(26)

For model I,

$${\mathop{\rm{var}}} (\hat b) = \left[ {\begin{array}{*{20}{c}} {{\mathop{\rm{var}}} (\hat \beta )} & {{\mathop{\rm{cov}}} (\hat \beta ,\hat a_i)} \\ {{\mathop{\rm{cov}}} (\hat a_i,\hat \beta )} & {{\mathop{\rm{var}}} (\hat a_i)} \end{array}} \right]$$
(27)

and the Wald test for H0: ai = 0 is

$${\mathrm {Wald}} = \frac{{\hat a_i^2}}{{{\mathop{\rm{var}}} (\hat a_i)}}$$
(28)

For model II,

$${\mathop{\rm{var}}} (\hat b) = \left[ {\begin{array}{*{20}{c}} {{\mathop{\rm{var}}} (\hat \beta )} & {{\mathop{\rm{cov}}} (\hat \beta ,\hat a_i)} & {{\mathop{\rm{cov}}} (\hat \beta ,\hat a_j)} & {{\mathop{\rm{cov}}} (\hat \beta ,\hat \gamma _{ij})} \\ {{\mathop{\rm{cov}}} (\hat a_i,\hat \beta )} & {{\mathop{\rm{var}}} (\hat a_i)} & {{\mathop{\rm{cov}}} (\hat a_i,\hat a_j)} & {{\mathop{\rm{cov}}} (\hat a_i,\hat \gamma _{ij})} \\ {{\mathop{\rm{cov}}} (\hat a_j,\hat \beta )} & {{\mathop{\rm{cov}}} (\hat a_j,\hat a_i)} & {{\mathop{\rm{var}}} (\hat a_j)} & {{\mathop{\rm{cov}}} (\hat a_j,\hat \gamma _{ij})} \\ {{\mathop{\rm{cov}}} (\hat \gamma _{ij},\hat \beta )} & {{\mathop{\rm{cov}}} (\hat \gamma _{ij},\hat a_i)} & {{\mathop{\rm{cov}}} (\hat \gamma _{ij},\hat a_j)} & {{\mathop{\rm{var}}} (\hat \gamma _{ij})} \end{array}} \right]$$
(29)

and the Wald test for H0: γij = 0 is

$${\mathrm {Wald}} = \frac{{\hat \gamma _{ij}^2}}{{{\mathop{\rm{var}}} (\hat \gamma _{ij})}}$$
(30)

The p-value for a marker’s additive effect or the interaction effect of a marker pair is calculated using

$$p = 1 - \Pr (\chi _{1}^{2} < {\mathrm {Wald}})$$
(31)

PATOWAS pipeline

PATOWAS was developed for analyzing traits through ome-wide association studies. The PATOWAS is composed of two primary sub-pipelines. Sub-pipeline 1 consists of one module designed for kinship matrix calculation, and sub-pipeline 2 is designed for association mapping and integrates three related analysis modules: one for the three variance component analysis, another for 1D p-value scanning for all markers’ direct additive effects, and a third for 2D p-value scanning for all marker pairs’ interaction effects. The four modules are designated km_cal, vc_anal, ps_main, and ps_inter, respectively. The modules were coded with C/C++ using Code::Blocks in a Linux environment and compiled into four separate executable commands. Several Perl and Linux C shell scripts were developed to function as a wrapper to streamline the complete analysis pipeline. Briefly, when the coded genotype data, transcript gene expression data, or metabolite abundance data are provided, module km_cal calculates and delivers the corresponding kinship matrix. When phenotypic quantitative trait data are provided, module vc_anal estimates and delivers the three variance component ratios utilizing both the quantitative trait data and the available kinship matrices. After performing various information aggregation procedures, including kinship matrix weighing and matrix eigen-decomposition, modules ps_main and ps_inter calculate and return 1D p-values for all markers and 2D p-values for all marker pairs, respectively (Fig. 8a).

Fig. 8
figure 8

PATOWAS data analysis flowchart and the user interfaces. a PATOWAS data analysis flowchart. PATOWAS includes four processing models, need omics marker matrix data and phenotypic trait data as input, and output results kinship matrix, variance component ratios, 1D and 2D p-values for main and interaction effects. b User interface for submitting data. c User interface for downloading results

PATOWAS accepts 2D omic marker matrix data and 1D phenotypic trait data in.csv format as inputs (Fig. 8b). After data submission, PATOWAS calculates kinship matrix Ka, Kaa and uses the intermediate kinship matrix and phenotypic trait data to estimate the variance component ratio λa, λaa. Finally, a Wald test is employed to scan the 1D and 2D p-values for the markers’ additive effects and the interaction effects of marker pairs, respectively (Fig. 8a). The PATOWAS analysis output includes the kinship matrix, estimated variance components, and 1D and 2D p-values for the markers’ additive effects and marker pairs’ interaction effects (Fig. 8c). Based on the results, the three variance components accounting for phenotypic variance and the −log10(p) values for the markers’ additive effects and marker pairs’ interaction effects can be visualized and further analyzed.

To increase the flexibility of analyses, users are allowed to run only a portion of the PATOWAS pipeline according to the input data and user-configured parameters (e.g., users can perform only kinship matrix calculations and the three variance component analyses or only kinship matrix calculations). Such configuration flexibility enables users to utilize PATOWAS to generate specific data, such as a kinship matrix, for their own genetic data analysis.

Similar to PEPIS8, PATOWAS was implemented in C/C++ programming language. Furthermore, its computationally demanding analysis modules were implemented using parallel computing techniques, which effectively divide large computational tasks into smaller jobs that are distributed to nodes on networked Linux clusters to accelerate computing.

Let n be the number of individuals and m be the number of omic markers. The total number of effects is m + C(m,2) = m(m + 1)/2. According to the kinship matrix calculation formula7 and complexity analysis8, the multiplications to calculate the kinship matrix Ka and Kaa are mn(n + 1)/2 and m(m−1)n(n + 1)/4, respectively. These calculation complexities demonstrate the enormity of the multiplication demand associated with kinship matrix calculations, especially when both n and m are large. However, the procedure used to calculate each matrix cell value is the same; thus, all n(n + 1)/2 loops for matrix cell calculation can be parallelized.

The variance component analysis module essentially needs only one optimization for a three-parameter log-likelihood estimation. The 1D additive effects p-value scanning module needs m Chi-square calculations and Wald tests, and the 2D interaction effects p-value scanning module needs m(m−1)/2 Chi-square calculations and Wald tests. However, the procedure to calculate the p-values is the same, so the m times additive effect p-value calculation and m(m−1)/2 times interaction effect p-value calculation can also be parallelized. The strategy utilized in the PATOWAS for parallel high-performance distributed computing is summarized in Table 3.

Table 3 Summary of parallel strategy of PATOWAS for high-performance distributed computing

Rice omics data

We analyzed YIELD and KGW from 210 RILs of rice described by Hua et al.45,46. The 210 RILs were derived by single-seed descent from a cross between the Zhenshan 97 and Minghui 63 rice hybrids. Field phenotypic data pertaining to YIELD and KGW were collected from replicated field trials on the Huazhong Agricultural University Experimental Farm in Wuhan, China.

Ome-wide quantitative marker data consisted of bin-based genotype data, Affymetrix RNA microarray-based gene expression data, and mass spectrometry-based profiling metabolite abundance data. Over 270,000 high-density SNP markers were used to infer recombination breakpoints (crossovers), which were then used to construct a total of 1619 genotype bins41. Each bin was treated as a new synthetic marker for association studies, and the bin map was constructed by genotyping the RIL population sequences.

The transcriptomic data originally consisted of 24,994 expressed genes, which were sampled and measured from flag leaves for all 210 RILs in 2008. Each line had two biological replicates, but RNA extracted from the two replicates was mixed at a 1:1 ratio before microarray expression profiling. The original expression levels were then log2-transformed before analysis47. Of the 24,994 genes, 22,584 were clearly matched to 1619 genotype bins. We found only minor and inconsequential differences between the analysis results for 24,994 versus 22,584 genes.

The metabolomic data consisted of 683 metabolites measured from flag leaves and 317 metabolites measured from germinated seeds48. Metabolomic data were collected in 2009 and 2010. Before mass spectrometry-based metabolic profiling, germinated seeds were sampled in one biological replicate in 2009 and one in 2010, and flag leaves were sampled in two biological replicates in 2009. For both tissues, the abundance level of each metabolite was log2-transformed. For each line, we took the average of two replicates’ abundance levels as the measurement of the metabolite.

In summary, the bin genotype data, microarray-based gene expression data, and mass spectrometry-based metabolite data were acquired and stored in three matrices as dimensions of m × n = 1619 × 210, m × n = 22,584 × 210, and m × n = 1000 × 210, respectively (Table 1). Here, m and n represent the number of markers and individuals, respectively.

Code availability

The PATOWAS pipeline and source code are freely available at http://bioinfo.noble.org/PATOWAS/. In addition, the source code of PATOWAS has been deposited into the public repository GitHub at https://github.com/ZhaoBioinformaticsLab/PATOWAS. We are committed to maintaining and improving the specific function modules per user comments and suggestions.