Background & Summary

Wheat is one of the “big three” cereals that dominate global staple food production. Wheat seeds are milled to remove brans, leaving most starchy endosperm to produce refined white flour, which are mainly starch and proteins. Wheat has become such a popular staple crop probably because of two major reasons. First, wheat is one of the most important sources of total calorie intake and contributes significantly to our daily plant-sourced proteins, fibres, mineral nutrients and beneficial phytochemicals (i.e., bioactive compounds)1. Second, wheat seeds contain a unique set of seed storage proteins (i.e., glutenins and gliadins), which form polymeric protein networks as the physico-chemical basis of unique visco-elastic properties, allowing wheat dough made into numerous types of flour foods consumed globally, such as breads.

High yield and environmental adaptability have long been the main targets of wheat breeding2. In recent years, the conceptual changes of “eat full” to “eat well” have become popular in food consumption, emphasizing the nutrition value and dietary benefits of our daily meal. Nutritional studies provide evidence that long-term intake of refined white-flour foods (e.g., white flour breads) as the staple is not beneficial for human health and is associated with the prevalence and development of chronic diseases, such as cardiovascular diseases and type-2 diabetes1. By contrast, consumption of wholegrain wheat-based food products has become the new dietary trend, as wheat bran provides many health-beneficial compounds3. Thereafter, increasing efforts have been made in the basic and applied research areas to understand the genetic basis of wheat quality, especially the nutritional quality traits, and to genetically improve these traits4,5. For example, several key enzymes in the carotenoid biosynthetic pathway have been functionally studied, including β-hydroxylase (HYD)6, lycopene epsilon cyclase (LCYε)7, carotenoid cleavage dioxygenases (CCD)8 and aldehyde oxidase (AO)9,10, and transgenic wheat lines or mutant lines have been investigated to discover their effects on biofortification of β-carotene in wheat11,12,13,14,15. More recently, owing to the technological advances in metabolomics (e.g., the widely targeted metabolomics based on ultra-performance liquid chromatography- electrospray ionization-tandem mass spectrometry (UPLC-ESI-MS/MS)16), large-scale identification and quantification of phytochemicals in wheat seeds has become possible. Thus, the metabolomes of a few wheat cultivars with black, purple or blue grain colors have been studied, revealing the metabolic basis of grain color formation and possible molecular mechanisms17,18,19,20,21.

Based on the above-mentioned studies, it is demonstrated that: (1) the human-beneficial phytochemicals are deposited in wheat seeds in a spatial and temporal manner; (2) pigmented wheat varieties offer high levels of bioactive compounds (flavonoids, phenolics, vanillin, and azelaic acid, for instance). Many of the bioactive compounds are accumulated in the outer layers of wheat seeds and their contents vary between wheat varieties, which not only serve as the basis supporting the benefits of whole-grain wheat consumption, but, unfortunately, are mostly removed during the grain milling process22. Our group previously showed that the contents of major carotenoids in wheat (lutein, zeaxanthin and β-carotene) were drastically decreased during seed development, possibly representing an intrinsic limitation in carotenoid biofortification of wheat seeds14.

Similar to the phytochemical deposition in seeds, the seed storage proteins (SSPs) of wheat are accumulated in spatial-temporal patterns as well. These SSPs include a large portion of gluten proteins, such as high-molecular-weight glutenin subunit (HMW-GS), low-molecular-weight glutenin subunit (LMW-GS), gliadins, puroindolines and avenin-like proteins (ALPs)23. The expression of these SSP-encoding genes are dramatically up-regulated from 12 to 20 days after pollination (DAP) and peak at the middle and late stages of endosperm development (20~30 DAP)24,25,26. In addition, these SSPs are accumulated in different seed regions: HMW-GS and γ-gliadin are primarily deposited in the central endosperm cells, while the S-rich prolamins (LMW-GS, α-gliadin and ALPs) are abundant in the sub-aleurone layer and the adjacent cells27,28,29. The spatial-temporal expression of SSP-encoding genes are tightly controlled by a combination of different transcription factors, some of which have been functionally characterized with recent advancements in wheat genomics and genetics28,30,31,32,33.

In the well-studied Poaceae species (rice and maize), complex networks of transcription factors and coregulators are known to be involved in the regulation of storage proteins and accumulation of starch and other phytochemicals, serving as, at least partly, the genetic basis of yield and nutritional and processing quality traits34,35. Compared to rice and maize, the post-genomic era of wheat studies has come until the very recent burst of Triticeae genomic resources, landmarking by the release of high-quality reference genomes of common wheat, durum wheat and their diploid ancestors36,37,38,39,40,41. While the recent accumulation of Triticeae genomic datasets, high quality transcriptomic and metabolomic resources of wheat seeds with focus on the spatial-temporal regulation of proteins and phytochemicals are scarce. The expression atlas of wheat includes a few samples of aleurone and endosperm42, while the other public RNA-seq datasets of wheat seeds emphasize on embryogenesis43. Also, environmental factors are known to contribute to grain development and metabolite abundance19,44, suggesting the necessity of a well-designed field experiment in generating transcriptome and metabolome datasets simultaneously for studying seed development. In the present work, we describe a high-quality, 310-GB RNA-seq data of wheat seed tissues together with a metabolomic dataset produced from the same samples. This multi-omic resource covers the stages and tissues suitable for studying the nutrient deposition and protein accumulation during late grain development. We believe that this resource will be helpful for understanding the spatial-temporal patterns of seed proteins and beneficial phytochemicals, for unraveling the regulatory networks involved in these metabolic processes and for wheat quality improvement.


Plant materials and field experiments

The bread wheat (Triticum aestivum L.) cultivar L88-31 was used in the field experiment, which was conducted at the experimental field of Huazhong University of Science and Technology in Wuhan, China during the 2020~2021 season using a randomized completely block design with three replicates (Fig. 1). L88-31 belongs to a set of wheat near isogenic lines that have been significantly contributed to our understanding on wheat dough property and widely used for producing transgenic wheat lines45,46,47. Each block was used to collect the samples of four stages for a biological replicate and the sampling stage was randomly assigned to a plot within each block. Each plot consisted of twelve rows which were 1.5 m long and had 35 seeds per row. Since the focus of our research is to capture the dynamics of gene expression and metabolites during the late stages of wheat seed development, the four stages were used: 20, 25, 30 and 35 days after pollination (DAP). Within each plot, the main tillers and the corresponding spikes were tagged and used for sample collection. The middle part of each sampled spikes was used for seed collection and dissection of tissue samples to avoid samples from different flowering time and seed developmental status. All the spikes were collected in the morning (between 9:00 A.M. to 11:00 A.M.) to avoid potential influences of circadian on transcriptome and metabolome. Once collected from the field, the spikes were transferred to the lab and dissected on ice with scalpels and tweezers immediately to embryos, the out layer of seeds and endosperm tissues (abbreviated as Em, OL, and En, respectively, and used hereafter), followed by snap frozen in liquid nitrogen. Around 150~200 seeds from the same plot were sampled and pooled together to form a biological replicate (including the tissues for both RNA-seq and metabolomics) (Fig. 1). Owing to the tissue availability, the embryo tissues collected at 20, 25, 30, and 35 DAP (Em20, Em25, Em30, and Em35, respectively) were used for RNA-seq analysis and other experiments in future, while the endosperm and outer layer tissues collected at 20, 25, 30, and 35 DAP (En20, En25, En30, En35, and OL20, OL25, OL30, OL35, respectively) were used for both metabolomics and RNA-seq, with the remaining tissues saved for future experiments.

Fig. 1
figure 1

Overview of the experimental design and sample collection for RNA-seq and metabolomics.

RNA extraction, library construction and sequencing

Total RNA was extracted with TRIzol reagent. The quality of extracted RNA samples was examined by agarose gel electrophoresis, NanoDrop 2000, and Agilent 2100 Bio-analyzer (Table 1). Standard protocols for the Illumina NovaSeq platform were used for construction of the wheat mRNA libraries. RNA-seq libraries were sequenced to generate 150-bp pair-end reads. For sequence quality control, cutadapt ( and FASTX-Toolkit ( fastx_toolkit/) were used to trim low-quality base pairs from the 3′ end of each sequence and the quality of raw and clean data was checked with FastQC ( The quality-filtered clean reads were mapped to the wheat reference genome of Chinese Spring (IWGSC_v1.0) using HISAT2 v2.0.1-beta with default parameters (,49. Only uniquely-mapped reads were retained and the read counts that aligned to the 110,790 gene models annotated of the wheat reference genome were calculated by using the featureCount software ( FPKM (fragments per kilobase of exon per million mapped sequence reads) values were calculated for each gene model. Genes met the following criteria were considered as expressed in a stage: (1) at least 5 reads mapped to a gene in each of the three replicates; (2) the average FPKM at a stage should be ≥ 0.5. The Pearson correlation coefficients between biological replicates were calculated using gene expression values. The differentially expressed genes (DEGs) were determined with edgeR by comparing the read count data between stages within each tissue or by comparing the tissues of each stages (,52. The significance threshold for edgeR-based differentially expressed genes was fold change ≥ 2 and a false discovery rate (FDR)-adjusted P-value < 0.05.

Table 1 The quality of each RNA sample. RIN, RNA integrity.

Metabolomic analysis

To capture the metabolomic changes during wheat seed development, quasi-targeted metabolomics was employed to analyze the wheat samples under a contract with Novogene Co., Ltd. (Beijing, China). The metabolomic method was established by Novogene and reported previously53,54,55. The method is described in detail as below.

Metabolites extraction

Wheat tissues (100 mg per sample) were grounded individually into powder with liquid nitrogen and the homogenate was well vortexed together with prechilled 500 μL 80% methanol. The samples were then subject to a 5-min cold incubation (on ice) followed by centrifugation (15,000 g, 4 °C for 20 min). The supernatant was diluted with LC-MS grade water to a final concentration of 53% methanol. The samples were subsequently transferred to a new Eppendorf tube followed by centrifugation (15,000 g, 4 °C for 20 min). After that, the supernatant was injected into the LC-MS/MS system56,57.

HPLC-MS/MS analysis

LC-MS/MS analyses were conducted with an ExionLC™ AD system (SCIEX) coupled with a QTRAP® 6500+ mass spectrometer (SCIEX). By using a 20-min linear gradient at a flow rate of 0.4 mL/min for the positive/negative polarity mode, the sample was injected onto a Xselect HSS T3 (2.1 × 150 mm, 2.5 μm) with 0.1% formic acid-water and 0.1% formic acid-acetonitrile as the eluent A and B, respectively54. The following parameters for solvent gradient was used: 2% B, 2 min; 2–100% B, 15.0 min; 100% B, 17.0 min; 100-2% B, 17.1 min; 2% B, 20 min. QTRAP® 6500+ mass spectrometer was operated in the positive polarity mode with curtain gas of 35 psi, collision gas of medium, ionspray voltage of 5500 V, temperature of 550 °C, ion source gas of 1:60, and ionsource gas of 2:60. QTRAP® 6500+ mass spectrometer was operated in the negative polarity mode with the following settings: curtain gas of 35 psi, collision gas of medium, ionspray voltage of −4500 V, temperature of 550 °C, ion source gas of 1:60, ion source gas of 2:60.

Metabolites identification and quantification

Metabolite identification was based on the in-house database using MRM (Multiple Reaction Monitoring)58. In the in-house library of Novogene Co., Ltd., more than 2500 commercially available purified standard compounds have been registered to the LC-MS/MS platform for determination of their characteristics. This in-house database includes 190+ amino acids and derivatives, 200+ organic acids and derivatives, 100+ nucleotide and derivatives, more than 120 flavonoid compounds, more than 40 anthocyanins, over 30 vitamin compounds, 30+ alkaloids and derivatives, 20+ phenolamides, 95+ carbohydrates and derivatives, 100+ lipids (including phospholipids, fatty acyls, glycerophospholipids and glycerolipids), 20+ phytohormone compounds, 600+ known compounds from medicinal plants and over 1000 other compounds (such as anthracenes, benzene and substituted derivatives, cinnamic acids and derivatives, coumarins and derivatives, iridoid derivatives, etc.).

To accurately identify biochemicals, retention time (RT) with a narrow RT window, accurate mass match to the library entries (+/− 0.005 amu), Q1 (parent ion) and Q3 and the MS/MS forward and reverse scores between the experimental data and the authentic standards were applied as the criteria59,60. SCIEX OS (version 1.4) was used to process the HPLC-MS/MS data files to integrate and correct the peak with the following parameters: minimum peak height, 500; signal/noise ratio, 5; gaussian smooth width, 1. The area of each peak represents the relative content of the corresponding substance.

Metabolomic data analysis

These identified metabolites were annotated using, Human Metabolome Database (HMDB) database61, Lipidmaps database62 and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database63. Principal components analysis (PCA) and partial least squares discriminant analysis (PLS-DA) were conducted with the metabolomics software metaX64. We applied univariate analysis (t-test) to calculate the statistical significance (P-value). The following criteria was used to identify differential metabolites: variable important in projection value VIP > 1 and P-value < 0.05, log2(fold change) ≥ 1 or ≤ −1.

Data Records

This data set contains two parts: the first part is the RNA-seq data and the derived gene expression data; the second part is the results of metabolite identification and quantification with metabolomics. Both parts use the same set of sample names, in which En, Em and OL stands for the endosperm, embryo, and outer layer tissues, respectively, with two digits standing for seed developmental stages and the suffix standing for biological replicates. For the RNA-seq data, the raw sequencing data containing all 36 samples were deposited in the NCBI (Bioproject ID: PRJNA891918; SRA experiments No. SRP430408)65. The gene expression matrix and the results of differential expression analysis are available at figshare66. For the metabolomic data, the metabolite information (i.e., compound name, formula, exact Q1 (m/z), molecular weight, RT and CAS No.), relative quantities and differential metabolite results have been made available by Novogene Co., Ltd. and are deposited at figshare67,68. Owing to the contract with Novogene, the raw metabolomic data have not been publicly available. Other data related to this data set (such as RNA quality, RNA-seq statistics) are provided in the manuscript (Table 1, Supplementary Tables S1S3).

Technical Validation

Quality control

The quality of extracted total RNA is key to high-quality RNA-seq libraries and successful downstream experiments, because low RNA integrity (RIN) values may affect the quality of RNA-seq libraries and may lead to potential deviations in gene expression. In our project, we ensured that the RIN values of all RNA samples were > 7.0. The quality parameters of each RNA samples are shown in Table 1.

Quality Validation

Quality validation for the RNA-seq samples

To ensure that the transcriptome and metabolome data obtained meet statistical standards, we have designed the experiment in consideration of several potential influencing factors, including replicates of the fields, growth differences between major tillers and smaller, minor tillers, differences in the developmental rate between the seeds from upper, middle and lower parts of the spike, and growth differences between individual plants and seeds (Fig. 1). In our experiments, we sampled several hundreds of seeds for each sample and the seeds were dissected on ice into three tissues: embryo (Em), the out layers (OL) and endosperm (En).

The RNA-seq experiments obtained high sequencing quality scores and high clean read ratios. The Q30 scores range from 87.67% to 95.49% and the clean read ratios ranges from 97.18% to 99.37% (Supplementary Table S1; Fig. 2a–c; samples’ metadata in Supplementary Table S2). The GC content of all RNA-seq samples was relatively stable, varying from 48.95% to 57.89% (Supplementary Table S1). The percentage of RNA-seq reads mapped to different parts of coding regions demonstrate that a large portion (from 45.4% to 60.84%) of the reads were mapped to coding sequence (CDS) with a small fraction of the reads mapped to 5′ and 3′ untranslated regions (0.12% to 0.54% for 5′UTR and 0.77% to 2.94% for 3′UTR, respectively) for most of the RNA-seq samples (Fig. 2d). However, two samples, En20-2 and En25-1, might be skewed compared to the other samples, as they contained 9.51% and 9.92% percentage of reads, respectively, mapped to intronic regions, indicating DNA contamination (Fig. 2d). Consistent with this result, En20-2 and En25-1 RNA samples indeed had lower A260/230 scores (1.51 and 1.32, respectively; Table 1) and lower GC contents (49.45% and 48.95%, respectively; Supplementary Table S1). The lowered A260/230 and GC content of En20-2 and En25-1 samples also indicate DNA contamination in these samples. In addition, sample OL30-1 had a higher percentage of reads mapped to intergenic regions (Fig. 2d). Taken together with its low mapping rate but relatively acceptable uniquely mapped rate and the global expression pattern correlated well with the corresponding replicates, we think OL30-1 can be acceptable for gene expression analysis (Fig. 2e,f; Supplementary Table S1).

Fig. 2
figure 2

Quality control and clustering analysis of the RNA-seq data set. (a) Quality score of per position in the reads. (b) Sequence content per base of the clean data. (c) N content per base of the clean data. (d) The percent of reads mapped to coding sequences (CDS), 5′untranslated regions (5′UTR), 3′untranslated regions (3′UTR), introns and intergenic regions for each RNA-seq sample. (e) Hierarchical clustering analysis. (f) Primary component analysis (PCA) of the RNA-seq samples. The analyzed results of Em20_1 sample are shown as a representative in the (a–c).

Differential expression analysis and the expression values were calculated for the high-confidence geneIDs annotated in the wheat reference genome (Supplementary Table S1, Table S2). Among the 110,790 annotated genes, a large portion had expression values ranging from 1 to 100 in the FPKM unit for each RNA-seq sample (a representative FPKM distribution per sample shown in Fig. 3a), supporting FPKM > 0.5 as a reasonable threshold for the expressed genes. The number of expressed genes varied from 37,862 to 51,368 (Supplementary Table S4), consistent with previous reports of wheat seeds samples43. The results of correlation heatmap and principle component analysis (PCA) based on the expression between the replicates further support that En20-2 and En25-1 samples are deviated from the remaining RNA-seq samples (Supplementary Figure S1). Thus, En20-2 and En25-1 samples are excluded from the identification of differentially expressed genes (DEGs). The number of DEGs between each pair of RNA-seq samples are shown (Fig. 3, Supplementary Table S4). Also, when En20-2 and En25-1 were excluded, the remaining RNA-seq samples could be well separated in the correlation heatmap and PCA plot (Fig. 2e,f, respectively). In the PCA result, the principle components 1 (PC1) and 2 (PC2) explains 30.27% and 12.34% of the variations, respectively, and well corresponds to the tissue types and developmental stages. The PCA results show: (1) the three tissue types can be clearly separated; (2) the outer layer and embryo tissues exhibit greater changes in gene expression during the four stages compared to the endosperm tissues. Indeed, the number of DEGs between the three tissues support this finding, with at least over 5000 up- or down-regulated genes identified at each of the four stage (Fig. 3e–g). By contrast, the number of DEGs between different stages within a particular seed tissue are much less, ranging from several hundreds to a few thousands. For example, only a few hundred genes were up- or down-regulated in the comparison of En20-En25 and En25-En30, consistent with the relative stable transcriptomic status of the endosperm samples as indicated in the PCA results (Figs. 2f, 3). In brief, the results of PCA and differential expression analysis consistently reflect the transcriptomic differences between the seed tissues and stages, demonstrating that this RNA-seq dataset is of high quality to study the gene expression regulation in different seed tissues during developmental stages.

Fig. 3
figure 3

Gene expression of the RNA-seq samples (a) and differential expression analyses between the stages (bd) or the tissue types (eg). (a) The distribution of gene expression values (in FPKM) using Em20_1 as a representative. (bd) The number of DEGs identified between the stages from embryo (b), endosperm (c) and the outer layers (d). (eg) The number of DEGs identified between the tissue types from the same stages.

Quality validation for the metabolomic samples

The metabolomic analysis identified 1,067 known metabolites in the samples of wheat endosperm (En) and seed outer layers (OL), covering a diverse set of compounds, including amino acids, carbohydrates, lipids, nucleotides, organic acids and their derivatives and many metabolites belong to secondary metabolic pathways (Fig. 4a). By calculating the coefficient of variance of each metabolite within the three replicates, 984 metabolites67,68 with low CV values (<0.8) have been kept (CV distribution shown in Fig. 4d and Supplementary Fig. 2, meta-data for the metabolomic samples in Supplementary Table S2), with their metabolic categories given in Fig. 4b. PCA analysis well detected the metabolic differences among the samples, with PC1 and PC2 explaining 38.64% and 19.04% metabolic variations, respectively, and probably corresponding to the differences between the tissues and the stages (Fig. 4c). Partial least squares discriminant analysis (PLS-DA) was performed to identify the metabolites with differential abundance between the samples (Fig. 4e,f, Supplementary Table S5). Interestingly, around 250 to 300 metabolites were found to be up-regulated in the outer layer samples at each of the four stages (Fig. 4e), consistent with previous results that the outer layers of wheat seeds have more metabolites with higher abundance22.

Fig. 4
figure 4

Characterization of the metabolomic data. (a) Classification of the 1,067 identified metabolites. (b) Classification of the 984 metabolites with low coefficient of variance values within the replicates. (c) PCA analysis of the metabolic samples. (d) The distribution of coefficient of variance values within the replicates for each metabolite (the three replicates of OL20 were used as a representative). (e) The number of differential metabolites between outer layer samples and endosperm samples at each of the four stages. (f) The number of differential metabolites between the stages within each the outer-layer or endosperm tissue.