Main

The significance of polyploidy in plants has long been recognized, with respect to both species diversification and implications for genetic improvement during plant breeding1,2,3,4,5. The genome-wide redundancy conferred by whole genome duplication increases genetic diversity and provides additional avenues for evolving functionality, thereby increasing biological complexity6,7. Duplicated genes in polyploid organisms, or homoeologs, are coordinated in several ways, mediating gene dosage effects, gene balance, interaction between cis- and trans-acting factors and rewiring of gene expression networks8,9,10,11,12,13. Unequal contribution of the expression of each homoeologous gene to the total expression level (that is, biased homoeolog expression) has been observed in many allopolyploid plants14,15,16,17. Homoeologous expression bias reflects one aspect of duplicate gene coordination, which is thought to be associated with increased variation and hence adaptive potential. The many complexities of duplicate gene cis- and trans-interactions in biological networks, however, have only recently begun to be elucidated8,10,12,13,17. A particularly promising research avenue is the mapping of expression quantitative trait loci (eQTL) for genome-wide discovery of genetic regulatory variants that influence gene expression.

Allotetraploid ‘upland’ cotton, Gossypium hirsutum L., which originated following an interspecific hybridization event between two diploid ancestors (genome type AA and DD) approximately 1–2 million years ago18, is the dominant source of natural renewable fiber for textiles. Cotton ‘fibers’ are single-celled epidermal ovular trichomes with modular expression and phenotypic stages encompassing initiation, primary wall synthesis, secondary wall synthesis and maturation19. As a vital economic commodity, cotton fiber development has been extensively studied, and many genetic loci and functional genes responsible for fiber development have been discovered20,21,22,23,24,25,26. However, our understanding of how the co-existing At and Dt subgenomes genetically coordinate the dynamic development of the fiber is limited. Accordingly, few breeding practices consider the interactions of genetic effects due to the two subgenomes, which have unique transcriptional and biochemical suites of interactions8,10,12,13. Here we present a genetic regulation analysis of dynamic gene expression in developing fibers across a suite of highly diversified G. hirsutum accessions and uncover the genetic components that may optimize homoeologous gene expression for unlocking the potential for fiber improvement.

Results

Gene expression atlas in fiber development

To uncover the genetic regulation of gene expression in fiber development, we collected 376 diverse G. hirsutum accessions for genome and transcriptome analysis. A total of 13.5 Tb of genome resequencing data were generated, with an average depth of 15.6× (Supplementary Table 1). Accessions were sampled at different developmental stages, including ovules on the day the flower opens (0 days postanthesis (DPA)) and fibers at five timepoints spanning elongation to secondary cell wall synthesis (4, 8, 12, 16 and 20 DPA). A total of 2,215 RNA-sequencing (RNA-seq) data samples were generated (with 41 failing to sample), with an average of 40 million read pairs for each sample (Fig. 1a and Supplementary Table 2). Principal component analysis showed that samples from the same timepoint clustered together, and samples from adjacent development stages were closely associated, indicating a continuous developmental trajectory (Fig. 1b and Extended Data Fig. 1a,b).

Fig. 1: Gene expression atlas and genetic regulation during fiber development.
figure 1

a, Samples used for RNA-seq in fiber development. The samples include ovules at 0 DPA and fibers at 4 DPA, 8 DPA, 12 DPA, 16 DPA and 20 DPA. The sample number of each stage is shown in parentheses. b, Principal component analysis (PCA) plots of the first two components for 2,215 RNA-seq samples. c, The number of genes expressed (dark gray outside line) or not expressed (lighter gray outside line) in all RNA-seq samples. d, The number of expressed homoeologous genes in each timepoint. Red, homoeologous genes with expression bias toward the At (BiasA); blue, homoeologous genes with expression bias toward the Dt (BiasD); gray, homoeologous genes without expression bias (BiasN); purple, homoeologous genes with expression bias toward the At and Dt (bidirect bias). e, The number of eQTLs and distribution of eGenes that had cis-eQTL, trans-eQTL or both. At, the At subgenome; Dt, the Dt subgenome; Sca, genes in unanchored scaffolds.

A total of 49,860 genes were transcribed (fragments per kilobase of transcript per million mapped reads (FPKM) > 0.1 in at least 5% accessions) during cotton fiber development, including 24,486 in the At subgenome and 25,238 in the Dt subgenome (Fig. 1c, Extended Data Fig. 1c and Supplementary Table 3), of which 12,875 were expressed at one to five timepoints (Extended Data Fig. 1d). Of note is the observation that 20,189 homoeologous gene pairs (2n = 40,378) were expressed (Extended Data Fig. 1e and Supplementary Tables 3 and 4), of which 79.6% (n = 16,081) showed expression bias (expression level fold change ≥2 between two homoeologs in ≥ 5% accessions) toward the At (BiasA) or Dt (BiasD) subgenome in at least one timepoint (Fig. 1d), including 3,256 pairs with stable direction of expression bias at all timepoints (Extended Data Fig. 1f).

Genetic regulation of dynamic gene expression

Using both RNA-seq data in fiber development and genome resequencing data for each accession, we considered the impact of genetic variants on gene expression. We used eQTL mapping to identify cis- (within 1 megabase (Mb) of each gene on either side) and trans- (>1 Mb apart or on a different chromosome) regulatory variants (eVariants) that are associated with differences in gene expression. We leveraged approximately 2.7 million SNPs with a minor allele frequency (MAF) > 0.05 in conjunction with the 45,545 genes exhibiting expression variation across all stages (64.8% of the predicted transcriptome) in the eQTL mapping (Supplementary Figs. 1 and 2). In total, 53,854 cis-eQTLs were identified for 18,637 genes (23.8–28.4% of cis-eQTLs overlapped with open chromatin20) and 23,811 trans-eQTLs were identified for 10,391 genes (eGenes, that is, genes whose expression is associated with one or more eQTL), with the largest number of eGenes (12,674, or 55.1%) identified at 12 DPA, because of the presence of a few eQTL hotspots (that is, local chromosomal regions that were associated with transcriptional regulation of more than three genes) at this timepoint (Fig. 1e and Supplementary Table 5).

To compare the sharing of eQTL and further understand the differences in genetic regulation between stages, we collated eQTLs from all stages and distinguished regulatory mechanisms that were found in only one stage (stage-specific) or at least two stages (stage-shared; Supplementary Figs. 3 and 4). We found that stage-shared eQTLs showed larger effects than stage-specific eQTLs, and genes with stage-shared eQTLs had a higher proportion of cis-eQTLs than stage-specific eGenes (Extended Data Fig. 2a,b). For each cis-eQTL, we assessed sharing among stages by comparing local false sign rate (LFSR) and magnitude, which represented the metrics for eQTL significance and effect estimates27. In this analysis, 27,102 (50.2%) cis-eQTLs that were shared among six timepoints had significant signals in the comparison of LFSR, of which only 11,072 had detected effects (fold change of magnitude ≤2 between different timepoints), suggesting many stage-shared eQTLs (LFSR ≤ 0.05) showed variable effect magnitude during fiber development (Extended Data Fig. 2c). In terms of the effect magnitude estimates for eQTLs that are shared between timepoints, the vast majority (91.1%) showed consistent effect direction (Extended Data Fig. 2d).

Fine-mapping of fiber quality associations

We decoded the genetics of fiber quality-related traits by integration of genome-wide association study (GWAS) and eQTL data. A total of 18 QTLs were identified, including five for fiber length (FL), six for fiber strength (FS), four for fiber elongation (FE) rate and three for fiber uniformity (FU; Fig. 2a and Supplementary Fig. 5), of which nine were previously uncharacterized (Supplementary Table 6). This result was partially verified by an F2 population (Supplementary Fig. 6). Then, two complementary methods, including a transcriptome-wide association study (TWAS) and a colocalization analysis28,29, were used to prioritize causal genes for fiber quality-related traits. Using TWAS, 1,255 genes (false discovery rate (FDR) < 0.05) were identified across the whole genome (311 for FL, 655 for FS, 877 for FE and 308 for FU; Fig. 2a, Supplementary Fig. 7 and Supplementary Table 7). Specifically, 43 genes were prioritized as candidate genes for 17 GWAS QTLs using TWAS (Supplementary Table 8). As proof, genetic knockout of a TWAS gene (Ghir_D10G004160) showed that FL became significantly shorter (Extended Data Fig. 3a,b and Supplementary Figs. 8 and 9). Using two colocalization strategies, summary-data-based Mendelian randomization (SMR) and coloc29,30, we also identified 14 fiber quality-related genes (3 for FL, 4 for FS, 6 for FE and 4 for FU) in nine GWAS QTLs (Fig. 2b, Extended Data Fig. 3c, Supplementary Figs. 10 and 11 and Supplementary Table 8), 11 of which overlapped with genes in the TWAS analysis. Of note is the observation that 1,243 of the 1,258 genes from TWAS and colocalization analysis maintained the same effect on fiber quality traits in fiber development.

Fig. 2: Fine-mapping of fiber quality associations.
figure 2

a, Manhattan plot of the genome-wide association study (top panel) and transcriptome-wide association study (bottom panel) for fiber quality. Significant QTLs are labeled. Significance thresholds of P = 3.76 × 10−7 (one-sided F test) and FDR = 0.05 (P value of two-sided Student’s t test corrected by FDR) were used, respectively. b, Phenotypic effects (TWAS z score or correlation between expression and phenotype) of FL/FS-related candidate genes. Strategies are shown at the left, with illustrative color ranges. Positive z scores or correlations are shown in orange and negative values in purple. Significant genes are marked with ‘check mark.’ c, Regional association plots for FL (top row) and eQTLs for gene Ghir_D05G007220 (BB2) at 12 DPA and 20 DPA. Chromosomal location and gene position were labeled at the bottom. The lead SNPs are highlighted with a purple diamond. Boxplots on the right panel show fiber length and BB2 expression for accessions with different genotypes (n = 154 versus 215; two-sided Wilcoxon rank-sum test; centerline, median; box limits, first and third quartiles; whisker, 1.5× interquartile range). d, Expression profiles of Ghir_D05G007220 (BB2) in 340 accessions were divided into eight expression groups. The figure shows the four larger expression groups. Heatmaps showing normalized FPKM in each accession at each timepoint. Line charts showing the mean expression of accessions at each timepoint. The arrow points to the timepoint with the highest mean expression. e, Boxplot for fiber length of accessions with different expression patterns (n = 16 versus 88 versus 108 versus 109; two-sided Student’s t test; centerline, median; box limits, first and third quartiles; whisker, 1.5× interquartile range). f,g, Dot plot for the genetic effect of variants associated with FL/FS-related candidate genes. X axis indicates the variants sorted by genetic effect. Y axis indicates the genetic effect of variants. The representative variants and regulated genes are labeled. h, The heritability accumulation accompanying the increase of genetic variants. i, The heritability is explained by only GWAS loci and both GWAS loci and eQTLs.

On chromosome D05, we identified a QTL that is significantly related to FL, for which BB2 (Ghir_D05G007220) is characterized as a causal gene (Fig. 2c). BB2 encodes an E3 ubiquitin ligase, which was found to be related to cell proliferation and elongation in Arabidopsis31, and positively regulates FE after 12 DPA. Based on the expression pattern of BB2, all cotton accessions were divided into eight different groups, with groups 2, 3, 4 and 5 including 94.4% of accessions (Fig. 2d). From group 2 to group 5, the expression pattern of BB2 was gradually delayed, and the median FL value of the corresponding accessions gradually increased (Fig. 2e). This result indicates that the expression pattern of group 5 (highly expressed from 8 DPA to 16 DPA) is conducive to FE, probably coinciding with the extended FE period. Similar to the analysis for BB2, we investigated the expression patterns for all 1,258 genes from TWAS and colocalization analysis in all accessions and defined ‘favorable’ expression patterns as those in accessions with favorable fiber quality traits, such as longer or stronger fiber. We found 158 FL, 196 FS, 349 FE and 148 FU-related genes exhibited favorable expression patterns in the accessions with favorable traits (Extended Data Figs. 3d and 4a). From short-fiber to long-fiber accessions, the number of genes with favorable expression patterns tended to increase in accessions with longer fiber (Pearson coefficient: 0.621, P < 2.2 × 10−16; Extended Data Fig. 4b,c). Similar observations were found for other traits (Extended Data Fig. 4d–f).

We next explored the genetic effect of regulatory variants associated with genes in TWAS and colocalization analysis. Most of the loci showed moderate effects with a median FL of 0.48 mm, FS of 1.14 cN tex−1, FE of 0.04% and FU of 0.36% (Fig. 2f,g). We found that 534 genes showed pleiotropic effects (Extended Data Fig. 4g), such as the regulatory variant of Ghir_D06G018130 that contributed to both FS and FE with very large effects (FS: 2.30 cN tex−1; FE: 0.07%). We delineated the growth trend of heritability by considering different numbers of trait-related loci through random sampling. The estimated heritability grew smaller as the number of loci increased, in a logarithmic manner (Fig. 2h). The phenotypic variance explained by the integration of TWAS and colocalization genes was much more than could be explained by just considering GWAS loci (Fig. 2i).

Regulatory modules underpinning fiber quality

Because one eQTL may regulate multiple eGenes and one gene may also be regulated by multiple eQTLs, this relationship becomes more complicated when considering the presence of eQTL hotspots. We identified 406 eQTL hotspots that regulated 4,689 genes across five fiber developmental timepoints (Fig. 3a), including 283 timepoint-specific hotspots and 123 timepoint-shared hotspots, of which nine representative large hotspots were found to regulate a variable number of genes in fiber development ranging from 4 (Hot292 at 4 DPA) to 1,546 genes (Hot369 at 12 DPA; Fig. 3b). The eQTL/hotspot–eGene relationships constituted a comprehensive genetic network (Fig. 3c and Supplementary Tables 9 and 10j). We found that the in-degree distribution of this network followed a linear trend in the log scale (r2 = 0.7691, P = 2.37 × 10−7), which is a landmark of scale-free networks (Fig. 3c and Extended Data Fig. 4h)32. In this network, all nodes (eQTLs/hotspots or eGenes) were clustered into 36 modules according to the connectivity between nodes.

Fig. 3: eQTL hotspots and genetic network of genes associated with fiber quality.
figure 3

a, Line chart showing the number of genes regulated by eQTL hotspots at each timepoint. Red point, the total number of regulated genes across five timepoints. b, Bubble plot showing the variable number of genes regulated by nine hotspots in fiber development. Point size, the scaled number of regulated genes in each timepoint; line size, the scaled number of genes that were coregulated between adjacent timepoints. c, Genetic network of fiber-related genes. The node color represents the module to which it belongs. Hexagon node, eQTL hotspot; circular, candidate genes; square, eQTL. The edge is used to connect hotspots/eQTLs with genes. Line plot, distribution of node in-degree values (one-sided F test). d, The counts of genes, eQTLs and hotspots (top) in 36 modules and module heritability of four fiber quality-related traits (bottom). e, Analysis of dynamic interpretation (normalized r2) of fiber length in eight modules at five timepoints. The expression patterns of four representative genes are shown with line plots. f, A total of 129 loci with higher favorable allele frequency (>0.2) in the accessions with the longest fiber (n = 50) compared with the accessions with the shortest fiber (n = 50). Centerline, median; box limits, first and third quartiles; whisker, 1.5× interquartile range. g, Module enrichment analysis of the 129 loci in f (one-sided Fisher’s exact test).

Each module shows diverse heritability for four traits (Fig. 3d). We identified 23 FL, 25 FS, 26 FE and 25 FU-related modules with heritability ≥0.05 for each trait (Supplementary Table 10). For example, module 25 shows the highest heritability for FL and FE and module 24 shows the highest heritability for FS. This indicates that modules may differ with respect to phenotypic effects, possibly because different modules are controlled by different regulatory factors such as transcriptional factors (Extended Data Fig. 4i)33. In all modules, 35.3–57.4% of eQTLs regulated candidate genes at a specific timepoint, 11.8–38.2% of eQTLs had steady regulatory effects on candidate genes at all timepoints and the other eQTLs (16.8–37.3%) showed variable regulation at two to four timepoints (Extended Data Fig. 4j).

To examine the dynamic effect of modules on FL, the eight modules (heritability >0.05 and FL-related gene counts ≥5) with higher interpretation (normalized r2) to FL were tabulated (Fig. 3e). For example, genes in module 29 are mainly involved in regulation at the early elongation stage (4–8 DPA), such as Ghir_D02G002150 that encodes a xyloglucan endotransglucosylase/hydrase involved in primary cell wall extension34. The genes in module 1 are mainly involved in regulation at a later elongation stage (16–20 DPA), including Ghir_A10G008020, which encodes a plant glycogen-like starch initiation protein that has a role in secondary cell wall biosynthesis (Fig. 3e)35.

To investigate the extent of favorable allele aggregation in each module, we divided cotton accessions into three categories by their FL, that is, a long-fiber group (31.08 ± 0.68 mm), an intermediate-fiber group (28.85 ± 0.80 mm) and a short-fiber group (26.31 ± 0.82 mm). Among the 436 loci associated with FL, 129 show a higher favorable allele frequency (>0.2) in the long-fiber group compared with the short-fiber group, and these loci are enriched in modules 1, 23, 25 and 36 (Fisher’s exact test, P < 0.05; Fig. 3f). Accordingly, modules 1 and 25 have a very high heritability to FL, representing strong candidates for prioritization in cotton breeding (Fig. 3g).

Genetic effects on homoeologous expression bias

In polyploid cotton, subgenomic expression bias occurs for a number of homoeologous genes; however, the cis- or trans-transcriptional regulatory controls of this bias are not well understood. To investigate the genetic basis of expression bias of homoeologous genes, we performed GWAS using bias fraction score as a phenotype (Supplementary Fig. 12). A total of 14,133 significant associations (bias-eQTLs) were detected for 4,026 homoeologous gene pairs during fiber development.

To explore the possibility that bias-eQTLs contribute to homoeologous expression bias, we compared the genetic effects of eQTLs in one subgenome with those of bias-eQTLs for homoeologous genes, after discovering that eQTLs were significantly enriched for homoeologous pairs exhibiting expression bias (17,828, 48.1% versus 6,225, 29.3%; Fig. 4a and Supplementary Fig. 13). We observed that 2,658 gene pairs with 5,350 significant signals that were detected in both the eQTL and bias–eQTL analyses tended to have a wider pattern of expression bias in different accessions and their cis-regulatory regions had a higher proportion of variance for expression bias (Extended Data Fig. 5a,b). Of the 5,350 bias-eQTLs, 4,846 (90.5%) were colocalized with eQTLs for one of their homoeologous genes, and the number of variants in genic regions showed noticeable differences (Extended Data Fig. 5c). This suggests that variants in the transcriptional regulation region for one copy of homoeologous gene may result in dysregulation of expression and lead to expression bias. Of note is the observation that for gene pairs with bias-eQTLs identified at each stage, 8% (6.2–9.3%) of them are TWAS genes. There is an association between the variation of their homoeologous expression and FL change (Fig. 4b), suggesting that some homoeologous pairs with biased expression were implicated in the regulation of traits.

Fig. 4: Genetic regulation of homoeologous expression bias.
figure 4

a, The circular diagram shows the proportion of homoeologous gene pairs that were characterized as eGenes and non-eGenes. The inner circle shows homoeologous pairs with expression bias, and the outer circle indicates gene pairs without expression bias. b, The number of TWAS genes present in the bias–eQTL gene pairs. Boxplots on the right panel show fiber length and homoeologous expression bias level at 20 DPA for the two haplotypes (n = 85 versus 280; two-sided Wilcoxon rank-sum test; centerline, median; box limits, first and third quartiles; whisker, 1.5× interquartile range). c, The heatmap on the left shows switched, time-dependent and dominant bias patterns for 2,658 gene pairs across stages. The heatmap on the right indicates the correlation (Spearman coefficient) of the eQTL effect for 2,658 gene pairs between stages. d, Histogram shows a significant difference (based on one-sided Fisher’s exact test) in the proportion of gene pairs showing dominant expression bias with shared eQTLs or without shared (nonshared) eQTLs. e, Histogram indicates a significant difference (based on one-sided Fisher’s exact test) in the proportion of gene pairs with switched expression bias in specific cis-eQTLs and trans-eQTLs.

To further interrogate the impact of genetic variants on the dynamics of expression bias during fiber development, we clustered the 2,658 gene pairs into three groups (switched, time-dependent and dominant) according to the number of accessions belonging to the patterns of expression bias (Extended Data Fig. 5d; Methods). Gene pairs belonging to the switched group exhibit changes in the direction of expression bias at different stages, which had relatively low correlation of genetic effects of eQTLs across stages (Fig. 4c). The dominant group shows a higher proportion of shared eQTLs, indicating that eQTLs often contribute to this shared expression pattern (Fig. 4d). Interestingly, compared with gene pairs with nonswitched bias of direction, more gene pairs in the switched group had trans-eQTLs instead of cis-eQTLs (Fig. 4e). This may suggest that trans-eQTLs were prone to mediate the dynamics of expression bias direction and cis-eQTLs were more likely to contribute to stable direction of expression bias during fiber development. In addition, by constructing a co-expression network of 16,081 homoeologous genes with expression bias, we found that homoeologous genes in the switched group have lower network connectivity that probably represents simpler regulatory relationships (Extended Data Fig. 5e–j and Supplementary Table 11). These findings may facilitate further understanding of the genetic regulatory dynamics underlying the expression bias of homoeologous genes.

Subgenomic coordination of genetic effect on fiber quality

Fiber quality-related genes were identified in both subgenomes; however, the genetic contribution of their homoeologous copies to fiber quality is not well understood. In this study, we found that a very small proportion of both homoeologous copies (0.8%, 2 of 241) were identified as candidate genes for FL in the TWAS and colocalization analysis. Similar results were obtained for FS (2.7%, 14 of 518), FE (3.7%, 26 of 700) and FU (0.8%, 2 of 240). We identified pseudoregulatory sites by mapping the significant SNPs of each candidate gene to the sequences flanking the homoeologous gene (2 Mb upstream and downstream) in the other subgenome. This analysis showed that few pseudoregulatory sites (2 of 2,442) for the four fiber-quality traits are mutated in the population (Extended Data Fig. 6a–d).

Due to the relative paucity of regulatory variants, we evaluated the contribution of the homoeologous gene of each candidate gene by comparing their expression levels. For each gene that positively regulates fiber quality, we assume the candidate gene and its homoeologous gene have similar regulatory effects when the candidate gene is associated with a trait-beneficial genotype and the homoeologous gene has higher or no differences in expression level (favorable expression). Based on this, the contribution of each homoeologous gene pair to fiber quality was grouped into the following four models: (1) favorable homoeologous pairs, with both favorable genotype and expression; (2) only favorable genotype with unfavorable expression of its homoeologous gene; (3) only favorable expression that has unfavorable genotype of candidate gene and (4) unfavorable homoeologous pairs (Methods). We found that 29.5% of the 133 homoeologous pairs that positively regulate FL have favorable expression, 34.9% do not, 3.1% lack a favorable genotype and 32.6% lack both favorable genotype and expression (Fig. 5a). Among the 106 homoeologous pairs that negatively regulate FL, 27.5% have the favorable state, 19.6% show decreased expression of homoeologous genes, 8.8% do not have a favorable genotype and 44.1% lack both favorable genotypes and expression levels (Fig. 5b). These data show that the majority (67.5–75.8%) of homoeologous genes might be further optimized for fiber improvement (Extended Data Fig. 6e–g). At the module level, module 31 appears to be relatively optimized with respect to homoeologous gene expression, indicating that most modules might be promising targets for improvement through optimizing both homoeologs (Fig. 5c and Supplementary Figs. 1416).

Fig. 5: Relationships between homoeologous expression and fiber quality.
figure 5

a, A total of 133 homoeologous gene pairs positively regulate fiber length. Red, favorable homoeologous pairs (homoeologous gene pairs with both favorable genotype and favorable expression). Yellow, homoeologous gene pairs with only favorable genotype of the candidate gene. Light blue, homoeologous gene pairs with only favorable expression. Dark blue, unfavorable homoeologous pairs (homoeologous gene pairs without either favorable genotype or favorable expression). Heatmap shows the classification of 133 gene pairs in 340 accessions at five timepoints. b, A total of 106 homoeologous genes negatively regulate fiber length. The classification is the same as in a. c, Pie charts of the states of homoeologous genes in eight FL-related modules. The number of FL-related genes is shown in the parentheses for each module. d, Proportional distribution of four types of homoeologous genes in accessions that were sorted by FL from short to long. e, Comparison of bias score of gene pairs in samples with favorable and unfavorable alleles. The pie chart shows the ratio of gene pairs with different patterns of bias score change. The boxplots indicate gene pairs that show a significant difference in bias score. Centerline, median; box limits, first and third quartiles; whisker, 1.5× interquartile range. f,g, Correlation between the number of gene pairs with biased expression, the number of favorable loci and fiber length. Panel f shows the accumulation in a single accession of 155 gene pairs that show a decrease in bias score in e. Panel g shows the accumulation in a single accession of 153 gene pairs that show an increase in bias score in e. The gray band represents 95% confidence interval for the fitted regression.

To assess the impact of aggregating homoeologous pairs with favorable genotype or expression on fiber quality traits, we counted the number of gene pairs represented by the four models above in each accession. For FL, we observed that the number of gene pairs categorized as favorable homoeologous pairs in long-fiber accessions became generally larger than in short-fiber accessions (Fig. 5d). We also found that 81.6% of the TWAS signals (308) corresponding to 231 homoeologous pairs showed changes of expression bias fraction score between accessions with favorable and unfavorable genotypes (Fig. 5e). Interestingly, by counting the bias levels of homoeologous pairs in each accession, we found that with the aggregation of favorable genotypes in long-fiber accessions, the expression bias exhibited the following two patterns: either fewer gene pairs (Fig. 5f) or more gene pairs (Fig. 5g) showed biased expression. The former implicates that only one subgenome has favorable genotypes and the latter might result from neither of subgenomes having favorable genotypes in short-fiber accessions. Similar patterns were observed for the other three traits (Extended Data Fig. 7a–c). These results suggest that the targeted utilization of both subgenomes may enhance the potential for fiber improvement and that aggregation of favorable genotypes has led to changes in the transcriptional regulation of homoeologous genes.

Genomic design for fiber quality improvement

To expand insight into the aggregation of favorable alleles for fiber improvement, we evaluated the effect of domestication selection on fiber quality-associated loci. In this analysis, 3,552 cotton accessions, including 332 landraces and 3,220 cultivars, were collected (Supplementary Table 12)20,23,36,37. A library of favorable alleles was constructed using the fiber quality-associated loci and these 3,552 accessions (Fig. 6a). Overall, cultivars aggregated more favorable alleles compared with landraces (Extended Data Fig. 8a,b and Supplementary Table 13), providing indirect ex post facto confirmation of the utility of focusing on the regulation of homoeologs related to fiber traits.

Fig. 6: Genomic design for fiber quality improvement.
figure 6

a, Phylogenetic tree and genotypes of fiber-related loci in 332 cotton landraces and 3,220 cultivars. b, Trait-associated loci were grouped into four categories according to their favorable allele frequency in the modern cultivar population. c, Number of loci in four categories with different sharing ratios of favorable allele for FL. d, Heritability of the SNPs in each of the four categories for FL. e, Comparison of favorable allele frequency in the four categories between landrace and cultivar population. The frequency difference above 0.6 was highlighted in green and yellow. f, Favorable allele frequency of loci that were highlighted in green color in e. g, Comparison of fiber traits (FL, FS and FE) and expression levels of Ghir_A01G013620 in accessions with different genotypes (CC and AA). Two-sided Wilcoxon rank-sum test. Centerline, median; box limits, first and third quartiles; whiskers, 1.5× interquartile range. h, Correlation between real FL values and FL breeding value (BV) estimates calculated by ridge regression. i, Model predictability tested by applying the training model in a to an external dataset (1,040 accessions) for FL. j, Distance of real FL (blue dots) and FL BV estimates (red dots) of the accessions to the estimated best value of FL. k, The number of FL-associated genes and the number of the homoeologous genes whose expression was correlated or not correlated with FL. l, Fivefold cross-validated prediction accuracy using ridge regression model based on two predictive sets. One contains genotype and the other contains both genotype and the expression of corresponding homoeologous genes. Hundred replications were run. Circle, mean; error bar, mean ± s.e. m, Genomic-guided donor parent selection. The heatmap (left) shows the level of genetic improvement. Each row represents the accession to be improved and each column represents the donor. Right, the utilization level of FL/FS-associated loci in the eight genetic modules or the homoeologous genes corresponding to these loci of S161 and S003. The loci or the homoeologous genes of S161 that could be improved by S003 were listed at the bottom.

We calculated the sharing ratio of favorable alleles in 3,220 modern cultivars and grouped the trait-associated loci into four categories, C1–C4. From C1 to C4, the utilization level of favorable alleles in modern cultivars increased (Fig. 6b–d and Extended Data Fig. 8c,d). We also compared the differences of the four categories between cultivars and landraces (Fig. 6e and Extended Data Fig. 8e). Based on the differences in allele frequency (>0.6), we identified 91 selectively favorable alleles in cultivars (Supplementary Table 14). We also identified 48 loci for which the favorable alleles were under purifying selection, indicating that human selection did not always lead to the aggregation of favorable alleles (Fig. 6f). We observed no linkage drag effects for the loci under purifying selection, and the effect of these loci was no different to that of preferentially aggregated loci in cultivars (Supplementary Fig. 17). Nevertheless, we found that many of the genes corresponding to the loci under purifying selection showed pleiotropic effects (27/48; Supplementary Table 14). For example, Ghir_A01G013620, encoding a nicotinamide adenosine dinucleotide (phosphate) (NAD(P))-linked oxidoreductase superfamily protein, was characterized as a representative candidate for fiber development. The expression of this gene at 4 and 8 DPA was positively correlated with FL, FS and FE (Fig. 6f,g). This type of functional implication suggests that future fiber improvement efforts should consider enabling the use of these alleles.

A strong linear relationship was observed between the total number of favorable alleles and the phenotype, indicating that combining favorable alleles in elite lines is an effective way to design favorable cotton cultivars (Extended Data Fig. 9a). Ridge regression was used to estimate the overall effect of the trait-associated loci (Extended Data Fig. 9b,c). The correlation coefficient between predicted and observed values for FL reached 0.77, indicating these loci are predictive (Fig. 6h). The accuracy of the model was evaluated using an external GWAS cohort (n = 1,040)37, and the correlation coefficient reached 0.47, which was similar to the accuracy of ridge regression best linear unbiased prediction (rrBLUP) by using the SNP set filtered through linkage disequilibrium (LD) clumping method (GWAS P value threshold = 0.001; Fig. 6i and Supplementary Fig. 18)38. We also predicted the best value of FL, which was 35.55 mm, much longer than the normally field-produced 28 mm fibers. The distance of the 376 accessions to the predicted best FL ranged from 2.88 mm to 11.64 mm (Fig. 6j).

Because the expression of the homoeologous copies of trait-associated candidate genes may affect the phenotypes, we hold the view that fiber quality-related traits can be further improved by modifying the expression of homoeologous genes. To estimate the overall contribution of homoeologous gene expression, we integrated the expression of the homoeologous genes that were correlated with fiber quality-related traits in at least one developmental stage into the predictive model. This gave an increased prediction accuracy of 8% for FL, 10% for FS, 5% for FE and 6% for FU (Fig. 6k,l and Extended Data Fig. 9d). These data pinpoint the importance of the involvement of homoeologous genes and expression levels in genomic breeding.

To enable genomic design for FL and FS improvement, we evaluated the trait-associated loci, as well as the state of utilization of the corresponding homoeologous genes that could be improved by other accessions (Fig. 6m and Supplementary Tables 1518). As a proof of concept, we present the improvement degree matrix of 18 cotton accessions to clarify the donor parent selection process. The accessions S003 and S265 would be the most commonly used donor parents in the first crossing. S161 could be improved at up to 165 loci by S003 for FL and 381 loci for FS, and as a consequence, the genomic-estimated breeding value could be increased by 7.9% for FL and 26.7% for FS. We also present the degree of utilization of S161 and S003, considering both genetic modules and homoeologous genes. A total of 80 and 197 loci (or the homoeologous genes) of S161 could be improved by S003 in the eight genetic modules for FL and FS, respectively (Fig. 6m).

Discussion

Precise spatiotemporal regulation of gene expression by both cis-regulatory sequences and trans-acting factors is required for developmental programs in higher organisms39. In this study, we characterized potential cis- and trans-regulatory variants of gene expression across different stages of cotton fiber development by eQTL mapping, which provides a rich resource for the community to identify genetic regulatory components associated with fiber quality. We show that a large proportion of eQTLs showed stage-dependent regulatory effects, similar to the increasing number of observations that many regulatory variants are not associated with gene expression at a steady state21,40,41,42. This finding suggests that the genetic effect of genes on fiber quality should be evaluated at a specific developmental stage. Future studies might explore the functional implications of cis-eQTLs represented by variants in transcription factor binding sites, and also ascertain whether key transcriptional factors are mutated in trans-eQTLs, the latter requiring the implementation of a cotton Encyclopedia of DNA Elements project43.

In polyploid cotton, the effect of cis- and trans-interactions that lead to both intrasubgenomic and intersubgenomic interactions on the expression of duplicated (homoeologous) genes is largely unresolved8,13. We show that the transcriptional dysregulation mediated by cis-regulatory variants in a certain subgenome may give rise to homoeologous expression bias. This phenomenon may be addressed further from an evolutionary viewpoint to understand the phylogenetic timing and extent of homoeologous gene expression divergence relative to both polyploidization and cotton domestication. We further show that widespread partitioned expression of homoeologous genes during fiber development is accompanied by only 30.2% of fiber quality-related homoeologous copies with both favorable aggregations during breeding. This suggests the exciting prospect that fiber quality may be improved through subgenome optimization, harnessing genomic modifications of genetic regulatory loci associated with homoeologous expression partitioning. This study highlights that the dissection of the genetic basis of agronomic traits or identification of functional genes should consider subgenomic counterparts in polyploid crops. We note that although we do not provide experimental evidence demonstrating that subgenome optimization in breeding programs actually has improved fiber quality, we point to this important consideration here. In summary, genetic dissection of the regulation of homoeologous expression partitioning may lead to both fundamental and applied advances with respect to polyploidy and crop improvement.

Methods

Plant materials

A population of upland cotton (G. hirsutum) containing 376 diverse accessions was cultivated in the field at Huanggang (114°55′22″E, 30°34′19″N), Hubei, China in 2019. Three replicates of each accession were cultivated on three separate plots. Cotton leaf samples of each accession from one plant were collected for genomic DNA extraction using the cetyltrimethylammonium bromide (CTAB) method44. Cotton bolls were marked on the day of flowering (0 DPA) for transcriptomic analysis. Ovules at 0 DPA and developing fibers at 4, 8, 12, 16 and 20 DPA were collected from at least ten cotton bolls of different plants in two rows and stored in liquid nitrogen until RNA extraction (see below). Mature cotton fibers were collected from each plant for fiber quality measurement.

DNA sequencing and SNP calling

Purified genomic DNA was sonicated to about 250–300 bp, followed by pair-end repairing. Resequencing libraries were constructed following manufacturer protocols for the MGISEQ-T7 platform (Pair-end 150 bp reads; including 373 new DNA sequencing data and three published DNA sequencing data20: accession S112, SRR4018975; S123, SRR4018977; S280, SRR4018974). Fastp software (v0.23.0) was used to trim the adapter sequence and remove low-quality reads45. Variant calling was accomplished following the Sentieon pipeline (License 201808.07)46. In brief, Sentieon invoked BWA to align the clean sequencing reads to TM-1 reference genome47 following customized parameter (mem -M -K 10000000)48. SAMtools (v1.9) was used to eliminate duplicated and low-quality reads alignment49. We performed SNP calling in two different ways. First, Sentieon Haplotyper algorithm (--algo Haplotyper --genotype_model multinomial --emit_conf 30 --call_conf 30) was used to process variant calling for each sample. The global variation file (GVCF) for each sample was generated by GVCFtyper algorithm (–algo GVCFtyper –emit_mode gvcf) in Sentieon, followed by joint calling to merge all variations into an integrated VCF file. Only the variations identified by both algorithms and covered by more than five sequencing reads were regarded as high confident and were retained.

RNA sequencing and data analysis

For the cotton ovule and fiber samples, RNA was extracted using the RNAprep Pure Plant Kit (polysaccharides and polyphenolics-rich). In total, 2 μg RNA was used to construct sequencing libraries using VAHTS Universal V6 RNA-seq Library Prep Kit (Vazyme, NRM604-02) that were sequenced on an MGISEQ-2000 platform (pair-end 150 bp reads). After removing adaptors and clipping low-quality bases with Trimmomatic (v0.36), clean reads were mapped to the reference genome of TM-1 using HISAT2 (v2.1.0)50,51. SAMtools (v1.9) was used to remove PCR duplicates and reads with mapping quality less than 20 (ref. 49). The remaining reads were used to calculate the expression level (FPKM) of genes with StringTie (v2.1.4)52.

Expression bias analysis of homoeologous genes

For each accession at each stage, the expression change of homoeologous gene pairs was defined if at least one gene was expressed (FPKM > 0.1) and both genes exhibited an expression difference of at least twofold, which was tested under different conditions (Supplementary Table 3). For each expressed homoeologous gene pair in all accessions, if the number of accessions with expression bias (at the single accession level) was greater than 5% of all accessions, this gene pair was identified as biased homoeologous gene pair. For each homoeologous gene pair with biased expression, we compared the expression level at each timepoint between the At and the Dt in all accessions using a two-sided Wilcoxon rank-sum test and corrected the P value using the Benjamini–Hochberg method. Homoeologous gene pairs with expression bias in at least 5% accessions and FDR ≤ 0.05 were considered biased homoeologous gene pairs at the population level. In this analysis, we classified the expression bias of gene pairs into two kinds, that is, bias-At and bias-Dt, with observed expression bias toward the At or Dt subgenome. For gene pairs showing expression bias toward the At or Dt subgenome in less than 5% of samples or with two bias directions, they were defined as bias-N or bidirectional-bias, respectively. In total, we identified 29,690 gene pairs categorized as bias-At, 31,604 as bias-Dt and 1,667 as bidirectional-bias in all six timepoints of fiber development.

Peer factors for eQTL mapping

To account for the hidden batch effect and other global confounders, we used the probabilistic estimation of expression residuals (PEER, v1.3) method to estimate hidden covariates for gene expression levels for each stage53. This method can also increase the detection power of eQTL mapping by accounting for these covariates in the analysis. A set of PEER factors were tested, and the number of PEER factors was selected to maximize the number of mapped eGenes. Due to the similar number of samples at each timepoint, 20 PEER factors were chosen in six timepoints for the correction of hidden covariates (Supplementary Fig. 1).

Cis-eQTL mapping

To identify eQTL for genes in fiber development, genes with low expression levels (FPKM < 0.1) in more than 95% of samples were filtered, and the expression levels of remaining genes of all samples from a given stage were normalized using an inverse normal transformation. Cis-eQTL mapping was performed using FastQTL (v7)54, and the top three genetic principal components and 20 PEER factors were used as covariates52. By testing the association with variants within ±1 Mb of the transcription start site (TSS) for each gene, an adaptive permutation was used with the setting --permute 1000 10000. The significant variant–gene associations were identified by applying gene level nominal P value thresholds corresponding to FDR < 0.05. The significant SNP with the strongest association signal was defined as the lead SNP for each association. We subsequently performed a forward–backward stepwise regression analysis to identify multiple independent cis-eQTL signals for a given expression phenotype. The stepwise regression procedure was implemented in the conditionally independent QTL module of the software tensorQTL (v1.0.5), as described previously55.

Trans-eQTL mapping

For trans-eQTL mapping, we used the same covariates as for cis-eQTL analysis. The FAST-LMM (v0.2.32) program was used to perform GWAS of each gene and the whole genomic SNPs (MAF > 0.05)56. Variant–gene pairs with P value less than 3.76 × 10−7 were considered significant, and the significant variants for each eGene were grouped into clusters with a maximum distance of 10 kb between two consecutive SNPs, and only those clusters with more than three SNPs were considered as a putative eQTL. We identified a total of 46,749 significant variant–gene associations in six stages, of which 22,938 had target eGenes with variants within ±1 Mb of the genic region. Because the corresponding variants were highly correlated with variants identified in FastQTL, we only kept the variant–gene associations where the variants were found out of the 1 Mb region of the eGene.

Comparison of eQTL effect between stages

To estimate the cross-stage activity of cis-eQTLs, we used Multivariate Adaptive SHrinkage (as implemented in the R package mashr (v0.2.45)) for all cis-eVariant–gene pairs across stages27. This analysis was performed using the β coefficients and s.e. of each eQTL as input, and nine SNPs were randomly selected from the ±1 Mb region of each eGene. A total of 575,820 variant–gene pairs were used to fit the MASHR model. Effect size estimates and LFSR (that is, LFSR) outputted by mash were used as metrics of cis-eQTL magnitude and significance. An LFSR ≤ 0.05 was used as the threshold of significant cis-eQTL activity. To estimate the activity of trans-eQTL across stages and test the robustness of mash, we used a meta-analysis approach as implemented in MetaSoft (v2.0.1) to calculate the posterior probability and m value that the eQTL effect exists in each stage57. An m value ≥ 0.9 was used as the threshold of significant eQTL activity. To assess the stage specificity of eQTL, we paid attention to significant eQTL identified at different stages that may be in the same LD genomic interval. To reduce duplicate statistics, significant eQTLs for each gene were merged based on the lead SNPs with r2 ≥ 0.6 and distance ≤100 kb. For each eQTL, we focused on its lead SNP, and the effect of stage-shared eQTL was defined as the average effect of lead SNP in all stages.

Detection of bias–eQTL

To identify significant variants associated with the expression bias of homoeologous gene pairs, the expression bias level in each accession of each stage was quantified into a biased score as shown below. The association between SNPs and bias score was performed using the factored spectrally transformed linear mixed models (FAST-LMM) program (Supplementary Fig. 12)56.

$${\mathrm{Bias}}\,{\mathrm{score}}=\frac{{\mathrm{At}}-{\mathrm{Dt}}}{{\mathrm{At}}+{\mathrm{Dt}}}$$

We applied a strict filter of homoeologous genes with expression bias—biased expression was present in at least 5% but no more than 95% of all samples, and the expression of the homoeologous genes differed by twofold change of FPKM. After filtering, an average of 5,579-6,563 homoeologous gene pairs in each stage were used for GWAS analysis. We clustered the significant SNPs according to the method described previously58. SNP clusters were further merged based on LD (r2 ≥ 0.6) and distance (≤100 kb) in the genome. Across six timepoints, a total of 14,133 bias-eQTLs were identified for 4,026 homoeologous gene pairs. Genes regulated by eQTLs were significantly enriched in homoeologous genes with expression bias, so we compared the effect of all bias-eQTLs to their effect in eQTL analysis. The effect size was estimated by the β coefficient and s.e. of lead SNP in each bias–eQTL. When calculating the effect of bias–eQTL in regulating the expression of Dt-subgenomic genes, the effect value is correspondingly multiplied by −1.

Analysis of heritability of gene expression

To estimate the variance of expression bias level explained by different genomic regions, genome-wide SNPs were divided into LD-friend SNPs (that is, target SNPs that are in significant LD) of cis-SNPs and trans-SNPs according to the eQTLs of each homologous gene pair. For LD-friend SNPs, GCTA-LDF (v1.94.0) was performed with parameters --ld-wind 100 and --ld-sig 0.05 to search for SNPs that are in LD with the lead SNP59,60. Cis-SNPs and trans-SNPs are distinguished according to whether the distance between SNP and TSS is more than 1 Mb. Each set was used to build a kinship matrix using the direct method and --power −0.25 in LDAK (v5.2)61. Considering the fact that different genome regions have different LD levels, the LDAK weightings model was used to equalize the tagging of SNPs in the genome. The genetic variance was then calculated for these SNPs using the restricted maximum likelihood (REML) model and –mgrm61.

Genome-wide association analysis

For GWAS, we used a total of 2,658,921 high-quality SNPs (MAF > 0.05) from the 376 accessions. Association analysis was carried out using a linear mixed model implemented in FAST-LMM for four fiber quality-related traits, including FL, strength, elongation and uniformity. The population structure was inferred using STRUCTURE software (k = 3, SNP number = 5,000) and included in the model as covariates. The kinship matrix was calculated based on all SNPs using FAST-LMM56. The threshold for genome-wide significance was set as 3.76 × 10−7 (1/n, where n is the total number of genomic SNPs). The significant SNPs were first grouped into one locus if two adjacent SNPs were within a 20-kb interval. Consecutive loci were further merged into a single locus if SNPs with the lowest P values in each of the adjacent loci were located in LD regions (r2 ≥ 0.6).

TWAS and colocalization analysis

To identify associations between gene expression and fiber quality-related traits, we conducted TWAS using the FUSION package28. Briefly, we constructed a standard binary PLINK format file for each gene using SNPs within 500 kb on either side of the gene boundary with the integration of log-transformed expression data. Then, the script FUSIONcompute_weights.R (https://github.com/gusevlab/fusion_twas) in the FUSION package28 was used to compute the expression weights for each gene, taking the binary PLINK format file as input. Five models (BLUP, BSLMM, LASSO, Elastic Net and top SNPs) were used in this step and the effect sizes from these models acted as weights. The script FUSION.test.R (https://github.com/gusevlab/fusion_twas) was run to perform the typical TWAS analysis, with the computed gene expression weights, GWAS summary statistics and an LD reference panel that was constructed using the SNPs matching the GWAS SNPs. We considered genes with corrected P values < 0.05 as significant associations.

To detect the shared causal variants between eQTL and GWAS signals, we performed a colocalization analysis of cis-eQTLs from five fiber developmental stages and GWAS loci of four fiber quality traits using coloc29. The cis-eGenes within 1 Mb of the GWAS loci were extracted for colocalization analysis. The function coloc.abf was used to calculate posterior probabilities for PP.H4 that shows both traits (gene expression and GWAS phenotype) share a single causal variant. A cis-eGene was defined as having evidence of colocalization when the posterior probability of colocalization (PP.H4) was higher than 0.8. We implemented SMR approach to test if an eGene is associated with a trait through eQTL30. SMR (v1.03) implements the heterogeneity in dependent instruments (HEIDI)-outlier test to distinguish causality or pleiotropy from linkage. SMR associations were declared significant if the Bonferroni-corrected SMR P < 0.05 and HEIDI-outlier test P > 0.05.

Definition of biased expression pattern in fiber development

To analyze the dynamic expression bias of homoeologous genes, we explored whether the direction (the At or Dt subgenome) of expression bias in each accession was consistent over six timepoints. For each gene pair, all cotton accessions were classified into the following eight classes according to the biased expression patterns in fiber development: class 1 indicates accession with the same bias direction across six timepoints; class 2 indicates accession with the same bias direction across five timepoints while having no bias in the remaining timepoint; class 3 indicates accession with the same bias direction across four timepoints while having no bias in the remaining timepoints; class 4 indicates accession with the same bias direction across three timepoints while having no bias in the remaining timepoints; class 5 indicates accession with the same bias direction across two timepoints while having no bias in the remaining timepoints; class 6 indicates accession with bias at a specific timepoint while having no bias in the remaining timepoints; class 7 indicates homoeologous genes had balanced expression in all six timepoints (without expression bias) or class 8 indicates the direction of expression bias is different in at least two timepoints. For each gene pair, the number of accessions in each classification was counted. According to the number of accessions in each classification, 2,658 gene pairs were sorted using unsupervised hierarchical clustering (Extended Data Fig. 5d). For the gene pairs of the switched cluster, most of their accessions are presented in the bias expression of class 8. For the gene pairs of the dominant cluster, most of their accessions show the same bias direction across six timepoints (class 1). For the gene pairs of time-dependent cluster, most of their accessions have no bias expression in at least one timepoint (classes 2–7).

Genetic network construction

To explore the regulatory landscape of candidate genes during fiber development, we used eQTLs to constitute a genetic network that integrated regulatory relationships from all timepoints. This network was composed of eQTLs, eQTL hotspots and candidate genes identified in TWAS or colocalization analysis. To describe the dynamic eQTL regulation, eQTLs in all timepoints were merged when the distance between lead SNPs was less than 50 kb and r2 was greater than 0.6. We obtained 16,914 merged eQTLs from 53,854 cis-eQTLs and 23,811 trans-eQTLs from six timepoints. eQTL hotspots were identified using the HOT_SCAN program at six timepoints62 and then merged when the distance between adjacent hotspots was less than 20 kb. We identified a total of 463 hotspots across six timepoints and 406 hotspots across five fiber developmental timepoints. The visualization and module division of the network was accomplished by the built-in program in Gephi (v0.9.5)63. The principle of modularization was to maximize the connectivity between nodes in the same module and minimize the connectivity between nodes in the different modules. Each module consists of a collection of genes related to fiber quality (TWASFDR < 0.05 or COLOCPP. H4 > 0.8). Modules with heritability ≥0.05 and gene counts ≥5 were used for multiple regression of gene expression to phenotypic values across timepoints. For each module, the degree of correlation between gene expression and phenotype at a given timepoint was indicated by the magnitude of the normalized r2 value. We identified 129 loci with higher favorable allele frequency (>0.2) in the top 50 accessions with the longest fiber. To determine whether the modules were enriched for the 129 loci, one-sided Fisher’s exact test was performed in R.

Conception of subgenome coordination model

In this study, favorable genotypes refer to those identified by TWAS or colocalization analysis and corresponding to higher trait values. For candidate genes that positively regulate phenotype, favorable expressions refer to cases where homoeologous genes have higher expression than candidate genes (with twofold expression change) or no differences. For candidate genes that negatively regulate phenotype, favorable expressions refer to cases where homoeologous genes have lower expression than candidate genes (with twofold expression change) or no differences. We classified the status of a pair of homoeologous genes in all accessions into the following four models: favorable homoeologous pairs, only favorable genotype, only favorable expression and unfavorable homoeologous pairs. Based on the above classification, a pair of homoeologous genes in accessions at all timepoints was classified—(1) favorable homoeologous pair, judged as favorable homoeologous pair in ≥50% accessions at all timepoints; (2) only favorable genotype, judged as favorable homoeologous pair or only favorable genotype in ≥50% accessions at all timepoints but did not meet the first condition; (3) only favorable expression, judged as favorable homoeologous pair or only favorable expression in ≥50% accessions at all timepoints but did not meet the first condition; (4) unfavorable homoeologous pairs, the remaining gene pairs that did not meet the above three conditions.

Heritability estimation

To evaluate how the heritability changes when considering more trait-related SNPs, we first took subsets from all SNPs at different sizes by using the sample() function in R and then used the LDAK-Thin model to estimate the heritability contributed by these SNP subsets61. The LDAK-Thin assumes the expected heritability contributed by an SNP is higher for SNPs in regions with lower levels of LD and for those with higher MAF. The process of thinning the genetic variants cameters ‘--window-kb 100 and --window-prune 0.98.’ Kinship matrix was calculated for each SNP subset using the main argument ‘--calc-kins-direct’ by setting the power equal to 0.25. Heritability contributed by each kinship matrix was estimated separately using a generalized REML solver that was included in LDAK61.

Genomic prediction using ridge regression

In this study, ridge regression, a linear regression with penalty, was used to reduce overfitting64. The predictor variables were the number of superior alleles at each locus, encoded as 0, 1 or 2. The responding variable was the real phenotype. The function cv.glmnet() in glmnet package in R was used to fit the ridge regression model by setting α equal to 1 and the cross-validation will be automatically performed. The value of λ (regularization parameter) that gives cross-validated minimum mean error was selected to obtain the model coefficients65.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.