Comprehensive RNA sequencing and co-expression network analysis to complete the biosynthetic pathway of coumestrol, a phytoestrogen

Coumestrol (CMS), a coumestan isoflavone, plays key roles in nodulation through communication with rhizobia, and has been used as phytoestrogens for hormone replacement therapy in humans. Because CMS content is controlled by multiple genetic factors, the genetic basis of CMS biosynthesis has remained unclear. We identified soybean genotypes with consistently high (Daewonkong) or low (SS0903-2B-21-1-2) CMS content over 2 years. We performed RNA sequencing of leaf samples from both genotypes at developmental stage R7, when CMS levels are highest. Within the phenylpropanoid biosynthetic pathway, 41 genes were tightly connected in a functional co-expression gene network; seven of these genes were differentially expressed between two genotypes. We identified 14 candidate genes involved in CMS biosynthesis. Among them, seven were annotated as encoding oxidoreductases that may catalyze the transfer of electrons from daidzein, a precursor of CMS. Two of the other genes, annotated as encoding a MYB domain protein and a MLP–like protein, may increase CMS accumulation in response to stress conditions. Our results will help to complete our understanding of the CMS biosynthetic pathway, and should facilitate development of soybean cultivars with high CMS content that could be used to promote the fitness of plants and human beings.


RNA-seq and DEG profiling.
To identify the differences in gene expression involved in phenylpropanoid biosynthesis between Daewonkong and SS0903-2B-21-1-2, we extracted total RNA from leaf tissues at developmental stage R7. A total of 237 and 245 million reads, encompassing 24 and 25 Gb, respectively, were generated per genotype and about 70% and 75% of total reads were properly mapped to the G. max reference genome sequence (www.phytozome.net/soybean) 43 (Fig. 2A). The DEGs were annotated against the AgriGO genome locus background, and metabolic process was the most enriched GO term in all four DEG sets ( Fig. 2C-F). KEGG pathway analysis revealed that ~40% and ~30% of DEGs were assigned to metabolic pathways and biosynthesis of secondary metabolites, respectively ( Supplementary Fig. 2). These results indicate that the secondary metabolic pathways are differentially regulated between the two genotypes.

Identification of candidate genes for CMS biosynthesis.
To identify candidate genes for biosynthesis of CMS from daidzein, we applied three prediction approaches (Supplementary Table 4). First, using the 41 genes in the network as guide genes (Fig. 3), we searched the whole soybean gene network of 40,812 genes for genes closely connected to the guide genes ("guide prediction"). Second, the genes closely connected to the seven DEGs involved in phenylpropanoid biosynthesis were identified using the same network. Third, candidate genes were predicted in the context of subnetworks consisting of central hubs and their neighbors, and connections between the hub genes and the seven DEGs were identified ("hub prediction") (Supplementary Table 4) 44 . We then listed the top 20 genes predicted from each approach, along with their GO terms from three different databases. The three methods identified 3, 11, and 9 genes as DEGs, all of which were up-regulated in Daewonkong ( Supplementary Fig. 5). Overall, 14 genes were predicted from the three prediction approaches, of which seven were identified by two or more approaches (Supplementary Table 5). These 14 candidate genes may play a key role in determining CMS contents in soybean.

Validation of gene expression level by qRt-pCR.
We validated the expression levels of the 14 DEGs by quantitative reverse-transcription (qRT) PCR (Fig. 4). The qRT-PCR results were consistent with the RNA-seq data: all 14 DEGs up-regulated in Daewonkong in the RNA-seq data were also up-regulated in the qRT-PCR results ( Supplementary Fig. 5).

Discussion
Soybean is one of the most important crops in the world due to its high production of protein and oil. In addition, it is a valuable nutraceutical ingredient because it contains several phytochemicals, including isoflavones, saponins, phenolic acids, and linoleic acids. In light of their contribution to human health and plant defense systems, these phytochemicals, especially isoflavones, are desirable target traits in soybean breeding programs [48][49][50][51][52][53] . Isoflavones, synthesized predominantly in legumes, attract rhizobial bacteria, initiate nitrogen-fixing root nodule formation, exert antifungal activity, and serve as metabolic precursors for major phytoalexins 5,6,54,55 . CMS, a coumestan isoflavone, decreases the risk of breast cancer by binding selectively to ERβ 23,24,26 . Therefore, CMS is a promising phytoestrogen for use as a selective estrogen receptor modulator (SERM) 26 .
The extreme variability of isoflavone contents among different environments has hindered elucidation of the genetic basis of isoflavone biosynthesis [56][57][58] . Indeed, in this study, even though plant samples were prepared at the same location over 2 years in three biological replicates in each year, variations in CMS content were observed between 2016 and 2017, indicating that this trait is environmentally sensitive (Supplementary Fig. 1, Fig. 1A). Therefore, we only used leaf samples from genotypes with relatively consistent levels of CMS over both years. Moreover, to ensure that leaf tissues containing the highest levels of CMS were used for RNA-seq, the CMS content in Daewonkong was measured at two different time points at developmental stage R7 in each year of cultivation (Fig. 1B); CMS content dramatically increases after the reproductive stage and peaks at R7 42 . In 2016, CMS content increased about 3.5% as the leaves matured at R7, but in 2017 it increased about 41% only during R7. The Daewonkong leaf samples used for the second measurement in 2017, which had the highest CMS, were used for RNA-seq. The other genotype, SS0903-2B-21-1-2, exhibited consistently low CMS content in both cultivation years.
The legume-specific isoflavonoid pathway includes several side pathways that overlap and compete with each other, and share daidzein as a central metabolite 62 . Among homologous genes involved in isoflavonoid biosynthesis with daidzein as the common precursor, four (Glyma.08G089500, Glyma.10G176700, Glyma.13G173600, and Glyma.16G149300) were connected in the co-expression network. Glyma.08G089500, Glyma.10G176700, and Glyma.13G173600 were minimally expressed in both genotypes, but Glyma.16G149300 was detected as a DEG. Glyma.16G149300 was down-regulated in Daewonkong, consistent with the higher level of CMS accumulation in this genotype (Fig. 3).
Other than the candidate genes that catalyze the NADPH oxidation reactions starting from daidzein, there are seven more candidate genes involved in the biosynthesis pathway of CMS. MYB domain protein 15 (Glyma.02G005600) and major latex protein (MLP)-like protein 423 (Glyma.17G030400) are annotated as involved in stress responses. Because the regulation of isoflavonoid metabolism is thought to occur primarily at the level of transcription, transcription factors (TFs) are promising candidates 63 . Multiple MYB TFs regulate the expression of structural genes involved in isoflavone biosynthesis under stressed conditions [64][65][66] . MLP-like protein modulates the production of metabolites under drought stress conditions, and overexpression of MLP leads to salt stress insensitivity 67,68 . This agrees well with reports that CMS increases drought stress through communication with mycorrhiza and CMS accumulates the most at R7 growth stage when soybean is dehydrated while maturation 33,42,69 . Thus, these two genes may induce accumulation of CMS under stress conditions. For candidate DEGs unmapped to any GO term, Glyma.01G135200 and Glyma.13G285300 are annotated as cytochrome P450, an oxidoreductase, that is reported to be involved in isoflavonoid biosynthesis 70,71 . Glyma.11G004200, annotated as a member of the alpha/beta-hydrolase superfamily, may play a role in the hydrolysis reaction of CMS biosynthesis. 2-hydroxyisoflavanone dehydratase, which catalyzes a dehydration reaction yielding isoflavone from 2-hydroxyisoflavanone, is a member of the carboxyesterase family (Glyma.02G134000) 72 .
To date, although no evidence for their direct roles in isoflavone biosynthesis has been reported, these genes could also affect the biosynthetic pathway of CMS in as-yet-undiscovered ways (Fig. 5).
The functions of the candidate genes remain to be experimentally verified. The biosynthetic pathways responsible for production of anthocyanins, another class of soybean flavonoid metabolites, have been well studied due to the importance of these compounds for human health and the cosmetic industry. Anthocyanin content in soybean seeds is determined by six loci, and all genes corresponding to these loci have been isolated [73][74][75][76][77][78] . In addition, the key enzymes could be characterized by forward genetic approaches because anthocyanin levels can be easily distinguished based on the color of the seed coat or flower. However, because variations in CMS content do not cause any visible phenotypic variation, identification of the candidate genes involved in CMS biosynthesis is essential for study of the biosynthetic pathway using reverse genetic approaches. Knockout or down-regulation of each candidate gene, or multiple genes at the same time, would enable characterization of their effects on CMS accumulation. In vitro conversion of intermediate products with candidate enzymes might also be a good approach to characterizing the role of each enzyme in CMS biosynthesis.
In summary, to determine the genetic basis of CMS accumulation, we sequenced RNA samples from two genotypes, Daewonkong and SS0903-2B-21-1-2, which had consistently high and low CMS content, respectively. Using a co-expression network database and key DEGs identified in the iso/flavonoid biosynthetic pathway, we identified genes that might play important roles in CMS accumulation. Our results provide a valuable resource to help elucidate CMS biosynthesis in soybean and develop soybean cultivars with desired CMS contents, with the aim of improving plant defense and human health. Future research will focus on functional validation of the identified genes and complete characterization of the CMS biosynthetic pathway.  Table 1).

DeGs and enrichment analysis. Fragments Per Kilobase of transcript per Million mapped reads (FPKM)
values were calculated by mapping raw RNA reads for 56,044 genes to the G. max reference genome annotation data (Gmax_275_Wm82.a2.v1.gene.gff3) using the Tuxedo software suite 79 . DEGs were defined as genes with a log 2 fold change (FC) ≥ 2 between two samples in pairwise comparisons for three replications with p-value < 0.05 ( Fig. 2A). FPKM values < 1 were converted to 1 for the purpose of calculating FC. Sets of DEGs with log 2 FC ≥ 2, 3, 4, or 5 were used for Gene Ontology (GO) enrichment analysis using the Singular Enrichment Analysis (SEA) tool, available at agriGO (http://bioinfo.cau.edu.cn/agriGO/), and KEGG (Kyoto Encyclopedia of Genes and Genomes) ontology (http://www.genome.jp/kegg/tool/map_pathway1.html) (p < 0.05) ( Supplementary Fig. 2). The G. max reference annotation (Wm82.a2.v1) was used as a background reference for enrichment analysis. Candidate gene prediction. New candidate genes in the phenylpropanoid biosynthetic pathway were identified using the soybean co-expression network consisting of 1,940,284 co-functional links between 40,812 soybean genes constructed based on 734 microarrays and 290 RNA-seq data from soybean, available at http://www. inetbio.org/soynet/ 44 . First, new candidate genes in the pathway were identified by searching genes closely connected to the 41 known genes in the pathway as guide genes in the co-expression network ("guide prediction"). Second, the seven DEGs among the 41 genes in the pathway were used as guide genes for the same function. The genes were prioritized according to the sum of their log-likelihood scores, and highly ranked genes were considered good candidates for new members of the pathway 44 . Third, DEGs was used to identify new candidate genes through subnetworks consisting of a central hub and their neighbors ("hub prediction"). If significant overlap was observed between DEGs and neighbor genes of a subnetwork, the central hub of the subnetwork was identified as a candidate gene. Candidate genes were prioritized according to their p-values.
qRt-pCR validation of DeGs. Gene-specific primers for qRT-PCR analysis were designed based on the nucleotide sequences of selected DEGs using Primer3 (http://primer3plus.com/) (Supplementary Table 2). cDNA was synthesized using an iScript ™ cDNA Synthesis Kit (Cat. 170-8891; Bio-Rad, Hercules, CA, USA). qRT-PCR was conducted using an iQ ™ SYBR Green Supermix kit (Cat. 170-8882; Bio-Rad) on a LightCycler ® 480 (Roche Diagnostics, Laval, QC, Canada). Actin was used for normalization of target gene expression, and each sample was analyzed in triplicate. Relative gene expression was analyzed based on the reference gene as previously described 80 . Student's t-test was performed to determine whether differences were statistically significant (p < 0.05).

Data Availability
The raw RNA sequencing reads were deposited at NCBI SRA.