PoplarGene: poplar gene network and resource for mining functional information for genes from woody plants

Liu, Qi; Ding, Changjun; Chu, Yanguang; Chen, Jiafei; Zhang, Weixi; Zhang, Bingyu; Huang, Qinjun; Su, Xiaohua

doi:10.1038/srep31356

Download PDF

Article
Open access
Published: 12 August 2016

PoplarGene: poplar gene network and resource for mining functional information for genes from woody plants

Qi Liu¹,
Changjun Ding¹,
Yanguang Chu¹,
Jiafei Chen¹,
Weixi Zhang¹,
Bingyu Zhang¹,
Qinjun Huang¹ &
…
Xiaohua Su^1,2

Scientific Reports volume 6, Article number: 31356 (2016) Cite this article

3751 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Poplar is not only an important resource for the production of paper, timber and other wood-based products, but it has also emerged as an ideal model system for studying woody plants. To better understand the biological processes underlying various traits in poplar, e.g., wood development, a comprehensive functional gene interaction network is highly needed. Here, we constructed a genome-wide functional gene network for poplar (covering ~70% of the 41,335 poplar genes) and created the network web service PoplarGene, offering comprehensive functional interactions and extensive poplar gene functional annotations. PoplarGene incorporates two network-based gene prioritization algorithms, neighborhood-based prioritization and context-based prioritization, which can be used to perform gene prioritization in a complementary manner. Furthermore, the co-functional information in PoplarGene can be applied to other woody plant proteomes with high efficiency via orthology transfer. In addition to poplar gene sequences, the webserver also accepts Arabidopsis reference gene as input to guide the search for novel candidate functional genes in PoplarGene. We believe that PoplarGene (http://bioinformatics.caf.ac.cn/PoplarGene and http://124.127.201.25/PoplarGene) will greatly benefit the research community, facilitating studies of poplar and other woody plants.

Computational identification of protein-protein interactions in model plant proteomes

Article Open access 19 June 2019

ChIP-Hub provides an integrative platform for exploring plant regulome

Article Open access 14 June 2022

PlantPathMarks (PPMdb): an interactive hub for pathways-based markers in plant genomes

Article Open access 29 October 2021

Introduction

Woody plants, especially long-lived forest trees, provide large amounts of biomass, serving as vital raw materials for renewable energy production and other valuable commercial products. However, due to the long lifecycles of these plants, many of which have relatively large genomes, it is difficult to perform experiments using these plants, which has motivated the development of a model woody plant system¹. Poplar has several attributes that have led to its emergence as such a model system, including rapid growth, ease of clonal propagation, relatively small genome, easy transformation and so on^2,3. Understanding the characteristics of poplar, including various developmental processes, such as growth and wood development, will great facilitate the study of long-lived, large perennial plants. Although poplar is the first woody plant whose complete genome has been sequenced, and dozens of genes encoding poplar traits have been identified, functional knowledge about these genes and the genetic factors underlying these traits remains limit. Recent advances in high-throughput sequencing⁴, such as RNA-seq-based transcriptome studies and re-sequencing-based genetics studies, have generated unprecedented amounts of functional genomics data associated with many traits in poplar^5,6, which greatly facilitates the study of many important traits of poplar genome-wide.

The regulation of biological processes involves networks of various genes that function in a complex, coordinated manner. However, to date, most studies of poplar have focused on only a single or limited number of genes. Although gene coexpression networks have been constructed to identify functional gene modules involved in the conditions of interest^7,8,9,10,11, no comprehensive functional network of the interactome of poplar is currently available, and there is a strong demand for such public web resources. Functional gene interaction networks serve as powerful tools for gene functional linkage studies in many organisms including animals, plants and prokaryotes^12,13,14. Among the functional network construction algorithms, the development of probabilistic functional gene networks increases both network accuracy and coverage by integrating heterogeneous biological data into a single model^15,16. Using this approach, functional associations are determined between genes in a genome based on diverse data sets, each containing millions of individual observations, which are then integrated into a comprehensive gene network. Once the comprehensive functional linkage network is generated, genes whose functions are unknown could easily be annotated based on their linkage to genes with known functions. In addition, network-guided screening could be performed to identify new candidate genes linked to a specific trait based upon network linkages with previously identified genes associated with these traits.

Here, we constructed a genome-wide co-functional gene network for poplar (covering ~70% of the 41,335 Populus trichocarpa coding genome) based on machine learning technologies and created a network web service, PoplarGene, offering numerous functional interactions and extensive poplar gene functional annotations. PoplarGene incorporates two network-assisted gene prioritization algorithms, neighborhood-based prioritization¹⁷ and context-based prioritization¹⁸, which can be used to perform gene prioritization and to identify genes underlying traits in a complementary manner. Additionally, the co-functional linkage information in PoplarGene can be utilized for other woody plant proteomes via orthology transfer using two optional orthology mapping algorithms (Bidirectional Best Hits^19,20 and InParanoid²¹). In addition to poplar genes, the webserver also accepts Arabidopsis reference genes as input to guide the search for novel candidate functional genes in the PoplarGene network. We found that PoplarGene has significant predictive power for identifying genes affecting specific traits, such as secondary xylem development, stress response and defense genes. To the best of our knowledge, PoplarGene is the most comprehensive functional linkage resource for poplar to date. We believe that its user-friendly web interface will be highly beneficial to the research community, representing a valuable resource for better understanding poplar and other woody plants.

Results and Discussion

Network construction

The PoplarGene network was constructed based on diverse types of large-scale experimental and genomic datasets using machine-learning methods (Fig. 1). Three major steps were involved in PoplarGene network construction: (a) inferring functional gene pairs from each experimental and genomic dataset; (b) assigning likelihood ratio scores for each network linkage benchmark using gold-standard gene pairs and (c) integrating component network linkages using a modified naive Bayesian algorithm. Network construction was based on the Populus trichocarpa v3.0 reference genome obtained from Phytozome v10.3²², which contains 41,335 protein-coding genes. The gold-standard functional gene pairs used for network training were derived from Biological Process of Gene Ontology in Biofuel Feedstock Genomics Resource (BFGR)²³, KEGG pathway²⁴, MapMan Pathway²⁵ and PoplarCyc pathway²⁶. We obtained a total of 961,462 positive and 72,756,688 negative gold-standard gene linkage pairs, which were then used as the training set in a Bayesian framework²⁷ to measure the likelihood of functional links between two genes. We performed the training for each type of dataset, generating a total of 23 component networks (Table 1), which were integrated into a single comprehensive network using the weighted sum strategy²⁸. The integrated network contains 29,049 genes (covering >70% of the P. trichocarpa proteome) and 1,967,631 linkages. Precision-Recall analysis²⁹, in which, gene pairs were ranked by LLS score, and cumulative precision and recall were then calculated with successive bins of 1,000 gene pairs, indicated that the integration improved both genome coverage and linkage accuracy compared to all datasets alone (Fig. 2A).

**Figure 1: The overall workflow of PoplarGene construction.**

Table 1 Summary of the PoplarGene network and 23 network components.

Full size table

**Figure 2: Summary of quality assessment of the PoplarGene network.**

Network validation

To validate the accuracy of the constructed network, GO-BP terms from the agriGO database were utilized³⁰. This GO annotation set is alternative from BFGR GO-BP, which was used in our previous gold-standard training data construction. To avoid validation bias towards the broad GO-BP terms, the top 12 broadest terms in GO-BP were excluded from agriGO. We ultimately obtained 247,285 positive and 18,238,543 negative validated gene linkage pairs, overlapping 8% of our gold-standard training-positive gene pairs. Meanwhile, we also used the gene pair set derived from agriGO “Cellular Component” ontology terms as an additional benchmark set (220,946 positive and 2,465,233 negative), approximately 4% and 2% of which overlap with BFGR GO-BP-based gene pairs and gold-standard training-positive gene pairs, respectively. One important way to construct a poplar gene network is to perform orthology transfer of linkages from the existing Arabidopsis and rice comprehensive functional gene networks using associalogs methods³¹. First, to assess the accuracy of our network, we generated an AraNet-derived network and RiceNet-derived network by transferring the linkages from AraNet¹² and RiceNet³², respectively. The comparison between the PoplarGene network, AraNet-derived poplar network and RiceNet-derived poplar network demonstrated that the PoplarGene network not only has larger genome coverage (number of genes in the network), but it also has higher linkage accuracy, as assessed using the validated gene pairs (Fig. 2B). Precision-Recall (PR) analysis²⁹ further revealed that logarithmic OR ratios across high-scoring network linkages were higher than those of the AraNet-derived network and RiceNet-derived network (Fig. 2C). PR analysis using GO-CC-based benchmark sets also supported the same conclusion (Supplementary Figure S2A), confirming the improved accuracy and coverage of the PoplarGene network.

Second, we used several types of network property computational analyses to evaluate the quality of the PoplarGene network for biological process modeling. Power-law degree distribution analysis³³ indicated that, like other large-scale biological system networks, the PoplarGene network is also a scale-free network (Supplementary Figure S1A)³⁴. We then conducted topological analysis to assess the consistency between network modular structures and well-defined biological processes. The result show that the clustering coefficient of PoplarGene was ~200-fold higher than that of a random network (Supplementary Figure S1B), which is an expected property of functional modules comprising a network³³. Moreover, the non-randomness of the shortest path lengths between gene pairs in PoplarGene indicates that tightly interconnected functional modules are separated by long functional links (Supplementary Figure S1C). Together, the network properties analyses revealed the gene module organization in the PoplarGene network.

Third, we used guilt-by-association (GBA) analysis¹⁷ to determine whether known biological pathways could be detected by the network modules in PoplarGene³⁵. Candidate genes in the network were prioritized based on the direct network links to known genes (guide genes) in each biological process^17,36. We evaluated the predictive power for candidate gene function for each biological process by leave-one-out cross-validation and receiver operating characteristic (ROC) analysis³⁷. Tightly interconnected biological process member genes would be highly ranked based on high network prediction power, as indicated by high AUC (area under the ROC curve, 0.5 for random expectation and 1 for perfect prediction)³⁸. We tested the predictive power of 277 agriGO Biological Process terms with more than four annotated genes³⁰. The results reveal that PoplarGene has much higher predictive power for diverse biological pathways than random-chance expectation (P = 2.2 × e⁻¹⁶, Wilcoxon signed rank test; Fig. 2D). Moreover, PoplarGene had significantly higher AUC scores than both the AraNet-derived network (P = 3.606 × e⁻¹⁴, Wilcoxon signed rank test) and the RiceNet-derived network (P = 2.2 × e⁻¹⁶, Wilcoxon signed rank test), indicating that the PoplarGene network is highly predictive of gene function (Fig. 2D). The analysis using agriGO-CC-derived benchmark sets also supported this conclusion (Supplementary Figure S2B).

PoplarGene web service

Implementation

The PoplarGene web service (http://bioinformatics.caf.ac.cn/PoplarGene and http://124.127.201.25/PoplarGene) is hosted on the Apache/PHP/MySQL environment under a Linux system and is equipped with two Octa-cores AMD processors (2.6 GHz each) and 64 GB of RAM. The back-end pipeline is implemented in the Python/Perl language, and the plots are drawn by R (http://www.r-project.org) and JavaScript. Network nodes and edges were stored and organized in Neo4j (http://neo4j.com/), a highly scalable native graph database management system that was specifically designed to host graphical data. An integrated network exploration JavaScript library, sigma.js (http://sigmajs.org/), was used for network graph drawing. The web interfaces were successfully tested on different web browsers, including Mozilla Firefox 42.0, Google Chrome 47.0, Safari 5.1.10 and Internet Explorer 11.0. The PoplarGene web service provides users with very user-friendly interfaces for performing gene querying and other extensive network analysis functions (Fig. 3).

**Figure 3: Screenshots of the PoplarGene web service.**

Network-assisted gene prioritization

An effective strategy for genetic dissection of complex traits is network-assisted gene prioritization^17,18,32. To better utilize network linkage information and publicly available poplar gene-to-phenotype association information, PoplarGene offers two complementary methods to conduct network-assisted gene prioritizations for specific phenotypes. In addition, the web service can accept guide gene input from Arabidopsis, allowing the user to benefit from the available functional information about the most extensively studied plant species.

The first network-assisted gene prioritization method is neighborhood-based gene prioritization¹⁷, which is based on direct neighborhoods in the network (Fig. 3A). This method prioritizes new candidate genes for a specific phenotype by weighting (sum of edge LLS [Log likelihood score] weights) the direct connection to know genes involved in the phenotype (guide genes, submitted by the user). The server lists the top 100 novel candidate genes for the specific phenotype; the full list of ranked candidate genes is also available on the Results webpage. In addition, the AUC score, representing the predictive power for the submitted guide genes, is calculated using ROC analysis and is reported on the Results webpage as well. AUC ranges from 0.5 for random chance expectation to 1.0 for perfect predictions; AUC > 0.7 indicates good predictive power.

The second network-assisted gene prioritization method in the PoplarGene web service is based on a context-centric approach (Fig. 3C)¹⁸. Due to the long reproductive cycle and less efficient transformation procedures in poplar functional studies, the number of known guide genes for numerous poplar traits is still very limited, which hinders the efficient utilization of neighborhood-based gene prioritization. Transcriptomic analysis, largely facilitated by high-throughput sequencing in recent years, has become an efficient alternative approach to studying gene-to-phenotype associations. However, many differentially expressed genes (DEGs) identified in transcriptome studies are not actual regulatory genes but are simply genes that respond to alterations in cellular state. Moreover, many genes associated with a particular phenotype are not significantly differentially expressed. PoplarGene can prioritize genes using DEGs from a specific biological context. We initially identified 15,004 central hub genes with no less than 50 directly connected neighbors in the PoplarGene network. Users can initiate the analysis by submitting a set of DEGs that are associated with a specific biological context. Central hub genes that are significantly associated with the biological context will be returned and are subjected to Fisher’s exact test to evaluate the statistical enrichment of the neighbors of central hubs among the DEGs.

Mapping functional links to other tree species based on orthology

The PoplarGene web service also provides a feasible and convenient way to construct genome-scale gene functional networks for other woody plants based on proteome sequence data (Fig. 3D). Three gene functional network templates (AraNet v2, RiceNet v2 and PoplarGene) and two orthology mapping algorithms (Bidirectional Best Hit^19,20 and InParanoid²¹) are supported in PoplarGene. The web service also performs functional annotations for the submitted proteome using four pathway annotation systems (GO-BP, KEGG pathway, MapMan pathway and MetaCyc pathway) simultaneously. Once users successfully submit the proteome sequences, the web service will give the users a job ID, which can be used to retrieve the results once the job is completed.

Other functionalities in PoplarGene

All poplar genes (P. trichocarpa v3.0 reference genome) are extensively annotated in the PoplarGene web service, including their pathway annotation, protein domain annotation, orthology annotation, expression atlas, expression profile in woody plant tissues (Fig. 3E) and so on. All poplar gene information can be retrieved via user-friendly search interfaces, including single gene search mode and batch gene search mode (Fig. 3B). The linkages of each gene are also downloadable in SIF format which could serve as the input for Cytoscape software (http://www.cytoscape.org/download.php) installed on local desktop computers. Additionally, the functions of query genes whose functions are unknown can be inferred from network neighbors based on GO-BP term annotations. The functional terms for the query genes are assigned based on directly connected network neighbors with GO-BP annotations and are ranked using the sum of the edge LLS weight scores. Top ten GO-BP terms will be returned as candidate functions for the query gene. In addition, poplar microRNA target binding information, BLAST search functions, GBrowse2 (http://gmod.org/wiki/GBrowse), Jbrowse (http://jbrowse.org/) and Netviewer (based on Sigma.js) tools are also available at the PoplarGene web service (Fig. 3F).

Case studies

The number of poplar genes annotated using experimental evidence is quite limited, whereas Arabidopsis has the most extensive functional information of any plant. Wood is a complex structure, and thousands of genes have been shown to be associated with wood development in many species^39,40,41,42. A large number of genes associated with wood/xylem development in Poplar remain unknown. Thus, an effective approach is to prioritize novel poplar genes for xylem development using Arabidopsis orthologs for the equivalent trait. The likelihood of the new candidates could be validated based on tissue-specific expression patterns, assuming that genes for xylem development exhibit more active changes in expression in xylem than in leaf tissue. We submitted 50 Arabidopsis genes known to control xylem cell specification for neighborhood-based gene prioritization in the PoplarGene web service (see Supplementary Figure S3A for the workflow), which returned 2,399 new candidate poplar genes. We then used poplar RNA-seq transcriptome data (Sequence Read Archive ID: SRP050172)⁵, which were obtained from a comparative study of gene expression in xylem and leaf tissue, to validate the new candidate genes. The top 100 candidate genes were significantly more differentially expressed in xylem versus leaf tissue than 100 randomly selected poplar genes (P = 5.2 × e⁻¹⁰, Wilcoxon rank sum test; Fig. 4A).

We then used context-based gene prioritization in PoplarGene to prioritize poplar genes for defense response and stress response traits. First, we submitted 155 stress-responsive poplar DEGs⁴³ to PoplarGene and identified 474 context-associated hubs as new candidate genes (P ≤ 0.01, Fisher’s exact test) (Supplementary Figure S3B). To validate the predictions, we measured the enrichment of 1,035 genes related to stress responses annotated by Gramene⁴⁴ GO-BP terms among the predicted 474 genes, revealing significant enrichment of the annotated stress response genes among the new candidate genes (P = 1.347 × e⁻¹¹, Fisher’s exact test). Second, we submitted 55 poplar defense DEGs^45,46 to PoplarGene, which returned a total of 367 context-associated hubs as new candidate genes (P ≤ 0.01, Fisher’s exact test). We then used 841 genes related to defense responses annotated by Gramene⁴⁴ GO-BP terms to measure enrichment of the predicted 367 genes. The results also reveal significant enrichment of the annotated defense response genes among the new candidates (P = 0.019, Fisher’s exact test).

To evaluate orthology-transferred functional gene networks for other woody plants using PoplarGene, we constructed Eucalyptus grandis functional gene networks based on AraNet, RiceNet and PoplarGene (Supplementary Figure S3C), which generated 483,742 linkages (14,036 genes), 950,409 linkages (13,844 genes) and 1,328,017 linkages (17,093 genes), respectively. The qualities of the transferred networks were assessed using GO-BP term recovering analysis based on the areas under Receiver Operating Characteristic curves. A total of 310 GO-BP terms (≥5 members) from the E. grandis coding-sequence genome annotated by Phytozome v10.3 were used for this analysis. The results demonstrate that AUC scores of PoplarGene-derived E. grandis network significantly outperformed both the AraNet-derived E. grandis network (P-value = 3.61 × e⁻¹⁴, Wilcoxon rank sum test) and the RiceNet-derived E. grandis network (P-value = 2.20 × e⁻¹⁶, Wilcoxon rank sum test; Fig. 4B).

In Poplar, PtrWND2B (Potri.002G178700) interacts with PtrVND/SND genes to regulate several poplar R2R3 MYB genes involved in secondary cell wall biosynthesis^47,48. In the PoplarGene networks, we found that PtrWND2B has functional links with 15 genes (Potri.013G113100, VND7; Potri.005G096600, MYB63; Potri.017G016700, SND2; Potri.004G207600, XCP1; Potri.001G099800, MYB103; Potri.009G061500, MYB83; Potri.001G112200, KNAT7; Potri.007G135300, SND2; Potri.005G063200, MYB69; Potri.019G083600, VND7; Potri.003G132000, MYB103; Potri.001G197000, MYB26; Potri.003G022800, XND1; Potri.006G122100, MYB27; Potri.004G086300, MYB43). Among these linked genes, eight genes are MYB genes and Potri.005G096600 (PtrMYB028/MYB63), Potri.009G061500 (PtrMYB020/MYB83) and Potri.004G086300 (PtrMYB018/MYB43) were reported to be directly link to PtrWND2B by experimental study⁴⁷.

Conclusion

In this study, we constructed a functional gene network of poplar from diverse data sources using machine-learning procedures, which improved both the genome coverage and linkage accuracy. We then developed the PoplarGene web service, a publicly available gene network resource and network-assisted gene prioritization service that provides the poplar community with a number of useful functions. We demonstrated that not only can PoplarGene be used to predict the functions of unknown genes and to predict new candidate genes affecting a wide variety of traits in poplar, but it can also be used to map the co-functional linkages to other woody plants with high efficiency. PoplarGene can also accept guide genes from Arabidopsis, the most extensively studied plant species, which will greatly facilitate investigations of the less-studied plant poplar. PoplarGene will continue to be improved. When more published data are available for poplar research, literature-based network inference methods will be incorporated into PoplarGene. In summary, we believe that PoplarGene will serve as a highly useful tool for the scientific community, facilitating studies of poplar and other woody plants.

Methods

Gold standard gene pairs for machine learning

To construct and evaluate the network, gold standard co-functional gene pairs were generated from four sources of annotated sets of P. trichocarpa: Biological Process of Gene Ontology (GO-BP)²³, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways²⁴, MapMan metabolic pathways²⁵ and PoplarCyc metabolic pathways²⁶. The positive gene pairs were derived by pairing genes sharing at least one functional annotation in each annotation set, while the negative pairs were obtained by pairing genes that do not share any functional annotation terms. In the GO annotation set, gene pairs sharing annotation from the same GO term were considered to be functionally linked, while the pairs of annotated genes not sharing any GO terms were treated as negative pairs⁴⁹. For example, the gene Potri.015G088100 and Potri.011G023800 represent a positive pair, sharing GO terms “GO:0006281: DNA repair”, “GO:0006310: DNA recombination”. The gene Potri.004G061800 and Potri.010G136500 is a positive pairs, sharing GO term “GO:0016567: protein ubiquitination”. The gene Potri.003G183000 (annotated with GO:0005216, GO:0016020, GO:0006811 and GO:0055085) and Potri.004G061800 (annotated with GO:0016567, GO:0004842, GO:0000151 and GO:0005515) do not share any term and represent a negative example. Among the GO-BP terms, since terms above level 2 are too general and terms below level 11 are too specific, we used the terms belonging to levels 2 through 10 to optimize annotation specificity and comprehensiveness³⁷. If a term/pathway has too many annotated genes, there will be too many gene pairs generated from a single term/pathway, which may cause functional bias towards the term/pathway^12,50. For instance, among the Poplar BFGR GO-BP terms, six top broad GO-BP terms will generate 1,984,503 positive linkage pairs, which account for ~92% of total 2,155,797 positive linkage pairs (based on all 341 Poplar BFGR GO-BP terms), thereby leading to strong bias toward these broad terms. It is the same case for KEGG pathway, Mapman pathway and PoplarCyc pathway. Thus, to reduce the training bias, the terms/pathways containing too many genes were ignored in the gold standard gene pair construction. The ignored terms/pathways, which typically contains >300 genes, are listed in Supplementary Table S1. As a result, GO-BP generated 171,294 positive and 7,300,003 negative gene pairs, covering 3,877 (~9.4%) P. trichocarpa genes. For KEGG pathway (Release 76.0) analysis, after ignoring the largest terms and broad-concept terms, 440,925 positive and 12,991,275 negative pairs were obtained, covering 5,198 (12.6%) poplar genes. The gold standard gene pairs from MapMan metabolic pathways included 318,481 and 51,307,487 positive and negative gene pair (10,162, ~24.6% of P. trichocarpa genes), respectively. For PoplarCyc (version 3.0), since the largest pathways contain the fewest annotated genes, no terms were ignored, and 118,243 positive and 10,844,660 negative gene pairs were obtained for 4,683 genes (11.3% of P. trichocarpa genes). Finally, after merging the four types of gold standard gene pairs, a total of 961,462 positive and 72,756,688 negative gold standard gene pairs were obtained, covering 15,677 (~38%) P. trichocarpa genes.

Function links inferring framework and data integration

The functional linkages derived from different data sets have different levels of confidence due to variations in the internal measurements of different types of data sets. To unify the dataset-intrinsic scores and to integrate heterogeneous data into a composite network, a common Bayesian scoring framework, LLS³⁷, was initially used to measure the functional linkages between two genes in each dataset, which was defined as:

where P(I|D) and P(~I|D) represent the frequencies of gold standard positive and negative gene pairs observed in the corresponding dataset (D), and P(I) and P(~I) are the frequencies of all positive and negative gold standard gene pairs, respectively. To avoid over-fitting bias, 0.632 bootstrapping, which provides a robust estimate of classifier accuracy and is appropriate for poorly annotated genomes⁵¹, was used to calculate LLS values³⁷.

For each dataset, the gene pairs were ranked by their respective continuous intrinsic scores (mutual information, correlation coefficient, gene distance and so on), and LLS for bins with equal numbers of ranked gene pairs were calculated. Regression models were then constructed based on these LLS values, and the set of mean continuous scores for bins was used to map the intrinsic score of each gene pair to LLS values in a continuous manner²⁸.

Linkages data integration framework

The functional links in each dataset were generated; a functional link could be observed in multiple datasets with different LLS values. Because the datasets were not fully independent, the weighted sum (WS)²⁸, which is a modification of the native Bayesian, was used to integrate the linkages derived from various dataset. WS is defined as:

where L is the LLS value (L₀ is the largest LLS among the datasets supporting the link), and i (in L_i) is the rank index number of the remaining LLS values of the link. D is the weight factor, which ranges from 1 to + ∞, and T is the minimum threshold of LLS. LLS values above the threshold were considered in order to exclude noisy, low-scoring linkages. Systematic testing was conducted to select the optimal values of D and T in order to maximize overall performance, which was measured as the area under a plot of LLS versus the number of gene pairs in the network³⁷.

Functional links inferred from genomic contexts

The two most widely used genomic context methods, Phylogenetic Profiling^52,53,54 and Gene Neighborhood^55,56,57, which have shown reasonable performance for interring functional linkages in Arabidopsis and rice, were applied to infer functional associations in poplar. Phylogenetic Profiling is a method that uses similarity of evolutionary co-occurrence patterns among large numbers of species to infer functional couples. First, BLASTP was used to align all P. trichocarpa protein sequences against the unique representative complete genomes in each of the three domains of life (1,188 Bacteria species, 159 Archaea species and 434 Eukaryota species), respectively. The species with the largest genomes were chosen as the unique representative species in each genus. Second, the best BLAST hit was used to construct a phylogenetic profile matrix for each domain of life, and the similarity between two profiles was then measured by mutual information (MI)¹⁵. The functional linkages generated in the three domains of life were integrated into a single network by the weighted-sum framework mentioned above. Meanwhile, two complementary Gene Neighborhood algorithms, physical distance based neighborhood⁵⁶ and probability-based neighborhood⁵⁵, were used to infer functional links separately, which were integrated into a single network by the weighted-sum framework as well.

Functional links inferred from the co-occurrence protein domains

The protein domain is the functional subunit of a protein. Proteins sharing a similar set of domains may perform similar functions⁴⁹. Rare domains are more closely related to specific functions than common domains⁴⁹. Using the protein PFAM domain annotation⁵⁸, domain occurrence profiles (3,375 unique domains) were generated for all protein sequences, with the inverse of the domain frequency in the P. trichocarpa proteome indicating the presence of the corresponding domain and 0 indicating its absence. This type of weighted scoring gives more weight to rare domains. The mutual information was then calculated to determine the significance of domain co-occurrence within the profile matrix to infer functional linkages.

Inferring functional linkages from associalogs

Associalogs are defined as conserved functional linkages that are transferred from other organisms by orthology³⁷. The functional linkages were transferred to P. trichocarpa genes from AraNet v2 (Arabidopsis thaliana)¹², WormNet v3 (Caenorhabditis elegans)¹⁸, HumanNet v1 (Homo sapiens)⁵⁹, FlyNet v1 (Drosophila melanogaster)¹³, RiceNet v2 (Oryza sativa)³² and YeastNet v3 (Saccharomyces cerevisiae)⁶⁰. All transferred functional linkages were scored by InParanoid weighted LLS (IWLLS)¹⁶, which is defined as:

where A and B are poplar genes and A′ and B′ are orthologous genes from other organisms. An InParanoid score is calculated by multiplying two inparalog scores, i.e., those of the poplar gene and the orthologous gene in another organism (A − A′/B − B′), which are generated from the InParanoid algorithm²¹.

Inferring functional linkages based on co-expression patterns

Functionally associated genes tend to be co-expressed under various conditions³⁵. High dimensional microarray data have been broadly used to infer co-functional links based on correlations in gene co-expression patterns. First, 32 microarray datasets with no less than 12 samples were obtained from Gene Express Omnibus (GEO) in May 2015. Datasets with fewer than 12 samples were excluded because co-functional links inferred by correlation with small sample sizes may be promiscuous. Second, expression profile vectors for each gene across microarray samples were generated for each GEO dataset. Finally, Pearson Correlation Coefficient (PCC) values were calculated between each pair of expression profile vectors to measure the co-expression correlation. Only gene pairs with PCC values that were statistically significant at the 99% confidence level (t-test) were retained. After filtering the dataset with lower co-expression correlation, 22 co-expression networks were obtained, which were further integrated into a single co-functional network via the weighted-sum framework.

Additional Information

How to cite this article: Liu, Q. et al. PoplarGene: poplar gene network and resource for mining functional information for genes from woody plants. Sci. Rep. 6, 31356; doi: 10.1038/srep31356 (2016).

References

Neale, D. B. & Kremer, A. Forest tree genomics: growing resources and applications. Nat Rev Genet 12, 111–122 (2011).
CAS PubMed Google Scholar
Taylor, G. Populus: arabidopsis for forestry. Do we need a model tree? Ann Bot 90, 681–689 (2002).
CAS PubMed PubMed Central Google Scholar
Wullschleger, S. D., Tuskan, G. A. & DiFazio, S. P. Genomics and the tree physiologist. Tree Physiol 22, 1273–1276 (2002).
CAS PubMed Google Scholar
Schneeberger, K. Using next-generation sequencing to isolate mutant genes from forward genetic screens. Nat Rev Genet 15, 662–676 (2014).
CAS PubMed Google Scholar
Hefer, C. A., Mizrachi, E., Myburg, A. A., Douglas, C. J. & Mansfield, S. D. Comparative interrogation of the developing xylem transcriptomes of two wood-forming species: Populus trichocarpa and Eucalyptus grandis. New Phytol 206, 1391–1405 (2015).
CAS PubMed Google Scholar
Du, Q. et al. Genetic architecture of growth traits in Populus revealed by integrated quantitative trait locus (QTL) analysis and association studies. New Phytol 209, 1067–1082 (2015).
PubMed Google Scholar
Lin, Y. C. et al. SND1 transcription factor-directed quantitative functional hierarchical genetic regulatory network in wood formation in Populus trichocarpa. Plant Cell 25, 4324–4341 (2013).
ADS CAS PubMed PubMed Central Google Scholar
Cai, B., Li, C. H. & Huang, J. Systematic identification of cell-wall related genes in Populus based on analysis of functional modules in co-expression network. PLoS One 9, e95176 (2014).
ADS PubMed PubMed Central Google Scholar
Gronlund, A., Bhalerao, R. P. & Karlsson, J. Modular gene expression in Poplar: a multilayer network approach. New Phytol 181, 315–322 (2009).
PubMed Google Scholar
Liu, J., Zhang, J., He, C. & Duan, A. Genes responsive to elevated CO2 concentrations in triploid white poplar and integrated gene network analysis. PLoS One 9, e98300 (2014).
ADS PubMed PubMed Central Google Scholar
He, J. et al. A transcriptomic network underlies microstructural and physiological responses to cadmium in Populus x canescens. Plant Physiol 162, 424–439 (2013).
CAS PubMed PubMed Central Google Scholar
Lee, T. et al. AraNet v2: an improved database of co-functional gene networks for the study of Arabidopsis thaliana and 27 other nonmodel plant species. Nucleic Acids Res 43, D996–1002 (2015).
CAS PubMed Google Scholar
Kim, H., Shim, J. E., Shin, J. & Lee, I. EcoliNet: a database of cofunctional gene network for Escherichia coli. Database (Oxford) 2015 (2015).
Kim, E. et al. MouseNet v2: a database of gene networks for studying the laboratory mouse and eight other model vertebrates. Nucleic Acids Res 44, D848–D854 (2015).
PubMed PubMed Central Google Scholar
Date, S. V. & Marcotte, E. M. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat Biotechnol 21, 1055–1062 (2003).
CAS PubMed Google Scholar
Lee, I. et al. Predicting genetic modifier loci using functional gene networks. Genome Res 20, 1143–1153 (2010).
CAS PubMed PubMed Central Google Scholar
Wang, P. I. & Marcotte, E. M. It’s the machine that matters: Predicting gene function and phenotype from protein networks. J Proteomics 73, 2277–2289 (2010).
CAS PubMed PubMed Central Google Scholar
Cho, A. et al. WormNet v3: a network-assisted hypothesis-generating server for Caenorhabditis elegans. Nucleic Acids Res 42, W76–W82 (2014).
CAS PubMed PubMed Central Google Scholar
Zhang, M. & Leong, H. W. BBH-LS: an algorithm for computing positional homologs using sequence and gene context similarity. BMC systems biology 6 Suppl 1, S22 (2012).
PubMed PubMed Central Google Scholar
Haberer, G. et al. Large-scale cis-element detection by analysis of correlated expression and sequence conservation between Arabidopsis and Brassica oleracea. Plant Physiol 142, 1589–1602 (2006).
CAS PubMed PubMed Central Google Scholar
Sonnhammer, E. L. & Ostlund, G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 43, D234–D239 (2015).
CAS PubMed Google Scholar
Goodstein, D. M. et al. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res 40, D1178–D1186 (2012).
CAS PubMed Google Scholar
Childs, K. L., Konganti, K. & Buell, C. R. The Biofuel Feedstock Genomics Resource: a web-based portal and database to enable functional genomics of plant biofuel feedstock species. Database (Oxford) 2012, bar061 (2012).
Google Scholar
Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res 42, D199–D205 (2014).
Article CAS PubMed Google Scholar
Thimm, O. et al. MAPMAN: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant J 37, 914–939 (2004).
CAS PubMed Google Scholar
Caspi, R. et al. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res 44, D471–D480 (2015).
PubMed PubMed Central Google Scholar
Lee, I., Date, S. V., Adai, A. T. & Marcotte, E. M. A probabilistic functional network of yeast genes. Science 306, 1555–1558 (2004).
ADS CAS PubMed Google Scholar
Lee, I., Li, Z. & Marcotte, E. M. An improved, bias-reduced probabilistic functional gene network of baker’s yeast, Saccharomyces cerevisiae. PLoS One 2, e988 (2007).
ADS PubMed PubMed Central Google Scholar
Davis, J. & Goadrich, M. In Proceedings of the 23rd international conference on Machine learning 233–240 (ACM, Pittsburgh, Pennsylvania, USA, 2006).
Du, Z., Zhou, X., Ling, Y., Zhang, Z. & Su, Z. agriGO: a GO analysis toolkit for the agricultural community. Nucleic Acids Res 38, W64–W70 (2010).
CAS PubMed PubMed Central Google Scholar
Kim, E., Kim, H. & Lee, I. JiffyNet: a web-based instant protein network modeler for newly sequenced species. Nucleic Acids Res 41, W192–W197 (2013).
PubMed PubMed Central Google Scholar
Lee, T. et al. RiceNet v2: an improved network prioritization server for rice genes. Nucleic Acids Res 43, W122–W127 (2015).
CAS PubMed PubMed Central Google Scholar
Girvan, M. & Newman, M. E. Community structure in social and biological networks. Proc Natl Acad Sci USA 99, 7821–7826 (2002).
ADS MathSciNet CAS PubMed MATH PubMed Central Google Scholar
Arita, M. Scale-freeness and biological networks. J Biochem 138, 1–4 (2005).
CAS PubMed Google Scholar
Rhee, S. Y. & Mutwil, M. Towards revealing the functions of all genes in plants. Trends Plant Sci 19, 212–221 (2014).
CAS PubMed Google Scholar
Lee, I. et al. Genetic dissection of the biotic stress response using a genome-scale gene network for rice. Proc Natl Acad Sci USA 108, 18548–18553 (2011).
ADS CAS PubMed PubMed Central Google Scholar
Lee, I. et al. A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat Genet 40, 181–188 (2008).
CAS PubMed Google Scholar
Linghu, B., Snitkin, E. S., Hu, Z., Xia, Y. & Delisi, C. Genome-wide prioritization of disease genes and identification of disease-disease associations from an integrated human functional linkage network. Genome Biol 10, R91 (2009).
PubMed PubMed Central Google Scholar
Cato, S. et al. Wood formation from the base to the crown in Pinus radiata: gradients of tracheid wall thickness, wood density, radial growth rate and gene expression. Plant Mol Biol 60, 565–581 (2006).
CAS PubMed Google Scholar
Qiu, D. et al. Gene expression in Eucalyptus branch wood with marked variation in cellulose microfibril orientation and lacking G-layers. New Phytol 179, 94–103 (2008).
CAS PubMed Google Scholar
Dillon, S. K., Brawner, J. T., Meder, R., Lee, D. J. & Southerton, S. G. Association genetics in Corymbia citriodora subsp. variegata identifies single nucleotide polymorphisms affecting wood growth and cellulosic pulp yield. New Phytol 195, 596–608 (2012).
CAS PubMed Google Scholar
Xu, T., Ma, T., Hu, Q. & Liu, J. An integrated database of wood-formation related genes in plants. Scientific reports 5, 11422 (2015).
ADS CAS PubMed PubMed Central Google Scholar
Song, Y., Ci, D., Tian, M. & Zhang, D. Comparison of the physiological effects and transcriptome responses of Populus simonii under different abiotic stresses. Plant Mol Biol 86, 139–156 (2014).
CAS PubMed Google Scholar
Monaco, M. K. et al. Gramene 2013: comparative plant genomics resources. Nucleic Acids Res 42, D1193–D1199 (2014).
CAS PubMed Google Scholar
Foster, A. J., Pelletier, G., Tanguay, P. & Seguin, A. Transcriptome Analysis of Poplar during Leaf Spot Infection with Sphaerulina spp. PLoS One 10, e0138162 (2015).
PubMed PubMed Central Google Scholar
Liang, H., Staton, M., Xu, Y., Xu, T. & Leboldus, J. Comparative expression analysis of resistant and susceptible Populus clones inoculated with Septoria musiva. Plant Sci 223, 69–78 (2014).
CAS PubMed Google Scholar
Wang, S. et al. Regulation of secondary cell wall biosynthesis by poplar R2R3 MYB transcription factor PtrMYB152 in Arabidopsis. Scientific reports 4, 5054 (2014).
CAS PubMed PubMed Central Google Scholar
Zhong, R., McCarthy, R. L., Lee, C. & Ye, Z. H. Dissection of the transcriptional program regulating secondary wall biosynthesis during wood formation in poplar. Plant Physiol 157, 1452–1468 (2011).
CAS PubMed PubMed Central Google Scholar
Lee, I., Ambaru, B., Thakkar, P., Marcotte, E. M. & Rhee, S. Y. Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana. Nat Biotechnol 28, 149–156 (2010).
CAS PubMed PubMed Central Google Scholar
Shin, J. et al. FlyNet: a versatile network prioritization server for the Drosophila community. Nucleic Acids Res 43, W91–W97 (2015).
CAS PubMed PubMed Central Google Scholar
Sima, C., Braga-Neto, U. & Dougherty, E. R. Superior feature-set ranking for small samples using bolstered error estimation. Bioinformatics 21, 1046–1054 (2005).
CAS PubMed Google Scholar
Huynen, M., Snel, B., Lathe, W. 3rd & Bork, P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 10, 1204–1210 (2000).
CAS PubMed PubMed Central Google Scholar
Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. & Yeates, T. O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 96, 4285–4288 (1999).
ADS CAS PubMed PubMed Central Google Scholar
Wolf, Y. I., Rogozin, I. B., Kondrashov, A. S. & Koonin, E. V. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res 11, 356–372 (2001).
CAS PubMed Google Scholar
Bowers, P. M. et al. Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol 5, R35 (2004).
PubMed PubMed Central Google Scholar
Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 96, 2896–2901 (1999).
ADS CAS PubMed PubMed Central Google Scholar
Dandekar, T., Snel, B., Huynen, M. & Bork, P. Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem Sci 23, 324–328 (1998).
CAS PubMed Google Scholar
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44, D279–D285 (2016).
CAS PubMed Google Scholar
Lee, I., Blom, U. M., Wang, P. I., Shim, J. E. & Marcotte, E. M. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res 21, 1109–1121 (2011).
CAS PubMed PubMed Central Google Scholar
Kim, H. et al. YeastNet v3: a public database of data-specific and integrated functional gene networks for Saccharomyces cerevisiae. Nucleic Acids Res 42, D731–D736 (2014).
CAS PubMed Google Scholar

Download references

Acknowledgements

This project was financially supported by the Twelfth Five National Key Technology R&D Program (2012BAD01B03).

Author information

Authors and Affiliations

State Key Laboratory of Tree Genetics and Breeding, Research Institute of Forestry, Chinese Academy of Forestry, Key Laboratory of Tree Breeding and Cultivation, State Forestry Administration, Beijing, 100091, China
Qi Liu, Changjun Ding, Yanguang Chu, Jiafei Chen, Weixi Zhang, Bingyu Zhang, Qinjun Huang & Xiaohua Su
Co-Innovation Center for Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing 210037, China.,
Xiaohua Su

Authors

Qi Liu
View author publications
You can also search for this author in PubMed Google Scholar
Changjun Ding
View author publications
You can also search for this author in PubMed Google Scholar
Yanguang Chu
View author publications
You can also search for this author in PubMed Google Scholar
Jiafei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weixi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bingyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qinjun Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohua Su
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Q.L. constructed PoplarGene network, developed the web service and drafted the manuscript. C.D., Y.C. and J.C. participated in the pipeline development in the web server. W.Z., B.Z. and Q.H. participated in drafting the manuscript. X.S. was involved in planning of study and headed the project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiaohua Su.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information (PDF 2886 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Liu, Q., Ding, C., Chu, Y. et al. PoplarGene: poplar gene network and resource for mining functional information for genes from woody plants. Sci Rep 6, 31356 (2016). https://doi.org/10.1038/srep31356

Download citation

Received: 21 April 2016
Accepted: 18 July 2016
Published: 12 August 2016
DOI: https://doi.org/10.1038/srep31356

This article is cited by

QTL mapping of drought-related traits in the hybrids of Populus deltoides ‘Danhong’×Populus simonii ‘Tongliao1’
- Changjian Du
- Pei Sun
- Jianjun Hu
BMC Plant Biology (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.