Abstract
In the general framework of the weighted gene co-expression network analysis (WGCNA), a hierarchical clustering algorithm is commonly used to module definition. However, hierarchical clustering depends strongly on the topological overlap measure. In other words, this algorithm may assign two genes with low topological overlap to different modules even though their expression patterns are similar. Here, a novel gene module clustering algorithm for WGCNA is proposed. We develop a gene module clustering network (gmcNet), which simultaneously addresses single-level expression and topological overlap measure. The proposed gmcNet includes a “co-expression pattern recognizer” (CEPR) and “module classifier”. The CEPR incorporates expression features of single genes into the topological features of co-expressed ones. Given this CEPR-embedded feature, the module classifier computes module assignment probabilities. We validated gmcNet performance using 4,976 genes from 20 native Korean cattle. We observed that the CEPR generates more robust features than single-level expression or topological overlap measure. Given the CEPR-embedded feature, gmcNet achieved the best performance in terms of modularity (0.261) and the differentially expressed signal (27.739) compared with other clustering methods tested. Furthermore, gmcNet detected some interesting biological functionalities for carcass weight, backfat thickness, intramuscular fat, and beef tenderness of Korean native cattle. Therefore, gmcNet is a useful framework for WGCNA module clustering.
Similar content being viewed by others
Introduction
Weighted gene co-expression network analysis (WGCNA) is often used to explore the system-level functionality of gene sets. WGCNA groups thousands of genes into a number of modules, simplifying biological interpretation. The general framework of WGCNA1 can be summarized as follows. First, the adjacencies of paired genes are calculated to define the gene co-expression network. The adjacencies are then incorporated into a topological overlap measure (TOM) to reveal gene-gene connections. Using the TOM, a clustering algorithm assigns intensively connected genes to the same modules. Finally, functional analyses are used to determine the biological meanings of the modules. This pipeline has been widely used in various fields. For example, recent biomedical studies used WGCNA to identify specific modules and hub genes related to human cancer2 and arterial disease3. In animal and plant sciences, WGCNA has often been used to profile plant gene expression4 and detect pathways responsible for complex animal traits5,6. The module definitions greatly affect the interpretations of the results. WGCNA commonly uses a hierarchical clustering (HC) algorithm. This unsupervised clustering method places adjacent genes into the same modules based on pairwise TOM data. However, a concern has been raised that transformations of gene expressions into a TOM results in loss of raw-level expression features. HC-based module assignment depends strongly on the TOM. This can degrade similarity of expression not only between modules, but also within modules. In other words, HC may assign two genes with low topological overlap to different modules even though their expression patterns are similar. Furthermore, once a gene is added to a specific module, HC can never reverse the decision. This poses challenges when clustering complicated networks with many interconnected gene pairs. Thus, a new algorithm is needed to more accurately identify WGCNA gene modules. Langfelder et al.7 developed a “dynamic tree cut” technique that clusters gene modules based on the shapes of dendrogram branches, but this still depends on TOM. Botía et al.8 employed a derivative of K-means processing to refine gene modules generated by standard HC. However, this algorithm requires more than four steps beginning with module clustering, centroid computation, distance measurements, and gene relocation. This complex pipeline requires significant computational time and is thus unsuitable for very large networks.
A graph neural network (GNN)9,10,11,12 is a good alternative algorithm for module clustering. GNNs extend deep neural networks to learn a graph representation by finding stable features of nodes and its neighbors in graph-based data. Gilmer et al.11 introduced a general framework termed message-passing neural network (MPNN), which effectively aggregates each node with its neighbors into embedding features. Many other studies for GNN have achieved impressive performance using this framework12,13. Given the recent successes of GNN, graph-based learning methods have been widely applied in bioinformatics. To predict drug-target interactions, recent studies employed various graphical convolutional networks14,15. For single-cell RNA-seq analysis, a GNN was used to model cell-cell relationships16 and impute gene expression levels within single cells17. Yang et al.18 developed a GNN that extracted protein features from graphical information. However, most studies on WGCNA did not use GNN for module clustering.
In this paper, we introduce a GNN-based clustering algorithm for WGCNA: the gene module clustering network (gmcNet). Our method clusters genes based on their co-expression topologies (genes in the same module should be strongly connected) and single-level expression (genes in the same module should exhibit similar expression patterns). The main innovation of gmcNet is incorporating the expression feature of single gene with co-expression feature of their neighbor genes. gmcNet includes a “co-expression pattern recognizer” (CEPR) and a module classifier. The CEPR has a message-passing (MP) operation similar to that of MPNN11, except that the topological overlap matrix1 is used as the input rather than the adjacency matrix. Using the former matrix, CEPR defines weighted relationships, consistent with the objective of WGCNA. The module classifier assigns genes to various modules using the CEPR-embedded features. We tested gmcNet using RNA-seq data for native Korean cattle, and compared the performance to that of other clustering algorithms. We also validated gmcNet performance on gene expression datasets of human, mouse, pig, and chicken which were downloaded from the Gene Expression Omnibus (GEO) repository19. As GNNs are not widely used for WGCNA, our findings will be of interest to computational biologists.
Results
Model performance on Korean native cattle
To validate gmcNet performance, it was compared to four baseline clustering algorithms including HC, K-means clustering, and K-medoids clustering (Fig. 1). We measured performance in terms of clustering strength and functional enrichment. We used graph modularity20 to measure the clustering strength, and the differentially expressed module (DEM) signals to assess functional enrichment.
Table 1 presents the performances of the various methods on Korean native cattle dataset. The single gene expression-based method (K-means) is robust to DEM signal capture, whereas the TOM-based methods (HC, K-medoids) provide higher modularity. On the other hand, gmcNet, which leverages both single gene expression and TOM, achieves the best DEM signal (27.739) and cluster modularity (\(\mathcal {Q}\): 0.261). Comparison of gmcNet and HC revealed that gmcNet markedly increases modularity and the DEM signal by 0.042 and 9.121, respectively. Thus, gmcNet is more powerful than the other methods for revealing the apparent closeness of genes within the same module, and when making biological sense of the complex traits of native Korean cattle.
CEPR embedding
Figure 2 shows plots based on the first and second principal components of three feature types (single-level expression, TOM, and CEPR embedding) of Korean native cattle dataset. Single-level expression fails to distinguish modules with ambiguous boundaries. This may reflect the low modularity of K-means, which uses single-level expression for clustering. The TOM provides stronger connections between genes than single-level expression. However, it also decreases the distances between different modules and genes. As shown in Fig. 2, K-medoids and HC, which use the TOM for clustering, do not clearly assign genes into different but closely related modules. Compared to the other types, CEPR embedding provides better separation, i.e. smaller distances between genes and larger ones between modules. With CEPR embedding, gmcNet defines gene modules more clearly and increases modularity.
Model performance at different k (number of clusters)
Our current implementation of gmcNet requires the setting of an optimal k (number of clusters). The effects of the k-value on modularity \(\mathcal {Q}\) and the DEM signal are summarized in Fig. 3. With an increasing k-value, the DEM signal increases while the \(\mathcal {Q}\) decreases. In contrast, gmcNet yields a larger DEM signal than HC even at smaller k-values (\(6 \le k<8\)), and remains higher \(\mathcal {Q}\) at larger k-values (\(k=9\)). gmcNet outperforms K-means and K-medoids for all k-values. These results can demonstrate the superiority of gmcNet regardless of the k-value.
Functional enrichment analysis of native Korean cattle
To identify the DEMs, we performed linear regression analysis of the module eigengenes1 for four complex traits, including carcass weight (CWT), backfat thickness (BF), intramuscular fat content (IMF), and the Warner-Bratzler shear force (WBSF). Figure 4 shows the results. In terms of the number of DEMs, IMF ranked first with four modules (K2, K3, K4, and K8) followed by BF (K2, K3, and K4), WBSF (K5 and K7), and CWT (K1 and K6). Interestingly, K5 and K7, which contain large numbers of genes, were significant to WBSF. This may reflect our mode of data collection; the RNA-seq data were from the longissimus-dorsi muscle and WBSF indicates the tenderness of beef muscle. Also, gmcNet detected 11 significant module-trait interactions. gmcNet found more DEMs than the other methods (HC: 9, K-means: 10, and K-medoids: 10) (Fig. S1).
We used Gene Ontology (GO) enrichment analysis21 to annotate the biological processes of the modules defined by gmcNet. Three modules (K1, K5, and K7) were linked to significant processes (Fig. 5). K1, a CWT-related module, was enriched in “biosynthetic” and “metabolic” processes. Based on both the DEM analysis and the GO enrichment results, K1 seems to involve many genes associated with growth-related traits. Two WBSF-related modules (K5 and K7) were enriched in “immune system” and “protein catabolism” , respectively. Although several studies have suggested that the immune system plays a key role in cattle weight gain and feed efficiency22,23, the association between beef tenderness and immune pathways is a novel finding. Various studies have reported an association between “protein catabolic process” and beef tenderness24,25,26. Therefore, the results suggest that K7 is a key module of beef tenderness in native Korean cattle.
Hub gene searches for modules of interest
Given the functional enrichment results, we selected the four modules, K1, K2, K4, and K7, as the principal modules of complex traits. Figure 6 shows the hub gene networks and Table 2 shows the related traits. The six hub genes of K1 are related to quantitative traits including growth (LAMTOR527 and PAM1628) and feed intake (NDUFB129, NDUFB430, ATP5MF31, and SEC61G32). These findings support our suggestion that K1 is significant in terms of CWT. K2 and K4, associated with fat-related traits (BF and IMF) in DEM analysis, include eight (ACSL333, NFKB134, CYP2R135, HSF236, TMEM13537, PDCD438, HERPUD239, and NMRAL140) and seven (SPNS134, MYOD141, PDXK42, TMUB137, ARHGAP2643, RAB1534, and TP7344) fat-related hub genes, respectively. Thus, future research should identify the relationships between fat metabolism and modules K2 and K4. Although K7 was associated with WBSF in DEM analysis, only four hub genes (PARD345, EIF4G346, PAFAH1B147, and CAMTA248) were associated with growth-related traits; the other hub genes were all novel.
Gene Expression Omnibus (GEO) repository
We also performed our method on the NCBI GEO datasets19. The datasets include four different species (GDS6010: human, GDS5618: mouse, GDS4246: pig, and GDS3857: chicken). We measured DEM signals using the trait included in each dataset (human: virus infection, mouse: pancreatic islets, pig: blood, chicken: light pulse). The implementation details for GEO datasets can be shown in supproting information S2. Table 3 presents the performances of the various methods on GED datasets. For mouse and chicken gmcNet achieves the best cluster modularity, while for human and pig gmcNet show much lower modularity than other TOM-based method (HC and K-medoids). However, gmcNet outperforms all methods on DEM signal capture with reasonable modularity for all datasets. These results can prove the gmcNet is useful method to group thousands of gene according to their system-level functionality.
Discussion
Single-level expression is generally appropriate to identify trait-specific marker genes that are differentially expressed depending on the biological phenotype63. Here, we found that single-level expression also revealed trait-specific modules with strong DEM signals. However, most existing WGCNA methods address only the co-expression topology (including TOM); the DEM signals are weak. On the other hand, our gmcNet simultaneously addresses single-level expression and TOM. gmcNet thus yielded larger DEM signals than other clustering methods. Furthermore, gmcNet produced some novel and interesting results. Threfore, gmcNet can detect module functionality and improves our understanding of WGCNA system-level biology. Also, gmcNet yields strong adjacencies between genes in the same module. gmcNet exploits the learnable properties of CEPR, which aggregates single-gene expressions with the co-expression features of its first neighbors, embedding these features to reduced dimensions. As noted in the Results section, CEPR generates more robust features than single-gene expression data or TOM. Given the CEPR-embedded feature, gmcNet achieved the best WGCNA modularity of all clustering methods tested.
Many genes are uniformly expressed in all individuals. Such genes (“noise”) are intimately connected with nested modules and exhibit no differential expression in complex trait analysis. Any attempt to cluster them disrupts module identification and obscures the biological implications. HC uses a dendrogram cut-off to exclude noisy genes. On the other hand, gmcNet assigns every gene to the most probable module. This may yield some meaningless assignments, because uniform expression may render the assignments to nested modules similar. Therefore, in future, it will be important to eliminate noise. We are exploring probability thresholding to this end. Specifically, genes with maximum probabilities lower than a given threshold will be excluded from module assignment. We will also add the optimal k search method to gmcNet; k-values can greatly increase model performance and may be modified depending on the characteristics of a dataset. Here, gmcNet used the optimal k of HC and performed better than other methods. In addition, gmcNet outperformed K-means and K-medoids at all k-values tested (2-10). Thus, the addition of an optimal k search would improve gmcNet performance in the context of WGCNA.
We derived a gene module clustering network, gmcNet, which simultaneously addresses single-level expression and TOM. We validated gmcNet performance using 4,976 genes from 20 native Korean cattle and four GEO datasets. gmcNet reliably assigned genes to modules exhibiting high modularity and DEM signals. gmcNet also detected some interesting biological functionalities. Therefore, gmcNet is a useful framework for WGCNA module clustering.
Materials and methods
Korean native cattle data
A total of 20 native Korean steers, born 2013 at Hanwoo Experiment Station, National Institute of Animal Science (NIAS), Rural Development Administration, South Korea, were used; all were humanely slaughtered at 30 months of age. The CWT (kg), and BF (mm) were measured after chilling for 24 hours. BF was measured at the junction of the 12th and 13th ribs. The WBSF and IMF were measured at the longissimus-dorsi muscle according to64 and65, respectively. RNA from the longissimus-dorsi muscle was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA). RNA quality and quantity were assessed by automated capillary gel electrophoresis performed using a Bioanalyzer 2100 running the RNA 6000 Nano LabChip (Agilent Technologies Ireland, Dublin, Ireland). Only RNA samples with RNA integrity \(\ge 7\) were retained. Complementary DNA (cDNA) libraries were synthesized with Illumina TruSeq preparation Kit according to the manufacturer’s instruction (Illumina, San Diego, CA, USA). The RNA sequencing was done using Hiseq 2000 Illumina platform to obtain paired-end reads. The quality of the raw RNA samples was confirmed using FastQC v0.1166, and the reads with low quality were removed using Trimmomatic v0.3667. The reads were aligned to the reference genome Bos taurus (Ensemble UMD3.1) with TopHat v2.168. The gene count of the reads was done with HTSeq v0.9169. Reads per kilobase per million (RPKM) were computed for each gene. We used Pearson correlation test to filter out uniformly expressed genes for the four traits (CWT, BF, IMF, and WBSF). Specifically, we calculated correlation coefficients between each gene and the traits. Then, the genes which show non-significant correlation (\(p\text {-value}>0.1\)) for any of the traits, were excluded in further progresses. After deriving Pearson correlation test, we excluded 7,555 genes and subjected 4,976 genes in 20 samples to this study. Notice that the National Institute of Animal Science (NIAS) of the Rural Development Administration (RDA) of South Korea approved the experimental procedures (ethics committee approval number: 2015-150).
Co-expression network construction
To represent the co-expression network in matrix form, we used the topological overlap matrix of1. Briefly, the adjacency of each pair of genes i and j is given by \(a_{ij}=\left| {cor}_{ij}\right| ^{\beta }\) where \(\beta \) is a smoothing parameter and \({cor}_{ij}\) is the correlation coefficient between the single-level expressions of the two genes. Given the adjacency values \(a_{ij}\), the topological overlap matrix \(\mathbf {T}\in \mathbb {R}^{n\times n}\) was created using a TOM70, where n is the number of genes. TOM \(t_{ij}\), which provides a similarity measure in the topological overlap matrix, is calculated as follows:
where, \(l_{ij}=\sum _u{a_{iu}a_{uj}}\) and \(k_i=\sum _u{a_{iu}}\) is a node connectivity.
Also, we constructed two additional topological overlap matrices to train gmcNet (Fig. 7). \(\mathbf {T}_\text {p}\in \mathbb {R}^{n\times }\), representing the positive network, was created leaving only positive correlation coefficients, whereas \(\mathbf {T}_\text {n}\in \mathbb {R}^{n\times n}\), representing the negative network, was created leaving only negative correlation coefficients. After scale-free model fitting1, we chose \(\beta =6\), \(\beta =9\), and \(\beta =10\) as the smoothing parameters for \(\mathbf {T}\), \(\mathbf {T}_\text {p}\), and \(\mathbf {T}_\text {n}\), respectively.
Gene module clustering network
We developed a gene module clustering network (gmcNet) that clusters genes according to their co-expression topologies (genes in the same module should be strongly connected) and their single-level expression (genes in the same module should exhibit similar expression patterns). Figure 8 shows an overview of gmcNet, which features a co-expression pattern recognizer (CEPR) and module classifier. The CEPR incorporates the expression features of single genes into the topological features of co-expressed ones. Given this CEPR-embedded feature, the module classifier computes module assignment probabilities.
Network structure
CEPR: The goal of CEPR is to integrate single-expression features with co-expression features. To achieve this, we used the MP operation of MPNN11, but employed the topological overlap matrix rather than the adjacency matrix. We computed a new topological overlap matrix \(\widetilde{\mathbf {T}}\) by zeroing the diagonal of \(\mathbf {T}\) and applying degree normalization:
where \(\mathbf {D}=\text {diag}({\mathbf {T}_\text {z}1}_n)\) is a degree matrix. Let \(\mathbf {X}\in \mathbb {R}^{n\times m}\) be the single-level expression of n genes in m samples. Then, single and co-expression can be simply combined via an MP operation:
where \(\mathbf {W}_\text {co}\) and \(\mathbf {W}_\text {single}\) are the trainable parameters of the co- and single-expression features. As \(\widetilde{\mathbf {T}}\) includes the topological adjacencies between gene pairs, it is easy to see that \(\widetilde{\mathbf {T}}\mathbf {X}\) can be interpreted as a co-expression feature.
A simple MP operation cannot separate positive and negative co-expressions, even when they differ in different biological pathways. Therefore, we refined a simple MP to become a CEPR, as follows:
where \(\{\mathbf {W}_\text {c}, \mathbf {W}_\text {p}, \mathbf {W}_\text {n}, \mathbf {W}_\text {s}\} \in \mathbb {R}^{m\times m^\prime }\) are the trainable weights of the co-expression, positive co-expression, negative co-expressions, and single-expression, respectively. \(m^\prime \) is an embedding dimension (set to 8). As \({\widetilde{\mathbf {T}}}_\text {p}\mathbf {X}\mathbf {W}_\text {p}\) and \({\widetilde{\mathbf {T}}}_\text {n}\mathbf {X}\mathbf {W}_\text {n}\) are identical in terms of dimensionality, CEPR learns various co-expressions by simply adding them. By skip connections of single-expression \(\mathbf {X}\mathbf {W}_\text {s}\), CEPR generates the embedding feature \(\bar{\mathbf {X}}\in \mathbb {R}^{n\times m^\prime }\), which deals with single-expression and three different co-expressions in the \(m^\prime \) dimension.
Module classifier: Given the CEPR-embedded feature \(\bar{\mathbf {X}}\), the module classifier computes a module assignment probability using a multi-layer perceptron (MLP):
where \(\mathbf {W}_\text {m}\in \mathbb {R}^{m^\prime \times k}\) are the trainable weights for clustering of k modules. As softmax activation guarantees that \(m_{ij}\in [0,1]\), the ith-row of \(\mathbf {M}\in \mathbb {R}^{n\times k}\) corresponds to the module-assignment probability of gene i. In other words, gene i belongs to module c if \(m_{ic}\) is the maximum value of the ith-row of \(\mathbf {M}\).
Loss function
For unsupervised clustering, we employed the cut and orthogonality loss terms of MinCutPool71. The loss function when training gmcNet was defined as:
where \({\left\| \cdot \right\| }_F\) indicates the Frobenius norm and Tr is the trace; \(\lambda \) is a balancing hyper-parameter, which is set to 2.6. The cut loss term, \(\mathscr{L}_\text {c}\), encourages clustering of strongly connected genes within the same module, and the orthogonality loss term, \(\mathscr{L}_\text {o}\), penalizes assignment to similarly sized modules.
Implementation Details
The model was iterated for 5,000 epochs using a GeForce RTX 2080ti. For the first 100 epochs, the balancing hyperparameter \(\lambda \) was set to 0 and the learning rate to 0.01. This prevented the creation of empty modules. After epoch 100, we set \(\lambda \) to 2.6 and the learning rate to 0.001. Model training was early stopped at \(\mathscr{L}_\text {o}>\tau \), where \(\tau \) is the orthogonal threshold, which was set to 0.8. The Adam optimizer72 was used to minimize the loss function. Finally, \(\mathbf {M}\) at the end of training was used for module assignment.
Model performance
To validate gmcNet performance, HC7, K-means73 and K-medoids74 were also used for module clustering and the results were compared to those of gmcNet. K-means uses single-expression feature \(\mathbf {X}\) as input data; the HC and K-medoids use the topological distances \(1-\mathbf {T}\) as inputs. The optimal k for K-means, K-medoids, and gmcNet was set to 8, as suggested by application of the dynamic tree cut technique7 to HC.
Metrics
We measured the model performance in terms of modularity and DEM signaling. Module modularity is a commonly used metric in graph clustering. In a fully random graph, gene i and j of degrees \(c_i=\sum _{u}t_{iu}\) and \(c_j=\sum _{u}t_{ju}\) are connected with a probability \(c_ic_j/s\), where s is the total topological overlap \(s=\sum _{ij}{t_{ij}}\). Modularity measures the divergence between intra-module connections as:
where \(\delta \left[ i,j\right] =1\) if i and j belong to the same module; otherwise, \(\delta \left[ i,j\right] =\ 0\).
To assess functional enrichment of clustering method, we introduce a novel metric, called DEM signal. Let \(\rho [l,t]=1\) if module l is significant (\(\le 0.05\)) for trait t; otherwise, \(\rho \left[ l,t\right] =0\). The final DEM signal was defined as:
where t is traits and \(p\text {-value}_{lt}\) indicates the significance value of module l in terms of trait t. We employed linear regression analysis to the module eigengenes, i.e. the first principal components of the modules, for four complex traits: CWT, BF, IMF and WBSF.
Functional enrichment analysis
The Bioconductor R package “clusterProfiler”75 was used for GO analysis. The adjusted \(p\text {-value}\) (obtained using the Bonferroni method) was employed to examine the significance (p.adjust\(<0.05\)) of all GO terms. The top 20 biological processes were extracted if there were more than 20 significant results. To identify hub genes, we calculated the correlation coefficients between single-level expression of each gene and the ME of the module it belong to. The top 25 genes (in terms of correlation coefficients) were defined as hub genes.
Data availability
The gmcNet code and example data is available on GitHub at https://github.com/gywns6287/gmcNet. Request for full gene expression data of Korean native cattle can be made to Korea National Institute of Animal Science, Animal Genome & Bioinformatics Division (http://www.nias.go.kr/english/sub/boardHtml.do?boardId=depintro).
References
Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Stat. applications genetics molecular biology 4 (2005).
Li, J. et al. Application of weighted gene co-expression network analysis for data from paired design. Sci. Rep. 8, 1–8 (2018).
Zheng, P.-F., Chen, L.-Z., Guan, Y.-Z. & Liu, P. Weighted gene co-expression network analysis identifies specific modules and hub genes related to coronary artery disease. Sci. Rep. 11, 1–13 (2021).
Rao, X. & Dixon, R. A. Co-expression networks for plant biology: why and how. Acta biochimica et biophysica Sinica 51, 981–988 (2019).
Salleh, M. et al. Rna-seq transcriptomics and pathway analyses reveal potential regulatory genes and molecular mechanisms in high-and low-residual feed intake in nordic dairy cattle. BMC Genomics 18, 1–17 (2017).
Silva-Vignato, B. et al. Gene co-expression networks associated with carcass traits reveal new pathways for muscle and fat deposition in nelore cattle. BMC Genomics 20, 1–13 (2019).
Langfelder, P., Zhang, B. & Horvath, S. Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for r. Bioinformatics 24, 719–720 (2008).
Botía, J. A. et al. An additional k-means clustering step improves the biological features of wgcna gene co-expression networks. BMC Syst. Biol. 11, 1–16 (2017).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. ICLR-17 (2017).
Xu, D., Zhu, Y., Choy, C. B. & Fei-Fei, L. Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5410–5419 (2017).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. Int. Conf. Mach. Learn. 1263–1272 (2017).
Hamilton, W. L., Ying, R. & Leskovec, J. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, 1025–1035 (2017).
Wang, Y. et al. Dynamic graph cnn for learning on point clouds. Acm. Trans. Graph. (tog) 38, 1–12 (2019).
Peng, J. et al. An end-to-end heterogeneous graph representation learning-based framework for drug–target interaction prediction. Brief. Bioinf. (2021).
Zhao, T., Hu, Y., Valsdottir, L. R., Zang, T. & Peng, J. Identifying drug-target interactions based on graph convolutional network and deep neural network. Brief. Bioinf. 22, 2141–2150 (2021).
Wang, J. et al. scgnn is a novel graph neural network framework for single-cell rna-seq analyses. Nat. Commun. 12, 1–11 (2021).
Rao, J., Zhou, X., Lu, Y., Zhao, H. & Yang, Y. Imputing single-cell rna-seq data by combining graph convolution and autoencoder neural networks. Iscience 24, 102393 (2021).
Yang, F., Fan, K., Song, D. & Lin, H. Graph-based prediction of protein-protein interactions with attributed signed graph embedding. BMC Bioinf. 21, 1–16 (2020).
Database resources of the national center for biotechnology information. Nucleic acids research 46, D8–D13 (2018).
Newman, M. E. Modularity and community structure in networks. Proc. Nat. Acad. Sci. 103, 8577–8582 (2006).
Wu, T. et al. clusterprofiler 4.0: A universal enrichment tool for interpreting omics data. The Innov. 100141 (2021).
Reynolds, J., Foote, A., Freetly, H., Oliver, W. & Lindholm-Perry, A. Relationships between inflammation-and immunity-related transcript abundance in the rumen and jejunum of beef steers with divergent average daily gain. Anim. Gen. 48, 447–449 (2017).
Alexandre, P. A. et al. Liver transcriptomic networks reveal main biological processes associated with feed efficiency in beef cattle. BMC Gen. 16, 1–13 (2015).
Zhao, C. et al. Functional proteomic and interactome analysis of proteins associated with beef tenderness in angus cattle. Livest. Sci. 161, 201–209 (2014).
Tian, X. et al. Quality and proteome changes of beef m. longissimus dorsi cooked using a water bath and ohmic heating process. Innov. Food Sci. Emerg. Technol. 34, 259–266 (2016).
Li, Y. et al. Association of cast gene polymorphisms with carcass and meat quality traits in yanbian cattle of china. Mol. Biol. Rep. 40, 1875–1881 (2013).
Ribeiro, V. M. P. et al. Genes underlying genetic correlation between growth, reproductive and parasite burden traits in beef cattle. Livest. Sci 244, 104332 (2021).
Kern, R. J. et al. Transcriptome differences in the rumen of beef steers with variation in feed intake and gain. Gene 586, 12–26 (2016).
Keogh, K., McKenna, C., Porter, R., Waters, S. & Kenny, D. Effect of dietary restriction and subsequent realimentation on hepatic oxidative phosphorylation in cattle. Animal 15, 100009 (2021).
Benedeti, P. D. B. et al. Nellore bulls (bos taurus indicus) with high residual feed intake have increased the expression of genes involved in oxidative phosphorylation in rumen epithelium. Anim. Feed. Sci. Technol. 235, 77–86 (2018).
Nolte, W. et al. Identification and annotation of potential function of regulatory antisense long non-coding rnas related to feed efficiency in bos taurus bulls. Int. J. Mol. Sci. 21, 3292 (2020).
Hardie, L. et al. The genetic and biological basis of feed efficiency in mid-lactation holstein dairy cows. J. Dairy Sci. 100, 9061–9075 (2017).
Lv, Y. et al. Effect of acsl3 expression levels on preadipocyte differentiation in chinese red steppe cattle. DNA Cell Biol. 38, 945–954 (2019).
Waters, S. M., Coyne, G. S., Kenny, D. A. & Morris, D. G. Effect of dietary n-3 polyunsaturated fatty acids on transcription factor regulation in the bovine endometrium. Mol. Biol. Rep. 41, 2745–2755 (2014).
Li, Y. et al. Transcriptome profiling of longissimus lumborum in holstein bulls and steers with different beef qualities. PloS one 15, e0235218 (2020).
Baik, M., Vu, T., Piao, M. & Kang, H. Association of dna methylation levels with tissue-specific expression of adipogenic and lipogenic genes in longissimus dorsi muscle of korean cattle. Asian-Australasian J. Anim. Sci. 27, 1493 (2014).
Seong, J., Yoon, H. & Kong, H. S. Identification of microrna and target gene associated with marbling score in korean cattle (hanwoo). Gene. Gen. 38, 529–538 (2016).
Melnik, B. C., John, S. M. & Schmitz, G. Milk consumption during pregnancy increases birth weight, a risk factor for the development of diseases of civilization. J. Transl. Med. 13, 1–11 (2015).
Yu, S.-L. et al. Identification of differentially expressed genes between preadipocytes and adipocytes using affymetrix bovine genome array. J. Anim. Sci. Technol. 51, 443–452 (2009).
Engle, B., Masters, M., Boles, J. A. & Thomson, J. Gene expression and carcass traits are different between different quality grade groups in red-faced hereford steers. Animals 11, 1910 (2021).
Shao, T., McCann, J. C. & Shike, D. W. Effects of supplements differing in fatty acid profile to late gestational beef cows on steer progeny finishing phase growth performance, carcass characteristics, and mrna expression of myogenic and adipogenic genes. Animals 11, 1904 (2021).
Peletto, S. et al. Genetic basis of lipomatous myopathy in piedmontese beef cattle. Livest. Sci. 206, 9–16 (2017).
Martins, R. et al. Genome-wide association study and pathway analysis for fat deposition traits in nellore cattle raised in pasture-based systems. J. Animal Breed. Genet 138, 360–378 (2021).
de Las Heras-Saldana, S. et al. Differential gene expression in longissimus dorsi muscle of hanwoo steers–new insight in genes involved in marbling development at younger ages. Genes 11, 1381 (2020).
Zhang, F. et al. Genetic architecture of quantitative traits in beef cattle revealed by genome wide association studies of imputed whole genome sequence variants: I: Feed efficiency and component traits. BMC Gen. 21, 1–22 (2020).
Keogh, K. et al. Effect of dietary restriction and subsequent re-alimentation on the transcriptional profile of bovine ruminal epithelium. PloS one 12, e0177852 (2017).
Srivastava, S. et al. Haplotype-based genome-wide association study and identification of candidate genes associated with carcass traits in hanwoo cattle. Genes 11, 551 (2020).
Bazile, J. et al. Molecular signatures of muscle growth and composition deciphered by the meta-analysis of age-related public transcriptomics data. Physiol. Geno. 52, 322–332 (2020).
Bernard, C. et al. New indicators of beef sensory quality revealed by expression of specific genes. J. Agric. Food Chem. 55, 5229–5237 (2007).
Muniz, M. M. M. et al. Identification of novel mrna isoforms associated with meat tenderness using rna sequencing data in beef cattle. Meat Sci. 108378 (2020).
de Lemos, M. V. A. et al. Association study between copy number variation and beef fatty acid profile of nellore cattle. J. Appl. Gene. 59, 203–223 (2018).
Olivieri, B. F. et al. Differentially expressed genes identified through rna-seq with extreme values of principal components for beef fatty acid in nelore cattle. J. Anim. Breed. Genet 138, 80–90 (2021).
de Almeida Santana, M. H. et al. Copy number variations and genome-wide associations reveal putative genes and metabolic pathways involved with the feed conversion ratio in beef cattle. J. Appl. Gene. 57, 495–504 (2016).
Anton, I. et al. Effect of single-nucleotide polymorphisms on the breeding value of fertility and breeding value of beef in hungarian simmental cattle. Acta Vet. Hungarica 66, 215–225 (2018).
Seabury, C. M. et al. Genome-wide association study for feed efficiency and growth traits in us beef cattle. BMC Geno. 18, 1–25 (2017).
Manca, E. et al. Use of the multivariate discriminant analysis for genome-wide association studies in cattle. Animals 10, 1300 (2020).
Keel, B. N. et al. Rna-seq meta-analysis identifies genes in skeletal muscle associated with gain and intake across a multi-season study of crossbred beef steers. BMC Geno. 19, 1–11 (2018).
Elolimy, A. A. et al. Skeletal muscle and liver gene expression profiles in finishing steers supplemented with amaize. Anim. Sci. J. 89, 1107–1119 (2018).
Kong, R. S., Liang, G., Chen, Y. & Stothard, P. Transcriptome profiling of the rumen epithelium of beef cattle differing in residual feed intake. BMC Geno. 17, 1–16 (2016).
Tizioto, P. et al. Variation in myogenic differentiation 1 mrna abundance is associated with beef tenderness in nelore cattle. Anim. Gene. 47, 491–494 (2016).
Leal-Gutiérrez, J. D., Elzo, M. A., Johnson, D. D., Hamblen, H. & Mateescu, R. G. Genome wide association and gene enrichment analysis reveal membrane anchoring and structural proteins associated with meat quality in beef. BMC Geno. 20, 1–18 (2019).
Ramayo-Caldas, Y. et al. A marker-derived gene network reveals the regulatory role of ppargc1a, hnf4g, and foxp3 in intramuscular fat deposition of beef cattle. J. Anim. Sci 92, 2832–2845 (2014).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Geno. Biol. 15, 1–21 (2014).
Wheeler, T., Shackelford, S. & Koohmaraie, M. Relationship of beef longissimus tenderness classes to tenderness of gluteus medius, semimembranosus, and biceps femoris. J. Anim. Sci. 78, 2856–2861 (2000).
Feldsine, P., Abeyta, C. & Andrews, W. H. Aoac international methods committee guidelines for validation of qualitative and quantitative food microbiological official methods of analysis. J. AOAC Int. 85, 1187–1200 (2002).
Andrews, S. Fastqc: a quality control tool for high throughput sequence data. Available from: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2010).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Trapnell, C., Pachter, L. & Salzberg, S. L. Tophat: Discovering splice junctions with rna-seq. Bioinformatics 25, 1105–1111 (2009).
Anders, S., Pyl, P. T. & Huber, W. Htseq–a python framework to work with high-throughput sequencing data. Bioinformatics 31, 166–169 (2015).
Li, A. & Horvath, S. Network neighborhood analysis with the multi-node topological overlap measure. Bioinformatics 23, 222–231 (2007).
Bianchi, F. M., Grattarola, D. & Alippi, C. Spectral clustering with graph neural networks for graph pooling. In International Conference on Machine Learning, 874–883 (PMLR, 2020).
Kingma, D. P. & Ba, J. L. Adam: A method for stochastic gradient descent. In ICLR: International Conference on Learning Representations, 1–15 (2015).
Lloyd, S. Least squares quantization in pcm. IEEE Trans. Inf. The. 28, 129–137 (1982).
Kaufman, L. & Rousseeuw, P. J. Finding groups in data: an introduction to cluster analysis, vol. 344 (John Wiley & Sons, 2009).
Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterprofiler: An r package for comparing biological themes among gene clusters. Omics: A J. Integr. Biol. 16, 284–287 (2012).
Acknowledgements
This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(No.2020-0-01441, Artificial Intelligence Convergence Research Center(Chungnam National University)).
Author information
Authors and Affiliations
Contributions
Conceptualization: S.H.L., Y.J.K. Data Curation: K.Y.C. Formal Analysis: H.-J.L., J.H.L. Funding Acquisition: J.H.L., Y.-K.K. Methodology: H.-J.L., Y.-K.K. Software: H.-J.L., Y.J.K. Visualization: Y.C. Writing – Original Draft Preparation: H.-J.L., Y.C. Writing – Review & Editing: S.H.L., Y.J.K.
Corresponding authors
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lee, HJ., Chung, Y., Chung, K.Y. et al. Use of a graph neural network to the weighted gene co-expression network analysis of Korean native cattle. Sci Rep 12, 9854 (2022). https://doi.org/10.1038/s41598-022-13796-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-022-13796-9
This article is cited by
-
Assessment of Genomic Diversity and Selective Pressures in Crossbred Dairy Cattle of Pakistan
Biochemical Genetics (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.