Use of a graph neural network to the weighted gene co-expression network analysis of Korean native cattle

In the general framework of the weighted gene co-expression network analysis (WGCNA), a hierarchical clustering algorithm is commonly used to module definition. However, hierarchical clustering depends strongly on the topological overlap measure. In other words, this algorithm may assign two genes with low topological overlap to different modules even though their expression patterns are similar. Here, a novel gene module clustering algorithm for WGCNA is proposed. We develop a gene module clustering network (gmcNet), which simultaneously addresses single-level expression and topological overlap measure. The proposed gmcNet includes a “co-expression pattern recognizer” (CEPR) and “module classifier”. The CEPR incorporates expression features of single genes into the topological features of co-expressed ones. Given this CEPR-embedded feature, the module classifier computes module assignment probabilities. We validated gmcNet performance using 4,976 genes from 20 native Korean cattle. We observed that the CEPR generates more robust features than single-level expression or topological overlap measure. Given the CEPR-embedded feature, gmcNet achieved the best performance in terms of modularity (0.261) and the differentially expressed signal (27.739) compared with other clustering methods tested. Furthermore, gmcNet detected some interesting biological functionalities for carcass weight, backfat thickness, intramuscular fat, and beef tenderness of Korean native cattle. Therefore, gmcNet is a useful framework for WGCNA module clustering.

Weighted gene co-expression network analysis (WGCNA) is often used to explore the system-level functionality of gene sets. WGCNA groups thousands of genes into a number of modules, simplifying biological interpretation. The general framework of WGCNA 1 can be summarized as follows. First, the adjacencies of paired genes are calculated to define the gene co-expression network. The adjacencies are then incorporated into a topological overlap measure (TOM) to reveal gene-gene connections. Using the TOM, a clustering algorithm assigns intensively connected genes to the same modules. Finally, functional analyses are used to determine the biological meanings of the modules. This pipeline has been widely used in various fields. For example, recent biomedical studies used WGCNA to identify specific modules and hub genes related to human cancer 2 and arterial disease 3 . In animal and plant sciences, WGCNA has often been used to profile plant gene expression 4 and detect pathways responsible for complex animal traits 5,6 . The module definitions greatly affect the interpretations of the results. WGCNA commonly uses a hierarchical clustering (HC) algorithm. This unsupervised clustering method places adjacent genes into the same modules based on pairwise TOM data. However, a concern has been raised that transformations of gene expressions into a TOM results in loss of raw-level expression features. HC-based module assignment depends strongly on the TOM. This can degrade similarity of expression not only between modules, but also within modules. In other words, HC may assign two genes with low topological overlap to different modules even though their expression patterns are similar. Furthermore, once a gene is added to a specific module, HC can never reverse the decision. This poses challenges when clustering complicated networks with Table 1 presents the performances of the various methods on Korean native cattle dataset. The single gene expression-based method (K-means) is robust to DEM signal capture, whereas the TOM-based methods (HC, K-medoids) provide higher modularity. On the other hand, gmcNet, which leverages both single gene expression and TOM, achieves the best DEM signal (27.739) and cluster modularity ( Q : 0.261). Comparison of gmcNet and HC revealed that gmcNet markedly increases modularity and the DEM signal by 0.042 and 9.121, respectively. Thus, gmcNet is more powerful than the other methods for revealing the apparent closeness of genes within the same module, and when making biological sense of the complex traits of native Korean cattle.
CEPR embedding. Figure 2 shows plots based on the first and second principal components of three feature types (single-level expression, TOM, and CEPR embedding) of Korean native cattle dataset. Single-level www.nature.com/scientificreports/ expression fails to distinguish modules with ambiguous boundaries. This may reflect the low modularity of K-means, which uses single-level expression for clustering. The TOM provides stronger connections between genes than single-level expression. However, it also decreases the distances between different modules and genes. As shown in Fig. 2, K-medoids and HC, which use the TOM for clustering, do not clearly assign genes into different but closely related modules. Compared to the other types, CEPR embedding provides better separation, i.e. smaller distances between genes and larger ones between modules. With CEPR embedding, gmcNet defines gene modules more clearly and increases modularity.
Model performance at different k (number of clusters). Our current implementation of gmcNet requires the setting of an optimal k (number of clusters). The effects of the k-value on modularity Q and the DEM signal are summarized in Fig. 3. With an increasing k-value, the DEM signal increases while the Q decreases. In contrast, gmcNet yields a larger DEM signal than HC even at smaller k-values ( 6 ≤ k < 8 ), and remains higher Q at larger k-values ( k = 9 ). gmcNet outperforms K-means and K-medoids for all k-values. These results can demonstrate the superiority of gmcNet regardless of the k-value.

Functional enrichment analysis of native Korean cattle.
To identify the DEMs, we performed linear regression analysis of the module eigengenes 1 for four complex traits, including carcass weight (CWT), backfat thickness (BF), intramuscular fat content (IMF), and the Warner-Bratzler shear force (WBSF). Figure 4 shows the results. In terms of the number of DEMs, IMF ranked first with four modules (K2, K3, K4, and K8) followed by BF (K2, K3, and K4), WBSF (K5 and K7), and CWT (K1 and K6). Interestingly, K5 and K7, which contain large numbers of genes, were significant to WBSF. This may reflect our mode of data collection; the RNA-seq data were from the longissimus-dorsi muscle and WBSF indicates the tenderness of beef muscle. Also, gmcNet detected 11 significant module-trait interactions. gmcNet found more DEMs than the other methods (HC: 9, K-means: 10, and K-medoids: 10) (Fig. S1). We used Gene Ontology (GO) enrichment analysis 21 to annotate the biological processes of the modules defined by gmcNet. Three modules (K1, K5, and K7) were linked to significant processes (Fig. 5). K1, a CWTrelated module, was enriched in "biosynthetic" and "metabolic" processes. Based on both the DEM analysis and the GO enrichment results, K1 seems to involve many genes associated with growth-related traits. Two WBSF-related modules (K5 and K7) were enriched in "immune system" and "protein catabolism" , respectively.
Although several studies have suggested that the immune system plays a key role in cattle weight gain and feed efficiency 22,23 , the association between beef tenderness and immune pathways is a novel finding. Various studies have reported an association between "protein catabolic process" and beef tenderness [24][25][26] . Therefore, the results suggest that K7 is a key module of beef tenderness in native Korean cattle. www.nature.com/scientificreports/ Hub gene searches for modules of interest. Given the functional enrichment results, we selected the four modules, K1, K2, K4, and K7, as the principal modules of complex traits. Figure 6 shows the hub gene networks and Table 2 shows the related traits. The six hub genes of K1 are related to quantitative traits including growth (LAMTOR5 27 and PAM16 28 ) and feed intake (NDUFB1 29 , NDUFB4 30 , ATP5MF 31 , and SEC61G 32 ). These findings support our suggestion that K1 is significant in terms of CWT. K2 and K4, associated with fatrelated traits (BF and IMF) in DEM analysis, include eight (ACSL3 33

Gene Expression Omnibus (GEO) repository.
We also performed our method on the NCBI GEO datasets 19 . The datasets include four different species (GDS6010: human, GDS5618: mouse, GDS4246: pig, and GDS3857: chicken). We measured DEM signals using the trait included in each dataset (human: virus infection, mouse: pancreatic islets, pig: blood, chicken: light pulse). The implementation details for GEO datasets can be shown in supproting information S2. Table 3 presents the performances of the various methods on GED datasets. For mouse and chicken gmcNet achieves the best cluster modularity, while for human and pig gmcNet show much lower modularity than other TOM-based method (HC and K-medoids). However, gmcNet outperforms all methods on DEM signal capture with reasonable modularity for all datasets. These results can prove the gmc-Net is useful method to group thousands of gene according to their system-level functionality. www.nature.com/scientificreports/

Discussion
Single-level expression is generally appropriate to identify trait-specific marker genes that are differentially expressed depending on the biological phenotype 63 . Here, we found that single-level expression also revealed trait-specific modules with strong DEM signals. However, most existing WGCNA methods address only the co-expression topology (including TOM); the DEM signals are weak. On the other hand, our gmcNet simultaneously addresses single-level expression and TOM. gmcNet thus yielded larger DEM signals than other clustering methods. Furthermore, gmcNet produced some novel and interesting results. Threfore, gmcNet can detect module functionality and improves our understanding of WGCNA system-level biology. Also, gmcNet yields strong adjacencies between genes in the same module. gmcNet exploits the learnable properties of CEPR, which aggregates single-gene expressions with the co-expression features of its first neighbors, embedding these features to reduced dimensions. As noted in the Results section, CEPR generates more robust features than single-gene expression data or TOM. Given the CEPR-embedded feature, gmcNet achieved the best WGCNA modularity of all clustering methods tested. Many genes are uniformly expressed in all individuals. Such genes ("noise") are intimately connected with nested modules and exhibit no differential expression in complex trait analysis. Any attempt to cluster them disrupts module identification and obscures the biological implications. HC uses a dendrogram cut-off to exclude noisy genes. On the other hand, gmcNet assigns every gene to the most probable module. This may yield some meaningless assignments, because uniform expression may render the assignments to nested modules similar. Therefore, in future, it will be important to eliminate noise. We are exploring probability thresholding to this end. Specifically, genes with maximum probabilities lower than a given threshold will be excluded from module assignment. We will also add the optimal k search method to gmcNet; k-values can greatly increase model performance and may be modified depending on the characteristics of a dataset. Here, gmcNet used the optimal k of HC and performed better than other methods. In addition, gmcNet outperformed K-means and K-medoids at all k-values tested (2-10). Thus, the addition of an optimal k search would improve gmcNet performance in the context of WGCNA.
We derived a gene module clustering network, gmcNet, which simultaneously addresses single-level expression and TOM. We validated gmcNet performance using 4,976 genes from 20 native Korean cattle and four GEO datasets. gmcNet reliably assigned genes to modules exhibiting high modularity and DEM signals. gmcNet also detected some interesting biological functionalities. Therefore, gmcNet is a useful framework for WGCNA module clustering.

Materials and methods
Korean native cattle data. A total of 20 native Korean steers, born 2013 at Hanwoo Experiment Station, National Institute of Animal Science (NIAS), Rural Development Administration, South Korea, were used; all were humanely slaughtered at 30 months of age. The CWT (kg), and BF (mm) were measured after chilling for 24 hours. BF was measured at the junction of the 12th and 13th ribs. The WBSF and IMF were measured www.nature.com/scientificreports/ at the longissimus-dorsi muscle according to 64 and 65 , respectively. RNA from the longissimus-dorsi muscle was extracted using TRIzol reagent (Invitrogen, Carlsbad, CA, USA). RNA quality and quantity were assessed by automated capillary gel electrophoresis performed using a Bioanalyzer 2100 running the RNA 6000 Nano Lab-

Co-expression network construction.
To represent the co-expression network in matrix form, we used the topological overlap matrix of 1 . Briefly, the adjacency of each pair of genes i and j is given by a ij = cor ij β where β is a smoothing parameter and cor ij is the correlation coefficient between the single-level expressions of the two genes. Given the adjacency values a ij , the topological overlap matrix T ∈ R n×n was created using a TOM 70 , where n is the number of genes. TOM t ij , which provides a similarity measure in the topological overlap matrix, is calculated as follows: where, l ij = u a iu a uj and k i = u a iu is a node connectivity. Also, we constructed two additional topological overlap matrices to train gmcNet (Fig. 7). T p ∈ R n× , representing the positive network, was created leaving only positive correlation coefficients, whereas T n ∈ R n×n , representing the negative network, was created leaving only negative correlation coefficients. After scale-free model fitting 1 , we chose β = 6 , β = 9 , and β = 10 as the smoothing parameters for T , T p , and T n , respectively.
Network structure. CEPR: The goal of CEPR is to integrate single-expression features with co-expression features. To achieve this, we used the MP operation of MPNN 11 , but employed the topological overlap matrix rather than the adjacency matrix. We computed a new topological overlap matrix T by zeroing the diagonal of T and applying degree normalization: where W co and W single are the trainable parameters of the co-and single-expression features. As T includes the topological adjacencies between gene pairs, it is easy to see that TX can be interpreted as a co-expression feature. A simple MP operation cannot separate positive and negative co-expressions, even when they differ in different biological pathways. Therefore, we refined a simple MP to become a CEPR, as follows: where {W c , W p , W n , W s } ∈ R m×m ′ are the trainable weights of the co-expression, positive co-expression, negative co-expressions, and single-expression, respectively. m ′ is an embedding dimension (set to 8). As T p XW p and T n XW n are identical in terms of dimensionality, CEPR learns various co-expressions by simply adding them. By skip connections of single-expression XW s , CEPR generates the embedding feature X ∈ R n×m ′ , which deals with single-expression and three different co-expressions in the m ′ dimension.
Module classifier: Given the CEPR-embedded feature X , the module classifier computes a module assignment probability using a multi-layer perceptron (MLP): where W m ∈ R m ′ ×k are the trainable weights for clustering of k modules. As softmax activation guarantees that m ij ∈ [0, 1] , the ith-row of M ∈ R n×k corresponds to the module-assignment probability of gene i. In other words, gene i belongs to module c if m ic is the maximum value of the ith-row of M.
Loss function. For unsupervised clustering, we employed the cut and orthogonality loss terms of MinCutPool 71 . The loss function when training gmcNet was defined as: where �·� F indicates the Frobenius norm and Tr is the trace; is a balancing hyper-parameter, which is set to 2.6. The cut loss term, L c , encourages clustering of strongly connected genes within the same module, and the orthogonality loss term, L o , penalizes assignment to similarly sized modules.
Implementation Details. The model was iterated for 5,000 epochs using a GeForce RTX 2080ti. For the first 100 epochs, the balancing hyperparameter was set to 0 and the learning rate to 0.01. This prevented the creation of empty modules. After epoch 100, we set to 2.6 and the learning rate to 0.001. Model training was early stopped (3) MP(X, T) = ReLU( TXW co + XW single ) (4) X = CEPR(X, T, T p , T n ) = ReLU( TXW c + T p XW p + T n XW n + XW s ) Figure 8. The architecture of gmcNet. X ∈ R n×m is the single-level expression of n genes in m samples. X ∈ R n×m ′ is CEPR-embedded feature with m ′ dimension. M ∈ R n×k is assignment probability matrix of n genes to k modules. L is loss function. www.nature.com/scientificreports/ at L o > τ , where τ is the orthogonal threshold, which was set to 0.8. The Adam optimizer 72 was used to minimize the loss function. Finally, M at the end of training was used for module assignment.
Model performance. To validate gmcNet performance, HC 7 , K-means 73 and K-medoids 74 were also used for module clustering and the results were compared to those of gmcNet. K-means uses single-expression feature X as input data; the HC and K-medoids use the topological distances 1 − T as inputs. The optimal k for K-means, K-medoids, and gmcNet was set to 8, as suggested by application of the dynamic tree cut technique 7 to HC.
Metrics. We measured the model performance in terms of modularity and DEM signaling. Module modularity is a commonly used metric in graph clustering. In a fully random graph, gene i and j of degrees c i = u t iu and c j = u t ju are connected with a probability c i c j /s , where s is the total topological overlap s = ij t ij . Modularity measures the divergence between intra-module connections as: where δ i, j = 1 if i and j belong to the same module; otherwise, δ i, j = 0.
To assess functional enrichment of clustering method, we introduce a novel metric, called DEM signal. Let ρ[l, t] = 1 if module l is significant ( ≤ 0.05 ) for trait t; otherwise, ρ[l, t] = 0 . The final DEM signal was defined as: where t is traits and p-value lt indicates the significance value of module l in terms of trait t. We employed linear regression analysis to the module eigengenes, i.e. the first principal components of the modules, for four complex traits: CWT, BF, IMF and WBSF.
Functional enrichment analysis. The Bioconductor R package "clusterProfiler" 75 was used for GO analysis. The adjusted p-value (obtained using the Bonferroni method) was employed to examine the significance (p.adjust< 0.05 ) of all GO terms. The top 20 biological processes were extracted if there were more than 20 significant results. To identify hub genes, we calculated the correlation coefficients between single-level expression of each gene and the ME of the module it belong to. The top 25 genes (in terms of correlation coefficients) were defined as hub genes.

Data availability
The gmcNet code and example data is available on GitHub at https://github.com/gywns6287/gmcNet. Request for full gene expression data of Korean native cattle can be made to Korea National Institute of Animal Science, Animal Genome & Bioinformatics Division (http://www.nias.go.kr/english/sub/boardHtml.do?boardId=depintro). www.nature.com/scientificreports/