Identifying key genes in milk fat metabolism by weighted gene co-expression network analysis

Milk fat is the most important and energy-rich substance in milk, and its content and composition are important reference elements in the evaluation of milk quality. However, the current identification of valuable candidate genes affecting milk fat is limited. IlluminaPE150 was used to sequence bovine mammary epithelial cells (BMECs) with high and low milk fat rates (MFP), the weighted gene co-expression network (WGCNA) was used to analyze mRNA expression profile data in this study. As a result, a total of 10,310 genes were used to construct WGCNA, and the genes were classified into 18 modules. Among them, violet (r = 0.74), yellow (r = 0.75) and darkolivegreen (r =  − 0.79) modules were significantly associated with MFP, and 39, 181, 75 hub genes were identified, respectively. Combining enrichment analysis and differential genes (DEs), we screened five key candidate DEs related to lipid metabolism, namely PI4K2A, SLC16A1, ATP8A2, VEGFD and ID1, respectively. Relative to the small intestine, liver, kidney, heart, ovary and uterus, the gene expression of PI4K2A is the highest in mammary gland, and is significantly enriched in GO terms and pathways related to milk fat metabolism, such as monocarboxylic acid transport, phospholipid transport, phosphatidylinositol signaling system, inositol phosphate metabolism and MAPK signaling pathway. This study uses WGCNA to form an overall view of MFP, providing a theoretical basis for identifying potential pathways and hub genes that may be involved in milk fat synthesis.

Milk is not only the source of nutrition for newborn cows, it is also an important source of protein, sugar, lipids, and other nutrients for humans 1 . Milk fat is the most important and energy-rich substance in milk and is an important component in the production of butter and yogurt, with a content of about 3-5% in milk. Milk fat also plays an important role in the nutrition and metabolism of human growth and development 2 Polyunsaturated fatty acids such as conjugated and non-conjugated linoleic acid (C18:2) contained in milk fat play a beneficial role in lowering blood lipids, suppressing immune responses, and stimulating lipid metabolism 3 ; while high concentrations of saturated fatty acids such as myristic acid (C14:0), lauric acid (C12:0) and palmitic acid (C16:0) increase the concentration of low density lipoproteins in the blood, which is associated with cardiovascular disease 4 . Therefore, the content and composition of milk fat is the main reference element to evaluate the quality of milk. Nowadays, milk fat content is not only one of the important indicators for the core competitiveness of dairy products, but also a major target trait for dairy cattle breeding 5 . Exploring the theory and methods of milk fat formation and regulation to improve milk fat content in dairy cows has become a hot spot in international lactation biology research.
In the past, many scholars have extensively studied the complex regulatory mechanisms of mammary gland development and elucidated the major pathways of milk fat synthesis (including de novo synthesis and FA uptake in the blood) 6 . Breeding researchers have identified a range of potent genes and biomarkers in milk fat metabolism with the widespread use of next-generation sequencing technologies and the dramatic reduction in sequencing costs 7 . Despite the transcriptome determining the characteristic of spatial and temporal specificity, there is less linkage to phenotypic data 8 . Weighted gene co-expression network analysis (WGCNA) can combine gene expression with phenotypic data 9,10 and gather genes with similar expression patterns into one module 11 . Genes in modules are often involved in the same function or pathway, which can be used in data analysis of complex processes, and playing an important role in exploring the characteristics of gene networks associated with  (Fig. 1). It was found that the samples of the high-milk fat group and the low-milk fat group were significantly different, and the correlation coefficients within the group were all above 0.89, which indicated a high similarity of expression patterns within the samples, and no outlier samples were found. Therefore, the transcriptome sequencing results are reliable and can be used for subsequent analysis.
The correlation between the co-expression module and the MFP phenotype was analyzed. The results showed that multiple modules were associated with MFP phenotype, among them violet (r = 0.74) and yellow (r = 0.75)  Fig. 3D) with MFP phenotype in these three modules, respectively, and the correlation between all these genes and module members (MM) is greater than 0.9, so these genes were considered as hub genes (Supplementary  Table 2). After obtaining hub genes of the three modules, the intersection was taken with the 915 DEs screened from the transcriptome data (Supplementary Table 3) (Fig. 3E), and it was found that the hub genes were isolated between the three modules, and the yellow module contains 14 DEs, which is the largest number, followed by the darkolivegreen module (5 DEs) and the violet module (4 DEs). The DEs contained in each module are shown in Table 1.

Functional enrichment analysis of hub genes.
To determine the specific functions of the genes within the three modules significantly associated with MFP, we performed GO and KEGG enrichment analysis on 295 hub genes in the three modules. A total of 1 301 GO terms were enriched as a result (Supplementary Table 4), and 180 GO terms were significantly enriched (P < 0.05). There were 37 significantly enriched GO terms related to lipid metabolism, and Fig. 4A shows 11 significantly enriched biological processes (BP) related to lipid metabolism, 15 molecular functions (MF) related to lipid metabolism and 7 representative cellular components (CC). GO terms closely related to milk fat synthesis include acylglycerol lipase activity, ubiquitin protein ligase binding, intermembrane lipid transfer and lipid binding. Notably, SLC16A1, ATP8A2 and PI4K2A are hub genes related to lipid metabolism and also DEs screened in the transcriptome data (P < 0.05), which were mainly enriched in monocarboxylic acid transport, phospholipid transport and phosphatidylinositol phosphorylation terms.
The KEGG enrichment results showed that 239 pathways were enriched (Supplementary Table 4), among them 66 pathways were significantly enriched (P < 0.05) and 13 pathways were associated with lipid metabolism. Figure 4B show the hub genes contained in 13 lipid metabolism-related pathways, with PI4K2A, VEGFD and ID1 being the DEs and significantly enriched to phosphatidylinositol signaling system, inositol phosphate metabolism, Rap1, MAPK and Ras signaling pathway. Interestingly, the down-regulated gene PI4K2A within the yellow module was significantly enriched in GO terms and KEGG pathways related to lipid metabolism (Fig. 5A). The PI4K2A gene is involved in diacylglycerol and glycerophospholipid metabolism of the phosphatidylinositol and inositol phosphate metabolic pathways, respectively (Fig. 6), which suggesting that PI4K2A may play an www.nature.com/scientificreports/ important function in milk fat synthesis. In addition, some hub genes were not significantly different between the high and low milk fat groups, but they were significantly enriched in the GO terms and KEGG pathways related to lipid metabolism (Table 2), and these genes are likely to play a potential role in milk fat synthesis.

Protein interaction network analysis. The results of hub genes enrichment analysis intersected with
DEs to identify SLC16A1, ATP8A2, PI4K2A, VEGFD and ID1 may be the key candidate genes involved in milk fat synthesis (Fig. 5A). We performed protein interaction network analysis on these five candidate genes. Since these genes are not directly functionally related to each other, we have selected the top 10 interacting proteins that are similar in function to the candidate gene. PI4K2A and VEGFD (FIGF) had the highest strength of data support with the 10 proteins with a common function (Fig. 5B,F) and the confidence levels of 0.9 and 0.7 respectively, followed by ID1 with a confidence level of 0.7 (Fig. 5E). The strength of data support for SLC16A1 and ATP8A2 with the top 10 proteins was low (Fig. 5C,D), with moderate confidence (0.4), but still have reliable reference value.
Tissue expression profile analysis of key candidate genes. The expression levels of SLC16A1, ATP8A2, PI4K2A, VEGFD and ID1 varied in different tissues. PI4K2A gene expression was relatively highest in the mammary gland ( Fig. 7E), which significantly higher than small intestine, liver, kidney, heart and ovary tissues (P < 0.05), and slightly lower in uterus than mammary gland. The expression level of ATP8A2 and ID1 genes in mammary gland ranks second (Fig. 7C,D); VEGFD gene expression in mammary gland was significantly lower than heart and similar to that in uterus (Fig. 7B). The SLC16A1 gene was highly expressed in the kidney, followed by the liver, with lower expression in other tissues and non-significant differences (Fig. 7A).
In addition, we examined the relative expression levels of key candidate genes for milk fat synthesis in BMECs from the high and low milk fat groups (three technical replicates), it was found that the trends of the qRT-PCR www.nature.com/scientificreports/  www.nature.com/scientificreports/ experiment results and the RNAseq sequencing results are consistent by using log 2 FoldChange to convert the difference multiples (Fig. 7F), confirming the accuracy of the transcriptome sequencing.

Discussion
Some new analytical methods are gradually making up for the limitations of traditional biological research with the continuous innovation of sequencing technology and the rapid development of bioinformatics, which can fully and effectively explore the biological significance of the massive amount of data 20,21 . Compared to other  www.nature.com/scientificreports/ regulatory networks, WGCNA is an effective data mining method that modularizes large datasets based on similar expression patterns of genes to obtain co-expression modules with high biological significance 22 . In recent years, WGCNA has been applied to explore the characteristics of human and plant life activities [23][24][25] . However, the use of WGCNA on the metabolism of milk fat has not been reported. Milk fat synthesis in dairy cows is related to many physiological and metabolic changes. To gain new insights into the expression and regulation of key genes in milk fat synthesis, we used WGCNA to comprehensively analyze the mRNA expression profile data of high and low MFP dairy cow mammary epithelial cells which measured by Illumina PE150 in the early stage of our group. Clustering the important genes into modules of specific biological pathways that may be associated with MFP in cows, thereby improving the efficiency of identification of important genes.
In this study, we constructed the first gene co-expression network for high and low MFP in Holstein cows, and obtained three modules significantly associated with MFP, violet, yellow and darkolivegreen respectively. The hub gene enrichment analysis in the three modules showed that SLC16A1, ATP8A2, VEGFD, ID1 and PI4K2A genes, which overlap with DEs, were significantly enriched to lipid metabolism-related pathways. Among them, the SLC16A1, ATP8A2 and PI4K2A genes were significantly enriched for monocarboxylate transport, phospholipid transport and phosphatidylinositol phosphorylation terms. It is well known that many monocarboxylic acids were utilised by the body's metabolism, and acetic acid and β-hydroxybutyric acid are the main substrates for the de novo synthesis of fatty acids in ruminants and are essential for meeting the energy requirements (70%) and milk fat synthesis in cows. Acetic acid and β-hydroxybutyric acid play a positive regulatory role in the de novo synthesis of fatty acids, the transport and desaturation of long-chain fatty acids and the synthesis of triglycerides [26][27][28] ; Phosphatidylinositol is a phospholipid, it is also one of the five main polar lipids in milk (less than 2% of total fat) in milk 29 . Polar lipids are the main component of the milk fat globule membrane (MFGM), which is responsible for wrapping the lipid droplets secreted by BMECs 30 . Therefore, SLC16A1, ATP8A2 and PI4K2A may be key candidate genes for the regulation of milk fat synthesis. The MAPK signalling pathway plays a key role in the inflammatory response and induces the expression of a variety of inflammatory mediators and pro-inflammatory cytokines 31 . The RAP1 pathway is a key component of the BMECs 32 , and during peak lactation in dairy cows, MAPK and RAP1 signaling pathways can increase milk production by regulating the proliferation and differentiation of BMECs 33, 34 . Rap1 has been shown to antagonize Ras signals in inactive www.nature.com/scientificreports/ complexes by capturing its effector protein (serine/threonine kinase Raf) 35 . In this study, KEGG enrichment analysis revealed that PI4K2A, VEGFD and ID1 genes (i.e. DEs and hub genes) were significantly enriched in the phosphatidylinositol, Rap1, MAPK and Ras signaling pathway, which suggested that PI4K2A, VEGFD and ID1 genes were likely to be involved in the lactation process of cows. Notably, the PI4K2A gene was significantly enriched to the GO terms and KEGG pathways associated with lipid metabolism and was involved in diacylglycerol and glycerophospholipid metabolism in the phosphatidylinositol and inositol phosphate metabolic pathways, respectively, which indicates its potential to be involved in regulating milk fat synthesis. PI4K2A is a key enzyme for the synthesis of phosphatidylinositol 4-phosphate with multiple cell signaling functions 36 , which is critical for epidermal growth factor receptor degradation 37 , transferrin receptor recycling 38 , autophagy-lysosome fusion 39 and prognosis of breast cancer patients 40 . A genome-wide association study of milk fatty acid composition in Italian Simmental and Italian Holstein cows by using single nucleotide polymorphism arrays 41 , which revealed that PI4K2A may be involved in milk fat metabolism. In addition, the CDIPT gene, which significantly interacts with PI4K2A, and it also plays an important role in fatty acid and energy metabolism 42 , which reflecting the potential importance of the PI4K2A gene in milk fat metabolism. This study found that the expression level of PI4K2A was significantly higher in dairy cows' mammary tissue than the small intestine, liver, kidney, heart and ovary. It further indicates that PI4K2A may have an important function in milk fat synthesis in dairy cows, and a more in-depth functional verification of specific mechanisms is required. DNA binding inhibitor 1 (ID1) is a helix-loop-helix transcription factor that is highly expressed in brown adipose tissue 43 and promotes obesity by inhibiting brown fat thermogenesis and white fat browning 44 . Functionally ID1 is involved in regulating the transcriptional activity of ADD1/SREBP-1c, thereby regulating adipogenesis 45 . Marcin et al. 46 showed that Mammalian target of rapamycin can regulate mammary epithelial cells growth through ID1. ATP8A2 is a P4-ATPase that transfers (flips) phosphatidylserine and phosphatidylethanolamine from the ectoplasmic lobules of the cell membrane lipid bilayer to the cytoplasmic lobules, resulting in asymmetric lipid partitioning between membrane lobules 47 . Vascular endothelial growth factor D (VEGFD) is considered the main angiogenic component of adipose tissue 48 , which enhances lymphangiogenesis and reduces obesity-related immune accumulation in mouse adipose tissue 49 , but VEGFD deficiency does not affect adipose tissue development in mice 50 . The expression levels of ATP8A2, ID1 and VEGFD were significantly higher in mammary tissue than small intestine, liver, kidney and ovary of dairy cows, which suggested that they may have potential biological functions in the mammary gland. SLC16A1 has an important role in short-chain fatty acid transport 51 . Hu et al. 52 studies suggested that SLC16A1 may be involved in hepatic lipid metabolism in pigs, which is consistent with the high-level expression results of SLC16A1 in liver tissues of this study. Although the expression level of the SLC16A1 gene in dairy cows' mammary gland tissue is significantly lower than that in liver and kidney. However, compared with other tissues, the expression abundance of SLC16A1 is still at the upper-middle level, so the role of SLC16A1 in dairy cows' milk fat metabolism cannot be ignored. In the future, members of our www.nature.com/scientificreports/ group will continue to investigate the functional mechanisms of these key candidate DEs (SLC16A1, VEGFD, ID1, ATP8A2 and PI4K2A) in lipid metabolism screened by WGCNA, and in order to elucidate their specific regulatory mechanisms.

Conclusion
In this study, a comprehensive analysis of mRNA expression profile data based on Illumina PE150 sequencing of high and low MFP BMECs was performed by WGCNA, resulting in three modules (violet, yellow and darkolivegreen) that were significantly associated with MFP. After enrichment analysis, a total of 5 candidate DEs related to lipid metabolism were screened out, namely PI4K2A, SLC16A1, ATP8A2, VEGFD and ID1. Among them, PI4K2A is more likely to be involved in milk fat metabolism. The results of this study provide a new way to understand the function of genes in milk fat synthesis in dairy cows and it also provide a new perspective on the study of the lactation process in cattle.  Table 5). Sequencing samples were obtained from the Maosheng pasture of He Lanshan in Ningxia state farm, where the test cows were fed the same balanced total mixed diet. A total of 245 Holstein cows of similar age and in the mid and late lactation were selected. Collect milk samples of each cow in the morning, at noon, and in the evening for dairy herd improvement (DHI). Screen 8 Holstein cows with somatic cell counts within 100,000/mL and extreme differences in MFP (Table 3). BMECs were isolated from fresh milk by aseptic collection 53 , and the library construction and transcriptome sequencing were carried out by Beijing Nuohe Zhiyuan Biotechnology Co., Ltd. A chain-specific library was constructed by removing ribosomal RNA. After passing the library inspection, Illumina PE150 sequencing was performed. After the original data is obtained, the reads with adapter, N (undetermined base information) ratio greater than 0.002, and low-quality bases with a read length of more than 50% are removed. After sequencing error rates (Q20 and Q30) and GC content distribution checks, clean reads for subsequent analysis were obtained. Hisat2 (http:// ccb. jhu. edu/ softw are/ hisat2, version 2.1.0) software were used to compare and analyze the RNA sequencing (RNA-seq) data (the reference genome version is bos_taurus_Ensembl_97) 54 .

Materials and methods
Since the mRNA expression profile data in transcriptome sequencing is represented by the FPKM value, the FPKM was converted to TPM by using the colSums function of the tidyverse package of the R (version 4.1.1). After that, the principal component and correlation analysis of the eight samples was performed by online postsale tool platform provided in Beijing Nuohe Zhiyuan Biotechnology Co., Ltd (https:// magic. novog ene. com/ custo mer/ main#/home/2d9dc26d1e059b931b9ac5364 9482c7c).
Construction of co-expression network. The median absolute deviation of different gene expression profiles were first calculated by the apply function in R to eliminate outliers and abnormal values in the data set, and then the goodSamplesGenes function was used to detect missing values and samples below the sample threshold. And finally, 10,310 genes with relatively high expression were obtained. The co-expression network of mRNA expression profile data was constructed by the R package WGCNA 55 . The construction of a weighted gene network requires the optimal selection of soft thresholding power β that improves co-expression similar- www.nature.com/scientificreports/ ity and calculates the adjacency. Therefore, picking the optimal soft thresholding power β was performed using the function pickSoftThreshold (based on the criterion of approximate scale-free topology) in the R package WGCNA. When 0.8 is used as the correlation coefficient threshold (R 2 = 0.8), the soft thresholding power β was 14 and the minimum number of genes in the module is 111, and the number of genes to construct the coexpression network is set to 100. The module detection sensitivity was 2 (deepSplit = 2), and the cut height for merging of modules was 0.25 (mergeCutHeight = 0.25, i.e., merge into one module if the correlation coefficient of eigengenes within the module is greater than 0.75).

Identification of key candidate genes.
In the module-trait correlation analysis, hub genes were considered as genes with GS greater than 0.4 and high module group members (MM, weighted correlation index > 0.9), indicating a significant correlation with milk fat percentage.

Functional enrichment and protein interaction network analysis.
Here, functional enrichment analysis was performed using the KOBAS website (http:// kobas. cbi. pku. edu. cn/ genel ist/, version 3.0) and its results were visualized by the R package GOplot (version 1.0.2). The hub genes were intersected with the differential genes (DEs) screened by the transcriptome (P < 0.05) and combining the results of enrichment analysis to screen key candidate genes for milk fat metabolism, and protein interaction network analysis carried by String website (https:// www. string-db. org/https://www.string-db.org/, version 11.5).
qRT-PCR validation of key candidate genes. Small intestine, liver, kidney, heart, ovary, uterus and mammary gland tissues were collected from three cows in the mid and late lactation with similar age, and the tissues were cut and quickly placed in liquid nitrogen and brought back to the laboratory for total RNA extraction and first-strand cDNA synthesis. Real-time quantitative reverse transcription PCR (qRT-PCR) was used to detect the expression of key candidate genes for milk fat in different tissues and to verify their expression levels in BMECs of high and low MFP cows. Total RNA was extracted by using RNA simple Total RNA Kit (Tiangen Biochemical Technology Co., Ltd). First-strand cDNA synthesis was performed by using PrimeScript RT Kit (Takara, Dalian, China). qRT-PCR (three replicates) was performed by SYBR Premix Ex Taq II (TaKaRa, Dalian, China) on the Bio-Rad CFX96 Touch Real-Time PCR Detection System (Bio-Rad, Hercules, CA, USA). Amplification procedure: 95 °C for 30 s, 95 °C for 5 s, annealing for 30 s, 40 cycles. qRT-PCR primers were designed by using Primer Premier 5.0 and the primers span at least one intron, the sequence and annealing temperature of each primer were shown in Table 4. Statistical analysis. The statistical significance of differences between the two groups was analyzed using a non-parametric test or t-test based on the data distribution characteristics. All the analyses were conducted using the software R; the P < 0.05 was considered statistically significant. The relative expression of DEs was analyzed by the 2 −ΔΔCt method and normalized using the glyceraldehyde-3-phosphate dehydrogenase (GAPDH) gene.
Institutional review board statement. The Animal Ethics Committees of Ningxia University approved the experimental design and animal sample collection for the present study (permit number NXUC2 0200619). And animal experiments were conducted strictly followed the guidelines of the Regulations for the Administration of Affairs Concerning Experimental Animals (Ministry of Science and Technology, China, 2004).

Data availability
All data generated or analyzed in this study are included in this article [and its Supplementary Information File], and the datasets have been submitted to the SRA database with the Accession Number PRJNA730595. Access to the data of permanent link to https:// www. ncbi. nlm. nih. gov/ sra/ PRJNA 730595.