Integration of GWAS, pathway and network analyses reveals novel mechanistic insights into the synthesis of milk proteins in dairy cows

The quantities and proportions of protein fractions have notable effects on the nutritional and technological value of milk. Although much is known about the effects of genetic variants on milk proteins, the complex relationships among the set of genes and pathways regulating the different protein fractions synthesis and secretion into milk in dairy cows are still not completely understood. We conducted genome-wide association studies (GWAS) for milk nitrogen fractions in a cohort of 1,011 Brown Swiss cows, which uncovered 170 significant single nucleotide polymorphism (SNPs), mostly located on BTA6 and BTA11. Gene-set analysis and the network-based Associated Weight Matrix approach revealed that the milk proteins associated genes were involved in several biological functions, particularly ion and cation transmembrane transporter activity and neuronal and hormone signalling, according to the structure and function of casein micelles. Deeper analysis of the transcription factors and their predicted target genes within the network revealed that GFI1B, ZNF407 and NR5A1 might act as master regulators of milk protein synthesis and secretion. The information acquired provides novel insight into the regulatory mechanisms controlling milk protein synthesis and secretion in bovine mammary gland and may be useful in breeding programmes aimed at improving milk nutritional and/or technological properties.

B milk is indeed characterised by an increased κ-CN content, which favourably affect MCPs 8 . Moreover, milk payment systems in the dairy sectors producing hard cheeses with EU Protected Designation of Origin (PDO) status often include among their payment criteria coagulation and curd firming properties, which are strongly affected by the amounts, proportions and genetic variants of milk protein fractions 8 , as these are related to cheese quality and sensory properties 9,10 . Different milk protein fractions and genetic variants (such as the A1 and A2 variants of β-CN) also seem to affect human health and wellbeing in different ways 11,12 . In recent decades, there have been extraordinary advances in our knowledge of the physiology and biochemistry of the lactating mammary gland. Despite such efforts, little is as yet known of the genetic regulation of the physiological and cellular mechanisms required for milk protein synthesis and secretion. It is well known that milk protein synthesis in the mammary gland depends on hormonal and developmental cues that modulate the transcriptional and translational regulation of genes through the activity of specific transcription factors, non-coding RNAs and alterations of the chromatin structure in the mammary epithelial cells 13,14 . The interplay between all the aforementioned factors might play a key role in milk protein synthesis, which is crucial during the onset and throughout the lactation in high-producing dairy cattle. Recently, it has also been shown that CN phosphorylation, one of the most important factors controlling the stabilization of calcium phosphate nanoclusters in casein micelles and the internal structure of the casein micelles 15 , is also essential for the protein synthesis machinery in the mammary gland. Differences in the phosphorylation of α S1 -CN may be of particular interest as it represents 40% of the total CN fraction in bovine milk 16 . The possibility of tailoring milk composition, e.g., to obtain milk with high protein content and/or favourable MCPs, would allow to meet specific demands from the cheese industry and consumers, and therefore represents a highly desirable goal for the dairy industry. Since milk protein composition is less responsive to diet than milk fat content 17 , genomic selection may offer a valid alternative for optimising milk protein nutritional value in relation to human health 7 while maximizing economic returns for the dairy industry.
There are substantial differences among different bovine breeds in the proportions of milk protein fractions and in the frequencies of protein genotypes 18 . Several studies have investigated the effects of genetic variants of CN and β-LG genes on the milk protein content and cheese-making ability 8,18,19 . However, other loci seem to contribute to regulate the proportions and characteristics of milk proteins, suggesting that regulation is shared among different genes 16,[20][21][22][23][24][25][26] . Deeper knowledge of the set of genes and pathways regulating bovine milk protein synthesis and secretion might, therefore, help to identify their contribution to optimising casein and whey protein contents during lactation. Pathway-based and gene network analyses have been often used as complementary approaches for extracting biological information from genome-wide association analysis studies (GWAS) and for better characterising the genomic structure of complex traits 21,22 .
To date, only one study has explored this type of integrated analysis for milk protein fractions (albeit limited to κ-CN and β-LG and a small cohort of 164 lactating cows), and it suggests that, in addition to the role played by single genes, a complex multi-hormonal system regulates the expression of milk proteins and the interactions between mammary epithelial cells and the components of the extracellular matrix 23 . Nevertheless, no genome-wide association analysis (GWAS) of Brown Swiss populations with the aim of unravelling the genomic architecture controlling milk protein synthesis and secretion has been yet reported. The aims of this study, therefore, were: i) to perform a GWAS analysis to identify genomic regions associated to the proportions of non-protein nitrogen (N) and protein fractions in milk samples from 1,011 Brown Swiss cows; ii) to uncover the biological functions regulating the milk N compound profile through gene-set enrichment analysis; and iii) to use an association weight matrix (AWM) approach 24 based on SNP co-associations in silico, to identify regulatory networks associated with milk protein synthesis, metabolism and secretion in cattle.
Adjusting for the effect of the highest signals for κ-CN and β-LG altered the SNPs with the most significant associations (see Supplementary Table S1). The genetic variance explained by the SNPs for the κ-CN proportion decreased dramatically (0.124 vs 1.138; −89.1%), as did heritability (0.325 vs 0.681; −73.3%). Significant decreases were also observed for the proportions of β-CN (−43.9% genetic variance, −23.5% heritability) and of β-LG, although to a lesser extent (−23.4% genetic variance, −11.0% heritability) (see Supplementary Table S1). Pathway analysis. Of the total 37,568 SNPs used in this study, 17,006 were located in the 15 kb flanking region of the annotated genes. These were assigned to 13,269 genes on the basis of the UMD3.1 bovine genome sequence assembly. On average, a total of 600 genes showed significant associations (P < 0.05) with MY or milk N fractions. To gain a better understanding of the functional implications of these 600 significant genes, we performed pathway analyses in order to identify over-represented biological processes. On the one hand, the total CN percentage was significantly enriched by K+ transport pathways, including 7 over-represented gene ontology (GO) categories, e.g., K+ ion transmembrane transport (q = 0.00015), voltage-gated K+ channel complex (q = 6.07E-06) and K+ channel activity (q = 1.16E-05; Fig. 3a). The plasma membrane, plasma membrane protein complex and cell-periphery cellular components were also significantly enriched for CN (q = 0.00011, q = 1.33E-05 and q = 8.94E-05, respectively; Fig. 3a). On the other hand, over-represented pathways for β-CN included cellular responses to stimuli, e.g., alcohol (q = 2.89E-06), corticosteroid hormones (q = 2.30E-05) and ketone bodies (q = 4.54E-05; Fig. 3a). Minor N compounds (N min) were significantly associated with the metal ion transport pathways (q = 1.04E-05) (Fig. 3a). The full list of significantly enriched pathways (q < 0.05) is given in Supplementary Table S2).

Gene network analyses.
A total of 15,277 annotated SNPs were used for the AWM construction and the SNP co-association analyses. The AWM matrix was then built using a total of 15 phenotypes and the 1,917 SNPs that were significantly associated with at least one of these phenotypes (selected after applying the filtering steps described in the Material and Methods section). These SNPs corresponded to 1,917 unique genes. The SNPs True Protein nitrogen (N) and milk N fractions are expressed as percentage of total milk N; α S2 -CN: α S2 -casein; α-LA: α-lactalbumin; β-LG: β-lactoglobulin; β-CN: β-casein; κ-CN: κ-casein; α S1 -CN: α S1 -casein; α S1P -CN/ α S1 -CN: ratio between αS1(phosphorylated)-casein and α S1 -casein; αS1P-CN: αS1(phosphorylated)-casein; caseins: Σcaseins (β-CN+ κ-CN+ α S1 -CN+ α S1P -CN+ α S2 -CN+ αS1P/αs1-CN); Whey proteins: Σ whey proteins (α-LA + β-LG selected by the AWM method explained 72% of the phenotypic variance for κ-CN, which was significantly larger (P < 0.001) than the average variance (46%) explained by the same number of randomly selected SNPs (10,000 replicates). Hierarchical clustering of traits was firstly performed to describe the set of phenotypes that inevitable were correlated between them. In fact, milk N fractions profiles were clustered in three different groups: the first comprised the minor N compounds, the second comprised the whey proteins, total CN and the α S -CN fraction, while the third included β-CN, κ-CN, urea, α S1P -CN and the α S1P /α S1 -CN ratio ( Supplementary Fig. S2). Then, operating on the rows of the AWM matrix, the correlations between all pair-wise genes were used to predict gene interactions and generate a regulatory network for the milk N fractions, where the nodes are genes and the edges represent significant interactions between nodes. The PCIT algorithm identified a total of 235,764 edges connecting the 1,917 nodes. After filtering for sparse correlations values ≥ |0.80|, we obtained a regulatory network with 101,284 edges and 1,904 nodes. The analysis of the network topological parameters, e.g., closeness centrality and betweenness centrality, revealed that the genes related to ion transport pathway (e.g., ITPR2, IQGAP1, TP53RK and LACE1), protein metabolism (e.g., METAP1 and PRC1) and axon guidance (e.g., NTNG1 and ROBO3) might have an important influence on the regulatory network. Ranking the nodes according to their degree (number of significant interactions), we found BPIFB1 and FAM169A at the top of the list with 481 and 477 edges, respectively (Supplementary Table S3). Analysis with the LASAGNA tool, which predicts the transcription factors (TF) binding sites in the genes' promoter regions, showed that the promoter of BPIFB1 and FAM169A contained binding sites for several TFs involved in regulating milk protein synthesis, such as GR, ER, STAT5A, C/EBP and YY1 (Supplementary Table S4). Additionally, we detected other highly-connected nodes within our regulatory network, including the K+ channel KCNK9 (with 455 edges), transporters such as CRABP1 (450 edges) and SLC4A7 (420 edges), and the phosphatase PLPP7 (located 2 Mb from PAEP; 418 edges) (Supplementary Table S3). The TFs act in a regulatory network and can drive or repress the expression of different genes in a feed-forward and feedback manner. Accordingly, a second network was generated to explore the main putative regulatory TFs   Nmin: minor N compounds (e.g., small peptides, ammonia, creatine, creatinine, etc.); α S1P /α S1 -CN: ratio between α S1 (phosphorylated)-casein and α S1 -casein; α S1P -CN: α S1 (phosphorylated)-casein; CN: casein, Σcaseins (β-CN+ κ-CN+ α S1 -CN+ α S1P -CN+ α S2 -CN+ α S1P /α S1 -CN); WP: whey proteins, Σ whey proteins (α-LA + β-LG); MUN: milk urea N. The trait with the highest P-value in each genomic region is bolded.
in our regulatory network and the connectivity between them. We identified GFI1B, NR5A1 and ZNF407 as the "best" trio of TFs within our regulatory network. Altogether, they potentially regulated the transcription of 452 genes (about 24% of genes in the AWM matrix filtered for correlations ≥|0.80|; Fig. 4). Figure 5A,B show the distribution of the partial correlation coefficients in the full and TF networks. More sophisticated regulation patterns between the TFs and their target genes were provided by the LASAGNA promotor analyser. For instance, the promoters of GFI1B and NR5A1 were discovered to contain putative binding sites for the TFs that are known to regulate milk protein synthesis (e.g. STAT5A, C/EBPbeta, YY1, NFκB, NF-1 and CREB; Supplementary  Table S4; Fig. 4). Differences between the correlation values of the full regulatory network and the TFs network were apparent. The absolute correlation values of the full regulatory network ranged from 0.80 to 1.00, with a mean of 0.86, whereas the absolute correlation values of the TF network ranged from 0.80 to 0.99, with a mean of 0.86. Moreover, while NR5A1 repressed most of its target genes (63%), the proportion of repressed and induced target genes were similar for GFI1B and ZNF407 (Fig. 5).
To identify the most important cellular activities controlled by the regulatory network and the TFs network, we analysed over-represented GO biological process terms using ClueGO. The full list of enriched pathways and ontologies is reported in Supplementary Table S5. Most of the molecular functions that were commonly enriched in both the full and TF networks were related to ion and cation transmembrane transporter activity and phosphatidylinositol signalling (Fig. 5C). The two networks also shared a considerable number of pathways and biological processes related to neuronal and hormone (e.g. glucocorticoids and insulin) signalling, reproduction, nitrogenous compound metabolism and molecular transport (Supplementary Table S5). Several functions related to the Golgi apparatus were also enriched in both networks such as Golgi vesicle transport, regulation of Golgi organization, intra-Golgi vesicle-mediated transport and post-Golgi vesicle-mediated transport (Supplementary Table S5). In addition, processes and components belonging to the extracellular matrix (ECM), such as the proteinaceous extracellular matrix (q = 0.00418), and cell proliferation, e.g., epithelial cell proliferation (q = 0.03981), were significantly overrepresented in the full network (Supplementary Table S5). Immune system response was only over-represented in the TF network, e.g., "positive regulation of lymphocyte mediated immunity" (q = 0.03696) and "regulation of adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains" (q = 0.04188) (Supplementary Table S5). Discussion GWAS analysis. We carried out GWAS analysis of the bovine milk N profile, including the main CN and whey protein fractions and non-protein N compounds. The genomic heritabilities we found were generally higher than previously found in the literature, which may be partially due to several factors, such as differences in breed, population size, analytical method, statistical model and data measurement unit (e.g., yield vs proportion) [20][21][22][23][24][25][26] . Heritabilities of single casein fractions such as κ-CN and β-CN were much higher than that of total caseins. This might be due to the fact that single protein fractions (as well as totals) were expressed as percentage of total N and therefore qualitative (and not quantitative) information was provided. Accordingly, proportions of single milk protein fractions do not share the same profile nor necessarily vary conforming to the totals. The same explanation might be applied also to the number of significant SNPs (much lower in the case of total caseins). However, it is worth mentioning that when using a less stringent P-value (as in the case of pathway analyses) the situation was reversed, suggesting that in the case of total caseins the significantly associated signals tended to be mostly weak. These findings might provide further indication that selection for individual milk protein fractions might be more effective than selection based on total caseins, especially when setting breeding programmes aimed at improving milk nutritional and/or technological properties. As expected, our GWAS results confirmed the highest signals to be on BTA6 in the region of the casein cluster and its flanking region (~86. , and on the tail part of BTA11 including the region of the PAEP gene (~101. 27-106.54), in line with previous results [20][21][22][23] . The most significant SNPs for κ-CN (Hapmap52348-rs29024684), β-CN (Hapmap28023-BTC-060518 and Hapmap24184-BTC-070077, in full LD) Figure 3. Distribution of the significantly enriched terms/pathways using genes associated to the milk nitrogen fractions. The SNP (P < 0.05) were assigned to genes if they were located within the gene or in a flanking region of 15 kb up-and downstream of the gene using the biomaRt R package. For mapping, the Ensembl Bos taurus UMD3.1 assembly was used as reference. Gene-set enrichment analysis was carried out using the goseq R package. Only the traits showing significantly enriched terms are reported (q < 0.05). (a) GO terms; (b) KEGG-pathways. β-CN: β-casein; CN: Σcaseins (β-CN+ κ-CN+ α S1 -CN+ α S1 phosphorylated-CN+ α S2 -CN+ α S1 (phosphorylated)/α S1 -CN); Nmin: minor nitrogen compounds; a S1 -CN: α S1 -casein; a S1P -CN: α S1 (phosphorylated)-casein; κ-CN: κ-casein. Even after adjusting for the effect of the highest significant SNPs, we still detected high signals on BTA6. Apart from Hapmap28023-BTC-060518 and Hapmap24184-BTC-070077, which are in moderate LD with Hapmap52348-rs29024684, we still found peaks in the ranges 82 to 85 Mb and 88 to 94 Mb. The highest signal in the former region corresponded to Hapmap46932-BTA-111719, which was associated to β-CN, α S1 -CN and α S2 -CN. This marker was located about 0.2 Mb from CTSL2 and 0.5 Mb from IARS. CTSL2 belongs to the cathepsins family, which are endogenous proteases affecting the physicochemical characteristics of fresh milk and the quality of dairy products; an increase in CTSL2 expression in bovine milk was observed over the course of lactation 28 . IARS, on the other hand, encodes for the isoleucyl-tRNA synthetase. Aminoacyl-tRNA synthetases are key enzymes involved in translating the genetic code by attaching the correct amino acid to each tRNA species and hydrolysing an incorrectly attached amino acid in the editing process 29 . Amino acids serve as precursors for protein synthesis but also act as regulators of protein synthesis 30 . Furthermore, isoleucine seemed to act cooperatively with leucine to increase milk protein synthesis 31,32 , which appeared to be controlled (at least partially) by the mTOR pathway 33 . The highest peak in the latter region corresponded to Hapmap43045-BTA-76998, which was associated to α S2 -CN and mapped in close proximity to several genes involved in immune system response, e.g., 0.2 Mb from IL8, 0.1 Mb from CXCL6 and 64 Kb from PPBP. IL8, for instance, is a highly polymorphic gene considered to be a mastitis trait 34 and may also be a quantitative trait locus (QTL) for milk production traits 35,36 . We found highly significant SNPs on BTA11 in the region flanking PAEP (102.94-103.05 Mb) and including the QTL for the β-LG percentage deposited in the Cattle QTL Database. The marker ARS-USMARC-Parent-AY85 1163-rs17871661 (associated to β-LG, whey proteins, other N compounds, protein and N minor compounds) was located within GFI1B (intron variant effect), one of the TFs we proposed as master regulators of milk protein . Activators and repressors of the regulatory network of genes associated with the bovine milk κ-casein content. This network contained 452 nodes and 498 edges. In the network, each node represents a gene, whereas every edge connecting two nodes represents a significant interaction (correlation value ≥ |0.80|). In the network, the best trio of transcription factors is showed: GFI1B, NR5A1 and ZNF407. Together they control 2.5% of the regulatory network. The nodes shape indicates whether the node is a transcription factor (triangles), a miRNA (hexagon), a metabolite (round rectangle), a membrane receptor (rectangle), a transporter (parallelogram), or other type of genes (ellipses). The node colour represents the biological function of the gene according to Ingenuity Pathway Analysis (IPA) annotation. The edge colour intensity indicates the level of the association: red = positive correlation -and blue = negative correlation between two nodes.
SciENtiFic REPORTS | (2018) 8:566 | DOI:10.1038/s41598-017-18916-4 synthesis in bovine mammary gland. We also found a high signal located at 104. 29 Mb and corresponding to ARS-BFGL-NGS-104610, which was associated to the same phenotypes. Interestingly, this region (104.13-104.31 Mb) is densely packed with genes coding for small nucleolar-RNA and micro-RNA, well-known regulators of gene expression 37,38 . Pathway and network analyses. Pathway and network analyses derived from GWAS gave additional insights into the complex relationships among genes and the interconnected pathways that are likely to have a role in regulating protein synthesis and secretion in the mammary gland. For instance, we found several pathway associations within our regulatory network, which to the best of our knowledge have not been fully described before, namely: (i) ion and cation transmembrane transport (particularly K+, P, and Ca 2+ ); (ii) hormone signalling, (iii) neuronal signalling and (iv) immune system response ( Fig. 6; Supplementary Table S5). Additionally, we also identified three TFs, which were likely to be key activators and repressors of a total of 1,904 targets genes within the regulatory network, e.g., GFI1B, ZNF407 and NR5A1, which controlled the expression respectively of 260, 197 and 41 genes in the network. Interestingly, many of these pathways derived from GWAS analysis have been also related to milk coagulation properties, curd firmness, cheese yield and curd nutrient recovery 22 , such as calcium and potassium transport, neuronal and hormonal signalling, as well as phosphatidylinositol signalling. These functional findings might confirm the established relationship between milk protein composition and cheese-making traits.
The relationship between CN percentage in milk and the genes involved in the regulation of Ca 2+ and phosphate transmembrane transport is in line with the structure and the main functions of the casein micelles, which on the one hand act as Ca 2+ -transporting vehicles to supply young mammals with a highly concentrated yet soluble form of calcium phosphate and on the other hand, prevent calcified, proteinaceous deposits containing amyloid fibrils in the mammary gland 39 . Caseins bind Ca 2+ via highly phosphorylated sequences called phosphate centres present in α S1 -CN, α S2 -CN, β-CN 40 . Calcium-dependent CN kinase is responsible for κ-CN phosphorylation before micelle formation and milk secretion 41 . In agreement to our results, the Ca 2+ ion-binding GO term has been already associated with κ-CN and β-LG in bovine milk 23 . These biologically reasonable associations were further confirmed by the enrichment of several functions related to the Golgi vesicle transport within the full-network and our TFs network. Indeed, the milk proteins newly synthesized in the rough endoplasmic reticulum are transferred to the Golgi apparatus where they are processed for transport to the apical area of the mammary epithelial cells through secretory vesicles 3 . A cardiovascular regulation function through several genes (e.g. ARVC, HCM, and DCM) has been also associated with κ-CN, suggesting that this protein fraction is involved with the regulation of Ca 2+ homeostasis. Impaired Ca 2+ ion regulation (and alteration in insulin signalling) is known to contribute to the pathophysiological effect on cardiomyocyte function 42 . Furthermore, these cardiovascular related pathways also included genes coding for integrins, the major ECM receptors that have been identified as important regulators of mammary epithelial cell growth and differentiation 43 . In relation to these results, pathways pertaining to the extracellular matrix were indeed significantly enriched in our full-regulatory network. Similarly, Gambra et al. 23 reported an association between the extracellular matrix receptors, κ-CN and β-LG concentrations in bovine milk 23 . Besides Ca 2+ ion, the K+ transport was also enrichment. It is likely that prolactin (PRL), which have a direct role in milk synthesis 33 , activates the extrusion of Na+ and the entry of K+ in mammary cells in both lactating and pre-lactating tissue 44 . Interestingly, a plasmin-induced β-CN breakdown product (fraction 1-28) has been found to act as a potent blocker of K+ channels in bovine mammary epithelia apical membranes 45 .
Our study also showed that milk proteins related genes were associated with the concerted action of hormones such as prolactin, growth hormone, thyroid hormone, corticosteroids, insulin, and growth factors, which are essential for the regulation of milk protein synthesis within the bovine mammary epithelium 33 . Lactogenic hormones enter MECs by diffusion and synergistically bind to milk protein gene promoters. Indeed, the proximal promoters of the βand κ-CN genes contain so-called lactogenic response elements that harbour binding sites for TFs, which act either as inducers, such as GR, STAT5, NF-1 and C/EBPβ, or as repressors, such as YY-1 46,47 . Remarkably, binding sites for these abovementioned TFs have also been predicted by the LASAGNA tool for the two most important nodes in the full regulatory network, in particular BPIFB1 and FAM169A, and for the two key TFs, GFI1B and NR5A1. Among the pathways overrepresented in the networks, regulation of insulin secretion and of insulin-like growth factor receptor signalling pathways were included (Supplementary Table S5). A direct effect of insulin on the bovine mammary gland might be mediated by the major milk protein ELF5, which seemed to be regulated by means of phosphoinositide 3-kinase/Akt signalling 48 , which has been identified as playing a central role in lactation 49 . Overrepresentation of phosphatidylinositol signalling (PI3K) in the full and TF networks might provide further support for this hypothesis. Both insulin and IGF1 might in turn activate the mTOR signalling pathway, which is crucial for milk protein synthesis 50,51 . Among the enriched genes included in the insulin secretion pathway, GLUT1 (SLC2A1) is of particular interest. The large uptake of glucose by the mammary gland during lactation considerably induces the expression of GLUT1 33 , which seemed also to be regulated by mTOR 52 . Both RPTOR and GLUT1 were predicted to be targets for ZNF407 by our TF network.
Additionally, milk proteins associated genes were involved in the activation of neuronal signalling pathways, suggesting an indirect link to the reproduction process and lactation. The overrepresentation of neurotransmitter signalling, such as the cholinergic synapse (enriched in the full network) and axon guidance may be explained by the stimulation of mechanoreceptors in the teat skin, which induces cholinergic nerve impulses with the result that oxytocin is released from the pituitary gland, essential for milk secretion 53 . In fact, the study carried out by Gao et al. 54 provides support to this hypothesis. These authors reported a significant increase in the expression of all CN genes in the bovine mammary gland at the lactation onset 54 , which is reasonably consistent with the need to meet the nutritional requirements of new-born calves. Having established that neuronal signalling appeared to be associated to milk protein components, we also demonstrated that CN could be related to the control of reproduction. The mammary gland is considered as an accessory reproductive organ 55 . This later association may be attributable to several genes involved in the regulation of reproductive process, including NR5A1, which plays an important role in various aspects of reproductive development and function 56 and also regulates gene expression of pituitary gonadotropins, such as the luteinizing hormone (LH) and the follicle-stimulating hormone (FSH) 57 . On the other hand, we found 100 genes in the full network that might be related to amyloidosis disease. Caseins, as other unfolded proteins, tend to form amyloid fibrils and calcified deposits, although to avoid the risk of amyloidosis and calcification, the mammary gland orchestrates different aggregation mechanisms that result in the formation of the casein micelle 58 . Amyloidosis and the production of amyloid proteins have been associated with a variety of so-called protein conformational or protein misfolding diseases (including Alzheimer's disease, Parkinson's disease, type-II diabetes) 59 . Caseins have been also found to function as holdase molecular chaperones to prevent the potentially harmful formation of amyloid fibrils 58 , which might explain the enrichment of the signal sequence binding found in our study.
Finally, the enrichment of pathways related to immune response observed for the TF-network might be partly related to the biological role of GFI1B which is a transcriptional repressor that plays important roles in the differentiation of several haematopoietic cells 60 . Our findings might be related to the antimicrobial activity of caseins, and specifically of κ-CN 61 ; of interest, an overall increase in the immune response and/or in milk antimicrobial activity of the bovine mammary gland has been observed during lactation 62 .
Milk protein composition is subject to the well-known effect of the major genes coding for the various CNs and whey proteins. In our study, the combination of GWAS and pathway and network analyses showed several genes that were coordinated and highly connected between them, making a substantial contribution at different stages of milk protein synthesis. This information advances our understanding of bovine mammary gland functionality and could be helpful to breeding programmes aimed at improving milk quality and/or technological properties. However, altogether, the correlative nature of associations between outcomes from which causality cannot be determined limits the interpretation of our results. Therefore, it is of paramount importance to carry on larger longitudinal studies to explore the causes and the persistency of these interactions. Additionally, the predicted associations need to be biologically validated, e.g., by integrating genomic data with gene expression profiles, by using machine-learning approaches or animal models with knockout genes.

Methods
Ethics statement. The cows included in this study belonged to commercial private herds and were not subjected to any invasive procedures. Milk and blood samples were previously collected during the routine milk recording coordinated by technicians working at the Breeder Association of Trento Province (Italy) and therefore authorized by a local authority.
Phenotypes and genotypes. Individual milk samples were collected from 1,264 Italian Brown Swiss cows from 85 commercial herds located in the Alpine province of Trento (Italy). Details of the animals used in this study and the characteristics of the area are reported in Cipolat-Gotet et al. 63 and Cecchinato et al. 64 .
Milk total nitrogen, casein and urea nitrogen (MUN) were measured using a MilkoScan FT6000 (Foss, Hillerød, Denmark). Proportions of the true proteins, e.g., casein fractions (α S1 -, α S1P -, α S2 -, βand κ-CN), and whey proteins [β-lactoglobulin (β-LG) and α-lactalbumin (α-LA)] were determined using validated reversed-phase high-performance liquid chromatography (RP-HPLC) 65 . Each fraction was expressed as a percentage of the milk total nitrogen (N) content. These percentages were summed and deducted from the milk total N content to arrive at the proportion of the remaining minor milk N compounds.
The Illumina BovineSNP50 v.2 BeadChip (Illumina Inc., San Diego, CA) was used to genotype 1,152 cows (blood samples were not available for all the phenotyped animals). Quality control excluded markers that do not fulfil the subsequent criteria: call rates >95%, minor allele frequencies >0.5% and no extreme deviation from the Hardy-Weinberg equilibrium (P > 0.001, Bonferroni corrected). After filtering, 1,011 cows and 37,568 SNPs were retained for subsequent analyses.
Genome-wide association study. Genome-wide association analyses (GWAS) were conducted using single-marker regression in the GenABEL R package 66 and GRAMMAR-GC (Genome-wide Association using Mixed Model and Regression -Genomic Control) with the default function gamma 67 . There are 3 steps to the GRAMMAR-GC: firstly, an additive polygenic model with a genomic relationship matrix is fitted; secondly, the residuals obtained from this model are regressed on the SNPs to test for associations; finally, genomic control corrects for conservativeness of the procedure 68 where y is a vector of the milk N fractions; β is a vector with fixed effects of (i) days in milk of the cow (classes of 30 days each), (ii) parity of each cow (classes of 1, 2, 3, ≥4), and (iii) herd-date effect (n = 85); X is an incidence matrix connecting each observation to specific levels of the factors in β. The two random terms in the model were the animal and the residuals, which were assumed to be normally distributed as σ a N G (0, ) g 2 and σ e N I (0, ) e 2 , where G is the genomic relationship, I is the identity matrix, σ g 2 is the additive genomic variance and σ e 2 the residual variance. The G matrix was built in the GenABEL R package, where for a given pair of individuals i and j, the identical by state coefficients (f i,j ) is calculated as: where N is the number of markers used, x i,k is the genotype of the i th individual at the k th SNP (coded as 0, ½ and 1), p k is the frequency of the "+" allele and k = 1, …, N. A significance threshold of P < 5 × − 10 5 was adopted 69 . Manhattan plots were drawn using the qqman R package 70 .
SNP variance was calculated as 2pqa 2 , where p is the frequency of one allele, q = 1 − p is the frequency of the second allele and a is the estimated additive genetic effect. Model (1) was also used to estimate the variance components and the genomic heritability of the traits based on the genomic relationship matrix (2). Heritability was 2 . To identify secondary association signals, association analysis conditioning on the primary associated SNPs was carried out to test for the presence of other significantly associated SNPs. Therefore, in model (1) we fixed the most significant SNPs on BTA6 and on BTA11 to obtain SNP effect estimates adjusted for the effect of these highly significant SNPs.
The r-squared statistic was chosen to predict the extent of LD. The r 2 between pairwise SNPs covering the region of CN loci on BTA6 and the region of the β-LG gene (progestagen-associated endometrial protein, PAEP) on BTA11 and their respective 1 Mb flanking regions was calculated using the R package LDheatmap 71 .
Gene-set enrichment and pathway analyses. Pathway analyses were performed as detailed in Dadousis et al. 22 to identify the biological functions regulating the milk N fraction profile. Briefly, the SNPs (nominal P-values < 0.05) were assigned to genes if they were located within the gene or within 15 kb of 5′ and 3′ends 72 using the BiomaRt R package 73,74 and the Ensembl Bos taurus UMD3.1 assembly. Respect to the GWAS analysis, a less stringent significance threshold was adopted since we aimed to detect the effect of less significant SNPs which still contribute to explain phenotypic variability, as associated to genes which are part of biological networks and cellular processes. Combining weaker but related variant signals we can improve the prediction of how these variants might be collectively related to the phenotypes of interest. The Kyoto Encyclopaedia of Genes and Genomes (KEGG) 75 and the Gene Ontology (GO) databases 76 were used to define the functional categories associated to the gene sets. To avoid testing broad or narrow functional categories, only GO and KEGG terms with >10 and <1000 genes were inspected. A Fisher's exact test was applied to each functional category to test for overrepresentation of significant gene sets. A q-value of 0.05 was set as the cut-off for significant enrichments. The gene-set enrichment analysis was performed using the R package goseq 77 . SNP co-association and network analyses. The GWAS results were used to build the AWM as described by Fortes et al. 24 . The selection criteria favour genes harbouring SNPs with significant associations across related traits. In brief, κ-CN was selected as the key phenotype (due to its greater importance for milk technological properties) and the SNPs that were associated with it (P ≤ 0.05) were included in the AWM.
Dependency among phenotypes was explored by estimating the average number of other phenotypes (Ap) that were associated with these SNPs at the same P value (P ≤ 0.05) (Ap = 3). Then, we selected SNPs that were both close (<10 Kb) to the nearest annotated gene (UMD3.1 assembly) and were associated with any ≥3 other traits (P < 0.05). To identify putative regulators, the TFs reported by Vaquerizas 78 and the microRNA (miRNA) that were mapped to the UMD 3.1 bovine genome assembly (GenBank assembly accession: GCA_000003055.3) were also included in this analysis. To estimate the phenotypic variance explained by the AWM-SNPs, we constructed a first G matrix based only on the SNPs that were selected for the AWM. The same numbers of randomly selected SNPs were used to build 10,000 G matrices (10,000 replicates), to estimate the variance explained by those randomly selected SNPs. The Pearson correlations obtained from pair-wise correlations of AWM columns (standardized SNP effects across traits) were computed and hierarchical clustering of traits was visualised using the hclust function in R 79 . The PCIT algorithm 80 was used to report significant interactions in the network, which were visualized in Cytoscape 81 . Every node in the network represents a gene, while every edge connecting two nodes represents a significant interaction. In order to include only the high-confidence gene co-associations determined by PCIT, those with correlations ≥|0.80| were retained (n = 1,904 unique genes), on the assumption that these genes have relevant biological significance for the key phenotype from which the AWM-PCIT was constructed. The co-association network was automatically generated using the organic layout algorithm in Cytoscape V2.7 (http://cytoscape.org). Network topological parameters and node centrality values were calculated using the NetworkAnalyzer plugin 82 to gain insights into the organisation and structure of the complex networks formed by the interacting molecules. In parallel, the list of co-associated genes was fed into the Cytoscape plugin ClueGo 83 to identify relevant categories of molecular functions, cellular components and biological processes. The ClueGO cut-off for the statistical assessment was FDR < 0.05. In addition, the list of co-associated genes was uploaded to the Ingenuity Pathway Analysis (IPA, version 5.5; Ingenuity Systems, USA) to define information on molecule type (e.g., transcription factor, cytokine, transporter). Genes in the network were coloured according to the biological processes they participate in. Then, a list of TFs (based on Vaquerizas et al. 78 ) and their target genes, to which they were potentially connected, were identified within our high-confidence gene network (r ≥ |0.80|). An information-lossless approach 84 was used to identify the optimal subset of TFs spanning the majority of the network topology. The density plots of the genes' partial-correlation values in the full and the TF network were generated using the R package ggpubr.
Prediction of TF binding sites in the genes' promoter regions was performed by the LASAGNA-Search 2.0 web tool 85 using matrices in the TRANSFAC public database and with a significance threshold of P = 0.001.