Introduction

Milk is an important source of nutrients in human nutrition all over the world. In particular, milk polyunsaturated fatty acids (PUFAs) like isomers of conjugated linoleic acid (CLA), arachidonic acid (AA, 20:4n-6), eicosapentaenoic acid (EPA, 20:5n-3) and docosahexaenoic acid (DHA, 22:6n-3) have known positive associations with a range of human health conditions like cardiovascular diseases, anticancer effects, antiadipogenic, antiatherogenic, antidiabetogenic and anti-inflammatory properties1,2,3. For these reasons, dairy producers are looking for ways to optimize milk beneficial components.

Genetic variability in the relative proportions of the various milk components (proteins, fats, individual fatty acids, lactose, milk urea nitrogen, etc.) exist between and among cattle breeds4,5, which is further influenced by nutrition and gene x environment interactions. This variability indicates the possibility of using genomic selection to improve milk traits6,7,8,9.

Advances in high-throughput sequencing technologies provide the ability to improve complex traits. Over the past two decades, sequence variation information has supported genome wide association (GWAS) and candidate gene studies of milk traits10,11,12,13,14,15,16,17,18,19. Use of genomic information is gaining wide application in livestock improvement schemes20,21 since genotype data and confirmed associations with trait is important to support informed decisions in livestock selection.

Out of 36,693 quantitative trait loci (QTL) for 492 traits archived in the cattle QTL data base, about 5,815 are QTLs for milk fat composition, 3,157 for milk protein composition, 1,324 for milk yield, 550 for fatty acid content and 1,246 for mastitis (CattleQTLdb, http://www.animalgenome.org/cgi-bin/QTLdb/BT/index, accessed on 27 November, 2015). These QTLs are spread on most bovine chromosomes but only a few of the causative genes have been identified. Furthermore, majority of reported associations involves common variants with minor allele frequencies (MAF) >5%, while the contributions of low frequency (MAF 0.5% to 5%) and rare (MAF<0.5%) variants remain relatively untapped. Moreover, reported associations with milk traits have uncovered few markers or genes that explain a huge portion of the variation in traits13,22 while the remaining portion of unexplained variation may be attributable to low frequency and rare variants and remain to be uncovered.

A deeper study of the bovine genome to quantify the contribution of low frequency SNP variants and novel positional candidate genes is necessary. Recently, whole genome sequencing of 234 bull genomes of the Holstein, Fleckvieh, Jersey and Angus breeds uncovered 28.3 million variants, including common and low frequency variants responsible for various conditions in cattle23. Low frequency variants are not represented in currently available low and high density genotyping chips, a situation now remedied by next generation sequencing technologies in humans24, maize25 and cattle26. Using genotyping-by-sequencing (GBS) on the Illumina platform, De Donato et al.26 successfully identified and genotyped 63,697 SNPs in 47 bovine samples from 7 breeds in which they uncovered more SNPs per bovine chromosome than represented on the Illumina Bovine50KSNP BeadChip. Furthermore, they demonstrated the cost effectiveness of GBS as a complementary tool to available genotyping chips and its potential application to bovine studies and to other species.

In this study, we applied GBS to 1,246 Canadian Holstein cows from 16 different herds in Quebec followed by GWAS and identified population specific SNP markers, novel SNPs, novel SNP associations and novel candidate genes for milk traits.

Results

Sequencing results, identified SNPs and classification

The method of GBS was used to analyse DNA samples from 1,246 Canadian Holstein cows on an Illumina HiSeq 2000 system. Sequencing generated a total of 3.7 billion reads. After initial quality check, 2.9 billion reads resulting in a total of 92.7 million unique tags were retained (Table S1). Unique tags were merged to a total of 14.6 million merged tags, of which 79.6% aligned to unique positions on the bovine genome (Btau 4.6.1), 10.5% aligned to multiple positions while 9.9% could not be aligned. By analyzing only tags that aligned to unique positions, a total of 515,787 SNPS were identified on all chromosomes. The highest number of SNPs were detected on BTA11 (25,093 SNPs) followed by BTA3 (23,860 SNPs) and the least number on BTAX (9,471 SNPs) (Fig. 1, Table S2). SNP variant classification indicated that majority of identified SNPs were located within intergenic regions (69%) of the genome followed by intronic regions of genes (25%) (Fig. 1, Table S2). Only 3.46% of SNPs are coding variants. About 4,280 SNPs located on 367 invalid transcripts were not classified. Further classification of coding SNPs indicated that 66% are non-synonymous, 18% are synonymous, 11% are unknown, 4.6% are splicing variants while SNPs at initiation codons, stop gain and stop loss constituted less than 1% (Fig. 1, Table S2). Majority of identified markers had MAF ≤1% (Figure S1). Only about 29% of identified markers are represented in dbSNP implying that 71% of identified variants are novel (Table S3a). These novel variants have been submitted to dbSNP and they have been assigned Submitted SNP (SS) numbers (Table S3b).

Figure 1
figure 1

Distribution of identified markers by chromosomes (A), variant classification (B) and coding variant classes (C).

Genotype imputation

Genotypes were imputated using Beagle v3.3.227 to correct for missing genotypes in some samples. Principal component analysis (PCA) was used to assess population structure. Initial inconsistent PCA patterns in 3 herds (Figure S2) were resolved when call rate ≥80% and minor allele frequency (MAF) ≥1% filters were applied (Figure S3). Call rates were also improved post-imputation (Figure S4), with MAF generally unchanged.

Results of significant genome wide association analysis

After genotype imputation, a total of 76,355 SNPs out of 515,787 with call rates >85%, accuracy of imputation score >50% and MAF ≥1.5% were retained and used in GWAS. Also excluded from GWAS were genotypes that deviated significantly from Hardy-Weinberg Equilibrium assumptions. Results of GWAS and Benjamini-Hochberg (BH) false discovery rate (FDR) correction (p-values BH FDR <0.1) are listed in Tables S4a to e, S5a to d, S6a to b and S7a to h. Only associations with corrected p-values BH FDR <0.1 were considered to be of genome wide significance in this study.

Significant genome wide associations between markers and milk component traits

Markers were tested for significant associations with test day fat% (TFP), test day fat yield (TFY), 305 day fat yield (305dFY), test day protein% (TPP), test day protein yield (TPY), 305 day protein yield (305dPY), test-day milk yield (TMY), 305 day milk yield (305dMY), lactose% (LP), milk urea nitrogen (MUN) and milk somatic cell counts (SCC) (Table 1). Significant GWAS results (p-value BH FDR <0.1) were recorded between 1 to 143 markers and 7 variables (TFP, TPP, 305dFY, TMY, LP, MUN and SCC).

Table 1 Studied milk component and fatty acid traits and their mean values (±standard error [SE]) across 16 herds.

Thirty six markers (26 intergenic and 10 gene region variants) were significantly associated with TFP (Table S4a). In addition, a strong association signal was recorded between 20 markers in the centromeric region of BTA14 and TFP (Table 2, Fig. 2) out of which two (rs132685115, rs135581384) are located within TONSL gene and one each within ADCK5 (rs135576599), PP1R16A (rs133629644) and TRAPPC9 (rs207542860) genes. Only one coding region SNP (ss1850090958) within TEP1 gene associated significantly with TFP and TPP.

Table 2 Markers showing genome wide significant associations with test day fat percentage (TFP).
Figure 2
figure 2

Manhatan plot of –log10(p-value) of genome wide SNP association results showing (A) a strong association signal at the centromeric region of BTA14 with test day fat percent (TFP); (B) expanded 0-2 Mbp region of BTA14 showing significant SNPs and corresponding genes and (C) significant SNP association with C20:5n3 (eicosapentaenoic acid, EPA). Thirty six and 56 significant (P-value BH FDR ≤ 0.1) genome wide SNP associations with respectively TFP and C20:5n3 are shown above the horizontal lines. Genome wide association analysis was done with implementation of the single-locus efficient mixed model (EMMA) association approach (Kang 2008) using 76,355 SNPs markers with minor allele frequencies ≥1.5%.

Fifty three markers including 20 gene region SNPs were significantly associated (p-value BH FDR < 0.1) with TPP (Table S4b). Significant markers for TPP are spread over several chromosomes except BTA2, 12, 23, 26 and X (Table S4b). Two gene region SNPs are coding mutations (ss1850241810, ss1850090958) including a non-synonymous mutation (ss1850241810) within exon 5 of P4HTM gene.

In this study, the highest number of markers comprising 91 intergenic and 52 gene region SNPs were associated with SCC (Table S4c). Two variants each on PLK1S1 (rs133818453, rs384381919) and RILPL2 (ss1850180613, ss1850180614) genes were significantly associated with SCC and three intronic SNPs (rs459258791, rs470194324, rs211180730) have MAF above 16%. Interestingly, eight SNPs (rs462951569, ss1850157510, ss1850127088, rs445074791, ss1850272832, ss1850161426, rs384261616, ss1849973836) each explained about 4% or more of the variance in SCC.

Fewer significant associations were recorded between TMY, 305dFY, LP and MUN and studied markers (Table S4d,e). About 35 significantly associated SNPs with TFP, TPP, SCC, TMY and 305dFY, and having MAF ≥10% were considered the most commonly associated SNPs for these traits in this study (Table 3).

Table 3 Commonly significantly associated variants (minor allele frequency [MAF] ≥10%) with milk component traits.

Results of significant genome wide associations between markers and individual milk fatty acids

Several markers were significantly associated with two omega-3 fatty acids (C20:5n3, eicosapentaenoic acid, EPA and C22:5n3, docosapentaenoic acid, DPA), one omega-6 fatty acid (C20:4n6, arachidonic acid, AA), one CLA isomer (CLA:9c11t) and gamma linolenic acid (C18:3tcc) (Table S5a–d). Significant associations with EPA included 36 SNPs in intergenic regions and 20 markers within coding and regulatory regions (introns and 3′UTR) of 20 genes on 15 chromosomes (Table 4 and Table S5a). Seven of these significant associations were concordant by both additive and dominant models, 11 by the additive and two by the dominant models only. A non-synonymous coding SNP (p.Ala1424Pro) located within exon 21 (rs470755489) in the ERCC6 gene with MAF of 11.2% was identified by the additive model to be significantly associated with EPA. Another non-synonymous coding SNP (ss1850036184) within exon 2 of TTC38 gene (p.Ala16Pro) was significant for EPA by the additive model only.

Table 4 Markers 1 showing genome wide significant associations with C20:4n6, C20:5n3, C22:5n3A, CLA:9c11t and C18:3tcc.

SNPs in mostly intergenic regions were significantly associated with AA (C20:4n6) along with 25 markers on 25 genes (Table 4 and Table S5b). Only one coding synonymous SNP (ss1850101917) within exon 3 of the TLX2 gene showed significant genome wide association with AA, whereas the highest phenotypic variance of 4.3% was explained by an intronic mutation (ss1850112671) within the RABEPK gene.

Only five markers (two within genes and three in intergenic regions) were significantly associated with DPA (C22:5n3) (Table 4 and Table S5c). An intergenic SNP (rs466855972) with the highest MAF (14%) is located within 50 Kbp of the SMOC2 gene on chromosome 9. This mutation also explained the most variance (3.6%) in DPA. A silent mutation within exon 4 of the GNLY gene was the only coding SNP (ss1850106318) with significant association with DPA. One intergenic SNP on BTA1 (ss1849964962) with a MAF of 4% associated significantly with CLA:9c11t (Table 4 and Table S5d). Similarly, only one intergenic variant on BTAX (ss1850306238) reached genome wide significance with gamma linolenic acid (C18:3tcc) (Table 4 and Table S5d).

Palmitoleic acid (C16:1) out of seven monounsaturated fatty acids (MUFAs) (Table 1) was associated with one intergenic SNP on chromosome 26 (rs110405215) and oleic acid (C18:1n9c) was associated with three SNPs (ss1850063824, rs135581384 and rs41855732) on three different chromosomes (Table S6a). Nine SNPs on 8 different chromosomes associated significantly with total MUFA including one synonymous coding region SNP (ss1850271826) within exon 2 of the ACHE gene and two intronic SNPs in the TACR3 (rs440980096) and ITGB4 (rs109739948) genes (Table S6b). Furthermore, SNP rs41855732 associated significantly with both oleic acid and total MUFA.

Significant marker associations with butyric acid (C4:0) are shown in Table S7a. Out of 116 associations, 81 markers are found in intergenic regions while 35 are found within gene regions. Majority of gene region associated variants are intronic (29) followed by three coding SNPs (rs457014340, rs456001743 and ss1850186363), two 3′UTR SNPs (rs480031082 and ss1850158549) and one splice variant (ss1850027498). Rs457014340 detected by both the additive and dominant models is located within exon 4 of the VPS37C gene (p.Ser115Ala, c.343T>G). rs456001743 and ss1850186363 were detected only by the additive model and located in exon 6 of the COMMD4 gene (p.His102Pro, c.305A>C) and on exon 19 of the FUK gene (p.Gln855Glu, c.2563C>G), respectively.

Results of GWAS analysis for caproic acid (C6:0) are shown in Table S7b. One non-synonymous coding SNP (rs136905662, p.Gly265Val, c.794G>T) on F7 gene and two intronic variants on PSMA4 (rs207776812) and BICD2 (rs463987848) genes (have MAF of 10%) associated significantly with C6:0. Further intronic SNPs within EVL (rs467244058 and rs432423874), FTO (rs133525188 and rs381581176) and MACROD1 (ss1850302054 and rs451632156) genes associated significantly with C6:0. Significantly associated SNPs with C6:0 and having MAF ≥1.5% are shown in Table S7b. Furthermore, 7 markers (rs458879791, rs447857210, ss1850048597, ss1850251288, rs451632156, ss1850074571, rs469668684) out of 69 significant gene region SNPs explained the highest phenotypic variances in C6:0, ranging from 3.2% to 6.9%.

Caprylic acid (C8:0) was significantly associated with 119 SNPs including 81 intergenic and 38 gene region SNPs (including 4 non-synonymous coding SNPs) (Table S7c). Two intronic SNPs (ss1850302054 and rs451632156) within MACROD1 gene were found to be significantly associated with C6:0 and C8:0. Five SNPs (rs451632156, rs458879791, rs447857210, ss1850251288, rs110927574) explained relatively high proportions of the variance in C8:0, ranging from 31% to 5.5%.

Significant associations were detected between five SNPs (ss1850128726, ss1850120683, ss1850133655, ss1850196022 and ss1850063824) and C14:0, and between two SNPs each and C11:0 (rs43649533, ss1850043060), C15:0 (rs382773693, ss1850118039) and C17:0 (rs385021638, ss1850255586) (Table S7d). Ss1850196022 (p.His30Pro, c.89A>C) is a non-synonymous coding SNP within exon 1 of CYP2S1 gene.

The most significant GWAS results were recorded for tridecylic acid (C13:0), with 707 markers out of 76,355 (Table S7e). Out of this number, 483 are intergenic SNPs while 224 including 27 coding variants (21 are non-synonymous) are gene region SNPs. Furthermore, one coding SNP (rs471212184) in exon 9 of the TARBP2 gene is a stop loss mutation (p.*367Glyext*64, c.1099T>G). Although most SNPs associated with C13:0 had MAF below 10%, 5 gene region SNPs had MAF ranging from 10.98% to 20.91 (Table S7e) while 16 intergenic variants had MAF above 10%. It should be noted that 5 intronic SNPs within NPAS2 gene associated with C13:0. Similarly, two SNPs in each of GRB10, CSGALNACT1, XPNPEP1, MAD1L1 and CLSTN2 genes attained genome wide significance with C13:0. Furthermore, two SNPs within microRNA genes, (rs440208182 on MIR130B/MIR301B and rs480300366 on MIR3596/MIRLET7B) associated with C13:0.

Tricosanoic acid (C23:0) and lignoceric acid (C24:0) long chain saturated fatty acids (SFAs) associated significantly with several SNP markers while no associations were recorded for C20:0 and C22:0 (Table S7f,g). Out of 40 significant markers (30 intergenic, 9 intronic and one 5′UTR) for C23:0, 7 intergenic SNPs had MAF of 10% and above. Furthermore, 12 intergenic variants and 3 intronic variants each accounted for over 3% of phenotypic variance in C23:0. Significant associations for C24:0 included 164 intergenic and 80 gene region SNPs (Table S7g). Three intronic variants (ss1850076897, rs42155039 and ss1850107740) on MYT1L, MARCH8 and ANTXR1 genes, respectively, had MAF of respectively of 29.4%, 22.6% and 18.6%. Seven out of nine coding SNPs are non-synonymous (Table S7g). Two intronic SNPs within GMDS (ss1850253438) and ACPP (ss1849970456) genes explained the highest proportion of variance in C24:0 at 7.3% and 6.3% respectively. Only 11 SNPS on 10 different chromosomes, including one coding SNP on ACHE (ss1850271826) gene and one intronic SNP each on TONSL (rs135581384) and TACR3 (rs440980096) genes associated significantly with milk total SFA (Table S7h).

One or more SNPs within the same gene associated significantly with one or several traits (Table S8). In some cases, one SNP associated significantly with two or more fatty acids (Table S9). For example, one SNP each within ACOX3, PPP2R4 and GGT6 genes associated significantly with C4:0, C6:0 and C8:0, 5 SNPs within NPAS2 gene associated significantly with C13:0 while five variants within EVL gene associated significantly with one or more fatty acids. Several markers within seven regions (1 Mbp to 3 Mbp) denoted association hotspots, on different chromosomes associated significantly with several fatty acids (Table 5). About 81 significantly associated variants with individual milk fatty acids with MAF ≥10% were considered commonly associated SNPs with these traits, in this study (Table 6).

Table 5 Chromosomal regions (0 to 3 Mbp) harboring four or more significantly associated SNPs with the same or different fatty acid traits and termed association hot spots in this study.
Table 6 Commonly significantly associated SNPs (MAF ≥ 10%) with individual fatty acid traits.

Discussion

We have demonstrated in this study that genotyping-by-sequencing (GBS) technique is a useful approach to detect population-specific SNPs influencing milk traits in Canadian Holstein cows by identifying 515,787 SNPs in 1,246 animals using the Illumina HiSeq platform. This technique was first developed to support genetic diversity studies in plants25 and tested on cattle samples at a small scale by De Donato et al.26 and has now been extended to identify SNPs in a much larger sample size followed by GWAS.

GBS allows a higher level of ascertainment of the genetic variation within a given population than current genotyping arrays and at a cheaper cost (about a third of the cost of using proprietary Illumina BovineSNP50K array). About 71% of detected SNPs in this study are novel, most of which may be population-specific, will increase our knowledge of genomic variation in Canadian Holstein cows available for dairy improvement. It should be noted that it is best to apply results of marker trait association information in populations were such associations were identified. The GBS technique may therefore compliment available genotyping arrays for detection of novel and known SNPS within specific populations.

The preponderance of MAF of <5% in this study is not unusual because whole genome sequencing of 234 bulls representing Holstein, Fleckvieh, Jersey and Angus breeds and deep sequencing of human genomes from different racial backgrounds indicate that rare (MAF below 0.5%) and low (MAF 0.5% to 5%) frequency variants greatly outnumber common variants23,24. Furthermore, rare and low frequency variants have been shown to explain part of the phenotypic variation in some human diseases28. Our findings showed that majority of significant SNPs for milk traits are found within non-coding regions of genes and intergenic regions of the genome and is supported by many recent GWAS and candidate gene studies on milk traits14,29,30. In humans, it has been reported that over 80% of disease associated variants fall outside protein coding regions of genes31, further strengthening the contribution of non-coding SNPs and intergenic region SNPs to complex traits, and supports their inclusion in GWAS. Our data further strengthens the notion that previously considered junk regions of the genome now harbor mutations that drive gene expression and affect the outcome of economically important traits.

As complex quantitative traits are controlled by numerous genes with small effects22,32, milk traits were associated mostly with SNPs with small effects. This study confirmed a strong signal for TFP in the centromeric region of BTA14 previously reported for milk fat yield, fat%, protein yield and protein%18,22,29,30,32,33,34,35,36. This peak region (0–2 Mbp, Fig. 2) lies within the same chromosomal region as the DGAT1 gene whose effect on milk production traits has been confirmed in numerous breeds around the globe11,17,22,33,37,38,39. Smaraqdov38 has proposed the use of the DGAT1 K232A mutation as a golden standard in gene sets used in the comparison of effects on milk productivity. However, the pleiotropic effect of the K232A polymorphism on genes related to cell growth, proliferation, development, tissue remodeling, cell signaling and immune system response has led to the argument that the expression pattern of genes carrying the K232A mutations reflect counter mechanisms of mammary gland tissue response to changes in milk fatty acid concentration and/or composition40. Streit et al.39 showed evidence for a major DGAT1 gene by polygene interaction effects for milk fat and protein percentage in German Holstein cattle while Bennewitz et al.33 reported that the DGAT1 K232A mutation is not solely responsible for all the genetic variation for milk, fat and protein yield and fat and protein percentages at the centromeric region of BTA14. Our data has uncovered more SNPs that contribute to the genetic variation in TFP in the centromeric region of bovine BTA14. Although 15 out of the 20 strong signal variants for TFP on BTA14 are located within intergenic regions, 5 are located within the intronic regions of four genes (ADCK5, TONSL, PPP1R16A and TRAPPC9) (Table 2). TONSL, ADCK5 and PP1R16A are among genes identified in a study that assessed the gene content of the chromosomal regions flanking the DGAT1 gene as a basis for future linkage disequilibrium studies with aim to determine whether neighboring genes to DGAT1 are associated with variation in milk fat percentage41. The two mutations of TONSL (rs132685115 and rs135581384) had MAF of respectively 27% and 26% and explained 9.5% of the variation in TFP in this study and may be considered potential candidate markers for milk fat%. An intergenic variant (rs210334336) positioned at 0.7 Mbp upstream of DGAT1 gene has a MAF of 29% and explained the highest proportion of variance (6.6%) in TFP in this study. The high proportion of variance (72.85.4%) in TFP explained by the significant SNPs (20 of them) within the centromeric region of BTA14 in this study supports the notion that DGAT1 gene is not solely responsible for the variation in milk fat% in this region. Other SNPs within the centromeric region of BTA14 significantly influenced other traits in this study. These include intergenic SNPs (rs109818540, rs109072495 and rs110566728) and one coding SNP in LRRC14 gene (rs439245899, c.500T>G, p.Val167Gly) significantly influenced C13:0 while two intergenic SNPs, rs110892754 and rs381071867, significantly associated with 305dFY and LP, respectively. In a recent GWAS utilizing the Illumina BovineSNP50 BeadChip for milk production traits in Chinese Holstein population, 92.3% (60 out of 65) of genome-wise significant variants for milk fat percentage were located within a 6.2 Mbp region (0.05–6.25 Mbp) of BTA 1429 further supporting our findings. The only non-synonymous coding SNP (ss1850090958) that showed genome wide significance with TFP and TPP in this study is located within exon 20 of the TEP1 gene (p.Tyr1006Ser, c.3017A>C) and the affected amino acid lies within a region of unknown function of the protein.

Five significant associations for TPP (rs455358874, rs134756756, rs137597165, ss1850220972 and ss1850220213) are located on BTA20, mostly in the vicinity of reported QTLs for TPP29,42,43. Three further associated SNPs for TPP (ss1850047119, rs379699027, rs133974370) on BTA6 occur within a region (118 Mbp to 120 Mbp) where significant QTLs for protein percentage have been reported34,44 and could be contributing factors to these QTLs.

Many significant associations (91 intergenic and 52 gene region variants) were recorded for SCC. SCC is routinely monitored in dairy herds as an indirect measure of bovine mammary gland health. Mastitis, the most important disease of dairy cows is under the control of numerous factors including genetics, indicating that many genes and gene pathways spread over the entire genome may contribute to the genetic variance in milk SCC. The 52 associated SNPs with SCC are located either in the intronic or exonic regions of 48 genes. Many of these genes (e.g. RASA3, TPST1, JDP2, PTPN22, CAMKK1, TNR, IGGL1, CDH15, CHD23, NGEF, ANKRD27, SBF2, TXNDC5, RRM3, CHST8, ADAM12) with immune functions, are located in disease related pathways or are implicated in disease progression. Furthermore, several significantly associated SNPs with SCC in this study lie within reported QTL regions for SCC and somatic cell score45.

SNPs associated with one or more PUFAs were found on all chromosomes, except BTA 8, with 5 or more associations on BTA 1, 5, 7, 10, 11, 15, 19, 21, 24 and 28 (Table S5a to S5d). Only a few studies on candidate gene associations and significant QTLs for individual or total milk PUFAs have been reported in cattle10,14,46,47,48 and our study is the first to detect significant SNPs associated with milk EPA (C20:5n3), AA and DPA in genes without prior associations with fatty acid biosynthesis or uptake. Three SNPs (ss1850294609, rs470755489 and rs471314510) within a 10 Mbp region (33.4 Mbp to 43.5Mbp) of BTA28 associated significantly with EPA. Rs470755489 (p.Ala1424Pro) is a non-synonymous SNP within exon 21 of the ERCC6 gene suggesting ERCC6 as a likely novel candidate gene for milk EPA. Another potential candidate gene harboring rs439293424 for milk EPA is ACER3 implicated in sphingolipid metabolism pathway. SNP rs439293424 (on KCNJ1 gene) and ss1850305614 (on LSP1 gene) are located within a chromosomal region harboring FADS1 and FADS2 genes, with well-defined roles in the synthesis of PUFAs. SNPs within FADS1 and FADS2 were recently demonstrated to associate significantly with C20:3n6, C20:4n6 and C20:5n3 in bovine milk14. A 28 Mbp region of BTA 24 (12.82 Mbp to 40.82 Mbp) harbored 8 SNPs (ss1850256121, ss1850256332, ss1850256353, rs385515058, rs381067250, rs209502433, rs445435952, rs109764724) significantly associated with C20:5n3 and together explained 25.23% of the variation in C20:5n3. There are only two reports of significant QTLs for milk fat% and milk fat yield in this region49,50.

Two SNPs (ss1850063824 and rs41855732) were associated with variation in both oleic acid and total MUFA. Oleic acid is the most abundant MUFA and obviously contributed the most to total MUFA. The intergenic ss1850063824 SNP is mapped close to two genes in the solute carrier family (SLC7A4 and SLC25A1) as well as MAN2A1 implicated in two KEGG pathways, metabolic and n-glycan biosynthesis pathways. Recently, Nafikov et al.51 reported a QTL for oleic acid on BTA 7 and suggested a gene in the solute carrier family (SLC27A6) as the potential candidate. An intronic SNP (rs109739948) in ITGB4 gene that associated significantly with C18:1n9c occurs within the region (26.5–57.7 Mbp) of a reported QTL for milk C18:1n9c percentage52. ITGB4 gene could be a candidate for oleic acid. A SNP in the TONSL gene (rs135581384) with a high MAF significantly associated with both C18:1n9c and TFP and may be a candidate SNP for these traits. Only one coding synonymous variant (ss1850271826) within ACHE gene with roles in metabolism and glycerophospholipid biosynthesis and metabolism pathways were significantly associated with total MUFA.

More significant associations were recorded for six SFAs (C4:0, C6:0, C8:0, C13:0, C23:0 and C24:0) compared to 16 SFAs studied. The most significantly associated SNP to C4:0, rs458879791, also associated with C6:0 and C8:0 and occurs in the intronic region of the GGT6 gene with roles in glutathione metabolism53. This SNP explained 4.8%, 6.9% and 5.5% of the variation in C4:0, C6:0 and C8:0, respectively and is considered a potential candidate gene for these traits even though it has a low MAF of 1.5%. Furthermore, rs458879791 may be localized in the same region as a previously reported QTL on BTA19 for C6:0 and C8:051. Another SNP (ss1850048597) that associated significantly with C4:0, C6:0 and C8:0 occur in the intronic region of ACOX3 gene with documented roles in the biosynthesis of SFAs and fatty acid oxidation54. A MECR gene SNP (ss1849987546) significantly influenced C4:0. MECR is involved in catalyzing the NADPH-dependent reduction of trans-2-enoyl thioesters and generating saturated acyl-groups55. SNPs in ACSL3 (ss1849985469) and FABP3 (ss1849987006) genes with well-defined roles in fatty acid biosynthesis56,57 on BTA2 were significantly associated with variation in C13:0 and C6:0, respectively. A SNP (rs379603734) on ADAM12 gene implicated in breast cancer58 associated significantly with C4:0 and C13:0, and responsible for 2% of phenotypic variation. Three SNPs including a coding variant (ss1850186363, p.Gln855Glu, MAF of 3.8%) in exon 19 of FUK associated significantly with C4:0 and with a role in KEGG’s fructose and mannose metabolism pathway, may not be excluded from mammary lipogenesis.

Mutations on NPRL3 gene (ss1850263261 [p.Val394Gly] and ss1850263264) with significant associations with respectively C4:0 and C24:0 occurs within a region of a QTL for milk palmitic acid (C16:0)59 making it a potential candidate gene for milk fatty acid traits. Numerous SNP associations and QTLs exist on BTA21 for milk traits in dairy cattle19,46,50,60 and this study has identified SNP associations with fatty acid traits in EVL (rs432423874, rs209872748, rs467244058, rs381368835 and rs454925079), SLCO3A1 (rs434552481 and rs209897920) and PSMA4 (rs207776812) genes making them potential candidate genes for the reported QTLs. Neighboring SNPs (rs133525188 and rs381581176) occurring with the same MAF (4.2%) on the FTO gene influenced C6:0 significantly thus supporting a previous report on the impact of two causative mutations in the FTO gene with a functional effect on milk fat and protein yield in Holstein dairy cattle61. MACROD1 and five other genes (SLC22A8, NRXN2, EHD1, DKFZP761E198 and C29H11orf68) harboring variants with significant associations with C6:0, C8:0 and C13:0 are located within an association hotspot region (3 Mbp, from 42815829 to 45731092 Mbp) of BTA29 (Table 5). Previous reports of significant QTLs for milk and protein yield and for milk fat and protein percentages within this region of BTA2936,43 suggest that associated SNPs in this study could be contributing factors to the phenotypic variance in these traits. Two or more SNPs in 7 genes including 5 in NPAS2 gene were observed to significantly associate with C13:0 in this study (Table S7e). SNPs in NPAS2 gene are located within a reported QTL region for milk fat yield on BTA1142. In addition, two NPAS2 SNPs (rs211557881 and rs208606161) explained about 7% of the variation in C13:0. The presence of NPAS2 amongst genes of human REACTOME’s fatty acid, triacylglycerol and ketone body metabolism pathway suggest a similar role for this gene in bovine.

Conclusion

Our study used GBS method to identify 515,787 SNPS in Canadian Holstein cows. Most SNPs were localized in intergenic regions followed by intronic regions of genes further emphasizing the contribution of non-coding and intergenic region variants in defining phenotypes and supports their inclusion in GWAS. Only about 29% of identified SNPs are present in dbSNP, while 71% are novel. Association of 76,355 markers with 44 milk traits identified novel genomic regions associated with milk traits. Most associated SNPs were located in intergenic regions followed by intronic regions of genes. Twenty markers within the centromeric region of bovine chromosome 14 showed strong association with TFP. Several SNPs were significantly associated with two omega-3 fatty acids (C20:5n3 [EPA] and C22:5n3 [DPA]), one omega-6 fatty acid (C20:4n6, AA), one CLA isomer (CLA:9c11t) and gamma linolenic acid (C18:3tcc). Several potential candidate genes uncovered for milk traits or mammary gland functions include ERCC6, TONSL, NPAS2, ACER3, ITGB4, GGT6, ACOX3, MECR, ADAM12, ACHE, LRRC14, FUK, NPRL3, EVL, SLCO3A1, PSMA4, FTO, ADCK5, PP1R16A and TEP1. Our study further demonstrated the utility of the GBS technique for identifying population-specific SNPs for use in improvement breeding of complex dairy traits.

Methods

Animal ethics

Animal use procedures and protocols were according to the national codes of practice for the care and handling of farm animals (http://www.nfacc.ca/codes-of-practice) and approved by the animal care committee of McGill University.

Animals and milk sampling

About 1246 Canadian Holstein dairy cows enrolled in the dairy production center of expertise for Quebec and the Atlantic Provinces, Valacta (www.valacta.com) were used for this study. Cows were drawn from 16 herds from the province of Quebec with an average of 98 animals per herd. Cows were in mid-lactation and their parities ranged from one to five. Animal management by participating farms were according to standard procedures. Fifty mL of milk was collected from each animal during the morning milking and a portion of it (about 10 mL) was used to analyse for milk components while 40 mL was separated into fat and milk somatic cells by centrifugation (12000 × g at 4 °C for 30 min) immediately upon arrival at the laboratory. The fat portion was used for fatty acid profile analysis while DNA was isolated from milk somatic cells. Milk sampling was coordinated by Valacta.

Analysis of milk components

The contents of milk components including test day milk yield, fat and protein yields, lactose and milk urea nitrogen were determined with MilkoScan FT 6000 Series mid-range infrared Fourier Transform Infra-Red (FTIR) based spectrometers, and the somatic cell counts were determined by means of Fossomatic flow cytometric cell counter at VALACTA (Ste-Anne de Bellevue, QC, www.valacta.com). Test day milk fat and protein yields were determined by multiplying the respective percentages with the total test day milk production. Entire lactation production values (305-d total milk production, 305-d milk fat yield and milk protein yield) were obtained by adding together monthly values covering the entire lactation period for each cow.

Fatty acid profile analysis

Fatty acid methyl esters (FAME) for fatty acid profile analysis were prepared according to the procedure of O’Fallon et al.62. FAME were separated into different fatty acid isomers by capillary gas chromatography on a Varian CP-3900 gas chromatograph equipped with a Varian CP-8400 auto-sampler and auto-injector, column oven and a flame ionization detector (Varian Inc., Walnut Creek, CA, USA) according to O’Fallon et al.62. Individual FAME peaks were identified by comparison of retention times with FAME standards (GLC No. 463 and No. UC-59-M, Nu-Chek Prep Inc., Elysian, MN, USA). Agilent Technologies Chemstation (B.04.03) software was used for data analysis.

DNA isolation

Genomic DNA from milk somatic cells was isolated using NucleoSpin® Blood QuickPure kit (MJS Biolynx, Ontario, Canada) with some modifications as described in Ibeagha-Awemu et al.14. The concentration of purified DNA was measured with NanoDrop® spectrophotometer (NanoDrop Technologies, Inc., Wilmington, DE, USA).

Genotyping-by-sequencing (GBS)

GBS libraries were prepared and analyzed at the Institute for Genomic Diversity, Cornell University (IGD), according to Elshire et al.25 with modifications according to De Donato et al.26. To reduce genome complexity, samples were initially digested with PstI enzyme26. Libraries were created with 1246 unique barcodes. Ninety-six multiplexed libraries (includes controls) per lane (total of 15 lanes) were subjected to single end 100 bp sequencing on an Illumina HiSeq 2000 system (Illumina Inc., San Diego, CA, USA).

GBS bioinformatics

The GBS analysis pipeline (Fig. 3) implemented in Tassel Version: 3.0.13963 (date: November 8, 2012) was used to process raw Illumina DNA sequence data and to call SNPs. The GBS pipeline options used are listed in Table S10. An overview of GBS bioinformatics and the GBS pipeline can be found at http://www.maizegenetics.net/#!tassel/c17q9. Tags were aligned to the cow reference genome, Btau_4.6.1/bosTau7 assembly using BWA version 0.6.1-r104.

Figure 3
figure 3

Genotyping-by-sequencing sequence data analysis pipe line (https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline).

SNP analysis

VCF tools v0.1.8 (http://vcftools.sourceforge.net/)64 was used to summarize data, filter data and to generate input files for PLINK65, which were used for multidimensional scaling (MDS). Analyses were visualized using basic plotting functions in R version 2.15.0 (https://www.r-project.org/).

Genotype imputation

Internal imputation with Beagle v3.3.2 software was done to correct for missing genotypes at some marker sites in some samples and also to increase overall data call rates. Beagle was run with default parameters. The Beagle utility program “gprobs2beagle.jar” was used to make genotype calls based on a probability threshold of 0.95. Any subject/marker combination where the probability of the most likely genotype was less than 0.95 was assigned a missing genotype in the output and was not phased. The Beagle utility program “gprobsmetrics.jar” was used to compute several per-marker metrics including minor allele frequency and allelic R^2 values among other metrics.

Genome wide association analysis

GWAS was accomplished with the single-locus mixed linear model procedure implemented in Golden Helix SVS v8.1.1 software (Golden Helix, Inc, Bozeman, MT, USA, www.goldenhelix.com). Specifically, the efficient mixed model association (EMMA) approach66 was used to directly estimate the variance components σ2g and σ2e, reducing the problem to a maximization search in just one direction. To correct for population structure in the absence of pedigree data, a kinship matrix was computed once using all markers. The kinship matrix was then used to solve the EMMA equation for every marker. The EMMA procedure and equations have been described in details in SNP & Variation Suite Manual Release 8.4.3 (Golden Helix, Inc.) and in Kang et al.66 and summarized below:

The genotypes to phenotype association was done by testing the hypothesis for each m loci one at a time, on the basis of the model (1)

where Mik is the minor allele count of marker k for individual i, βk is a fixed effect size of marker k, and are other fixed effects of parity and herd. The error term is (2)

Assuming the 1246 Canadian Holstein dairy cows were unrelated and there was no dependence across the genotypes, the values will be independently and identically distributed (i.i.d.), and thus simple linear regressions will make appropriate inferences for the k values of β.

However, the variance of the first term of actually comes closer to being proportional to a matrix of the relatedness or kinship between samples. Thus (3),

which reduces the equation yi to the mixed-model equation (4)

Both the additive and dominant models were used in GWAS. Under the additive model, testing is designed specifically to reveal associations which depend additively based on the allele classification. When alleles are classified according to frequencies, the associations will depend additively on the minor allele, where having two minor alleles (DD) rather than having no minor alleles (dd) is twice as likely to affect the outcome in a certain direction as is having just one minor allele (Dd) rather than no minor alleles (dd) (SVS Manual release 8.4.3, www.goldenhelix.com). Under the dominant model, allele classification according to frequency specifically tests the association of having at least one minor allele D (either Dd or DD) versus not having it at all (dd) (SVS Manual release 8.4.3). Both models were used in GWAS in this study to enable the capture of most existing associations. The Benjamini-Hochberg (BH) false discovery rate (FDR) correction was applied to raw p-values and genome wide significance was declared at P-Value BH FDR <0.1.

Additional Information

How to cite this article: Ibeagha-Awemu, E. M. et al. High density genome wide genotyping-by-sequencing and association identifies common and low frequency SNPs, and novel candidate genes influencing cow milk traits. Sci. Rep. 6, 31109; doi: 10.1038/srep31109 (2016).