Sequence-based GWAS and post-GWAS analyses reveal a key role of SLC37A1, ANKH, and regulatory regions on bovine milk mineral content

The mineral composition of bovine milk plays an important role in determining its nutritional and cheese-making value. Concentrations of the main minerals predicted from mid-infrared spectra produced during milk recording, combined with cow genotypes, provide a unique opportunity to decipher the genetic determinism of these traits. The present study included 1 million test-day predictions of Ca, Mg, P, K, Na, and citrate content from 126,876 Montbéliarde cows, of which 19,586 had genotype data available. All investigated traits were highly heritable (0.50–0.58), with the exception of Na (0.32). A sequence-based genome-wide association study (GWAS) detected 50 QTL (18 affecting two to five traits) and positional candidate genes and variants, mostly located in non-coding sequences. In silico post-GWAS analyses highlighted 877 variants that could be regulatory SNPs altering transcription factor (TF) binding sites or located in non-coding RNA (mainly lncRNA). Furthermore, we found 47 positional candidate genes and 45 TFs highly expressed in mammary gland compared to 90 other bovine tissues. Among the mammary-specific genes, SLC37A1 and ANKH, encoding proteins involved in ion transport were located in the most significant QTL. This study therefore highlights a comprehensive set of functional candidate genes and variants that affect milk mineral content.


Results
We analyzed six traits predicted from MIR spectra in Montbéliarde cows, representing the mineral (Ca, P, Mg, K, and Na) and citrate content of milk. MIR prediction equations originated from the Optimir project 14,15 . The accuracies of these MIR predictions, as assessed by the coefficient of determination (R 2 ) in a validation population (Table 1), ranged from 0.68 to 0.90, with the exception of Na (0.44).
Heritability and genetic correlation estimates for mineral and citrate content. Genetic parameters, i.e. heritabilities (h 2 ) and genetic correlations (r g ), were estimated for milk mineral and citrate content, as predicted from more than 1 million test-day records from 126,873 cows (Table 2). At the test-day level, heritability estimates were moderate for Na content (h 2 = 0.32) but higher for other minerals (h 2 = 0.50 to 0.56) and citrate (h 2 = 0.48). Na was negatively and poorly correlated with other minerals (-0.23 ≤ r g ≤ − 0.02) and citrate (r g = − 0.15), while the levels of most other minerals were generally positively correlated (0.11 ≤ r g ≤ 0.60); the exception was the relationship between Ca and K (r g = − 0. 22). Values of the genetic correlation between citrate and mineral levels in milk ranged quite broadly, depending on the mineral: null with K (r g = 0.01 with K), slightly negative with Na (r g = − 0.15) and P (r g = − 0. 16), and highly positive with Ca (r g = 0.57) and Mg (r g = 0.59). Table 1. Mean, standard deviation (SD), and accuracy (R 2 ), estimated by cross-validation, of mid-infrared (MIR) predictions for concentrations of minerals and citrate in milk from Montbéliarde cows. www.nature.com/scientificreports/ QTL identified in GWAS. We conducted single-trait GWASs on imputed WGSs from 19,586 cows (12,907,802 variants) and primarily identified 96 trait x region combinations with significant effects (-log 10 (P) ≥ 8.4) on milk mineral and citrate composition by applying the procedure described in the Methods section ( Fig. 1). In two genomic regions (a 12Mbp-region on BTA1 and a 27Mbp-region on BTA20), the single marker analyses identified multiple trait x region combinations (10 and 15, respectively). Therefore, conditional analyses including the most significant variant identified in each region was applied to decipher if significant www.nature.com/scientificreports/ variants were due to linkage disequilibrium (LD) with the same causal mutation or to the presence of multiple causal mutations. These analyses led to the exclusion of 4 and 9 trait x region combinations on BTA1 and BTA20, respectively, resulting in 83 independent QTL. These QTL corresponded to 50 genomic regions located on Bos taurus (BTA) autosomes 1, 2, 3, 4, 5, 6, and 7 (Table 3) and 11,12,13,14,15,17,18,19,20,21,22,24,25,27,28, and 29 (Table 4). Of these, 18 regions had effects on several traits (2 to 5), while 32 affected only one trait. The Table 3. QTL identified for milk mineral and citrate content on Bos taurus (BTA) autosomes 1 to 10 (when a QTL region affected multiple traits, the most significantly affected trait is indicated in bold).  Table 5). The most significant QTL, located on BTA1 and BTA20, explained more than 23% of the genetic variance of P and citrate, respectively. Mbp; each CI contained between 1 and 1004 variants with significant effects (99.3 on average). While only ca. 6% of the tested variants were included on SNP chips, these chip variants (mainly HD) were disproportionately represented among the variants ranked in the top 10 in QTL peaks, and accounted for nearly 43% of the variants ranked first in the peaks, i.e. "top 1" variants (  www.nature.com/scientificreports/ ants in the CI of the QTL) of variants were intergenic, while 60.5% (top 1) to 65.7% (top 10) of variants were located in genes, including upstream and downstream regions ( Table 6). In all QTL, we therefore identified at least one positional candidate gene (1 to 20 per QTL, 5.3 on average). In total, 271 different genes were found within the CIs of the 83 QTL. In the four QTL regions with the most significant effects, the top-ranked variants were located in the genes SLC37A1 (BTA1), ANKH (BTA20), SEL1L3 (BTA6), and PPARA (BTA5); and the topranked variants in the remaining QTL were located in dozens of other genes (Tables 3 and 4). The majority of the variants located in genes were intronic, and were only rarely found in coding regions: only 1%, approximately, of variants were non-synonymous, i.e. 84 variants within the CIs of the QTL, and 55, 24, 4, and 1 among the top 100, top 50, top 10 and top 1 variants, respectively ( Table 6). Out of these non-synonymous variants located in coding regions of genes, 4 are serious candidates because they were found among the 10 most significant variants of the QTL peaks: G446S in SLC26A4 (ranked 2nd for the QTL found on BTA4 at 48.7Mbp for Ca), V298M in ENSBTAG00000050954 (ranked 10th for the QTL found on BTA7 at 40.3Mbp for P), F267V in ARHGAP39 (ranked 5th for the QTL found on BTA14 at 0.4Mbp for K and Citrate) and F257Y in GHR (ranked 1st for the QTL found on BTA20 at 31.9Mbp for P, Ca and Mg). We also found a relatively high proportion of variants  www.nature.com/scientificreports/ located in non-coding RNA (7.9% of all variants in the CIs of the QTL and 13.6% of top 1 variants). As presented in Table 6, these variants were mostly located in long non-coding RNA (lncRNA), less frequently in micro RNA (miRNA) and small nuclear RNA (snRNA), and very rarely in small nucleolar RNA (small Cajal body-specific RNA, scaRNA), and transfer RNA (tRNA). We then completed annotation of the QTL (1) identifying putative regulatory SNPs (rSNPs) located within transcription factor binding sites (TFBSs) according to the procedure described in the Methods section, and (2) using transcriptomic data available on the Cattle Gene Atlas website, http:// cattl egene atlas. roslin. ed. ac. uk/.
We then assessed the degree to which these positional candidate genes and TFs demonstrated tissue-specific expression. Of the 271 genes and 438 TFs associated with milk-mineral QTL, 203 and 160, respectively, were present in the Cattle Gene Atlas, which contains expression data from 91 different tissues or cell lines. Using the model described in the Methods section, we assessed the overexpression or tissue specificity of these genes by evaluating the t-statistic and the P value, Pt, associated with the effect of the tissue under consideration (see Supplementary Tables S2 and S3). Among the 203 positional candidate genes and 160 TFs with a putative regulatory effect on these genes, we calculated the number that were overexpressed in each tissue type (Pt < 10 −4 ). The profiles obtained for both categories of genes, presented in Fig. 4, were quite similar. The tissues or cell types in which we found the highest number of genes associated with milk mineral composition were white blood cells (51 candidate genes and 59 TFs) and mammary gland (47 candidate genes and 45 TFs). In each tissue, we identified the most-specific genes by retaining the top 10% with respect to t-statistic values, i.e. 20 candidate genes and 16 TFs. Among the candidate genes that were most specific to mammary gland, we found the genes encoding the main milk proteins (CSN1S1, CSN1S2, CSN2, CSN3, and PAEP), and SLC37A1 and ANKH, which are located within the QTL that had the most significant effects in the present study. The ASCL2 TF, which potentially regulates five genes located in the CIs of the QTL, was one of the most mammary-gland-specific TFs. Among genes specific to white blood cells, we found COTL1, MKL1, and FOXP1. SLC37A1 was not among the 20 mostspecific genes but it was ranked 22nd. In the upstream or intronic regions of these four genes, we identified the top-ranked variant of four different QTL, located on BTA18, BTA5, BTA22, and BTA1, respectively. Furthermore, www.nature.com/scientificreports/ we identified 27 rSNPs for which both TF and target gene were highly specific to the mammary gland; these combinations were composed of 26 unique TFs and 15 unique target genes (see Supplementary Table S4).
Focus on the SLC37A1 and ANKH gene regions. As mentioned above, two QTL-located on BTA1 (142.8 Mbp) and BTA20 (58.2 Mbp) in the vicinity of the SLC37A1 and ANKH genes, respectively-had very strong effects on milk mineral and citrate content. The SLC37A1 region affected P, K, Mg, and Na levels while the ANKH region had effects on citrate, Mg, K, and Ca (Fig. 2). These two regions alone affected all six milk components analyzed in this study. They explained a high proportion of the genetic variance in P (23.3%), citrate (23.1%), Mg (17.9%, i.e. 7.3% for SLC37A1 and 10.6% for ANKH), and K (15.4%, i.e. 13.6% for SLC37A1 and 1.8% for ANKH), but had much more moderate effects on Na (2.6%) and Ca (2.2%).
In the SLC37A1 region, the variant with the most significant effect was located in an intron; it was an imputed variant (R 2 = 0.85) for P and Mg (rs109459130 at 142,834,737 bp) and a chip variant (HD SNP) for K and Na (rs109717634 at 142,835,551 bp). For both variants, which were located 814 bp apart, the most frequent allele (0.58 and 0.61, respectively) increased the amount of all minerals affected by this region.
In the ANKH region, we identified three different top 1 variants, depending on the trait. For K, the top variant was rs134021638 (at 58,386,888 bp, imputed with R 2 = 0.92), located in an intronic region of ANKH; for Ca, Mg and citrate, the top variants were located in an lncRNA ENSBTAG00000048498 (rs110048176 at 58,189,663 bp, HD SNP for Ca and Mg and rs109956167 at 58,204,929 bp, imputed variant with R 2 = 0.90 for citrate). ENSBTAG00000048498 spans from 58,186,301 to 58,224,434 bp and is located around 83 kbp upstream of the ANKH gene (58,307,527-58,477,497 bp). In all cases, the alleles responsible for an increase in citrate and mineral content were not the most common, with a MAF of 0.05 for the ANKH variant (rs134021638) and 0.17 for the two other variants.
In both regions, we examined the top-ranked variant for the most-affected traits, i.e. rs109459130 and rs109956167, and tested the interaction effects of their genotypes on milk mineral and citrate content. The results, presented in Fig. 5, revealed no significant interaction for any of the traits analyzed (P ≥ 0.02), suggesting that the effects of the two regions on milk composition were additive.

Discussion
To the best of our knowledge, this study is unique because it is the first GWAS of imputed whole-genome sequences based on the most-recent bovine reference genome with such a large population of cattle (19,586 cows); and it is the first attempt to investigate milk mineral content with a sequence-based GWAS, which here assessed a very large panel of genomic polymorphisms (12.9 million).
Using this approach, we identified 83 QTL that explained a substantial part of the genetic variance (up to 42.2%) for mineral and citrate content in cows' milk; in each QTL region, we then identified functional candidate genes and variants. Our results build on those of two previous studies that used GWAS to investigate the mineral composition of milk-based on mineral content measured with reference methods and HD 777 k SNP genotypes-in small populations of Holstein (371 9 and 444 12 ) or Jersey (321 9 ) cows. The study of Buitenhuis et al. 9 was dedicated to mineral composition while Kemper et al. 12 attempted to identify QTL that overlapped between milk production and composition traits. Only two QTL were shared between the two studies: a region located on BTA1 that affected P, for which both studies identified SLC37A1 as the best functional candidate gene, and a region located near the DGAT1 gene on BTA14 that affected Ca and P. In our study, we also identified these two regions, together with dozens of other regions located throughout the bovine genome.
By considering a very large dataset (more than 1 million test-day records of 126,873 cows), we estimated heritability values for mineral and citrate composition to be moderate to high, i.e. higher 5,7,8,10 than or similar 6,9 to those previously reported in the literature. Furthermore, we found strong genetic correlations among Ca, P, and Mg and among Ca, Mg, and citrate, which was consistent with the studies of Toffanin et al. 5 and Denholm et al. 8 . Among minerals, Na had both the lowest heritability value (0.32 vs 0.48-0.56 for the other minerals) www.nature.com/scientificreports/ and the smallest percentage of genetic variance explained by its associated QTL (19.8% vs 33.1-42.2% for the other minerals). Ca, P, and Mg shared more QTL than any other group of minerals: 7, 6, and 6 QTL were shared between Ca and P, Ca and Mg, and P and Mg, respectively, while 4 QTL had effects on all three. In contrast, despite the high degree of genetic correlation between Ca, Mg, and citrate, only 2 QTL were shared among these three traits; however, one of these was a QTL located on BTA20 (~ 58.2 Mbp) that explained 2.2%, 10.6%, and 23.1% of the genetic variance of Ca, Mg, and citrate, respectively. Overall, estimates of genetic correlations were consistent with QTL results, which probably reflects the common biological pathway of these minerals in milk. In colloidal form, Ca, P, and Mg are associated with caseins in the micelles while in the soluble fraction, Ca and Mg are associated with citrate. Further studies, using random regression models fitted across lactation, could investigate the pattern of genetic relationships between the different milk minerals during lactation and thus provide a better understanding of the underlying biological mechanisms. The resolution of our study was high enough to identify a single or a few positional candidate genes in most of the 83 QTL we identified. In each of these regions, though, the variant with the most significant effect was not necessarily located in a gene. As an example, we detected the overrepresentation of chip SNPs at the top of the QTL peaks (42.7% of the top 1 variants), and the majority of these SNPs were located in intergenic regions. Chip SNPs are directly genotyped or have a higher imputation accuracy than the surrounding variants, which probably enabled more precise estimation of their effects. To identify the best candidate variant(s) for each QTL, we therefore considered not only the variant with the most significant effect, but the set of variants ranked at the top of the peak that were located in or close to genes. In most cases, we found that these variants were located in non-coding regions of genes; for example, 15.3% and 12.1% of all top 10 variants were located in upstream regions of genes or in lncRNA, respectively, i.e. in putative regulatory regions. To further refine our results, we consulted in silico annotations of rSNPs in upstream regions of genes as well as existing knowledge regarding genomic regions that are transcribed into non-coding RNA. Moreover, to support the putative roles of these variants, we also considered the tissue specificity of candidate genes and transcription factors. Unfortunately, the expression dataset, which was based on Ensembl release 94, did not contain all of the genes located in QTL regions. Here we present some examples of QTL in which the putative causal mutation was located in a regulatory region, with particular attention to functional candidate genes previously associated with milk composition and the QTL with the most significant effects in this study.
GPAT4 (glycerol-3-phosphate acyltransferase 4) was the best candidate gene for the QTL on BTA 27 with effects on Mg. Although it was not the most notable QTL in terms of its effect on minerals, this region, and this gene in particular, have been highlighted by previous studies as affecting milk composition (fat, protein, and lactose content) 19,[22][23][24] . In these studies, five variants in linkage disequilibrium-four in the upstream region and one in the 5′ UTR region of GPAT4-were highlighted as the best candidate causal variants. In our study, these variants were ranked in the top 5 in the GWAS peak; the variants ranked 3rd (rs209479876) and 5th (rs209855549) alter the WRKY48 and TWI transcription factor binding sites (TFBSs), respectively, and are more likely to be the causal variants. Daetwyler et al. 23 also highlighted rs209479876 as the best causal variant in this region because of the high probability that it overlaps a TFBS. We also confirmed that GPAT4 was overexpressed in the mammary glands (P = 3.10 −12 ) compared to 90 other tissues or cell types, but expression data were not available for the TFs associated with this gene.
For the QTL identified on BTA11 at ~ 103.2 Mbp with effects on P and citrate, the top two variants in the peak for P (the most-affected trait) were intergenic. Instead, the variants ranked 3rd to 7th were located in the upstream region of the PAEP (progestogen-associated endometrial protein) gene; the 5th-ranked variant, rs110710904, probably alters a binding site of the Macho-1 transcription factor (P = 0.02). PAEP, which is one of the most mammary-gland-specific genes (Pt = 1.8.10 −37 ), encodes β-lactoglobulin, the most abundant whey protein in cow milk. Two non-synonymous variants in this gene, rs109625649 and rs110066229, were previously highlighted as the causal mutations underlying β-lactoglobulin concentration in milk 25 . In our study, these were respectively ranked 157th and 171st in the peak, i.e. far below the top-ranked variants upstream. These results corroborate previous reports that these two putatively causative missense mutations did not explain all the effects of the region on milk composition 19,20 and highlight an rSNP as a likely causative variant.
BRI3BP (BRI3 binding protein), located on BTA 17 at ~ 50.8 Mbp, has been previously associated with de novo short chain fatty acid synthesis in bovine milk 26,27 . In our study, this region appeared to have effects on the K content of milk, and we found that BRI3BP was significantly overexpressed in mammary gland compared to the 90 other tissues investigated (P = 3.9.10 −8 ). The first seven variants in the peak were located in an intronic region of BRI3BP. The 8th (rs477456528) and 9th (rs440703666) variants, instead, were located in the upstream region of the gene and rs440703666 probably alter the binding site of the transcription factor SPL8. This variant thus represents a very attractive functional candidate in this region.
Five of the six traits analyzed in this study were affected by a QTL region at ~ 116.4 Mbp on BTA5. Here, a variant located at 116,438,773 bp (1000G_80994138) ranked 2nd for four traits. This variant was located in the upstream region of PPARA (peroxisome proliferator activated receptor alpha) and overlapped the binding sites for the transcription factors E2F4, RSC3, RSC30, TDA9, and TFDP1. In this QTL region, top 1 variants were located either in an intronic region of PPARA (citrate, Mg, and P) or in an lncRNA, LOC101903383 (Ca and Na). PPARA encodes the PPAR-α transcription factor, a key regulator of lipid metabolism belonging to the superfamily of PPAR hormone receptors 28 ; expression of this TF in mammary gland was previously found to be associated with milk fatty-acid composition in dairy cows 29,30 . Here, we did not detect any mammary-gland specificity of PPARA but we did observe that this gene was expressed in this tissue. It is therefore possible that polymorphisms in the binding sites of this TF or in the lncRNA LOC101903383, located 1.3 kbp upstream of PPARA , could regulate the expression of PPARA and be responsible for the strong effects of this region on milk composition. www.nature.com/scientificreports/ On BTA6 at ~ 45.3 Mbp, we identified a QTL with very strong significant effects on K and to a lesser extent on P, Na, and Ca. Within the confidence interval of this QTL, we found 80 different potential variants, of which the majority (69) were intergenic, 9 were located in an intronic region of SEL1L3 (SEL1L family member 3), and 2 were in the upstream region of SMIM20 (small integral membrane protein 20). However, when we examined only the top 10 variants for the different traits, the number of distinct variants shrank to 18 (9 intergenic, 7 in SEL1L3, and 2 in SMIM20). Of these, two (rs108972810 at 45,401,485 bp and rs136498639 at 45,401,570 bp) were located in the upstream region of SMIM20 and overlapped a TFBS. SEL1L3 was previously identified as a candidate gene in a QTL region with effects on bovine milk protein composition 19 . Instead, to the best of our knowledge, this study is the first to propose SMIM20 as a candidate gene for milk composition. However, the functional link between these genes and milk composition has yet to be established, as we found both genes to be endometrium-specific and underexpressed in mammary glands compared to the 90 other tissues investigated.
For milk mineral composition, the two best functional candidate genes highlighted by our analysis were SLC37A1 (solute carrier family 37, member A1) and ANKH (inorganic pyrophosphate transport regulator). These two genes were previously found to be overexpressed in mammary glands relative to 17 other types of tissue 31 , and here we found both among the top 10% of mammary-gland-specific genes (Pt = 3.4.10 −26 and 6.6.10 −16 , respectively). This suggests that the main function of these genes occurs in epithelial cells of this tissue. SLC37A1 and ANKH both encode transmembrane proteins involved in ion transport and have been found to play a role in inorganic anion transport 20 . Variants located in both genes explained a large degree of the genetic variation in milk mineral content. Earlier studies also linked SLC37A1 with mineral content, in particular that of P 9,12 , while both genes have been implicated in determining milk protein composition in Holstein, Montbéliarde, and Normande cows 19 and cheese-making traits in the same Montbéliarde cows we analyzed here 20 . These results are consistent with what is known about the association of minerals with casein molecules in micelles; milk mineral composition is strongly related to milk protein composition 2 and therefore to cheese-making traits 3,4 .
In SLC37A1, 7, 31, 9, and 7 variants were located in the confidence intervals of QTL for P, K, Mg, and Na, respectively. These variants (32 of which were unique) were overwhelmingly located in introns of the gene, in a 12.3-kbp-region from 142,826,156 to 142,838,477 bp. Among these 32 variants, we were not able to distinguish the best candidate based on annotation, but two variants, 814 bp apart, were particularly highly ranked for all traits. Specifically, rs109459130, at 142,834,737 bp, was ranked 1st, 2nd, 1st, and 3rd for P, K, Mg, and Na, respectively, while rs109717634, at 142,835,551 bp, was ranked 2nd, 1st, 3rd, and 1st for P, K, Mg, and Na, respectively. Of these, the first (rs109459130) appeared to be the better candidate because it was top-ranked for the most affected trait, P, and because it was imputed (R 2 = 0.85), while the second variant was an HD SNP (R 2 = 0.997).
In the region of the ANKH gene, we found 81, 65, 69, and 21 variants in the confidence intervals of QTL for citrate, Mg, K, and Ca, respectively. These represented 82 unique variants, all located either in intronic regions of the ANKH gene (59 variants between 58,344,839 and 58,388,849 bp) or in the lncRNA ENSBTAG00000048498 (23 variants between 58,185,895 and 58,212,187 bp). The same variant was ranked first for both Mg and Ca. Two top-1 variants were located in the lncRNA ENSBTAG00000048498-rs109956167 (at 58,204,929 bp) for citrate and rs110048176 (at 58,189,663 bp) for Mg and Ca-while the top-ranked variant for K was located in ANKH (rs134021638 at 58,386,888 bp). We propose rs109956167 as the best candidate causal variant because i) it was ranked 1st for the most-affected trait (citrate), ii) it was imputed, in contrast to rs110048176 which was from the HD chip, and iii) it was the only variant present in the top 10 for all traits affected by this region (4th for Mg, 6th for Ca, and 9th for K). We hypothesize that the lncRNA ENSBTAG00000048498, which is 83 kbp downstream of the closest gene, ANKH, may affect milk mineral content by regulating the expression of ANKH and thus the amount of protein available for ion transport.
When we examined the best candidate variants in the regions of SLC37A1 (rs109459130) and ANKH (rs109956167), we found that alleles of these genes did not have significant interactions with respect to the six milk composition traits analyzed in this study and that, combined together, these two regions explained a large part of the genetic variance in milk mineral content. The allele that increased mineral content was the most frequent allele of SLC37A1 (allele A, frequency = 0.58), while it was the least frequent allele in the ANKH region (allele C, frequency = 0.17).
Compared to previous studies, our GWAS-based investigation of milk mineral composition was conducted at the whole-genome level, after imputation with the most recent reference genome and using data from a large population of animals. This approach led to the identification of a large number of genomic regions that explain a great deal of the genetic variance of the traits studied here. Moreover, post-GWAS investigations conducted using different annotation datasets enabled the identification of candidate causal genes and the prioritization of candidate variants in these genomic regions. In most situations, exonic variants were not proposed as the best candidates, although they cannot be excluded in four QTL regions, including GHR. In contrast, the best candidate variants often were putative regulatory variants. Although the functional causative effect of the candidate variants remains to be demonstrated, this study highlights a 'short list' of candidate variants, in particular rs109459130 (SLC37A1) and rs109956167 (ANKH), whose favorable alleles can be feasibly selected to increase the mineral content (Ca, P, Mg, and K) of milk and therefore improve both its nutritional and cheese-making qualities in Montbéliarde cows.

Methods
Ethics statement. Milk samples were analyzed with MIR spectrometry during routine milk recording in commercial herds of Montbéliarde cows in the Franche-Comté region (France). We did not perform any experiments on animals and no ethical approval was required. www.nature.com/scientificreports/ Animals and phenotypes. The original dataset was generated by the From'MIR project and is described in detail by Sanchez et al. 11,20 . It comprised 6,670,769 mid-infrared (MIR) spectra of milk samples from 410,622 Montbéliarde cows. The concentrations of five milk minerals (Ca, P, Mg, K, and Na) and citrate were predicted from MIR spectra using equations developed by the Optimir project 14,15 (Table 1). As previously described 11,20 , to ensure that the dataset was homogeneous (and less computationally demanding), we retained only the firstlactation records with at least seven test-day records per cow for estimating variance components (1,100,238 test-day records from 126,873 cows 11 ). To estimate environmental effects used to derive the phenotypes included in GWAS, at least three test-day records per cow were required, corresponding to a dataset of 1,442,371 test-day records from 189,817 cows 20 .
Estimation of genetic parameters. First lactation data were analyzed using bivariate repeatability animal models applied to all pairs of traits. (Co)variance components were estimated using the AI-REML algorithm as implemented in Wombat software 32

GWAS.
We performed single-trait association analyses between all 12,907,802 polymorphic variants and each of six milk mineral and citrate traits, described in Table 1. All association analyses were performed using the mlma option of GCTA software (version 1.24), which applies a mixed linear model that includes the variant to be tested 39 : where yd is the vector of so-called yield deviations, i.e. test-day records adjusted for non-genetic effects with the mixed linear model (1) using Genekit software 40 and averaged per cow; m is the overall mean; b v is the additive fixed effect of the variant to be tested for association; x v is the vector of imputed allele dosages, ranging from 0 to 2; u ∼ N 0, Gσ 2 u is the vector of random polygenic effects, with G the genomic relationship matrix (GRM) calculated using the HD SNP genotypes (which offer both high density and accuracy of imputation), and σ 2 u the polygenic variance, estimated based on the null model yd = 1m + u + e and then assumed as known while testing for the association between each variant and the trait of interest; and ε ∼ N 0, Iσ 2 ε is the vector of random residual effects, with I the identity matrix and σ 2 the residual variance. Association was tested using a t-statistic calculated by dividing the variant effect estimate by its standard error.
In order to correct for multiple testing, the Bonferroni correction was applied to take into account all 12.9 million independent tests. The 5% genome-wide threshold of significance therefore corresponded to a nominal P-value of 4 × 10 −9 (-log 10 (P) = 8.4) per test. When a given trait was significantly affected by multiple variants, variants that were located less than 2 million base-pairs (Mbp) apart were grouped together to define QTL, www.nature.com/scientificreports/ considering the variants belonged to the same QTL region. The bounds of the confidence intervals (CIs) of each region were then determined based on the positions of variants that were included in the upper third of the QTL peak. In regions where multiple neighboring QTL were identified (BTA1 and BTA20), conditional analyses were carried out using the cojo option of GCTA 39 in order to conclude if multiple significant variants in a genomic region were due to LD with the same causal mutation or to the presence of multiple causal mutations. Association analyses were performed by including in the model the most significant variant as a fixed effect and by testing all variants in these neighboring QTL that were not in strong LD with the conditional variant (r 2 < 0.9).
The ability of genetic variants to alter transcription factor binding sites (TFBSs) was predicted with a custom script that used TFBS models from the JASPAR (JASPAR CORE 2018 collection 41 ), HOCOMOCO (version v10 42 ), and TRANSFAC (version v3.2 public 43 ) databases. These databases contain curated sets of transcription factor binding models represented as Position Weight Matrices (PWM), which are derived from published collections of experimentally defined eukaryote TFBSs. Only vertebrate PWMs were downloaded for use in our study.
Gene overexpression or specificity in different tissues was determined using gene expression patterns of 24,616 genes (Ensembl release 94) available in the Cattle Gene Atlas, which contains 723 RNA-seq datasets representing 91 tissues and cell types, classified into 17 biological categories (http:// cattl egene atlas. roslin. ed. ac. uk/). To assess the expression specificity of each gene in a given type of tissue (by excluding tissues in the same biological category), we applied the following linear model as described in Fang et al. 44 : where ye is the vector of expression level in the tissues, assessed by the scaled log 2 FPKM (Fragments Per Kilobase per Million mapped reads); me is the overall mean; x t is the vector of the variable with value 1 for samples of the tested tissue and − 1 for samples outside the same category; b t is the corresponding tissue effect; z is the incidence matrix related to the corresponding covariables effects c , including age, sex and study effects; ee is the residual effect. Model (3) was implemented adapting R scripts available on the Cattle Gene Atlas website. For each gene, a t-statistic was computed by dividing the tissue effect by its standard error. A gene was considered to be overexpressed in a tissue if the probability associated with the t-statistic (Pt) was lower than 10 −4 . To determine the tissue specificity of genes, we ranked the genes in each type of tissue by their t-statistics; the top 10% were considered to be tissue-specific.

SLC37A1-ANKH genotype interactions.
We tested putative interaction effects between the genes SLC37A1 and ANKH on milk mineral and citrate content with the following mixed linear model: where yd as defined in (2); mi is the overall mean; g SLC37A1 and g ANKH are the fixed effects of the genotypes of the best candidate variant in the SLC37A1 and ANKH regions, respectively; g SLC37A1 x g ANKH represents the interaction between the two genotypes;M, N , and O are incidence matrices related to the individual effects of the SLC37A1 and ANKH genotypes and their interaction, respectively; s is the vector of random sire effects and P the corresponding incidence matrix; and ei is the vector of random residual effects. All effects were tested using t-statistics computed in the MIXED procedure of SAS software.

Data availability
The data (genotypes and phenotypes) that enabled the findings of this study were made available by UMOTEST, CEL25-90, and GENIATEST. However, restrictions apply to the availability of these commercial data: they were used under license for the current study, and are not publicly available. www.nature.com/scientificreports/ acknowledge the breeders who participated in the From'MIR project; colleagues from the Conseil-Elevage 25-90 who coordinated data collection; and the contribution of the 1000 Bull Genomes consortium.