Genome-wide association identifies methane production level relation to genetic control of digestive tract development in dairy cows

The global temperatures are increasing. This increase is partly due to methane (CH4) production from ruminants, including dairy cattle. Recent studies on dairy cattle have revealed the existence of a heritable variation in CH4 production that enables mitigation strategies based on selective breeding. We have exploited the available heritable variation to study the genetic architecture of CH4 production and detected genomic regions affecting CH4 production. Although the detected regions explained only a small proportion of the heritable variance, we showed that potential QTL regions affecting CH4 production were located within QTLs related to feed efficiency, milk-related traits, body size and health status. Five candidate genes were found: CYP51A1 on BTA 4, PPP1R16B on BTA 13, and NTHL1, TSC2, and PKD1 on BTA 25. These candidate genes were involved in a number of metabolic processes that are possibly related to CH4 production. One of the most promising candidate genes (PKD1) was related to the development of the digestive tract. The results indicate that CH4 production is a highly polygenic trait.

performed GWAS on different methane phenotypes in beef cattle and validated the results on dairy cattle, whereas Van Engelen 15 performed GWAS on Holstein cows using phenotypes predicted from milk and breath analyses .
Only lately the technology for measuring CH 4 production both on a large scale and on individual animals has become available. Among others, high throughput measuring techniques are based on breath analyses, since approx. 90% of enteric CH 4 is released during eructation events and by breathing 5 . Nonetheless, collection of such phenotypic records is still challenging and data sets are limited. This technique is promising as it is non-invasive, based on infra-red analyses of breath samples and measurements can easily be taken during milking or feeding 13,[16][17][18][19] . The most common application of these techniques is in combination with automatic milking systems (AMS) enabling a relatively long measurement period (duration of the milking); additionally, several observations per cow per day may be collected from a large number of animals. This type of measurement set up enables collection of large volumes of data, which is a prerequisite for genetic analyses.
To our knowledge, to date no reports are available on genome-wide association analyses based on direct measurements of daily CH 4 production in dairy cattle. Therefore, the objective of this study was to undertake a genome-wide association study using CH 4 phenotypes measured by breath analyzers to unravel the genomic regions controlling CH 4 production from dairy cattle.

Results
Detected SNPs. The genetic variance for daily CH 4 production was estimated independently for each level of 2 nd order Legendre polynomials. As the first parameter explains most of the variation, only SNP detected with it will be presented and discussed in this study. The GWAS performed on daily CH 4 production indicated 50 SNPs with BF > 10 associated with CH 4 production in dairy cattle (Fig. 1). Those SNPs were located on 18 different BTA (Tables 1 and 2). From detected SNPs, three had a BF above 30, which is defined as "very strong" association 20 . On BTA 1,4,9,13 and 25 analysis in Haploview 21 indicated six potential candidate QTL regions (Fig. 2). For those regions and two single SNP associations on BTA 9 and 20, a total of 130 candidate genes (protein-coding and non-coding RNA) were located with BIOMART 22 (Table 1).
The three SNP detected for raw phenotypes with BF > 30 and six possible candidate QTL regions explained 0.032% of the total genetic variance (Table 1), whereas the remaining SNPs with 10 < BF < 30 explained 0.122% of this variance (Table 2). Overall this gives a very low result of 0.154% of the total genetic variance explained by detected SNPs.
Bioinformatics analysis of detected regions. Out of 130 candidate genes for CH 4 production, 46 remained for a further GO Term analysis as known and non-ambiguous genes. For possible candidate genes, 428 different GO Terms were described: 82 cellular component terms, 251 biological process terms and 95 molecular function terms. Based on the GO Terms, five candidate genes were selected as the most promising: CYP51A1 on BTA 4, PPP1R16B on BTA 13, and NTHL1, TSC2, and PKD1 on BTA 25 (Table 3).
Based on Cow QTLdb 20 , 52 QTLs involved in production and reproduction traits were selected as potentially playing a role in daily CH 4 production in cows. Those QTLs were clustered into five groups: feed efficiency, milk related, body size and health status (see Table 4).

Discussion
To our knowledge this is the first GWAS on direct measurements of daily CH 4 production performed in dairy cattle. So far one GWAS on direct measurements of CH 4 production was performed in beef cattle with validation on dairy cattle 14 and for CH 4 intensity 15 . Another GWAS study on dairy cattle 23 used predicted CH 4 following the formula proposed by Dijkstra et al. 24 . Thus very little is still known on the actual genomic architecture of CH 4 production in dairy cattle. Our results provide more insight into the genomic architecture of CH 4 production thanks to the identification of genomic regions involved in the control of this trait and revealed genomic relationships between CH 4 production and other traits.
Methane emission may be expressed in several ways depending on the aim of a given study [25][26][27][28] . First of all, when the total CH 4 emitted by cows is of interest, the CH 4 production phenotype expressed in g/d or l/d may be used [16][17][18]28 . When the goal is to minimize the amount of CH 4 emitted from the supplied unit of feed (i.e. dry matter intake) in order to maximize feed conversion, the CH 4 yield 26 is the trait of interest (CH 4 produced per kg of dry matter intake). Another way of expressing emission is CH 4 intensity 26 , where produced CH 4 is expressed per unit of product (milk or meat). Similarly to the residual feed intake, CH 4 may be expressed as a difference between predicted and measured CH 4 emission (i.e. residual CH 4 emission) 14,26,29 . For our analyses we have decided to use the phenotype applied most widely in the literature and the least influenced by other traits not strictly related to CH 4 emission itself (e.g. dry matter intake, milk production, live weight). Another reason is related with the fact that when calculating our CH 4 production phenotype, we account for body weight, physiological status and milk production as described in Pszczola et al. 13 following Madsen et al. 28 . Therefore, calculations of CH 4 yield or CH 4 intensity may have resulted in some potential overestimation of CH 4 emissions due to double counting.

Selected candidate regions.
Based on the bioinformatics analysis of detected regions for CH 4 production in dairy cattle, five most promising candidate genes were selected based on GO Term analysis ( Table 3). The first of them, CYP51A1 (BTA4: 9,306,414-9,323,252) located within the region of a candidate QTL on BTA 4, is a member of the cytochrome P450 family 51 subfamily A. Based on GO Terms this gene is involved in two biological processes that could potentially affect CH 4 production in dairy cattle. Those GO Terms are the lipid metabolic process and the steroid metabolic process 30,31 , which are confirmed by CYP51A1 and its family members being involved in the synthesis of cholesterol, steroids and other lipids 32 . Lipids (i.e. fatty acids) were previously reported to be related to CH 4 production, including several studies that used fatty acids present in milk to predict CH 4 production 24,33-39 .
The second gene, namely PPP1R16B (BTA13: 68,258,627-68,366,080), a protein phosphatase 1 regulatory subunit 16B, is located within the candidate QTL region on BTA 13. For this gene two biological processes were found in GO Terms analysis that could link it to CH 4 production. One of them, the establishment of the endothelial barrier, e.g. in the intestine, is defined as "… specific and selective control over the passage of water and solutes, thus allowing formation and maintenance of compartments that differ in fluid and solute composition" 40 . The other, the positive regulation of blood vessel endothelial cells 30,40,41 . The biological processes involving PPP1R16B suggest that this gene could affect the digestive process by controlling the passage of water within the intestine and providing blood vessels to the endothelial cells of the intestine. Being part of such processes, PPP1R16B could affect efficient use of feed and in this way control the amount of by-products (including CH 4 ) produced during the process of digestion.
The three other genes were all located within the largest detected candidate QTL region on BTA 25, comprising of four SNPs. The first of the genes, NTHL1, nth like DNA glycosylase 1, is located at 1,590,252-1,595,934 bp. Its GO Term is the metabolic process, which includes protein synthesis and gradation 31,40 . The process involving this gene suggests that NTHL1 may affect digestive processes and consequently also a number of their by-products, e.g. CH 4 , being released post feeding.
The second of the above-mentioned genes, TSC2 (BTA25:1,596,730-1,626,967), tuberous sclerosis 2, is the only candidate gene with a GO Term related to a cellular component, in that case lysosome 31,40 . Moreover, TSC2 has been very well studied in humans, as its mutation causes tuberous sclerosis and its product is believed to be a tumor suppressor 32 . In the case of dairy cattle the location of the TSC2 gene in lysosome, which contains hydrolytic enzymes and takes part in energy metabolism, suggest that it could be involved in digestion processes and degradation of metabolites, this may affect CH 4 produced by a cow. The last of the candidate genes on BTA 25 is PKD1, encoding polycystein 1, a transient receptor potentially involved in channel interacting (1,627,978-1,666,088 bp). Three biological processes were assigned to it in GO Term analysis, i.e. blood vessel development 40 , nitrogen compound metabolic process 31,40,42,43 and the digestive tract development 40 . All three GO Terms indicate that PKD1 is involved in digestion processes either directly by affecting the development of the digestive tract, or possibly also blood vessels around it as well as metabolic processes of a nitrogen compound. All those functions, in general, indicate that PKD1 might be involved in emissions of greenhouse gases, not only CH 4 but also nitrogen related.
To confirm that the candidate genes detected in this study are the actual causative mutations affecting CH 4 further functional studies such as gene expression or sequencing of the region of highest interest are required. However, this was outside the scope of this paper. Potential Quantitative Trait Loci. Next to the search for candidate genes, we have also looked for previously detected QTLs for traits potentially related to CH 4 production. Those QTLs were clustered in four groups of similar traits: feed efficiency, milk related, body size and health status (see Table 4). It has to be noted that in this study the estimation of CH 4 production included an equation, in which fat-protein-corrected milk, live weight and pregnancy status are taken into account, and some of the found relationships may be present due to this fact. Alternatively, CH 4 concentration (expressed in ppm) could be used for the association study. At this moment, however, CH 4 production is the most widely reported trait in genetic studies regarding reduction of enteric CH 4 emissions. For this reason we restricted our study to this trait.
Firstly, the comparison indicated an overlap between the genomic regions controlling the CH 4 production and QTLs for feed efficiency traits (e.g. residual feed intake, feed conversion ratio, average daily gain; Table 4). The relationship between diet composition and CH 4 production 44 or the effect of additives reducing emission [45][46][47][48][49] or dry matter intake 50-52 is well known. It is anticipated that increased CH 4 production leads to the loss of energy provided with feed 5,6 , and therefore more efficient cows should produce less CH 4 . Jentsch et al. 53 showed that greater feed ingestion results in higher total CH 4 production; however, CH 4 production per kg dry matter intake decreases. Pickering et al. 54 also reported the presence of a correlation between CH 4 production and intake, while studies of [55][56][57] showed that selection for cows with a low residual intake (efficient ones) results in lower CH 4 production. Unfortunately, in this study no data was available on individual feed intake of cows and therefore we were not able to verify this statement empirically. Secondly, regions controlling CH 4 production were also overlapping with QTLs for traits describing various aspects of milk production (e.g. milk yield, milk protein and fat yield, milk composition; Table 4). The relationship between milk composition and CH 4 production is particularly plausible because of common biochemical pathways between CH 4 , acetate and butyrate 58 . Furthermore, earlier studies showed that it is possible to use milk fatty acid composition to predict CH 4 production 24,33,35-39 .
Thirdly, it was found that height, chest depth and body weight of the cow were genetically controlled by the same regions as potential QTLs for the CH 4 production. Body characteristics such as body weight were earlier shown to be related to CH 4 production 52,59,60 . Heavier cows are usually bigger and have a larger rumen capacity and a lower passage rate 61 , which leads to greater CH 4 production 52 .
Finally, the QTLs detected previously for the health status of the cow (e.g. mastitis, somatic cell score, immunoglobulin G level) were also found in regions overlapping with SNPs detected in this study for CH 4 production. Thus reports on the relationship between the health status of the animal and the direct CH 4 production are limited. Zetouni    production and a very low positive genetic correlation with udder health 62 . Elliott- Martin et al. (1997), based on breath analyses, indicated that CH 4 could be used to diagnose ketosis. Moreover, the health status of the animal is known to affect other traits such as dry matter intake or production, and therefore is likely to affect CH 4 production. It is likely that a sick animal produces less methane due to a lower milk production; however, methane intensity (i.e. the amount of methane produced per kg of milk) would increase. Next to QTLs related to traits indicating the health status of the cow also QTLs indicating susceptibility to illness were found in the regions important to CH 4 production. Based on the several traits mentioned above that share the genetic background with CH 4 production, it may be suggested that some of the detected regions in this study have a pleiotropic effect. This knowledge is very beneficial especially in the case of production traits controlled by the same regions as CH 4 production (i.e. assumed to be genetically correlated), which could serve as indicator traits for enteric CH 4 production and eliminate difficult and time-consuming phenotyping. Our findings mostly match the study of Negussie et al. 63 , who reviewed literature on potential indirect traits for measuring CH 4 production. Further evaluation of genetic relationships between CH 4 and other traits is necessary to confirm relationships revealed by our study and before inclusion of CH 4 to the breeding program can be made.
Power of the experimental design. The Bayesian method selected to perform GWAS for CH 4 production allows for good distinctions between SNP with large and small effects on a trait, as in each iteration a different combination of SNPs is given a large effect. Thus detected SNPs give a valuable indication for the genomic regions potentially involved in CH 4 production in dairy cattle. This was confirmed also by bioinformatics post-analysis of detected regions with the functions of selected candidate genes and QTLs for other traits detected within those regions. However, the total genetic variance explained by significant SNPs was very low. This could be due to several possible reasons, i.e. (1) a low number of animals used in the study, (2) the accuracy of the collected phenotypes, and (3) the polygenic nature of the studied trait.
Firstly, it should be noted that the analyzed dataset was relatively small, and therefore the power of the GWAS design was too low to detect a majority of SNPs associated with CH 4 production. Taking into account the heritability of this trait at 0.27 13 , a higher number of genotyped animals would be needed to obtain a higher percentage of genetic variance explained by the detected SNP. Therefore, the analyses of a larger dataset (for both phenotypic and genomic data) may shed light on more specific SNPs with large effects However, generating a large data set by one project is difficult due to related costs (measuring and genotyping). Therefore, generating such a dataset by combining phenotypic observations and genotypes from various experiments could be a solution producing more reliable results in the future.
Secondly, to obtain reliable GWAS results reliable phenotypes are needed. In our study we used a technique that measures CH 4 at the AMS during milking. To verify the accuracy of the sensor used in this study we validated the used sensor against sensors used in Respiration Chambers (the standard CH 4 measuring technique). This comparison showed a high similarity between results generated by the two sensors when used in the AMS 64 . There are no studies comparing the performance of sensors used in the present study when installed in the Respiration Chamber. Several factors could lead to inaccuracies in the collected measurements such as occasional wind in the area of AMS or cows' head movement. These factors were not controlled in this study. To account for these arguably random effects we measured CH 4 for the individuals in the long period of time (i.e. resulting in multiple observations per animal). The average repeatability of the analyzed phenotype was 0.25 as reported in Pszczola et al. 13 .
Thirdly, the greater data set and increased accuracy of the measuring method could not have been enough to explain more genetic variation if the analyzed trait was highly polygenic. In previous studies using the same methodology, but larger data sets, only 0.83% of the genetic variance was explained by SNP in GWAS on litter size in pigs 65 and 9.5% in GWAS on teat number in pigs 66 . Based on the presented results it seems that CH 4 production is also a highly polygenic trait and many different regions are involved in its regulation. It might not be, therefore, possible to detect all of them using GWAS.
As CH 4 production turned out to be a very polygenic trait in application to breeding practice, it may be more advisable to use the genomic prediction approach without specifying particular SNPs as being more important than others (e.g. genomic BLUP). In fact, de Haas et al. 67 , Lassen et al. 68 and Wilson et al. 69 performed genomic prediction type analyses while searching for correlated traits. The biggest challenge for the performance of genomic prediction with sufficient, reasonable or high accuracy of the estimated genotypic values is to create an adequately large reference population, which is likely to require cooperation between several countries.

Conclusions
This study aimed at detecting genomic regions affecting CH 4 production in dairy cattle and showed that SNPs associated with the trait of interest may be detected. However, CH 4 data collection poses a challenge, leading to a lower power of the experimental design and prevented detection of a high number of SNPs with a large effect on CH 4 production. Consequently, only a small proportion of the genetic variance was explained by the SNPs. Nonetheless, the candidate QTL region on BTA 25, where three candidate genes were identified, may be considered as a genomic region regulating CH 4 production in dairy cattle. Furthermore, the comparison of the QTL regions affecting CH 4 production with previously reported QTLs indicated common genomic regions between CH 4 production and traits related to feed efficiency, milk related, body size and health status. The found candidate genes were also involved in a number of metabolic processes possibly related to CH 4 production. One of the most promising candidate genes (PKD1) was related to the development of the digestive tract being the environment inhabited by methanogens and the site for methane production. In general, all the evidence shows that CH 4 production is a polygenic trait.

Methods
All research was approved by the Local Ethical Committee for Experiments on Animals in Poznan, Poland (Decision Number: 64/2012) and performed in accordance with the "Act on the protection of animals used for scientific purpose" of the Republic of Poland, which complies with the European Union Legislation for the protection of animals used for scientific purposes.
Phenotypes. The observations on CH 4 production [g/d] used in this study were obtained from Pszczola et al. 13 , where all the detailed information on farms, measuring set-up and data processing can be found.
In short, animals available for this study were 287 Polish Holstein-Friesian cows kept on two commercial farms in Poland. This was a subset of 483 cows phenotyped for CH 4 production and analyzed in Pszczola et al. 13 , of which 287 were genotyped. The CH 4 production was measured repeatedly on Farm1 during two periods: from 2014/12/02 to 2016/02/03, and from 2016/06/01 to 2016/09/17, and on Farm2 from 2016/02/05 to 2016/03/14. Cows were milked repeatedly during the experiment, in total 25,872 CH 4 production observations were collected for the genotyped animals.
The CH 4 production was measured using a non-invasive Fourier Transform Infrared Spectroscopy breath analyzer (GASMET 4030; Gasmet Technologies Oy, Helsinki, Finland) during milking in AMS (Lely Astronaut A4). Concentrations of CH 4 and CO 2 measured during milking were converted to daily CH 4 production in grams per day [g/d] following Madsen et al. 28 and Pedersen et al. 70 . This calculation took into account the concentrations of CH 4 and CO 2 , fat-protein corrected milk, live weight and duration of the pregnancy. Multiple daily outputs per cow were corrected for the diurnal variation in CH 4 and averaged per cow per day. The genotyped SNPs were processed with following quality control checks: (1) being in Hard-Weinberg equilibrium, (2) having the minor allele frequency above 0.05, (3) not being monomorphic, and (4) having a call rate of above 0.95. Six cows were removed as they had the call rate below 0.9. After quality control and removing SNPs located on sex chromosomes and chromosome 0 (unassigned), 39,680 SNPs remained for the genome-wide association analysis.

Genotypes.
Genome-wide association. To identify regions of the genome affecting CH 4 production, a multi-SNP genome-wide association analysis was performed with the application of the Bayesian Variable Selection method 71 . The method allows for a simultaneous estimation of the effects of all markers used in the analysis. The analysis was performed with the Bayz software 72 on daily CH 4 production using the model developed by Pszczola et al. 13  where CH 4 stands for the daily CH 4 production levels of a cow; µ is an n-vector equal to the mean; Xb is the design matrix of fixed effects of year-week of measurement and cow's lactation number (levels 1 or 2+) fitted within the general lactation curve, which was modeled using 3 rd order Legendre polynomials; and e is an n-vector of random residual effects assumed to be normally distributed σ N(0, ) e 2 . The L k is a vector of individual random animal effect, which was modeled using 2 nd order Legendre polynomials. The mapping of marker effects is constructed as a hierarchical model on random animal effects 73 . Firstly, the model accounts for genetic variance only. Secondly, at the next level the model allows disentangling permanent environmental (Note: this accounted for repeated observations of daily CH 4 production per cow.) and genetic variances independently for each level of 2 nd order Legendre polynomials. Here the Z u is a matrix with dimensions n by p, with n being the number of genotypes and p being the number of SNP coded as 0, 1, 2 copies of a specific allele vector; β ijk is a p-vector with the random effects of markers; and ε ijk accounts for the permanent environmental effect assumed to be normally distributed σ ε N(0, ) 2 ijk . For the marker effect the Bernoulli distribution was applied: where for the first distribution it is assumed that the SNPs have a small effect (σ g 2 0 ); whereas in the second distribution the SNPs are assumed to have a large effect, which explains a large part of variance (σ g 2 1 ) of analyzed traits. In this study, a prior of π 1 = 0.001 was selected, thus on average only 1 in 1,000 SNPs was in the second distribution in each cycle. This resulted in only ~38 SNPs per cycle to have a large effect on the traits. The posterior means were calculated with 500k MCMC iterations with burn-in of 5k iterations to secure that all the SNPs were used 65,66,74 . Selecting a stringent prior provides a more precise distinction between SNPs with large and small effects on the trait 66,75 . If the SNP was not genotyped for a certain animal then Bayz assigned an average genotype to that position. Identification of significant SNPs. The Bayes Factor (BF) was calculated for each SNP to determine the significant associations: where π 1 and π 0 are the prior probabilities and p i is the posterior probability of the fraction of times the SNP was in the distribution with a large effect. Following the definitions of Kass and Raftery 20 , the SNPs with BF > 30 are described as a "very strong" association and with BF > 150 as "decisive". The variance explained by significant SNPs was estimated as a fraction of the total genetic variance explained by all SNPs.
To confirm the potential QTL regions, also the linkage disequilibrium (LD) measured by r 2 was estimated in Haploview 21 between the SNPs detected on one BTA and not further from each other than 500 kbp. The candidate gene search was performed with the BIOMART software available in Ensembl Bos Taurus UMD 3.1 32 by entering the position of a possible QTL region or one of the most significant SNPs with ±500 kbp. To limit the number of QTLs to the most promising as candidate genes for daily CH 4 production the BIOMART database was also used to study Gene Ontology Terms (GO Terms) of those QTLs. Furthermore, the Cow QTL database of the Animal Genome project 20 was used to find previously detected QTLs within the most promising regions detected here for daily CH 4 production. This was done analogically as for the candidate gene search, i.e. by entering the position of a possible QTL region or one of the most significant SNPs with ±500 kbp.