Introduction

As a potent greenhouse gas that contributes significantly to climate change, methane (CH4) emissions from ruminants pose a direct threat to the environment and sustainable agricultural production1. Methane emissions from ruminants also lead to a waste of part of the ingested feed energy2. In sheep, CH4 emissions result in approximately a 7% loss of gross energy intake3. In recent years, there has been growing interest in prediction models for CH4 emissions for several reasons. First, direct measurement of CH4 emissions from large herds or flocks of ruminants is technically difficult and time-consuming, and thus it can be done practically only with small numbers of animals in research settings4. Second, it is not practical or feasible to establish regional, national, or global inventories of CH4 emissions by direct measurement of CH4 emissions from individual animals. Third, statistical models can assist in describing and understanding the relationship among CH4 emissions, animal performance, and other factors (nutritional, microbial, and physiological).

A diverse set of prediction models has been developed to predict enteric CH4 emissions from sheep5 and cattle6,7. These prediction equations include predictor variables describing the dietary composition, feed intake, rumen fermentation profiles, and other animal traits such as body weight (BW), but not individual taxa (genera or species) of the rumen microbiota. These prediction models vary considerably in robustness and prediction accuracy8,9. The rumen microbiota is solely responsible for CH4 emissions from ruminants. Specifically, rumen bacteria, fungi, and protozoa digest and ferment the ingested feed and produce the substrates (methylamines, hydrogen, CO2, and formate) that are then used by the rumen methanogens to produce CH410. Thus, the rumen microbiota is directly responsible for the CH4 emitted by ruminants although it can be affected by several factors, including diet and host-related factors11. Indeed, several studies have reported strong associations or correlations of some bacteria, methanogens, or protozoa with or qualitatively indicative of CH4 production12,13. In this context, we hypothesized that the inclusion of individual taxa of the ruminal microbiota as predictor variables, in addition to animal and feed-related predictor variables, in CH4 prediction models would improve our ability to predict CH4 emissions. Methane prediction equations or models containing rumen microbial taxa as predictor variables may also help understand their roles and importance in enteric CH4 emissions. The objective of this study was to develop such prediction models using multiple statistical/machine learning (ML) methods and a dataset that contained both animal-related data and metataxonomic data of the rumen microbiota of a cohort of sheep that were not subjected to any dietary interventions.

Methods

Dataset

This study did not involve the use of any animals, and thus no approval of the Institutional Animal Care and Use Committee was required. The dataset (Supplementary Table 1) used in this study was generated in a previous study that compared the rumen microbiota and CH4 emissions among a group of New Zealand sheep that were not subjected to any dietary interventions13. The animal experiment and analyses were detailed in another report14. Briefly, 1225 sheep were fed a pelleted alfalfa diet (without any anti-methane inhibitors) twice daily at two times their maintenance requirement for energy (based on BW). CH4 emissions from 340 individual animals were measured with open-circuit respiration chambers twice in two independent measuring rounds (10–15 days apart) with each time lasting for two consecutive days14. CH4 production (g CH4/animal·d) and CH4 yield (g CH4/kg of dry matter intake, DMI) were calculated. Feed intake and BW of the 340 individual animals were also recorded. Rumen samples were collected from 118 of the 340 sheep, with 60 representing the highest and 58 representing the lowest CH4 yield sheep, immediately after each CH4 emission measurement (corresponding to 17–18 h after the last feeding) using the stomach tubing method. The two groups of sheep were balanced by BW (24–60 kg), sex, and breeds (indicated below). The rumen fermentation characteristics, including acetate:propionate (A:P) ratio, were analyzed using gas–liquid chromatography. We provided the density plots of the CH4 data (g/d and g/kg of DMI, Supplementary Fig. 1). Both density curves appeared to represent the mixture of the two groups, which is consistent with the data feature (high and low emission sheep) and with the patterns found in the previous study13.

Two rumen samples were collected (10 to 15 days apart) from each sheep and the composition and structure of the rumen microbiota were analyzed using metataxonomics sequencing the 16S rRNA genes of both bacteria and archaea, 18S rRNA gene of protozoa, and internal transcribed spacer 1 (ITS1) of fungi13. One sheep was excluded due to missing data, and 16 sheep were represented by only one sample that yielded complete metataxonomic data. This resulted in a cumulative 218 observations, with 101 sheep each contributing two observations and the remaining 16 sheep contributing one observation.

In this study, we retrieved the metataxonomic sequencing data of the sheep rumen samples from the EMBL database per the accession numbers (ERP003779 for Bacteria, ERP003773 for Archaea, ERP003772 for ciliate protozoa, and ERP003764 for anaerobic fungi). The sequences were then analyzed using DADA215. Briefly, the amplicon sequences were clustered into amplicon sequence variants (ASVs), which were then taxonomically assigned to individual taxa using specialty databases: the Silva database (version 138.1) for bacteria, archaea, and protozoa16; and the UNITE database for fungi17. The relative sequence abundance of each genus was calculated as the sum of the relative sequence abundance of all the ASVs assigned to the respective genus. The sequences of the ASVs that were classified to an unclassified genus (Uncl_genus) are listed in Supplementary Table 2. Only the genera that each had a relative sequence abundance reaching > 1% in all the samples were used as microbial variables. For an Uncl_genus, its higher and known taxon name was used to show its taxonomic lineage. In total, 330 microbial genera were obtained and used, including 308 genera of bacteria, 9 genera of archaea, 5 genera of protozoa, and 8 genera of fungi. The relative sequence abundance data of the microbial genera were log-transformed before the modeling process. In addition to the rumen microbiota data and CH4 emission data (CH4 production, g CH4/animal·d; CH4 yield, g CH4/kg of DMI), the dataset also contained the following five animal-related data of the individual sheep: DMI (kg/d), BW (kg), A:P ratio, sex, and breed (Coopworth, Corriedale, Finnish Landrace, Perendale, Romney, and Texel).

Linear mixed model

In our dataset, most animals (101 out of 118) had two records of CH4 emissions and other data (except for sex and breed). These repeated measurements from the same sheep were likely to be more akin to each other than to the measurements from a different sheep. Thus, the two records from the same sheep were expected to be correlated. Linear mixed-effects models were used to model the dependence of the data and account for correlations between the two repeated measurements. The general structure of the model was described as follows:

$${y}_{ij}={{\varvec{X}}}_{{\varvec{i}}}{\varvec{\beta}}+{\alpha }_{i}+{\varepsilon }_{ij}$$
(1)

where yij is the observed CH4 emissions from the ith animal and jth observation (i = 1, …, 118 and j = 1, 2), set as either g CH4/animal·d (referred to as ANIM-B models) or g CH4/kg of DMI (referred to as DMI-B models); Xi = (1, DMI, BW, A:P ratio, sex, breed, microbial variables) is the design vector for the fixed effects containing the values of the predictor variables for the ith animal and the “1” entry for the intercept (DMI was not included as a variable in developing the DMI-B models because it is a dependent variable in those models); β is the coefficients vector of the fixed effects describing the relationships between CH4 emissions and the predictor variables; αi is the random animal effects to account for intra-animal correlation between the two measurements; and εij is the random error. The αi was assumed to be independent and identically distributed with mean zero and variance \({{\sigma }_{\alpha }}^{2}\), and the εij errors were assumed to be independent and identically distributed with mean zero and variance \({\sigma }^{2}\). Further, it was assumed that random effects and errors were mutually independent.

Machine learning methods

The rumen microbiota is diverse, containing many hundreds to thousands of distinct species of microbes at varying relative sequence abundance. The inclusion of such a large number of microbial variables in prediction models can lead to overfitting. Additionally, relative abundance data of microbiota are usually on different scales, zero-inflated, sparse, and over-dispersed. Thus, proper selection of microbial predictor variables to be included in prediction models is crucial in model development. In the present study, we used four ML methods to select microbial predictor variables: random forest combined with boosting (RF-B) using the randomForest and gbm packages18, least absolute shrinkage and selection operator (LASSO) using the glmnet package19, generalized linear mixed model with LASSO (glmmLasso) using the glmmLasso package20, and smoothly clipped absolute deviation (SCAD) using the splmm package21. We opted for both LASSO and glmmLasso because the former does not consider intra-animal correlation, while the latter takes these correlations into account by utilizing the linear mixed-effect model described in Eq. (1). Each of the ML methods was evaluated for both prediction performance and fitting performance. Specifically, to evaluate prediction performance (based on 200 evaluations with random data splitting into testing data and training data (see section ‘Evaluation of prediction performance’ below), each ML method was used to make variable selection (selection for both animal and microbial predictor variables). Using the lmer function of the lme4 package22, we also fitted conventional models utilizing solely the animal-related data (i.e., BW, A:P ratio, and DMI (DMI was used only in developing the ANIM-B models)) as predictors. The two categorical variables (sex and breed) were not included in the prediction models because their addition did not lead to improvement in prediction performance (data not shown). The sets of variables selected by glmmLasso, LASSO, SCAD, RF-B, and conventional modeling were then separately fitted with a linear mixed model, and the resultant models were used to predict CH4 emissions. Further, to evaluate fitting performances, linear mixed models were fitted with all available data by selecting the variables using the ML and the conventional method. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) of those models were computed using the ANOVA function in R v4.0.223. The estimated standard deviation of the animal random effects (\({\widehat{\sigma }}_{\alpha }\)), coefficients, and associated standard errors (SE) of the predictor variables were computed using the lmer function22. The observations and residuals versus the model predictions were plotted to evaluate the linear mixed model assumptions for all the ML models.

Evaluation of prediction performance

Prediction performance was validated in five steps. Step 1 involved screening the independent variables using the sure independence screening (SIS) test implemented in the SIS package in R24. This step was omitted for the RF-B method because this method can handle high-dimensional data without the need for data prescreening when building the decision trees. To check for potential multicollinearity, the variance inflation factor (VIF) was examined, and variables with a VIF < 5 were retained as recommended by Niu et al.7.

Step 2 began by randomly splitting the dataset into testing data (20% of the data) and training data (the remaining data). The training data were standardized by mean centering (subtracting each variable from the corresponding mean) and scaling (dividing each variable by its corresponding standard deviation). The testing data were standardized based on the centering and scaling from the training data. To prevent data leaking of the repeated records from the same animal, the two repeated records were always allocated to either the testing data or the training data.

In step 3, glmmLasso, LASSO, SCAD, and RF-B were individually used to select predictor variables from the training data. For glmmLasso and SCAD, the optimal tuning parameter λ was determined using BIC. Two increments of λ, 0.1 and 0.5, were also evaluated, but the results were virtually the same (data not shown) except for a substantial increase in computing time. For LASSO, tenfold cross-validation was used to determine the optimal tuning parameter λ. For RF-B, the top 10 important variables identified using the randomForest package were selected when the number of decision trees was set as 500, and the number of variables randomly sampled as candidates at each split (using mtry in the randomForest package) was set as 40. The top 10 important variables determined by boosting using the gbm package were also selected with the default settings applied. The union of the variables selected by random forest and boosting was kept.

Step 4 involved separately refitting a linear mixed model with the sets of selected variables using each of the ML methods and the training data. Predictions were then made using the values of the variables from the testing data. All the predictions and observations were recorded.

Step 5 repeated steps 2 to 4 for 200 times (sufficiently large and not too computationally intensive) by re-splitting the data into testing and training data in each iteration. Mean squared prediction error (MSPE), root MSPE (RMSPE), mean absolute errors (MAE), and Lin’s concordance correlation coefficients (CCC) were calculated using the equations specified in the section ‘Model evaluation equations’ below. We utilized RMSPE, MAE, and CCC to provide a comprehensive evaluation of the model performance. In particular, MAE is a robust assessment that is less sensitive to extreme values compared to RMSPE, whereas CCC can provide information on the precision and accuracy of the prediction.

Model evaluation equations

The values of MSPE, RMSPE, and MAE of all the models were calculated using the equations shown below25. The unit of MSPE was (g/d)2 for the ANIM-B models and (g/kg)2 for the DMI-B models. Lin’s CCC was calculated by the epi.ccc function in the epiR package in R26.

$$MSPE=\frac{\sum_{i=1}^{N}{{(O}_{i}-{P}_{i})}^{2}}{N}$$
$$MAE=\frac{{\sum }_{i=1}^{N}\left|{(O}_{i}-{P}_{i})\right|}{N}$$
$$RMSPE= \sqrt{MSPE}$$
$$CCC=\frac{2{s}_{op}}{{s}_{o}^{2} + {s}_{p}^{2}+{(\overline{o }-\overline{p })}^{2}}$$

where \({O}_{i}\) is the CH4 emissions of ith observation; \({P}_{i}\) is CH4 prediction from the model; N is the total number of observations; \({S}_{o}\) is the standard deviation of the observations; \({S}_{p}\) is the standard deviation of the predictions; \(\overline{o }\) is the mean of the observations; \(\overline{p }\) is the mean of the predictions; \({S}_{op}\) is the covariance of the observations and the predictions.

The R codes we used in the present study were available at GitHub (https://github.com/yu2269/CH4_prediction_model).

Results

Model performance

Tables 1 and 2 summarize the prediction performances of the ANIM-B and the DMI-B models developed using both the ML and the conventional methods, based on random data splitting (200 times). The ML methods notably improved the performance of ANIM-B models, reducing RMSPE (by − 1.25% for SCAD and 3.8% for glmmLasso) and MAE (by − 0.39% for SCAD and 5.08% for LASSO), while increasing CCC (by 2.64% for SCAD and 9.8% for LASSO), as compared to the conventional method. Similarly, the ML methods enhanced the performance of DMI-B models, reducing RMSPE (by 6.22% for glmmLasso and 11.12% for LASSO) and MAE (by 7.36% for RF-B and 13.31% for LASSO) and increasing CCC (by 68.73% for glmmLasso and 106.18% for LASSO). Tables 3 and 4 present the fitting performance of the ANIM-B and the DMI-B models developed using both the ML methods and the conventional method, based on the entire dataset. Compared to the conventional method, the ML methods decreased BIC (by − 33 for LASSO and 19 for glmmLasso) and AIC (by − 152 for LASSO and 53 for glmmLasso) for the ANIM-B models. Similarly, for the DMI-B models, the ML methods decreased BIC (by − 201 for RF-B and 5 for glmmLasso and SCAD) and AIC (by − 157 for RF-B and 18 for glmmLasso), as compared to the conventional method.

Table 1 Prediction performance of the animal-based (ANIM-B) models (CH4 production; g/d) developed using conventional method and four machine learning methods.
Table 2 Prediction performance of the dry matter intake-based (DMI-B) models (CH4 yield; g/kg DMI) developed using conventional method and four machine learning methods.
Table 3 Fitting performance of the animal-based (ANIM-B) models (CH4 production; g/d) fitted with the variables selected from all the available data by the conventional method and machine learning methods.
Table 4 Fitting performance of the dry matter intake-based (DMI-B) models (CH4 yield; g/kg DMI) fitted with the variables selected from all the available data by the conventional method and machine learning methods.

The final models were selected by balancing the prediction performance and the fitting performance. Among the ANIM-B models, the glmmLasso model had the lowest BIC and AIC (Table 3). It also had the lowest RMSPE in prediction performance (Table 1). Thus, we selected the ANIM-B glmmLasso model as the final model. For the prediction performance of the DMI-B models, the LASSO model had a lower RMSPE than the SCAD model (1.89 vs. 1.92), which was lower than that of the glmmLasso model (Table 2), but the LASSO model had a higher BIC and AIC than the SCAD model and the glmmLasso model (Table 4). The DMI-B SCAD model also had the fewest variables (6 in total, Table 4) among the ML models. Thus, the DMI-B SCAD model was selected as the final DMI-B model.

We assessed the performance of the models by plotting the observed CH4 emissions against the CH4 emissions predicted using the final ANIM-B and the DMI-B models, along with the corresponding studentized residuals (Supplementary Figs. 2 and 3, respectively). Both the ANIM-B glmmLasso and the DMI-B SCAD models showed more tightly clustered and less scattered patterns in the observation vs. prediction plots, as compared to the ANIM-B and the DMI-B conventional models (Supplementary Figs. 2A,B and 3A,B, respectively). On the studentized residuals vs. predictions plots, the two ANIM-B and the DMI-B ML models also had tighter and more symmetric patterns around the regression lines that had smaller slopes compared to the ANIM-B and the DMI-B conventional models (Supplementary Figs. 2a,b and 3a,b, respectively).

Animal-based (g CH4/animal·d) prediction models

The final ANIM-B glmmLasso model contained three animal-related predictor variables (DMI, A:P ratio, and BW) and 10 bacterial genera (or equivalents) of rumen bacteria out of the 308 genera (Table 5). None of the protozoal, fungal, or archaeal genera was selected as predictor variables. The three animal-related variables all had a positive coefficient. Of the 10 microbial predictor variables, four had a positive coefficient: 0.59 for the Uncl_genus of Oscillospiraceae, 0.49 for the Uncl_genus of the order Clostridia, 0.46 for the Uncl_Family of the order Gastranaerophilales, and 0.24 for the genus Moryella. The remaining six genera had a negative coefficient: Uncl_Family of the order RF39 (‒0.27), the genus Prevotella_7 (‒0.30), the Uncl_Genus of Marinifilaceae (‒0.34), the genus Syntrophococcus (‒0.37), the genus Oribacterium (‒0.40), and the Uncl_Family of the order Oscillospirales (‒0.44). The ANIM-B conventional model had the same three of the five animal-related data as predictor variables (DMI, A:P ratio, and BW). Compared to the ANIM-B conventional model, the ANIM-B glmmLasso model had a reduced \({\widehat{\sigma }}_{\alpha }\) (1.59 vs. 2.04), RMSPE (by 3.8%), and MAE (by 4.6%), and increased CCC (by 9.49%). The ANIM-B glmmLasso model also had a lower AIC and BIC than the ANIM-B conventional model (by 53 and 19, respectively).

Table 5 Coefficients and the associated standard errors (SE) of the predictor variables of the final ANIM-B (CH4 production; g/d) model.

Dry matter intake-based (g CH 4 /kg of DMI) prediction model

The final DMI-B SCAD model had six predictor variables: one animal-related predictor variable (A:P ratio) and five bacterial genera as microbial predictor variables (Table 6). No protozoal, fungal, or archaeal genera were selected as predictor variables. The A:P ratio predictor variable had a positive coefficient. Except for Uncl_Family of the order Gastranaerophilales (with a coefficient of 0.50), the other four microbial predictor variables were all known genera, all of which had a negative coefficient: − 0.24 for Pseudoramibacter, − 0.33 for Megasphaera, − 0.39 for Selenomonas, and − 0.40 for Oribacterium. The DMI-B conventional model had two animal-related predictor variables (A:P ratio and BW), and both had a positive coefficient. Compared to the DMI-B conventional model, the DMI-B SCAD model had a reduced \({\widehat{\sigma }}_{\alpha }\) (0.98 vs. 1.47), RMSPE (by 9.66%), and MAE (by 9.97%), and increased CCC (by 84.36%). Compared to the DMI-B conventional model, the DMI-B SCAD model reduced both AIC and BIC (by 15 and 5, respectively).

Table 6 Coefficients and the associated standard errors (SE) of the predictor variables of the final DMI-B (CH4 yield; g/kg DMI) model.

Discussion

The rumen microbiota is responsible for about 90% of the CH4 emitted from sheep27. With respect to CH4 production, rumen microbes can be categorized as producers of methanogenesis substrates (primarily bacteria and protozoa), CH4 producers (i.e., methanogens), and those that influence CH4 production by interacting with the above two categories either positively (e.g., through mutualism and commensalism) or negatively (i.e., through amensalism, competition, and predation) interactions. Thus, some of the rumen microbes could have a quantitative relationship with the overall CH4 emissions. Using four statistical/ML methods and one dataset (218 observations) that contains both animal-related data and metataxonomic data of the rumen microbiota (330 genera in total), we developed ML models for improved prediction of CH4 production (g CH4/animal·d) and CH4 yield (g CH4/kg of DMI) from sheep, which include rumen microbes as predictor variables.

New ML models including microbial predictor variables

As evaluated for prediction performance, it is evident that all the ML models, except for the ANIM-B SCAD model, improved prediction of CH4 emissions in terms of error, bias, and accuracy when compared with the conventional models. Additionally, the improvement in prediction performance demonstrated in the evaluation with testing data randomly selected from the entire dataset suggests the potential usefulness of these models over a broad range of CH4 emissions. The improvement in CH4 prediction by the final ANIM-B glmmLasso model and the final DMI-B SCAD model is also evident in the plots of the predictions vs. the observations along with the studentized residuals. In addition to the same animal-related predictor variables included in the conventional models (except for the exclusion of BW in the DMI-B SCAD model), the final ML included genera of bacteria as predictor variables. The improvement in prediction performance strongly suggests that the inclusion of these microbial predictor variables might have fine-tuned the models. Moreover, the improved predictions of the final ML models align well with the assumptions of linear mixed models. These findings suggest that microbial predictor variables can differentiate sheep based on CH4 production levels, offering promising prospects for capturing variations in CH4 emissions and helping achieve accurate CH4 prediction.

The bacterial genera included in the ML models

Despite the dataset containing 330 genera of microbes, only a small number of genera were selected as microbial predictor variables, and their inclusion did not lead to model overfitting. Notably, all the microbial predictor variables in both the ANIM-B and the DMI-B prediction models are genera of bacteria, with the majority belonging to Firmicutes, an abundant phylum of rumen microbiota. Interestingly, no archaea, protozoa, or fungi were selected as predictor variables in any of the ML models. The absence of methanogen genera in the models might seem counterintuitive, given their role as CH4 producers. However, previous studies have shown that the abundance of rumen methanogens, even that of the most abundant genus, Methanobrevibacter28, has very a weak to no correlation with CH4 production12,29. Furthermore, changes in diets that significantly affected CH4 production were not found to lead to substantial alterations in the relative sequence abundance of Methanobrevibacter30. Thus, the lack of methanogens in the final prediction models is not surprising. At low abundance, methanogens, fungi, and protozoa might have little predicting power for CH4 emissions.

Both the ANIM-B and DMI-B ML models included the genus Oribacterium and one unclassified genus of the order Gastranaerophilales (the Uncl_Family of the order Gastranaerophilales) with a negative and a positive coefficient, respectively. Although the function of this unclassified genus is not known, its large coefficients in both prediction models suggest a high predicting power. Interestingly, Gastranaerophilales has been shown to be negatively associated with milk yield in dairy cattle31 and negatively associated with growth in lambs32. Regarding the genus Oribacterium, it had a negative coefficient in both the final ANIM-B and the DMI-B models. This aligns with Oribacterium being a potential H2 sink in the rumen of sheep33 and its negative association with CH4 emissions documented in dairy cows34. Future research is warranted to explore and verify if Oribacterium has a quantitative relationship with rumen CH4 production and can be used as a predictor variable or biomarker of CH4 emissions.

In addition to the two genera shared with the DMI-B SCAD model, the ANIM-B glmmLasso model included another eight genera, five of which were unclassified, and their functions remain unknown. Among these genera, Moryella, Prevotella_7, and Syntrophococcus are known genera of bacteria, but their association or correlation with CH4 emissions in ruminants has not been reported in the literature. Prevotella_7 is a newly established genus35, consisting of P. multiformis, P. multisaccharivorax, and uncultured bacteria from the oral cavities and the rumen (https://www.arb-silva.de/). Studies have shown that both P. multiformis and P. multisaccharivorax were enriched by concentrate-based diets correlated with decreased CH4 emissions36,37. In buffalos, a higher relative sequence abundance of Prevotella (closely related to Prevotella_7) was associated with low CH4 emissions38. Prevotella is considered to be hydrogen-consuming39, making the inclusion of Prevotella_7 as a predictor variable with a negative coefficient not surprising. As for Moryella and Syntrophococcus, the literature has no information about their relationship with CH4 emissions, except for one in vitro study that showed a positive correlation between Syntrophococcus and maximum CH4 production, but a negative correlation with the appearance of peak CH4 production40. Future quantitative studies are needed to verify the qualitative relationship between these two genera and CH4 emissions.

The DMI-B SCAD model did not include BW as a predictor variable, but the inclusion of bacterial predictor variables improved the prediction accuracy. This suggests that these bacterial predictor variables might have reduced the variation in CH4 prediction introduced by BW. In addition to the two unknown genera discussed earlier, the final DMI-B SCAD model included three known genera (Pseudoramibacter, Megasphaera, and Selenomonas), which were not included in the final ANIM-B glmmLasso model. While the literature has no information about any potential association between Pseudoramibacter and CH4 emissions, Pseudoramibacter was reported to be positively correlated with feed efficiency in sheep41. Similarly, it remains unknown if Megasphaera or Selenomonas has any association or correlation with CH4 emissions in the absence of anti-methane inhibitors. However, one in vitro study has shown that calcium salts of long-chain fatty acids42, tucuma oil43, and ginkgo extract44 decreased CH4 emissions while increasing propionate and the abundance of Megasphaera and Selenomonas. Encapsulated nitrate has also been found to reduce CH4 emissions while increasing Selenomonas in steers45. Additionally, Megasphaera was more abundant in low than in high CH4-yield sheep46. Thus, the negative coefficients of both Megasphaera and Selenomonas in the DMI-B SCAD model align with their negative association with CH4 emissions and their ability to consume hydrogen and produce propionate33,39,47,48. It should be noted that many of these correlations were observed in ruminants that were subjected to dietary interventions using anti-methane inhibitors, which could directly or indirectly affect these genera, skewing their relationships with CH4 production. Future research is warranted to determine their relationship with CH4 emissions in the absence of anti-methane inhibitors and their predicting power of CH4 emissions.

Potential limitations of the sheep dataset and the selected models

The dataset used in developing the prediction models was from one study using a relatively small number of sheep that were fed the same diet13. A larger database from multiple studies using different diets would improve prediction models and enhance their applicability across a broad range of feeding systems. However, currently, no such a database exists. A few studies in the literature have reported rumen microbiota data together with CH4 emission data, and we attempted to combine these data into a larger dataset. However, most of the studies used anti-methane inhibitors, likely skewing the microbiota data. Additionally, different studies sequenced different regions of the 16S rRNA or 18S rRNA gene, making it impossible to combine the sequence data for microbiota analysis. To facilitate the combination of sequence data from multiple studies, a standardized metataxonomic protocol should be adopted.

Metataxonomic analysis of the rumen microbiota incurs additional cost, which may hinder the adoption and application in the livestock industry, but it is relatively small and justifiable for several reasons. First, amplicon sequences are routinely generated and analyzed in numerous studies to understand the microbiological underpinning of treatment effects or group differences, making sequencing cost a part of the study cost. Second, the cost of amplicon sequencing has been steadily decreasing, and with the advent of new sequencing technologies (e.g., the Illumina NovaSeq 6000), it is expected to further decrease. Third, accurate and precise CH4 emission prediction is becoming increasingly important for precision livestock farming, which is an essential approach to increase feed efficiency while reducing CH4 emissions for a sustainable industry of ruminants. Improved prediction of CH4 emissions can assist researchers and farmers in improving the overall feed efficiency and sustainability of livestock production.

Conclusion

Using machine learning methods, we successfully developed animal-based and dry matter intake-based CH4 prediction models that include both animal data and microbial genera as predictor variables. The inclusion of microbial predictor variables improved CH4 emission prediction in terms of error, bias, and accuracy compared to the conventional models that solely contain animal-related data, particularly for predicting CH4 yield (g CH4/kg of DMI). Notably, the coefficients (positive or negative) of some of the predictor variables from known bacterial genera (e.g., Oribacterium, Prevotella_7, Megasphaera, and Selenomonas) align well with their known associations or correlations (positive or negative) with CH4 emissions and/or with their known functions in the rumen ecosystem, such as hydrogen consumption. However, the role and association of other predictor variables of known genera (e.g., Moryella, Pseudoramibacter, and Syntrophococcus), as well as the unclassified genera, with CH4 emissions remain unknown.

While further improvements in CH4 prediction models would benefit from extensive databases encompassing rumen bacteria, archaea, fungi, and protozoa, along with animal-related data from multiple studies, this study demonstrated the potential and feasibility to include rumen microbes as predictor variables to improve CH4 predictions from sheep, likely from cattle also. A standardized approach to generate and analyze rumen microbial data is imperative to facilitate the compilation of multiple datasets from numerous studies. Such efforts will contribute to a better understanding of the rumen microbiota's impact on CH4 emissions and improved prediction models for sustainable agricultural practices.