Using phenotypic distribution models to predict livestock performance

Livestock production systems of the developing world use indigenous breeds that locally adapted to specific agro-ecologies. Introducing commercial breeds usually results in lower productivity than expected, as a result of unfavourable genotype by environment interaction. It is difficult to predict of how these commercial breeds will perform in different conditions encountered in e.g. sub-Saharan Africa. Here, we present a novel methodology to model performance, by using growth data from different chicken breeds that were tested in Ethiopia. The suitability of these commercial breeds was tested by predicting the response of body weight as a function of the environment across Ethiopia. Phenotype distribution models were built using machine learning algorithms to make predictions of weight in the local environmental conditions based on the productivity for the breed. Based on the predicted body weight, breeds were assigned as being most suitable in a given agro-ecology or region. We identified the most important environmental variables that explained the variation in body weight across agro-ecologies for each of the breeds. Our results highlight the importance of acknowledging the role of environment in predicting productivity in scavenging chicken production systems. The use of phenotype distribution models in livestock breeding is recommended to develop breeds that will better fit in their intended production environment.


Materials and Methods
Live body weight data from the five breeds tested in Ethiopia was obtained from the ILRI datasets portal. For details on ACGG see https://africacgg.net/. Environmental variables that characterize the different agro-ecological zones in Ethiopia were obtained from WorldClim 16 and the Harmonized World Soil Database v 1.2 17 . All data cleaning and analyses were undertaken in R version 3.5.1 18 running on RStudio version 1.1.383 19 . The experimental protocols and ethical guidelines were approved by the ILRI Institutional Research Ethics Committee (ILRI IREC) under reference number ILRI-IREC2015-08. ILRI IREC is accredited by the National Commission for Science, Technology and Innovation (NACOSTI) in Kenya. All the methods were performed in accordance with the relevant guidelines and regulations. chicken breeds. The ACGG program identified four exotic chicken breeds presumably tolerant to sub-Saharan African environmental conditions (Koekoek, Kuroiler, Sasso, and Sasso-RIR), and in 2016 fertile eggs were imported into Ethiopia. A locally improved breed named Horro was also used in parallel to the exotic breeds. The latter one assumed to be locally adapted, and genetically improved for growth and egg production in the environmental conditions of the low input systems characterizing the smallholders farms in Ethiopia 20 .
The five breeds were tested in 63 villages, divided over five regions across Ethiopia (Addis Ababa, Amhara, Oromia, Southern Nations Nationalities, and Tigray; see Fig. 1). A total of 1393 households received approximately 25 six-week old chicks per household. All breeds were tested in all the villages, but only one breed was tested in each household. The introduced chickens were exposed to the same environmental and management www.nature.com/scientificreports www.nature.com/scientificreports/ conditions as the local breeds already kept by the farmers. The breed distribution started in August 2016, and data collection ended in January 2018.
Breed and phenotypic data. Households were georeferenced, and each breed was randomly assigned to a household. Chickens were delivered at 6 weeks of age. Traditionally in Ethiopia, farmers sell the male chickens when they reach 2 kg, which is around 20 weeks. For male chickens, weight data was collected until 20 weeks of age. Females are kept longer for egg production, therefore, weight data for females was collected until 72 weeks of age. Data cleaning. In each household, body weight data was collected every two weeks as a group measure.
Taking into account the number of birds weighted, we converted the group weight to average individual body weight. After calculating the average individual weight, we first eliminated all observations with weight less than 50 grams. We deleted the households with more than 30 birds reported per household. We also deleted those observations where age was less than 6 weeks. For males we used male body weight during the growing phase (weeks [14][15][16][17][18][19] and for females we divided the data into growing phase (weeks [14][15][16][17][18][19], and adult phase (weeks 20-72). Not all the weeks had data for all of the households. Therefore, we standardized the body weight in each household to the average value of week in each age phase for all breeds for males and females independently. The average week value for the growing phase for males and females was 16.54 and 16.52 respectively. For females during the adult phase this was 41.81. Then, the following linear model was fitted: where y is the vector with the average bird weights, H is the household at the week of measurement, W is the week of measurement, and e is the vector containing the random residual error. We assumed that within breeds all chickens had the same body weight at the time of hatching and fitted a model with common intercept and different slope per household. For each household for each age phase independently, we calculated the least squares means (LSmeans), using the emmeans package in R 21 .
environmental data. The introduced chickens were kept in low input systems along with indigenous chicken. Some feed supplementation was given by farmers at their discretion. The supplemented feed was either grown by the farmers or bought externally, the rest of the feed intake was from scavenging. The introduced chickens had to cope with the same environmental conditions as the indigenous chickens, where food availability depends on seasonality. Therefore, we used a total of 21 variables at a 1 km by 1 km resolution ( Table 1). The environmental data, representing current conditions included 19 bioclimatic variables and an elevation variable. These 20 variables represent trends in temperature, seasonality and precipitation, variables that are expected to have an influence on the chicken productivity and food availability. Smallholders in Ethiopia tend to have crops in their households, which has an influence on what is fed to their livestock. To represent this link between human intervention and agriculture 1,4 we used an additional layer, the total cultivated land. This variable represents the percentage of area that is used in agriculture, therefore having an impact on the food availability. www.nature.com/scientificreports www.nature.com/scientificreports/ phenotypic variation models. In ecology, regression analyses are often used to explain the relationship between ecological data and to build prediction models. Ecological data can sometimes be complex showing non-linear relationships and/or spatial and temporal correlation. To address these problems, more flexible, but still complex models are often used, such as generalized additive models (GAMs). GAMs can identify nonlinear approaches, and still generate easy to interpret relationships. However, including too many covariates will lead to overfitting, which decreases the prediction accuracy. Therefore, several machine learning algorithms, such as boosting, have been used to increase the prediction accuracy of standard regression models 22 .
To model the relationship between environmental variables and chicken phenotypes, the values of the 21 environmental variables for each of the georeferenced household locations were used as predictor variables. The response variable was the LSmeans for body weight per household for each of the breeds at the average week for each growing phase. To increase prediction accuracy, gradient boosting, a machine learning algorithm, was applied to a GAM. Gradient boosting is an iterative process, where models are built in a step-wise fashion, starting from weak prediction models that fit the data poorly (base-learners), and by learning from the previous step, the fitting is improved in the next iteration 22 . All the models were built using all the environmental variables as base-learners. Boosting algorithms use as tuning parameter a stopping iteration, which determines the optimal point of where the algorithm should stop before convergence (e.i. early stopping). This choice of stopping iteration becomes crucial, as it prevents overfitting the data and improves the accuracy. The model for each breed was run independently, and a separate stopping iteration was determined for each model using the method described in Mayr, et al. 23 . The predicted body weight was represented in a heatmap that covers all the regions of the country were chickens are known to be kept. Desert areas (Somali and Afar regions (see Fig. 1) were excluded from the predictions, as the environmental conditions in these regions lay outside the range of the environmental conditions in which the chickens where tested (Supplementary Tables S1 and S2).
The variable of highest contribution in building each of the models was obtained. For this, the in-bag risk reductions per boosting step of a fitted model are accumulated individually for each variable (base-learner) contained in the model. This quantifies the individual variable contribution to risk reduction of each variable (base-learner), and can be used to compare the importance of different variable in the model 24 . To quantify the precision of our predictions, we estimated the correlation between the predicted value obtained from the phenotypic distribution models, and the LSmeans per household in each region. The LSmeans are the observed values for each of the households. With the model we predict the performance. If the predictions are precise, then the correlation between the predictions and the LSmeans would be high. Data cleaning, modelling the phenotypic variation and importance variable selection were done using the package mboost in R 24 .
Results estimated body weights. LSmeans for male body weights were estimated for week 16.54 (growing phase; Table 2). The heaviest breed was the Sasso with a mean estimated body weight of 1035.7 g, and the lightest was the Horro with a mean of 635.3 g. LSmeans for female body weights were estimated for week 16.52 (Table 3). The heaviest breed was the Koekoek with a mean estimated body weight of 1280.2 g, and the lightest was the Horro with a mean estimated weight of 738.6 g. LSmeans for female body weights in week 41.81 (adult phase) were highest for the Sasso-RIR with a mean weight of 2863.3 g, and the lightest remained to be the Horro with a mean estimated body weight of 2250.8 g (Table 4).  www.nature.com/scientificreports www.nature.com/scientificreports/ predicted body weights. Predicted body weights varied considerably between breeds across Ethiopia. In general, for the growing phase of males and females, the models predicted for the Sasso breed to have the highest body weight with 2185.2 g for males and 3862.8 g for females ( Fig. 2; Tables 2 and 5). The Horro breed is predicted to have the lowest body weight with 1327 g for males and 2540.3 g for females ( Fig. 3; Tables 5 and 6). The female body weight predicted for the adult phase was the highest for the Sasso breed with 4330.8 g. The lightest breed predicted was the Horro with 3074.6 g ( Fig. 4; Table 7).
To assess the precision of our predictions, correlations between the predicted values from the phenotypic variation models and the household LSmeans were estimated for each growing phase and breed within each of the regions. There is considerable variation within region for all breeds and growing phases, both in estimated and predicted weights (Supplementary Figs S1-S3). The correlation estimates have a median value of 0.61 but vary by breed and region ranging from 0.18 to 0.89. As suggested by Bonett and Wright 25 , to better estimate a correlation, the sample size should be greater than 25. Therefore, the correlations were only estimated for those regions where the total number of households for each breed was greater than 25. For the lowest number of households, N = 28, the S.Es were 0.098 and 0.103. For the highest number of household, N = 155, the S.E was 0.016. (Supplementary  Table S3; Supplementary Figs S1-S3).
Region suitability. Predictions of optimal region per breed, and the most important variable in building the model are given in Tables 5, 6 and 7. Results for male and female body weights during the growth phase and adult weights were very similar. Sasso always had the highest predicted body weights in Tigray. Koekoek had the highest weights in Amhara, while Kuroiler had the highest weights in Oromia. For Horro and Sasso-RIR, the pattern was less clear. Horro was predicted to perform best in either Amhara or Tigray for males and females. For Sasso-RIR there was no clear single region where the breed was expected to perform best for all sex and age groups.
Variables contributing to the models. The individual contribution from each environmental variable to build the best model were identified. The environmental variables of strongest influence for the predicted male body weight distribution were different for each breed. Patterns of male body weights for Horro, Koekoek, Sasso and Sasso-RIR were best predicted by variables associated with temperature. For Kuroiler, precipitation of the coldest quarter was the variable that had the strongest influence on the distribution (Table 5). For female body weights during the growing phase, the patterns of the body weight distribution for Horro and Sasso were again best predicted by temperature related variables. However, for Koekoek, Kuroiler and Sasso-RIR, precipitation variables had more influence on the distribution (Table 6). For female body weights during the adult phase, the body weights distribution patterns for Horro, Koekoek and Kuroiler were best predicted by variables associated with precipitation. For Sasso and Sasso-RIR, temperature annual range and elevation respectively had the largest influence on the body weights distribution (Table 7).

Discussion
We used phenotypic distribution models, implementing gradient boosting, a machine learning technique, to increase the predictive ability of GAMs. Machine learning procedures improve the accuracy of the models, by addressing overfitting, and variable selection simultaneously. Broadly, our results demonstrate the importance of understanding how different breeds respond to environmental conditions, and how different environmental variables can influence their productivity.
In Ethiopia, the Horro breed was characterized and genetically improved for growth traits 6,20 . The breeding program was conducted under an intensive management system, including high quality poultry houses, formulated feed, and vaccination. The ACGG program is the first attempt to test the ability of this breed to adapt to on-farm environment and management conditions 20 . Even though our results show higher body weights than previously reported for the breed 26 , compared to the introduced commercial lines they exhibit lower growth performance. The Horro breed originates in the Oromia region, at the Horro district at altitudes ranging between 2580 to 2810 meters above sea level (masl) in cool wet highlands. The average annual temperature is 13.3 °C, and the seasons can be divided in three; a main rainy season, a dry season, and a short rainy season. The seasonal variety in the Horro district can explain why the Horro breed seems to perform better in low humidity areas with low levels of precipitation. A recent study showed that during wet periods, the Horro chickens are prone to die due to disease, and survive better in dry seasons 27 .
The Sasso breed originates from the south of France in warm, and dry areas. In accordance to what we find here, we see that the Sassos' performance across ages is linked to temperature. Different studies have shown, in line with    www.nature.com/scientificreports www.nature.com/scientificreports/ our findings, that the Sasso breed was more productive than the indigenous breeds in Ethiopia 28 . We identified that temperature-associated variables have an influence on the Sasso predicted body weight distribution during all the age phases evaluated. The Sasso-RIR tested in Ethiopia was obtained through a private poultry farm, were crosses    www.nature.com/scientificreports www.nature.com/scientificreports/ were made to generate the breed for the ACGG program 28 . Therefore, no other studies have been reported using the Sasso-RIR cross to evaluate performance under scavenging conditions. For Sasso-RIR, different environmental variables are responsible for shaping the predicted body weight distribution depending on the age phase, suggesting    www.nature.com/scientificreports www.nature.com/scientificreports/ that the breeds response to the agro-ecology depends on age. In comparison, the pure breed Sasso, was predicted to be heavier than the crossbred Sasso-RIR for the male and female growing periods. For the adult females, the prediction was similar for both breeds. It should be noted that the body weight patterns and the variables of importance in building the models are different between these two breeds, suggesting that both breeds may have been adapted to different environmental conditions. As this is the first report of the Sasso-RIR cross in on-farm testing, we encourage the continuation of testing Sasso and Sasso-RIR in a broader scale. Our results show that these breeds outperform the others, suggesting that they both cope with low-input conditions.
The Kuroiler is a dual-purpose breed developed in India under humid conditions to perform in low maintenance systems. Studies have shown its capacity to adapt to tropical countries and significantly outperform the local breeds on scavenging conditions 10 . In this study, this breed showed higher estimated body weights compared to the locally improved Horro. For the phenotypic prediction models, for all of the age phases, the body weight distribution was influenced by environmental variables linked to precipitation. Rainy season has been reported preferred for farmers to rear chickens, as high precipitation increases the vegetation and lowers predation 27 . It also shows that the humid origin of the breed plays a role in the productivity of the breed.
The Koekoek breed is of South African origin, descending from a cross between three different breeds. This breed has been of popular use among rural farmers in African countries for egg and meat production. In a previous study using predictive habitat distribution models we showed that the Amhara region, and areas with colder temperatures and bigger fluctuations in annual mean temperature were the most suitable areas for the survival of the Koekoek breed 13 . In accordance, the Amhara region and areas with higher temperature fluctuations are now predicted to have higher body weights for Koekoek males and females. Here we show, in contrast to previous results, that precipitation during the wettest and warmer periods have an influence on the predicted distribution of the breed for body weight. Even though both predictive models show temperature fluctuation to be an important variable, precipitation seems to also play a role in shaping the variation of the phenotype. Dissimilarities in the important variables that shape the predicted distribution of the breed based on presence versus the phenotype, suggests that different processes can determine the range of the Koekoeks' breed performance distribution. We believe that the use of quantitative data in phenotypic distribution models gives a more accurate result compared to the presence only models (distribution models), as the former takes into account more detailed information about how the environment influences the life history traits of the breed.
Here we show that there is variation in the body weights estimated and predicted for each of the breeds for all the growth phases. In other words, that the breeds respond differently when exposed to the same environments. Correlations between predicted and estimated body weights are positive, but for some breeds and regions they are higher than others. These correlations were calculated using the locations where both a prediction and an LSmeans estimate were available. Therefore, the correlation of predictions to performance outside the locations of the households in the current dataset may be lower. For some breeds the correlations are similar between regions, however for others the values greatly differ, highlighting the variation in response within breeds to different agro-ecologies.
Variation in productivity among breeds can be attributed to the breeds' origin, which can have an effect on the breeds intrinsic response to different environmental conditions. A recent study characterized the genetic diversity and identified genomic regions that presented adaptive advantages of chickens in Uganda and Rwanda 29 . Here the authors showed that the Kuroiler breed has a gene which does not occur in different African ecotypes. They indicate that this gene is associated with homeostatic regulatory functions such as response to hypoxia, cold, and starvation. They suggest that the presence of this gene may be representing the result of the selective pressures for stress tolerance during their development in India 29 .
Our results highlight the importance of taking environmental variables into account for different applications. Genotype by environment interaction (G x E), defined as the change in phenotypic performance of genotypes, relative to one another, when measured in different environments 30,31 can be acknowledged using phenotypic distribution models. How sensitive a breed is to a particular environment has been of interest for many years in animal breeding. Accounting for which environmental variables have an influence on breed performance can help in predicting the occurrence of G x E interaction, and by establishing which areas within a country can be suitable for breed introduction. This knowledge can help breeding companies when choosing in which ecology they should introduce a breed, or it may help a farmer decide which breed to use in its own agro-ecology. Integrating habitat distribution models and phenotypic distribution models into animal breeding can provide a richer and more specific understanding of the environment's role in explaining the variation in productivity of a breed in different circumstances.

conclusions
Agro-ecological diversity can be a challenge when developing breeding programs, particularly in tropical countries 27 . The use of gradient boosting GAMs is novel approach that can be applied to this problem. They can be used either by breeding companies or by farmers to include environmental information when deciding where to introduce a breed, or which breed is better suited for which environment. We encourage the use of these models in livestock studies to capture the breeds' phenotypic variation as a response to different agro-ecological conditions. The approaches outlined in this paper can also be applied to different livestock breeds and locations, and could also be used to predict impacts of different climate change scenarios on change of productivity.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.