A systematic approach to analyse the impact of farm-profiles on bovine health

In this study we present systematic framework to analyse the impact of farm profiles as combinations of environmental conditions and management practices on common diseases in dairy cattle. The data used for this secondary data analysis includes observational data from 166 farms with a total of 5828 dairy cows. Each farm is characterised by features from five categories: husbandry, feeding, environmental conditions, housing, and milking systems. We combine dimension reduction with clustering techniques to identify groups of similar farm attributes, which we refer to as farm profiles. A statistical analysis of the farm profiles and their related disease risks is carried out to study the associations between disease risk, farm membership to a specific cluster as well as variables that characterise a given cluster by means of a multivariate regression model. The disease risks of five different farm profiles arise as the result of complex interactions between environmental conditions and farm management practices. We confirm previously documented relationships between diseases, feeding and husbandry. Furthermore, novel associations between housing and milking systems and specific disorders like lameness and ketosis have been discovered. Our approach contributes to paving a way towards a more holistic and data-driven understanding of bovine health and its risk factors.

Husbandry, housing and milking systems. Data about husbandry, housing and milking systems has been collected by a survey within the scope of the "Efficient Cow" project 8,9 . The survey was conducted by trained staff from the performance recording organisations at all 166 farms, assessing information on farm management practices describing the management for young stock, lactating and dry cows. The original farm-based dataset included 70 variables: 8 numeric, 2 nominal and 60 categorical. Variables that contained irrelevant information for the analysis, such as the code for main production area or the unique identifier of hoof trimmers were excluded after consultation with domain experts. For the statistical analysis, categorical variables were one-hotencoded. Individual categories with less than ten observations were grouped together as a new variable labelled as "other". Each variable was then tested for internal consistency. A variable was included if it contained at least 80otherwise.
Feeding. Information about feed composition and feeding stems from 54,661 cow-based observations of the 5828 animals that were conducted within the scope of routine dairy herd improvement assessments (DHI) between 2/1/2014 and 5/3/2015. To estimate a farm's feed composition, we calculated the average amount of crude fiber and concentrate feed and their respective standard deviations for lactating and dry cows of a given farm. Information about feed quality stems from feed quality assessments conducted in 2014 10 . These cow-based observations were also matched to diagnoses based on unique and pseudonymised animal identifiers and the observation and diagnosis date, respectively. This data set was then used in the multivariate logistic regression (see "Methods" section "Multivariate logistic regression model" below) to assess the impact of individual features on the odds for specific diseases. The assessors rated the contamination of feed with mold as well as feed temperature as an indication for fermentation on a scale from 1 (perfect) to 4 (unsatisfactory). As assessments returned a perfect rating for both categories in almost all instances, feed quality was defined as problematic when the average of these two ratings was higher than 1.1.
Weather. Records from the national weather service from 2014 were used to determine the environmental conditions of a farm. Based on the farms' locations we computed its yearly average temperature, precipitation and humidity and the corresponding standard deviations. Furthermore, we computed the number of days with high wind and low temperature conditions. A day with at least one measurement smaller than 0.5 • C was classified as a low temperature day 11 , and a day with at least one observation of wind speed greater than 2.5 m/s was classified as a windy day.
National disease register. We used information about diagnoses from the national disease register to calculate disease prevalence for the eight most common diseases for each farm. Entries in the national disease registry are either diagnoses by veterinarians 12 or observations at calving and culling reasons 13 . The main data source for diagnostic information is field data based diagnoses by veterinarians, which must be documented due to legal regulations. This information is predominantly transmitted by the veterinarian directly to the central cattle database (RDV) 12 . Additional health-related observations from farmers around calving are recorded by employees from the performance recording organisation 13 . These employees were trained in advance, before starting the nationwide data collection at these 166 farms, in locomotion scoring, using the Sprecher et al. 14 method in real farms and using videos showing different locomotion scores. All health-related information is collected according to the law on the control of veterinary medicinal products (Tierarzneimittelkontrollgesetz), which requires a standardization and validation of the data. Health-related information is standardised using a coding system consisting of 65 diagnoses divided into ten categories. The diagnosis code for the Austrian National Disease Register is based on the very extensive ICAR central health code 15 . In addition to plausibility checks, the validation of the data contained is based on a strict criterion for the regular and complete recording of diagnoses. Furthermore, we derived information on diagnoses for lameness and ketosis based on recurring ketosis tests and lameness scoring. A ketosis diagnosis was assigned to a cow if beta-hydroxybutyrate levels (ketones) of ≥ 200µmol/l 16 was measured either 7 or 14 days after calving. Lameness was diagnosed if a cow had a lameness-score equal to or greater than 3 14,17 . In the present analysis we refer to all the described disease labels as "diagnosis", regardless of their origin.
We aggregated the information contained in all data sets except for the diagnoses based on unique and pseudonymized farm identifiers. Categorical variables were one-hot encoded. The final aggregated data set used to analyse farm profiles contained 119 binary variables and 23 numeric variables. www.nature.com/scientificreports/ Methods. In our methodological approach, we combine machine learning techniques with statistical analysis. To identify farm profiles, we combined the UMAP and HDBSCAN machine learning technique. For the evaluation of the farm profiles and their associated disease risks, we conduct a statistical analysis that includes calculating cluster-wise ORs and z-scores obtained from a multivariate regression model.
UMAP dimension reduction algorithm. In order to identify pertinent factors, we first reduced the dimensionality of the aggregated data. We applied the uniform manifold approximation and projection for dimension reduction (UMAP 6 to investigate the underlying structure in the farm-based dataset. UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis, constructed from a theoretical framework based on Riemannian geometry and algebraic topology. UMAP optimizes the layout of the data representation by creating a two-dimensional description of a high-dimensional dataset. Therefore, this approach preserves more of the hidden global structure 6,18 within the data than other linear dimensionality reduction techniques such as principal component analysis (PCA). Number of neighbours and minimal distance are the main parameters to define. We set number of neighbours to 50 and minimal distance to 0 as suggested in the documentation on using UMAP as a precursor step to clustering 18 . To improve the analysis, we assigned weights to the individual variables. As weights, we used the test statistic obtained by the multivariate logistic regression model; see below. To estimate the overall importance of a given factor, we computed the average of all test statistic results. Factors, for which no test statistics could be obtained because they did not meet our regression analysis criteria, see "Methods"-"Multivariate logistic regression model", were omitted. They were considered to be less important, and exclusion allows a uniform analysis without special treatment of individual features. A total of 13 variables were excluded this way, reducing the number of features to 129. Nevertheless, the information of these 13 variables is not lost. The features were excluded for the purpose of clustering, but all 142 available variables were later used to characterise the individual farm profiles, based on their z-scores.
HDBSCAN clustering algorithm. The two-dimensional embedding of our dataset obtained from the UMAP analysis was then used as input for the HDBSCAN clustering algorithm to identify regions of similar density and therefore assign farms to clusters (groups) with similar attributes. The hierarchy density-based spatial clustering of applications with noise algorithm (HDBSCAN) 7,19 constructs a density-based cluster hierarchy tree and then uses a specific stability measure to extract flat clusters from the tree. Each obtained cluster defines a farm-profile as a combination of environmental factors, feeding characteristics and management practices. The minimum cluster size has to be defined manually and was set to 15 for the present analysis. The optimal number of clusters is not an input parameter and is identified by the algorithm. Unlike many other clustering algorithms, HDBSCAN can also refrain from assigning an observation to a cluster if it is too uncertain about its belonging. The mathematical procedure to reduce the complexity of a higher dimensional space is based on stochastic computation, whereby the same analysis may produce different results. To ensure that the identification of relevant factors in our data is robust, we applied the UMAP algorithm in conjunction with HDBSCAN over 10,000 times with different random seeds and recorded the number of identified clusters identified. This approach showed a clear preference for five clusters. For the present analysis, we used a UMAP embedding that results in five distinct clusters and where all farms in the dataset are assigned to a cluster.
Disease risk profile. A disease risk profile for every cluster was created by comparing disease prevalence between the individual clusters. The disease risk of a profile was examined for the eight most common diseases in our data: anestrous, acute and chronic mastitis, ketosis, lameness, metritis, periparturient hypocalcemia and ovarian cysts. For every disease we identified the cluster with the lowest disease prevalence and used it as the reference group to calculate the OR of that disease for the other clusters. We only used clusters if they had at least five cows diagnosed with a certain disease. The OR was computed using a one-sided Fisher's exact test.
z-Scores. In order to identify the pertinent variables of a farm profile, we computed z-scores for all variables of each cluster. To interpret a z-score on a cluster level, we computed the z-scores in comparison to as follows: where x is the cluster mean value of a variable, µ the global mean value of all clusters and σ the standard deviation of all clusters. Here, a z-score measures how many standard deviations a variable of a given cluster is above or below the global mean. The corresponding p-value determines whether the variable differs significantly from the observed values in the other clusters. Significance was tested using a 2-sided t-test for numeric variables, and a Fisher's exact test for binary variables. A cluster variable with a high z-score is interpreted as a pertinent factor for that cluster, and the p-value specifies if the variable is significantly different from the other clusters. The obtained cluster specific z-score ranges differ between the clusters. For a better comparison of the factor importance between the clusters, we introduced a ranking of the z-scores from 1 to 142, where the highest z-score is assigned a 1 and the lowest 142.
Multivariate logistic regression model. To investigate the impact of the individual factors on diseases, we applied multivariate logistic regression. The dependent variable is the disease, which is explained by independent variables. The analysis was carried out based on the observation dataset linked to the diagnoses. We matched every diagnosis to one out of the 54,661 observations based on animal label and temporal proximity. We only included observations before the occurrence of a diagnosis even if an observation of the same animal after the diagnosis www.nature.com/scientificreports/ was closer in time, to exclude a possible bias introduced by the farmer's reaction to a disease. Furthermore, all observations of dry cows were excluded due to their specific treatment. A cow undergoes a metabolic change during dry periods that needs a change in management, such as the management of feeding. Therefore, observations from the dry period might not be representative for the relationship between disease and factor outside of the dry period. Animals with no diagnosis during the entire observation period were considered as the control group. In the logistic regression, 6615 observations linked to a diagnosis were analysed compared to 20,932 observations in the control group. Due to the heterogeneity of the data, numerical variables were normalized to a mean of zero and variance of one. To minimize the effects of confounding variables in the regression model we adjusted for the number of lactations, season, breed, the performance level of the farm (assessed by average yearly milk yield) as well as for differences in the diagnoses' origins (veterinarian, observation close to calving/ culling reason or based on a keto-test or lameness scoring). We corrected for biases due to different reporting schemes in certain regions by introducing a binary variable that indicates whether the diagnosis was made within such a region or not. In the regression model, the association between a given independent variable and a disease was estimated by keeping the other (confounding) variables constant. For each variable-disease pair with more than 50 observations, we applied the logistic regression model. To correct for multiple hypothesis testing, we applied the Benjamini-Hochberg procedure. The variance inflation factor was computed to control for possible multicollinearity. To improve interpretability of the obtained results, we converted the log odds into odds ratios (OR). The odds ratios were used to investigate the effects of each factor on diseases, particularly those factors characteristic of a farm profile.

Results
We report our results in the following structure. First, we describe the five farm profiles obtained by dimension reduction and clustering. Second, we describe and discuss the obtained disease risk profiles for these farm profiles. We conclude the manuscript by discussing technical limitations of our approach and summarizing our main findings. Statistical significance will be reported as *** for p < 0.001 , ** for p < 0.01 and * for p < 0.05.

UMAP-HDBSCAN-clustering. The obtained results of the UMAP-HDBSCAN cluster analysis reveal five
well-separated clusters of farms, the farm profiles; see Fig. 1. The farms have been grouped along the two-dimensional representation of our data as inferred with the UMAP procedure. Each dot represents one of 166 farms, and the size is proportional to the number of cows. For descriptive statistics for all variables see Supplement Table 6. In the following, we present the pertinent factors that characterise a farm profile structured along five categories: environmental conditions, husbandry, feeding, as well as housing and milking systems.
Farm profile of cluster 1. Cluster 1 describes the farm management of 12 farms located at a high altitude (i.e., height above mean sea level) [z-score-rank: 2*** p-value < 0.001] that are exposed to a high number of days with low temperature [rank 3***]. The farm management of this cluster is characterised by providing access to alpine pasture and pasture. We observe high ranks for providing access to alpine pasture for young Disease risk profiles: grouped feature-disease relations. Here, we report associations between groups of interacting variables comprising the farm profiles and disease risks. As shown in Fig. 2 and Table 1, we find that the observed differences in farm profiles associate with substantial differences in disease risks. In general, we observe the trend that farms in cluster 1 and 2 are associated with a decreased disease risk, while farms of cluster 4 and 5 have an increased disease risk. In cluster 3, we observe a trend towards a moderately increased odds ratios Multivariate logistic regression: feature-disease relations. For a better understanding of the relationship between diseases and the individual factors, we conducted a multivariate logistic regression analysis. In the following we use the estimates obtained by the logistic regression model to investigate the relationship between individual factors and diseases. The results of the logistic regression model are reported in the "Supplement" in Tables 7-14. Each cluster is analysed based on the factors highlighted in "Results" and their relationship to disease risks identified in "Conclusions". In the disease risk profiles of cluster 1 and cluster 4, we focus on the observed trends, while we concentrate on the most or least frequently observed diseases in the other profiles.
Further, we will focus on effects on lactating cows unless we do not observe any results for them, in which case we report results for dry cows or young stock, if available. We also report results from the latter whenever they differ from results obtained for lactating cows.
Disease risk profile for cluster 1. Cluster 1 is, overall, the cluster with the healthiest animals compared to all other clusters, thus showing the lowest disease risk for anestrous, ketosis, chronic mastitis, metritis and ovarian cysts. In this cluster, we observe strong negative correlations between disease risks and environmental conditions and husbandry, respectively. Regarding environmental conditions we find a strong negative association  Disease risk profile for cluster 2. For the farm-profile of cluster 2, which has the lowest prevalence of periparturient hypocalcemia and lameness diagnoses and the highest risk for metritis, we also observe a decreasing effect of high altitude [rank 1], low annual temperature [rank 9] and a husbandry that does provide access to alpine pasture. We find that altitude reduces the ORs for lameness, periparturient hypocalcemia and, contrary to the observed disease risk at a cluster level, also of metritis [ Disease risk profile for cluster 3. Compared to clusters 1 and 2, in cluster 3 we observe a general trend towards increased disease risk, although this cluster shares the risk lowering effects of altitude and pasturing. Nevertheless, the cluster has the lowest risk for acute mastitis and the second lowest for chronic mastitis. In terms of environmental conditions, we find a negative association between acute and chronic mastitis and altitude [ Discussion. The presented systematic methodological approach demonstrates how to identify and assess a variety of interacting farm management and environmental factors in relation to certain diseases. The obtained results allow to analyse disease risks in relation to a set of potentially interacting factors. We find that the farm profile of cluster 1 is associated with an overall decreased disease risk. The profile describes the management of farms at high altitude in which cows are often exposed to low temperatures. The overall decreased disease risk in this cluster is further related to the following factors: a free-stall barn, a husbandry that provides access to alpine pasture and pasture, and a feeding management where silo bales are used. In line with the literature, we find that a free-stall system is associated with better udder health 20,21 , lower risk of ketosis and better fertility 22 . Access to pasture and alpine pasture indicate a farm management that allows regular physical exercise. Previous studies revealed that the combination of loose-housing and regular outdoor exercise 21,23 have positive effects on the health and welfare of cattle. Alpine pasture in particular is beneficial for the fitness of an animal and thus for its longevity 24 . However, there is also evidence that neither a free-stall www.nature.com/scientificreports/ system nor increased physical exercise is inherently beneficial to dairy cattle's health 25 . For instance, increased physical activity has been identified as a risk factor for lameness 26,27 .
In cluster 2 we observe that the positive effect of a free-stall system also depends on environmental conditions, the type of floor type and its quality and farm management practices 28 . We find that tie-stalls that allow outdoor access are known to be associated with a reduced risk of lameness 21 . On the other hand, in line with the literature we observe that a free-stall system with deep-bed cubicles is also associated with reduced risk for injuries, in particular for lameness [29][30][31][32] . These inconsistent reports on free-and tie-stall housing emphasize the need for further analyses on the disease risks associated with different forms of housing; namely, an analysis that takes into account additional characteristics of the farm profile such as pasture management and environmental conditions. In this context, we find that access to alpine pasture has a positive effect on the fitness of an animal 24 , as does pasturing on claw health, depending on the duration of pasturing 33 . In line with previous reports where periods of heat stress were associated with an increase in the rate of claw horn lesion 34,35 , we find that low temperatures reduce the risk for lameness.
The overall highest diseases risk is associated with the farm profile of cluster 4. The highest risk increases are observed for anestrous, lameness, acute mastitis and ovarian cysts. As the main risk factors of this cluster, we find a combination of high annual temperature and its standard deviation, a husbandry of a large herd size where cows are kept in an outdoor climate barn, a free-stall housing system with high bed cubicles and a walkway with concrete slits, a milking system with post-milking technology and a feeding management that uses bunker silos together with a year-round corn silage feeding and TMR for forage and provision of concentrates. In line with literature, we observe an association between high temperature and an increased risk of illness. Heat stress has been associated with increased standing time 36 , which increases the risk of lameness 26,27,34 . The risk for lameness also correlates with the size of the herd 1,37 . As described, we also observe that an increased risk for disease correlates with housing factors such as high bed cubicles 30 , or a milking system with post-milking technology. In accordance with the literature, we find that feeding corn silage increases lameness [38][39][40] , and that a TMR feeding method has a strong negative impact on dairy cattle health 25 . This negative association is also illustrated when compared to the farm-profile of cluster 5. Although the profile of cluster 5 is similar to that of cluster 4, cluster 5 shows a reduced risk of disease. In addition to the annual milk yield, we have identified the feeding method as a distinguishing feature between the two clusters. In cluster 5 a partial mixed ration method is used for feeding, while in cluster 4 a TMR method is used. Furthermore, we observe in cluster 5 that the environmental conditions indicate a problematic feed quality. High humidity and high temperature promote fungal growth and a possible toxic contamination of the feed 41,42 . Silage in particular is known as a source of mycotoxigenic fungi 43 . When animals are fed mycotoxin-contaminated diets, toxic effects such as decreased feed intake and milk production or even death can occur [42][43][44] .
For some individual disease risk associations as in cluster 3, however, we could not find any confirmation from the literature. This is particularly the case for milking-system-related risk factors. We only found one study 45 that linked herringbone milking parlors to a higher risk of hock lesion, a sign of lameness, than tandem parlors, which is consistent with our findings on lameness risk variables (see "Supplement").
Limitations. Due to the observational study design and its secondary data use, our reported findings have an exploratory character. While the diagnosis information was collected by trained personnel according to clearly defined and strict guidelines, as prescribed by Austrian law, our data comes with the usual limitations of field-based observations. The identified associations therefore need to be evaluated by further research and prospective studies, especially for those of our findings that have not yet been thoroughly addressed in the literature. Care has to be taken when generalising from our findings for Austria to other countries due to unique geographic characteristics of farming in Austria. We found that altitude has a strong impact on the results of the clustering. Altitude and the associated environmental conditions featured prominently amongst the most important variables for defining the clusters. We, therefore, tested the dimension reduction clustering procedure with and without altitude. We observed that without including altitude as a variable, a significant number of farms would have remained unclassified. However, it became clear that the outcomes concerning other variables than environmental ones did not change qualitatively when comparing the farm risk profiles with and without altitude. In other words, associations between farm management practices and diseases have shown to be robust with respect to the in-or exclusion of environmental variables.

Conclusions
In this study, we presented a systematic framework to analyse the impact of farm profiles as combinations of environmental conditions and management practices on common diseases in dairy cattle. Our findings show that analysing disease risks in relation to a collection of variables can lead to a more holistic description of relevant risk factors and their interrelations as opposed to analysing disease risk in relation to individual variables, which may lead to seemingly inconsistent results if taken out of the context of other farm variables, as has been shown for the free-and tie-stall effects described in "Disease risk profile for cluster 2". In addition, our analysis reveals new relationships between housing and milking systems and specific diseases. Further prospective studies are needed to examine these relationships in more detail.