Infection prediction in swine populations with machine learning

The pork industry is an essential part of the global food system, providing a significant source of protein for people around the world. A major factor restraining productivity and compromising animal wellbeing in the pork industry is disease outbreaks in pigs throughout the production process: widespread outbreaks can lead to losses as high as 10% of the U.S. pig population in extreme years. In this study, we present a machine learning model to predict the emergence of infection in swine production systems throughout the production process on a daily basis, a potential precursor to outbreaks whose detection is vital for disease prevention and mitigation. We determine features that provide the most value in predicting infection, which include nearby farm density, historical test rates, piglet inventory, feed consumption during the gestation period, and wind speed and direction. We utilize these features to produce a generalizable machine learning model, evaluate the model’s ability to predict outbreaks both seven and 30 days in advance, allowing for early warning of disease infection, and evaluate our model on two swine production systems and analyze the effects of data availability and data granularity in the context of our two swine systems with different volumes of data. Our results demonstrate good ability to predict infection in both systems with a balanced accuracy of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$85.3\%$$\end{document}85.3% on any disease in the first system and balanced accuracies (average prediction accuracy on positive and negative samples) of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$58.5\%$$\end{document}58.5%, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$58.7\%$$\end{document}58.7%, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$72.8\%$$\end{document}72.8% and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$74.8\%$$\end{document}74.8% on porcine reproductive and respiratory syndrome, porcine epidemic diarrhea virus, influenza A virus, and Mycoplasma hyopneumoniae in the second system, respectively, using the six most important predictors in all cases. These models provide daily infection probabilities that can be used by veterinarians and other stakeholders as a benchmark to more timely support preventive and control strategies on farms.


Supplementary Results
Tables S1, S2 and S3 give the performance of additional machine learning models on both farm systems in comparison to the random forest, gradient boosting, and neural network (system B) models reported in the paper.

Dimension Reduction Analysis
Table S4 and Figure S1 are the results of our analysis of the model's dependence on the dimension reduction hyperparameter, the number of features selected for dimension reduction, in system A.

Distribution Shift
Tables S5 and S6 and detail distribution differences between the train and test sets.Table S7 details missing movement data in the training set in system B.

Test Rates and Heuristic Model
The structure of the available data leads to test rates being an overly effective predictor in the dataset in system A that would not generalize to other production systems or sample sets.This is a result of the fact that many farms have an overwhelming number of samples of one class and few or none of the other class.
To illustrate this, we build a baseline heuristic model based on the historical test rates.Specifically, consider a sample taken at farm i on day T that has historical positivity and negativity rates P . (1) On system A, this simple model is effective: with a 60-day window, it obtains a balanced accuracy of 0.883 on the test set, surpassing all of the machine learning models.This observation is an artifact of the data: many samples exist on farms that either have no or few tests of one class; with a 60-day window, a particularly extreme case consists of a farm with 577 negative samples -containing over half of the dataset -and no positive samples.
The full distribution of samples by farm, the majority of samples occur at farms that only have negative samples, is available in Supplementary Figure S4.Ultimately, to maintain the robustness of the model, we remove historical rates from the feature set in system A.
The distributions in system B do not suffer to the same extent, with the heuristic model able to perform well in some cases but not all.The model achieves scores of 0.584, 0.578, 0.725, 0.752 on PRRS, PEDV, IAV and MHP, respectively.As a result, we retain test rates as a feature in this system.
3 Supplementary Methods

Data
This section contains additional details on the predictor categories.

Direct Contact Predictors
Direct contact predictors derive from the fact that the transportation of infected pigs is a major factor leading to infection and outbreak.These predictors are motivated by the idea that the movement of pigs between farms may beget disease at the farm receiving pigs; intuitively, infections are more likely to occur if diagnoses at the farm sending pigs imply previous disease infection or outbreak.We model this disease pathway with three main types of features: movement quantity features, source distance features and source diagnosis features, broken down below.
1. Movement quantity features enumerate the number of pigs sent to farm i within the window W T,n from each of its respective sourcing farms.The number of features is determined by the farm that has the highest number of unique source farms in its busiest window.
2. Source distance and source diagnosis features are compiled using the source farms from above.Source distance features are the distances of each of the respective source farms to the farm receiving pigs.Source diagnosis features, on the other hand, are the counts of positive and negative diagnoses at each source farm within W T,n .For every source movement feature, there is one corresponding source distance feature and two source diagnosis features -one positive, one negative.

Spatio-temporal predictors
Spatio-temporal predictors attempt to model local area spread of disease between swine populations.We expect local area spread to occur more often between nearby farms; as a result, we add the distances of the five nearest farms to the farm in a given sample as features.In addition, climactic factors are known to affect disease transmission; consistent warm or cold temperatures, for example, often lead to faster spread of disease and outbreak intensity.To model this, we encode five meteorological features: maximum and minimum temperature, humidity, wind direction, and wind speed.Each feature is extracted daily over W T,n at which point their respective means are computed to give the final value.

Historical predictors
Our final set of predictors are historical predictors, which attempt to capture past farm performance that may correlate with infection and outbreak.Our main source of historical features are production data, which include a variety of recorded data: ranging from the number of litters per female to the total pigs born dead, and including statistics on feed consumption, pig weights, spending on veterinary services, among others.
In system B, we focus on three subsets of the production features: mortality-related, feed related, and cull related.
We find that by training models solely with these features, combined with the other predictors, provides improved performance compared to using all production features.

Test Rates
Our final type of historical feature is historical test rates.Specifically, we define two features -the historical positivity rate and the historical negativity rate -as follows. Let [0,T ) be the number of positive and negative tests, respectively, at farm i across all days from the beginning of available data through day T − 1.Then the historical positivity and negativity rates at farm i on day T are respectively.We examine models with test rates for both the sample farm -at which the sample is defined -and its source farms.Due to the structure of the dataset, however, we remove test rates at the sample farm as features system A, while retaining them in system B. In both systems, we use the test rates at source farms as features.

Farm-specific predictors
Finally, we consider additional farm-specific predictors in system B. Specifically, we analyze biosecurity data, a type of data unavailable in system A. This data concerns the management policies of each farm.We focus on five 2/14 main policy features: the estimated number of swine on site, the carcass disposal plan, the manure storage method, the employee access point plan, and whether there is a shared lagoon.The estimated number of swine feature is continuous while the others are categorical; the categorical information is one-hot encoded to form numerical features which are concatenated with the rest of the feature set.

Biosecurity Features
The following is a list of the categories involved in the biosecurity features for system B.These features (with the exception of the estimation number of swine on site) are one-hot encoded and then appended with the numerical features.The estimated number of swine on site are appended as is.
1 Tables S8 and S9 contain details of the sample sets used in the machine learning process.

Feature Names
This section describes the naming conventions behind the figures used in the text in Figures 1 and 2, and in Supplementary Table S9.All numbered features are zero-indexed.
• Farm Density Indicator/Nearest Farm Distance (nth).These features are the distances in kilometers of the five nearest farms.They serve as a proxy for the local farm density: if the five closest farms are nearby and the distances are low, the local farm density is high, and vice versa.
• Sow Production: <Feature Name>.These are sow production features.Details on specific features, as well as a list of all features, are available in Supplementary Table S8.
• Close Out: <Feature Name>.These are key performance indicators on finishing farms in system A.
• Nurfin: <Feature Name>.These are production features on nursing/finishing farms in system B. Details on specific features, as well as a list of all features, are available in Supplementary Table S8.
• Source Movement: nth.This is the number of incoming swine from the nth source farm.• Source Positivity/Negativity Rate: n.This is the historical positivity/negativity rate (days with positive/negative diagnoses/total days) on the nth source farm.
• Source Positive/Negative Tests: nth.This is the count of positive or negative tests within the historical window on the nth source farm.This differs from historical test rates in that test rates are normalized by the number of total historical days and are collected over all previous days, while positive/negative tests are strict counts and collected only over the historical window.

Model Stages
This section contains additional details on the model stages.
The first stage is standardization, where each feature has its respective mean subtracted and is then divided by its standard deviation in order to make the features scale-invariant.The motivation and process behind the four subsequent stages is discussed below.

Feature Selection
The next step is a feature selection, which attempts to select an optimal subset of features to maximize performance of the model.Feature selection differs from dimension reduction in that it is a supervised process; features are selected according to their ability to aid model performance.As this type of supervision can depend significantly on the performance of the model, we perform this step across multiple models to determine the best combination of features and model.As a benchmark, we first analyze models with all features before performing a feature selection; this can be thought of as an identity feature selection.We discuss feature selection methodology and results in depth in a separate section.

Dimension Reduction
Due to the relatively large number of features -ranging from 204 to 404 depending on the system, disease, and window -we reduce the dimension of the data to improve the model's ability to generalize.In system B, we use a dimension reduction across all features.In system A, we explore both a standard dimension reduction as well as a stratified dimension reduction, wherein we perform two dimension reductions, one each for the sow and close out features, and then recombine them with the other features post-reduction.This latter dimension reduction is motivated by the idea that linear correlations are likely to exist within the sow and close out data; running a dimension reduction separately on these datasets will best extract those relationships.In both cases, we use principal component analysis (PCA) to perform the dimension reduction; we choose the number of PCA components via the cross-validation process, discussed in the cross validation section.

Machine Learning Models
We examine five binary classifiers: logistic regression, support vector machines, decision trees, gradient boosting, and random forests.In system B, we also explore a multi-layer preceptron model.Specifically, we examine fully connected models and explore models with both two layers and three layers, the latter in an auto-encoder type architecture.We find that this architecture is particularly suited for system B, as the samples suffer from significant distribution drift.As this is less of a factor in system A, we do not report results with this model.The auto-encoder models are able to extract the most important signals, allowing for better performance on validation and testing sets.
In all cases, we tune hyperparameters with the five-fold cross-validation procedure.

Metrics
Due to the imbalance between classes in this binary classification, we select balanced accuracy as the metric to evaluate model performance.Balanced accuracy can be viewed in two ways -first, it is the weighted accuracy of each class, so that performance on the smaller class is weighted equally to performance on the larger class: Balanced Accuracy = Correct predictions on true negatives Total true negatives + Correct predictions on true positive Total true positives Second, it is the average of the model's sensitivity and specificity: Balanced Accuracy = 1 2 where T P, T N, FP, and FN are the true positive, true negative, false positive, and false negative rates, respectively.
While balanced accuracy scores avoid the optimism of receiver-operator characteristic area under curves (ROC-AUC) on imbalanced datasets, they do necessitate the choice of a specific threshold for prediction.As with our other hyperparameters, we determine the optimal threshold in the cross-validation process as the threshold that maximizes the Youden's J statistic on the validation set: it is equivalent to the threshold that maximizes the difference between the true and false positive rates.Since these thresholds are chosen to maximize performance on the validation set under consideration, we evaluate the performance of the models with these thresholds on the test set.

Cross Validation
All of our hyperparameters are explored by cross-validation.We first perform a train/test split by setting the last quarter of the samples aside, to ensure that the model avoids time-series data leakage.With the remaining data, we perform a five-fold cross validation across all of our hyperparameters; in each case, the folds are split before any scaling/dimension reduction is performed to prevent data leakage.The full set of hyperparameters explored is in Table S10.S3.Balanced accuracy with selected hyperparameters for IAV and MHP on system B.

6/14
Columns CV and Test correspond to balanced accuracy scores in cross-validation and on the test set, respectively, while Thresh gives the optimal threshold for that model as determined via the metric-computation process. 9/14

T
, respectively.Then we define the probability of infection at farm i on day T as p(i, T ) = P Farm 0 indicates the farm sending the most pigs in the historical window and farm n denotes the farm sending the least pigs.The numbers 0, 1, ..., n corresponding to Source Negativity Rate, Source Positivity Rate, Source Negative Tests, Source Positive Tests, and Source Distance features, e.g.Source Negativity Rate: 4 refers to the same farm as Source Movement: 4 within a given sample.Note that farm 4 in one sample does not necessarily refer to farm 4 from another sample as the corresponding ordering of numbers of incoming pigs changes between destination farms and historical windows.

Table S1 .
Balanced accuracy with selected hyperparameters on system A. Balanced accuracy represents the average recall, weighted between positive and negative samples.Columns CV and Test correspond to balanced accuracy scores in cross-validation and on the test set, respectively, while Thresh gives the optimal threshold for that model as determined via the metric computation process.Higher thresholds imply better model discrimination between positive and negative predictions.

Table S2 .
Balanced accuracy with selected hyperparameters on for PRRS and PEDV on system B. Columns CV and Test correspond to balanced accuracy scores in cross-validation and on the test set, respectively, while Thresh gives the optimal threshold for that model as determined via the metric-computation process.