Estimation of swine movement network at farm level in the US from the Census of Agriculture data

Swine movement networks among farms/operations are an important source of information to understand and prevent the spread of diseases, nearly nonexistent in the United States. An understanding of the movement networks can help the policymakers in planning effective disease control measures. The objectives of this work are: (1) estimate swine movement probabilities at the county level from comprehensive anonymous inventory and sales data published by the United States Department of Agriculture - National Agriculture Statistics Service database, (2) develop a network based on those estimated probabilities, and (3) analyze that network using network science metrics. First, we use a probabilistic approach based on the maximum information entropy method to estimate the movement probabilities among different swine populations. Then, we create a swine movement network using the estimated probabilities for the counties of the central agricultural district of Iowa. The analysis of this network has found evidence of the small-world phenomenon. Our study suggests that the US swine industry may be vulnerable to infectious disease outbreaks because of the small-world structure of its movement network. Our system is easily adaptable to estimate movement networks for other sets of data, farm animal production systems, and geographic regions.

In this work, we estimate the swine movement probabilities between counties based on published inventory and sales data from the USDA Census of Agriculture. We develop a convex optimization problem with some linear constraints for the US swine industry. To solve this problem, we adapt the cattle movement model from Schumm et al. 9 for the swine population. In particular, we maximize the entropy of the distributions of the objective function (Eq. 1). Maximum information entropy methods have been used in various research fields [10][11][12] . The maximum entropy principle states that the best way to approximate the unknown distribution that satisfies all the constraints will have the maximum entropy 13 .
We propose a novel algorithm to develop a farm level swine movement network using the estimated swine movement probabilities. In this network, nodes (or vertices) represent swine-farms and directed links (or edges or connections) represent directional swine movements between the farms. Network realizations from the interactions among the elements of different dynamic systems can be seen several times in the literature; for example, weighted network for worldwide air transportation 14 , network for collaboration among scientists 14 , network to understand complex intercellular interactions 15 , and network to represent interplay among different physiological systems [16][17][18][19] . To understand the generated swine movement network, we use network centrality measures. They have been used often in the literature to understand the livestock movement patterns [20][21][22] . The network centrality measures can assist in detection of the important farms, which can control the movement flows in the network. This information can be useful to plan effective mitigation strategies to reduce an epidemic size. In the literature, researchers have used targeted vaccination, or quarantine, or culling of important agents to control epidemics 23,24 . The network centrality measure also can help us to understand the movement pattern. From the analysis of the developed swine movement network, we find a trace of the small world phenomenon and the presence of hubs in the US swine movement network.

Materials and Methods
First, we develop a convex optimization problem to estimate swine movement probabilities. Next, we propose an algorithm to develop a network based on those probabilities, where nodes or vertices are farms or operations and edges among them represent swine movement. Finally, we analyze the network using different network analysis metrics.
Data. We have collected the hog inventory, sales, slaughter, and dead/lost pig data from the United States Department of Agriculture National Agricultural Statistics Service (USDA-NASS) 25 . The USDA-NASS conducts a census every five years, which compiles a uniform, comprehensive agricultural data set for each county of the entire US. We used the data from the 2012 Census of Agriculture, as the census of 2017 is not published fully at the time of this research. For each county, two sets of data are available: (1) inventory and (2) sales. In both types, pigs are grouped into seven classes based on operation/farm size. These groups are: size1 (1-24 pigs), size2 (25-49 pigs), size3 (50-99 pigs), size4 (100-199 pigs), size5 (200-499 pigs), size6 (500-999 pigs), and size7 (more than 1000 pigs). For each size group, data for the number of operations and the number of pigs are available. However, several data points are not published to maintain anonymity; we estimate those to develop the network model. The study time of this research is the year 2012. We have assumed that the inventory sizes are constant throughout the year because of the resolution limitation of the available data. Another set of missing data are the geographic farm locations; we use geographical county centroids to measure the distances among counties.
We estimate the swine movement probabilities among sub-populations for the State of Iowa, where a sub-population is denoted as the swine population in a size group in a county. Iowa has the largest swine inventory (31.43%) in the US 25 . In the list of America's top 100 pig farming counties, 42 counties are from Iowa alone 26 . It is also the most vulnerable state for the introduction of classical swine fever and African swine fever viruses due to legal import of live swine 27 . Iowa has 99 counties in total, the number of swine sub-populations in our optimization problem is 99 × 7. swine movement probability estimation. To estimate the pig movement probabilities in a week among different sub-populations, we use a convex optimization problem. This convex optimization problem consists of two steps: (1) estimation of the non-disclosed data points in the inventory and sales data and (2) estimation of movement probabilities among different sub-populations.
To estimate non-disclosed points in the inventory data, we formulate an entropy function. By maximizing this function, we estimate the data points with minimum assumptions 28 . This process is detailed in Schumm et al. 9 . In step 2, we construct a convex optimization problem, which includes a series of linear constraints. The purpose of this problem is to maximize the entropy of the distributions of the objective function, the distributions of the objective function for a sub-population are presented in Fig. 1. The maximum entropy is a well-known method of statistical inference, which has been used in diverse research fields including ecology, thermodynamics, economics, forensics, language processing, astronomy, image processing etc. 12,29,30 . This method produces the least biased predictions while maintaining prior knowledge constraints.
In the convex optimization problem, there are C counties and each county has I size groups. A pig from a sub-population can be moved to a sub-population in the state, or moved outside of the state, or not moved at all, or slaughtered, or lost. Therefore, a pig in a sub-population has five movement options, which construct the distributions of the objective function. We define the objective function of this estimation problem as, The objective function of this problem is to maximize the Entropy. We estimate the movement probabilities m i j dist x y d , , ( , ) , which represents the movement probability from sub-population (x, i) to sub-population (y, j) in a week. A sub-population (x, i) is the swine population in the size group i in the county x. The index variable x and i are used for the originating sub-population, x = 1, 2, 3 … C and i = 1, 2, … I. Again, y and j are the index variable for the receiving sub-populations (y, j). The superscript d marks the decision parameters. The parameter os x i d , represents movement probability from sub-population (x, i) to outside of the state in a week, rn x i d , is the probability to remain or not-moved in the sub-population (x, i) in a week, sl x i d , is the probability of pigs being slaughtered for meat from sub-population (x, i) in a week, and lt x i d , is the probability of pigs being dead or lost in sub-population (x, i) in a week. We divide the distance between counties into five classes: (1) distance ∈[0, 20), (2) distance ∈[20, 100), (3) distance ∈[100, 200), (4) distance ∈[200, 400), and (5) distance ∈[400, D max ]. D max is the maximum distance between two counties. dist(x, y) represents the distance class for the distance between county x and y. We divide the distances between all pairs of counties in that way to group them into discrete distance groups. This problem is subject to several linear constraints, which we construct from probability rules, sales data, swine population conservation etc.
As a pig can move (from the sub-population (x, i) to a sub-population in the state, or outside of the state, or slaughtered, or death) or it could stay in the sub-population, therefore the summation of these possibilities is equal to one. From the rule of the probability, we can get the following constraint for any sub-population (x, i), The probabilities in Eq. 2 are considered in the objective function.
There are three types of sales in the system, (1) sales for the movement from sub-population (x, i) to the all sub-populations in the state, (2)  The superscript r indicates published data. The parameter Iv x i r , is the swine inventory in the sub-population (x, i), and Sales x r represents the total sales from county x in a year. The parameter scaled is used to convert the timescale, www.nature.com/scientificreports www.nature.com/scientificreports/ this parameter allows us to convert the timescale from yearly to weekly basis. ET x sales is the error term for the constraint 3.
The constraint for the slaughtered swine is, The term TotalSlaughtered r represents the total number of slaughtered in a year in the system, and ET sl is the error term for slaughtered data.
The constraint for the sales to the outside of the state is; The term TotalOutshipment r is the total sales to the outside of the state in a year, and ET out is the error term for outshipment.
The constraint for the inshipments from the outside of the state is; The parameter is x i d , indicates the inshipment probability in a week from outside of the state to the sub-population (x, i), TotalInshipment r is the inshipment from outside in a year in the system, and ET in is the error term for inshipment.
The constraint for the death or lost is, The term TotalLost r represents the total number of death or lost in a year from the system, and ET lt is the error term for this constraint.
We assume that the population or inventory size of a sub-population remain constant throughout the year. Therefore, in a sub-population, the summation of the outgoing flows from the sub-population (solid black lines in Fig. 1) is equal to the summation of the incoming flows into the sub-population (dotted red lines in Fig. 1). Constraints for the population conservation are, x i r Here, Iv x i b d , , represents the breeding population, bt x i d , is the probability of birth in the sub-population (x, i) in a week, and ET x i pop , is the error term. The left side of the Eq. 8 is the summation of the outgoing flows from sub-population (x, i) and the right side is the summation of the incoming flows into the sub-population (x, i). The range for bt x i d , is (7 × 9)/115 − (7 × 12)/112 week −1 , as time period for gestation is 112-115 days and average litter rate is 9-12 25 . The range for sl x i d , was chosen based on the lifespan of market pigs in the US, which is about 25 to 28 weeks.
Constraint for the errors is, The left side of Eq. 9 represents the summation of all the errors in the optimization problem. Here, R c is a proportional constant, and TotalPopulation r is the total swine population in the system. The inequality (Eq. 9) states that the total error in the convex optimization problem should be less than equal to a fraction R c of the TotalPopulation r . The value of R c is calculated by using trial and error with an objective to minimize the total error.
Convex cost function (Eq. 1) and constraints (Eqs 2-9) constitute our optimization linear problem. The objective of this estimation problem is to maximize the entropy of the distributions of the objective function of all sub-populations. The performance of entropy measures is sensitive to different factors 31 . Maximum entropy methods can predict accurately given a prior knowledge. However, maximum entropy methods can perform poorly if the prior knowledge is insufficient or inaccurate or contains biases 32 . In our estimation problem, published USDA-NASS data are used as the prior knowledge, and the data was sufficient to solve the formulated convex optimization problem. Maximum entropy methods can also perform poorly if the system changes very rapidly 32 , which is not our case.
Network development. We develop a network using the movement parameters which are obtained using the maximum entropy optimization. The network development is done in two stages: (1) setup of the population in each farm and (2) setup of the movement links between farms.
www.nature.com/scientificreports www.nature.com/scientificreports/ In order to generate the network, first, we need the farm level estimates of the pig population. The USDA-NASS data only provide the number of farms in a size range and the number of total pigs in that range in a county. Recorded data on the number of pigs in a farm are generally not available in the US (with the exception of a few counties). To allocate the pig population, we generate random numbers for every farm in a size group i within a county x with the following constraints: (a) The random numbers fall in the range of the corresponding group i. (b) The sum of all generated numbers is equal to the total number of pigs in that sub-population (x, i).
The procedure to establish the movement links between farms is inspired by the random network model 33 . Our movement network for pig farms is represented as (V, E, W). The term V denotes the set of nodes, the term E represents the set of links or connections among individual nodes, and W denotes the weight of each link. To generate the movement network among farms, we use the following procedures: Step 1 For each pig p 1 in a sub-population (x, i), we generate a random number rand from the uniform distribution U(0, 1) for sub-population (y, j), y = 1, 2, 3, ….. C, and j = 1, 2, 3, … I. Here, C is the number of counties in the system and I is the number of size groups. Step ( , ) , a link is created from pig p 1 to another pig p 2 . Pig p 2 is picked randomly from the sub-population (y, j).
Step 3 If there is no link from the parent farm f 1 of pig p 1 to the parent farm f 2 of pig p 2 , we create a link flink from f 1 to f 2 . Otherwise, if a link already exists, we increase its weight by 1.
Step 4 For each sub-population (x, i), we repeat Steps 1-3. This process produces a directed weighted network at the farm level. Links or connections among farms represent swine movement. The weight of a link represents the volume of movements occurring from one farm to another.

Network analysis.
To capture the particular features of the developed network, we compute the following network analysis metrics: node strength, betweenness, eigenvector, clustering coefficient, and average shortest path [33][34][35] . Centrality measures can help us determine the most important or central nodes in a network.
The node strength-centrality measure is the strength of the nodes or sum of the weights of the edges connected to it 36 . In a directed network, the nodes have two types of vertex-strength centralities: (1) in-strength InS, and (2) out-strength OuS.
l NB k kl ( ) Here, w lk is the connection strength of the edge/link from node l to node k, NB(k) is the set of the neighbors of node k. Vertex strength can be illuminating in the investigation of diseases spreading. A high in-strength node has a high risk of receiving an infection. On the other hand, a high out-strength node is influential over the network, as such a node can infect many more nodes.
The betweenness centrality measure suggests which nodes are important in the connection flow or act as bridges in the network. Betweenness centrality of a node measures how many shortest paths between different pairs of nodes go through that particular node. The shortest path between two nodes is the path with the fewest number of connections. Nodes with high betweenness centrality have high control over movement flow (here, concerning flow of swine) in the network. Removal of such nodes can effectively reduce connectivity in the network. Knowledge of these nodes can be useful in controlling outbreaks 37 . Let, p st be the number of shortest paths from s∈N to t∈N. We denote, p st (k) to be the number of shortest paths from s to t, that includes node k somewhere in between. The betweenness centrality of a node k is defined 38 as: Eigenvector centrality is an extension of the degree/strength centrality. In the eigenvector centrality measure, the centrality of a node is proportional to the sum of the centralities of its neighbors. Here, e(k) is the eigenvector centrality of the node k, and λ 1 is the largest eigenvalue of the adjacency matrix [a kl ] of the network. Eigenvector centrality of a node can be large if either it has many neighbors or it has important neighbors. Nodes with high eigenvector centralities have high probabilities of becoming infected 39,40 .
The clustering coefficient measures local group cohesiveness. The clustering coefficient Cc(k) for a node k is the ratio of the number of edges among the neighbors of k and the maximum possible number of such edges (for the fully-connected network formed by the neighbors of node k). If neighboring nodes of node k has c k connections among them then clustering coefficient can be defined as 35 The average shortest path is the average of the shortest path length between all pairs of nodes in the network.

Results
Movement probability estimation. In this research, we solve a convex optimization problem to estimate the swine movement probabilities by using the maximum entropy approach for Iowa. We utilized the AIMMS modeling system 41 of Paragon Decision Technology to solve this convex optimization problem. The time-scale of our estimation problem is weekly, which we controlled it by using scaled = 52 weeks/year. The boundary of error limit in our system is 5.45% of total swine population in Iowa (R c = 5.45%). The estimated probabilities are given in Table 1. This table shows swine movement probabilities between size groups for five different distance ranges. The highest movement probability is from size7 to size7 sub-population when the distance between them is less than 20 km. We divide seven size groups into three categories; size: 1-3(small farms), 4-5(medium farms), and  For these 12 counties, we have developed a movement network (V, E, W), which is shown in Fig. 2. This network is a realization based on the movement probabilities from Table 1. For the network, |V| = 641 and |E| = 22, 461, the description of the nodes, and the adjacency list for this network is provided in the Supplementary Material Dataset 2 and 3. In Fig. 2, this network has seven types of nodes representing the seven size groups. A description of size groups is presented in Table 2. The largest group is the size7, contains 393 nodes which are presented by light blue. There are 17484 edges among the nodes of this group (67.41% of total edges).   www.nature.com/scientificreports www.nature.com/scientificreports/ Network analysis. The clustering coefficient of the full network is 0.363, the diameter of the network is 7, and the average shortest path length is 2.598. A summary of various centrality measures for the network is provided in Table 3. Node-strength, betweenness, eigenvector and clustering coefficient centrality for seven size groups are presented here. In-strength, out-strength, betweenness, and eigenvector centralities were calculated from the overall network. Clustering coefficients in Table 3 were calculated for networks of the same size group (any node and its neighbors are in the same size group). We used the open source package Gephi to visualize and analyze the network 42 .

Size1
Size2 In  Table 3. A summary of centrality measures for different size groups in the network. www.nature.com/scientificreports www.nature.com/scientificreports/ From the node-strength centrality measures, we observe that the average node-strength is positively correlated with the size groups. Larger size groups have higher average node-strengths. Consequently, size7 has the highest average node-strength ( Table 3). The node-strength distribution is provided in Fig. 3. In the network, only a few nodes have high strength and most of the nodes have low strength. This characteristic is similar to the power-law distribution. The range of in-strength is 0-1426. About 90.95% of the total nodes have in-strengths less than 285, which is merely the first 20% of the in-strength range. The range for out-strength is 0-1372. About 91.11% of the total nodes have out-strengths less than 274, which is within the first 20% of the range of out-strength values. The correlation coefficient between in-strength and out-strength is 0.9523, which is an indication of strong correlation.
The betweenness centrality is positively correlated with size groups until group6, after which farms in the group7 have lower betweenness. The farms in group6 have the highest average betweenness. The distribution of betweenness centrality measure is given in Fig. 4. Most of the farms have low betweenness. Few farms act as hubs in the network which have high betweenness. The range for betweenness is 0-15229. We divide the nodes into three groups, (1) low-betweenness (0-50), (2) medium-betweenness (51-500), and (3) high-betweenness (>500). These three groups contain 183, 302, and 156 nodes respectively. These three groups are illustrated in Fig. 5. In the low-betweenness group majority of the nodes are from small size groups, in the medium-betweenness group most of the nodes are from group7, and in the high-betweenness group, most of the nodes are from group6.
The mean eigenvector centrality is positively correlated with the size groups. Larger size groups have higher eigenvector centralities (Table 3). We have divided the nodes (farms) into three groups: (1) low-eigenvector central nodes (0-0.1), (2) medium-eigenvector central nodes (0.11-0.3), and (3) high-eigenvector central nodes (0.31-1). The low-eigenvector central group consists of 298 nodes, the medium group consists of 233 nodes, and the high group contains the rest of the nodes. The network for different eigenvector groups is presented in Fig. 6. Clustering coefficient for group size 7 is 0.755, which is quite high. The nodes from this group form several clusters, which are quite visible in Figs 2 and 6.
In the network, the importance of links is another useful topic to study 17 . From the link strength or weight distribution, we can see that the majority of the links have a low weight however very few links have high weight (Fig. 7). A link with high-weight represents a high volume swine movement. For a susceptible farm, an infected neighbor connected by a high-strength-link is riskier than an infected neighbor connected by a low-strength-link.

Discussion
In this study, we have three objectives: (1) we compute optimal estimates swine movement probabilities among counties from the aggregated data of USDA-NASS, (2) we develop a realization of the network from the estimated probabilities, and (3) we analyze the developed network with different network analysis metrics.
Animal movement has been one of the major causes of diseases spread among farms for several outbreaks in the US swine industry. A better understanding of the swine movement network can increase the feasibility of planning effective mitigation strategies that can reduce the risk of disease spread. There is no mandatory animal movement tracking system in the US due to the industry preference for privacy in the swine business. We have estimated the movements among different swine sub-populations using a convex optimization problem, have formulated according to the USDA-NASS data. The discrepancy from our optimization problem is about 5.45% of the total swine population, which is slightly higher than that of a similar work on cattle movement probability estimation 9 due to a greater amount of data available for cattle. Our estimation can be improved if more data are available. The additional data that would improve the results most is the type of swine operations (for example, nursery, farrow-to-feeder, farrow-to-wean, farrow-to-finish, finish only etc.) at the county level. The USDA-NASS department can collect and publish this information in future reports, as this additional data would not hamper the anonymity of the Census of Agriculture yet greatly improve movement estimations.  www.nature.com/scientificreports www.nature.com/scientificreports/ The network development algorithm can provide us a realization of the network from the estimated movement probabilities. The generated swine movement network was well connected with a giant component containing 95.94% of the farms. The implication of this high connectivity is that the swine industry may be vulnerable to infectious diseases. All the disconnected farms were smaller farms (inventory size less than 100) where most of them produce meat for their own consumption (60.5% of all small swine farms) 43 . In addition to that, most of these small farms are engaged in all of the phases of swine production (farrow-to-finish producers) 44 . On the other hand, larger farms have more connections among them. One possible reason could be that most of the large farms are specialized in a single production phase to increase productivity 45,46 . Consequently, pig shipments are very frequent among them.
We use centrality measures to understand the characteristics of the movement network. From the analysis of the node-strength centrality measure, we notice that many nodes in the network have low node-strength however very few nodes have high node-strength, who work as hubs in the network. The node-strength distribution of the network is similar to that of scale-free networks (Fig. 3). Compared to a random network, epidemics can spread faster in a scale-free network. In addition to that, scale-free networks have lower epidemic threshold than comparable random networks 47 . This information could be useful because targeted vaccination/node-removal is more effective in scale-free structures than random vaccination 48 . The vaccination, or culling, or quarantine of the hubs (farms with high node-strength) can be crucial to control an epidemic.
If we analyze the average shortest path length and the clustering coefficient of the overall network, we see evidence of the small-world phenomenon in the network. The average path length was similar and clustering coefficient was more than six times larger compared to the similar properties of the equivalent Erdos-Renyi random network 49 , which satisfy the sufficient conditions for small-world properties of the network 50 . The US swine movement network structure is quite vulnerable to any pathogen spreading because of its small-world nature. This result is similar to other studies as well [20][21][22] . This network has high local clustering. Size7 group (larger operations: headcount is more than 1000) has the highest amount of local clustering (Figs 2 and 6). Therefore, large operations are highly interconnected, making them more vulnerable to outbreaks. Moreover, the structure of the US swine industry has been changing over several years. The number of large operations is increasing, where most of them specialize in one particular phase of production. These changes are increasing the risk for disease outbreaks in the swine industry.
The correlation between in-strength (incoming movements) and out-strength (outgoing movements) is strong. The nodes with high out-strength values also have high in-strength values. This is an important indicator as the nodes with a high risk of receiving infection are also highly capable of spreading them.
Although the group size7 (largest operations) has the highest values of node-strength, clustering coefficient, and eigenvector centralities it is not necessarily highest in terms of the betweenness centrality measure. We found that group size6 has the highest betweenness centrality values ( Table 3). The groups size4 and size5 also show high betweenness. The above-mentioned properties indicate that the group size7 forms various clusters in the network, where the operations are highly connected. The operations of medium size, however, maintain the connectivity among the clusters of the largest group. Hence, these medium size operations play a key role in the system. During www.nature.com/scientificreports www.nature.com/scientificreports/ an epidemic, it is possible to use these high betweenness farms to disconnect the movement network and confine the disease in a smaller part of the network.
We make several assumptions to simplify our model as all necessary data are not available. We assume that the inventory size of the operations is constant on a year-to-year basis. We also consider that movement flows are the same throughout the year because of the resolution limitation of the available data. However, movement flows can be different from one season to another season. The movement flows also can be sensitive to other factors, for example, production technology, business strategy, and food availability. However, we do not have specific knowledge about these factors at this point and inclusion of too many unknown factors increases the complexity of the estimation problem given the limited data. Our estimation steps can be easily adapted by adding more constraints when more data are available.
One immediate use of this network could be the investigation of the stochastic spreading processes [51][52][53][54][55] . This kind of study can help us understand the underlying mechanisms and threshold conditions of spreading processes for various swine diseases including porcine reproductive and respiratory syndrome (PRRS), classical swine fever (CSF), African swine fever (ASF) and many more.
In summary, we present a maximum entropy approach to estimate the swine movement network from aggregated anonymous census data. This method can be used to estimate movement probabilities of other farm animals too for various locations.

Data Availability
The dataset used to perform this research is available from https://quickstats.nass.usda.gov/, https://quickstats. nass.usda.gov/. The authors are willing to provide further details upon request.