Introduction

Livestock are often moved between facilities to reduce costs and improve productivity. There is an old adage, “Livestock follow the grain”. Even now this aphorism seems true, as shipping animals is less expensive than shipping grains, which are required for animals to attain their slaughter weights. The corn-belt region (Iowa, Missouri, Illinois, Indiana, and Ohio states) is the largest market for feeder pigs1 because they are the largest producers of two major sources of hog rations (corn and soybeans). Although movements in the livestock industry can reduce the cost of production, movements have a major role in the risk of pathogens spread. Movement of swine among farms is one of the major pathways for the spread of several diseases (e.g., Porcine reproductive and respiratory syndrome-PRRS, Porcine epidemic diarrhea-PED etc.) in the United States (US) swine industry2,3. Knowledge of livestock movement can be useful in the control of pathogen spread. In Europe, there are several well-established animal tracking systems. However, similar programs are yet to be mandated for the US. In the US, a comprehensive livestock tracking system has not been implemented because of a cultural preference for privacy and competition between producers4. The United State Department of Agriculture (USDA) collects movement information when livestock shipments cross state boundaries. There is no program that collects movement information at the county or farm level.

In the prior literature, several models have been developed to understand swine movement in different regions of the US4,5,6. However, all of them used confidential incomplete datasets, which are not publicly accessible, and also which are not inclusive of the whole US. Yadav et al.5 developed a model to understand classical swine fever outbreak-related outcomes in Indiana. They used data from USAHerds (US Animal Health Emergency Reporting and Diagnostic System), where import-export activities, location of import origin, receiving swine premises, shipment size and shipment date are listed. However, only 22% of the states participates in the USAHerds program. Another research group predicted movement networks of the swine industry for some counties of Minnesota using a machine learning approach6. They used confidential survey data from two counties to train their model. The objective of our research is to understand the swine movement network in the US from publicly available data. A network is a useful structure in the study of any spreading phenomena, where farm-level animal movement networks are used as a key component in the area of disease spreading7,8.

In this work, we estimate the swine movement probabilities between counties based on published inventory and sales data from the USDA Census of Agriculture. We develop a convex optimization problem with some linear constraints for the US swine industry. To solve this problem, we adapt the cattle movement model from Schumm et al.9 for the swine population. In particular, we maximize the entropy of the distributions of the objective function (Eq. 1). Maximum information entropy methods have been used in various research fields10,11,12. The maximum entropy principle states that the best way to approximate the unknown distribution that satisfies all the constraints will have the maximum entropy13.

We propose a novel algorithm to develop a farm level swine movement network using the estimated swine movement probabilities. In this network, nodes (or vertices) represent swine-farms and directed links (or edges or connections) represent directional swine movements between the farms. Network realizations from the interactions among the elements of different dynamic systems can be seen several times in the literature; for example, weighted network for worldwide air transportation14, network for collaboration among scientists14, network to understand complex intercellular interactions15, and network to represent interplay among different physiological systems16,17,18,19. To understand the generated swine movement network, we use network centrality measures. They have been used often in the literature to understand the livestock movement patterns20,21,22. The network centrality measures can assist in detection of the important farms, which can control the movement flows in the network. This information can be useful to plan effective mitigation strategies to reduce an epidemic size. In the literature, researchers have used targeted vaccination, or quarantine, or culling of important agents to control epidemics23,24. The network centrality measure also can help us to understand the movement pattern. From the analysis of the developed swine movement network, we find a trace of the small world phenomenon and the presence of hubs in the US swine movement network.

Materials and Methods

First, we develop a convex optimization problem to estimate swine movement probabilities. Next, we propose an algorithm to develop a network based on those probabilities, where nodes or vertices are farms or operations and edges among them represent swine movement. Finally, we analyze the network using different network analysis metrics.

Data

We have collected the hog inventory, sales, slaughter, and dead/lost pig data from the United States Department of Agriculture National Agricultural Statistics Service (USDA-NASS)25. The USDA-NASS conducts a census every five years, which compiles a uniform, comprehensive agricultural data set for each county of the entire US. We used the data from the 2012 Census of Agriculture, as the census of 2017 is not published fully at the time of this research. For each county, two sets of data are available: (1) inventory and (2) sales. In both types, pigs are grouped into seven classes based on operation/farm size. These groups are: size1 (1–24 pigs), size2 (25–49 pigs), size3 (50–99 pigs), size4 (100–199 pigs), size5 (200–499 pigs), size6 (500–999 pigs), and size7 (more than 1000 pigs). For each size group, data for the number of operations and the number of pigs are available. However, several data points are not published to maintain anonymity; we estimate those to develop the network model. The study time of this research is the year 2012. We have assumed that the inventory sizes are constant throughout the year because of the resolution limitation of the available data. Another set of missing data are the geographic farm locations; we use geographical county centroids to measure the distances among counties.

We estimate the swine movement probabilities among sub-populations for the State of Iowa, where a sub-population is denoted as the swine population in a size group in a county. Iowa has the largest swine inventory (31.43%) in the US25. In the list of America’s top 100 pig farming counties, 42 counties are from Iowa alone26. It is also the most vulnerable state for the introduction of classical swine fever and African swine fever viruses due to legal import of live swine27. Iowa has 99 counties in total, the number of swine sub-populations in our optimization problem is 99 × 7.

Swine movement probability estimation

To estimate the pig movement probabilities in a week among different sub-populations, we use a convex optimization problem. This convex optimization problem consists of two steps: (1) estimation of the non-disclosed data points in the inventory and sales data and (2) estimation of movement probabilities among different sub-populations.

To estimate non-disclosed points in the inventory data, we formulate an entropy function. By maximizing this function, we estimate the data points with minimum assumptions28. This process is detailed in Schumm et al.9. In step 2, we construct a convex optimization problem, which includes a series of linear constraints. The purpose of this problem is to maximize the entropy of the distributions of the objective function, the distributions of the objective function for a sub-population are presented in Fig. 1. The maximum entropy is a well-known method of statistical inference, which has been used in diverse research fields including ecology, thermodynamics, economics, forensics, language processing, astronomy, image processing etc.12,29,30. This method produces the least biased predictions while maintaining prior knowledge constraints.

Figure 1
figure 1

The movement flows of a sub-population (x, i). Solid black lines represent the outgoing flows from the sub-population, dotted red lines represent the incoming flows into the sub-population, and the blue solid line represents the possibility to stay or not moved. Solid lines (black and blue) form the distributions of the objective function. The probability of each movement are shown with the arrows.

In the convex optimization problem, there are C counties and each county has I size groups. A pig from a sub-population can be moved to a sub-population in the state, or moved outside of the state, or not moved at all, or slaughtered, or lost. Therefore, a pig in a sub-population has five movement options, which construct the distributions of the objective function. We define the objective function of this estimation problem as,

$$\begin{array}{rcl}max\{Entropy\} & = & max\{-\sum _{x\in C}\,\sum _{i\in I}\,\sum _{y\in C}\,\sum _{j\in I}\,{m}_{i,j,dist(x,y)}^{d}\ast log({m}_{i,j,dist(x,y)}^{d})\\ & & -\,\sum _{x\in C}\,\sum _{i\in I}\,o{s}_{x,i}^{d}\ast log(o{s}_{x,i}^{d})-\sum _{x\in C}\,\sum _{i\in I}\,r{n}_{x,i}^{d}\ast log(r{n}_{x,i}^{d})\\ & & -\,\sum _{x\in C}\,\sum _{i\in I}\,s{l}_{x,i}^{d}\ast log(s{l}_{x,i}^{d})-\sum _{x\in C}\,\sum _{i\in I}\,l{t}_{x,i}^{d}\ast log(l{t}_{x,i}^{d})\}\end{array}$$
(1)

The objective function of this problem is to maximize the Entropy. We estimate the movement probabilities \({m}_{i,j,dist(x,y)}^{d}\), which represents the movement probability from sub-population (x, i) to sub-population (y, j) in a week. A sub-population (x, i) is the swine population in the size group i in the county x. The index variable x and i are used for the originating sub-population, x = 1, 2, 3 … C and i = 1, 2, … I. Again, y and j are the index variable for the receiving sub-populations (y, j). The superscript d marks the decision parameters. The parameter \(o{s}_{x,i}^{d}\) represents movement probability from sub-population (x, i) to outside of the state in a week, \(r{n}_{x,i}^{d}\) is the probability to remain or not-moved in the sub-population (x, i) in a week, \(s{l}_{x,i}^{d}\) is the probability of pigs being slaughtered for meat from sub-population (x, i) in a week, and \(l{t}_{x,i}^{d}\) is the probability of pigs being dead or lost in sub-population (x, i) in a week. We divide the distance between counties into five classes: (1) distance [0, 20), (2) distance [20, 100), (3) distance [100, 200), (4) distance [200, 400), and (5) distance [400, Dmax]. Dmax is the maximum distance between two counties. dist(x, y) represents the distance class for the distance between county x and y. We divide the distances between all pairs of counties in that way to group them into discrete distance groups. This problem is subject to several linear constraints, which we construct from probability rules, sales data, swine population conservation etc.

As a pig can move (from the sub-population (x, i) to a sub-population in the state, or outside of the state, or slaughtered, or death) or it could stay in the sub-population, therefore the summation of these possibilities is equal to one. From the rule of the probability, we can get the following constraint for any sub-population (x, i),

$$\sum _{y\in C}\,\sum _{j\in I}\,{m}_{i,j,dist(x,y)}^{d}+o{s}_{x,i}^{d}+r{n}_{x,i}^{d}+s{l}_{x,i}^{d}+l{t}_{x,i}^{d}=1\,\,\forall (x,i)$$
(2)

The probabilities in Eq. 2 are considered in the objective function.

There are three types of sales in the system, (1) sales for the movement from sub-population (x, i) to the all sub-populations in the state, (2) sales for the movement to the outside of the state, and (3) sales for slaughter. Constraint for the sales or movement from any county x is,

$$\begin{array}{l}\sum _{i\in I}\,\sum _{y\in C}\,\sum _{j\in I}\,I{v}_{x,i}^{r}\ast {m}_{i,j,dist(x,y)}^{d}+\sum _{i\in I}\,I{v}_{x,i}^{r}\ast s{l}_{x,i}^{d}+\sum _{i\in I}\,I{v}_{x,i}^{r}\ast o{s}_{x,i}^{d}+E{T}_{x}^{sales}=\frac{Sale{s}_{x}^{r}}{scaled}\,\,\forall x\end{array}$$
(3)

The superscript r indicates published data. The parameter \(I{v}_{x,i}^{r}\) is the swine inventory in the sub-population (x, i), and \(Sale{s}_{x}^{r}\) represents the total sales from county x in a year. The parameter scaled is used to convert the timescale, this parameter allows us to convert the timescale from yearly to weekly basis. \(E{T}_{x}^{sales}\) is the error term for the constraint 3.

The constraint for the slaughtered swine is,

$$\sum _{x\in C}\,\sum _{i\in I}\,I{v}_{x,i}^{r}\ast s{l}_{x,i}^{d}+E{T}^{sl}=\frac{TotalSlaughtere{d}^{r}}{scaled}$$
(4)

The term TotalSlaughteredr represents the total number of slaughtered in a year in the system, and ETsl is the error term for slaughtered data.

The constraint for the sales to the outside of the state is;

$$\sum _{x\in C}\,\sum _{i\in I}\,I{v}_{x,i}^{r}\ast o{s}_{x,i}^{d}+E{T}^{out}=\frac{TotalOutshipmen{t}^{r}}{scaled}$$
(5)

The term TotalOutshipmentr is the total sales to the outside of the state in a year, and ETout is the error term for outshipment.

The constraint for the inshipments from the outside of the state is;

$$\sum _{x\in C}\,\sum _{i\in I}\,I{v}_{x,i}^{r}\ast i{s}_{x,i}^{d}+E{T}^{in}=\frac{TotalInshipmen{t}^{r}}{scaled}$$
(6)

The parameter \(i{s}_{x,i}^{d}\) indicates the inshipment probability in a week from outside of the state to the sub-population (x, i), TotalInshipmentr is the inshipment from outside in a year in the system, and ETin is the error term for inshipment.

The constraint for the death or lost is,

$$\sum _{x\in C}\,\sum _{i\in I}\,I{v}_{x,i}^{r}\ast l{t}_{x,i}^{d}+E{T}^{lt}=\frac{TotalLos{t}^{r}}{scaled}$$
(7)

The term TotalLostr represents the total number of death or lost in a year from the system, and ETlt is the error term for this constraint.

We assume that the population or inventory size of a sub-population remain constant throughout the year. Therefore, in a sub-population, the summation of the outgoing flows from the sub-population (solid black lines in Fig. 1) is equal to the summation of the incoming flows into the sub-population (dotted red lines in Fig. 1). Constraints for the population conservation are,

$$\begin{array}{l}I{v}_{x,i}^{r}\ast [\sum _{y\in C}\sum _{j\in I}{m}_{i,j,dist(x,y)}^{d}]+I{v}_{x,i}^{r}\ast s{l}_{x,i}^{d}+I{v}_{x,i}^{r}\ast l{t}_{x,i}^{d}+I{v}_{x,i}^{r}\ast o{s}_{x,i}^{d}\\ \,=\sum _{y\in C}\sum _{j\in I}I{v}_{y,j}^{r}\ast {m}_{j,i,dist(y,x)}^{d}+I{v}_{x,i,b}^{d}\ast b{t}_{x,i}^{d}+I{v}_{x,i}^{r}\ast i{s}_{x,i}^{d}+E{T}_{x,i}^{pop}\,\,\forall (x,i)\end{array}$$
(8)

Here, \(I{v}_{x,i,b}^{d}\) represents the breeding population, \(b{t}_{x,i}^{d}\) is the probability of birth in the sub-population (x, i) in a week, and \(E{T}_{x,i}^{pop}\) is the error term. The left side of the Eq. 8 is the summation of the outgoing flows from sub-population (x, i) and the right side is the summation of the incoming flows into the sub-population (x, i). The range for \(b{t}_{x,i}^{d}\) is (7 × 9)/115 − (7 × 12)/112 week−1, as time period for gestation is 112–115 days and average litter rate is 9–1225. The range for \(s{l}_{x,i}^{d}\) was chosen based on the lifespan of market pigs in the US, which is about 25 to 28 weeks.

Constraint for the errors is,

$$\sum _{x\in C}\,|E{T}_{x}^{sales}|+|E{T}^{sl}|+|E{T}^{in}|+|E{T}^{out}|+|E{T}^{lt}|+\sum _{x\in C}\,\sum _{i\in I}\,|E{T}_{x,i}^{pop}|\,\le {R}_{c}\ast TotalPopulatio{n}^{r}$$
(9)

The left side of Eq. 9 represents the summation of all the errors in the optimization problem. Here, Rc is a proportional constant, and TotalPopulationr is the total swine population in the system. The inequality (Eq. 9) states that the total error in the convex optimization problem should be less than equal to a fraction Rc of the TotalPopulationr. The value of Rc is calculated by using trial and error with an objective to minimize the total error.

Convex cost function (Eq. 1) and constraints (Eqs 29) constitute our optimization linear problem. The objective of this estimation problem is to maximize the entropy of the distributions of the objective function of all sub-populations. The performance of entropy measures is sensitive to different factors31. Maximum entropy methods can predict accurately given a prior knowledge. However, maximum entropy methods can perform poorly if the prior knowledge is insufficient or inaccurate or contains biases32. In our estimation problem, published USDA-NASS data are used as the prior knowledge, and the data was sufficient to solve the formulated convex optimization problem. Maximum entropy methods can also perform poorly if the system changes very rapidly32, which is not our case.

Network development

We develop a network using the movement parameters which are obtained using the maximum entropy optimization. The network development is done in two stages: (1) setup of the population in each farm and (2) setup of the movement links between farms.

In order to generate the network, first, we need the farm level estimates of the pig population. The USDA-NASS data only provide the number of farms in a size range and the number of total pigs in that range in a county. Recorded data on the number of pigs in a farm are generally not available in the US (with the exception of a few counties). To allocate the pig population, we generate random numbers for every farm in a size group i within a county x with the following constraints:

  1. (a)

    The random numbers fall in the range of the corresponding group i.

  2. (b)

    The sum of all generated numbers is equal to the total number of pigs in that sub-population (x, i).

The procedure to establish the movement links between farms is inspired by the random network model33. Our movement network for pig farms is represented as (V, E, W). The term V denotes the set of nodes, the term E represents the set of links or connections among individual nodes, and W denotes the weight of each link. To generate the movement network among farms, we use the following procedures:

Step 1 For each pig p1 in a sub-population (x, i), we generate a random number rand from the uniform distribution U(0, 1) for sub-population (y, j), y = 1, 2, 3, ….. C, and j = 1, 2, 3, … I. Here, C is the number of counties in the system and I is the number of size groups.

Step 2 If \(rand < ={m}_{i,j,dist(x,y)}^{d}\), a link is created from pig p1 to another pig p2. Pig p2 is picked randomly from the sub-population (y, j).

Step 3 If there is no link from the parent farm f1 of pig p1 to the parent farm f2 of pig p2, we create a link flink from f1 to f2. Otherwise, if a link already exists, we increase its weight by 1.

Step 4 For each sub-population (x, i), we repeat Steps 1–3.

This process produces a directed weighted network at the farm level. Links or connections among farms represent swine movement. The weight of a link represents the volume of movements occurring from one farm to another.

Network analysis

To capture the particular features of the developed network, we compute the following network analysis metrics: node strength, betweenness, eigenvector, clustering coefficient, and average shortest path33,34,35. Centrality measures can help us determine the most important or central nodes in a network.

The node strength-centrality measure is the strength of the nodes or sum of the weights of the edges connected to it36. In a directed network, the nodes have two types of vertex-strength centralities: (1) in-strength InS, and (2) out-strength OuS.

$$InS(k)=\sum _{l\in NB(k)}\,{w}_{lk}$$
(10)
$$OuS(k)=\sum _{l\in NB(k)}\,{w}_{kl}$$
(11)

Here, wlk is the connection strength of the edge/link from node l to node k, NB(k) is the set of the neighbors of node k. Vertex strength can be illuminating in the investigation of diseases spreading. A high in-strength node has a high risk of receiving an infection. On the other hand, a high out-strength node is influential over the network, as such a node can infect many more nodes.

The betweenness centrality measure suggests which nodes are important in the connection flow or act as bridges in the network. Betweenness centrality of a node measures how many shortest paths between different pairs of nodes go through that particular node. The shortest path between two nodes is the path with the fewest number of connections. Nodes with high betweenness centrality have high control over movement flow (here, concerning flow of swine) in the network. Removal of such nodes can effectively reduce connectivity in the network. Knowledge of these nodes can be useful in controlling outbreaks37. Let, pst be the number of shortest paths from sN to tN. We denote, pst(k) to be the number of shortest paths from s to t, that includes node k somewhere in between. The betweenness centrality of a node k is defined38 as:

$$B(k)=\sum _{s\ne k\ne t\in N}\,\frac{{p}_{st}(k)}{{p}_{st}}$$
(12)

Eigenvector centrality is an extension of the degree/strength centrality. In the eigenvector centrality measure, the centrality of a node is proportional to the sum of the centralities of its neighbors.

$$e(k)={\lambda }_{1}^{-1}\ast \sum _{l\in NB(k)}\,e(l)$$
(13)

Here, e(k) is the eigenvector centrality of the node k, and λ1 is the largest eigenvalue of the adjacency matrix [akl] of the network. Eigenvector centrality of a node can be large if either it has many neighbors or it has important neighbors. Nodes with high eigenvector centralities have high probabilities of becoming infected39,40.

The clustering coefficient measures local group cohesiveness. The clustering coefficient Cc(k) for a node k is the ratio of the number of edges among the neighbors of k and the maximum possible number of such edges (for the fully-connected network formed by the neighbors of node k). If neighboring nodes of node k has ck connections among them then clustering coefficient can be defined as35:

$$Cc(k)=\frac{{c}_{k}}{|NB(k)|(|NB(k)|-\mathrm{1)/2}}$$
(14)

The average shortest path is the average of the shortest path length between all pairs of nodes in the network.

Results

Movement probability estimation

In this research, we solve a convex optimization problem to estimate the swine movement probabilities by using the maximum entropy approach for Iowa. We utilized the AIMMS modeling system41 of Paragon Decision Technology to solve this convex optimization problem. The time-scale of our estimation problem is weekly, which we controlled it by using scaled = 52 weeks/year. The boundary of error limit in our system is 5.45% of total swine population in Iowa (Rc = 5.45%). The estimated probabilities are given in Table 1. This table shows swine movement probabilities between size groups for five different distance ranges. The highest movement probability is from size7 to size7 sub-population when the distance between them is less than 20 km. We divide seven size groups into three categories; size: 1–3(small farms), 4–5(medium farms), and 6–7(large farms). From Table 1, we can notice that the movement probabilities from large farms to small farms are small and vice versa.

Table 1 Estimated swine movement probabilities mi,j,dist(x,y) × 103 from maximum entropy approach.

Network description

We generate a swine movement network for the central agricultural district of Iowa. It has 12 counties: Boone, Dallas, Grundy, Hamilton, Hardin, Jasper, Marshall, Polk, Poweshiek, Story, Tama, and Webster. The total number of farms in those 12 counties is 641, while the net pig population is 2,600,888, which is 12.71% of the total pig population in Iowa. Grundy, Hamilton, Hardin, Jasper, Marshall, and Webster Counties are within the America’s top 100 pork producer counties. Among these, Hardin County is in the 9th position. The descriptions of pig inventories for the above-mentioned counties are provided in the Supplementary Material Dataset 1.

For these 12 counties, we have developed a movement network (V, E, W), which is shown in Fig. 2. This network is a realization based on the movement probabilities from Table 1. For the network, |V| = 641 and |E| = 22, 461, the description of the nodes, and the adjacency list for this network is provided in the Supplementary Material Dataset 2 and 3. In Fig. 2, this network has seven types of nodes representing the seven size groups. A description of size groups is presented in Table 2. The largest group is the size7, contains 393 nodes which are presented by light blue. There are 17484 edges among the nodes of this group (67.41% of total edges).

Figure 2
figure 2

Movement Network for the pig population at the farm level. Different colors represent different size groups. Farms are divided into 7 size groups, size: 1–3(small farms), 4–5(medium farms), and 6–7(large farms).

Table 2 A summary of the size groups in the network.

Network analysis

The clustering coefficient of the full network is 0.363, the diameter of the network is 7, and the average shortest path length is 2.598. A summary of various centrality measures for the network is provided in Table 3. Node-strength, betweenness, eigenvector and clustering coefficient centrality for seven size groups are presented here. In-strength, out-strength, betweenness, and eigenvector centralities were calculated from the overall network. Clustering coefficients in Table 3 were calculated for networks of the same size group (any node and its neighbors are in the same size group). We used the open source package Gephi to visualize and analyze the network42.

Table 3 A summary of centrality measures for different size groups in the network.

From the node-strength centrality measures, we observe that the average node-strength is positively correlated with the size groups. Larger size groups have higher average node-strengths. Consequently, size7 has the highest average node-strength (Table 3). The node-strength distribution is provided in Fig. 3. In the network, only a few nodes have high strength and most of the nodes have low strength. This characteristic is similar to the power-law distribution. The range of in-strength is 0–1426. About 90.95% of the total nodes have in-strengths less than 285, which is merely the first 20% of the in-strength range. The range for out-strength is 0–1372. About 91.11% of the total nodes have out-strengths less than 274, which is within the first 20% of the range of out-strength values. The correlation coefficient between in-strength and out-strength is 0.9523, which is an indication of strong correlation.

Figure 3
figure 3

Node strength distribution of the directed network. (a) In-strength, (b) out-strength.

The betweenness centrality is positively correlated with size groups until group6, after which farms in the group7 have lower betweenness. The farms in group6 have the highest average betweenness. The distribution of betweenness centrality measure is given in Fig. 4. Most of the farms have low betweenness. Few farms act as hubs in the network which have high betweenness. The range for betweenness is 0–15229. We divide the nodes into three groups, (1) low-betweenness (0–50), (2) medium-betweenness (51–500), and (3) high-betweenness (>500). These three groups contain 183, 302, and 156 nodes respectively. These three groups are illustrated in Fig. 5. In the low-betweenness group majority of the nodes are from small size groups, in the medium-betweenness group most of the nodes are from group7, and in the high-betweenness group, most of the nodes are from group6.

Figure 4
figure 4

Betweenness distribution of the network.

Figure 5
figure 5

Node groups according to betweenness. (a) nodes with low-betweenness, (b) nodes with medium-betweenness, and (c) nodes with high-betweenness. The connections among visible nodes are presented here.

The mean eigenvector centrality is positively correlated with the size groups. Larger size groups have higher eigenvector centralities (Table 3). We have divided the nodes (farms) into three groups: (1) low-eigenvector central nodes (0–0.1), (2) medium-eigenvector central nodes (0.11–0.3), and (3) high-eigenvector central nodes (0.31–1). The low-eigenvector central group consists of 298 nodes, the medium group consists of 233 nodes, and the high group contains the rest of the nodes. The network for different eigenvector groups is presented in Fig. 6. Clustering coefficient for group size 7 is 0.755, which is quite high. The nodes from this group form several clusters, which are quite visible in Figs 2 and 6.

Figure 6
figure 6

Node groups according to eigenvector centrality, (a) low-eigenvector central nodes, (b) medium-eigenvector central nodes, and (c) high-eigenvector central nodes. The connections among visible nodes are presented here.

In the network, the importance of links is another useful topic to study17. From the link strength or weight distribution, we can see that the majority of the links have a low weight however very few links have high weight (Fig. 7). A link with high-weight represents a high volume swine movement. For a susceptible farm, an infected neighbor connected by a high-strength-link is riskier than an infected neighbor connected by a low-strength-link.

Figure 7
figure 7

Link-strength or connection-weight distribution of the network. Log-log scale has used for better visualization.

Discussion

In this study, we have three objectives: (1) we compute optimal estimates swine movement probabilities among counties from the aggregated data of USDA-NASS, (2) we develop a realization of the network from the estimated probabilities, and (3) we analyze the developed network with different network analysis metrics.

Animal movement has been one of the major causes of diseases spread among farms for several outbreaks in the US swine industry. A better understanding of the swine movement network can increase the feasibility of planning effective mitigation strategies that can reduce the risk of disease spread. There is no mandatory animal movement tracking system in the US due to the industry preference for privacy in the swine business. We have estimated the movements among different swine sub-populations using a convex optimization problem, have formulated according to the USDA-NASS data. The discrepancy from our optimization problem is about 5.45% of the total swine population, which is slightly higher than that of a similar work on cattle movement probability estimation9 due to a greater amount of data available for cattle. Our estimation can be improved if more data are available. The additional data that would improve the results most is the type of swine operations (for example, nursery, farrow-to-feeder, farrow-to-wean, farrow-to-finish, finish only etc.) at the county level. The USDA-NASS department can collect and publish this information in future reports, as this additional data would not hamper the anonymity of the Census of Agriculture yet greatly improve movement estimations.

The network development algorithm can provide us a realization of the network from the estimated movement probabilities. The generated swine movement network was well connected with a giant component containing 95.94% of the farms. The implication of this high connectivity is that the swine industry may be vulnerable to infectious diseases. All the disconnected farms were smaller farms (inventory size less than 100) where most of them produce meat for their own consumption (60.5% of all small swine farms)43. In addition to that, most of these small farms are engaged in all of the phases of swine production (farrow-to-finish producers)44. On the other hand, larger farms have more connections among them. One possible reason could be that most of the large farms are specialized in a single production phase to increase productivity45,46. Consequently, pig shipments are very frequent among them.

We use centrality measures to understand the characteristics of the movement network. From the analysis of the node-strength centrality measure, we notice that many nodes in the network have low node-strength however very few nodes have high node-strength, who work as hubs in the network. The node-strength distribution of the network is similar to that of scale-free networks (Fig. 3). Compared to a random network, epidemics can spread faster in a scale-free network. In addition to that, scale-free networks have lower epidemic threshold than comparable random networks47. This information could be useful because targeted vaccination/node-removal is more effective in scale-free structures than random vaccination48. The vaccination, or culling, or quarantine of the hubs (farms with high node-strength) can be crucial to control an epidemic.

If we analyze the average shortest path length and the clustering coefficient of the overall network, we see evidence of the small-world phenomenon in the network. The average path length was similar and clustering coefficient was more than six times larger compared to the similar properties of the equivalent Erdos-Renyi random network49, which satisfy the sufficient conditions for small-world properties of the network50. The US swine movement network structure is quite vulnerable to any pathogen spreading because of its small-world nature. This result is similar to other studies as well20,21,22. This network has high local clustering. Size7 group (larger operations: headcount is more than 1000) has the highest amount of local clustering (Figs 2 and 6). Therefore, large operations are highly interconnected, making them more vulnerable to outbreaks. Moreover, the structure of the US swine industry has been changing over several years. The number of large operations is increasing, where most of them specialize in one particular phase of production. These changes are increasing the risk for disease outbreaks in the swine industry.

The correlation between in-strength (incoming movements) and out-strength (outgoing movements) is strong. The nodes with high out-strength values also have high in-strength values. This is an important indicator as the nodes with a high risk of receiving infection are also highly capable of spreading them.

Although the group size7 (largest operations) has the highest values of node-strength, clustering coefficient, and eigenvector centralities it is not necessarily highest in terms of the betweenness centrality measure. We found that group size6 has the highest betweenness centrality values (Table 3). The groups size4 and size5 also show high betweenness. The above-mentioned properties indicate that the group size7 forms various clusters in the network, where the operations are highly connected. The operations of medium size, however, maintain the connectivity among the clusters of the largest group. Hence, these medium size operations play a key role in the system. During an epidemic, it is possible to use these high betweenness farms to disconnect the movement network and confine the disease in a smaller part of the network.

We make several assumptions to simplify our model as all necessary data are not available. We assume that the inventory size of the operations is constant on a year-to-year basis. We also consider that movement flows are the same throughout the year because of the resolution limitation of the available data. However, movement flows can be different from one season to another season. The movement flows also can be sensitive to other factors, for example, production technology, business strategy, and food availability. However, we do not have specific knowledge about these factors at this point and inclusion of too many unknown factors increases the complexity of the estimation problem given the limited data. Our estimation steps can be easily adapted by adding more constraints when more data are available.

One immediate use of this network could be the investigation of the stochastic spreading processes51,52,53,54,55. This kind of study can help us understand the underlying mechanisms and threshold conditions of spreading processes for various swine diseases including porcine reproductive and respiratory syndrome (PRRS), classical swine fever (CSF), African swine fever (ASF) and many more.

In summary, we present a maximum entropy approach to estimate the swine movement network from aggregated anonymous census data. This method can be used to estimate movement probabilities of other farm animals too for various locations.