The origin of motif families in food webs

Food webs have been found to exhibit remarkable “motif profiles”, patterns in the relative prevalences of all possible three-species subgraphs, and this has been related to ecosystem properties such as stability and robustness. Analysing 46 food webs of various kinds, we find that most food webs fall into one of two distinct motif families. The separation between the families is well predicted by a global measure of hierarchical order in directed networks—trophic coherence. We find that trophic coherence is also a good predictor for the extent of omnivory, defined as the tendency of species to feed on multiple trophic levels. We compare our results to a network assembly model that admits tunable trophic coherence via a single free parameter. The model is able to generate food webs in either of the two families by varying this parameter, and correctly classifies almost all the food webs in our database. This is in contrast with the two most popular food web models, the generalized cascade and niche models, which can only generate food webs within a single motif family. Our findings suggest the importance of trophic coherence in modelling local preying patterns in food webs.

that motivates a new classification of food webs which could provide insight into the controversy regarding the nature and role of omnivory 23 .
Local structural patterns in complex networks are intimately related to global network properties 24,25 . A network metric called trophic coherence was recently introduced in order to capture the degree to which the nodes fall neatly into distinct levels 21,[26][27][28] . In the context of food webs, these are the trophic levels, and high coherence corresponds to the species at one level consuming almost exclusively species at the level immediately below (i.e. low omnivory). Trophic coherence was shown to be a major predictor of the linear stability of ecosystem models, as well as of a number of structural properties of empirical food webs 21 . It has also been related to the numbers of cycles in directed networks, and to the distribution of eigenvalues of associated matrices 26 .
Trophic coherence is a structural property of directed networks that places constraints on local topological features and on the prevalence of small subgraphs in particular. In this paper we present evidence that the relative prevalence of three-species subgraphs in food webs can be explained by the level of trophic coherence in both empirical and model food webs. This result provides another viewpoint in the debate about the origin of subgraph prevalences in food webs and further evidence of the importance of global organization in food webs 21 .

Methods
Quantifying triad significance. For any given network the exact number N k of any of the k = 1, …, 13 connected three-node subgraphs (triads, Fig. 1) is influenced by the network size and the degree distribution of the vertices. To test the statistical significance of any given triad k, the empirically observed number N k is compared against appearances of the same triad in a randomized ensemble of networks serving as a null model 16 . This comparison gives a statistical significance or z-score k k krand rand where N k rand and rand σ are the randomized ensemble average and standard deviation for triad k, respectively. The z-score of triad k thus measures the deviation of prevalence in the observed network with respect to the null model.
The z-scores of all 13 triads can be summarized in a triad significance profile (TSP) which is a vector z z { } k = with components z k for each triad k. Additionally, the normalized version of the TSP is often used to compare networks of different sizes and link densities 16 . This is given by The randomization procedure used to obtain the randomized ensemble statistics is a matter of choice. A careful selection of null model is important to discern between real effects and artefacts present in the TSP 29 . In our analysis, we follow the configuration model (CM) prescription 30,31 , and preserve the number of incoming and outgoing links for each node (the degree sequence) while randomizing links via a Markov chain Monte Carlo switching algorithm 15,16 . This preserves both the total number of nodes (species) and the links (trophic interactions) in the network. The generation of randomized networks and counts of triads was carried out with mfinder, the algorithm used by Milo et al. in their seminal work on network motifs 15,32 .
It is important to emphasize that the TSP is a relative measure of which triads are over-and under-represented with respect to the null model provided by the randomized CM networks. The over-(under-)representation as indicated by a positive (negative) z-score indicates that these triads appear more (less) frequently than in the randomized networks but do not imply an absolute saturation (absence) of said triads. Nevertheless, the TSP is an adequate tool for comparing networks of different sizes and degree distributions.
Comparing networks based on triad significance. To quantitatively compare networks based on their TSP, we use Pearson's correlation coefficient r between the normalized z-score vectors ẑ a and z b of networks a and b, respectively 12,16 . This is defined as¯σ where Figure 1. The 13 unique connected triads. These can be separated into two groups: (a) five triads, S1-S5, that only contain single links, (b) eight triads, D1-D8, that have double links (corresponding to mutual predation).
Scientific RepoRTS | 7: 16197 | DOI:10.1038/s41598-017-15496-1 are the mean and the standard deviation of the normalized z-score vectors, a and b specify the networks, k is an index over the triads and n = 13 is the total number of triads. With this definition a value of r close to 1 indicates that the two networks have very similar TSPs and thus patterns of over-and under-represented triads, a value close to 0 indicates no similarity, and a value close to −1 indicates anti-similarity-i.e. triads over-represented in one network will typically be under-represented in the other (and vice versa).
Comparing the empirical networks is straightforward as we just calculate the r-coefficient pairwise for the z-score vectors of all 46 food webs in our database. On the other hand, for comparison with the model (described in a subsequent section), for each empirical network we fit our food-web model to the data, generate 1000 instances of a model network and then compute the r-coefficient of the empirical z-score vector and the average z-score vector of the model-generated ensemble.
Clustering food webs into families. To uncover clusters of food webs with similar TSPs, we use a hierarchical, agglomerative clustering algorithm 33 based on the Pearson's correlation coefficient r between TSPs. First, we need to convert this to a distance measure. We define This definition ensures that d is a Euclidean metric 34 and we can readily apply hierarchical clustering. We use the UPGMA (average linkage) algorithm 33 to uncover the full cluster hierarchy.
Trophic coherence. Trophic coherence is a topological metric for directed networks that characterizes how layered the network is 21,26 . It measures the extent to which we can separate nodes into distinct groups so that any given group receives incoming links from just one other group and has outgoing links to another, different group of nodes. In the context of food webs, it measures the overall tendency of species to feed on multiple distinct trophic levels.
For each species j in the network, we define its trophic level s j as the average trophic level of its prey, plus one 21,35 , where k a j i ij in = ∑ is the number of prey of species j (also known as the in-degree) and a ij are entries of the adjacency matrix A of the food web. Here the convention is that the directed trophic links point from prey i to predator j.
Because of the recursive nature of Eq. (7), to assign a trophic level to every node in the network two conditions must hold. First, there must be at least one node of zero in-degree -we call such nodes basal; and second, every node in the network must be reachable by a path from at least one basal node. Food webs satisfy both conditions so the linear system defined by Eq. (7) has a unique solution. Without loss of generality we assign s j = 1 for all basal species, as is the convention in ecology.
We define the trophic distance associated to link a ij in the network as the difference between the trophic levels at the endpoints, = − x s s ij j i . Note that this is not a distance in the mathematical sense as it can take negative values. Denote by p(x) the distribution of trophic distances as measured on a network. This will have mean 〈x〉 = 1 by definition and a standard deviation q x 1 2 = − which we will call the trophic incoherence parameter. The trophic incoherence parameter is thus a measure of the homogeneity of the distribution p(x). For perfectly coherent networks we have q = 0, which translates to having only integer valued trophic levels and all species feeding on prey only one trophic level below their own. In this case the network is perfectly structured, or layered, as there are distinct groups of herbivores feeding only on basal species, predators feeding only on herbivores and so on. For less coherent networks, q > 0 indicates a less ordered trophic structure, where trophic levels take fractional values and species tend to prey on a broader range of trophic levels. See Fig. 2 for examples of coherent and incoherent food webs.
Model with tunable trophic coherence. Various mathematical models of food webs have been proposed to capture and explain different aspects of food webs 19,20,[36][37][38] , but the main models still fail to capture the full variety of empirically observed structures 12,21,39,40 . To reproduce many of the empirical structures 21 , in particular the prevalence of three-species motifs, we propose a model for food webs that allows us to adjust the incoherence parameter q by means of fitting a single free parameter. The model is a generalization of the Preferential Preying Model (PPM) introduced in ref. 21 , with the improvement that it can generate bidirectional links and cycles of higher order, thus producing more realistic networks. In the following we denote by B, N and L the number of basal nodes, total nodes and links in the network respectively, all parameters to be fitted using the empirical network data.
Scientific RepoRTS | 7: 16197 | DOI:10.1038/s41598-017-15496-1 We begin with B basal nodes and no links. We assign trophic levels s = 1 to all basal nodes. We then add N − B new nodes to the network sequentially according to the following rule. For each new node j, pick exactly one prey i at random from among all the existing nodes in the network, thus creating a link from i to j. In doing so, we define the temporary trophic level of node j as = +ŝ s 1 j i . After this procedure finishes, we have a network of N nodes and N−B links, and each node has a (temporary) trophic level s î .
Once all N nodes are created, we add the remaining links to the network to bring the expected number of links up to L. The links are chosen among all possible pairs of nodes (i, j) where j is not a basal node (this ensures no incoming links to basal nodes which would make them non-basal), with a probability P ij that decays with the where T is a free parameter which sets the degree of prey diversity between multiple trophic levels. This form of probability ensures that the most likely links to be created are between adjacent (temporary) trophic levels. The probabilities in Eq. (8) are normalized so that the expected number of links in the final network is L. At the end of the network creation procedure the trophic levels need to be recalculated according to Eq. (7) as the addition of new links will have changed the network topology, and the trophic levels in the final network need not correspond to the temporary integer valued trophic levels.
The free parameter T is analogous to temperature in statistical physics and sets the amount of deviation from a perfectly coherent network. For T = 0, only links between adjacent (temporary) trophic levels are allowed which results in the incoherence parameter q = 0. In this case the temporary trophic levels coincide with the actual trophic levels as the addition of links does not change the initially assigned trophic levels. As T is increased, links between a wider range of (temporary) trophic levels become more probable, so we expect q > 0 and increasingly more random networks. A sample dependence of q on T is shown in Fig. 3. The model exhibits a monotonic  dependence of the incoherence parameter q on temperature T which provides a basis for fitting the model to empirical food webs given the empirically observed q. We also find that the level of incoherence that is achieved at any given temperature depends on B/N, the ratio of basal species to all species. We will further explore this relationship in the subsequent section.
To fit the model to the food web data, we provide as input the number of basal species B, the number of total species N, and the number of links or trophic interactions L. We then use stochastic root finding to find the value of the temperature parameter T that results in an ensemble of networks whose incoherence parameter q is centred about the empirical incoherence parameter as measured from the food web topology.
Empirical food web data. We study the triad significance profile (TSP) in 46 empirical food webs from a variety of environments: marine, freshwater (river and lake) and terrestrial. Table 1 gives the relevant summary statistics of each food web. The full structure of each food web is included in supplementary information.

Results
Motifs in empirical food webs. The main results are summarized in Figs 4 and 5. Figure 4 shows the pairwise Pearson correlation coefficients of the triad significance profiles between all 46 food webs. The food webs are arranged by increasing incoherence parameter q so that more coherent food webs are assigned a lower ID. Red hue or warmer colours indicate a larger coefficient, while blue hue or colder colours indicate an anti-correlation in the TSPs.
We see that roughly two families of food webs emerge with similar TSPs. The first family (roughly ID 1-22) is characterized by relatively high coherence (low incoherence parameter q), for which the similarities in the TSPs are very high (r 0 8 ≥ . ). There is a second family of food webs, characterized by a high incoherence parameter q, that also show high similarities in their TSPs. Membership to this second family is not as clear as there is a tighter core of food webs belonging to it, with a periphery that only shares some similarities.
To make these ideas more precise, we performed hierarchical clustering of food webs based on a distance metric derived from the pairwise Pearson correlation coefficients. The resulting clusters are shown as a dendrogram in Fig. 4. By choosing a threshold distance d c , we can group food webs into a number of distinct families based on the similarities of their TSPs. Setting = .
Setting a lower threshold distance could provide a more fine-grained classification of food webs in more than two distinct families but we now show that this coarse classification into two families allows us to qualitatively differentiate food webs based on species preying patterns, specifically the extent of omnivory. To this end, we look closer at the bulk behaviour of the TSPs for the two families. Figure 5 shows the normalized profiles of Family 1 (top) and Family 2 (bottom).
We first consider Family 1. The bulk behaviour of food webs in this family is characterized by an over-representation of triads S1, S4 and S5, as well as an under-representation of triad S2 (with the exception of ID 22 Michigan Lake, ID 29 Florida Bay and ID 46 Everglades Graminoid Marshes). We should find the pattern of under-representation of triad S2 (which represents omnivory) unsurprising, since the majority of food webs belonging to this family have a low incoherence parameter q, which limits the ability of species to feed on multiple different trophic levels. Equally, the over-representation of triads S1, S4 and S5 is to be expected as these are the only three triads out of 13 that can arise in a hypothetical food web with q = 0, which is a value close to the empirical values of q for food webs in this family. The double link triads D1-D8 are all under-represented or close to even, in agreement with our expectations.
We now turn to Family 2. Here the triads S1, S4 and S5 no longer follow a strong pattern of over-representation and the double link triads D1-D8 are not always under-represented. The most distinguishing feature, however, is the bulk over-representation of triad S2 (with the exception of ID 40 Weddell Sea), in stark contrast to Family 1. We will argue that this is the main feature that separates the two food web families.
This pattern of food webs based on the under-or over-representation of triad S2 was alluded to in previous work 12 , however it is in disagreement with the predictions of the generalized cascade 19 and niche 20 models which can only produce food webs where S2 is over-represented 12 . Subsequently, we present results from our model which show that it is possible to change the pattern of under-representation to over-representation of triad S2 by increasing the incoherence parameter q, thus providing evidence that trophic coherence can naturally give rise to two food web families characterized by low or high prevalence of omnivory, respectively.
Comparison between empirical and model networks. We have also investigated the similarities of triad significance profiles between the empirical food webs and model generated food webs. To this end we study the similarity of the TSPs between each empirical food web and an ensemble of model food webs fitted to the data of the empirical one. The results are summarized in Fig. 6. Averaging over an ensemble of 1000 model generated food webs fitted to each empirical food web, we measured the Pearson correlation coefficient between the TSP of the empirical food web and the TSP of the ensemble average. The results show that the model is able to reproduce empirically observed TSPs for the majority of food webs in both families with high accuracy. The model fails to produce accurate TSPs for a number of food webs and sometimes even produces anti-correlated TSPs (r < 0). If we require that r > 0.5, eight food webs are not able to be reproduced accurately by our model, five in Family 1 (ID 31 Lough Hyne, ID 36 Carpinteria Salt Marsh Reserve, ID 41 Caribbean Reef, ID 44 El Verde Rainforest and ID 46 Everglades Graminoid Marshes) and three in Family 2 (ID 23 Bridge Broom Lake, ID 24 Grassland (U.K.) and ID 40 Weddell Sea). Recall that IDs are assigned in the order of increasing q so these particular food webs are unusual members of their respective families in that they tend to have extreme values of q with respect to the majority of networks in either family (higher than average in Family 1 and lower than average in Family 2). Because of the imperfect agreement between q and family membership, our model cannot replicate the structure  Table 1. An alphabetical list of the 46 food webs studied in the paper. From left to right, the columns are for: name, number of species N, number of basal species B, number of links L, ecosystem type, trophic incoherence parameter q, value of the temperature parameter T found to yield (on average) the empirical q with our model, references to original work, and the numerical ID.
of these sporadic webs. This suggests that for some food webs information about trophic coherence q may not be enough to reproduce realistic looking TSPs and there may be further mechanisms of prey selection at play 12 .
The role of omnivory and basal species. We now focus on the claim that the main difference between the two families of food webs is the relative under-and over-representation of triad S2, or the degree of omnivory in a food web. A prevalence of triad S2 indicates that the species in a food web often feed on different trophic levels, contributing to an increased incoherence parameter q as discussed at the start of this section. A scarcity of triad S2, on the other hand, indicates that species only tend to feed on prey with similar trophic levels, which in turn signals a low incoherence parameter. This suggests a relationship between the z-score of triad S2 and network incoherence as measured by q. Furthermore, model results (Fig. 3) suggest that a high proportion of basal species to all species, B/N, produces more coherent food webs (i.e. with a low incoherence parameter q). We take this as an additional predictive food web statistic for family membership.

= .
R 0 53 2 , p 8 47 10 9 = . ⋅ − ) that indicates a positive relationship between how coherent a network is (low q) and how many of its species are basal.
We have also coloured the markers of each food web to indicate the level of over-or under-representation of triad S2 as measured by the normalized z-score ẑ S2 . Red circles indicate an over-representation while blue diamonds indicate an under-representation of S2 in the respective food web. Remarkably, based on this measure, we uncover two clusters of food webs corresponding roughly to the two families based on TSP similarities. The first cluster is once again characterized by a high incoherence parameter q as well as a low ratio of basal species to all species B/N. The second cluster is characterized by a low incoherence parameter and a high ratio of basal species to all species. The only exceptions are six food webs in the first family (ID 20 Coweeta (1), ID 21 Martins Stream, ID 31 Lough Hyne, ID 36 Carpinteria Salt Marsh Reserve, ID 41 Caribbean Reef and ID 44 El Verde Rainforest), four of which correspond to food webs poorly matched by our model (Fig. 6). We conclude that, indeed, the main difference between the two families is the relative role of triad S2 as already observed in the bulk behaviour of the TSPs in Fig. 5.
Finally, we study whether our model exhibits a similar transition from a relatively S2-poor to an S2-rich state which would explain the relatively good agreement between empirical and model generated TSPs for the two families (Fig. 6). We find that for a given basal species ratio B/N there exists a critical temperature T c , and thus a critical incoherence parameter q c , which signifies such a transition. For T (and q) below these critical values, the model generates networks where S2 is under-represented, while for values above critical, the networks generated have either an even or an over-represented number of S2 triads. We include the transition line of the two regimes in Fig. 7 for an ensemble of 100 model networks with N = 100 species and an average (non-basal) degree = − = k L N B /( ) 10. Networks with q below the line show an under-representation of S2 triads, while networks with q above the line show an over-representation as measured by z S2 .
Remarkably, the model results are in very good agreement with the empirical data despite the fact that both the network size N and the average degree k vary considerably between the empirical food webs. Almost all food  webs with an under-represented number of S2 triads fall below the transition line of the model while those with an over-represented number reside above the line.
These findings suggest that the two families of food webs differ in the degree of omnivory present as measured by the prevalence of triad S2 which is itself intimately related to the incoherence parameter q. Interestingly, based on the strong anti-correlation between q and B/N, either parameter is a strong determinant of family membership. To our knowledge, the GPPM is the first food-web model able to reproduce triad significance profiles consistent with empirical observations. The ability to produce model networks belonging to either of the two families suggests that the parameters q and B/N are both important in the mathematical modelling of food webs and may, in fact, be fundamental for understanding local preying patterns in food webs.

Discussion
Our investigation of trophic interaction patterns in food webs has revealed significant correlations between the degree of omnivory, hierarchical organization of trophic species and the density of basal species.
The analysis of local trophic interactions via triad significance profiles in empirical food webs reveals two distinct families of food webs characterized by a relatively low or high incoherence parameter respectively. While certain differences across families of food webs based on their TSPs have been observed before 12 , these are not predicted by any existing food web models, calling into question their use as null models given the academic significance attached to food web motifs [9][10][11][12][13] . Trophic coherence provides a network theoretic metric that enables us to classify and predict the relative prevalence of such motifs.
We have shown qualitatively that the the main difference between the two food web families is the extent of omnivory, as measured by the over-or under-representation of triad S2 (the feed-forward loop). This classification of food webs into two families according to the extent of omnivory is at odds with previous claims that omnivory occurs more often than one would expect to happen by chance across most food webs 12 . On the other hand, the existence of these families may be related to different ways omnivory emerges in food webs and influences their stability [21][22][23] . We have tested our prediction for the onset of omnivory using a new model for generating synthetic food webs with a given trophic coherence. We find that the model exhibits a transition from an under-representation of omnivory to an over-representation of omnivory as a function of trophic coherence. Our model results fit the food web data very well, providing evidence of the importance of trophic coherence as well as the basal species density in modelling realistic trophic interactions. We would like to emphasize that these findings are remarkably robust between food webs originating from vastly different habitats.
This work has expanded on the importance of trophic coherence in predicting structural features in food webs 21 , but the biological origin of trophic coherence remains elusive. Basal species density and its effect of suppressing highly incoherent structures in both empirical and model food webs may provide some clues. All other things being equal, a higher proportion of autotrophs in a food web will necessarily mean that a higher proportion of consumers will feed on these basal species. In turn, this would have a dampening effect on the formation of long food chains in the trophic hierarchy and hence fewer possibilities for a varied diet of species at the top. Figure 2 exemplifies how this hypothesis could lead to very different food web structures. Established food web models do not treat basal species density as a predictor for emergent structure but rather as an emergent property itself. On the other hand, most food webs have been found to be significantly more trophically coherent than a random graph with the same density of basal species, so there must be other coherence-inducing mechanisms at play 26 . Further work is needed to elucidate the reasons behind this property of ecosystem structure.