Symbiosis limits establishment of legumes outside their native range at a global scale

Microbial symbiosis is integral to plant growth and reproduction, but its contribution to global patterns of plant distribution is unknown. Legumes (Fabaceae) are a diverse and widely distributed plant family largely dependent on symbiosis with nitrogen-fixing rhizobia, which are acquired from soil after germination. This dependency is predicted to limit establishment in new geographic areas, owing to a disruption of compatible host-symbiont associations. Here we compare non-native establishment patterns of symbiotic and non-symbiotic legumes across over 3,500 species, covering multiple independent gains and losses of rhizobial symbiosis. We find that symbiotic legume species have spread to fewer non-native regions compared to non-symbiotic legumes, providing strong support for the hypothesis that lack of suitable symbionts or environmental conditions required for effective nitrogen-fixation are driving these global introduction patterns. These results highlight the importance of mutualisms in predicting non-native species establishment and the potential impacts of microbial biogeography on global plant distributions.

distances approximated by the geosphere package in R 1 . This measure was square root transformed prior to analysis.
We also calculated the log-transformed total area of all non-native ranges for each species, as well as the log-transformed mean area per individual non-native range polygon. All the following analyses were performed only on the subset of species for which there was at least one non-native range polygon (zeroes were not included), or at least two non-native ranges in the case of geographic dispersion, and analysed using standard Gaussian linear models (after appropriate transformations of the response variables).
The mean area of non-native polygons and their geographic dispersion was not significantly different between symbiotic and non-symbiotic species (Supplementary Table 1). Total area of nonnative polygons, however, was significantly smaller for symbiotic species as compared with nonsymbiotic species. This is consistent with our main result, in that it suggests that not only do symbiotic species have a lower prevalence of non-native ranges (Supplementary Table 1), but they also have a smaller total area of introduction once they have been introduced. We recognize the caveat of interpreting area measures from TDWG political regions. But our analyses of geographic properties indicate that any variation introduced by using TDWG definitions do not conflict with biological inferences obtained from our main results. Table 1. Regression of various geographic measures of successful non-native ranges (with denoted response variable transformations) against the presence or absence of the symbiosis trait, with the inclusion of other factors found to predict successful introduction in our legume dataset. A negative coefficient indicates a lower response value for symbiotic legumes compared with nonsymbiotic legumes. Each cell contains the estimate with 95% confidence interval following in brackets.

Supplementary Note 2: Accounting for phylogenetic non-independence
There is a possibility that phylogenetic non-independence could influence patterns of non-native establishment of symbiotic nitrogen-fixation in our analyses, so we performed several tests to try and rule this possibility out. To do so, we used the most complete angiosperm phylogeny available from Zanne et al. 2,3 . We dropped all the tips of the publicly available Zanne et al. phylogeny 2,3 that were not found in our legume database, giving a total of 1301 legume species available (1140 symbiotic species, 161 non-symbiotic species) for phylogenetic analyses. This is a significant reduction in our sample size (nearly ⅓), but assuming a relatively random sample, we should be able to generalize our results here to the full dataset.
We were unable to create an exactly analogous model to our main result (i.e. hierarchical mixed model), because there is no available software that can feasibly analyse a dataset of this size while including both phylogenetic structure and crossed random effects to account for regional nonindependence. Therefore, we used a species-level analysis comparing species that had at least one nonnative range to species with no non-native ranges. We modelled the probability that a species had at least one non-native range using phylogenetic logistic regression implemented in the R package phylolm 4 . This method uses maximum penalized likelihood, which maximizes the penalized likelihood of the logistic regression using Firth's correction, which is implemented in the phyloglm function (model parameter = "logistic_MPLE") 4 . We used the 95% confidence interval generated by a bootstrap procedure with 1000 replicates (boot = 1000 in phyloglm) to determine if a coefficient was different from zero.
We ran two models. The first used no independent variables and was designed to test if there was any significant phylogenetic signal in the probability of a species having a non-native range. If there is no phylogenetic signal in the response it is unlikely that phylogenetic non-independence could affect our main result. However, it is still possible that phylogenetic signal could be obscured by the effects of covariates. So additionally we ran a model including the same set of independent variables used in our  Table   2). The only difference we saw in this model from our main results was that the symbiosis by number of human uses interaction was not significant, whereas in our main result we found a significant positive interaction. This could be due to the large decrease in sample size in this model or the slightly different formulation of the model (e.g. lack of a geographic area parameter or using probability of at least one non-native range rather than the prevalence of non-native ranges). The low estimated alpha in this model

Supplementary Note 3: Imputation of missing trait values
To determine whether our results were sensitive to the method of imputing missing life history and lifeform traits, we also tried two other imputation methods. The two other methods were a 'majority rules' taxonomic imputation, and an imputation based on other species traits using Random Forest.
For the second method, or majority rules taxonomic imputation, we assigned a state to each trait in the same manner as for the taxonomic mean imputation used in the main text, except instead of using the mean for a taxonomic group, we used the mode, or the most common value. This way, each trait could only take on values available in the original dataset (a 0 or a 1). Our results from the main model analyses did not change qualitatively regardless of which imputation method we used, with all model coefficients changing only superficially, and no change in levels of significance for any factor, indicating that our results are robust to the method of imputation chosen.

Supplementary Figure 2.
Example range species distribution maps for 32 randomly chosen species with both ILDIS and GBIF data. Native (green), introduced (orange) and unknown status (grey) ranges are defined as polygons according to TDWG World geographic scheme of plant distributions. Overlaid occurrence records from GBIF generally show the degree of consistency between ILDIS and GBIF. GBIF points that fall inside ILDIS polygons are coloured according to the status of the polygon they fall in (blue = native, yellow = introduced, grey = unknown); those that fall outside ILDIS polygons are coloured red. Most GBIF points that fall outside ILDIS polygons are not far from ILDIS polygons (see also Supplementary Fig. 3).