Multi-scale interplays of biotic and abiotic drivers shape mammalian sub-continental diversity over millions of years

The reconstruction of deep-time diversity trends is key to understanding current and future species richness. Studies that statistically evaluate potential factors affecting paleodiversity have focused on continental and global, clade-wide datasets, and thus we ignore how community species richness build-up to generate large-scale patterns over geological timescales. If community diversity is shaped by biotic interactions and continental and global diversities are governed by abiotic events, which are the modulators of diversity in subcontinental regions? To address this question, we model Iberian mammalian species richness over 13 million years (15 to 2 Ma) using exhaustive fossil evidence for subcontinental species’ ecomorphology, environmental context, and biogeographic affinities, and quantitatively evaluate their impact on species richness. We find that the diversity of large Iberian mammals has been limited over time, with species richness showing marked fluctuations, undergoing substantial depletions as diversity surpasses a critical limit where a significant part of the niches is unviable. The strength of such diversity-dependence has also shifted. Large faunal dispersals and environmental heterogeneity increased the system’s critical diversity limit. Diversity growth rate (net migration and diversification) also oscillated, mainly modulated by functional saturation, patchiness of canopy cover, and local temperature and aridity. Our study provides quantitative support for subcontinental species pools being complex and dynamic systems where diversity is perpetually imbalanced over geological timescales. Subcontinental diversity-dependence dynamics are mainly modulated by a multi-scale interplay of biotic and abiotic factors, with abiotic factors playing a more relevant role.


SI Methods
Fossil data. Our fossil dataset includes 94 Iberian fossil sites with macro-mammals remains whose ages span from 16 Ma to around 2 Ma. We excluded from the database fossil sites from the Northeastern Iberian Peninsula (i.e., Vallès-Penedès and Seu d'Urgell Basins) since they have been traditionally assigned to a different bioprovince 1 , and, also, because we do not have isotopic data from these basins (see below). Seventy seven of the fossil sites included in this study had been precisely dated using the Maximum Likelihood Appearance Event Ordination (ML-AEO) 2 . The other 17 localities are assigned to local mammal age units (MN units; see temporal ranges assigned to each locality in Dataset S1). After discarding fossil occurrences that were not identified at the species level, our dataset includes 721 occurrences of 209 species.
Taxonomic diversity through time. Estimates of past diversity that use the fossil record should account for sampling bias. Our approach uses two different methods of paleodiversity estimation: the True Richness estimated using a Poisson Sampling model (TRiPS) 3 , and the Shareholder Quorum Subsampling method (SQS) 4 . Taking advantage of the precise calibration of the fossil sites in our dataset, we performed a sliding time window analyses to avoid edge effects. We sampled occurrence data within a time window of 1 myr every 0.25 myr. Occurrence data from the 17 localities assigned to MN units were assigned a random age within the MN unit temporal range prior to the analyses.
The TRiPS approach jointly models biodiversity and the sampling process within each time bin, here each 1 myr window. The method provides maximum likelihood estimates of both sampling rates and species richness 3 . We retain the maximum likelihood estimate of biodiversity for subsequent analyses.
The SQS is a subsampling approach. The SQS is based on each species 'coverage', the percentage of the total occurrences belonging to that species in each focus interval. This 'coverage' is modified by multiplying by Good's u (the proportion of occurrences of non-singleton data, which is a proxy for sampling completeness; Fig. S1E) to estimate the real frequency 4 .
Good's u also marks the maximum quorum level (q) to which we can subsample (fewer intervals can be subsampled as we increase this quorum) (Fig. S1E). We estimate Good's u for the 100 resampled versions of the dataset, and explore the percentage of time windows where at least 10 from 100 of the Good's u values fell above 0.4 and 0.5, since a q below 0.4 is not recommended 4 .
A q of 0.4 still retains information for 84% of the time windows, whereas using 0.5 drops a 25% of the data (Fig. S1E). Thus, we use a q of 0.4 for estimating SQS diversity through time.
We estimate TRiPS and SQS diversity for the 100 resampled versions of the dataset. For TRiPS, we exclude estimates associated with a sampling probability below 0.25 (Fig. S1C), because these yield non-realistic diversity values and very broad confidence intervals. Also, time windows for which less than 10 richness estimates are computed across the 100 resampled datasets are also discarded. For SQS estimates, we retain values from time windows where at least 10 estimates were obtained (10 estimates are above the quorum of 0.4 across resampled versions of the data).
To synthesize the obtained clouds of points from the resampled TRiPS and SQS estimates (Fig.   S1), we fit locally weighted (LOESS) curves to the data ( Fig. S1D and S1F). An optimal smoothing parameter is selected using an Akaike Information Criterion (AIC) 5 so that the resulting curves capture the general trend, and reduces influence of both extreme points and the noise caused by the temporal uncertainty of in our data. Also, this kind of approach allows to predict values in short intervals bins where we have fewer direct observations (e.g. intervals where Good's u is below the quorum value of 0.4, for example between 5.5 and 4.75 Ma; see Fig. S1F). The resulting curves are retained for posterior analyses. LOESS curves were also used for the rest of the time series ( Fig. 1 and methods below).
The functional traits are used to investigate ecological functionality of the regional species pool through time. We compute overall functional space by calculating Gower's distances 26 applying the function daisy in the R package cluster to our three traits. Gower's metric allows the inclusion of missing data (here only 6.6%) since the method rescales distances based on the amount of information available 26 . We estimate functional disparity (FD) and functional saturation (FS) 27 . FD is measured as the mean distances among the taxa in a given time window. FS is calculated as the negative of the mean nearest-neighbour distance (NND) among the taxa (saturation increases when NNDs are smaller). FD reflects the overall functional diversity, whereas FS measures the relative saturation of taxa in the functional space 28 (functional partitioning). 1 myr time windows are placed every 0.25 myr through the study interval and we extract the species with functional information in each window. FD and FS are only computed for time windows that contain at least 4 localities with species with functional information for two of the traits, resulting in a range of 8 to 41 species across bins. We find no significant correlation between FD and the number of species with functional information sampled in each window (Kendall's τ = 0.157; P = 0.118). This suggests that the trends observed are the result of true biological signal, and not just an artefact of paleontological sampling. FS could also be sensitive to sampling 28 . Poor sampled windows are expected to show less extreme nearest-taxon distances (a lower range of NNDs) 29 . Our data show the opposite trend, with the standard deviation of the observed NNDs decreasing with species richness (τ = -0.463; P < 0.001), indicating that the signal of the less rich temporal bins is not affected by sampling. FD and FS are scaled from 0 to 1 to ease model fitting (see below) and graphical representation (Fig. 1B). LOESS curves are applied to the estimated values of FD and FS through time (Fig. S3), using an Akaike Information Criterion (AIC) for selecting the appropriate smoothing parameter 5 . These curves are retained for posterior analyses.
Biogeography. Large scale biogeographic events (e.g. intercontinental faunal interchanges) should influence terrestrial communities composition [30][31][32] . To test whether such episodes also impact diversity trends in small regions, we investigate the influence of European and African faunas on our species pool through time. We estimate the mean Raup-Crick similarity of the Iberian localities with those from Europe and Middle East, and Africa. Presence-absence data for all the localities was obtained from the NOW database 25 and the analyses were carried out at the genus level. Only localities with at least 5 genera were considered. Tooth enamel samples (n=728) were analyzed for the carbon and oxygen isotope composition of carbonate in bioapatite (δ 13 C and δ 18 O CO3 , respectively). Chemical pre-treatment followed the one described in Domingo et al. 33 . Isotopic analyses were conducted at the stable isotope laboratories of the University of California Santa Cruz using a ThermoScientific MAT253 dual inlet isotope ratio mass spectrometer coupled to a ThermoScientific Kiel IV carbonate device and of the University of Minnesota using a ThermoScientific MAT252 dual inlet isotope ratio mass spectrometer coupled to a ThermoScientific Kiel II carbonate device.
The δ 18 O values of phosphate in bioapatite (δ 18 O PO4 ) were measured on 372 tooth enamel samples. The chemical treatment is described in Domingo et al. 33 . Analyses were performed at the stable isotope laboratories of the University of California Santa Cruz using a ThermoFinnigan Delta plus XP IRMS coupled to a ThermoFinnigan High Temperature Conversion Elemental Analyzer (TCEA) and of the University of Kansas using a Thermo Finnigan MAT 253 IRMS coupled to a ThermoFinnigan TCEA.
Among mammalian herbivores, there is a contrast between obligate drinkers (animals that obtain most of their water from drinking) and non-obligate drinkers (those obtaining water mainly from plant water and metabolic water) 36 . The first group drinks water daily or consumes non-leafy parts of plants (which contain water that has not experienced evaporation). As a result, these animals record more faithfully local meteoric water δ 18 O values as well as temperature. Non-obligate drinkers can survive with little or no drinking water. They obtain most of their water from leaf water, which is prone to strong evaporative 18   Diversity models. We used a maximum likelihood approach 38 to assess the effect of diversitydependence (DD), functional space, biogeography and environment in diversification+migration rates and diversity limits. The basis of this approach are three basic models of competition (see detailed model descriptions, formulas and R code in Ezard and Purvis) 38 . These basic models assume constant carrying capacity (K) and constant intrinsic diversification+migration rate (the rate when diversity tends to zero or unconstrained rate, r 0 ). We term this basic models 'pure DD models'. The realized diversification+migration rate (r) is a function of diversity in relation to K and r 0 . The 'bounded contest' model is the analog of logistic growth. In this model diversity plateaus (r becomes 0) as it reaches the carrying capacity (K). In the 'expansion and collapse' scenario, adding species to the regional species pool produces the contraction of the niche breath of the other species. As a result, there is a diversity limit where competition is analogous to the 'bounded contest', but beyond this point the system is not sustainable any longer and diversity collapses (r becomes negative). In the 'damped increase', the intensity of competition is included as the competition coefficient (c). Values of c ~1 would mimic the 'bounded contest'. For c > 1, the model would work similar to an 'expansion and crash' scenario. For c < 1, diversity would never stop diversification completely. In particular, as c tends to zero, K tends to infinite, and thus diversity would only depend on diversification (a true unbounded scenario).
More complex versions of these DD models can be built by incorporating factors (e.g. FD, FS, δ 13 C, δ 18 O, etc.) that modulate K and r 0 over time 38 . We also build models where K and r 0 directly depend on such factors, but with no diversity-dependence involve. In total, 412 models were fitted (206 based on TRiPS diversity and 206 based on SQS diversity; see descriptions of the models in Dataset S1). We used AICc weights to evaluate the statistical support of each model. Akaike weights measure the relative probability of each model being the best one among those compared.
Specifically, we compute aggregated AICc weights across different factors to investigate their contribution to the fit of the models (Dataset S1). To test that the observed pattern is not a result of the amalgamation of models with poor support, we also estimate AICc weights from the best performing model (lowest AICc) in each factor's category (Dataset S1).
To interpret the contribution of each factor to the models, relevant parameters are model-averaged by multiplying them by the AICc weight of the respective model and then summing the results across all models 38 . We compare averaged parameters from the pure DD models with those where a given factor modulated K or r 0 . In order to assess how a given factor impacts K, we look at the differences in AICc-averaged diversity between pure DD models and models where K was a function of that factor and plot these differences against values of the factor (Fig. 3, S8 and S9). To assess how a given factor impacts r 0 , we look at the differences in AICc-averaged diversity between pure DD models and models where r 0 was function of that factor and plot these differences against values of the factor (Fig. 3, S8 and S9).        Effect of different factors on the strength of diversity-dependence. The first column shows the rate of diversity growth (diversity in one bin divided by the diversity in the previous bin) plotted against observed diversity, based on AICc-weight-averaged predicted values from models where K is controlled by each factor. General trends based on local regression fitting (LOESS) and their 95% prediction band are shown. Light grey represents the real trend (Fig. 1A). The second column shows the differences between AICc-weight-averaged predicted diversity from pure DD models and models where each factor regulates K, plotted against each factor. Points are colored according to the predicted diversity under each factor (darker blue is higher diversity). Dark lines represent linear or polynomial regressions that are significant (P < 0.05). Third and fourth columns are the same but here we average model predictions based on the influence of each factor on r 0 . Figure S8. Results of model-averaging parameters from models based on TRiPS-estimated diversity. Bar-plots show the support for pure DD models in grey (fixed K and r 0 , respectively).

Supporting Figures and Dataset Captions
Effect of different factors on the strength of diversity-dependence. The first column shows the rate of diversity growth (diversity in one bin divided by the diversity in the previous bin) plotted against observed diversity, based on AICc-weight-averaged predicted values from models where K is controlled by each factor. General trends based on local regression fitting (LOESS) and their 95% prediction band are shown. Light grey represents the real trend (Fig. 1A). The second column shows the differences between AICc-weight-averaged predicted diversity from pure DD models and models where each factor regulates K, plotted against each factor. Points are colored according to the predicted diversity under each factor (darker blue is higher diversity). Dark lines represent linear or polynomial regressions that are significant (P < 0.05). Third and fourth columns are the same but here we average model predictions based on the influence of each factor on r 0 . Dataset S1. Information on the age of the fossil sites, and functional traits for the 209 mammalian species analyzed. Description of the 206 models run based on diversity curves estimated using TRiPS and SQS, showing classifications based on different criteria, AICc scores and AICc weights.
Aggregated AICc weights for all the models are shown for each classification. Aggregated AICc weights based on AICc scores of the best model for each category in each classification.