Disentangling the determinants of species richness of vascular plants and mammals from national to regional scales

Understanding the spatial patterns in species richness gets new implication for biodiversity conservation in the context of climate change and intensified human intervention. Here, we created a database of the geographical distribution of 30,519 vascular plant species and 565 mammal species from 2,376 counties across China and disentangled the determinants that explain species richness patterns both at national and regional scales using spatial linear models. We found that the determinants of species richness patterns varied among regions: elevational range was the most powerful predictor for the species richness of plants and mammals across China. However, species richness patterns in the Qinghai-Tibetan Plateau Region (QTR) are quite unique, where net primary productivity was the most important predictor. We also detected that elevational range was positively related to plant species richness when it is less than 1,900 m, whereas the relationship was not significant when elevational range is larger than 1,900 m. It indicated that elevational range often emerges as the predominant controlling factor within the regions where energy is sufficient. The effects of land use on mammal species richness should attract special attention. Our study suggests that region-specific conservation policies should be developed based on the regional features of species richness.


Table S1
Correlations between environmental variables in whole China and its three regions Table S2 Variables selected based on Spearman (two-sided) correlation analysis and univariate regression for the residuals of species richness of vascular plants and mammals in whole China and its three regions

Supplementary text Errors in species richness data
Although this is one of the best datasets ever developed in China, it has some sampling bias that may introduce errors into the results of our study. Two types of errors are possible: (1) Omission errors, which occur when a given species is not recorded in a county while, in fact, it is present in the county, and; (2) Commission errors, which occur when a given species is recorded in a county when, in fact, it is not present in the county.
We discuss the errors in the dataset to better understand the results and implications of our study. The errors are not exclusive to our dataset-in fact they are prevalent among most published species distribution databases at all scales 1-7 . The current datasets are much more prone to omission errors.

Omission errors
A common shortcoming in the current databases is that sampling efforts are not uniform in space 5,8-10 . Most species records are derived from opportunistic collections without a harmonious sampling strategy to cover the full variation of environmental conditions in the entire target region. Sampling bias tends to be higher for certain taxonomic groups or regions. Some species receive special attention, or are easy to be detected. Some regions are close to researchers, or have more research funding 7,11 .
Biases are very common as records can be spatially biased towards more popular species or easily accessible sites 5,12 . There are a lot of species that are actually present but have never yet been recorded. Poorly known species and regions are most likely to be affected by the above limitations. Detailed surveys across the entire possible range of most species are barely conducted due to lack of resources [13][14] . Few counties in China have been surveyed with the aim of generating complete species lists 15 .
Consequently, omission errors exist in the data on species distribution in some counties of China due to incomplete documentation of specie distribution 15 .
Well-known regions are less likely to have omission errors.

Commission errors
Species may be misidentified, or the locations may be wrongly recorded, which result in commission errors 5,11 . Species occurrence is not fixed especially for mammals as they move dynamically in time and space as a consequence of changing biotic and abiotic conditions. Due to change of habitats or land use, such as transformation of forests into croplands, and wetlands into rice paddies, the distribution of species is likely to change. Thus, data on species distribution based on literatures or specimens may overestimate species locations and then lead to commission errors in this analysis.

Data sources and peer-review
Data on species distribution were collected from three sources: (i) approximately (ii) collection information of specimens in herbaria of more than 20 institutes and universities; and (iii) some recent ground observation information of vascular plants and mammals based on records of field surveys by experts from more than 11 institutes of the Chinese Academy of Sciences and over 14 universities. Among them, 80% of data on species distribution were from the literatures, 15% from collection information of specimens, and 5% from ground observation. To improve the data quality, we organized more than 20 expert meetings and invited over 50 experts specialized in different specific taxa to review the data of spatial distribution of each species across the country based on a GIS information system that we developed for species distribution at the county level. These experts specialized in specific taxa are

Using stratified random sampling to test the robustness of multivariate models
Although limitations in the database were reduced to the lowest extent, we performed a bootstrap procedure with stratified random sampling to test the impact of sampling bias on the robustness of the final (GLM) model [180][181][182][183] .We illustrated the procedures as follows: (1) We implemented a stratified random sampling procedure to generate a sample of 60% of the total dataset from target regions (whole China, EMR, NAR and QTR respectively). The stratification system was based on the phytogeographic regions in China 184 for vascular plants and zoogeographical regions in China 185 for mammals.
(2) We fitted a GLM model based on the subset data (60%).
(3) We assigned the statistical significance at p<0.05 according to the regression coefficient of each variable.
(4) The above steps from (1) to (3) were replicated 1000 times with randomly generated samples. We summed the number of times each variable reached the statistical significance based on the regression coefficient. The top six variables were selected according to the number of times of a variable that reached statistical significance among 1000 times.
(6) If consistent environmental variables were finally obtained from the above models based on different proportions of samples (60%, 70%, 80% and 90%), as compared with that of multivariate models in target regions (100%), we can effectively control the impact of sampling bias and verify the robustness of multivariate models.
Through the above analysis, we found that consistent results were obtained in GLM models based on different proportions of samples (from 60% to 90%), as compared with that of multivariate models in target regions (100%) ( Table S4). For example ,in all GLM models for plants in whole China, the top six environmental variables that reached statistical significance for the most times remained unchanged from 60% to 100%, i.e., elevational range, net primary productivity, precipitation seasonality, mean annual precipitation, maximum temperature of the warmest month, and main land cover type (Table S4). It suggests that the impact of sampling bias on multivariate models was effectively controlled and the multivariate models were robust.

Table S2 Variables selected based on Spearman (two-sided) correlation analysis and univariate regression for the residuals of species richness of vascular plants and mammals in whole China and its three regions.
Spearman (two-sided) correlation analysis between any two variables in each category was performed. The deviance of each variable in each category explaining species richness was calculated using univariate regression models. If the correlation coefficient between two variables in a category was >0.7, we identified strongly intercorrelated variables and retained the variables that explained more deviance in univariate regression models. Mean annual precipitation: PRE; precipitation of the wettest quarter: PREmax; precipitation of the driest quarter: PREmin; mean annual dryness: DRY; mean annual temperature: TEM; maximum temperature of the warmest month: TEMmax; minimum temperature of the coldest month: TEMmin; annual potential evapotranspiration: PET; annual actual evapotranspiration: AET; net primary productivity: NPP; temperature annual range: TEMvar; mean diurnal range: TEMday; temperature seasonality: TEMsea; precipitation seasonality: PREsea; elevational range: ELEV; mean elevation: ELEM; main land cover type: LC. Number of land cover types 9.9 LC 5.9