Improving prediction of rare species’ distribution from community data

Zhang, Chongliang; Chen, Yong; Xu, Binduo; Xue, Ying; Ren, Yiping

doi:10.1038/s41598-020-69157-x

Download PDF

Article
Open access
Published: 22 July 2020

Improving prediction of rare species’ distribution from community data

Chongliang Zhang¹,
Yong Chen²,
Binduo Xu¹,
Ying Xue¹ &
…
Yiping Ren^1,3,4

Scientific Reports volume 10, Article number: 12230 (2020) Cite this article

7408 Accesses
17 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Species distribution models (SDMs) have been increasingly used to predict the geographic distribution of a wide range of organisms; however, relatively fewer research efforts have concentrated on rare species despite their critical roles in biological conservation. The present study tested whether community data may improve modelling rare species by sharing information among common and rare ones. We chose six SDMs that treat community data in different ways, including two traditional single-species models (random forest and artificial neural network) and four joint species distribution models that incorporate species associations implicitly (multivariate random forest and multi-response artificial neural network) or explicitly (hierarchical modelling of species communities and generalized joint attribute model). In addition, we evaluated two approaches of data arrangement, species filtering and conditional prediction, to enhance the selected models. The model predictions were tested using cross validation based on empirical data collected from marine fisheries surveys, and the effects of community data were evaluated by comparing models for six selected rare species. The results demonstrated that the community data improved the predictions of rare species’ distributions to certain extent but might also be unhelpful in some cases. The rare species could be appropriately predicted in terms of occurrence, whereas their abundance tended to be underestimated by most models. Species filtering and conditional predictions substantially benefited the predictive performances of multiple- and single-species models, respectively. We conclude that both the modelling algorithms and community data need to be carefully selected in order to deliver improvement in modelling rare species. The study highlights the opportunity and challenges to improve prediction of rare species’ distribution by making the most of community data.

A Benford’s law-based framework to determine the threshold of occurrence sites for species distribution modelling from ecological monitoring databases

Article Open access 05 October 2023

The importance of common and the irrelevance of rare species for partition the variation of community matrix: implications for sampling and conservation

Article Open access 13 November 2020

Can information from citizen science data be used to predict biodiversity in stormwater ponds?

Article Open access 10 June 2020

Introduction

Species distribution model (SDMs) have been widely used to evaluate ecological niches and to predict geographic distribution of organisms across terrestrial, freshwater, and marine habitats^1,2,3,4,5,6. A majority of SDMs have been developed for common and economically important species because of practical incentives, while predictive models are more challengeable for rare species due to methodological difficulties^7,8,9. As most species are rare in natural biological communities^10,11, modeling common species cannot depict the full picture of biodiversity. In addition, rare species, characterized by low occurrence, are particularly vulnerable to environmental changes and human impacts thus deserve special concerns in biological conservation^8,12. As such, there is a pressing need to predict the distribution of rare species for successful conservation in the practices of designing marine protected areas (MPAs) and identifying priorities for monitoring programs¹³.

Accurate prediction of rare species is not easy. The difficulties come largely from the limits of data, as the observations of rare species are typically sparse in terms of spatial location and temporal frequency^14,15,16. The sparse data imply that the number of presence observations is often small compared to the number of influential predictors, resulting in a critical problem of over-fitting in modelling^8,16,17. Besides, occurrence or abundance of rare species are often vulnerable to sampling errors, which may lead to model misspecification, making it unfeasible to characterize species’ niche space¹⁸. There are a few studies aiming to address the issue of rarity, e.g., by developing a large number of simple models averaged in an ensemble^8,9, and generating pseudo-absence from a habitat suitability map^19,20,21. In spite of the progress, many issues remain, such as species’ nonlinear responses to environmental variables^2,22, unobserved/unknown driving forces²³, imperfect detection^16,24, among other outstanding difficulties²⁵.

With the development of modern statistics, technical advances provide powerful tools to estimate and predict species distributions, for example, machine learning methods and Bayesian hierarchical models are highly flexible to handle complex ecological responses and are promising for data-limited situations^26,27,28. Some predictive methods have emerged to account for community information, leading to a new modelling approach known as community-level models²⁹ or joint species distribution models (JSDMs)^30,31,32,33.This modelling approach may benefit the prediction of rare species by borrowing strengths from community data^29,34,35,36, which include rich information of species correlations resulting from biological interactions or shared environmental gradients^30,37,38. These factors have essential influences on species distributions thus may improve the predictive powers of species distribution models. That is, models that integrate community data may contribute to solving the ‘rare-species modelling paradox’.

It should be acknowledged that this idea of community modelling is not quite new^39,40, and some studies have compared the performances between single- and multi-species models^41,42,43. However, JSDMs remain underutilized to date^29,44,45, and there are limited understanding of their advantages and limitations. Although many studies suggest JSDMs may outperform single-SDMs (SSDMs), the advantage is not guaranteed²⁹, and JSDMs may lead to biased parameters if some species have responses to the environment very different from others. Therefore, the gains of adopting JSDMs need to be carefully considered.

This study tested the predictive performances of rare species distribution models, focusing on the hypothesis that community data may improve model prediction. We chose a range of SDMs that treat community data in different ways²⁹ and compared their performances using cross validation with survey data collected in the coastal water of Yellow sea, China. Both species occurrence and abundance were considered in the evaluation, as studies have concentrated on occurrence data but abundance data are better indicators of extinction risk^42,46. In addition to comparing modelling algorithms, we evaluated two approaches of data arrangement, species filtering and conditional prediction, to enhance the predictive performances of the chosen models. These approaches were considered from a pragmatic viewpoint, i.e., available data and modelling techniques are often fixed and can be hardly improved in time, and improving model prediction, even to a limited extent, may be the only solution to account for the rare-species challenge. The goal of this study is to improve our ability to predict the spatial distribution of rare species for biological conservation.

Results

Variations in predictability

The tested SSDMs and JSDMs had substantially different predictive abilities. Considering the results of Japanese seahorse (Hippocampus mohnikei, Sp4), AUCs (the area under curve of receiver operating characteristic) around 0.9 showed that occurrence of this species could be properly predicted by most models, except artificial neural network (ANN) (Fig. 1). The Cohen’s κ coefficient indicated a similar pattern, whereas hierarchical modelling of species communities (HMSC) and generalized joint attribute model (GJAM) performed worse than those machine learning methods. The results of RMSE (root mean square error) were consistent with AUC, and ANN yielded RMSE larger than that simply assuming the absence of this species over survey areas (dash line). All the models had negative partial relative bias (PRB) on average, implying the tendency of underestimating abundance. The results of other five species showed a similar pattern but the values of performance metrics varied substantially (Supplementary Figure S5). In general, multivariate random forest (MRF) and random forest (RF) showed the best predictive powers for this species, followed by multi-response artificial neural network (MANN).

The divergences in the model performances were compared for other species. In terms of occurrences, MRF provided the best predictions of Sp1 (Brown croaker, Miichthys miiuy) and Sp3 (Blackhead seabream, Acanthopagrus schlegelii), and RF was optimal for Sp5 (Black scraper, Erisphex pottii). HMSC and MANN provided better predictions of Sp2 (Ocellate spot skate, Raja porosa) and Sp6 (Bartail flathead, Platycephalus indicus) in some measurements (Table 1). The cases of RMSE were complicated, i.e., HMSC and GJAM was the best for Sp1 and Sp2, respectively, RF best for Sp3 and Sp5, and MRF for Sp4 and Sp6. It should be noted that the discrepancies among models were relatively small in terms of the performance metrics, especially between RF and MRF. The predictions of abundance were poor for very rare species, and no model made better predictions than assuming all-zeros for Sp2. In addition, relative performances of the models were not consistent among species. Sp3, Sp4 and Sp6 were more readily predicted than the other species (Table 1). The occurrence of the rarest species in this study, Sp1, could be properly predicted, whereas Sp2 and Sp5 were less well predicted in terms of both occurrence and abundance.

Table 1 A summary of model predictive performances for target rare species.

Full size table

Species filtering

The increasing thresholds of species selection (filtering) led to less but strongly correlated species, which imposed different effects on the four JSDMs (Fig. 2). Among them, MRF tended to be less responsive to the changes of species selection, and the corresponding RMSE increased slightly only for Sp2 and Sp6 in LV3 (levels of species filtering, and LV3 denoted a small set of species selected). On the contrary, the predictions of MANN were substantially improved by reducing the number of species with decreasing RMSE, except for Sp6. HMSC was barely influenced in the cases of Sp1, Sp2 and Sp3 but benefited from specie selection for other species. GJAM also showed less responses to species selection for Sp1, 2, 3, but its performances decreased in terms of the other species. At LV3, MANN and HMSC tended to outperform the other models.

Conditional predictions

Comparing to single-species RF, the predictive accuracy of conditional-RF (using ancillary species as predictive variables) was substantially improved for most species, indicated by the decreases in RMSE (ΔRMSE in Fig. 3). Predictions conditioning on observation data of ancillary species (RF-OBS) showed the most gains of accuracy; meanwhile, comparable improvement could be obtained with the help of JSDMs, i.e., conditional-RF based on JSDMs (using the outputs of JSDMs as predictors) could substantially improve RF, which performed better than MRF in many cases.

Conditional predictions also remarkably improved ANN to the performance similar to or better than MANN (Fig. 3). The degrees of improvement showed small differences between observation-based and model-based conditioning. However, the effects substantially differed among species, largest for Sp5 and Sp6 and least for Sp3.

Discussion

Given the global awareness of biodiversity loss with climate changes and anthropogenic pressures, it is not surprising that SDMs have been increasingly used in recent years. It is therefore of great concern how reliable the models are in their utility of predicting species distribution^47,48,49. Here in this study, we examined the performances of a representative selection of modelling methods for rare species using a typical dataset available in marine fisheries surveys. Our results were generally mixed, that is, most species could be appropriately predicted in terms of occurrence, whereas non-zero abundance tended to be underestimated. Nevertheless, given the rather limited occurrence (mostly less than 10%), such performances were acceptable for rare species in a context of biological conservation. Although the conclusions may depend on specific objectives of studies and characteristics of targeted ecosystems, we highlight the opportunities of community data to address the ‘rare-species modelling paradox’^30,35.

It is worth noting that this study covers a limited scope of SDMs in a continuous spectrum of complexity, and the potential of existing models may not be fully reflected. In particular, literature have concluded that the predictive abilities of SDMs may vary in different circumstances, depending on the type of organisms, their life-history trait, behavior, prevalence, data quality, spatial resolution and extent, and the impacts of human activities^17,25,50,51. The target species in this study by no means represent the high diversity of marine organisms. In particular, the so-called ‘rare species’ may also diverge in definition, characterized by geographic range, habitat specificity and local density, and different types of “rarity” may influence predictive models in different ways^16,52,53. In general, substantial challenges still lie ahead on the road to predicting rare species.

In our evaluation, the six models had divergent performances when evaluated with different objectives, measures and target species. In general, the models using RF algorithms had better predictive ability than ANN- and regression-based models for both occurrence and abundance. The advantage could be largely attributed to the successful control of overfitting by model ensembles and internal cross-validation⁵⁴. On the other hand, ANN easily led to overfitting under the circumstance of sampling errors and environmental noise⁵⁵. Nevertheless, the predictive power was substantially improved in MANN and conditional ANN, implying that the overfitting issue was effectively alleviated by borrowing information from common species. On the other hand, the regression algorithm adopted by HMSC and GJAM implied that they were less flexible to non-linear relationships³⁰ and at the same time less vulnerable to overfitting⁵⁶. Whereas, the regression-based JSDMs tended to be ‘conservative” for rare species in terms of PRB. We highlight that model ensemble and internal cross-validation should be considered in the future development of SDMs, and particularly the capacity to account for non-linearity and overfitting for JSDMs⁵⁷.

Considering the overall performances of the SDMs, our evaluations generally find better predictive powers in the category of machine-learning JSDMs and conditional SSDMs, suggesting that community information are useful for the prediction of rare species³⁶, although the extent of improvement depends on the statistical algorithms adopted. It is well established that such gains could be attributed to the covariations in species distribution, as a result of (dis)similar environmental requirements, biotic interactions such as competition and predation, human impact such as fishing, and other stochastic processes such as observation/sampling errors^29,30,32. Our results are consistent with this conclusion, i.e., species less correlated with the others (Sp2) tend to be poorly predicted while the well predicted one (Sp3, Sp4 and Sp6) show relative high correlations in the raw data (Supplementary Fig. S2). Meanwhile, it should be noted that SSDMs, specifically RF, may outperform the community models when predicting rare species, implying that community information are not helpful in certain circumstances. This is because the underlying driving forces may be idiosyncratic for the target species and others^29,58. In this case, the distributional patterns of rare species reflected by the limited data may be concealed by the relatively large amount of data of common species, and increasing species number may make the situation worse for model fitting. Such a result was evident in the species selection processes in MANN and HMSC, both of which tended to have improved predictive powers when the number of species was reduced. On the other hand, MRF showed less responses to species selection because the RF algorithm could effectively suppress predictor species with loose correlations⁵⁴. The declining performance of GJAM might also be attributed to the predicting algorithm, which generated latent variables randomly from a multivariate normal distribution according to species covariance matrices⁵⁹. In this case, a strong correlation matrix might lead to larger prediction of latent variables and increased RMSE for rare species. Our results highlight the critical role of species selection in the implementation of JSDMs especially MANN and HMSC.

This study provides suggestions for the application of SDMs for rare species. First, MRF, conditional RF and HMSC are recommended provided the models properly tuned in structure and input variables. Conditional RF should be most powerful for modelling rare species when the distribution of common species are known in the locations of interest (RF-OBS). These results may contribute to extending the scope of species that can be statistically modelled and facilitating studies of similar backgrounds in the cases of rare species or limited data. In future studies, in addition to the improvement of data quality and quantity, algorithmic development is still in need to address the multiple issues raised by rarity. As no models is likely to be superior in all circumstances, diverse types of SSDMs and JSDMs with different features should be combined to address different situations of biological characteristics, rarity and available data, for which better understanding of potential and shortcoming of the existing models are required. Finally, regarding the challenges far from solved, we highlight the need of research efforts in the field of modelling rare species to deliver successful ecosystem management and biodiversity conservation.

Methods

Study area and data

A marine fisheries survey was conducted in the north Yellow Sea, China to collect data. A modified systematic survey design was implemented with a total of 118 sampling stations in 2017 (Supporting information, Supplementary Fig. S1). In each station, an otter trawl which has the net width of 15 m and cod-end mesh size of 20 mm was towed for around 1 h at a speed of nearly 3 knots. Catch data were standardized to the same sampling efforts (trawling speed *time) for modelling. The survey and analysis methods were carried out in accordance with the ethics and guideline of the China law and the experimental protocol is approved by Ethical committee of Ocean University of China.

A total of 145 fish, shrimp and cephalopod species, in addition to benthos, were identified in the survey. As this study concentrated on rare species, only species occurring in less than 15% of the survey stations were selected as target species. As a result, six species with the occurrence frequency ranging from 3 to 12% were selected, including Brown croaker (Miichthys miiuy, Sp1, 3.5%), Ocellate spot skate (Raja porosa, Sp2, 4.3%), Blackhead seabream (Acanthopagrus schlegelii, Sp3, 6.1%), Japanese seahorse (Hippocampus mohnikei, Sp4, 8.8%), Black scraper (Erisphex pottii, Sp5, 9.6%), and Bartail flathead (Platycephalus indicus, Sp6, 12.3%) (Supplementary Table S1 in Supporting Information). In addition, 31 most prevalent species with occurrence frequency ranging from 23 to 87% were used as ancillary species (Supplementary Fig. S2) to help the prediction of target species. Commonly available hydrological variables in marine surveys were measured, including bottom water temperature, salinity, and depth (details are shown in Supplementary Table S2; Supplementary Fig. S3), using a CTD system (XR-420) in the same sampling stations after hauling.

Predictive models

We selected six SDMs following three approaches in terms of how species associations are utilized. The first modelling approach is single-species distribution models (SSDM), which refer to the traditional methods that exclude community data. Two commonly used models, random forest (RF)⁶⁰ and artificial neural network (ANN)⁶¹ are adopted. The two models are selected because they are powerful and can automatically deal with non-linear relationships that are prevalent in ecological studies^62,63. The two models are used as references to evaluate how community information may improve the prediction of rare species distribution.

The second approach includes multivariate random forest (MRF) and multi-response artificial neural network (MANN), which are extensions of RF and ANN to account for multiple response variables, respectively. The former is analog to RF in term of bootstrap resampling but the split function is modified to minimize species compositional similarity within groups^64,65. The latter MANN shares the same algorithm with ANN whereas its output layer has multiple neurons⁶⁶. The connection coefficients between input and hidden layers affect all species collectively in MANN. Although both MRF and MANN are designed for modelling community data, their algorithms account for the information of species associations implicitly (c.f. the following category).

The third approach accounts for species associations explicitly, including two JSDMs that adopt the Bayesian hierarchical framework. The first is a versatile statistical framework of hierarchical modelling of species communities (HMSC)³², which uses latent variables to incorporate information of species associations^32,67. The other is generalized joint attribute model (GJAM), designed to accommodate multifarious data types flexibly, such as presence-absence, ordinal, continuous, discrete, composition and censored data^59,68. The model represents species responses using a latent continuous variable, which can be censored to the discrete space of observations.

All the models were implemented on the R platform (version 3.5.1), using packages “randomForest”, “nnet”, “MultivariateRandomForest”, “HMSC”, and “gjam”, respectively. A summary of the models was provided in Table 2, and additional technical details were shown in Supporting Information.

Table 2 A summary of predictive models used in this study.

Full size table

Prediction improvement

We tested two approaches to improving predictions of JSDMs and SSDMs, using species filtering and conditional prediction, respectively. It should be noted that the “improved” models used the same algorithms as above, whereas the variables used for model fitting varied. The first approach followed the concern that community models might not benefit predictions when the response variables were poorly correlated²⁹. To avoid the undue influences, we selected ancillary species from the 31 common ones according to their correlations with target species. Three levels of species filtering were considered, level-1 (LV1) included all 31 common species, level-2 (LV2) included two-third species of the highest correlations, and level-3 (LV3) with the first third of the highest correlations. The process of species selection was conducted for each target species, and JSDMs were fitted with target species and their corresponding ancillary species at different levels of thresholds (LV), respectively.

The second approach, conditional prediction, was designed to improve the SSDMs using ancillary species directly as predictive variables⁶⁹. The ancillary species were considered in two scenarios, one that ancillary species were observed in all sampling sites, and the other that they were predicted from JSDMs. Obtained from either way, the information of ancillary species were used in SSDMs as predictive variables. To suppress noise and reduce the number of predictive variables, principal component analyses (PCA) were conducted on ancillary species data prior to model fitting, and only PCs with eigenvalues above one were included in the conditional models⁷⁰.

Evaluation procedures

A four-fold cross validation procedure was used to evaluate models’ predictive performances. The total data were split into four equal sized subsamples, in which 75% were used for model training and the remaining 25% for testing, iteratively. To avoid potential failures with all-zero training/testing dataset, the nonzero data of rare species were randomly assigned to the four subsamples to ensure that each had equal number of occurrence of target species. Specifically, data splitting was conducted separately for samples with and without target species, and a permutation process was used to assign the survey data to four subsamples.

The predictive performances for species abundance were measured by root mean square error (RMSE) between observations and model predictions, RMSE = \(\sqrt{\sum_{i}^{N}{({P}_{i}-{O}_{i})}^{2}/N}\), where Pi and Oi were the prediction and observation of abundance in sampling site i, respectively (RMSE thus has the same unit as abundance and the unit is omitted in the texts). In addition, we concerned the models’ predictive power for non-zero observations and used partial relative bias (PRB) to measure predictive accuracy in the sampling sites where target species were present, i.e., PRB = \(({P}_{p}-{O}_{p})/{O}_{p}\), where O_p was non-zero observations and P_p was the prediction in the corresponding sampling site.

Performances on predicting species occurrence were measured by the area under curve (AUC) of receiver operating characteristic and Cohen’s κ coefficient¹⁷. The former has been commonly used for model evaluation of presence-absence, and the latter is used to indicate the chance-corrected agreement between predictions and observations⁷¹. A random guess of occurrence leads to 0.5 and zero in AUC and Cohen’s κ, respectively. Additionally, True Skill Statistics⁷² were calculated and shown in the Supporting Information. Given that low detectability of rare species might lead to zero observations, a species-specific threshold, mean abundance in the whole area, was used to determine species occurrence from predicted abundance. Data splitting, model fitting, prediction, and evaluation were conducted for each of the target species, and the processes of cross-validation were repeated 500 times.

Data availability

Data and R codes may be available from the Dryad Digital Repository.

References

Guisan, A. et al. Predicting species distributions for conservation decisions. Ecol. Lett. 16, 1424–1435 (2013).
PubMed PubMed Central Google Scholar
Elith, J. & Leathwick, J. R. Species distribution models: Ecological explanation and prediction across space and time. Annu. Rev. Ecol. Evol. Syst. 40, 677–697 (2009).
Google Scholar
Robinson, N. M., Nelson, W. A., Costello, M. J., Sutherland, J. E. & Lundquist, C. J. A systematic review of marine-based species distribution models (SDMs) with recommendations for best practice. Front. Mar. Sci. 4, 421 (2017).
Google Scholar
Sofaer, H. R. et al. Development and delivery of species distribution models to inform decision-making. Bioscience 69, 480–480 (2019).
Google Scholar
Guisan, A. & Thuiller, W. Predicting species distribution: Offering more than simple habitat models. Ecol. Lett. 8, 993–1009 (2005).
Google Scholar
Hao, T., Elith, J., Guillera-Arroita, G. & Lahoz-Monfort, J. J. A review of evidence about use and performance of species distribution modelling ensembles like BIOMOD. Divers. Distrib. https://doi.org/10.1111/DDI.12892 (2019).
Article Google Scholar
Gogol-Prokurat, M. Predicting habitat suitability for rare plants at local spatial scales using a species distribution model. Ecol. Appl. 21, 33–47 (2011).
PubMed Google Scholar
Lomba, A. et al. Overcoming the rare species modelling paradox: A novel hierarchical framework applied to an Iberian endemic plant. Biol. Conserv. 143, 2647–2657 (2010).
Google Scholar
Breiner, F. T., Guisan, A., Bergamini, A. & Nobis, M. P. Overcoming limitations of modelling rare species by using ensembles of small models. Methods Ecol. Evol. 6, 1210–1218 (2015).
Google Scholar
Magurran, A. E. & Henderson, P. A. Explaining the excess of rare species in natural species abundance distributions. Nature 422, 714–716 (2003).
ADS CAS PubMed Google Scholar
Cao, Y., Larsen, D. P. & Thorne, R.S.-J.J. Rare species in multivariate analysis for bioassessment: Some considerations. J. N. Am. Benthol. Soc. 20, 144–153 (2001).
Google Scholar
Foden, W. B. et al. Climate change vulnerability assessment of species. Wiley Interdiscip. Rev. Clim. Chang. 10, e551 (2019).
Google Scholar
Guisan, A. et al. Using niche-based models to improve the sampling of rare species. Conserv. Biol. 20, 501–511 (2006).
PubMed Google Scholar
Ancillotto, L. et al. An African bat in Europe, Plecotus gaisleri: Biogeographic and ecological insights from molecular taxonomy and Species Distribution Models. Ecol. Evol. https://doi.org/10.1002/ece3.6317 (2020).
Article PubMed PubMed Central Google Scholar
Della Rocca, F., Bogliani, G., Breiner, F. T. & Milanesi, P. Identifying hotspots for rare species under climate change scenarios: Improving saproxylic beetle conservation in Italy. Biodivers. Conserv. 28, 433–449 (2019).
Google Scholar
Cunningham, R. B. & Lindenmayer, D. B. Modeling count data of rare species: Some statistical issues. Ecology 86, 1135–1142 (2005).
Google Scholar
Vaughan, I. P. & Ormerod, S. J. The continuing challenges of testing species distribution models. J. Appl. Ecol. 42, 720–730 (2005).
Google Scholar
Franklin, J., Wejnert, K. E., Hathaway, S. A., Rochester, C. J. & Fisher, R. N. Effect of species rarity on the accuracy of species distribution models for reptiles and amphibians in southern California. Divers. Distrib. 15, 167–177 (2009).
Google Scholar
Engler, R., Guisan, A. & Rechsteiner, L. An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data. J. Appl. Ecol. 41, 263–274 (2004).
Google Scholar
Chefaoui, R. M. & Lobo, J. M. Assessing the effects of pseudo-absences on predictive distribution model performance. Ecol. Modell. 210, 478–486 (2008).
Google Scholar
Phillips, S. J. et al. Sample selection bias and presence-only distribution models: Implications for background and pseudo-absence data. Ecol. Appl. 19, 181–197 (2009).
PubMed Google Scholar
Meynard, C. N. & Quinn, J. F. Predicting species distributions: A critical comparison of the most common statistical models using artificial species. J. Biogeogr. 34, 1455–1469 (2007).
Google Scholar
Royle, J. A., Nichols, J. D. & Kéry, M. Modelling occurrence and abundance of species when detection is imperfect. Oikos 110, 353–359 (2005).
Google Scholar
Welsh, A. H., Cunningham, R. B., Donnelly, C. F. & Lindenmayer, D. B. Modelling the abundance of rare species: Statistical models for counts with extra zeros. Ecol. Modell. 88, 297–308 (1996).
Google Scholar
Yates, K. L. et al. Outstanding challenges in the transferability of ecological models. Trends Ecol. Evol. 33, 790–802 (2018).
PubMed Google Scholar
Williams, J. N. et al. Using species distribution models to predict new occurrences for rare plants. Divers. Distrib. 15, 565–576 (2009).
Google Scholar
Rufener, M.-C., Kinas, P. G., Nóbrega, M. F. & Lins Oliveira, J. E. Bayesian spatial predictive models for data-poor fisheries. Ecol. Modell. 348, 125–134 (2017).
Google Scholar
Blangiardo, M. & Cameletti, M. Spatial and spatial-temporal bayesian models with R-INLA. Spat Spat. Epidemiol. 4, 33–49 (2013).
MATH Google Scholar
Nieto-Lugilde, D., Maguire, K. C., Blois, J. L., Williams, J. W. & Fitzpatrick, M. C. Multiresponse algorithms for community-level modelling: Review of theory, applications, and comparison to species distribution models. Methods Ecol. Evol. 9, 834–848 (2018).
Google Scholar
Warton, D. I. et al. So many variables: Joint modeling in community ecology. Trends Ecol. Evol. 30, 766–779 (2015).
PubMed Google Scholar
Thorson, J. T., Pinsky, M. L. & Ward, E. J. Model-based inference for estimating shifts in species distribution, area occupied and centre of gravity. Methods Ecol. Evol. https://doi.org/10.1111/2041-210X.12567 (2016).
Article Google Scholar
Ovaskainen, O. et al. How to make more out of community data? A conceptual framework and its implementation as models and software. Ecol. Lett. 20, 561–576 (2017).
PubMed Google Scholar
Hui, F. K. C. Boral-Bayesian ordination and regression analysis of multivariate abundance data in R. Methods Ecol. Evol. 7, 744–750 (2016).
Google Scholar
Warton, D. I., Foster, S. D., De’ath, G., Stoklosa, J. & Dunstan, P. K. Model-based thinking for community ecology. Plant Ecol. 216, 669–682 (2015).
Google Scholar
Ovaskainen, O. & Soininen, J. Making more out of sparse data: Hierarchical modeling of species communities. Ecology 92, 289–295 (2011).
PubMed Google Scholar
Hui, F. K. C., Warton, D. I., Foster, S. D. & Dunstan, P. K. To mix or not to mix: Comparing the predictive performance of mixture models vs separate species distribution models. Ecology 94, 1913–1919 (2013).
PubMed Google Scholar
Leach, K., Montgomery, W. I. & Reid, N. Modelling the influence of biotic factors on species distribution patterns. Ecol. Modell. 337, 96–106 (2016).
Google Scholar
Anderson, R. P. When and how should biotic interactions be considered in models of species niches and distributions?. J. Biogeogr. 44, 8–17 (2017).
Google Scholar
D’Amen, M., Rahbek, C., Zimmermann, N. E. & Guisan, A. Spatial predictions at the community level: From current approaches to future frameworks. Biol. Rev. 92, 169–187 (2017).
PubMed Google Scholar
Kindsvater, H. K. et al. Overcoming the data crisis in biodiversity conservation. Trends Ecol. Evol. 33, 676–688 (2018).
PubMed Google Scholar
Thorson, J. T., Kell, L. T., De Oliveira, J. A. A., Sampson, D. B. & Punt, A. E. Introduction to data-poor stock assessment. Fish. Res. 171, 1–3 (2015).
Google Scholar
Schliep, E. M. et al. Joint species distribution modelling for spatio-temporal occurrence and ordinal abundance data. Glob. Ecol. Biogeogr. 27, 142–155 (2018).
MathSciNet Google Scholar
Maguire, K. C. et al. Controlled comparison of species- and community-level models across novel climates and communities. Proc. R. Soc. B Biol. Sci. 283, 20152817 (2016).
Google Scholar
Zhang, C., Chen, Y., Xu, B., Xue, Y. & Ren, Y. Comparing the prediction of joint species distribution models with respect to characteristics of sampling data. Ecography (Cop.) 41, 1876–1887 (2018).
Google Scholar
Wilkinson, D. P., Golding, N., Guillera-Arroita, G., Tingley, R. & McCarthy, M. A. A comparison of joint species distribution models for presence–absence data. Methods Ecol. Evol. 10, 198–211 (2019).
Google Scholar
Ehrlén, J. & Morris, W. F. Predicting changes in the distribution and abundance of species under environmental change. Ecol. Lett. 18, 303–314 (2015).
PubMed PubMed Central Google Scholar
Smeraldo, S. et al. Modelling risks posed by wind turbines and power lines to soaring birds: The black stork (Ciconia nigra) in Italy as a case study. Biodivers. Conserv. 29, 1959–1976 (2020).
Google Scholar
Rizvanovic, M., Kennedy, J. D., Nogués-Bravo, D. & Marske, K. A. Persistence of genetic diversity and phylogeographic structure of three New Zealand forest beetles under climate change. Divers. Distrib. 25, 142–153 (2019).
Google Scholar
Guillera-Arroita, G. et al. Is my species distribution model fit for purpose? Matching data and models to applications. Glob. Ecol. Biogeogr. 24, 276–292 (2015).
Google Scholar
Hernandez, P. A., Graham, C. H., Master, L. L. & Albert, D. L. The effect of sample size and species characteristics on performance of different species distribution modeling methods. Ecography 29, 773–785 (2006).
Google Scholar
Thibaud, E., Petitpierre, B., Broennimann, O., Davison, A. C. & Guisan, A. Measuring the relative effect of factors affecting species distribution model predictions. Methods Ecol. Evol. 5, 947–955 (2014).
Google Scholar
Rabinowitz, D., Cairns, S. & Dillon, T. Seven forms of rarity and their frequency in the flora of the British Isles. In Conservation Biology: The Science of Scarcity and Diversity 182–204 (Sinauer, 1986).
Gaston, K. J. What is Rarity? In The Biology of Rarity: Causes and Consequences of Rare-Common Differences 30–47 (Chapman and Hall, New York, 1997).
Google Scholar
Boulesteix, A.-L., Janitza, S., Kruppa, J. & König, I. R. Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2, 493–507 (2012).
Google Scholar
Özesmi, S. L., Tan, C. O. & Özesmi, U. Methodological issues in building, training, and testing artificial neural networks in ecological applications. Ecol. Modell. 195, 83–93 (2006).
Google Scholar
Norberg, A. et al. A comprehensive evaluation of predictive performance of 33 species distribution models at species and community levels. Ecol. Monogr. 89, e01370 (2019).
Google Scholar
Harris, D. J. Generating realistic assemblages with a joint species distribution model. Methods Ecol. Evol. 6, 465–473 (2015).
Google Scholar
Elith, J. et al. Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29, 129–151 (2006).
Google Scholar
Clark, J. S., Nemergut, D., Seyednasrollah, B., Turner, P. J. & Zhang, S. Generalized joint attribute modeling for biodiversity analysis: Median-zero, multivariate, multifarious data. Ecol. Monogr. 87, 34–56 (2017).
Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
MATH Google Scholar
Suryanarayana, I. et al. Neural networks in fisheries research. Fish. Res. 92, 115–139 (2008).
Google Scholar
Brun, P., Kiørboe, T., Licandro, P. & Payne, M. R. The predictive skill of species distribution models for plankton in a changing climate. Glob. Chang. Biol. 22, 3170–3181 (2016).
ADS PubMed Google Scholar
Smoliński, S. & Radtke, K. Spatial prediction of demersal fish diversity in the Baltic Sea: Comparison of machine learning and regression-based techniques. ICES J. Mar. Sci. J. Cons. 74, 102–111 (2017).
Google Scholar
Segal, M. & Xiao, Y. Multivariate random forests. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1, 80–87 (2011).
Google Scholar
Rahman, R., Otridge, J. & Pal, R. IntegratedMRF: Random forest-based framework for integrating prediction from different data types. Bioinformatics 33, 1407–1410 (2017).
CAS PubMed PubMed Central Google Scholar
Olden, J. D. A species-specific approach to modeling biological communities and its potential for conservation. Conserv. Biol. 17, 854–863 (2003).
Google Scholar
Ovaskainen, O., Roy, D. B., Fox, R. & Anderson, B. J. Uncovering hidden spatial structure in species communities with spatially explicit joint species distribution models. Methods Ecol. Evol. 7, 428–436 (2016).
Google Scholar
Clark, J. S. Why species tell more about traits than traits about species: Predictive analysis. Ecology 97, 1979–1993 (2016).
PubMed Google Scholar
Araújo, M. B. & Luoto, M. The importance of biotic interactions for modelling species distributions under climate change. Glob. Ecol. Biogeogr. 16, 743–753 (2007).
Google Scholar
Peres-Neto, P. R., Jackson, D. A. & Somers, K. M. How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Comput. Stat. Data Anal. 49, 974–997 (2005).
MathSciNet MATH Google Scholar
Fielding, A. H. & Bell, J. F. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ. Conserv. 24, 38–49 (1997).
Google Scholar
Allouche, O., Tsoar, A. & Kadmon, R. Assessing the accuracy of species distribution models: Prevalence, kappa and the true skill statistic (TSS). J. Appl. Ecol. 43, 1223–1232 (2006).
Google Scholar
Basheer, I. & Hajmeer, M. Artificial neural networks: Fundamentals, computing, design, and application. J. Microbiol. Methods 43, 3–31 (2000).
CAS PubMed Google Scholar

Download references

Acknowledgements

The authors acknowledge all colleagues who have contributed to the marine surveys, data collection and analyses. Funding for this study was provided by the National Key R&D Program of China (2018YFD0900906) and the National Natural Science Foundation of China (31802301, 31772852).

Author information

Authors and Affiliations

College of Fisheries, Ocean University of China, 216, Fisheries Hall, 5 Yushan Road, Qingdao, 266003, China
Chongliang Zhang, Binduo Xu, Ying Xue & Yiping Ren
School of Marine Sciences, University of Maine, Libby Hall, Orono, ME, 21604469, USA
Yong Chen
Field Observation and Research Station of Haizhou Bay Fishery Ecosystem, Ministry of Education, Qingdao, 266003, China
Yiping Ren
Laboratory for Marine Fisheries Science and Food Production Processes, Pilot National Laboratory for Marine Science and Technology (Qingdao), 1 Wenhai Road, Qingdao, 266237, China
Yiping Ren

Authors

Chongliang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Binduo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yiping Ren
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.Z. conceived the ideas, designed the study and wrote the first draft of the manuscript. Y.C. contributed ideas for the interpretation of the analyses and revised the manuscript. B.X., Y.X. and Y.R. collected the data and performed the analyses. All authors gave final approval for publication.

Corresponding author

Correspondence to Yiping Ren.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, C., Chen, Y., Xu, B. et al. Improving prediction of rare species’ distribution from community data. Sci Rep 10, 12230 (2020). https://doi.org/10.1038/s41598-020-69157-x

Download citation

Received: 19 February 2020
Accepted: 29 June 2020
Published: 22 July 2020
DOI: https://doi.org/10.1038/s41598-020-69157-x

This article is cited by

Verification study on how macrofungal fruitbody formation can be predicted by artificial neural network
- Katalin Somfalvi-Tóth
- Ildikó Jócsák
- Ferenc Pál-Fám
Scientific Reports (2024)
Mechanisms, detection and impacts of species redistributions under climate change
- Jake A. Lawlor
- Lise Comte
- Jennifer Sunday
Nature Reviews Earth & Environment (2024)
Habitat probability prediction of umbrella species in urban ecosystems including habitat suitability of prey species
- Jaeyeon Choi
- Chan Park
- Sungho Kil
Landscape and Ecological Engineering (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

A Benford’s law-based framework to determine the threshold of occurrence sites for species distribution modelling from ecological monitoring databases

The importance of common and the irrelevance of rare species for partition the variation of community matrix: implications for sampling and conservation

Can information from citizen science data be used to predict biodiversity in stormwater ponds?

Introduction

Results

Variations in predictability

Species filtering

Conditional predictions

Discussion

Methods

Study area and data

Predictive models

Prediction improvement

Evaluation procedures

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Verification study on how macrofungal fruitbody formation can be predicted by artificial neural network

Mechanisms, detection and impacts of species redistributions under climate change

Habitat probability prediction of umbrella species in urban ecosystems including habitat suitability of prey species

Comments

Search

Quick links