Unveiling global species abundance distributions

Whether most species are rare or have some intermediate abundance is a long-standing question in ecology. Here, we use more than one billion observations from the Global Biodiversity Information Facility to assess global species abundance distributions (gSADs) of 39 taxonomic classes of eukaryotic organisms from 1900 to 2019. We show that, as sampling effort increases through time, the shape of the gSAD is unveiled; that is, the shape of the sampled gSAD changes, revealing the underlying gSAD. The fraction of species unveiled for each class decreases with the total number of species in that class and increases with the number of individuals sampled, with some groups, such as birds, being fully unveiled. The best statistical fit for almost all classes was the Poisson log-normal distribution. This strong evidence for a universal pattern of gSADs across classes suggests that there may be general ecological or evolutionary mechanisms governing the commonness and rarity of life on Earth.

There was a strong relationship for the 9,047 species assessed providing support to the notion of the number of GBIF occurrences serving as a proxy for the abundance of a species.All abundances are log10-transformed.The three asterisks (***) represent significance at the p<0.001 level from a Pearson two-sided correlation test.For a smaller subset of global abundance estimates, from a different source, see Fig. S3.S7.An exploratory test of the gSAD for dragonfly species (N=3538)a subset of Insecta as a whole.We used the methods described for the main analysis and tested the statistical fit of the log-series (ls; green line), negative binomial (nbinom; orange line), and Poisson log-normal (poilog; purple line) distributions.The aggregation method shown is the 20-year rolling window.

Fig. S8.
An exploratory test of the gSAD for butterfly species (N=9702)a subset of Insecta as a whole.We used the methods described for the main analysis and tested the statistical fit of the log-series (ls; green line), negative binomial (nbinom; orange line), and Poisson log-normal (poilog; purple line) distributions.The aggregation method shown is the 20-year rolling window.

Fig. S9.
An exploratory test of the gSAD for Coleoptera species (N=49175)a subset of Insecta as a whole.We used the methods described for the main analysis and tested the statistical fit of the log-series (ls; green line), negative binomial (nbinom; orange line), and Poisson lognormal (poilog; purple line) distributions.The aggregation method shown is the 20-year rolling window.

Fig. S10.
An exploratory test of the gSAD for Diptera species (N=25265)a subset of Insecta as a whole.We used the methods described for the main analysis and tested the statistical fit of the log-series (ls; green line), negative binomial (nbinom; orange line), and Poisson log-normal (poilog; purple line) distributions.The aggregation method shown is the 20-year rolling window.

Fig. S11.
A posterior distribution of a Bayesian model fit for the relationship between the percentage of the gSAD uncovered and the total observed species richness in a taxonomic class.Models were fit using n=39 independent classes as replicates.The more observed species in a class, the further away from the gSAD was of being fully sampled.The bars represent the posterior median (the dot), 66%, and 95% quantile intervals.

Fig. S12.
A posterior distribution of a Bayesian model fit for the relationship between the percentage of the gSAD uncovered and the total number of observations in a taxonomic class.Models were fit using n=39 independent classes as replicates.The more observations in a class, the closer the gSAD was of being fully sampled.The bars represent the posterior median (the dot), 66%, and 95% quantile intervals.

Fig. S13.
A posterior distribution of a Bayesian model fit for the relationship between the percentage of the gSAD uncovered and the total observed species richness divided by the total number of observations.Models were fit using n=39 independent classes as replicates.The lower the proportional sampling in a class (i.e., the more observations per species) the closer the gSAD was of being fully sampled.The bars represent the posterior median (the dot), 66%, and 95% quantile intervals.

Fig. S1 .
Fig. S1.The 39 classes included in our analyses, and the number of total observations used, present in GBIF from 1900-2019, for each class (on a log10-scale).

Fig. S2 .
Fig. S2.The relationship between a global abundance estimate (in absolute numbers of individuals) from Callaghan et al. 2021 and the number of records in GBIF for 1900-2019 (cumulative), 1999-2019 (20 years cumulative), 2009-2019 (10 years cumulative), and just 2019 (individual year).There was a strong relationship for the 9,047 species assessed providing support to the notion of the number of GBIF occurrences serving as a proxy for the abundance of a species.All abundances are log10-transformed.The three asterisks (***) represent significance at the p<0.001 level from a Pearson two-sided correlation test.For a smaller subset of global abundance estimates, from a different source, see Fig.S3.

Fig. S3 .
Fig. S3.The relationship between a global abundance estimate (in absolute numbers of individuals) from BirdLife International 2022 and the number of records in GBIF for 1900-2019 (cumulative), 1999-2019 (20 years cumulative), 2009-2019 (10 years cumulative), and just 2019 (individual year).There was a strong relationship for the 3216 species assessed providing support to the notion of the number of GBIF occurrences serving as a proxy for the abundance of a species.All abundances are log10-transformed.The three asterisks (***) represent significance at the p<0.001 level from a Pearson two-sided correlation test.For a larger subset of global abundance estimates, from a different source, see Fig.S2.

Fig. S5 .
Fig. S5.The gSAD for each of 10 example classes, where each year is an individual year treated separately, in contrast to our analysis presented in the main text where each year represents a 20year rolling window.Despite only being yearly, by the end of the time series, birds are still being fully sampled showing a log-left skew distribution.See animations for all 39 classes here.

Fig. S6 .
Fig. S6.The final 20-year rolling window correlation of predicted vs observed fitted values for each class in our analysis, showing that the Poisson log-normal had the best 'fit' for all but one (Charopyceae) class.

Fig.
Fig.S7.An exploratory test of the gSAD for dragonfly species (N=3538)a subset of Insecta as a whole.We used the methods described for the main analysis and tested the statistical fit of the log-series (ls; green line), negative binomial (nbinom; orange line), and Poisson log-normal (poilog; purple line) distributions.The aggregation method shown is the 20-year rolling window.

Fig. S14 .
Fig. S14.The global species abundance distribution for 39,065 tree species, as defined by the Botanic Gardens Conservation International global tree list (36), showing that the Poisson lognormal is the superior fit both in the last 20-year rolling window (A) and through time (B).The aggregation method shown is the 20-year rolling window.

Fig. S15 .
Fig. S15.Exploratory analysis for the class Aves, showing that when trimming the GBIF species 'taxonomy' (A) to only those species included in an official bird taxonomy checklist (B) the same qualitative and quantitative results are returned.For this reason, because we were assessing 39 different classes that would present a computation bottleneck to resolve taxonomies for each class, we use the GBIF 'taxonomy' throughout our main results.The aggregation method shown is the 20-year rolling window.

Fig. S16 .
Fig. S16.Exploratory analysis for the class Mammalia, showing that when trimming the GBIF species 'taxonomy' (A) to only those species included in an official mammal taxonomy checklist (B) the same qualitative and quantitative results are returned.For this reason, because we were assessing 39 different classes that would present a computation bottleneck to resolve taxonomies for each class, we use the GBIF 'taxonomy' throughout our main results.The aggregation method shown is the 20-year rolling window.

Fig. S17 .
Fig. S17.Exploratory analyses showing the robustness of the negative binomial fitting procedure by adjusting the starting parameter values, mu, and r.The results here are presented for Aves, and illustrate that regardless of starting parameter values (each line in a panel), the negative binomial produced a poor fit (color of lines is equal to the correlation value between observed and fitted distribution).

Fig. S18 .
Fig. S18.Exploratory analyses showing the robustness of the negative binomial fitting procedure by adjusting the starting parameter values, mu, and r.The results here are presented for Mammalia, and illustrate that regardless of starting parameter values (each line in a panel), the negative binomial produced a poor fit (color of lines is equal to the correlation value between observed and fitted distribution).

Fig. S19 .
Fig. S19.Exploratory analyses showing the robustness of the negative binomial fitting procedure by adjusting the starting parameter values, mu, and r.The results here are presented for Arachnida, and illustrate that regardless of starting parameter values (each line in a panel), the negative binomial produced a poor fit (color of lines is equal to the correlation value).

Fig. S20 .
Fig. S20.Exploratory analyses showing the robustness of the negative binomial fitting procedure by adjusting the starting parameter values, mu, and r.The results here are presented for Amphibia, and illustrate that regardless of starting parameter values (each line in a panel), the negative binomial produced a poor fit (color of lines is equal to the correlation value).

Fig. S21 .
Fig. S21.Exploratory analyses illustrating that different measures of goodness of fit show similar trends in our understanding of the gSAD through time, where Poisson log-normal gains importance and becomes the statistically best fit by the end of the time series.Illustrated here is the fit for Cephalopoda.The Chi-sqaure test was two-sided.

Fig. S22 .
Fig. S22.Exploratory analyses illustrating that different measures of goodness of fit show similar trends in our understanding of the gSAD through time, where Poisson log-normal gains importance and becomes the statistically best fit by the end of the time series.Illustrated here is the fit for Cycadopsida.The Chi-sqaure test was two-sided.

Fig. S23 .
Fig. S23.Exploratory analyses illustrating that different measures of goodness of fit show similar trends in our understanding of the gSAD through time, where Poisson log-normal gains importance and becomes the statistically best fit by the end of the time series.Illustrated here is the fit for Aves.The Chi-sqaure test was two-sided.

Fig. S24 .
Fig. S24.Exploratory analyses illustrating that different measures of goodness of fit show similar trends in our understanding of the gSAD through time, where Poisson log-normal gains importance and becomes the statistically best fit by the end of the time series.Illustrated here is the fit for Amphibia.The Chi-sqaure test was two-sided.

Fig. S25 .
Fig. S25.Hypothetical relationship between observed abundances (as measured by number of occurrences in GBIF) and the real abundances of species.The observed abundances can be biased towards rare species (left) or biased towards common species (right).See methods for more details.

Fig. S26 .
Fig.S26.Proportion of simulated communities for which the Poisson log-normal is a better fit than the negative binomial at different values of q, from extremely biased occurrences towards rare species (q=0.1) to extremely biased occurrences towards common species (q=2), for communities sampled from a gamma distribution with the best GBIF fit parameters for birds.Similar results occur when communities are sampled from a gamma distribution with the best fit parameters for mammals.

Fig. S27 .
Fig. S27.Relationship between log number of GBIF occurrences and log estimated global abundance for birds.Top: linear regression plot for GBIF occurrences 1999-2019.Middle: Q-Q plot of the residuals 1999-2019.Bottom: linear regression plot for GBIF occurrences 1900-1919.