## Introduction

Our understanding of the global patterns of plant diversity largely stems from studies based on either local to national floras or stacked distribution range maps1,2,3,4. These approaches allow quantification of the total number of species occurring in a region but do not address how plant species co-occur locally and form species-rich or species-poor communities. With the notable exceptions of trees and ferns5,6,7,8,9, the global distribution of local plant diversity remains poorly understood10.

The species richness of local plant communities, i.e., alpha diversity, is non-linearly related to the size of the sampling unit, i.e., the spatial grain9,11,12,13. Enlarging the sampling unit means that more species are progressively captured in the same plot, so that the alpha diversity of a sampled plot slowly, but non-linearly, approaches the regional species richness, i.e. gamma diversity11,12,14. The steepness of the curve, i.e., beta diversity, determines how the plant community composition varies from place to place11. This non-linearity complicates direct comparisons of biodiversity data from place to place and makes mapping alpha diversity across large areas challenging. Even in well-sampled regions, available data are heterogeneous mixtures of surveys with varying spatial grains and sampling protocols, and different reference taxonomies9,15,16. Furthermore, there is a typical trade-off between spatial grain and extent in biodiversity research, with most fine-grained studies only covering limited spatial extents. Thus the question of whether global patterns of alpha diversity are consistent with known patterns of regional gamma diversity has remained unanswered.

Plant diversity patterns result from ecological and evolutionary processes acting at different spatial and temporal scales17,18. At continental and regional scales, evolutionary processes (migration, speciation, extinction) as well as geological and climatic history play key roles19,20. At local scales, diversity depends primarily on assembly processes related to species dispersal, habitat filtering and biotic interactions (including humans)3,21,22. There is clearly an intimate nested relationship between processes at different scales, as a species must be present regionally to occur locally, and large-scale environmental factors influence local conditions8,17,23. An exploration of alpha diversity patterns at multiple grain sizes can discriminate between areas where species richness is consistently high or low across grain sizes, and those where it is not, i.e., where species richness is either high at fine grains and low at coarse grains, or vice versa9,10,24,25. This may provide insights into the prevailing mechanisms that shape biodiversity distribution at different scales, and which produce and maintain global plant diversity11,26. For example, the discrepancies between alpha diversity patterns at different grains could indicate regional or biome-related variation in the roles of habitat heterogeneity, dispersal barriers or environmental filtering27,28,29.

Here, we explore alpha diversity patterns across multiple spatial grains globally. We leverage methodological advances in modeling biodiversity across scales9,12,18,30,31,32 using the sPlot database, a global initiative that aggregates and harmonizes local-scale species co-occurrence data from hundreds of independent datasets and vegetation surveys15,16. The sPlot database incorporates more than 1 million vegetation plots and covers both natural and semi-natural ecosystems on all continents and in all biomes15. We focused on terrestrial vascular plants only, since data on bryophytes, lichens, vascular epiphytes and aquatic habitats are too scattered in the sPlot database.

We applied machine learning (boosted regression trees) to model the relationships between vascular plant species richness at different grains and 20 global datasets on current and past climate, soil and topography. Our models allowed relationships between alpha diversity and environmental variables to vary across grains by including interaction terms between plot size and other predictors9. To simultaneously quantify uncertainty and to account for the uneven distribution of data across biomes and vegetation formations in our database, we averaged our results over 99 model runs, each based on a stratified resampling of the data (Supplementary Fig. 1, Supplementary Data 1). By modeling the relationships between alpha diversity and environmental variables across the globe, we (1) predicted alpha diversity of vascular plants at three different grain sizes spanning two orders of magnitude, (2) determined how the explanatory power of potential environmental drivers on alpha diversity varies across the three grain sizes, and (3) identified regional scaling anomalies, i.e., areas where alpha diversity is high at fine grain but low at coarse grains, or vice versa.

## Results

### Multi-grain global maps of local species richness

We modeled forest and non-forest ecosystems jointly but focus on each broad formation separately in the main text. Modeling them separately yielded similar results (not shown). For forests, we generated estimates for the three grain sizes most commonly used for sampling forests: 400 m2, 1000 m2 and 1 ha. At the finest grain (400 m2), the estimated alpha diversity of vascular plants (median prediction of each pixel of 2.5 arcminute resolution across the 99 resampled subsets) ranged from 1 to 120 species (median across all pixels = 22, interquartile range or IQR = 10; Fig. 1A, Supplementary Table 1). The areas with alpha diversity above the global 95th percentile (hereafter ‘hotspots’) were the forest-steppe region of easternmost Europe and Siberia, East Asia, Borneo and New Guinea, the eastern coast of Australia, the western Congo Basin, eastern Madagascar, the Andean-Amazonian foothills, the South American Atlantic Forest (‘Mata Atlântica’) and the Appalachian Mountains. Coldspots (i.e., areas with alpha diversity at a given grain size below the global 5th percentile) occurred in the Atlantic and Mediterranean part of Europe, central and western India, southern Australia, central Africa – specifically the eastern Guinean forest and the Sudanian savanna belt – and along the Pacific coast of North America. At the intermediate grain (1000 m2), the median estimated richness per grid cell in forest ecosystems ranged from 1 to 197 vascular plant species (global median across all grid cells = 29, IQR = 13) (Fig. 1B, Supplementary Table 1). Compared to the finest grain, all the hotspots in the equatorial region (Indonesia, Borneo, Andean-Amazonian foothills) increased in extent, whereas hotspots in the temperate and boreal regions either disappeared or shrank considerably. The coldspots in Western and Southern Europe and western North America remained, while those in central Africa diminished in size. Finally, at the coarsest grain (1 ha), average species richness per grid cell ranged from 2 to 921 species (median = 40, IQR = 39; Fig. 1C, Supplementary Table 1). At this grain, the well-known difference in species richness between the tropics and the boreal and temperate regions became apparent. The South American hotspots became connected, forming a belt spanning from the Andean-Amazonian foothills through the Chiquitano dry forest to the southern Pantanal and the Mata Atlântica regions. The hotspot in the western Congo Basin increased in size (Fig. 1C). The temperate region contained no hotspots at this grain. The coldspot in southern Australia expanded to the eastern coast, while the coldspot in central Africa disappeared. The uncertainty in alpha diversity estimates, quantified as the ratio between IQR and median across the 99 resampled subsets, was highest in the boreal regions of Canada, Central and Eastern Siberia, the Amazon and Sundaland (Supplementary Fig. 2).

For non-forest ecosystems, we used an alternative set of grains: 10 m2, 100 m2 and 1000 m2, to match the most frequently used plot sizes in our database. At the finest grain (10 m2), the median estimated alpha diversity across the 99 resampled subsets ranged from 0 to 68 vascular plant species (median across all grid cells = 14, IQR = 7; Fig. 2A, Supplementary Table 1). At this grain, non-forest hotspots were widely distributed across the forest-steppe region of easternmost Europe and Siberia, the central loess plateau of China, southern Eastern Australia, the Drakensberg region in South Africa, subtropical South America and eastern North America. Coldspots were widespread in southern Central Asia, central and northwestern Australia, the Sahel region of Africa and along the Pacific coast of South America. At the intermediate grain (100 m2), the median estimated species richness per grid cell ranged from 0 to 90 (median = 17, IQR = 9, Fig. 2B, Supplementary Table 1), and the distribution of hotspots and coldspots remained essentially unchanged compared to the finest grain. At the coarsest grain (1000 m2), the median estimated richness per grid cell ranged from 0 to 184 species (median = 23, IQR = 13, Fig. 2C, Supplementary Table 1). Except for the Loess plateau in China, hotspots were almost exclusively concentrated in subtropical regions at this scale, especially southeastern Australia, Madagascar, the Appalachian region, and the Pantanal and southern Cerrado in South America. The location of coldspots hardly changed compared to finer grains. The uncertainty in alpha diversity estimates was highest in northern Canada, the Tibetan Plateau and the Persian Gulf region (Supplementary Fig. 2). A map jointly showing alpha diversity of forest and non-forest ecosystems at 1000 m2 grain is available in the supplementary material (Supplementary Fig. 4).

Overall, the models showed a relatively high predictive power (average over 99 resampling iterations: Pearson’s r = 0.49), even after implementing a spatially constrained, block cross-validation33 that accounted for the residual non-independence of training and test datasets arising from the clustered nature of our database34. We found no major bias or trend in residuals across grain sizes, biomes or geographical regions (Supplementary Fig. 6), and the frequency distributions of observed and predicted values largely overlapped (Supplementary Fig. 7, Supplementary Table 2). The predicted values showed a slight tendency towards the sample mean with thinner tails at the extremes, which is a common feature of ensemble machine-learning methods, even with the bias-correction method we used (see Methods)35. Minor deviations only occurred for the dry mid-latitude and boreal biomes at coarse grains (Supplementary Fig. 7). Given the relatively small sample size for the wet tropics, we recommend interpreting the results for these regions with caution. For a complete description of model validation, see Supplementary Methods.

### Environmental and biogeographical determinants

Our statistical models reveal which of the environmental and biogeographic variables tested appear to drive alpha diversity of vascular plants (Fig. 3). Among the predictors having a higher-than-expected relative influence, plot size, i.e., the grain size of the vegetation plot, consistently ranked first across the 99 resampled models. Climate also had a high relative influence in shaping alpha diversity patterns, especially annual mean temperature and the temperature of the warmest and wettest quarter of the year (PC1 and PC4, respectively, in a principal component analysis based on 18 bioclimatic variables). The ecoregional species pool, i.e., the estimated number of species occurring in the ecoregion in which a given plot is located2, was the fourth most important predictor, highlighting the nested link between local and regional biodiversity. Finally, despite the expected importance of soil conditions for local plant diversity, only one soil variable, i.e., the percentage of coarse soil fragments, had an influence greater than 5%.

We created partial dependence plots to explore the directionality of these relationships and whether they are consistent across spatial grains and vegetation formations (Fig. 4). Plant alpha diversity increased non-linearly with increasing plot size. This effect saturated at relatively fine grains (~100 m2) in non-forest ecosystems and at 1 ha in forest ecosystems, which can be explained by the different grains at which forests and non-forests were sampled, and the different spatial structure of these vegetation types. Grain size interacted with most of the other predictors, as revealed by the different environment–richness relationships at different grains (Fig. 4). Alpha diversity increased when the size of the ecoregional species pool increased, but only for coarse grains. It also increased toward tropical regions (i.e., regions with higher temperatures of the warmest and wettest quarters, high scores on PC4) and at higher mean annual temperature (PC1), especially for coarse grains.

### Regional scaling anomalies in species richness across grain sizes

Many areas with relatively high fine-grained alpha diversity also had high alpha diversity at coarser grains (Fig. 5). For forests, our models revealed consistently high alpha diversity across grains in Sundaland, the Congo Basin, Madagascar, as well as in the eastern Andean foothills, the Amazon Basin and the Southern American Mata Atlântica (Fig. 5A). Areas with consistently low alpha diversity across all grains were the western parts of the USA and Canada, the Atlantic region of Europe, Fennoscandia, the Mediterranean Basin, central and northern India, and southern Australia. However, not all areas with relatively high fine-grained richness also had high coarse-grained richness, and vice versa, revealing regional scaling anomalies in plant alpha diversity patterns18. Areas with high plant alpha diversity at coarse grains, but relatively low alpha diversity at fine grains, were the tropical forests of Africa and the Guiana Shield in South America. The opposite was true in the Eastern European forest–steppe belt, northeastern Argentina, Eastern Australia and New Zealand (Fig. 5A).

The regions hosting non-forest ecosystems with consistently high plant alpha diversity across grains were the European Alps, the forest-steppe of Eastern Europe and Siberia, the loess plateau of China, Eastern Australia, eastern South Africa, Madagascar, the Chaco, Mata Atlântica and some other regions of South America, and eastern North America (Fig. 5B). Consistently low plant alpha diversity across grains occurred in Inner Asia and in the northern African desert and semi-desert regions, the Tibetan Plateau, Namib Desert, central Australia, the Atacama and High Monte deserts in the high Andean plateaus south of the equator as well as in the North American prairies and deserts. High coarse-grain species richness was associated with low fine-grain species richness in the Myanmar-Thailand-China borderland, Ethiopia and Mexico. The opposite situation was relatively rare, occurring locally in the temperate grasslands of southeastern Australia.

## Discussion

By simultaneously highlighting patterns at multiple spatial grains, our maps provide a nuanced picture of the pattern of alpha diversity of vascular plants. This complements our understanding of the distribution of biodiversity hotspots36 and regional (i.e., gamma) vascular plant diversity2,3,4,37. Within the broad range of plot sizes commonly used for vegetation sampling, our maps distinguish between regions where high coarse-grained alpha diversity results largely from high fine-grained richness, and regions where high coarse-grained alpha diversity results more from species turnover between adjacent plant communities (i.e., fine-grained beta diversity).

Our results are consistent with previous studies suggesting that forests in Borneo, New Guinea, Madagascar, eastern South Africa and the Andean-Amazonian foothills are hotspots for plant biodiversity across all spatial grains37. There is considerable agreement between our map of 1-ha alpha diversity in forests and a recently published global map of tree species richness at the same grain9. Similarly, patterns of fine-grained alpha diversity in non-forest ecosystems are consistent with the local and regional patterns recently observed for alpine vegetation38 and Palearctic grasslands25. We also found good agreement with previous research in the distribution of areas of low diversity (coldspots), such as the non-forest vegetation in the western Tibetan Plateau, the semi-desert regions of central Asia, coastal Somalia and the forests in the Pacific Northwest of North America, despite the large difference in grain37.

In some regions, however, the difference between our results and previously reported patterns was striking. None of the regions holding the world records of plant alpha diversity appeared in our results39. The foothills of the Carpathians, for instance, are known for hosting semi-natural grasslands that are among the most species-rich plant communities globally at fine grains (e.g., >100 species in 16 m2)39,40. As many as 233 species (including 59 epiphyte species, not considered here) were observed in a 100 m2 rain forest plot in Costa Rica41. At intermediate grains, very high plant species richness has been reported for the hemiboreal forests of the northern Russian Altai (149 species per 1000 m2)42 and Colombia (313 species per 1000 m2)43. At coarse grains, the world record is in Ecuador (942 species in 1 ha, including 172 epiphytes)44. Except for the Altai region, however, our maps do not show record high species richness in any of these regions. A general explanation is that our maps represent local averages across model runs, large areas (2.5 arcminute grid resolution) and a mixture of habitat types, so that the richest sites, which are rare in the landscape, have been averaged with neighboring sites that belong to other ecosystems with lower species richness. This is true, for instance, in Europe, where our data contained most non-forest vegetation types, including species-poor grasslands on acidic soils. The lack of data for epiphytes can partially explain why our model did not predict the expected high alpha diversity in Mesoamerica, where this growth form can account for up to 25% of forest species41,45,46.

Interestingly, our models highlighted that alpha diversity does not differ markedly between temperate and tropical regions at the finest grains, but differences become more pronounced at coarser grains. This may reflect the often overlooked fact that tropical forests have a relatively species-poor herb layer compared to temperate forest ecosystems46,47. For instance, the high alpha diversity of trees in West African forests7 is not accompanied by an equally high richness of herb or shrub species in the understorey. The low diversity in these understories could be due to the fact that tropical lowland forests have a closed canopy year-round48, or that fires occur frequently, favoring grass-dominated, species-poor understories49. Together with the scarcity of data on epiphytes, a species-poor herb layer might explain why tropical lowland forests exhibit scaling anomalies, namely low alpha diversity at fine grains but high at coarse grains. If most of the diversity (or data) is in the tree layer, large vegetation plots are needed to ensure that the diversity of an ecosystem is appropriately sampled, as few tree individuals can physically co-occur at small sampling grains. We note, however, that uncertainties were high for tropical forests, requiring a cautious interpretation of these results.

In general, finding these scaling anomalies points to the role of beta diversity as a cross-scale diversity metric, and suggests that the relative contribution of different eco-evolutionary processes in determining plant diversity patterns varies between regions. In many tropical lowland forests, alpha diversity is low at fine grains but increases rapidly with increasing grain size. This is the case, for instance, in the western Amazon, where much of the regional (gamma) diversity depends on species turnover rather than on the coexistence of a high number of species at the same site50. This suggests that the tropics might be shaped by processes promoting species coexistence through a tighter packing in the niche space. Recent work found a latitudinal increase in niche specialization and marginality of trees towards the equators, which has been attributed to the stable climate and high productivity in the tropics51. Alternative explanations include rarity and priority effects related to high productivity29, more uniform environmental conditions and stronger dispersal limitation at fine scales28, or stronger mycorrhiza-mediated effects of interspecific competition and habitat adaptation52 in the tropics compared to temperate regions. While the relative contribution of these processes remains a matter of speculation, our work points to the need for an improved understanding of the spatial variation of beta diversity in plant diversity analysis53. Beta diversity, rather than alpha diversity per se, appears to be the main driver of spatial differences in gamma diversity between temperate and tropical regions.

Conversely, we observed high plant alpha diversity at fine grains but relatively low alpha diversity at coarse grains in many temperate regions, including the Eastern European forest-steppe belt, East Asia and southeastern Australia. This pattern might be indicative of effective niche partitioning at fine grains and more homogeneous landscapes without dispersal barriers at coarse grains54. There is evidence that niche processes play a stronger role than neutral processes in determining fine-scale beta diversity at higher latitudes and altitudes28,29,30, where species are thought to have broader niches and be less responsive to geographical changes55. This is consistent with recent findings that the nestedness of tree communities increases with latitude, possibly due to the high share of ectomycorrhizal species in colder and wetter conditions52. Finally, high species richness at fine grains might also depend on plant size, as many small plants can coexist in a given grain size. Such conditions mainly occur in grasslands, e.g., in Eastern Australia, where this mechanism has been invoked to explain differences in beta diversity among vegetation types56.

Our work allows us to rank the predictors of alpha diversity by their importance. Since the species–area relationship has often been described as one of the few rules in ecology14, the high importance of plot size in our models is not surprising. Our important advance, however, is that by explicitly incorporating this nonlinear relationship into our models, we created a grain-independent model that links alpha diversity to multiple climatic, topographic and biogeographical predictors. We also showed that ecoregions with a large species pool are more likely to host species-rich communities. This pattern became disproportionately stronger at coarser grains, probably because at finer grains the maximum number of locally co-occurring species is constrained by the number of individuals that can fit into the grain. The other biogeographical covariates, namely biomes and realms, had very little effect on predicting alpha diversity. This is probably because they are closely related to other predictors with stronger explanatory power, i.e., macroclimate and ecoregions, respectively3. The increasing influence of macroclimate and ecoregional species pool with grain size is, however, in line with evidence on the role of climatic and geological histories of ecoregions on species pools8,10,20,24. This is not surprising since tectonic movements, uplift of mountain ranges, climatic stability, and glaciation events all play a role in driving regional speciation and extinction rates3. This result supports the view that, although intimately related, habitat filtering and biogeographical factors related to regional differences in geological and climatic history, have a different influence on patterns of alpha diversity at fine vs. coarse grains10.

Although our study is based on the largest collection of global vegetation-plot data ever compiled, there are some shortcomings. The most important limitation is the uneven distribution of vegetation plots across biogeographical regions. Most of our data points were in Europe and other countries with a strong tradition of vegetation surveys, while the coverage of tropical areas, especially the Amazon and equatorial Africa, was poor (Supplementary Fig. 1). Furthermore, data from tropical forests were often incomplete, containing information on woody species only. Although the targeted search for additional data, coupled with the stratified resampling and statistical model we applied, mitigate these problems (see Methods), they clearly cannot compensate for the lack of comprehensive data on plant composition in many species-rich regions (especially large parts of the tropics). Ongoing initiatives to mobilize existing data, expand biodiversity surveys by including underreported growth forms such as herbs or epiphytes45,46, and improve the overall taxonomic knowledge for these regions47,57 are, therefore, high priorities in biodiversity research58. A second limitation is the scale mismatch between some very fine-grained vegetation plots and our use of coarse-grained environmental predictors, as highlighted by other global-scale biogeographical analyses22. Thus, our models ignore the mounting evidence of the strong modulating impact of local land cover, topographic heterogeneity and vegetation structure on climatic conditions, rendering the environmental conditions experienced by organisms at the local scale markedly different from those inferred from global macroclimatic models59. Finally, our analysis focuses on natural and semi-natural plant communities but ignores the role of human impacts and non-native species invasions. These effects are too diverse and multifaceted to be included in a simple statistical model but clearly play a major role in the distribution of plant species, both at local and regional scales60. Taken together, these limitations imply that although the accuracy of our models was relatively high, our results may still be missing important environmental drivers, especially at fine grain sizes.

Despite these limitations, our analysis provides important insights and is a step forward in mapping global plant diversity. First, it reinforces the idea that large-scale evolutionary and historical processes interact with local factors to shape plant communities3,17,23. Indeed, our models indicate that macroecological gradients have a consistent effect on plant alpha diversity, but with magnitudes that vary across grains. Second, by highlighting regional scaling anomalies in alpha diversity across different plot sizes, our study can improve our ability to predict biodiversity response to global change11. Third, our work adds a new dimension to our understanding of global biodiversity patterns and hotspots previously defined based on gamma diversity only. This could have implications for conservation. For example, coarse-grained hotspots might require networks of relatively large protected areas, whereas fine-grained hotspots might be more sensitive to biotic homogenization and more dependent on maintaining traditional management or a particular type of land use. Explicit consideration of the difference between coarse- and fine-grained hotspots complements the regional data on species richness and endemism commonly used for delineating global biodiversity hotspots.

## Methods

### Species richness data

The vegetation-plot database ‘sPlot’ (www.idiv.de/splot) collates 110 national or regional vegetation-plot datasets. Vegetation-plot records provide geo-referenced information on the presence and cover/abundance of all vascular plants co-occurring within a delimited area. The sPlot database version 2.1 contains records from 1,121,244 vegetation plots surveyed between 1885 and 2015. These comprise 23,586,216 occurrence records for 58,066 vascular plant taxa, whose names have been standardized to a common nomenclature15. When the formation to which a plot belonged was not specified (n = 137,146 plots), we used the growth form of the recorded species61 to classify a plot as forest or non-forest as in ref. 22. That is, we defined a plot record as forest if the sum of the cover values of all tree taxa was >25% of the sum of the cover values of all species in that plot, and as non-forest, if the sum of cover values of all low‐growing taxa other than trees and shrubs was >90% of the sum of the cover values of all species in that plot. Plots not meeting either condition were excluded from the analysis, as well as all plots belonging to wetland or aquatic vegetation. Plots also had a wide variation in the sampled area (1–25,000 m2). Therefore, we performed a preliminary screening and only retained plots sized between 100 and 25,000 m2 for forest, and between 10 and 1500 m2 for non-forest, as these are the most frequent plot sizes used by plant ecologists in the field. Plots without information on the sampled area were also excluded. Similarly, we excluded all plots that we could confidently assign to anthropogenic communities, here defined as any vegetation that is shaped by intensive and repeated human interference, including weed communities on arable land, ruderal vegetation and intensively managed pastures and meadows.

The data in the sPlot database are geographically biased since plots are unevenly distributed across geographical regions and formations (Supplementary Fig. 1), with relatively few data from the wet tropics. We therefore made a special effort to improve the data coverage in these regions by searching for publications and databases that report species richness, plot size and spatial coordinates of vegetation plots in the tropics. We focused on plots for which the full assemblage of vascular plants (with or without epiphytes) was sampled. However, such data were particularly scarce in many regions (e.g., the central Amazon, Western Ghats and Sundaland). For these regions, we also included data reporting woody species richness only (along with the diameter at breast height—DBH—used as the minimum sampling threshold). In total, we found information for an additional set of 1914 vegetation plots from 53 papers (Supplementary References). Of these, only 170 vegetation plots contained species richness information for all vascular plants. Finally, we scanned the Global Index of Vegetation-plot Databases62 to retrieve additional datasets from the tropics, which were not included in sPlot 2.1. We obtained permission to use 11 local datasets, totaling 7929 additional vegetation plots (7385 with species richness data for all vascular plants). In total, our database contained 412,452 vegetation plots41,59,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196 (Supplementary Fig. 1, Supplementary Data 1).

### Data cleaning and geographical resampling

To further mitigate the remaining geographical bias in vegetation-plot distribution and to account for the fact that plot sizes vary markedly across regions and vegetation types (Supplementary Fig. 8), we applied a stratified resampling strategy that we repeated 99 times. We defined each stratum as a unique combination of realm197, biome15, broad formation (two classes: forest and non-forest), and plot size as a factor variable with four levels (small: ≤150, medium: 150–600, large: 600–1200, very large: >1200 m2). These intervals were chosen to encompass the grains used for predictions (i.e., 10, 100, 400, 1000 and 1 ha, see below) while accounting for the fact that some plot sizes are more routinely used than others. For each stratum, we randomly sampled (without replacement) up to 100 vegetation plots in each iteration. If a stratum had fewer than 100 vegetation plots, we retained all of them. This procedure resulted in the selection of 17,972 plots in each iteration. The total number of plots used across the 99 iterations was 170,272. Altogether, these plots provided 9,953,940 occurrence records for 53,271 vascular plant taxa, i.e., ~15% of the estimated ~350,000 vascular plant species that exist. This figure is slightly underestimated, since for 1893 plots (59,299 occurrence records) only aggregated alpha diversity data were available, but no species-level data.

Not all vegetation plots were complete with respect to the sampled functional groups. Most records from tropical forest plots contained either only tree data, or only data on trees and shrubs (Supplementary Fig. 9). Excluding these plots would not be optimal, as it would have greatly reduced the spatial coverage of our dataset. Since most of these incompletely sampled plots were from the tropics, excluding them would also create the risk of introducing a strong spatial bias into our model. Therefore, we retained these plots in the dataset and included a new predictor variable called ‘plants recorded’ (three levels: ‘complete vegetation’, ‘trees and shrubs only’, ‘trees only’) in our statistical models (see below). Specifically, a plot belonged to the ‘only trees’ level if it only contained information on woody species with a diameter at breast height (DBH) larger than 5 cm. It belonged to the ‘only trees and shrubs’ level when it either contained information on all woody species (both trees and shrubs) but not herbs, or if the minimum DBH threshold used for sampling woody individuals was less than or equal to 5 cm.

As most of these incompletely sampled plots were in the tropics, we simulated the occurrence of incomplete plots also in the other biomes when resampling the full database. This was achieved by selecting some plots with complete vegetation information and recalculating their species richness when accounting for ‘only trees’, i.e., discarding all information on the occurrence of shrub and herb species, or for ‘only trees and shrubs’, i.e., discarding information on herbs. We limited this procedure to biomes with >10,000 plots with complete vegetation information (i.e., subtropics with winter rain, subtropics with year-round rain, temperate mid-latitudes). In these biomes, 20% of all the plots selected randomly within each resampling iteration (623 on average) were transformed this way. This corresponded to an increase in the number of incomplete plots in these selected biomes from 151 to 359 (on average over the 99 iterations), which is close to the average number of incomplete plots occurring in the other biomes (n = 373). By rarefying data to simulate plots with incomplete vegetation records, we reduced the possible geographical bias resulting from the uneven distribution of incomplete plots across biomes. This allowed the use of incomplete plots from tropical regions (where complete plots are rare) when modeling the response of local vascular plant richness at the global scale (see below).

### Explanatory variables

Based on the plots’ geographic coordinates, we retrieved bioclimatic, soil, topographic and biogeographical variables from external sources, which we used as explanatory variables for species richness modeling. We extracted all the 19 bioclimatic variables included in CHELSA v1.1198, and seven soil variables at 250-m resolution from the SOILGRIDS project199. The soil variables were: (1) clay mass fraction (%); (2) silt mass fraction (%); (3) sand mass fraction (%); (4) coarse fragment fraction (%); (5) soil organic carbon content (g/kg); (6) soil pH (measured in water); and (7) cation exchange capacity. After standardizing and centering all 26 variables, we performed two principal component analyses (PCA), one for climate and one for soil. For subsequent analyses, we used the first five principal components for climate and the first four for soil, because these components accounted for more than 90% of the total variation in these ordinations. We interpreted these principal components based on the respective loadings of the corresponding environmental variables. For climate, the predictors with the highest loadings were: mean annual temperature for PC1; mean annual precipitation and mean diurnal temperature range for PC2; precipitation seasonality and precipitation of the wettest quarter for PC3; temperature of the wettest and temperature of the warmest quarter for PC4; and precipitation of the coldest quarter for PC5 (Supplementary Table 3, Supplementary Fig. 10). For soils, PC1 was mainly explained by soil bulk density; PC2 by sand content; PC3 by the percentage of coarse fragments and PC4 by soil pH (Supplementary Table 4, Supplementary Fig. 11).

To account for topographic heterogeneity, we also extracted data on plot topography from the EarthEnv.org data portal200. Specifically, we used terrain ruggedness (TRI, calculated at 50 km resolution), dominant landform (10 types at 1 km resolution: flat, peak, ridge, shoulder, spur, slope, hollow, footslope, valley, pit), and the number of landforms within a 50 km radius around each plot.

To account for historical and biogeographical factors, we included two predictors of the velocity of climate change between the Last Glacial Maximum and the present (one for temperature, one for precipitation) derived from ref. 201. These layers measure the local rate of displacement of climatic conditions and integrate macroclimatic shifts with local spatial topoclimatic gradients. Additionally, we considered two nominal biogeographical variables, realm197 and biome15, which we considered as rough proxies of the different geologic, biogeographical and climatic histories of different regions. The biomes were derived from Schultz’s ecozones202, which we modified to distinguish alpine areas203. Thus, our biomes are not nested within realms. As another surrogate for the biogeographical imprinting on alpha diversity patterns, we also accounted for regional effects by including the estimated size of the regional species pool for each of the 867 terrestrial ecoregions of the world2.

We then considered three additional predictors: a binary variable distinguishing two broad formations (i.e., forest: True\False), a nominal predictor accounting for the different functional groups sampled in each plot (i.e., ‘complete vegetation’, ‘only trees and shrubs’ and ‘only trees’, see above), and plot size, i.e., the spatial grain used in vegetation sampling.

In total, we considered 20 predictors: five principal components summarizing climate, four principal components summarizing soils, three variables quantifying topographic heterogeneity, five related to biogeographical history, one representing vegetation formation and two related to sampling design. Multicollinearity among predictors was limited, as no pair of predictors had Pearson’s r coefficient greater than 0.64 (Supplementary Fig. 12).

### Statistical modeling

We used boosted regression trees (BRTs) to model the relationships between species richness and the explanatory variables. BRTs are nonparametric machine-learning models based on decision trees in a boosting framework. BRTs have few prior assumptions, are relatively robust against overfitting, missing data, and collinearity, and are very flexible in detecting nonlinear relationships and interactions among predictors204. We parameterized our BRTs as follows. We first set a tree complexity of 5 and a bag fraction of 0.5. We then systematically tested the combination between learning rates (from 0.00025 to 0.1) and the number of trees returning the highest 10-fold cross-validated model fit, using the gbm.step routine from the dismo package205. For each explanatory variable, we calculated its relative influence (i.e., the fraction of times a variable was selected for splitting a tree in each BRT model, weighted by the squared model improvement) across the 99 resampled sets. To visualize the relationship between species richness and the explanatory variables, we created partial dependence plots at selected grain sizes to visualize the marginal effect of a given predictor on the response variable. We considered an explanatory variable as relevant in the model if its relative influence (averaged over 99 resamplings) was greater than 5%, which is the expected share if all the 20 predictors had the same relative importance.

BRTs are unbiased on average, i.e., the sum of the residuals is close to zero. Yet, similarly to other ensemble machine-learning methods, they produce results that are biased in a different sense: small values are often overestimated and large values underestimated35. This happens because the final prediction is the unweighted average of a collection of regression trees, which inevitably leads to results biased towards the sample mean. To avoid this problem, we implemented a bias-correction algorithm called ROE: regression of observed on estimated values35,206. In the first step, we fitted a linear regression of the observed values on the fitted values:

$${S}_{{{{{{{\rm{fit}}}}}}}}=a+b{S}_{{{{{{{\rm{obs}}}}}}}}$$
(1)

where Sfit is the vector of species richness predicted by a BRT in a given iteration, and Sobs is the vector of observed species richness in that iteration. We then created a vector of bias-corrected, fitted species richness $${S}_{{{{{{{\rm{fit}}}}}}}}^{{{{{{{\rm{bc}}}}}}}}$$ as:

$${S}_{{{{{{{\rm{fit}}}}}}}}^{{{{{{{\rm{bc}}}}}}}}={{\max }}\left[\frac{{S}_{{{{{{{\rm{fit}}}}}}}}-a}{b},\,0\right]$$
(2)

thus, introducing the constraint that $${S}_{{{{{{{\rm{fit}}}}}}}}^{{{{{{{\rm{bc}}}}}}}}$$ is no smaller than zero206.

We then used the above BRT models, together with the regression parameters a and b, to make bias-corrected predictions of local vascular plant richness at different plot sizes for all terrestrial pixels of the globe at 2.5 arcminute resolution. We did this separately for forest and non-forest ecosystems. For each pixel, we extracted the value for all 17 spatially explicit predictors (climate, soil, topography and biogeography) based on the pixel location. The variable ‘forest’ was set to ‘True’ for creating forest maps and ‘False’ for non-forest maps. For each of the 99 resampling iterations, we created multiple predictions, one for each selected sampling grain (i.e., 400 m2, 1000 m2 and 1 ha for forests, and 10, 100 and 1000 m2 for non-forests). In all cases, we only predicted species richness for the complete vegetation (i.e., including trees, shrubs and herbs). We also mapped the variability of our predictions, as the interquartile range (IQR - i.e., the difference between the 75th and 25th percentiles) across the 99 resampling iterations. Finally, we created a map of ignorance207 showing the geographic distance from the nearest vegetation plot used to calibrate our models (Supplementary Fig. 13). The map of ignorance highlights the uncertainty due to the uneven geographic distribution of vegetation plots and shows areas with limited or no data where our estimation should be taken with caution. Based on this map, we highlighted all data-poor regions, i.e., regions located farther than 500 km from the nearest plot, by parallel hatching in our maps. Given the strong structural differences between forests and non-forest ecosystems, we presented the multi-grain maps of plant richness separately for these two broad formations in the main text. Nevertheless, we also produced a joint map at 1000 m2 grain by complementing species richness estimates for forests with non-forest species richness for pixels outside the forest mask. For forests, we predicted all pixels where forests would grow under current climate conditions and without human influence208. For non-forest, we extracted all pixels where the land cover class ‘herbaceous vegetation’ occurs based on a consensus map integrating land-cover products derived from remote sensing209.

### Model validation

We assessed model performance in three ways. First, we averaged the tenfold cross‐validation across resampled sets obtained from the BRT output. Second, for each of the 99 resampled sets, we selected all plots not used in the specific set and calculated Pearson’s correlation between species richness observed in a given plot and the respective BRT prediction at a grain corresponding to the plot area. As a third approach, we performed a spatially-constrained cross-validation210. We did this because our plots were spatially clustered and our spatial predictors had high spatial autocorrelation (2320 km on average across all the quantitative predictors, based on 5000 random samples, Supplementary Fig. 14). This means that even selecting plots completely independent of the training dataset does not ensure proper validation of our models, as the training and the test data remain spatially dependent. This violation of the fundamental assumption of model validation, namely the independence between training and test data, has been shown to affect many mapping models created with ‘Big Data’ approaches34. To avoid this problem, we divided the world into square spatial blocks whose size corresponds to the average spatial autocorrelation range of the quantitative predictors (i.e., 2320 km, n = 84, Supplementary Fig. 14). For each resampling, we randomly assigned each block to five folds using the function spatialBlock in the R package blockCV33, which selects the most even spread of vegetation-plot data across folds in 99 iterations (Supplementary Fig. 15). We then refitted our BRT model five times for each resampling, each time using four out of five folds for training and the remaining fold for validation, and averaged Pearson’s correlation coefficient between the observed and predicted species richness across folds. We also repeated this process separately for each biome separately, i.e., sequentially withholding all data located within a fold and a given biome for validation. We then reported the distribution of these correlation coefficients across the resampled sets, both when considering all plots, and when disaggregating by biomes. Finally, we checked the model residuals for spatial autocorrelation by fitting variogram models to the residuals using the function variogram from the R package gstat211. All analyses were performed in R 3.6.3212. Map boundaries derive from R package rnaturalearth213.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.