Machine learning-based global maps of ecological variables and the challenge of assessing them

The recent wave of published global maps of ecological variables has caused as much excitement as it has received criticism. Here we look into the data and methods mostly used for creating these maps, and discuss whether the quality of predicted values can be assessed, globally and locally.

different sources. As a consequence, these data are strongly concentrated, e.g., in Europe and Northern America, and within these regions, they are extremely clustered around areas that received intense research. We are aware that large gaps in geographic space do not always imply large gaps in feature space, but it is the former that most concerns accuracy of the maps of focus here, as we will discuss.
For three publicly available datasets that were used for global mapping, Fig. 1A-C compares the distributions of the spatial distances of reference data to their nearest neighbor (pink) with the distribution of distances from all points of the global land surface to the nearest reference data point (prediction locations, blue). The difference between the two distributions reflects the degree of spatial clustering in the reference data: Fig. 1D shows  the distributions for a simulated spatially random sample of the same size as Fig. 1C. The clustered pattern has certain consequences and raises challenges for accuracy assessment that we will discuss in the following.

Map quality: global or local assessment?
The quality of global maps can be assessed in different ways. One way is global assessment where a single statistic is chosen to summarize the quality of the entire map: the map accuracy. For a categorical variable, this can be the probability that for a randomly chosen location on the map, the map value corresponds to the true value. For a continuous variable, it can be the RMSE, describing for a randomly chosen location on the map the expected difference between the mapped value and the true value. When a probability sample, such as a completely spatially random sample, is available for the area for which a global assessment is needed, then map accuracy can be estimated model-free (also called design-based, e.g., by using the unweighted sample mean in case of a completely spatially random sample). This circumvents modeling of spatial correlation because observations are independent by design 6,9 . This approach is called model-free because no model needs to be assumed about the distribution or correlation of the data: the only source of randomness is the random selection of sample units from a target population. If a probability sample is not available this approach cannot be used, and automatically the accuracy assessment approach becomes model-based 10 , which involves modeling a spatial process by assuming distributions and taking spatial correlations into account, and choosing estimation methods accordingly. Using naive random n-fold or leave-one-out cross-validation methods (or a simple random train-test split) to assess global model quality (usually equated with map accuracy) makes sense when the data are independent and identically distributed. When this is not the case, dependencies between nearby samples, e.g., in a spatial cluster, are ignored and result in biased, overly optimistic model assessment, as shown in, e.g., Ploton et al. 5 . Alternative cross-validation approaches such as spatial cross-validation 5,11 that control for such dependencies are the only way to overcome this bias. Different spatial cross-validation strategies have been developed in the past few years, all aiming at creating independence between cross-validation folds 5,[11][12][13] . Cross-validation creates prediction situations artificially by leaving out data points and predicting their value from the remaining points. If the aim is to assess the accuracy of a global map, the prediction situations created need to resemble those encountered while predicting the global map from the reference data (see Fig. 1 and discussions in Milà et al. 14 ). This occurs naturally when reference data were obtained by (completely spatially random) probability sampling, but in other cases, this has to be forced for instance by controlling spatial distances (spatial cross-validation). Such forcing, however, is only possible when the distances in space that need to be resembled are available in the reference data. In the extreme case where all reference data come from a single cluster, this is impossible. When all reference data come from a small number of clusters, larger distances are available between clusters but do not provide substantial independent information about variation associated with these distances. Lack of information about larger distances means that we cannot assess the quality of predictions associated with such distances and cannot properly estimate global quality measures. Alternative approaches such as experiments with synthetic data 15 or a validation using independent data at a higher level of integration 16 would then be options to support confidence in the predictions.
Another way of accuracy assessment is local assessment: for every location, a quality measure is reported, again as probability or prediction error. Such a local assessment predicts how close the map value is to newly observed values at particular locations. If the measurement error is quantified explicitly, a smoother, measurement-error-free value may be predicted 10 . If the model accounts for change of support 10,17 , predictions errors may refer to average values over larger areas such as 1 × 1, 5 × 5, or 10 × 10 km grid cells. Examples of local assessment in the context of global ecological mapping are modeled prediction errors using Quantile Regression Forests 18 or mapped variance of predictions made by ensembles 1,2 . Neither of these examples quantifies spatial correlation or measurement error, or addresses change of support, as it is known from other modeling frameworks 19 . By omitting to model the spatial process, the local accuracy estimates as presented in the global studies that motivated this comment are disputable.
The difference between global and local assessment is striking, in particular for global maps. A global, single number averages out all variability in prediction errors, and obscures any differences, e.g., between continents or climate zones. It is of little value for interpreting the quality of the map for particular regions.
Limits to accuracy assessment Maps, and in particular global maps, create a strong feeling of satisfaction, suggesting we now know it all. They are however also used, enlarged, torn apart, read in detail, and may form the basis for local decisions of all kinds, or even form the inputs for follow-up models. If a global map does not come with clear instructions about its value, like a prescription for subsequent use, it is easy to abuse it. Wyborn and Evans 4 rightly ask about "what changes are global maps, and their creators, trying to bring about in the world?", and suggest a re-engagement with empirical studies of local and regional contexts while seeking co-construction with those having local knowledge. The fact that creating global maps of anything nowadays is so easy does not mean these maps are always useful.
Technically, a trained Random Forest (or other) model can be applied globally as long as global predictors are available. Predictions far beyond reference data, however, often lead to extrapolation situations in the predictor space and models produce typically meaningless predictions when provided with predictor values that do not resemble the training data. The same applies to local accuracy estimates when based on the variance of predictions 7 . A good coverage of training data in the predictor space is hence required to produce globally applicable predictions. Since distances in geographic space often go along with distances in the feature space, it can be assumed that this is not given for many prediction models that are based on sparse and clustered reference data. In Meyer and Pebesma 7 , we suggest a procedure to limit spatial predictions to the area of applicability of the model: global maps would need to gray out areas where predictor values are too different from values in the training data-the areas for which we cannot assess the quality of predictions. Similar approaches have been suggested and discussed, e.g., by Jung et al. 16 . Limiting predictions to the area of applicability of the model is not only relevant to avoid wrong conclusions about prediction patterns but also to avoid propagation of large errors: many global maps of environmental variables used the global soil maps produced by Hengl et al. 3 as input predictors 1,2,20 . The global soil maps by Hengl et al. 3 in turn used other modeled maps as an input (e.g., WorldClim 21 ). If the latter maps had labeled locations with predictions for which quality cannot be assessed, or for which quality was really low, the follow-up study could have benefited from it. Without that information, both WorldClim and the soil layers were taken as if they contained true values.
We argue that showing predicted values on global maps without reliable indication of global and local prediction errors or the limits of the area of applicability, and distributing these for reuse, is not congruent with basic scientific integrity. Reusing such global maps while ignoring prediction errors amplifies this problem, hence more transparency and clear indication about the limitations of predictions is required. Global maps are being distributed digitally and could be used for purposes of decision making, e.g., in the context of nature conservation 22 . We call for global maps of ecological variables to be published only when they are accompanied by properly derived local and global accuracy measures.