To ensure that future mapmakers have specific guidance on how to release their data products responsibly, we propose that whenever possible mapmakers release at least two accompanied data layers for each predicted layer that is created: (1) a layer showing the uncertainty of predictions per pixel value (i.e., an error distribution of the predicted value); (2) a layer showing the degree to which the training data matches the area being predicted (i.e., a layer highlighting the degree of extrapolation, or similar). Moreover, the modeling process itself should make the best use of uncertainty estimates; a technical example could be bootstrapping training data so that model training datasets use the provided error distributions sampled at the per-pixel level. If training datasets don’t themselves contain error uncertainty values, scientists could experiment with Bayesian-style priors that they themselves estimate so that uncertainty can still be properly represented and propagated (e.g., ref. 5).

Deriving local prediction errors for complex machine learning models are, however, computationally demanding and data producers often need to use non-parametric bootstrapping methods6. These can significantly increase production costs, ultimately leading to delays and/or to a reduction in the number of mapped variables. In one of our recent analyses7, the prediction error maps required an order of magnitude more computational effort than the actual predictions. Overall, the ultimate goal of mapping should be to present the pixel level information at every point in addition to a full suite of accuracy statistics (output probability distributions) at every point or at least for the whole area of interest, which requires a great degree of planning, computational power, storage space, and fully transparent communication in results.

In summary, producing global consistent and usable datasets takes time, effort, and healthy scientific debate. In the context of inevitably limited and imperfect datasets, more information on error estimations is superior to less information (i.e., masking pixels). Extrapolation inevitably comes with risks, but it is always useful to test scenarios so that the scientific community can advance our work and continuously improve our methods.

Overfitting is also a serious problem in data science. An effective strategy to avoid overfitting is to spend more effort testing model performance using various resampling strategies6 and then see how the model performs, especially if a realistically simulated probability re-sampling is used. We recommend that, as often as possible, scientists fit their models using Ensemble Machine Learning frameworks such as mlr3, h2o, or scikit-learn that come with robust mechanisms to reduce overfitting, via meta-learners or stacking of the models.

Another serious challenge is the problem of artifacts in input data. It is well known that many Machine Learning algorithms are sensitive to data artifacts. Algorithms such as Random Forest may be noise proof8, yet they are sensitive to artifacts and bias in the training data. For example, some data producers use 0 for missing values; when not properly cleaned or accounted for, even a very small portion of such “polluted” data can have an enormous impact on the results. Luckily there are more and more diagnostic tools to visualize and investigate data quickly (especially multivariate correlation plots and density plots) to identify artifacts. The R and python open source development societies provide many such tools that are available completely open source (e.g., ref. 9). It is crucial to fully explore the data before production runs to ensure datasets have been satisfactorily cleaned and to document the methodology to ensure end-users can both understand and reproduce your work.

This interview was conducted by Walter Andriuzzi.