Background & Summary

In the last two decades, there has been a growing interest in estimating soil organic carbon (SOC) content and stocks for management (e.g. carbon sequestration), economic (e.g. greenhouse gas emission trading schemes, commercial incentives for Net Zero targets)1 and scientific (e.g. dynamic of carbon cycle) purposes. Soils are an essential component of the ecosystem carbon cycle, storing approximately two-thirds of the total terrestrial organic carbon pool2. Organic carbon is also a key indicator of the overall soil functioning. It is the main constituent of soil organic matter and is related to most soil functions such as water and nutrient cycling, the productivity of plants, carbon storage and climate mitigation, among others3.

The total SOC concentration is conventionally measured at a point using laboratory techniques (e.g. the Walkley-Black method4 or high-temperature combustion), the values of which can be used in models of soil C dynamics models such as RothC5). Organic carbon, however, is a continuum of compounds with different origins, multiple stages of decomposition and decay and chemical composition6. The processes that control organic carbon composition vary spatially and with depth, depending on soil, climate, land use and management practices and are controlled by a myriad of biotic and abiotic factors7. Several recent lines of work have therefore been developed to model the spatial distribution of SOC at regional8, national9 and continental10 scales. The mapping of SOC also reflects an increasing demand for spatially explicit organic carbon assessment to be used. For example, to prioritize local actions in soil carbon sequestration and to monitor SOC change over time. SOC maps are also used as input into Earth System Models and are relevant to calibrate and initiate mechanistic simulation models of the terrestrial carbon cycle11.

SOC maps are usually made using statistical or non-statistical models that exploit the quantitative relationships between point-measured values of SOC and a set of environmental covariates that control SOC spatial distribution11,12. Various methods can be used for this purpose, including geostatistical methods that rely on the variogram and kriging and recently more complex algorithmic tools from machine learning13. These models are used to predict the SOC at unobserved locations using the fitted relationship obtained at observation points and the spatially explicit covariates, such as terrain attributes and remote sensing imagery.

In Australia, digital maps of SOC have been produced extensively for nearly two decades. The Soil and Landscape Grid of Australia (SLGA)14, for example, used a large soil inventory composed of more than 27,000 sites where SOC is recorded, either measured or inferred with spectroscopic techniques. These SOC maps have been extensively used for national carbon accounting and to monitor soil carbon change. However, data collection and processing techniques have progressed rapidly in the last 8 years. It is worthwhile to include these recent developments to produce new high-resolution maps of SOC in Australia.

In this paper, we present updated continental-scale maps of SOC. These maps are part of the new version of the SLGA and will be the new baseline maps of Australia from which carbon stock can be estimated and changes in SOC can be monitored. The resolution at which the maps are produced (90 m × 90 m and 30 m × 30 m) enables applications from regional assessment to local-scale soil management. The maps are produced for six depth intervals, following the specifications of the GlobalSoilMap project15 and we also produced maps of the prediction uncertainty. The maps are based on measured total organic carbon (TOC) content in soils compiled from various sources which represent the most comprehensive dataset on SOC currently available in Australia.

Methods

Organic carbon data

Data on total organic carbon (TOC) concentration (%) was extracted with the Soil Data Federator developed by CSIRO with support from the Terrestrial Ecosystem Research Network (TERN). The Soil Data Federator is a web API that compiles soil data from different institutions and government agencies throughout Australia. The SOC data used in this study are publicly available through the Soil Data Federator (https://esoil.io/TERNLandscapes/Public/Pages/SoilDataFederator/SoilDataFederatorHelp.html) managed by CSIRO16. The laboratory methods for total organic carbon included in the study are presented in Table 1. We selected TOC data from the period 1970–2020 to get a compromise between the representativity of current TOC concentration and spatial coverage. The data was error checked and processed to harmonize units, excluding duplicates and potentially wrong data entries (e.g. missing upper or lower horizon depths, extreme TOC values, unknown sampling date). Additional TOC measurements from the Biome of Australian Soil Environments (BASE) contextual data17 were also included in the analyses. TOC concentration for BASE samples was determined by the Walkley-Black method4 (method 6A1 in Table 1). Upper limits for TOC concentration by biome and land cover classes were set according to published literature, and consistent datasets (Australian national Soil Carbon Research Program (SCaRP) and BASE), see refs. 17,18 and data exploration to exclude unrealistic TOC values (e.g. maximum TOC = 30% in temperate forests, maximum TOC = 14% in temperate rainfed pasture). Since TOC concentration in Australian ecosystems has been underestimated by previous SOC maps19, we did not set conservative TOC upper limits, knowing that the machine learning model would likely underestimate high SOC values.

Table 1 Datasets used for spatial modelling along with the laboratory method and reference.

Data for TOC concentration come from bulk soil samples taken at various depth intervals in the soil profile. To standardize the depth intervals, we built a mass-preserving depth function using the equal-area quadratic spline. The mathematical description of this function and its application to build continuous depth intervals over the soil profile have been extensively described in the literature20,21. The equal-area quadratic spline function was fitted to the whole collection of pre-processed TOC data, and then values extracted for the 0–5 cm, 5–15 cm, 15–30 cm, 30–60 cm, 60–100 cm, and 100–200 cm depth intervals, following GlobalSoilMap specifications15. Boxplots with TOC values by biome and land cover after data cleaning and depth standardization are shown in Fig. 1. A total of 90,025 measurements of TOC are used hereafter for mapping.

Fig. 1
figure 1

Values of total organic carbon concentration by biome and land cover after data cleaning and depth standardization.

Environmental covariates

We collected a set of 57 spatially exhaustive environmental covariates made available by the Terrestrial Ecosystems Research Network (TERN), covering Australia and representing proxies for factors influencing SOC formation and spatial distribution: soil properties, climate, organisms/vegetation, relief and parent material/age. The covariates were reprojected to the WGS84 (EPSG:4326) projection and cropped to the same spatial extent. All covariates were resampled using bilinear interpolation or aggregated to conform with a spatial resolution with a grid cell of 90 m × 90 m and 30 m × 30 m. The list of covariates along with their unit and reference is provided in Table 2. The covariates used in this study are freely available through the link https://data.tern.org.au/landscapes/slga/NationalMaps/SoilAndLandscapeGrid/. Instructions for accessing the covariate rasters as Cloud Optimised GeoTiffs are provided at https://esoil.io/TERNLandscapes/Public/Pages/SLGA/GetData-COGSDataStore.html.

Table 2 List of environmental covariates with unit and associated reference when applicable.

Mapping

The spatial distribution of soil TOC concentration is driven by the combined influence of climate, vegetation, relief and parent materials22. We thus modelled TOC concentration as a function of environmental covariates representing biotic and abiotic control of TOC. The measurements of SOC and their corresponding values of environmental covariates from Table 2 at the same measurement locations were used to fit the mapping model. For the mapping, we used a machine learning algorithm called quantile regression forest.

Quantile regression forest

Quantile regression forest23 is an ensemble of decision trees. A decision tree is built by partitioning the covariate dataset from the calibration dataset. A number of partitions are evaluated and a splitting metric, the variance, is used to evaluate the partitions. The partition with the smallest splitting metric is selected and undergoes the same procedure until a user-defined parameter, the minimum node size, is reached. For a single tree, the prediction is taken as the average prediction of the values at the end of the node of the tree.

The decision tree is extended by the process of bagging (i.e. bootstrap and aggregating)24, which aims to build an ensemble of decision trees called a random forest. In a random forest, a large number of decision trees is built on bootstrap samples of the original calibration data. For each tree, a random perturbation (i.e. bagging) is introduced during partition where only a subset of size mtry from the original number of covariates in the calibration data is used for partition. The final prediction from the random forest is simply the aggregation through averaging of all the decision tree predictions. Extending the standard random forest to QRF is straightforward. Instead of obtaining a single statistic, that is the mean prediction from the decision trees in the random forest, we report all the target values of the leaf node of the decision trees. With QRF, the prediction is thus not a single value but a cumulative distribution of the TOC prediction at each location, which can be used to compute empirical quantile estimates.

Fitting a QRF model is thus based on three user-defined parameters: the partition subset size mtry, the number of trees ntree and the stopping criterion for the tree splitting nodesize. We fitted a QRF model for each depth interval with parameters mtry and nodesize held to their default values. mtry is rounded down to the square root of the total number of covariates and nodesize was set to 5. To compromise between computational load and accuracy we fixed the number of trees to 250. We tested parameter tuning using a random grid-search procedure for the three QRF parameters, using 1000 parameter set combinations and the mean of the square error as criterion obtained from a 10-fold cross-validation strategy. We found that parameter tuning was very computationally demanding for a negligible improvement in prediction accuracy. Thus, we did not proceed any further with QRF parameter tuning. We used the R programming language and the ranger25 package for model fitting and prediction.

Evaluation of the prediction and uncertainty quantification

Model prediction

Each depth-specific model of TOC was validated based on the results of a K-fold cross-validation. The whole dataset was randomly split into K = 10 approximately same-size folds. Each fold was kept apart for the validation and the remaining K−1 folds were used as a calibration dataset. Models were compared using the mean error (ME), the root mean square error (RMSE), the squared Pearson’s r correlation coefficient (r2), and the modelling efficiency coefficient (MEC), defined by the following equations:

Mean error:

$${\rm{ME}}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{z}_{i}-{\widehat{z}}_{i},$$
(1)

where z and \(\widehat{z}\) denote the measured and predicted values of TOC, respectively, and n is the total number of measured values. The ME represent the bias, i.e. the systematic over- or under-prediction of the model.

Root mean square error:

$${\rm{RMSE}}=\sqrt{\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{({z}_{i}-{\widehat{z}}_{i})}^{2}}.$$
(2)

The RMSE represents the magnitude of the error, its optimal value is 0 and is expressed in the unit of TOC (i.e. in %).

Squared Pearson’s r correlation coefficient:

$${r}^{2}={\left(\frac{\mathop{\sum }\limits_{i=1}^{n}\left({z}_{i}-\overline{z}\right)\left({\widehat{z}}_{i}-\overline{\widehat{z}}\right)}{\sqrt{\mathop{\sum }\limits_{i=1}^{n}{\left({z}_{i}-\overline{z}\right)}^{2}}\sqrt{\mathop{\sum }\limits_{i=1}^{n}{\left({\widehat{z}}_{i}-\overline{\widehat{z}}\right)}^{2}}}\right)}^{2},$$
(3)

where \(\overline{z}\) is the mean of the measured values and \(\overline{\widehat{z}}\) is the mean of the predicted values. The r2 describes the linear correlation between measured and predicted values and ranges between 0 (no linear correlation) to 1 (perfect linear correlation).

Modelling efficiency coefficient:

$${\rm{MEC}}=1-\frac{\mathop{\sum }\limits_{i=1}^{n}{\left({z}_{i}-{\widehat{z}}_{i}\right)}^{2}}{\mathop{\sum }\limits_{i=1}^{n}{\left({z}_{i}-\overline{z}\right)}^{2}}.$$
(4)

The MEC optimal value is 1 but it can be negative if the mean of the measured values is a better predictor than the model. Positive MEC values can be interpreted as an amount of variance explained by the model.

We use a solar diagram26 to exploit the relationship between statistical indices and compare the maps. In a Cartesian coordinate system, the x-axis represents the ME (Eq. 1), and the y-axis the standard deviation of the error (SDE). The distance from the origin to any point in the diagram is expressed in terms of RMSE (Eq. 2), which allows to see the individual contribution of ME and SDE to the RMSE. The ME, SDE and RMSE are standardized by the standard deviation of the observation. The solar diagram further includes information on the correlation (Eq. 3) and MEC (Eq. 4). In a solar diagram, the closer the point to the origin of the diagram, the better the map. More information on the statistical representation of the validation statistics is provided in ref. 26.

Uncertainty quantification

We report the depth-specific lower (q0.05) and upper (q0.95) limits of the 90% prediction interval with two maps. Validation of the uncertainty estimates are obtained through a so-called accuracy plot. In an accuracy plot, the proportion of cross-validation observations contained in each q prediction interval is calculated. Ideally, the proportion of observations covered by a q interval is approximately equal to the value of q. If the proportion of observations in q is greater than q, it suggests that the uncertainty is under-estimated, whereas a substantially smaller proportion of observations compared to the nominal value of q suggests an under-estimation of the uncertainty. The process is repeated for all q, and the values of q are plotted against the actual proportion of values covered by q in a scattergram. Ideally, all values should be close to the 1:1 line, which would mean that the uncertainty is adequately estimated. Note that we evaluate uncertainty for all q against cross-validation observations, but report the maps of the 90% prediction intervals only.

Data Records

Prediction results

The depth-specific maps of TOC are distributed through the CSIRO Data Access Portal. Each product contains a map of TOC and its upper and lower prediction intervals. The products are available for grid size cells of 30 m × 30 m27 and 90 m × 90 m28 and for the size depth intervals: 0–5 cm, 5–15 cm, 15–30 cm, 30–60 cm, 60–100 cm and 100–200 cm. The unit of TOC is in percent mass (%). The projection system is WGS84 (EPSG:4326) and maps have extent: 112.54449°E - 153.38239°E; 9.59539°S - 43.38329°S for the 30 m and 112.99958°E - 153.99958°E; 10.0004°S - 44.00042°S for the 90 m. There are 36 maps in total: 6 depths, each with two maps of lower and upper intervals, for two spatial resolutions. An example of prediction along with the lower and upper intervals (0.05 and 0.95 percentiles) is shown in Fig. 6 for the 0–5 cm depth interval and the 90 m × 90 m spatial resolution. The maps of TOC prediction for all depth intervals and the two spatial resolutions are shown in Fig. 4. For the 90 m × 90 m resolution maps only, they can be used and accessed directly as Cloud Optimised GeoTIFF files through the TERN Data Store28.

Technical Validation

Prediction accuracy

The validation statistics for the QRF model prediction, two spatial resolutions and the six depth intervals are shown in Table 3 and visualized in Fig. 2. All maps have a negligible bias (the ME is close to 0 in all cases). The RMSE decreased with depth, which is expected because it is reported in the units of SOC (i.e. %) and the SOC content decreases with depth (see also Fig. 1). The solar diagram in Fig. 2 confirms that the RMSE is mostly composed of the SDE because all points are close to the a value of ME = 0. The r2 and MEC suggest that the models are accurate, but that prediction accuracy decreases dramatically for deeper depth intervals. This is an expected result reported in previous studies14. There is no substantial difference in terms of ME, r2 and MEC between the two spatial resolutions, and a slight improvement in RMSE between the prediction made with covariates at 30 m and that using 90 m resolution covariates.

Table 3 Depth-specific validation statistics for the maps.
Fig. 2
figure 2

Summary diagram (solar diagram) of the validation statistics. The statistics are obtained by 10-fold cross-validation. The x-axis represent the mean error (ME), the y-axis is the standard deviation of the error (SDE) which are both standardized by the standard deviation of the observations (denoted by *). Any point in the diagram has a distance to the origin equal to the RMSE. Color scale indicate the Pearson’s r correlation coefficient and the modelling efficiency (MEC). Note that 30 m and 90 m refers to the map made at 30 m × 30 m and 90 m × 90 m resolutions, respectively.

Validation of the uncertainty

The validation of the predicted prediction intervals is shown in Fig. 3. Note that while Fig. 2 shows all prediction intervals, we only report the maps of the 90% prediction interval, that is, the maps of the 0.05th and the 0.95 percentiles (see Figs. 4 and 5). Overall, the uncertainty was adequately estimated because most points are close to the line of equality. The uncertainty of SOC for the 0–5 cm depth intervals is slightly overestimated. For example, for the 50% interval, nearly 55% of the observations from the validation dataset fall within this interval.

Fig. 3
figure 3

Accuracy plots showing the probability interval and the proportion falling within this interval, for the six depth intervals and two spatial resolutions.

Fig. 4
figure 4

Map of prediction of TOC for the 0–5 cm depth interval and 90 m × 90 m spatial resolution (center) along with the lower (left) and upper (right) intervals (0.05 and 0.95 percentiles).

Fig. 5
figure 5

Maps of total organic carbon prediction for the size depth intervals and two spatial resolution (i.e. 90 m × 90 m and 30 m × 30 m).

Comparison with existing maps

We compared the topsoil (0–5 cm) TOC prediction maps at 30 m × 30 m (Fig. 6a) with existing maps available for Australia, the previous version of the SLGA v1 (Fig. 6b29) global map of SoilGrids 2.0 (Fig. 6c30). Note that in Fig. 6 the SLGA v1 map is at 90 m × 90 m whereas the SoilGrids map is at 250 m × 250 m spatial resolution. Three small areas in the West, East and North of Australia are shown. Overall, the prediction results for the three maps have similar patterns and ranges of values for the three small areas. However, the new map reveals much more detailed information than previous maps. The map of SoilGrids has smooth variation while the map of SLGV v1 captures further variation, but missed the detailed variation caused by fields and river beds.

Fig. 6
figure 6

Topsoil (0–5 cm) (a) TOC prediction maps at 30 m × 30 m spatial resolution compared with (b) the previous version of the Soil and Landscape Grid of Australia and (c) the global map of SoilGrids 2.030 for three small areas in the West, East and North of Australia.

Usage Notes

With its high spatial resolution and national coverage, this dataset should be useful for a range of stakeholders, including policymakers, scientists and land users alike. Potential uses include setting up a benchmark for Australia to estimate the change in soil organic carbon resulting from a change in land use, land cover, soil management practices and greenhouse gas offset activities. They can be also used for researchers to obtain insights into the large-scale and local-scale drivers of soil organic carbon in relation to biophysical factors and the environment. National-scale maps of soil organic carbon also constitute an input to guide the design of national soil monitoring networks and soil organic carbon accounting projects. The uncertainty of the maps reported in this study can be used to guide sampling to refine the existing maps in areas of large uncertainty. At the international level, we envision that this map will help policymakers to report on the national soil carbon budgeting and might help assist with strategies to mitigate climate change through carbon storage in soils.