Spatial Distribution of Carbon Stored in Forests of the Democratic Republic of Congo

National forest inventories in tropical regions are sparse and have large uncertainty in capturing the physiographical variations of forest carbon across landscapes. Here, we produce for the first time the spatial patterns of carbon stored in forests of Democratic Republic of Congo (DRC) by using airborne LiDAR inventory of more than 432,000 ha of forests based on a designed probability sampling methodology. The LiDAR mean top canopy height measurements were trained to develop an unbiased carbon estimator by using 92 1-ha ground plots distributed across key forest types in DRC. LiDAR samples provided estimates of mean and uncertainty of aboveground carbon density at provincial scales and were combined with optical and radar satellite imagery in a machine learning algorithm to map forest height and carbon density over the entire country. By using the forest definition of DRC, we found a total of 23.3 ± 1.6 GtC carbon with a mean carbon density of 140 ± 9 MgC ha−1 in the aboveground and belowground live trees. The probability based LiDAR samples capture variations of structure and carbon across edaphic and climate conditions, and provide an alternative approach to national ground inventory for efficient and precise assessment of forest carbon resources for emission reduction (ER) programs.

transects to make sure that the process of the validation does not introduce any artificial bias. We 23 also used pre-existing plots such as plots from Dr. Bastin and Smithsonian, where some plots fall 24 within the LiDAR scenes. In all cases, the measurement of the trees followed the standard 25 protocols provided by the IPCC guidelines: Within each plot, all stems > 10 cm in diameter (D) 26 at breast height (dbh) were measured at 130 cm from ground or above the butresses or trunk 27

deformations. 28
In addition, the total heights (H) of all trees were measured using instruments such as the 29 LaserACE 2D Hypsometer (MDL, York, UK) (i.e. Bastin plots). At YANGAMBI plots, all tree 30 individuals were classified into diameter classes (that is, 10-20, 20-30, 30-50, >50 cm) and all 31 species contributing to 95% of the basal area of the plot were selected for further measurements. 32 For each of these species, two individuals were selected within each diameter class for tree 33 height measurements using a Nikon Laser Rangefinder Forestry Pro hypsometer 1 . Height of trees 34 that were not measured in the field was estimated using local DBH-H models, typically for each 35 study site. DBH-H allometric models have the form: 36 However, some sites did not have any measured tree heights and had no neighboring plots that 37 could be used to assign tree height. For trees in these sites (Ituri, PARAP Luhudi and PARAP 38 Maniema), AGB was estimated using a different allometric equation (see below, Eq. 2). 39 Tree species were identified and recorded for wood density calculations. Wood density was 40 assigned using the Global Wood Density Database (GWDD) for tropical trees 2,3 and the FAO 41 database, at the highest level possible, species level being the most precise level, followed by 42 genus and family level. In the cases where tree identification was missing or did not match any 43 name in our databases, the mean wood density of the plot was assigned. At YANGAMBI plots, 44 wood samples of all important species were taken for the calculations of wood density. 45 Plot location is known for all the plots and is characterized by either the center of the plot 46 (circular plots, with known radius) or the four corners of the plots (square or rectangular plots). 47 Although the accuracy of geolocation is supposed to be between 3 and 8m, additional error may 48 ; is the height of each tree in meter (m), and ; is the wood density of each tree in g cm -3 . is a 70 measure of environmental stress, taking into account temperature seasonality, precipitation 71 seasonality and climatic water deficit, at any location on the globe 6 .
-./ was used for sites 72 where no tree height measurement or estimation was available (Ituri, PARAP Luhudi and 73

PARAP Maniema). 74
The aboveground biomass was further augmented for all trees with DBH < 10 cm. Trees < 10 cm 75 in diameter and height > 1.3 m were measured in Ituri plots and few plots outside of DRC in 76 Gabon and Republic of Congo. We developed an equation from 1-ha plots in DRC and Gabon 77 where trees with DBH > 1cm have been measured in the field. Small trees will add 78 approximately 3-7% on the average to the aboveground biomass values. The equation below 79 converts the AGB estimates for trees > 10 cm (AGB >10cm ) to AGB estimate for all trees with 80 DBH > 1 cm (AGB >1cm where WD represents the plot mean value of wood density in units of g cm -3 , MCH is in units of 95 m from LiDAR observation, and ~0, < represents the uncertainty in measurements. 96

Belowground Biomass Estimation 97
We encountered virtually no consistent measurements of belowground biomass in our data 98 compilation efforts. This result was not surprising, as it is not practical to measure below ground 99 biomass in most tropical forests on a routine basis. It is also very difficult to develop an 100 appropriate, country-specific allometric equation for root biomass. Methods for collecting 101 belowground biomass data are laborious, time-consuming, and technically challenging to 102 perform correctly. Instead, belowground biomass is usually estimated from aboveground 103 biomass using regression equations developed from field data collected across multiple biomes. 104 A synthesis of data from available literature, along with elimination of data collected using 105 unclear or incorrect methods, provided a universal equation for estimating forest belowground 106 biomass 11 . The equations below show how the belowground biomass (BGB) can be estimated 107 from AGB for humid tropical forests: 108 For all forest types, we use: 110

National Carbon Mapping 112
To produce the national carbon map in 1-ha (100-meter) spatial resolution, we built a synergistic 113 model for estimating biomass and total carbon from a variety of data sources, including the in-114 situ measurements of key forest attributes such as AGB and wood density, airborne small-115 footprint LiDAR sampling with a wider spatial coverage at the country level, as well as the 116 spatial data from contemporary satellite imagery covering the DRC country in the optical and 117 microwave domain. Supplementary Fig. S2 demonstrates the basics and key steps of our 118 workflow. 119

Nation-level Mapping Data Preprocessing 120
Nation-level Input feature layers in our study include satellite data and ancillary mapping 121 products of administrative boundary and major land cover types in the country of DRC. We used 122 the Congo basin forest atlas data from the Ministère de l'environnement et Développement 123 Durable (MEDD) to define the country boundary 12 . From the source vector data in shapefile 124 format, we rasterized the map in 100-meter resolution with binary numbers. We assigned the 125 value of 1 to each pixel within the country boundary, and the value of 0 to the rest pixels. The 126 final output is under the geographic coordinate system (GCS) with WGS84 datum (pixel 127 resolution in 0.0008983°). This becomes our reference 1-ha map, and every other data set was 128 registered to this map. 129 We possess continuous coverage of medium-resolution satellite data products over the study 130 region. We used three sources of satellite data as our inputs to the synergistic model. The preprocessing of satellite data includes spatial aggregation and image registration. We 157 aggregated 4 bands of Landsat-8 mosaic, 2 bands (HH/HV) of ALOS PALSAR, and the SRTM 158 v3 DEM data into 100-meter spatial resolution using spatial average. We also kept the local 159 standard deviation of SRTM data as an additional layer, creating a final set of satellite inputs 160 with 8 layers. Using the 1-ha reference map created from the MEDD country boundary, we 161 registered all our satellite layers to the same raster grid. 162 To have a prior knowledge on the vegetation types of DRC, we used the land cover (LC) product 163 from the Climate Change Initiative (CCI) project of the European Space Agency (ESA) 19 . The 164 CCI-LC map has a global coverage in 300-meter spatial resolution and has been validated using 165 ground and high resolution imagery. The new release (version 1.6.1) has improved the 166 representation of cropland patterns in the Democratic Republic of Congo 19 . In this study, we 167 points in these pixels. These data gaps cannot be corrected by the spatial interpolation method, 205 because ground points exist in some of these gaps ( Supplementary Fig. S6b). Such ground points 206 in gaps lead to extremely low CHM values compared to nearby pixels, while they are apparently 207 canopy pixels observed from aerial photos ( Supplementary Fig. S6c). We decided to use the 2-208 meter spatial resolution when a sufficient number of LiDAR points are available in each pixel to 209 determine the maximum height. Results show that the CHM raster in 2-meter resolution can 210 eliminate most of the undesired gap signals without sacrificing too much spatial detail 211 ( Supplementary Fig. S6d). 212 The 2-meter CHM products were further used to create the 1-ha map of LiDAR-measured mean 213 canopy height (MCH). For 1-ha (100-meter) spatial resolution, we have approximately 2500 2-214 meter observations for each pixel. To ensure that the MCH values can well represent the mean 215 characteristics of 1-ha pixels, we set MCH observations valid only when 90% of the 1-ha pixels 216 are covered by airborne LiDAR measurements, i.e., we excluded most of the edge pixels with 217 partial coverage in the 1-ha mapping process. The final LiDAR-derived 1-ha map of DRC has a 218 total of ~665K valid pixels that can be used in national-level MCH and carbon mappings. 219

Spatial Modeling 220
With the availability of nation-level feature layers and LiDAR-derived MCH, we were able to 221 build a supervised learning model as the spatial estimator to predict the unknown MCH for 222 locations where we have environmental data. Maximum Entropy (ME), as a supervised learning 223 algorithm, is a probability-based algorithm that seeks the probability distribution by maximizing 224 the information contained in the existing measurements 21,22 . In the ME algorithm, a measurement 225 of class has the probability of occurrence ( j ) with the constraint that probabilities of all 226 ( j ) must sum to 1 ( Similarly, MSD2 is defined as MSD for samples with the sum of predicted MCH and measured 250 MCH to be larger than 60 meters. Results also suggest that we use a relatively larger background 251 number to avoid overfitting. 252 The mapping of WD adopted the same spatial modeling procedure as MCH. But WD training 253 data were from field measurements of 1-ha plots covering the whole country (Supplementary 254 To avoid the test samples in one plot being dependent on the training samples from nearby plots 275 (long-distance spatial autocorrelation), we sampled the training and test sets based on latitudinal 276 bands, so that the test samples were the farthest from training samples. This approach finds the 277 lower bound of our spatial modeling, and guarantees that most pixels at the country level should 278 have predictions within the uncertainty range. In every repetition of these 2 CV test, we 279 randomly sampled 5000 observations as training and 3000 observations as test samples. 280

Spatial Autocorrelation 281
We used the variogram plot to demonstrate the existence of spatial autocorrelations 25,30 . The 282 variogram-based approaches assume that the spatial autocorrelation of variables only depends on 283 the distance (h), while it has no other directional or locational dependence. The Variogram 284 (γ(h)) is defined as 285 where ℎ is the covariagram depending on the distance h. The investigation of variogram and 286 covariagram of model residuals can check the model performance on removing the spatial 287 dependence of the original data. 288

Pixel Uncertainty and Regional Uncertainty 289
Our spatial estimator (ME) inherently provides pixel-level mapping uncertainty as the sum of To assess the uncertainty of regional mean, we followed Chen et al. 2016 for regional 293 where is the total number of pixels, OE , e and ' are the pixel-level errors from 1) spatial 295 mapping uncertainty, 2) allometric equation uncertainty, and 3) uncertainty of predictor variables 296 (often LiDAR height retrievals), respectively. The three sources of errors are assumed 297 independent, so that the overall uncertainty of regional estimates comes from the three 298 covariance terms. 299 We model the first covariance using spatial autocorrelation, 300 where ;• is the correlation coefficient between pixels and , and it can be approximated from 301 the variogram (Eq. 12) normalized ℎ under the assumption that spatial autocorrelation only 302 changes with distance ℎ. 303 The second covariance is related only to the allometric model coefficients and can be 304 are statistically independent 31 . We thus conclude that the third covariance contributes very little 314 in the regional uncertainty estimates (often thousands or millions of 1-ha pixels), and has been 315 set to zeros in the regional uncertainty calculations. 316

Climate and Soil Data 318
To evaluate the environmental control of the DRC tropical forests, we used several data sets 319 produced from different research groups. The climate data were obtained from the WorldClim 320 climate database 32,33 for our study. WorldClim is a set of average monthly climate data collected 321 globally from ground-based weather stations and interpolated to a 1-km resolution grid.  The high-resolution original data were resampled from approximately 90-meter (3 arcsec) to 1-345 km (30 arcsec) spatial resolution using spatial average as well as the local variation (interquartile 346 range), both of which were used as environmental layers representing terrain characteristics. 347

Data Processing and Regression Model Selection 348
To compare different environmental variables with our retrieved forest carbon density values, we 349 resampled the original high-resolution data to 0.25°x0.25° spatial resolution, using the spatial  Table S3). For the current study, we simply wanted to identify the 359 important environmental variables which are linearly related to forest carbon density, without 360 considering multi-collinearity, spatial autocorrelation, and non-linear effects. To achieve this 361 goal, we used the Lasso (least absolute shrinkage and selection operator) regression 26 which 362 performs both variable selection and regularization to find the best minimum set of predictor 363 features. Through the 10-fold cross-validation procedure using all available training data, we 364 identified the best 6 environmental variables that give the best prediction accuracy for forest 365 carbon density of each selected land cover type (Supplementary Table S4 Supplementary