Background & Summary

The Coupled Model Intercomparison Project (CMIP), established in 1995, is an effort of the World Climate Research Programme (WCRP) that produces and studies the outputs of Global Climate Models (GCMs) to better understand the past, present, and future of the climate system. The new state-of-the-art climate projections for the coming decades, made available by the CMIP Phase 6 (CMIP6)1, provide the underlying scientific ground for the latest IPCC Sixth Assessment Report (AR6)2 of the Intergovernmental Panel on Climate Change (IPCC). The Shared Socioeconomic Pathways (SSPs) scenarios used in AR6 include the SSP1 (Sustainability), SSP2 (Middle of the road), SSP3 (Regional Rivalry), SSP4 (Inequality), and SSP5 (Fossil-fueled Development) pathways3. The CMIP6 future scenario experiments are classified into core priority groups, including 1) the tier-1 experiments with SSPs 1–2.6, 2–4.5, 3–7.0, and 5–8.5, and the tier-2 experiments with SSPs 1–1.9, 4-3.4, 4–6.0, and 5–3.44.

To date, the resolution of the CMIP6 GCMs is still too coarse (usually over 100 km) to be used in many aspects such as risk assessment, adaptation management, and decision-making procedures at the regional or local scale. Besides the coarse resolution, biases and uncertainties contained in GCMs often exaggerate from global to regional and local scales, restraining the usefulness and applicability of GCMs in small-scale studies5,6,7. Hence, downscaling techniques, which transform the coarse GCM information into a higher spatial resolution, should be conducted for a specific region of interest before conducting local impact assessment and risk management. There are two popular downscaling techniques, namely dynamical and statistical, that have been widely used. Dynamical downscaling is related to a modeling process that feeds the coarse-resolution initial conditions (IC) and lateral boundary conditions (LBC) provided by a GCM into a regional climate model (RCM) to produce higher-resolution climate information8,9. Along with the massive requirement of computer resources and modeling time, the dynamical downscaling method still has to deal with biases and the sensitivity of the boundary conditions taken from the host GCM, as well as the accuracy and uncertainty of the dynamics and physical parameterization of each RCM8. On the other hand, statistical downscaling amounts to firstly searching for a relationship between local observed variables (called predictands) and large-scale GCM climate predictors. Then by assuming that the derived relationship is maintained with time, we can feed predictors of GCM outputs into the statistical model to obtain future climate information10,11. The statistical downscaling approach is effective and generally does not require huge computer resources. To date, different statistical downscaling techniques have been widely adopted in various studies, e.g. climate risks and water resources, at local and regional scales12,13,14.

In Vietnam, the Ministry of Natural Resources and Environment (MONRE) has published several reports on Climate Change and Sea-level rise scenarios for Vietnam15,16,17,18. Constrained by computing resources, the results of the latest MONRE reports15,16 were only based on a limited number of dynamically downscaled experiments (14 experiments in total with 4 RCMs and 10 CMIP5 GCMs) and for only two Representative Concentration Pathways (RCP) scenarios RCP4.5 and RCP8.5. Besides, a statistical downscaling effort has been implemented in Vietnam to downscale 31 CMIP5 GCMs under four RCPs scenarios19,20.

In the present study, we produce a new high-resolution (10-km) climate dataset for Vietnam by statistically downscaling 35 CMIP6 GCMs and eight SSPs scenarios. The Biases Corrected Spatial Disaggregation (BCSD) method21,22 was applied. It is noteworthy that we were able to collect daily temperature and rainfall observed station data in order to generate a gridded-based observation dataset over Vietnam, which served as important and mandatory inputs for the BCSD scheme. The new dataset, called CMIP6-VN, contains downscaled values of 4 variables, i.e. daily rainfall (pr), daily mean 2m-temperature (tas), daily maximum 2m-temperature (tasmax), and daily minimum 2m-temperature (tasmin) for the historical period 1980–2014 and the future period 2015–2099. CMIP6-VN is expected to be a valuable input with the most updated information for studies on climate change assessment and impacts in Vietnam.

Methods

Study area

We focus on Vietnam in this study. Vietnam has a diverse topography spanning over a long range of latitudes, a long coastal line, and a monsoon-influenced climate. Based on specific criteria for radiation, temperature, and rainfall23,24, Vietnam is divided into seven climatic regions, including (1) the Northwest (denoted as R1), (2) the Northeast (denoted as R2), (3) the Red River Delta (denoted as R3), (4) the North Central (denoted as R4), (5) the South Central (denoted as R5), (6) the Central Highland (denoted as R6), and (7) the South region (denoted as R7) (Fig. 1). In this study, the CMIP6-VN dataset is developed for the inland territory of Vietnam.

Fig. 1
figure 1

The study domain Vietnam and its 7 climate regions. Green and blue dots show the locations of 157 temperature and 481 precipitation stations, of which the data are used in this study. The red circles indicate 12 random stations selected for the validation of the OBS gridded dataset. The topography over Vietnam (shaded, in m) is obtained from the 5-minute Gridded Global Relief Data (ETOPO5)46.

Data acquisition

To guide the statistical downscaling process, we collected daily-observed precipitation (pr) and near-surface temperatures (daily average tas, daily maximum tasmax, and daily minimum tasmin) from 481 and 147 stations, respectively, of the Vietnam Meteorological and Hydrological Administration (VMHA) for the period 1980–2014 (Fig. 1). The data underwent prior verification by VMHA’s Documentation Center, following established operational processes. We subsequently applied the three-sigma (five-sigma) rule to identify any suspect values in the temperature (rainfall) data and re-examined each identified case.

Monthly rainfall and near-surface temperatures from 35 CMIP6 GCMs (Table 1) are downscaled for the historical period 1980–2014 and future period 2015–2099 under the eight SSPs 1–1.9, 1–2.6, 2–2.5, 3–4.0, 3–7.0, 4-3.4, 4-6.0, 5-3.4, and 5–8.5. The CMIP6 GCMs data are acquired via the Earth System Grid Federation website (ESGF, https://esgf-node.llnl.gov/projects/cmip6/).

Table 1 List of 35 CMIP6 GCMs and associated scenarios considered in this study.

Building the gridded observation dataset

We interpolated the observed station data of rainfall and temperature into a 0.1° × 0.1° gridded dataset (hereafter called OBS) using the Spheremap25 and Kriging26 interpolation techniques, respectively. For rainfall, the Spheremap method has some advantages over other interpolation techniques such as Cressman27, Inverse Distance Weighted28, or Kriging26. Similarly, the Kriging method is more suitable for interpolating continuous spatial variables such as temperature29. The OBS dataset is subsequently used to bias-correct the GCM CMIP6 data.

Downscaling process

In this study, the BCSD method21,22 is applied for downscaling the CMIP6 GCM outputs for Vietnam. The BCSD consists of two major steps: bias correction (BC) and spatial disaggregation (SD), which are briefly described below.

Before the BC step, all GCM data and the OBS are regridded to the intermediate resolution of 1° × 1°. We detrend the temperature data at each grid point before the BC, then add the trends back afterward to preserve the climatic trends in the original GCM21,30,31.

The BC firstly applies the quantile mapping (QM) method32, which corrects the biases in the GCM data when compared to the OBS at the resolution of 1° × 1°. For each variable on each grid cell in each month, cumulative distribution functions (CDFs) for both the OBS and historical GCM data are separately generated. Transfer functions (TFs) that map the model CDFs onto the OBS CDFs are subsequently developed. Then, the biases in the GCM monthly outputs are corrected by applying the TFs to transform the GCM data to the corresponding OBS data of the same CDF quantile. Those QM TFs are assumed to be stable through the historical and future periods and thus are applied to correct the future projected variables. For temperatures, after the QM step, the previously saved climatic trends are added back to the QM model data. Then, the climatological mean bias adjustment between the GCM and the OBS temperature at each grid point is subsequently applied, producing the BC temperature data at the intermediate resolution of 1° × 1°.

In the SD step, the BC data are interpolated to the resolution of 0.1° × 0.1° following a three-step procedure: (1) bilinearly interpolating the additive (for temperatures) and multiplicative (for precipitation) change factors estimated between the BC GCM fields and the OBS climatology to the targeted high-resolution of 0.1° × 0.1°; (2) constructing the high-resolution BC GCM data by adding (for tas, tasmax, and tasmin) or multiplying (for pr) the interpolated change factors and the 0.1° × 0.1° OBS climatology; (3) finally, the monthly BC fields of the future period are temporally disaggregated to a daily scale by randomly choosing a respective month from OBS and additively (for temperatures) and multiplicatively (for precipitation) adjusting its daily values to reproduce the future monthly BC data. Note that for precipitation downscaling, the temporal disaggregation requires an assessment of the scaling factor and the number of wet days to avoid unrealistic precipitation values. For example, if the number of wet days in the selected month is less than three and its scaling factor is greater than three, another year with more wet days in that month will be selected.

The implementation of the BCSD approach in this study contains two main phases:

- Phase 1 — Testing: the climatological fields of the training period 1980–2004 from the OBS and GCMs are used to develop the TFs between the simulations and observations. Then, the BCSD is applied to the independent (testing) period 2005–2014, and the results (hereafter called BCSD-CMIP6) are compared with OBS to examine the performance of the BCSD in reproducing past climate conditions.

- Phase 2 — Future downscaling: To maximize the construction period of the BCSD approach33,34, the total 35 years from 1980 to 2014 are used to generate the TFs and to guide spatial disaggregation for the future period 2015–2099. The BCSD is applied to all GCM models and SSPs scenarios listed in Table 1 to generate the targeted CMIP6-VN dataset.

Data Records

The generated OBS and CMIP6-VN datasets are stored in the figshare repository35. Both datasets include daily values of rainfall (pr) and temperatures (tas, tasmax, and tasmin) and cover the historical period of 1980–2014. In addition, CMIP6-VN provides future projections for 2015–2099 with the 7 SSPs. The datasets are in the Network Common Data Form (netCDF) classic format. For CMIP6-VN, one compressed tar file is created for each model, in which each variable for each historical/scenario period is saved in one netCDF file. The total data size for OBS and CMIP6-VN is 480 Mb and 76.31 Gb, respectively.

Technical Validation

The OBS gridded dataset

The OBS gridded dataset was constructed utilizing daily rainfall and temperature data from 481 and 147 stations, respectively. In order to validate the quality of the OBS data, we employed the methodology presented in Nguyen-Xuan et al.36 by randomly selecting 12 stations in the sub-climatological regions in Vietnam (see Fig. 1 and Supplementary Figure S1). These 12 stations were removed from the set of stations used to interpolate the OBS gridded dataset; and with the remaining set of stations, we applied the same interpolation algorithm to create a testing dataset, which we refer to as OBS-WT-12. Monthly rainfall and temperature data from OBS and OBS-WT-12 at these 12 stations during the period of 1980–2014 were compared with the gridded datasets typically used in the study region (Fig. 2). These datasets include: (1) the monthly Global Precipitation Climatology Centre (GPCC)37, (2) the daily Asian Precipitation – Highly-Resolved Observational Data Integration Towards Evaluation (APHRODITE)38, (3) the monthly Climate Research Unit (CRU) data39, and (4) the daily ERA5-Land data (hereafter called ERA5)40, which is the fifth generation European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric reanalysis. Noting that GPCC only provides rainfall data, while only CRU and ERA5 provide data on daily maximum and minimum temperatures. The horizontal resolution of OBS, OBS-WT-12, and ERA5 is 0.1°, while APHRODITE has a resolution of 0.25°, and GPCC and CRU have a resolution of 0.5°.

Fig. 2
figure 2

RMSE (mm d−1, left axis), and correlation (right axis) at the 12 randomly selected stations between the gridded datasets and the gauge observation.

Figure 2 illustrates the superior performance of OBS over the other datasets based on statistical parameters, namely the root mean square error (RMSE) and correlation (CORR), when compared to in-situ data from the 12 selected stations. Even without the use of inputs from these 12 stations, OBS-WT-12 outperforms the remaining datasets. For temperature, the correlations of OBS and OBS-WT-12 with in-situ data are approximately equal to 1, while for rainfall, they are greater than 0.9 across most stations. Meanwhile, the correlation of the remaining datasets is lower, particularly for rainfall across all stations and for temperature at stations numbered 9–12 (Table 2), located in the central and southern parts of Vietnam.

Table 2 Locations of 12 stations randomly selected for validating the OBS dataset.

The value of RMSE varies across stations, with OBS performing the best. For rainfall, OBS-WT-12 has a smaller RMSE than the other remaining datasets, except for the MongCai station where the OBS_WT-12 RMSE is slightly higher. For temperature, the OBS and OBS-WT-12 RMSE outperform the other datasets. At the SinHo station, the RMSE value is much higher than other stations, possibly due to its high altitude of 1529 m, leading to the fact that the observed temperature at SinHo does not represent the temperature of the entire grid. Noting that we have adjusted the temperature value with its topography dependence when applying the Kriging interpolation method by using an environmental lapse rate of 0.65 °C/km.

In brief, the results from Fig. 2 demonstrate that OBS performs well not only in grids with monitoring stations but also significantly better than the commonly used gridded datasets in the study region over the grids where there are no stations for rainfall and temperature.

The CMIP6-VN dataset

Spatial distribution

Figure 3 displays the spatial distribution of observed and modeled 2m-temperature and their difference over Vietnam. The modeled 2m-temperature is derived from the ensemble mean of the 35 BCSD downscaled CMIP6 GCMs described in Table 1 (hereinafter referred to as BCSD-ENS). There are high similarities between the BCSD-ENS and OBS in both training and testing periods. The model ensemble effectively reproduces the slight temperature shift between regions where the temperature tends to increase from North to South and reaches the highest value in Southern Vietnam. The annual and seasonal biases, given by the difference between the BCSD-ENS and OBS, range from −0.26 °C to 0.82 °C and from −0.67 °C to 0.98 °C for the training (Fig. 3e–i) and testing periods (Fig. 3j–n), respectively. The overall average annual bias is negligible, reaching only 0.05 °C (0.16 °C) for the training (testing) period. It is worth noting that the seasonal temperature biases are more pronounced than the annual average biases (middle and lower panels of Fig. 3), and the biases in the testing period are more pronounced than those in the training period. The seasonal temperature biases are larger in the northern regions than in the southern regions, especially in the testing period. It should be noted that the bias values shown in Fig. 3 are very small compared to those of non-bias-corrected and higher resolution simulations over the region41,42,43,44. For example, the absolute temperature biases of an ensemble of dynamical downscaling simulations42 are generally between 0.5–1 °C, which are much larger than the biases obtained with the BCSD products in this study. In brief, although biases are unavoidable in the BCSD products, the overall average annual and seasonal temperature biases over Vietnam are relatively small, suggesting the appropriate quality of the BCSD temperatures.

Fig. 3
figure 3

Spatial distribution of the average temperature in Viet Nam; (a, b) and (c, d) indicate the average temperature of 1980–2004 and 2005–2014 by OBS and the BCSD-ENS, respectively; (ei) and (jn) show the biases of the BCSD-ENS compared to OBS for the training period 1980–2004 and the testing period 2005–2014. Hatched lines show the regions in which more than two-thirds of the CMIP6 models have the same sign as the BCSD-ENS. Statistical values (average, maximum, minimum) over the entire Vietnam inland territory are also displayed.

The biases of the BCSD-ENS daily maximum and minimum temperatures are provided in the online supplemental materials (Supplementary Figures S1, S2). The biases of these fields, although generally higher than those of 2m-temperature in both annual and seasonal averages, are still small enough to demonstrate the good performance of the BCSD products.

Simulated precipitation climatology by the BCSD-ENS well agrees with OBS, which can be seen via the relatively low annual bias of 2.17% and 5.23% in the training (Fig. 4e) and testing (Fig. 4j) periods, respectively. Locations of dry regions (i.e. the Northeast and South of the South Central) and the wet regions (i.e. North of the Northeast and Central) are accurately captured by the BCSD outputs (Fig. 4a–d). The simulation biases are non-uniformly distributed between seasons and regions, and generally larger in the testing period than in the training period. The BCSD-ENS precipitation for the dry season of December-January-February (DJF) exhibits the largest bias compared to the other seasons. The DJF biases could reach 30.38% in the Central Highlands and-50.41% in the South during the testing period (Fig. 4k), and 24.9% in the north and −8.37% in the Central Highlands during the training period (Fig. 4f). However, it should be noted that a large relative bias (in %) could result from a small absolute bias value (in mm), especially in a dry region where a little change in precipitation often results in a significant relative bias.

Fig. 4
figure 4

Same as Fig. 3 but for precipitation.

Seasonal cycles

The reproducibility of the BCSD-ENS for seasonal cycles is examined by calculating the temporal correlations between the simulation and observation (Fig. 5). The average correlation values for the testing period are very high, over 0.994 for temperature and 0.982 for precipitation, indicating the good performance of the downscaled products (Fig. 5d,h). The overall performance of the BCSD-ENS is slightly better in the training period than in the testing period. The BCSD-ENS is further compared to the simple bilinear interpolation products (BIP), bilinearly interpolated from the CMIP6 GCM outputs onto the grid of 0.1° × 0.1° over Vietnam. The ensemble average of the BIP products (BIP-ENS) also shows a good agreement with OBS, which is partly illustrated by the high correlation coefficient of 0.976 and 0.926 for the seasonal cycles of temperature and precipitation, respectively, in the testing period (Fig. 5c,g). The BCSD-ENS generally exhibits better performance compared to the BIP-ENS. The skill of the BCSD-ENS in reproducing the seasonal cycle is also better for temperature than precipitation.

Fig. 5
figure 5

Temporal correlations of the temperature (upper) and precipitation (lower) seasonal cycles between the BCSD-ENS and BIP-ENS with OBS for the training period 1980–2004 (two left figures) and the testing period 2005–2014 (two right figures).

We compare the BCSD-ENS, the BIP-ENS, the downscaled individual models by both BCSD and BIP methods, and OBS to examine further the ability of the downscaled product in reproducing the seasonal temperature cycles of the testing period (Fig. 6). The comparison was conducted at seven stations randomly taken from the list of stations located in the seven climatic sub-regions, including Lai Chau (Northwest), Bai Chay (Northeast), Nam Dinh (Red River Delta), Ha Tinh (North Central), Da Nang (Central), Kon Tum (Central Highlands) and Can Tho (South). There are good agreements between OBS and the observed station data, illustrated by low root mean square error (RMSE) values ranging from 0.26 °C in Can Tho to 1.51 °C in Lai Chau. The RMSE values between the BCSD-ENS and OBS are small, e.g. only 0.3 °C in Da Nang, indicating good agreement between OBS and the BCSD outputs. On the other hand, the BIP-ENS generally overestimates spring-summer temperature in the northern part of Vietnam and consistently underestimates autumn-winter temperature in the remaining months and regions. The RMSEs of the BIP-ENS, ranging from 0.6 °C in Da Nang to 1.44 °C in Lai Chau, are larger than those of the BCSD-ENS. Note that the dispersions among the BIP members are also larger than those among the BCSD members,

Fig. 6
figure 6

Comparison of the seasonal temperature/precipitation cycles according to station data, OBS, BCSD outputs, and their ensemble mean, and BIP outputs and their ensemble mean for seven station locations in the period 2005–2014. Values of mean square errors (MSE) of the BCSD-ENS, BIP-ENS, and OBS are also displayed.

Regarding precipitation, the BCSD outputs can effectively capture the temporal variation and rainfall amount in all stations, including Ha Tinh (North Central) and Da Nang (Central), where the rainy season comes between two to three months later than the other regions of the country (Fig. 6). In months with high rainfall amounts, the dispersions among the downscaled products are larger. The average RMSEs of the BCSD-ENS outputs, ranging from 0.53 mm per day (Nam Dinh) to 0.78 mm per day (Da Nang), are much smaller than those of the BIP outputs (0.88–1.91 mm per day). The BCSD outputs consistently outperform the BIP outputs in all regions and seasons.

Added values

To confirm the superior performance of the BCSD method compared to the traditional and effortless BIP approach, we utilized the added value (AV) metric45, as presented below:

$$AV={\left({X}_{BIP}-{X}_{OBS}\right)}^{2}-{\left({X}_{BCSD}-{X}_{OBS}\right)}^{2}$$
(1)

where X represents one of the four downscaled variables. A positive AV indicates that the BCSD method outperforms the BIP approach, and vice-versa.

Positive AV grids dominate the majority of Vietnam’s mainland across all BCSD model outputs (Supplementary Figures S3, S4). For temperature, the percentage of grids with positive AVs ranges from 88.9% (EC-Earth3) to 98.2% (INM-CM5-0). Regarding rainfall, the percentage of positive AV grids varies from 85.9% (MIROC6) to 98% (FGOALS-g3). In brief, the BCSD method outperforms the BIP method in almost all regions of Vietnam for each CMIP6 GCM, and for the ensemble mean.

Future projections

The range of uncertainty among the BCSD downscaled CMIP6 GCMs for the historical (1986–2014) and future period (2015–2099), illustrated by one standard deviation away from the mean, is displayed in Fig. 7. The averaged dispersion of the BCSD downscaled products in the historical period is relatively small, i.e. ±0.21 °C for temperature and ±5.1% for precipitation. The clear warming trend toward the end of the 21st century is projected by all SSPs, along with the growth of model uncertainty. Contrary to the clear increasing temperature trends across all models, the BCSD-ENS shows a slight increasing precipitation trend over entire Vietnam in the late 21st century with much larger uncertainties.

Fig. 7
figure 7

Projected changes relative to the baseline period 1986–2005 based on the CMIP6-VN data for temperature (upper) and precipitation (lower). Five-year moving averages are applied. Colored lines show the ensemble means of the models and colored shaded areas represent the areas of uncertainty (1 standard deviation) for each scenario. The number of models used for each scenario is given in bracket.

Usage Notes

The sample scripts and the CMIP6-VN products are available for download. The GCM input variables (pr, tas, tasmax, and tasmin) and the gridded observation dataset should be prepared following the guidelines provided together with the script and dataset (please refer to the file readme.txt in the GitHub link). The step-by-step guide to the BCSD process is also described in detail in the readme.txt file. It should be noted that the sample script is prepared for downscaling the ACCESS-CM2 outputs for the 2015–2099 period under the SSP5-8.5 scenario. The script can be applied for any GCM and any historical/SSP scenario at any time frame with minor adjustments.

The CMIP6-VN dataset provides high resolution and multiple climate change scenarios, but it is important to acknowledge its limitations. Specifically, the QM step of the BCSD method may amplify the tails of the distribution in certain situations, potentially leading to under/overestimation of extreme events. Furthermore, the temporal disaggregation step may constrain future changes in the distribution of temperature or precipitation, such as changes in extreme event frequency or intensity. These limitations should be taken into account when interpreting results and utilizing the CMIP6-VN dataset for impact assessments.