Background & Summary

Oceans play a significant role in regulating the amount of CO2 in the atmosphere. Human-induced anthropogenic activities have increased atmospheric CO2, counterbalanced by the increasing global ocean CO2 uptake. Thus, the oceans become over-saturated, and as a result, the regional oceans have been increasingly becoming sources of atmospheric CO2. An increase in ocean sink strength has been seen in the past decade (\(\approx 2.5\pm 0.6\) GtC per year1). The first two years of this decade are reported to have even higher ocean sink strength (\(\approx 3.0\pm 0.6\) GtC per year in 20202 and \(\approx 2.9\pm 0.6\) GtC per year in 20213). Based on previous literature, the estimated ocean sink strength of the global coasts has decreased to \(\approx 0.2\) PgC per year4,5. On the other hand, current research shows an increase in the continental shelves' sink strength6. The wintertime CO2 sink in the northern South China Sea behaves stronger after 2007, although this sea area still serves as a weak annual source of atmospheric CO27,8,9. These global studies have highlighted the importance and role of the ocean in modulating the atmospheric CO2, and hence the environment. With the rise in importance of studying the sea-surface partial pressure of CO2 (pCO2), the paucity of measured data (especially on a regional scale) is an impediment for observational analysis and model validation10,11,12.

This study aims to develop pCO2 climatological data based on observation for the Bay of Bengal region (BoB). The BoB is recognized for having complex physical dynamics because of significant freshwater input and its distinctive geographical location. The Ganges-Brahmaputra river system, the second largest river system in the world, brings in high freshwater along with organic pollutants into the BoB region13,14. The freshwater influx increases stratification and reduces the vertical mixing (thick barrier layer), which influences the absorption and/or outgassing of atmospheric CO2 in BoB15. The nutrients brought down by these rivers decrease the ocean-surface pCO2 in the offshore region, but its influence diminishes away from the coast14.

The BoB is influenced by the seasonal reversing coastal currents (East India Coastal Currents (EICC)). From February to March, the EICC brings high saline waters from south to north, which weakens stratification and initiates upwelling. The upwelling brings high subsurface dissolved inorganic carbon (DIC) to the surface, which increases the sea-surface pCO216,17. The EICC flows south from October to December, carrying less saline waters from the north towards the south. This results in low sea-surface pCO2 (≈ 320 μ atm) values during this period. The freshwater plume spread due to this southward motion of EICC results in low sea-surface pCO2 values in the northern BoB15. The spatial pattern of the sea-surface pCO2 is dominated by the biological and thermal mechanisms14,18. Temporal evolution is dominated by solubility, primarily increased by sea-surface temperature (SST) and decreased by DIC18,19.

The sparse observations of sea-surface pCO2 constitute a significant hindrance in validating the coupled bio-physical model simulated ocean carbon cycle. The studies15,17,18,19,20 based on bio-physical models often validate the model with the BOBOA (Bay of Bengal Ocean Acidification) mooring21 at 15° N, 90° E. Another popular observation data is the SOCAT (Surface Ocean Carbon-dioxide Surface Atlas) data22, which has poor spatial and temporal coverage in the BoB region. These models are often compared with observation-based products like GLODAP23, which is a spatial annual mean data, and Takahashi data24, which has a very coarse resolution (4° × 5°). The observation-based products suffer due to a lack of observations in the BoB region, specifically, the unavailability of data near the coast25,26. The high freshwater flux, affecting the physical dynamics, also affects these observation-based products as the general assumption (e.g., failure of linear relation assumption of potential alkalinity and sea-surface salinity (SSS)) often fails in the BoB24.

Besides bio-physical models, the use of regression models27,28,29 is popular to understand the carbonate dynamics of the BoB region. These regression models emulate the sea-surface pCO2 with relatively larger errors. The linearity assumption between the dependent and independent variables is not always true. Region-specific Machine Learning (ReML) algorithms showed promising results for the central BoB30. Hence, this study attempts to construct spatiotemporal sea-surface pCO2 maps for the BoB using observations and advanced ML techniques.


This study includes a significant number of open and coastal ocean pCO2 observations and associated variables regulating pCO2 variability in BoB to come up with a data set that could aid in training advanced ML models (Fig. 1). We assume that the sea-surface pCO2 is a function of sea-surface temperature (SST), sea-surface salinity (SSS), mixed layer depth (MLD), atmospheric CO2 mole fraction (xCO2) and chlorophyll-a (CHL). The influence of the above-mentioned independent variables in regulating sea-surface pCO2 variability has been included as a proxy of different mechanisms (thermal, solubility, mixing, air-sea interaction, and biology).

Fig. 1
figure 1

Representation of the study region (BoB) and the available of pCO2 observations included in this study.

Data acquisition

SST and SSS observations, along with collocated sea-surface pCO2, are available at the locations shown in Fig. 1. We obtain the synthesised SST, SSS, and pCO2 observations from SOCAT ( and other locations shown in Fig. 1. Other than the observations at SOCAT and RAMA buoy locations, the available observations are addressed here as SAS (Sridevi and Sarma) data28. The data collection and quality control methods are elaborately available in the literature corresponding to each of these data22,28. The monthly data frequency of collocated SST, SSS, and pCO2 from various sources is shown in Fig. 2. The maximum number of observations is sourced from SOCAT (Fig. 2b), but it does not uniformly cover all the months. Specifically, in the open ocean SOCAT and SAS data, the observations are unavailable for the winter monsoon season (Dec, Jan, Feb). But these data provide a very good spatial coverage in other seasons. Further, the winter monsoon season observations are available from two sources: firstly, from the RAMA mooring and secondly, from the coastal transects of SAS data (as shown in Fig. 1). All ship-based observations (available in the SOCAT database) from 1991 to 2020 were acquired for this study. In the SAS data, the observations were available from 1991 to 2019.

Fig. 2
figure 2

Monthly observations of SST, SSS, and pCO2 were acquired from various sources. The RAMA buoy (a) provides the sea-surface pCO2 observations between November 2013 to December 2018. All ship-based observations (available in the SOCAT database) from 1991 to 2020 were used in this study (b). Further, additional ship-based observations available from 1991 to 2019 (denoted here as SAS data) were also included (c). The availability of pCO2 data from coastal transects from 2007 to 2018 is shown in (d).

CHL concentration is not available in SOCAT and SAS (except in a few locations) database; hence, we use a merged satellite product OC-CCI (Ocean Color Climate Change Initiative, available at This data has excellent spatial (1/12°) and temporal coverage (1997–2020). We extract collocated monthly CHL concentrations from OC-CCI at the available observation locations. Like CHL, MLD data cannot be obtained from SOCAT and SAS since temperature and salinity depth profiles are unavailable. So we obtain MLD from GLORYS12V1 product, which is a CMEMS eddy-resolving reanalysis product (data available at The MLD product has a spatial resolution of 1/12°, and the data is available from 1993–2020. The xCO2 is obtained from CAMS CO2 atmospheric inversion product32,33,34 ( The xCO2 data has spatial coverage of 0.25° and is available from 1985–2020. We use the nearest-neighbor interpolation method to find collocated data at the available sea-surface pCO2 observation locations.

We checked the data distribution before using these data for training and predictions. The MLD and CHL data are converted to normal by taking their log transformation. Since ML models are known to be sensitive to outliers (> 3 σ), these are removed from the available data.

Splitting and scaling data

To avoid data leakage, using a train-test split from the Scikit-Learn module, we divided the data into train-set (80%) and test-set (20%)35. The same training and test data are used for all ML models used in this study, which gives an advantage in testing model performance with respect to common test data. The K-fold (10 K-folds) technique is utilized for training each model, which aids in circumventing the over-training issue.

The data is then scaled using the StandardScaler method from the pre-processing libraries of the scikit learn. Scaling converts all the data between the range −1 to 1 with a mean of zero and a standard deviation of one. This process, called standardization, simplifies learning new things for ML models.


The study tests four advanced ML algorithms, and the best among the four is used to create sea-surface pCO2 maps for the BoB. The description of each of these algorithms is as follows:

  • Multiple Linear Regression (MLR)

    Multiple linear regression is an analysis that builds the output variables from the input variables. The approach attempts to link the response and interpretation variables linearly. It extends the traditional least square strategy because it considers numerous pertinent variables.

    The use of multiple linear regression is evident and well-established for different applications in the literature27,28,29. It is to be noted that advanced ML can only be used if a significant number of observations are available. The multiple linear regression equation to predict sea-surface pCO2 is as follows:

    $$p{{\rm{C}}{\rm{O}}}_{2}=365.94+11.92\times {\rm{S}}{\rm{S}}{\rm{T}}+7.45\times {\rm{S}}{\rm{S}}{\rm{S}}-1.23\times {\rm{l}}{\rm{o}}{\rm{g}}({\rm{C}}{\rm{H}}{\rm{L}})+0.86\times {\rm{l}}{\rm{o}}{\rm{g}}({\rm{M}}{\rm{L}}{\rm{D}})+19.29\times {{\rm{x}}{\rm{C}}{\rm{O}}}_{2}$$
  • Artificial Neural Network (ANN)

    The Artificial Neural Network (ANN) is a part of artificial intelligence based on the biological neural system. It has become common practise to establish the pCO2 for regional scales30,36,37,38,39. The ANN comprises interconnected neurons that interpret incoming data like how the human brain learns. Each connection’s signals are absolute values, and each neuron’s output is calculated as the sum of its inputs, a nonlinear function. The edges are another name for the physical link that exists between neurons. Weights are allocated to the neurons and edges, and they self-adjust to get the best results. An input, an output, and at least one hidden layer compose an ANN. The neurons in the input layer equal the number of input parameters (independent variables) as the input layer is linked to the input data. Similarly, the output layer’s neurons match the number of dependent variables. A signal can go through numerous hidden layers comprising several neurons, from the input to the output layer. The hidden layer’s main objective is to establish a link between the output and input variables.

    The ANN hyper-parameters are tuned using KerasTuner40 class from the Keras library. Rectified Linear Unit (ReLU)41 activation function is used for the hidden layers and the Linear activation function for the output layer. The network is optimized using the Adam optimizer42. The loss function, Mean Absolute Error, is employed and must be minimized. Two executions per trial are allowed with the parameters set for 100 trials.

    There are 18 hidden layers in the optimized ANN used in this study. Table 1 displays the neurons associated with each hidden layer. The model operates most well at a 0.0001 learning rate.

    Table 1 Neurons in each hidden layer.
  • Xtreme Gradient Boosting (XGB)

    Extreme Gradient Boosting (XGB)43 is one of the members of the family of boosting algorithms built on decision trees. The gradient-boosted algorithm’s performance and computational speed were both expanded to produce the XGBoost algorithm. Since it performed well for the central BoB region30, the model’s great speed and accuracy motivate us to compare its performance to that of other models. Only the residuals are supplied to the following weaker learners once the trees or vulnerable learners have been added in sequential order. This method helps to cut down on errors. Contrary to gradient descent, the Newton boosting based on the Newton Raphson method accelerates the approach to global minima.

    Similar to the ANN, the XGBoost model also has tunable hyperparameters. Following previous literature30, we employ the Optuna optimization framework44 to fine-tune the hyper-parameters. At one may find the description for each of the XGB hyper-parameters. The hyper-parameters range and final optimized values are shown in Table 2.

    Table 2 Optimized values of the XGB hyper-parameters.
  • Random Forest(RF)

As XGB belongs to a family of boosting algorithms, Random Forest (RF)45 belongs to a family of bagging algorithms. RF is also built on decision trees. RF uses with-replacement random samples from the training data to generate decision trees, and the results of these decision trees are averaged to get the final output. The combined output from several trees tends to smooth out the volatility between trees and improves the ability to generalize the model as a whole. One appealing aspect of RF is its ability to estimate error using out-of-bag error estimates without needing a set-aside testing dataset46. Like the ANN and XGB, RF also had tunable hyper-parameters optimized using the Optuna optimization framework. The list of the range and optimized hyper-parameter are provided in Table 3.

Table 3 Optimized values of the RF hyper-parameters.

Mapping method

After selecting the best algorithm from the four algorithms described in the previous section, we employ the best algorithm to build spatial maps. To build these maps, we select SST and SSS from different products, and the rest of the input variables are chosen from the same data used for acquiring collocated data at SOCAT cruise locations. The SST is taken from the GLORYS12V1 product, which is a CMEMS eddy-resolving reanalysis product (data available at The SST has a spatial resolution of 1/12° and is available from 1993 to 2020. We obtained the SSS from ESA-CCI (ESA stands for European Space Agency, and CCI stands for Climate Change Initiative), a merged product of three satellite data (SMOS, Aquarius, and SMAP). This ESA-CCI (having a spatial resolution of 0.25°) is reported to perform excellently for the BoB region47. The ESA-CCI SSS is available at

Since ESA-CCI is available only for the period from 2010–2020, we predict sea-surface pCO2 for the previous decade (2010–2020) and then average it to form a climatology. The mean of the common period (2010–2020) is centered around 2015. Thus, 2015 is the climatological reference year for the INCOIS-ReML sea-surface pCO2 climatology. The reason for making climatology is to reduce the uncertainty caused by extreme events. All the independent data are interpolated to 1/12° resolution (same as SST, CHL, and MLD) and provided to the model for prediction. Further, we compare our product with the climatology produced by averaging pCO2 of RAMA (which is available from November 2013 to November 2018) and the gridded SOCAT data (having a spatial resolution of 1° and temporally available from 2010 to 2020). This product is expected to help in evaluating high-resolution bio-physical model simulated ocean carbon cycle as only a limited number of spatial pCO2 observations are available in the BoB across different time scales.

CO2 flux calculation

After preparing the climatological sea-surface pCO2 for the BoB region, we calculate air-sea CO2 flux to examine the sink and source regions of the BoB. The flux is calculated using the following equation.

$${{\rm{CO}}}_{{\rm{2}}}\,{\rm{flux}}={\rm{kw}}\times {\rm{L}}\times \Delta {{\rm{pCO}}}_{{\rm{2}}}$$

where kw is the piston velocity calculated as a function of wind speed48. We use ERA549 winds ( to calculate the piston velocity50. L represents solubility of CO251, and Δ pCO2 is the difference between sea-surface pCO2 and atmospheric pCO2.

Data Records

The high-resolution sea-surface pCO2 maps and associated CO2 flux data produced for the BoB (reported in this paper) could be accessed from 52 The dataset contains two products, the first being sea-surface pCO2 and the second being air-sea CO2 flux for the BoB region. It is a monthly climatological data. Each of these data has a spatial resolution of 1/12°. A positive value of CO2 flux indicates outgassing of CO2, and the negative value shows uptake of atmospheric CO2.

Technical Validation

In this study, we use the Taylor diagram representation53 to evaluate the performance of the models. The Taylor diagram provides a summarized graphical view of the model performance with respect to the available observation data. Three statistics, namely Correlation Coefficient (r), Standard Deviation (STD), and Centred Root Mean Square Difference (CRMSD), are used to create the Taylor Diagram. The correlation coefficient ranges between −1 and 1; higher negative or positive values represent a strong inverse or in-sync relation between prediction and observation. Ideally, the STD of predicted values should be the same as observed, and lower CRMSD represents better model performance.

Model selection

Figure 3 represents the performance of all four models against a common test data. The performance of multiple linear regression is the worst, whereas the ANN, RF, and XGB perform almost closely to each other. The CRMSD (centered root-mean-square difference) of ANN, XGB, and RF is 6.26, 4.52, and 5.71 μatm, respectively. At the same time, the correlation of ANN, XGB, and RF is, respectively, 0.978, 0.988, and 0.982. Based on the statistics, XGB seems to have a slight edge over the other two ML models. The STD of the test data is 30.38 μatm , and all three models (ANN, XGB, and RF) are very close to this STD. Hence, from Fig. 3, it is clear that the XGB performs best among the four ML models chosen in this study. Thus, we employ the XGB model to build sea-surface pCO2 maps for the BoB. Henceforth, we refer to the XGB-based climatological data product as INCOIS-ReML (Indian National Centre for Ocean Information Services-Regional Machine Learning model).

Fig. 3
figure 3

Comparison of model performance with respect to the test data.

Creating sea-surface pCO2 maps

INCOIS-ReML is a high-resolution monthly climatological data product (Fig. 4). The temporal evolution of the INCOIS-ReML pCO2 climatology has been compared with BOBOA mooring-based pCO2 climatology (averaging over the available observation from 2014–2018) using correlation, root mean square error (RMSE), and Willmott skill score (WSS)54. The monthly variability of sea-surface pCO2 is satisfactorily captured by the INCOIS-ReML (correlation (r) = 0.93; Fig. 5). This comparison shows that INCOIS-ReML underestimates the sea-surface pCO2 (particularly in April and May). However, the RMSE between the observed and modeled values is 7.40, which indicates that the error is within acceptable bounds (Fig. 5). The capability of INCOIS-ReML pCO2 is also evident from its WSS of 0.885.

Fig. 4
figure 4

Climatological monthly variability of the sea-surface pCO2 produced by INCOIS-ReML. The climatological reference year for this dataset is 2015.

Fig. 5
figure 5

Climatological monthly variability of the sea-surface pCO2 produced by INCOIS-ReML is compared with the climatology created by RAMA mooring buoy. The climatological reference year for this dataset is 2015.

Using the available observations from BOBOA mooring (location-specific data), we validated the temporal variability of INCOIS-ReML pCO2. However, a limited number of observations makes it difficult to validate spatial variability of INCOIS-ReML pCO. Therefore, we use observations-based gridded (1° × 1°) SOCAT product (available from the 1990s to date) to compare spatial variability of pCO2. As a first step, we generate a climatology of SOCAT data product with reference to the year 2015 for comparison. Before comparison, we interpolate the high-resolution INCOIS-ReML data product (Fig. 6a) to the spatial resolution of SOCAT gridded data product (Fig. 6b) using the nearest-neighbor interpolation method. Here, the reader must understand that the unavailability of a sufficient number of temporally varying observations in the BoB impacts the magnitude of the sea-surface pCO2 climatology derived from SOCAT. INCOIS-ReML satisfactorily captures the spatial pattern, i.e., lower sea-surface pCO2 in the north and higher sea-surface pCO2 in the south. Figure 6c,d provide spatial statistics to evaluate the performance of the INCOIS-ReML data product. A high correlation is seen in the central BoB region (Fig. 6c). A few grids show negative to low correlation in the south of the Sri Lankan coast. Figure 6d shows overestimation in the region east of 92° E, but low negative bias persists in the rest of the region. The domain average bias is approximately 0.92 μatm. The overestimation of the INCOIS-ReML can be attributed to the discontinuous time-series data from SOCAT in a large part of BoB.

Fig. 6
figure 6

Comparison (annual mean of the climatological year) between the (a) INCOIS-ReML produced sea-surface pCO2 and (b) SOCAT. The spatial correlation and spatial bias (difference (Model - Observation (M-O)) in an annual mean of the climatological year) are shown in figures (c) and (d). The climatological reference year for this dataset is 2015.

We compare the INCOIS-ReML pCO2 with the results of existing studies, carried out using in-situ observations, available in the literature to validate the spatial variability of pCO2 more rigorously. The spatial monthly variation of INCOIS-ReML is shown in Fig. 4. The northern BoB (approximately above 15° N) is seen to have lower sea-surface pCO2 than the southern BoB region55,56. The EICC (East India Coastal Current) is known to have dominant control over the sea-surface pCO2, especially in the western coast of BoB16 due to the spreading of river-influenced water along the coast. The northward-moving EICC is primarily strong from March to May when high salinity and pCO2 levels are observed. In contrast, southward-moving EICC during October to December brings river-influenced low saline and pCO2 water along the coast16. The INCOIS-ReML well reproduces the coastal pattern of pCO2 levels with the lowest during November and the highest pCO2 levels during May (Fig. 4). Overall, the spatial and temporal patterns are well captured by INCOIS-ReML.

Further, we compare our climatological product with six widely used ML-based pCO2 products (listed in Table 4). Figure 7 shows that based on the Willmott skill score (WSS), INCOIS-ReML performs better than all the other six products. This is due to two primary reasons: a) the inclusion of a significant number of open and coastal ocean observations from SAS leads to an improvement in model prediction, and b) the high spatial resolution of INCOIS-ReML. Figure 7 (based on WSS) shows that CMEMS performs as good as INCOIS-ReML. Hence, we further compare the two products spatially and explain the advantages of high-resolution INCOIS-ReML (Fig. 8).

Table 4 List of ML-based models with which we compare INCOIS-ReML.
Fig. 7
figure 7

Willmott Skill Score of the comparison between INCOIS-ReML and other six widely used ML-based products' climatological pCO2 with BOBOA based climatological pCO2 observations.

Fig. 8
figure 8

Seasonal spatial comparison of pCO2 between CMEMS and INCOIS-ReML.

The first observation from Fig. 8 is that INCOIS-ReML is capable of capturing spatio-temporal variability of pCO2 in the coastal waters of the BoB. Since BoB receives high freshwater flux from rivers and precipitation during the southwest monsoon (June-September), low salinity water is found in the north that spreads to the south by monsoon currents57. This freshwater plume spreads to the BoB by fall monsoon (ON)015,58,59. This plume first spreads in the eastern Bay, followed by the western Bay, with minimal impact on freshwater during spring inter-monsoon (March to May). CMEMS and INCOIS-ReML performed well in capturing spatial variations of low pCO2 primarily driven by the low saline waters in the BoB. However, the spatial variations were not well captured by CMEMS compared to INCOIS-ReML during spring monsoon (MAM). Perennial occurrence of low pCO2 due to low salinity during the summer monsoon season was reported in the northern BoB60, that was not well captured by CMEMS (Fig. 8). In addition, the pCO2 levels in the low salinity plume region were underestimated by CMEMS compared to INCOIS-ReML61. The presence of low-saline freshwater and associated strong stratification lower the sea-surface pCO2 values in the northern BoB16. These physical processes play a significant role in regulating the seasonality of sea-surface pCO2 in the BoB15,17,19. It is evident that the seasonality of sea-surface pCO2 is well captured by the INCOIS-ReML. Therefore, the high resolution INCOIS-ReML data product is an improved version of the climatological mean state of sea-surface pCO2 in the BoB region.

Hence, we provide a high-resolution sea-surface pCO2 maps and associated air-sea CO2 flux (calculated using the equation mentioned in the earlier section) data product, which would immensely aid in validating not only high-resolution bio-physical model simulated ocean carbon cycle but also coarser-resolution CMIP6 models. Further, it is worth mentioning that the inclusion of SAS data makes it possible for this high-resolution product to capture the coastal pCO2 dynamics better, which is missing in other observation-based data products. We understand that the product can still be improved, and we will keep on updating the product as the number of observations increases. This product is expected to be extremely helpful in validating models (especially spatial variability) used to understand the future scenarios of the sea-surface pCO2 in the BoB.