Prediction of monthly dry days with machine learning algorithms: a case study in Northern Bangladesh

Osmani, Shabbir Ahmed; Kim, Jong-Suk; Jun, Changhyun; Sumon, Md. Wahiduzzaman; Baik, Jongjin; Lee, Jinwook

doi:10.1038/s41598-022-23436-x

Download PDF

Article
Open access
Published: 16 November 2022

Prediction of monthly dry days with machine learning algorithms: a case study in Northern Bangladesh

Shabbir Ahmed Osmani¹,
Jong-Suk Kim²,
Changhyun Jun^1,3,
Md. Wahiduzzaman Sumon⁴,
Jongjin Baik³ &
…
Jinwook Lee³

Scientific Reports volume 12, Article number: 19717 (2022) Cite this article

2080 Accesses
3 Citations
Metrics details

Subjects

Abstract

Dry days at varied scale are an important topic in climate discussions. Prolonged dry days define a dry period. Dry days with a specific rainfall threshold may visualize a climate scenario of a locality. The variation of monthly dry days from station to station could be correlated with several climatic factors. This study suggests a novel approach for predicting monthly dry days (MDD) of six target stations using different machine learning (ML) algorithms in Bangladesh. Several rainfall thresholds were used to prepare the datasets of monthly dry days (MDD) and monthly wet days (MWD). A group of ML algorithms, like Bagged Trees (BT), Exponential Gaussian Process Regression (EGPR), Matern Gaussian Process Regression (MGPR), Linear Support Vector Machine (LSVM), Fine Trees (FT) and Linear Regression (LR) were evaluated on building a competitive prediction model of MDD. In validation of the study, EGPR-based models were able to better capture the monthly dry days (MDD) over Bangladesh compared to those by MGPR, LSVM, BT, LR and FT-based models. When MDD were the predictors for all six target stations, EGPR produced highest mean R² of 0.91 (min. 0.89 and max. 0.92) with a least mean RMSE of 2.14 (min. 1.78 and max. 2.69) compared to other models. An explicit evaluation of the ML algorithms using one-year lead time approach demonstrated that BT and EGPR were the most result-oriented algorithms (R² = 0.78 for both models). However, having a least RMSE, EGPR was chosen as the best model in one year lead time. The dataset of monthly dry–wet days was the best predictor in the lead-time approach. In addition, sensitivity analysis demonstrated sensitivity of each station on the prediction of MDD of target stations. Monte Carlo simulation was introduced to assess the robustness of the developed models. EGPR model declared its robustness up to certain limit of randomness on the testing data. The output of this study can be referred to the agricultural sector to mitigate the impacts of dry spells on agriculture.

Current and future global water scarcity intensifies when accounting for surface water quality

Article 23 May 2024

Accurate medium-range global weather forecasting with 3D neural networks

Article Open access 05 July 2023

Increasing heat and rainfall extremes now far outside the historical climate

Article Open access 05 October 2021

Introduction

The global temperature increased by 0.6 °C (0.4–0.8 °C) from 1901 to 2001, highlighting the warming of the Earth in recent decades¹. The resulting extreme temperatures, precipitation, and continuous wet or dry conditions have severely impacted human activities and the ecosystem^2,3,4. Similarly, droughts due to extreme temperatures and dry conditions have become increasingly commonplace worldwide^5,6. These drought events and their frequency are directly affected by global warming, with 30% of the Earth’s surface expected to experience as much as twice the drought intensity by the end of this century, affecting most of the global population^5,6,7. Hence, the occurrence of droughts is a prime area of focus for monitoring and management from agricultural point of view to ensure food security in affected areas.

Bangladesh is characterized as one of the most environmentally vulnerable countries in the world^8,9,10 owing to the substantial adverse impacts of climate change, in combination with its geographical location and socio-economic conditions. Bangladesh is less adaptable to sustain adverse effects of climate change because of its developing economy, geography, and high population density, which lead to a low adaptive capacity¹¹. The adverse impacts of climate change are generally visible in the agricultural sector, as most agricultural processes depend on rainfall¹². Agriculture contributes approximately 14% to Bangladesh’s GDP and employs approximately 40% of its labor force¹³. As a result of reduced or no rainfall, regional droughts currently affect approximately 2.5 million and 1.2 million ha of agricultural land in a year in the wet and dry seasons, respectively¹⁴. Therefore, the prediction of dry days could be an approach for applying measures to mitigate the regional effects of prolonged dry spells.

Droughts have been identified and characterized at different scales. There are four types of droughts¹⁵: meteorological, agricultural, hydrological, and socio-economic. Meteorological droughts are defined based on the degree of dryness (an expression of precipitation departure) and the duration of the dry period^{15,16,17,18,19}. Agricultural drought occurs when there is insufficient soil moisture to meet the needs of a particular crop in a specific time owing to deficient precipitation for an extended period. Hydrological drought occurs when there are deficiencies in surface and subsurface water supplies, based on measurements of streamflow and lake, reservoir, and groundwater levels. Meanwhile, socioeconomic drought can be referred to the situations when the supplied volume of water is less than the demand of water in a specific region²⁰. Hoyt²¹ defined socioeconomic drought as occurring insufficient precipitation to meet the needs of human activities. This concept was expanded by Hoyt²² in 1942 by stating that socio-economic development in a region demands more water than normally available.

Multiple drought indices (DIs) have been used to define drought events and their intensities²³ to identify the spatiotemporal distribution of droughts²⁴. The standardized precipitation index (SPI)²⁵ is the most popular meteorological drought index, based on monthly precipitation²⁶. The effective drought index (EDI)²⁷ is another useful tool for distinguishing the characteristics of droughts. However, the application of SPI found some limitations in defining short and long-term droughts where EDI showed its effectiveness on detecting long and short-term droughts^26,28. In addition, different monthly SPIs are found in a particular month, while EDI provides a single value, which causes misinterpretation of droughts for that month. Other studies^28,29,30 have found that EDI can detect a high range of drought events. Moreover, precipitation and temperature define another drought index named as Standardized Precipitation Evapotranspiration Index (SPEI)³¹. The superiority of SPEI focuses by combining the effects of temperature variability on drought assessments.

Beside the drought indices, some other ways were also followed to characterize a dry event or period. A dry period was referred with prolonged consecutive dry days with little or no precipitation over a specific duration^32,33,34,35. Some meteorologists and climatologists designated a dry spell with precipitation less than 2 or 5 mm²⁷. Drought events were characterized by 15 consecutive dry days^35,36 or a long dry period with 25 days consecutive dry days³⁵. Moreover, climate scenarios were effectively presented through wet and dry periods^{37,38,39,40,41,42,43,44} and argued that wet and dry periods are useful indicators of weather^45,46. In Switzerland, wet and dry periods were found capable to extrapolate the climate through spatial and temporal trends of wet and dry periods³⁸. Dry days were found generating heat wave and in tropical, weather dry days were directly or indirectly related to heatwave. Heatwave vulnerability was used to identify the hot zones in a locality⁴⁷ through climatic, socio-economic, physiological, and environmental parameters. Heat wave was also analyzed by the effect of the North Atlantic Oscillation⁴⁸. Similarly, in both day and nighttime situations, a dense meteorological network was used to study urban and rural air temperatures where the urban heat index (UHI) was the highest when weather was dry⁴⁹. Hence, dry days have logical relations on producing heatwaves.

There were a limited number of researches on predicting future dry days, based on monthly cumulative dry days. Other researchers, for example, mainly focused on Monthly Consecutive Dry Days (MCDD) over Japan⁵⁰ to present zonal climate and established the application of consecutive dry days. Meanwhile, a study⁵¹ on monthly dry days (MDD) argued that MDD cannot be a direct description of defining a particular type of drought, but it would be meaningful to find trends of changes of dry spells in different months. This study was motivated to establish some new approaches on finding correlations of MDD and monthly wet days (MWD) in between stations.

Dry period or drought prediction and forecasts can be performed using either physical or data-driven models. A flood forecasting data-driven model⁵² showed data-driven models require minimal information for a short duration to build a result oriented model. Precipitation and droughts were also forecasted using statistical data driven models in several studies. For example, linear regression⁵³, support vector machine (SVM)⁵⁴ and artificial neural network (ANN)⁵⁵ were extensively used for long term drought prediction using SPI. These data-driven models took rainfall or drought relevant variables in the previous months as inputs, and the rainfall or drought indicators as outputs. ANN based models were more capable for forecasting droughts compared to others. Furthermore, ANN provided greater performance than multiple linear regression in forecasting SPEI in Wilsons Promontory in Australia⁵⁶. Several ML algorithms were also implemented on rainfall forecasting⁵⁷ and the results were consistently better using auto correlation functions.

However, in Pakistan, the prediction of SPEI showed the superiority of SVM over ANN and k-nearest neighbor (KNN)⁵⁸. Another study⁵⁹ established the accuracy of SVM over ANN on predicting SPI over Iran. The studies were accomplished with the fact that ML models have higher advantage on producing better accuracy by utilizing only hydro-meteorological data rather than considering the inherent physical processes⁶⁰.

Drought forecasting with longer lead times and higher accuracy is of significant value in agriculture applications. A study on different lead times phenomena among different drought studies admitted the challenges on lead time forecasting⁶¹. Among different ML algorithms, artificial neural network (ANN) based models were used in several studies and proved its effectiveness on forecasting droughts from 1 to 12 months lead time^62,63,64.

Uncertainty analysis on a proposed model confirms the robustness of the model. This uncertainty could be originated from a systematic error or by a random error. Uncertainty of different hydrological models on predicting climate events has been established as a vital approach to quantify the domain of study inputs or model parameters. In these studies, Monte-Carlo sampling-based methods were adopted^65,66,67. Different ranges of random data from the input parameters were generated to see the effect on the original level of output. For example, Monte Carlo simulation was used to perform uncertainty in different water model parameters^68,69 and checked the robustness of the proposed models.

This study was intended to deal with monthly dry days (MDD) and monthly wet days (MWD) instead of consecutive dry days. And finding regressions among MDD and MWD would claim the novelty of the study. It is not to visualize any dry spell or dry period in the study area. Rather, finding a strong regression among MDD of different climate stations through several machine learning algorithms was initiated. Here, a dry day was defined when a day has a rainfall less than 2 mm instead of 1 mm⁵⁰ and MDD was the cumulative dry days in every month. Datasets of monthly wet days, defined by several daily rainfall thresholds, were also used to establish regressions with MDDs. Different ML algorithms, like Fine Tree (FT), Bagged Trees (BT), Linear Regression (LR), Linear Support Vector Machine (LSVM), Exponential GPR (EGPR) and Matern GPR (MGPR) were incorporated to find a strong prediction model of MDD of the climate stations. The outcome of the study was also assessed its robustness using Monte Carlo simulation with different ranges of random datasets.

Results

Statistical summary

MDD of 27 stations have varied statistical responses. Figure 1A represents diversified ranges of mean, median and standard deviation. Several stations have high and low reaches in mean, median and standard deviation. The datasets are normally distributed since mean and median are very close to each other. Negative skewness depicts a higher concentration of data to the right. Skewness values in the range of − 2 to + 2 are generally acceptable⁷⁰. The datasets are found to be less skewed as the skewness was in the range of − 0.6 to − 0.2. It means the datasets are very close to normally distributed.

In contrast, Kurtosis defines the relative peaked-ness or flatness of the data relative to normal distribution. Figure 1B clearly depicts all negative values within − 1.5 to − 0.5 which means mean thinner tails. Kurtosis value in the range of − 2 to + 2 is generally acceptable to prove normal univariate distribution⁷⁰.

Prediction of MDD

The performance of the ML models for the prediction of MDD was determined and assessed using multiple approaches. In the first approach, only MDD of all stations were considered as study dataset. Every target station was taken as response while remaining 26 stations were the predictors. In the second approach, MWDs of all 26 stations (other than the target) were used as predictors. In the third approach, integrated monthly dry and wet days (MDWDs) at all stations were utilized as predictors. From the dataset of 35 years, 23 years (2/3rd) of data were used for training and 12 years (1/3rd) of data were used for testing. Two performance indicators, R² and RMSE, of each developed model stratified the efficiency on prediction strategy.

Out of all, EGPR and MGPR secured better results than any other algorithm in training dataset (Table 1). More particularly, EGPR routinely outperformed all other algorithms, with the highest mean R² (~ 1.00) for the first and third approaches. MGPR, on the other hand, for the same first and third approaches, has the second-best R² (~ 0.99). Reasonably, performance levels of the developed models are a bit deviated for the testing period.

Table 1 Values of R² from the ML models for the approaches (1) MDD to MDD (2) MWD to MDD & (3) MDWD to MDD.

Full size table

Focusing at the testing results, through the second approach, BT outraced the performance of other algorithms. The lowest average score of R² (~ 0.77) was produced by FT. All other responses using the second approach had a non-significant R² of 0.87 by BT. But for the first approach, EGPR, LSVM and LR, each algorithm scored a mean R² of 0.91 while they scored RMSE of 2.14, 2.16 and 2.16, respectively. In contrast, using the third approach, EGPR, MGPR, and LSVM, each have a bit reduced mean R² (0.90) and higher RMSE of 2.19, 2.26 and 2.21, respectively. Therefore, EGPR has the optimum scores of R² and RMSE by using the data of the second approach.

On the other hand, while prediction of MDD was tested from MDWD using the third approach, BT scored a highest mean R² (0.91) and second lowest mean RMSE of 2.20 (Table 2). In summary, comparing all scores, EGPR has the lowest mean RMSE of 2.14 with highest R² of 0.91, Hence, the study found EGPR as the best model and the 1st approach was identified as the best approach.

Table 2 RMSE of the ML models for the approaches (1) MDD to MDD (2) MWD to MDD & (3) MDWD to MDD.

Full size table

Figure 2a and b represented a comparison of the predicted MDD developed by all ML models for the six target stations following the first approach. The predicted values of Sylhet are traced well by LSVM rather than any other model where EGPR and LR picked the most of the actual values of MDD of Srimangal. Meanwhile, Rangpur station was caught by EGPR, LSVM and LR for better accuracy whilst EGPR and MGPR worked well for prediction of Dinajpur. Therefore, individual model goes fit for the individual station while combined performance considering least RMSE suggest EGPR as the best algorithm.

Lead time forecasting

The key objective of lead time approach was to evaluate the effectiveness of ML techniques for developing a reliable forecasting model that can be used to manage dry periods in advance by the agricultural industry, and the authority could take necessary precautions against possible dry spells. One year lead time was considered to step up the scenario of dry days in one ahead. All the three identical approaches and their predictors were employed to identify the most significant input datasets building a MDD forecasting model with high R² with low RMSE. The training dataset contained predictors from 1982 to 2003 and responses from 1983 to 2004. The testing period for the predictors was from 2004 to 2016, and consequently, the forecasted period was 2005–2017.

The results of the lead time approach in Table 3 showed a consistent regression for having better forecasting on MDD. In comparison, BT and EGPR models, for the third approach, produced highest R² and least RMSE compared to other models. Having an identical mean R² of 0.78, BT and EGPR are the stronger models in this simulation for predicting MDD with one year lead. However, the performance of EGPR outraced BT on the basis of less RMSE.

Table 3 R² & RMSE of the ML models for the approaches (1) MDD to MDD (2) MWD to MDD & (3) MDWD to MDD using testing dataset.

Full size table

The performance of LSVM was not satisfactory for having a low R² (0.71) even though it had the lowest RMSE (2.75) for forecasting Srimangal. In addition, FT produced highest RMSE (5.32) for Dinajpur and minimum R² (0.53) for Mymensingh. And, EGPR and LSVM were competitive for Rangpur having highest R² with varied RMSE. Every ML algorithm uses specific set of model parameters and coefficients to generate prediction models using variety of input datasets with minimized prediction errors by using different performance indicators like RMSE and R² values^57,71. Likely, performance levels are fluctuated here for different ML algorithms as well as input datasets.

The results of the testing dataset using EGPR are extrapolated through Figs. 3 and 4. Most of the highs and lows are easily captured by the model. However, some points of MDD have a bit fluctuation. For example, year 2006 has significant deviation of predicted values with the actuals. But these are very little compared to the true patterns of prediction. Particularly, Sylhet and Bogra have a very good one-year lead time prediction throughout the testing period.

Sensitivity analysis

Sensitivity analysis finds the efficiency of input parameters in developing data driven models. The focus is centered on the behavior of input parameters on the variation of the model output. In fact, different parameters have different (sometimes extreme) effect on the model’s outcome. Given that some parameters play significant roles, while others are marginally important, make sensitivity analysis a valuable tool.

To perform sensitivity analysis, a scenario was assumed that a station did not have any study data in the testing period. Keeping every station of Northern Bangladesh as target, all the 26 stations were checked through the developed EGPR model. Figure 5 summarizes the output levels of prediction for the six target stations. Significance of the station parameters in model validation is usually checked through this process. Results showed variety of significant stations to reach to the desired levels of prediction.

Rangpur is most sensitive when predicting MDD of Sylhet for one year lead time where Faridpur and Rajshahi were sensitive without any lead time (Fig. 5). Again, Mymensingh and Khepupara are found least sensitive without considering any lead time while Comilla was the least sensitive with one year lead targeting Sylhet.

Sylhet is significant for targeting Srimangal, Rangpur and Dinajpur while there is no any significant station predicting Mymensingh for zero lead time. In summary, for different target station with no lead time, the R² values of predicted models lie within 0.90 ± 0.04 for Sylhet, 0.84 ± 0.03 for Srimangal, 0.88 ± 0.02 for Rangpur, 0.94 ± 0.02 for Mymensingh, 0.86 ± 0.02 for Dinajpur, and 0.87 ± 0.03 for Bogra. In contrast, when considering one year lead time, R² values remain around 0.78 for targeting Sylhet, Srimangal, Mymensingh and Bogra where R² was approximately 0.68 and 0.81 for Rangpur and Dinajpur respectively.

In summary of the sensitivity analysis, it is concluded that a particular station was not highly sensitive for most of the target stations. Specifically, Sylhet and Dinajpur were found sensitive solely for Rangpur and Srimangal stations, respectively. Hence, sensitivity analysis for this intended procedure and models of the study is less result oriented.

Uncertainty analysis

An uncertainty analysis shows the propagation of uncertainty through the hydrological models and to derive meaningful uncertainty bounds of the model simulations⁷². This study incorporated two scenarios to perform uncertainty analysis. At first, any station was assumed to have random data within different coefficient of variations (CV). Secondly, any two stations were random within different CVs. Here, 0.01, 0.05, 0.1, 0.5, 1 and 2 are the CVs had been considered to do the simulation.

The typical syntax to generate random data is:

To comply with Monte Carlo simulations, total 10,000 sets⁷³ of new datasets were generated for a particular CV. When Sylhet was the target, for example, a station was picked randomly among the 26 stations and data of testing period of that station was generated randomly with a specific CV. This was repeated for 10,000 times for that CV. Every dataset was then evaluated by the developed EGPR model. The statistical details of the results are summarized through the boxplots in Figs. 6 and 7.

Case A: a single station was random

This type of uncertainty would be originated due to the errors in data recording, data processing, or errors in systems. The results of the analyses through Fig. 6a and b ensure that the models are consistent for the randomness of the predictors up to the CV of 0.1. If any station data vary at CV of 0.5 or more, the performance of the models are getting deviated.

Case B: any two stations were random

If a situation arises when any two stations are having random data with different spikes then the developed should also work with the new testing dataset. Figure 7a and b represent the outputs of this scenario. The analysis of this type of randomness produced quite similar responses compared to the randomness of one station. However, for Ranpur, Mymensingh, Dinajpur and Bogra, the robustness of the EPGR was extended up to the random data with CV of 0.5.

Discussion

The data analysis and the developed models through different sets of inputs and outputs represented a detail data driven model for forecasting a climate parameter. Simulation in this study generated some key outputs for the prediction of MDD. The study was not intended to define a drought or any similar event through the values of MDD. Instead, it tried to find a correlation among MDDs of all climate stations in Bangladesh through regressions using ML algorithms.

MDDs of the target stations showed a good regression with different MWDs and MDDs of the predictor stations in Bangladesh. ML algorithms were capable to build a fine prediction model of MDD. A Prolonged dry spell or regional drought due to low or no rainfall is objectionable by an agricultural sector¹⁴. Dryness is the defining feature of a dry spell, thereby allowing the interpretation of a drought. This study can help the agricultural sector to take precautions against periodical dry days in a month. The predicted models were assessed on the basis of R² and RMSE. A very strong regression was found in MDDs of the climate stations. MWDs were also firmly correlated with MDDs which would direct a future study on targeting MWD of the target stations. Response in one year lead time was also satisfactory to predict MDD.

Sensitivity analyses studied the effectiveness of each station to be present in producing desired level of model output. In summary of the sensitivity analysis, it is concluded that a particular station was not highly sensitive for most of the target stations. However, Sylhet and Dinajpur were found sensitive for predicting MDD of Rangpur and Srimangal, respectively. In general, a specific station would not produce much deviation in the model outputs.

Uncertainty analysis assessed the domain of the study data for predicting MDD with a satisfactory level of output. Robustness of the proposed models through Monte Carlo simulation was clearly determined for certain ranges of random input data. Most of the cases, input data could vary with maximum CV of 10% to limit the output of the predicted model at a satisfactory level. Figures 6 and 7 depicted the summary of this scenario. However, for Ranpur, Mymensingh, Dinajpur and Bogra, the robustness of the EPGR was sustained up to the random data at CV of 0.5.

Several optimized model parameters from the simulation of different ML algorithms in MATLAB are summarized. Tables 4 and 5 present the changes of the optimized parameters of the developed EGPR models for six target stations. The EGPR models with these values of the model parameters can be used for forecasting MDD without lead (Table 4) and with one year lead time (Table 5).

Table 4 Optimized parameters of EGPR models for six target stations without any lead.

Full size table

Table 5 Optimized parameters of EGPR models in one year lead time.

Full size table

The outcome of the study demonstrates the possibility of using MDWD instead of consecutive dry days^32,33,34,35. This approach can be useful for defining dry periods with certain rainfall thresholds. The rainfall threshold used in this study was 2 mm ²⁷. This concept can be used for real-time dry day forecasting by reducing computational time, improving water resource management against possible droughts, and reducing the cost of unnecessary field data collection. Hence, the novelty of the study comes from several outcomes using different ML algorithms through the correlation analysis on monthly dry days between different stations and the relationship between monthly dry days and monthly wet days. It demonstrates that ML methods are capable of outperforming current state-of-the-art methods for the prediction of MDD, representing a novel approach of lead-time phenomena with an established path for forecasting MDD.

Conclusion

MWD and MDWD datasets were prepared based on daily rainfall at all stations in Bangladesh to establish a strong regression with MDD of the six target stations in Northern Bangladesh. The summary of all approaches points out EGPR as the best model among EGPR, BT, MGPR, FT, LSVM and LR. In addition, lead time effort also presented a satisfactory result to forecast MDD for one year ahead.

Uncertainty analyses based on Monte Carlo simulation has established robustness of the developed EGPR model. In summary of the sensitivity analysis, a particular station was not highly sensitive for most of the target stations. Sylhet and Dinajpur were found sensitive for Rangpur and Srimangal, respectively. Hence, sensitivity analysis for this intended procedure and models of the study is less result oriented. The combination of all approaches and the findings with the predictors and responses confirmed the novelty of the study. The outcomes of the study are summarized as:

EGPR algorithm was able to provide satisfactory model with highest mean R² of 0.91 and lowest mean RMSE of 2.14 among all six algorithms.
A very good regression was found among MDD and MWD. Hence, dry days with 0–2 mm rainfall have a strong correlation with 10–25 mm and 26–50 mm of rainfall.
The inclusion of one year lead time also performed very well by EGPR and showed the best response for forecasting MDD.
EGPR model was assessed its robustness through Monte Carlo simulation. The model is robust up to CV of 0.1 for considering random data in a single station and two stations.
For most of the target stations, no any station is highly sensitive except Sylhet and Dinajpur.

This study provides novel insights into the analysis of monthly dry and wet days in climate research, which may directly or indirectly relate to the actual impacts of droughts. These results could be used in a future study for the definition of a new drought situation with other drought indices based on a strong relationship with monthly dry days. Future studies could seek to establish the relationship between dry events and consecutive dry days compared with different drought indices. More generally, within the broad area of intelligent systems, this study showed that ML algorithms can be applied to establish relationships between dry and wet days.

Methods

Study area and data

Bangladesh is prone to natural disasters and extremely vulnerable to climate change^74,75. Bangladesh extends from 20° 34 N to 26° 38 N and 88° 01 E to 92° 41 E. Except for the hilly southeast, the majority of the country is characterized by low-lying plains situated on deltas of large rivers flowing from the Himalayas. The country is surrounded by the Meghalaya Plateau in the north, the lofty Himalayas lying farther to the north, the Assam Hills in the east, and the Bay of Bengal in the south. Located in a tropical monsoon region, the climate of Bangladesh is characterized by moderately warm temperatures and high humidity with marked seasonal variations in rainfall.

The four recognized seasons are a hot, humid summer from March to May, a wet, warm, and rainy monsoon season from June to September, autumn from October to November, and a dry winter from December to February^76,77,78. January is the coldest month, with an average temperature of 18.1 °C, while May is the hottest month with an average temperature of 28.7 °C.

In the summer, the mean temperature gradient leans towards the northeast (cooler) from the southwest (warmer); in contrast, the winter mean temperature gradient is oriented towards the north (cooler) from the south (warmer). Rainfall in Bangladesh mostly occurs in the monsoon, induced by weak tropical depressions that are brought from the Bay of Bengal into Bangladesh by wet monsoon winds⁷⁷. More than 75% of the rainfall in Bangladesh occurs during the monsoon season. The daily rainfall in different stations shows a huge rainfall variation in between stations and seasons. Due to reduced or no rainfall, regional droughts currently affect approximately 2.5 million and 1.2 million ha of agricultural land in a year in the wet and dry seasons, respectively¹⁴. Hence, there would exist a better correlation in terms of varied rainfall magnitudes between stations on a monthly scale or a seasonal scale to deal with dry periods or droughts and there might have better directions to be used in the agriculture sector.

Figure 8 shows 27 rain gauge stations with rainfall records for more than 30 years (1982–2016) operated by the Bangladesh Meteorological Department (BMD). To predict monthly dry days (MDD), we selected only six target stations (Sylhet, Srimangal, Rangpur, Dinajpur, Bogra, and Mymensingh) located in Northern Bangladesh.

A rainfall threshold of 2 mm on a daily scale was used to characterize a dry day and a sample in Table 6 shows monthly cumulative dry days. MDD was defined as the frequency of dry days in a month as elaborated in Table 6. Details of the custom datasets prepared from daily rainfall are listed in Table 7.

Table 6 Calculation of dry days when a day has a rainfall less than 2 mm.

Full size table

Table 7 Three datasets in this study to predict MDD at six target stations.

Full size table

Study procedure

After preparing the datasets, the study used Regression Learner toolbox in MATLAB and performed the simulation of the proposed ML models. The study has two perspectives. In first perspective, the predictor stations were used to predict the MDD of the target stations without any lead time whereas in second perspective, the predictor stations were utilized to predict MDD of one year ahead.

The best model was chosen on the basis of optimized values of R² and RMSE. Then sensitivity and uncertainty analysis were performed to establish the robustness of the developed model. The detail procedure of the study is presented through Fig. 9.