Improving multiple model ensemble predictions of daily precipitation and temperature through machine learning techniques

Multi-Model Ensembles (MMEs) are used for improving the performance of GCM simulations. This study evaluates the performance of MMEs of precipitation, maximum temperature and minimum temperature over a tropical river basin in India developed by various techniques like arithmetic mean, Multiple Linear Regression (MLR), Support Vector Machine (SVM), Extra Tree Regressor (ETR), Random Forest (RF) and long short-term memory (LSTM). The 21 General Circulation Models (GCMs) from National Aeronautics Space Administration (NASA) Earth Exchange Global Daily Downscaled Projections (NEX-GDDP) dataset and 13 GCMs of Coupled Model Inter-comparison Project, Phase 6 (CMIP6) are used for this purpose. The results of the study reveal that the application of a LSTM model for ensembling performs significantly better than models in the case of precipitation with a coefficient of determination (R2) value of 0.9. In case of temperature, all the machine learning (ML) methods showed equally good performance, with RF and LSTM performing consistently well in all the cases of temperature with R2 value ranging from 0.82 to 0.93. Hence, based on this study RF and LSTM methods are recommended for creation of MMEs in the basin. In general, all ML approaches performed better than mean ensemble approach.

; Ahmed et al. 2 ; Crawford et al. 22 ; Sachindra et al. 23 ; Wang et al. 24 ). However, most of these studies have done monthly, annual or seasonal evaluation. Thus, reliable MMEs of daily climate variables are thus necessary. All the above approaches basically consider climate data to be stationary and linear. Several ML models have been proposed for climate data downscaling and multi-model ensemble predictions as an alternative to these approaches that can address the non-linearity in time series data 2,23,[25][26][27][28][29][30][31][32] . The most commonly used models are Support Vector Machine (SVM), Random Forest (RF) and the artificial neural networks (ANNs), which can model complex, mostly nonlinear relationships in climate data. Although these approaches can address nonlinearity in the data, they have the drawback of assuming that all inputs and outputs are independent of each other, even when dealing with sequential data 33 . Since climate data has dependency between successive values, it is imperative to consider this dependency. Long Short-Term Memory (LSTM) deep learning models are designed specially to learn long-term dependencies present in sequential data 34 . Compared to shallow ANN architecture, deep learning models are more capable of representing highly varying nonlinear functions like complex temporal patterns via high-level temporal abstractions 35,36 .
The present study aims on comparison and improvement of MMEs using various ensembling techniques. In this objective, special attention is paid to improve the MMEs of daily climate variables like precipitation (P), minimum temperature (Tmin) and maximum temperature (Tmax). Furthermore, special emphasis has been given in testing the ability of each ensembling technique in simulating monsoon rainfall. The methodology proposed in the present study is demonstrated on Netravati basin, a tropical river basin on the southwest coast of India. The present paper is organized as follows: "Study area" and "Data products" sections, introduces the study area and datasets considered. "Methodology" section describes the related methodology followed for creating ensembles of GCMs using simple statistical techniques (mean), regression models (i.e., SVM and MLR), an ensemble learning models (i.e., extra tree regressor (ETR) and RF), and deep learning time series model (i.e., multivariate LSTM). "Results and discussion" section presents the results, while "Summary and conclusions" section concludes and discusses the scope for future work.

Study area
The Western Ghats of India is one among the global biodiversity hotspots. It is biologically rich and biogeographically unique with diverse species of plants, mammals, birds and amphibians 37 . Netravati, a west-flowing river which drains into the Arabian Sea is located in the central zone of Western Ghats of India. This river is situated between 12°30′N and 13°10′N latitudes and 74°50′E and 75°50′E longitudes covering an area of about 3415 km 2 (Fig. 1). The Netravati river basin experiences a humid tropical climate with an average annual rainfall of around 4000 mm. The rainfall over the basin is distributed into three seasons namely, Pre-Monsoon (March-May), Southwest Monsoon (June-September), and Northeast Monsoon (October-December). The Southwest monsoon is the major contributor to annual rainfall. The average daily temperature is the highest during March to May and lowest during December and January. The average minimum and maximum temperatures of the basin are

Methodology
There are many methods available for ensembling, like Bayesian approaches and machine learning approaches 52,53 . Six techniques were used for creating MMEs of P, Tmax and Tmin simulated by 21 NEX-GDDP and 13 CMIP6 GCMs in Netravati basin. These methods were mean, Multiple Linear Regression (MLR), Support Vector Machine (SVM), Extra Tree Regressor (ETR), Random Forest (RF) and long short-term memory (LSTM). These methods cover the major types of existing machine learning ensembling methods. These ensembling techniques can be classified as simple statistical techniques (mean), regression models (i.e., SVM and MLR), ensemble learning models (i.e., ETR and RF), and deep learning time series model (i.e., multivariate LSTM). All these methods try to improve the GCM simulations with respect to the observation dataset in the historical time period. All the BC methods except LSTM were implemented P, Tmin and Tmax using the scikit-learn library in Python 54 . The LSTM was implemented using Keras, which is one of the most popular deep learning libraries in Python 55 . All the calculations have been carried out independently for each grid cell and the results for one representative grid in the basin is shown to simplify the presentation. More about data pre-processing and a brief description of each ensembling method are provided in the following sections.
Data preparation. Each ensembling method was carried out at each grid point considering P, Tmax and Tmin separately. Bilinear interpolation was done in order to bring the GCM values to the corresponding observation grids in the basin. Ensemble mean was calculated by finding the mean of P, Tmax and Tmin simulated by all GCMs at each grid respectively. The data was split into training and testing datasets for validation and comparison of each method of ensembling. The input to each ML model was preprocessed using Principal component analysis (PCA). More about PCA is described below.  56 . In this study the features are the various GCMs in ensemble. Ahmed et al. 2 has mentioned that choice of the number of the GCMs used in MME is a key decision in ensembling. In the present study PCA was used for this purpose. It is a part of the exploratory data analysis in ML technique for predictive models 57 . It makes the model simple and efficient which in turn reduces the run time of the model. PCA prevents overfitting and converts a group of correlated variables to uncorrelated variables through orthogonal transformation 58 . A principal component (PC) is chosen such that it would describe most of the available variance 59 . Thus, it removes the risk of multicollinearity. In this study, the PCs of 21 GCMs of NEX-GDDP dataset and 13 GCMs of downscaled CMIP6 dataset for each grid was calculated separately. The PC's which gave cumulative contribution rates greater than 95% were used as input to ML models.
Machine learning algorithms. MMEs were developed for P, Tmax and Tmin separately at each grid point in the basin using machine learning methods. The observed and simulated values of P, Tmax and Tmin were divided into a calibration period and validation period. The first 45-years (1951-1995) of overlapping observed and simulated data were used for calibrating the MMEs. The rest of the data were used for validating the MMEs. More about the methods adopted in the study are given in the following sections.
Multiple linear regression (MLR). MLR is a common form of regression analysis. Multiple linear regression attempts to explain the relationship between one dependent variable and two or more independent variables by fitting a linear Eq. 60 . It has been widely used for climate studies for downscaling and impact analysis 27,61 . In general, MLR can be mathematically written as: where y is the dependent variable, x i are independent variables, β i are parameters, ε is the error. In this study, the ordinary linear least squares (LLS) regression which minimizes the residual sum of squares between the observed values and the ensemble values was used. This was implemented using 'sklearn.linear_ model' module in python 54 .
(1) y = β 0 + β 1 x 1 + · · · + β n x n + ε where Kernel x i , x represents the kernel function used; α i and α i denote the Lagrange multipliers; x i denote the vectors; x represents the independent vector; b represents the bias parameter. SVR uses a symmetrical loss function, which equally penalizes high and low misestimates. Using Vapnik's Open image in new window -insensitive approach, a flexible tube of minimal radius is formed symmetrically around the estimated function, such that the absolute values of errors less than a certain threshold Open image in new window are ignored both above and below the estimate. In this manner, points outside the tube are penalized, but those within the tube, either above or below the function, receive no penalty. One of the main advantages of SVR is that its computational complexity does not depend on the dimensionality of the input space. Additionally, it has excellent generalization capability, with high prediction accuracy 64 .
MMEs which used the polynomial kernel function performed better than the MMEs that used other kernel functions. Hence in this study polynomial kernel function was put to use similar to Sachindra et al. 23 and Ahmed et al. 2 . The choice of hyperparameters plays a great role in machine learning methods. In the current study, the Bayesian hyperparameter optimization (BHO) was used to determine the hyperparameters for all machine learning algorithms. The "hyperopt" package in Python was used to implement BHO 65 . The important hyperparameters optimized in SVR are C, kernel function and epsilon.
Random forest and extra tree regressor. The RF and ETR models are ensemble machine learning techniques. RF is proposed by Breiman 66 based on a combination of statistical learning theory and classification or regression methods. The multiple classification and regression decision tree (CART) included in the algorithm prevents over-fitting and adjusts different types of input variables. This algorithm generates many independent trees and generates a decision based on the characteristics of nonparametric statistical regression and randomness 26 . A decision tree comprises of a root node, sub node, and leaf node. A leaf node corresponds to a judgement level while a sub node contains a judgement rule. The average of predicted values from all trees is the result of the algorithm. RF is internally cross-validated using out of bag (OOB) score 25 . ETR is a variation of that adds a further level of randomness to the splitting of the trees 67 . It is an extension of RF with two major differences: (1) ETR does not apply bootstrapping but each tree is trained with the whole of training data, (2) ETR selects a random cut point instead of a locally optimum cut point. The split which gives the highest score is selected from the set of randomly generated splits. That is k decision trees are generated and m features at selected for each training sample. At each of the decision tree a random cut-point is selected. This helps to avoid overfitting to some extent. More about ETR can be found in Xu et al. 25 .
Long short-term memory (LSTM) deep learning models. Climate data is a time series data involving sequence of observations over regularly spaced intervals with trend (upward, downward, or absent), seasonality (periodic fluctuation within a certain period), cyclic variations (rises and falls) and irregular or random components 68,69 . Meteorological predictions of GCMs can be seen as a multivariate sequential data. Hence a LSTM model which belongs to the family of deep recurrent neural networks could be used for creating multi-model ensembles of climate data. The current prediction of LSTMs is influenced by the feed network activations from the previous time steps. Hence, this connection develops a memory of previous events in the LSTM network. The architecture of a LSTM cell is given in Fig. 2 where f t , i t and o t are forget, input, and output gates respectively. X t , S t and C t are input, hidden and cell state at time step t, respectively. S t-1 and C t-1 are the hidden and cell state at time step t − 1, respectively. ⊗ , ⊕ and σ are pointwise multiplication, pointwise addition and sigmoid activation, respectively.
The network has three inputs: X t -input at the current time step, S t-1 is the output from the previous LSTM unit, and C t-1 is the memory of the previous unit. As for outputs, S t− the output of the current network, and C t is the memory of the current unit. The LSTM model has input i t , output o t , and forget f t learnable gates that modulate the flow of information and maintains an explicit hidden state that is recursively carried forward and updated as each element of the sequential data is passed through the network. The input gate decides what information to add from the present input to the cell state, the forget gate decides what must be removed from the S t-1 state, thus keeping only relevant information, and the output gate decides what information to output from the current cell state. More information LSTM can be found in Bouktif et al. 69 and Sagheer and Kotb 70 . In this study, the LSTM was optimised for learning rate, batch size, units, layers and window.
Performance evaluation. The observed and simulated values of P, Tmax and Tmin used for developing MMEs are divided into calibration and validation dataset. The first 45-years (1951-1995) of overlapping observed and simulated data were used for calibrating the machine learning models. The rest of the data were used for validating the MMEs developed using each method. Performance evaluation on validation data on daily basis was done in terms of Root-Mean Square Error (RMSE) or Root-Mean Square Deviation (RMSD) and correlation coefficient (R). These performance indicators are widely used by many researchers [71][72][73] . Further, the daily data were converted into monthly data for performance evaluation. Scatter plots and Taylor diagrams are used for the evaluation of performance on monthly basis. The scatter plots along with coefficient of determina- www.nature.com/scientificreports/ tion (R 2 ) provided a useful comparison of observed and MME values. Taylor diagram summarised the performance of each MME in terms of RMSD, R and standard deviation (SD). The procedure was repeated explicitly for MME's of precipitation for the monsoon season to study their ability in simulating rainfall magnitudes.

Results and discussion
The performance evaluation of each ensembling method for simulating P, Tmin and Tmax is done grid wise on daily and monthly scales for NEX-GDDP and CMIP6 datasets separately. The performance evaluation on daily scale is done using R and RMSE. Results of this evaluation during the validation period is given in Table 3. Further, scatter plots and Taylor diagrams are used to evaluate the performance on a monthly basis. The performance of each ML method was more or less the same at each grid. Hence, the results obtained for one representative grid in the basin is shown and discussed for simplification of the presentation.

Performance evaluation of MMEs in the case of precipitation. Performance evaluation of MMEs
for daily rainfall. The results of performance evaluation on daily precipitation given in Table 3 indicate that the ML approaches have improved performance of MMEs when compared with the mean ensemble approach. However, the improvements are not very significant for all ML methods except for LSTM. The MME developed using LSTM for NEX-GDDP dataset could significantly improve the R value from 0.52 to 0.74 when compared to mean ensemble technique. Similarly, a reduction in RMSE from 19.03 to 14.59 is also achieved by using LSTM for ensembling when compared to mean ensembling. Thus, the MMEs made using LSTM is performing  www.nature.com/scientificreports/ significantly better for NEX-GDDP and CMIP6 datasets. The same is observed in the scatterplots of monthly precipitation given in Figs. 3 and 4 for NEX-GDDP and CMIP6 datasets respectively. The R 2 value increased from 0.82 to 0.94 and 0.78 to 0.92 for LSTM ensemble when compared to mean ensemble for NEX-GDDP and CMIP6 dataset respectively. Figures 5 and 6 show the Taylor diagrams of observed and MME simulated monthly precipitation of NEX-GDDP and CMIP6 datasets respectively for the validation period. These figures demonstrate that MME developed using LSTM method matches better with the observed data than MMEs developed using other methods.
Performance evaluation of MMEs for monsoon season. The results of performance evaluation on daily precipitation of monsoon months (June to September) are given in Table 4. These results indicate that the ML approaches namely, MLR, SVM, ETR and RF have shown very slight and insignificant improvement in performance of MMEs when compared with the mean ensemble approach in the case of daily precipitation in monsoon months of NEX-GDP and CMIP6 datasets. However, MME made using LSTM has shown significant improvement in the performance of daily monsoon rainfall in terms of R and RMSE. The MME developed using LSTM for NEX-GDDP dataset could improve the R value from 0.038 to 0.386 when compared to mean ensemble technique. Similarly, a reduction in RMSE from 31.49 to 23.35 is also achieved by using LSTM model. Similar improvements in R (0.031 to 0.357) and RMSE (29.26 to 23.33) was seen in the case of CMIP6 dataset. Thus, the MMEs of monsoon precipitaion made using LSTM is performing significantly better for NEX-GDDP and CMIP6 datasets. The same is observed in the scatterplots of monthly precipitation given in Figs. 7 and 8 for NEX-GDDP and CMIP6 datasets respectively. The R 2 value increased from 0.506 to 0.81 and 0.366 to 0.788 for LSTM ensemble when compared to mean ensemble for NEX-GDDP and CMIP6 dataset respectively. Fig-Figure 3. Scatter plot of observed and MME simulated monthly precipitation for NEX-GDDP dataset. www.nature.com/scientificreports/ ures 9 and 10 show the Taylor diagrams of observed and MME simulated monthly monsoon precipitation of NEX-GDDP and CMIP6 datasets respectively for the validation period. These figures demonstrate that MME of monsoon precipitation developed using LSTM method matches better with the observed data than MMEs developed using other methods.
Performance evaluation of MMEs in the case of maximum temperature.  (Fig. 13) and Taylor diagram (Fig. 14) show the better performance of all ML methods when compared to mean ensemble approach.   However, in the case of temperature, all ML approaches performed equally good when compared to mean ensembling approach. All ML methods could improve the R value from 0.5 to a range of 0.8 in the case of temperature. In the case of maximum temperature of NEX-GDDP dataset, the MME made with RF (R = 0.872) slightly outperformed LSTM (R = 0.868). In all the other cases of all ML methods performed equally well, with LSTM showing a slightly increased performance. The same pattern was observed in all the grid points in the basin. Ensemble learning models like RF and ETR also performed well in the basin in the case of maximum and minimum temperature. They outperformed MLR and SVM in all the cases. Hence, MMEs developed through LSTM, RF and ETR algorithms are recommended for creating MMEs in the basin. In general, all ML methods performed better than mean ensemble approach. This is seen in other studies like that of Ahmed et al. 2 .

Summary and conclusions
In this study, an attempt has been taken to evaluate the performance of MMEs developed using six ensembling methods. These ensembling techniques include simple statistical technique (mean), regression models (i.e., SVM and MLR), ensemble learning models (i.e., ETR and RF), and deep learning time series model (i.e., LSTM). The performance evaluation of each ensembling technique was done in order to find the best-performing MMEs of 21 NEX-GDDP and 13 CMIP6 GCMs in Netravati basin. This comparison shows that the application of a LSTM model for climate model ensemble prediction performs significantly better than the benchmark models including other machine learning techniques and mean ensembling techniques in the case of precipitation. It gave a coefficient of determination of 0.94 and 0.92 in the case of NEX-GDDP and CMIP6 monthly precipitation datasets. The MME of LSTM method could simulate the monsoon rainfall magnitude satisfactorily when compared to all the other methods. Hence, LSTM deep learning models are seen to be an attractive approach for climate data prediction. This could be because of its capability in learning long-term dependencies in observed data, which www.nature.com/scientificreports/ lead to better predictions results that outperform several alternative machine learning and statistical approaches.
In case of temperature all the ML methods showed equally good performance, with RF and LSTM performing consistently well in all the cases of temperature. The coefficient of determination in the range of 0.9 and 0.8 are observed for MMEs developed using RF and LSTM techniques in the case of monthly average maximum and minimum temperature respectively. Hence, based on this study RF and LSTM are recommended for creation of MMEs in the basin. In general, all ML approaches performed better than mean ensemble approach. However, this study limits its scope to machine learning methods and does not analyse its performance on extreme vales. Hence, a future study which analyses its effectiveness on extreme values may be done. Further, other multi-model Figure 7. Scatter plot of observed and MME simulated monthly monsoon precipitation for NEX-GDDP dataset.