Deep learning for bias correction of MJO prediction

Producing accurate weather prediction beyond two weeks is an urgent challenge due to its ever-increasing socioeconomic value. The Madden-Julian Oscillation (MJO), a planetary-scale tropical convective system, serves as a primary source of global subseasonal (i.e., targeting three to four weeks) predictability. During the past decades, operational forecasting systems have improved substantially, while the MJO prediction skill has not yet reached its potential predictability, partly due to the systematic errors caused by imperfect numerical models. Here, to improve the MJO prediction skill, we blend the state-of-the-art dynamical forecasts and observations with a Deep Learning bias correction method. With Deep Learning bias correction, multi-model forecast errors in MJO amplitude and phase averaged over four weeks are significantly reduced by about 90% and 77%, respectively. Most models show the greatest improvement for MJO events starting from the Indian Ocean and crossing the Maritime Continent.

Could the authors present a hypothesis of why the MJO propagation across the Maritime Continent is improved with the DL model? They only mention "With DLcorrection, the MJO anomalies become close to the observations beyond two weeks by realistically forecasting both amplitude and phase" with no explanation or possible hypothesis. It would be good to present a hypothesis here that could be used for model development if possible. For the LSTM training, will the LOOCV methodology bias the LSTM models to foresee longterm variability in the RMM modes (multiyear/decadal variability) which could help the predictions for the year left out? title could be " Application of deep learning in the bias correction of MJO forecasts" 2. There is little or no mention of other methods of bias correction. Modeling centers are typically aware of biases and drifts in their models and have methods for correcting/accounting for them. This work needs to be put in that context. Specifically how does it compare with other methods? 3. The key challenge the that authors did not address is uncertainty. How is the error-bar/confidence level affected by the bias correction?
Manuscript title: "Deep Learning for bias correction of MJO prediction" Authors: H. Kim, Y. G. Ham, Y. S. Joo, S. W. Son * We wish to express our gratitude to the reviewers for the time and effort spent in reviewing our manuscript and we feel the updated version is much stronger. The major changes include: -Adding confidence interval to Figure

Reviewer #1
The authors blend dynamical model forecasts with a Deep Learning model to improve prediction skill of the Madden-Julian Oscillation (MJO). The MJO serves as a primary source of subseasonal predictability for the global climate system. With Deep Learning post-processing, the authors show that the multi-model forecast errors in MJO amplitude and phase averaged over four weeks are reduced by about 90% and 77%, respectively. Their deep learning model even improves the MJO propagation through the maritime continent, which has been a long-standing challenge for weather and climate models to get right.
The paper is well organized and clear. The results are definitely significant and a step change in weather prediction field by combining machine learning with weather prediction models. Once the minor revisions below are addressed, I feel the manuscript would be a welcome addition to the MJO prediction community. The manuscript will be suitable for publication in the Nature Communications journal after the suggested revisions.

General comment:
-There are some typos that need to be corrected before the manuscript can be ready for publication. There is good hindcast verification in the manuscript and the authors quantify improved skill in a valuable manner, yet the conclusions and discussion are a bit vague and not based on quantified metrics about why two variables would coevolve and how they can be copredicted with improved skill. Improving the discussion of the paper and eliminating the simple typos will make it a much stronger paper.
 Thanks for your constructive comments. We did our best to improve the discussion and correct typos throughout the entire manuscript.

Comments:
(1) Following line can be split into two for easier reading: "Model deficiencies in simulating realistic MJO events are partially due to our poor understanding of the underlying physics, thus more efforts on process-level diagnostics are suggested to further improve MJO simulation and prediction"  Split into two sentences as suggested.
(2) Change preposition "at" to "on" in this sentence: "A large discrepancy between S2S reforecasts and observations is clearly shown on day 1."  Changed as suggested throughout the entire manuscript.
(3) Could the authors present a hypothesis of why the MJO propagation across the Maritime Continent is improved with the DL model? They only mention "With DL correction, the MJO anomalies become close to the observations beyond two weeks by realistically forecasting both amplitude and phase" with no explanation or possible hypothesis. It would be good to present a hypothesis here that could be used for model development if possible.
 We think the main reason for the improvement is because the DL-correction amplifies the strongly damped MJO signal in the original forecasts (Fig. 4). The S2S/SubX models tends to underestimate the amplitude of the MJO especially over the maritime continent (Kim et al. 2019), and this systematic error is likely to be successfully captured, therefore, corrected by the DL model. Since this is a statistical correction of RMMs, it is hard to link the improved propagation with any physics-based hypothesis, such as moisture advection processes, etc. We make our point clear and add the implication on model development as: (Page 4) "With DL-correction, however, the MJO anomalies become close to the observations beyond two weeks by realistically forecasting both amplitude and phase of the MJO (Fig. 4c). The improved MJO eastward propagation is mostly due to the amplification of the strongly damped MJO signal shown in most of the S2S models (Extended Data Fig. 2)." (Page 4) "However, although Deep Learning approach can assist in correcting model biases, continuous effort on developing the dynamical forecast system to minimize the inherent errors is the key for making MJO forecast to reach its potential predictability. Note that the improved MJO prediction with DL-correction does not guarantee an improved prediction of MJO-related phenomena such as tropical cyclones, monsoons, and midlatitude teleconnections, because they rely on both MJO and background state within the model. Therefore, further improvement of dynamical model and initialization are fundamental to ultimately improve the S2S predictions."  Changed to positive-only color bar as suggested. Thanks for pointing this out.
(5) For the LSTM training, will the LOOCV methodology bias the LSTM models to foresee longterm variability in the RMM modes (multiyear/decadal variability) which could help the predictions for the year left out?
 To examine whether the RMM indices have multiyear/decadal variability, Figure A shows the power spectrum of the observed daily RMM1 index from 1979 to 2016. As expected, the RMM1 has significant power on the intraseasonal period (30~80 days), but not on periods longer than the intraseasonal timescale window. Therefore, we can conclude that the forecasts via LSTM models are not influenced by long-term MJO variability. Shown are also the Markov red noise spectrum (black curve) and its bounds at confidence level of 5% and 95% (black dashed curves). The two vertical red lines represent the 30~80 days range.
As an alternative way to examine the year-to-year variation and stability of the LSTM model, we compare the prediction results of each target year. Specifically, we target forecasting MJO events (phase 2 and 3) selected from random years from 1997 to 2016 with allowing duplication of years. A total of 25 MJO events are randomly selected from 20 years. Then, we use the LSTM model built in our study for each year (LSTM(yr)) to predict the 25 events. If forecast results from each LSTM(yr) show consistent forecasts for the randomly selected 25 MJO events, it indicates that the model does not differ much for different years, and model is stable.
- Figure B shows the RMM1 index on forecast day-21 (note that this is consistent in other lead days). Although some variation exists, the year-to-year variation of RMM1 forecast in each 25 MJO cases are consistent for 20 years.
- Figure C shows the 4-week forecast of the RMM1 index from randomly selected case among 25 cases (MJO case from year 2016). The 20 years show consistent results of forecasting this specific case.
A discussion is now added in the method section as: (Page 13): "Note that, for a given input data sets that were randomly selected, the LOOCV produce very similar results for every target year, indicating that the LSTM model is stable."

Reviewer #2
Skillful subseasonal prediction has a direct benefit to the society and economy, while it remains very challenging. The Madden-Julian Oscillation (MJO) acts as one of the major predictability sources and received great attention from the community. The MJO prediction skill in the stateof-the-art dynamic models differs dramatically, with some models have skill about two weeks and the ECMWF model has an outstanding skill beyond 30 days. To improve the MJO prediction skill is one of the major tasks for the S2S prediction community. The Deep Learning (DL) postprocessing leads to an improved amplitude and phase prediction of MJO compared to the raw model results. I think this is an interesting and timely study and provide a new path for S2S prediction. I would recommend this paper to be considered for publication if the authors can address my following comment as listed below.
 Thanks for your constructive comments. We tried our best to address your comments.
(1) As a predictor, the MJO is emphasized because it provides predictability sources for subseasonal predictions. The DL technique may lead to an 'improved' MJO prediction through the post-processing, while it may not benefit the real prediction of other variables and phenomena, such as temperature, precipitation, TC, atmospheric rivers, and so on. The modulation of MJO on other phenomena relies on the 'actual' amplitude/phase of MJO and the background mean climate within the dynamic models. The 'improved' MJO prediction, in particular through the post-process, does not guarantee an improved prediction of its impacts. It also implies that the further improvement of dynamic models and initialization tends to be fundamental to improve the S2S predictions.
From this point of view, this work is one example to show the potential application of the proposed methodology on subseasonal prediction. However, I think this work itself is not significant enough to warrant its publication in Nature Communications unless this can be demonstrated to some extent. One way is to build a hybrid dynamic-statistical model (using the MJO in model prediction and the observational relationship between MJO and other climate variables, such as temperature, precipitation, TC, circulation) to truly prove the value of the DL on S2S prediction (one example is fine).
 We appreciate the reviewer's valuable points. First of all, we totally agree that the improved MJO prediction does not guarantee an improved prediction of its impacts, so we reflect this point in the revision as: (Page 4) "Note that the improved MJO prediction with DL-correction does not guarantee an improved prediction of MJO-related phenomena such as tropical cyclones, monsoons, and midlatitude teleconnections, because they rely on both MJO and background state within the model. Therefore, further improvement of dynamical model and initialization are fundamental to ultimately improve the S2S predictions." Forecasting the MJO itself is far from perfect, and the best model being still 2-3 weeks lower than the potential predictability. So, we would like to firstly demonstrate the usefulness of deep learning for MJO only, since MJO prediction alone presents considerable systematic biases.
While we totally agree with the reviewer, ultimately though, we felt a look at MJO impact is outside the scope of this study and is more suited for a stand-alone paper. As suggested, our next plan is to apply the DL-correction to the prediction of MJO-related Atmospheric Rivers ( (2) I am surprised that 'day-1' MJO shows a relatively large bias from Fig. 1. Compared to the first author's previous work using the ECMWF and NCEP models (Fig. 9, Kim et al. 2014), the bias shown here seems to be too big. How to understand this? Please clarify.  5)). The weaker amplitude of the RMM indices is evident in both forecasts, which denotes that the overall bias shown in the current study is not systematically larger than that in Kim et al. (2014). The difference could be due to different reforecast periods (1993-2009 vs. 1997-2016), model versions (Cy32 vs. Cy43), etc., while it is not easy to disentangle the exact reason.   Thanks for your suggestion. We add the multi-model averaged bivariate correlation coefficient (BCOR) of MJO predictions before and after DL-correction (new Extended Data Fig.  7 and methodology) and add a discussion as: (Page 3) "To assess prediction skill and predictability of the MJO, two additional verification metrics are applied. It turns out that the multi-model mean BCOR of the DL-correction is consistently higher than the S2S models up to four weeks (Extended Data Fig. 7). The increased signal and reduced noise after the DL-correction, resulting in a higher MJO predictability than the original ECMWF-Cy43r3 reforecasts, is also observed (Extended Data Figure 8)." Extended Data Fig. 7: Correlation coefficients in multi-model mean S2S reforecasts and DLcorrections. Multi-model mean BCOR of S2S reforecasts (blue) and DL-corrections (red). The gray horizontal line is BCOR of 0.6.
(4) I suggest add some discussion about the implication of this study on model development to improve the MJO simulation and prediction. Also, it is better to provide some insights about why the individual models using DL show similar skills, no matter how bad or good the raw model is?
(5) Does the DL post-processing change the potential predictability? It will be interesting to check this.
 Thank you for the suggestion. Figure below shows the signal and noise from the raw ECMWF hindcast and DL-correction (new Extended Data Fig. 8). Clearly, the signal is larger, and noise is smaller after DL-correction, making the potential predictability higher. The methodology and discussion are now added as: (method): "The MJO potential predictability is assessed via the signal and noise 25 defined as: where overbar denotes ensemble mean and prime presents individual ensembles' deviations from the ensemble mean. In this formulation, the signal refers to the variability of ensemble mean while the noise refers to the variability of individual forecasts around the ensemble mean (i.e., the forecast spread), and both quantities depend on the forecast lead time (τ). The ECMWF-Cy43r3 is used to estimate the signal and noise due to its relatively large ensemble size and high MJO skill." (Page 3): "To assess prediction skill and predictability of the MJO, two additional verification metrics are applied. It turns out that the multi-model mean BCOR of the DL-correction is consistently higher than the S2S models up to four weeks (Extended Data Fig. 7). The increased signal and reduced noise after the DL-correction, resulting in a higher MJO predictability than the original ECMWF-Cy43r3 reforecasts, is also observed (Extended Data Figure 8)." Extended Data Fig. 8: Forecast signal and noise. Signal (solid) and noise (dashed) as a function of forecast lead days for ECMWF-Cy43r3 (blue) and DL-correction (red).
(6) 'The LSTM model is built at every target year, forecast lead time, MJO phase, and each model individually due to their unique systematic biases'. Does the LSTM differ much for different years?
 To examine the stability of the LSTM model, we compare the prediction results of each target year. In this case, the input variables are set to be equal in all years. Firstly, we target forecasting MJO events (phase 2 and 3) selected from random years from 1997 to 2016 with allowing duplication of years. A total of 25 MJO events are randomly selected from 20 years. Then, we use the LSTM model built in our study for each year (LSTM(yr)) to predict the 25 events. If forecast results from each LSTM(yr) show consistent forecasts for the randomly selected 25 MJO events, it indicates that the model does not differ much for different years, and model is stable.
- Figure E shows the RMM1 index on forecast day-21 (note that this is consistent in other lead days). Although some variation exists, the year-to-year variation of RMM1 forecast in each 25 MJO cases are consistent for 20 years. - Figure F shows the 4-week forecast of the RMM1 index from randomly selected case among 25 cases (MJO case from year 2016). The 20 years show consistent results of forecasting this specific case.
A discussion is now added in the method section as: (Page 13): "Note that, for a given input data sets that were randomly selected, the LOOCV produce very similar results for every target year, indicating that the LSTM model is stable."  (7) The leave-one-year-out cross-validation is shown, while it will be useful to show the 'real' prediction. For example, the data during the first 10 years are used as the training period to build the LSTM model, the left independent 7/10 years are examined to show its application on MJO prediction.
 Thanks for the suggestion. To see whether the DL-correction is useful for the real prediction, we perform two experiments with different training/testing sets from ECMWF-Cy43r3: (A) 10y-train forecast: data during the first 10 years (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) are used as the training period to build the LSTM, and the left independent 10 years (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) Figure G. It is shown that any of these methods result in reduced forecast errors, compared to raw ECMWF forecasts, while the level of reduction differs by the training method mainly due to the number of samples. Note that the results from BMSEp are similar as well (not shown). To reflect the reviewer's comments on the revision, we add discussion as: (Method, page 13) "We also perform the DL-correction in a real forecast manner. The ECMWF-Cy43r3 reforecasts during the first 10 years (1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) are used as the training period to build the LSTM model, and the left independent 10 years (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016) are evaluated. In this real forecast procedure, biases are still significantly reduced compared to the raw S2S reforecasts (not shown), but larger than those by the LOOCV approach due to the limited training sample size." Also, we contacted researchers from two operational centers which provide MJO realtime forecast to check whether they apply any bias correction: ECMWF and NOAA CPC. ECMWF provides a mean-bias corrected forecast by removing the model climatology (via personal communication with Frederic Vitart, ECMWF). In NOAA CPC forecast, NCEP EMC GEFS provides a version of Bias-Corrected Ensemble Forecast (named as "NCEP EMC GEFSBC"). EMC's bias correction is using a hybrid method which means a combined bias from the latest prior forecast bias for short lead-time, and model climatology for extended range and longer lead-time (via personal communication with Yuejian Zhu, Qin Ginger Zhang, Kyle MacRitchie, and Dan Collins, NOAA). Therefore, as far as we know, there is no bias correction applied to operational MJO forecasts besides removing the model climatological bias which is already performed in our current S2S reforecasts.  (3) The key challenge the that authors did not address is uncertainty. How is the errorbar/confidence level affected by the bias correction?
 Thanks for your suggestion. To address this, we utilize 11 ensemble members from each of ECMWF and CESM1 and use bootstrapping. Confidence interval is now added to Figure 3, Figure 4, Extended Data Fig. 3 and Fig. 4. The methodology is added in the text and in figure captions: (method, page 15) "The statistical significance test is performed with ECMWF-Cy43r3 and NCAR-CESM1 only due to their relatively large ensemble size. The confidence interval of DLcorrection results is calculated using the bootstrap method. We randomly select 11 ensembles members from the S2S reforecasts with allowing overlap to calculate the ensemble-averaged BMSE. This process is repeated 10,000 times and the 2.5th and 97.5th percentile values are used to define the 95% confidence interval. The ensemble-averaged BMSE of DL-correction value (Fig. 3, Extended Data Fig. 3, 4) is significant at 95% confidence level if it lies outside the 2.5th and 97.5th percentile. Same process is performed for the reconstructed OLR and U850 to check whether the composited anomalies from DL-correction is significantly different from the S2S forecast results (Fig. 4)." Manuscript title: "Deep Learning for bias correction of MJO prediction" Authors: H. Kim, Y. G. Ham, Y. S. Joo, S. W. Son

Reviewer #2
-First, I should applaud the authors' effort to address my comments. The revised manuscript is much more clear and improved.
→ We appreciate the reviewer for another careful review.
-However, I still have one comment following my previous comment #3: Since the error of amplitude and phase has reduced by more than 70%, why the BCOR skill does not change much as shown in the extended Fig. 7? → First of all, the ~70% reduced errors shown in Fig. 2 and Fig. 3 is calculated based on the 'composite' maps shown in Figure 1. This has been described in the manuscript as (page 3) "To evaluate the forecast errors quantitatively, the bivariate root-mean-squared error (BMSE) is calculated as a function of initial MJO phases from the composites shown in Fig. 1." and also in Figure captions (Fig. 2) "BMSE is divided into (a, c) BMSEa and (b, d) BMSEp for RMMs composite shown in Fig. 1". Therefore, the reduced mean errors do not exactly match with the BCOR (Supplementary Fig. 7) which is calculated with all individual MJO events.
-More specifically, the useful prediction skill can be estimated when BCOR drops to 0.5, which is generally corresponding to that when the RMSE increases to sqrt(2). It is not the case by comparing extended Fig. 5 and 7. This needs some clarification.
→ For fair comparison between BMSE (same as RMSE) and BCOR, one should compare the Supplementary Figure 5 and 7 in which the individual MJO events are used for BMSE and BCOR calculations. For easier comparison, we combine the Extended Fig. 5 and 7 to compare the multimodel averaged BMSE and BCOR ( Figure A below). Clearly, the BCOR and BMSE shows improved skill after DL-correction.
Although previous studies used the BCOR=0.5 and BMSE=1.414 as a threshold for useful MJO skill, the BCOR=0.5 does not always match with the BMSE=1.414 line. For example, Figure B is from Lim et al. (2018) where they compared the BCOR and BMSE of RMM indices using 10 S2S models. The BMSE=1.414 lines do not exactly match with the BCOR=0.5 line as well, so they used the BMSE=2 as a reference skill measure.
As mentioned by the reviewer, the BCOR does not show continuous improvement after 3 weeks while the BMSE keeps reducing. This could be because the absolute error of RMMs is reduced, but the sign of the error is not as much improved as the absolute error when individual MJO events are considered. This could depend on each forecast systems, and a more detailed analysis on individual model output should be performed to get a clearer picture.