A joint role for forced and internally-driven variability in the decadal modulation of global warming

Despite the observed monotonic increase in greenhouse-gas concentrations, global mean temperature displays important decadal fluctuations typically attributed to both external forcing and internal variability. Here, we provide a robust quantification of the relative contributions of anthropogenic, natural, and internally-driven decadal variability of global mean sea surface temperature (GMSST) by using a unique dataset consisting of 30-member large initial-condition ensembles with five Earth System Models (ESM-LE). We present evidence that a large fraction (~29–53%) of the simulated decadal-scale variance in individual timeseries of GMSST over 1950–2010 is externally forced and largely linked to the representation of volcanic aerosols. Comparison with the future (2010–2070) period suggests that external forcing provides a source of additional decadal-scale variability in the historical period. Given the unpredictable nature of future volcanic aerosol forcing, it is suggested that a large portion of decadal GMSST variability might not be predictable.

Ensembles is due to external forcing and how that amount compares to the decadal variability in observations. By using Large Ensembles in both the historical and future period, the authors are able to assess how much decadal-timescale variability is externally forced (using the ensemble mean) versus internally generated (by examining all ensemble members). They trace much of the decadal variability in modeled global warming during the historic period to volcanic aerosols. The addition of results from the CESM all-but-one-forcing ensembles further illustrate this possibility in a convincing way. In these Large Ensembles, volcanic eruptions tend to excite the IPO pattern, indicating that much variability in the IPO may be externally forced. From there, the manuscript concludes that modeled decadal variability in the future may be unpredictable given the unpredictable nature of volcanic eruptions.
I very much appreciate the approach of using multiple Large Ensembles. For studies examining climate variability, this approach is still quite novel (we've only had convenient access to multiple Large Ensembles for the last year) and should be a gold-standard method going forward for examining these kinds of questions in climate models. This method gets at both models' representation of internal variability and allows an assessment of structural model differences, as the authors have done here. Because of the rigor of using Large Ensembles along with strong statistical methods, I find the conclusions drawn from this study convincing and of relevance to the broader climate community. My one suggestion to the authors is to further flesh out another possible conclusion that is implied at multiple points in the manuscript: that climate models as a whole are reacting too strongly to aerosols and that forced IPO variability is too strong overall as compared to observations (see discussion below). In the figures, I see some support for this argument and think that the authors would only strengthen their manuscript by pointing out this possibility more explicitly.
The figures are of publication quality and the writing is very clear with almost no typos. I recommend acceptance after addressing a few (very) minor concerns and suggestions, listed below.
Minor concerns/suggestions: 1. As discussed above, I think that the results of this manuscript suggest that all of these ensembles have too strong responses to volcanic aerosol forcing (besides just CANESM2 and GFDL) and that there is a structural model deficiency across all the ensembles in the same direction. I suggest to the authors to address this possibility a bit more in the conclusions section. Here's the evidence I see in the figures for this possibility: -Fig 1b/c, 1991. The observed dip in residual GMSST is at the very upper edge or outside the spread of all ensembles. While it is possible that 1991 was a rare event, the fact that observations lie outside the spread of all 5 ensembles (in 1b) suggests to me that all 5 ensembles have something in common that is leading them to have a similar (possibly incorrect) response. -Fig 2, left column: The observed power spectral density (black lines) is lower than the ensemble mean (forced) variability and lower than almost all the ensemble members (gray lines) in all 5 ensembles for periods >=8 years. I find it rather unlikely that across all 5 ensembles with 150+ possible realizations, that the observed realization at 8-30 year periods is at the low edge for all ensemble members. The observed realization could be a rather rare event; this low observed PSD for 8-30 periods is within the spread of some ensembles, but I think another possibility is that the ensembles are all reacting too strongly to non-GHG forcing. I am also reminded of other evidence in Atlantic variability in the CESM LE that the model is responding too strongly to aerosol forcing, particularly from a paper by Kim et al. (2018): https://journals.ametsoc.org/doi/pdf/10.1175/JCLI-D-17-0193.1. Might be worth citing this study also at L244.
2. I realize that it is somewhat outside the scope of this study, but it might be useful to briefly cite studies in the introduction on how Atlantic variability modulates GMST variability, but on slightly longer timescales. This topic is briefly touched upon in the conclusions, but (in my opinion), could stand another sentence or two of discussion in the introduction since it is likely that forced Atlantic variability may also be overestimated in models.
3. I am a little bit concerned with the quadratic trend removal (though this is vastly better than the linear trend removal that I see in many other studies). How much of the multidecadal aerosol forcing is removed by the quadratic trend removal. Would this affect the power spectra estimates for the longer 20-30 year periods? For the decadal volcanic downturns examined here, I imagine that this is probably not a large concern.
Typos: In the captions for Fig 2, S2, subfigures should be singular subfigure. Caption 2: I think the word "integrating" in the last full sentence may be superfluous. Found this sentence confusing to get through, suggest revising.

Reviewer #1 (Remarks to the Author):
Review of the manuscript "A joint role for forced and internally-driven variability in the decadal modulation of global warming" by G. Liguori et al. This paper presents the effects of volcanic aerosol forcing in the decadal variations of global mean surface temperature. By using large initial-condition ensembles with Earth System Models, they separate anthropogenic, natural, and internally-driven components of decadal variability. They found the volcanic aerosols account for ~29-53% of the decadal variance over 1950-2010, while externally forced decadal variance is very weak during 2010-2070 due to the lack of volcanic forcing. In addition, they also confirm the possible projection of external forcing onto the IPO variability. The paper is well-written and well-organized. However, some drawbacks should be addressed before the paper can be considered for publication in Nature Communications. Thus, the paper needs a major revision.
We sincerely thank the referee for reviewing our manuscript and providing many excellent and constructive suggestions for improving the overall quality of the manuscript. A detailed report describing how the comments were addressed can be found below. We use style and colour code as follows: Italic/blue: for reviewer comment Italic/black: manuscript text Regular/black: Answer to the reviewer Italic/magenta: Proposed change in the manuscript text Major comments: 1. In this paper, both IPO and PDV are used to describe the decadal variability in the Pacific. Do you regard them as the same thing in this paper or not? Why do you use PDV in some places but IPO in others? Please clarify. For example, in Lines 71-76, the two sentences provide similar information, but the former uses IPO while the latter uses PDV. What is the difference between IPO and PDV?
We thank the reviewer for pointing this out. We agree that the lack of any explanation of what we mean by PDV and IPO generates confusion, especially in the text lines highlighted by the reviewer. We decided to drop the use of PDV except in the last paragraph, where we inform the reader that IPO is only one contributor (the largest) to the broader concept of Pacific decadal variability (we also direct the reader to the valuable work on the topic by Newman et al., 2016) 2. Besides the effect of volcanic forcing, the importance of internal variability can be inferred in the results of this paper. I suggest the authors adding some analysis and discussion to compare the contributions from internal component and volcanic forcing to the decadal modulation of GMSST quantitatively, not only for the variance as you have mentioned but also for the time evolution.
We appreciate the reviewer's suggestion and added a supplementary figure (Fig. S2) showing the correlation (as a boxplot) between GMSSTr ensemble mean and each individual member for both historical and future period. In addition, the figure indicates the correlation value between each model ensemble mean and the observation. Thus, when this value falls within the range indicated by the boxplot, the model is assumed to be consistent with the observation. With this criterion, only two out of five models are formally consistent, with the majority of models that overestimate the externally-forced decadal variability and/or underestimate the range of observed internal variability. These conclusions are similar to the one presented at the end of the section 3 ("Role of aerosols, GHG, and biomass burning in decadal variability").
As a result, the following text has been added to the section 2: Neglecting any residual internal variability in the ensemble mean, the correlation between ensemble mean and each individual member gives a direct measure of the GMSSTr forced component (Fig. S2). Depending on the ensemble the forced component explains between 30 to 58% of the variance (explained variance obtained as correlation squared) in the historical period and 8 to 18% in the future period, with the remaining variance associated with unforced internal variability. However, three out of five models show correlations between observed and ensemble-mean time series outside the range of modelled internal variability (i.e., outside the range of correlations between the ensemble-mean and individual ensemble members). This may be a result of models underestimating the range of natural variability, but, as we show below, it is likely to also indicate an overestimation of the forced component in the historical period. (e.g., Fig. S4). 3. By comparing between historical and future periods, the effect of volcanic forcing is discussed. Besides volcanic aerosols, are there any other forcing differences between the two periods that can cause differences in decadal variability? How about the anthropogenic aerosol forcing? Are there any decadal variations in anthropogenic aerosol forcing in historical or future periods? Please clarify.
We understand the reviewer's concern that other external forcing rather than volcanic may cause differences between historical and future periods. While the CESM fixed-aerosol experiments presented in Fig. 3c seems to exclude an important role for the aerosol forcing in the historical period, it does not preclude a significant role in future periods. However, the evolution of GMSSTr in this experiment until 2070 (being 2080 last year available) confirms that forced decadal fluctuations (i.e., coherent between all-forcing and fixed-aerosol experiments) only appear in the historical period, which is when the volcanic forcing is present (figure below). In addition, while the anthropogenic aerosol forcing differs between historical and future periods, its evolution in the 21st century presents a rather smooth negative trend that is difficult to link with the kind of decadal-scale fluctuations analysed in this study.
Here is a figure from the chapter 8 of synthesis report of the IPCC Fifth Assessment Report (AR) showing the aerosol forcing in CMIP5 experiments for the RCP8.5 scenario, which is also the one used in LENS simulations. Moreover, we acknowledge that other studies have suggested an important role for tropospheric aerosol in both global mean surface temperature and IPO variability. Specifically, in the opening paragraph we present this possibility and refer the reader to the seminal study of Smith et al., (2016), which suggests the possibility that the early 2000s hiatus in the GMST increase was driven by changes in atmospheric aerosols. While our study does not does not focus on the warming hiatus, it must be noted that our five large ensembles do not show any consistent slowdown in the temperature during the hiatus period (i.e., 1998-2013), suggesting for this event a dominant role of the internal variability over the externally-forced variability. We agree with the reviewer that lines 126-128 do not accurately describe the differences between observation and ensemble mean. We have changed that statement that now reads:

FIGUR ES FOR R EVIEW ER S
"The observed GMSSTr trajectory lies largely within the ESM-LE model spread envelopes and presents some similarities with GMSSTr ensemble means from about 1975 to 1995, with the lack of full agreement largely due to the internal variability but also to model deficiencies in representing the external forcing (discussed later).
The lack of similarity between modelled and observed GMSSTr trajectory is largely due to the internal variability, but also to model deficiencies in representing the external forcing. As noted by the reviewer in their point#5, and clearly visible in the previous Fig. S2, now Fig.  S4, models tend to overestimate the cooling response after major volcanic events, causing the simulated trajectories to diverge from the observed one. It is noteworthy that while the amplitude of the decadal fluctuations is partially off, the timing is often correct (see the drop in the observed time series after the three major events in the study period, Fig. 1b and  c). While it is impossible to fully quantify the role of the volcanic forcing without a dedicated ensemble (as also suggested by the review's point# 8), we believe that our calculations for the percent of externally-forced decadal variations obtained integrating the power spectrum of Fig. 3, provides an upper bound for the amount of variance in GMSSTr driven by volcanic eruption, acknowledging that other minor contributions may come from other external forcings (e.g., tropospheric aerosols). We thank the reviewer for their advice on looking at the correlation between observation and ensemble means. Their suggestion was well taken and led to an in-text discussion and a supplementary figure (see our answer to reviewer's comment #2). We show the correlation (as a boxplot) between GMSSTr ensemble mean and each individual member, together with the correlation value between each model ensemble mean and the observation. We find that only two out of five models are consistent with the observations, as the majority of models either overestimate the externally-forced decadal variability and/or underestimate the range of observed internal variability 5. Lines 162-165: Similar to the last comment, the decadal variations from volcanic eruptions are obvious only in the ensemble means but not in observation, again indicating the importance of internal variability in observation. This should be clarified in the text.
We thank the reviewer for their well-pointed comment. We now clarify multiple points (see our answers to reviewer's comment #2, and #4, and text added section 2 copied below) that the internal variability seems to play a more important role in the observation than in the models, as consequence of the model tendency for overestimating the volcanic response (Fig. 3S), which may be simply the result of inaccuracy in the external forcing (Kravitz & Robock, 2011;Santer et al., 2014;Toohey et al., 2011). However, it must be noted that the observed power spectrum (black lines in Fig. 2) for periods between 8 and 30 years is lower than almost all ensemble members (grey lines in Fig. 2), independently of the model. While it possible that the observed GMSSTr trajectory is a rare event, the discrepancy is more likely a symptom of a systematic bias in decadal variability, probably linked to the model tendency to overestimate the non-GHG forced component (e.g., Fig. S2 and Fig. S4).
6. Lines 180-183: This statement is interesting and needs to be clarified. Why aerosols from troposphere have offset effect to those from stratosphere? Do you have any conjecture or related references to support this statement?

Lines 180-183: It is noteworthy that fixing tropospheric anthropogenic aerosol concentrations to their 1920 value results in larger amplitude fluctuations in GMSSTr, suggesting that higher tropospheric aerosol concentrations typical of the second half of the 20th Century, have a damping effect on the volcanic forcing.
This is a very good point that deserves a full study. At the moment we do not have an explanation for this result except the following possible line of reasoning: in the limit of a troposphere saturated with aerosols, any further addition in the stratosphere (e.g., volcanic aerosols) would not be able to affect the Earth's surface, as the incoming solar radiation would be already screened by the saturated troposphere.
Given the importance of this result, we plan a dedicated study, the first step of which will be to verify that similar effects can be seen in other models.

Lines 216-217: "Assuming the IPO to be purely a result of internal variability, one would expect this IPO variance to decrease in proportion to the number of ensemble members, n."
How did you get this expectation? I cannot understand. Please explain.
We agree that an explanation for this expectation/result is needed. In the new version of the manuscript we added the following paragraph in the section Methods

A null hypothesis for the IPO variance in the ensemble mean
Assuming the IPO to be purely a result of internal variability, thus independent in each ensemble member, one can use a known statistical result to predict the ensemble mean IPO variance as function of the number of ensemble members.
Consider N independent time series denoted , =1, 2, …N, each with variance ( ) = 2 , =1, 2, …, N. Now the variance for the average of the N time series is , namely the variance for each independent timeseries, 2 , scaled by N.

Fig. 3: Why don't you provide the results for simulations with fixed volcanic forcing here?
That will be more straightforward to prove its effect in dominating the decadal variations in GMSST.
We fully agree with the reviewer that fixed volcanic forcing simulations would be ideal, however, a large ensemble for such an experiment is unavailable. Running such an ensemble is computationally expensive, requires significant storage capability, and it is outside the scope of the study. While a large ensemble is unavailable, several years ago NCAR's scientists used an early version of CESM to produce few simulations in which the volcanic forcing was either the only one present (i.e., volcanic-forcing-only experiment; 5 members) or the only one excluded (i.e., all-but-volcanic-forcing experiment; 4 members).
While the small number of members does not allow to effectively extract the forced variability, these experiments seem to be consistent with the hypothesized key role for volcanic forcing, as shown by the GMSSTr time series presented below and added to the supplementary material (Fig. S3).
This new text has been added to the manuscript: (Fig. S3).

Minor comments:
1 Lines 216 and 220, Fig. S2 should be Fig. 4c. We thank the reviewer for noticing this.

to multiple Large Ensembles for the last year) and should be a gold-standard method going forward for examining these kinds of questions in climate models. This method gets at both models' representation of internal variability and allows an assessment of structural model differences, as the authors have done here. Because of the rigor of using Large
Ensembles along with strong statistical methods, I find the conclusions drawn from this study convincing and of relevance to the broader climate community. My one suggestion to the authors is to further flesh out another possible conclusion that is implied at multiple points in the manuscript: that climate models as a whole are reacting too strongly to aerosols and that forced IPO variability is too strong overall as compared to observations (see discussion below). In the figures, I see some support for this argument and think that the authors would only strengthen their manuscript by pointing out this possibility more explicitly.

The figures are of publication quality and the writing is very clear with almost no typos. I recommend acceptance after addressing a few (very) minor concerns and suggestions, listed below.
We sincerely thank the referee for reviewing our manuscript and providing many excellent and constructive suggestions for improving the overall quality of the manuscript. A detailed report describing how the comments were addressed can be found below. We use style and color code as follows: Following suggestions from both reviewers, we now addressed more explicitly the possibility that models overestimate the response to the external forcing. We added an in-text discussion, a new figure to the supplementary materials (Fig. S2), and an additional statement in the conclusions section.
From section 2: The observed GMSSTr trajectory lies largely within the ESM-LE model spread envelopes and presents some similarities with GMSSTr ensemble means from about 1975 to 1995, with the lack of full agreement largely due to the internal variability but also to model deficiencies in representing the external forcing (discussed later). ... Neglecting any residual internal variability in the ensemble mean, the correlation between ensemble mean and each individual member gives a direct measure of the GMSSTr forced component (Fig. S2). Depending on the ensemble the forced component explains between 30 to 58% of the variance (explained variance obtained as correlation squared) in the historical period and 8 to 18% in the future period, with the remaining variance associated with unforced internal variability. However, three out of five models show correlations between observed and ensemble-mean time series outside the range of modelled internal variability (i.e., outside the range of correlations between the ensemble-mean and individual ensemble members). This may be a result of models underestimating the range of natural variability, but, as we show below, it is likely to also indicate an overestimation of the forced component in the historical period. (e.g., Fig. S4). While the forced variability is visible in the observations, it must be noted that models present too much decadal variability (Fig. 2) and likely overestimate the response to the external forcing ( Fig. S2 and S4).
In addition, we think that some of the reviewer's suggestions are captured in these statements taken from the section 3 ("Role of aerosols, GHG, and biomass burning in decadal variability"): Comparing GMSST anomalies after each major volcanic eruption suggests a tendency for a colder-than-observed response in ESM-LE models during the 1982 and 1991 events (Fig.  S3).… Furthermore, models that likely overestimate the volcanic response, CANESM2 and GFDL, present also the highest percentage of decadal variability accounted for by FDV (53%; Fig. 2a,l), suggesting that these models might overestimate the externally-forced fraction of decadal variability, which is therefore closer to the lower boundary of the range estimated in ).
- Fig 2,  We find the reviewer's comment very valuable and incorporated its essence in the following new lines of section 2: However, it must be noted that the observed power spectrum (black lines in Fig. 2) for periods between 8 and 30 years is lower than almost all ensemble members (grey lines in Fig. 2), independently of the model. While it is possible that the observed GMSSTr trajectory is a rare event even with these multiple large ensembles, the discrepancy is more likely a symptom of the model tendency to overestimate internal and/or non-GHG forced decadal variability (e.g., Fig. S2 and Fig S4). While we agree with the reviewer's comment, we think that the following text from the section 3 captures the essence of the comment: Comparing GMSST anomalies after each major volcanic eruption suggests a tendency for a colder-than-observed response in ESM-LE models during the 1982 and 1991 events (Fig.  S3) (Fig. S4). Furthermore, models that likely overestimate the volcanic response, CANESM2 and GFDL, present also the highest percentage of decadal variability accounted for by FDV (53%; Fig. 2a,l), suggesting that these models might overestimate the externally-forced fraction of decadal variability, which is therefore closer to the lower boundary of the range estimated in  We thank the reviewer for the suggestion. The reference has been included.
2. I realize that it is somewhat outside the scope of this study, but it might be useful to briefly cite studies in the introduction on how Atlantic variability modulates GMST variability, but on slightly longer timescales. This topic is briefly touched upon in the conclusions, but (in my opinion), could stand another sentence or two of discussion in the introduction since it is likely that forced Atlantic variability may also be overestimated in models.
The reviewer's point is a readily shareable opinion. In fact, in an early draft of the manuscript we briefly discussed the role of the Atlantic multidecadal variability in GMST. However, given the format of the journal, we preferred a shorter and more concise introductory section. Moreover, while the Atlantic variability is not explicitly named in the introduction section, it is implicitly included in some of the references (Meehl et at., 2013;Haustein et al., 2019).

I am a little bit concerned with the quadratic trend removal (though this is vastly better
than the linear trend removal that I see in many other studies). How much of the multidecadal aerosol forcing is removed by the quadratic trend removal? Would this affect the power spectra estimates for the longer 20-30 year periods? For the decadal volcanic downturns examined here, I imagine that this is probably not a large concern.
We fully understand the reviewer's concern as this was also of our concern in early stages of the study. However, except for possible effects at the edges of the study period (i.e., 1950-2070), the removal of a smooth centennial-scale signal fitted by the quadratic function has a little impact on the decadal-scale fluctuations targeted in this study. The large difference in the timescales involved, centennial vs decadal, prevent the creation of spurious decadalscale fluctuations far from the edges of the study period. Moreover, while the quadratic trend removal clearly affects the shape of the power spectrum (the quadratic fit has power at all frequencies), it does not significantly affect our findings on the forced decadal variability.
We now note possible effects of the trend removal with these in lines of section 2: Moreover, the removal of a centennial-scale quadratic trend does not significantly affect the shorter decadal-scale fluctuations targeted in this study, apart from possible effects at the edges of the study period (i.e., 1950-2070).
We thank the reviewer for noticing this.
Caption 2: I think the word "integrating" in the last full sentence may be superfluous. Found this sentence confusing to get through, suggest revising.