A novel method to test non-exclusive hypotheses applied to Arctic ice projections from dependent models

A major conundrum in climate science is how to account for dependence between climate models. This complicates interpretation of probabilistic projections derived from such models. Here we show that this problem can be addressed using a novel method to test multiple non-exclusive hypotheses, and to make predictions under such hypotheses. We apply the method to probabilistically estimate the level of global warming needed for a September ice-free Arctic, using an ensemble of historical and representative concentration pathway 8.5 emissions scenario climate model runs. We show that not accounting for model dependence can lead to biased projections. Incorporating more constraints on models may minimize the impact of neglecting model non-exclusivity. Most likely, September Arctic sea ice will effectively disappear at between approximately 2 and 2.5 K of global warming. Yet, limiting the warming to 1.5 K under the Paris agreement may not be sufficient to prevent the ice-free Arctic.

C omplex mathematical models are a key tool for providing projections of future climate and environmental changes [1][2][3][4] . Yet, it is well known that such models are dependent, which complicates extracting probabilistic projection information from multi-model ensembles [5][6][7][8][9][10][11][12][13] . Model dependence can be defined rather qualitatively as sharing of code or belonging to the same modeling group 5,6 . More quantitative definitions involve correlation 5,7,8 , the concept of conditional probability 9 , and distance between models in physical 6,10 or some transformed low-dimensional space 11 . Ignoring the dependence is expected to disproportionately pull future probabilistic projections towards clusters of similar dependent models. In the worst-case scenario, the future projections may be centered on output from a large group of models which rely on the same erroneous code, yet little weight may be assigned to an independent correct model. This is one of the reasons Intergovernmental Panel on Climate Change (IPCC) has outright removed any probabilistic global multimodel climate projections from its Fifth Assessment Report 12 .
Fortunately, several new studies break important new ground [5][6][7][8][9][10][11] . These results indeed show that ignoring model dependence can result in overconfidence and bias in the future projections 5,6 . Yet, these methods typically treat model dependence in a simplistic way, without considering higher-order dependence (e.g., dependence between three or more models) or relying on probability theory. Moreover, many studies additionally assume exclusivity, i.e., that only one of the models is correct, e.g., refs. [14][15][16] . Hence, "the concept of independence has been frequently mentioned in climate science research, but has rarely been defined and discussed in a theoretically robust and quantifiable manner" 9 .
Here, we first mathematically define model dependence and model exclusivity. Then, we present a novel statistical approach that can test multiple dependent non-exclusive statistical hypotheses, and make probabilistic projections under such hypotheses using Markov chain Monte Carlo (MCMC). The new approach differs from the traditionally used Bayesian model averaging (BMA), which assumes that hypotheses are exclusive. We apply the new method to provide the first multi-model probabilistic projections of the global mean surface temperature (GMST) change from the preindustrial period at which September Arctic sea ice will effectively disappear (thereafter called GMST change to melt). The projections are based on the output of the Coupled Model Intercomparison Project phase 5 (CMIP5) models. The method considers both model dependence and skill at capturing known present-day and uncertain future sea ice metrics. GMST change to melt, along with many other statistical model parameters, is jointly estimated using MCMC. Before making actual projections, we calibrate the method in a suite of observation system simulation experiments to provide approximately correct coverage of the 90% posterior credible intervals.

Results
Defining relevant statistical concepts. Given the confusion in the literature, we first set out to define independence and exclusivity of events through their joint probability. Assume there are n statistical hypotheses, and H 1 , …, H n are events corresponding to each hypothesis being true. These hypotheses are tested using some observations y or a random variable Y representing the state of the dynamical system. In what follows we use a short-hand notation Y to mean the event of observing the underlying random variable Y taking a value of y. Joint probability of two hypotheses H i and H j given the observations is defined as If these hypotheses are conditionally independent given Y, the joint probability is the product of the conditional probabilities 9 : Similar result holds for the combination of n multiple hypotheses H 1 , H 2 , …, H n . They are independent given Y if 17 : for every r-combination of the hypotheses, and r = 2, 3, …, n. The hypotheses are conditionally dependent if the expression above does not hold. In the general case where hypotheses can be conditionally dependent: If the hypotheses are exclusive, then all the joint probabilities are zero: and similarly for higher-order combinations. There are some relationships between the terms "exclusive" and "dependent". For example, exclusive events are dependent. However, dependent events do not have to be exclusive. These concepts can be applied to climate modeling. While any overall hypothesis about model correctness is "false (and, it could be argued, not even approximately true)" 18 , hypotheses may be formulated for model adequacy for a particular purpose. An example hypothesis may be that a model is adequate for predicting GMST change by year 2050 relative to the preindustrial period under a particular emissions scenario within a given uncertainty. In case of hypotheses representing physical models being adequate for a particular purpose, both assumptions of independence and exclusivity are likely wrong. Consider, for example close-cousin climate models, such as IPSL-CM5A-LR and IPSL-CM5B-LR. Defining probability as degree of belief in a model being correct for a specific modeling purpose, a climate scientist may assign a higher conditional probability to IPSL-CM5A-LR if s/he were informed that its cousin IPSL-CM5B-LR is correct. This is inconsistent with independence (where a fact of one model being correct has no bearing on the conditional probability of any other model), and with exclusivity (where probability of IPSL-CM5A-LR would be zero if its cousin model was known to be correct).
Law of total probability for non-exclusive events. We now discuss BMA and extend it to properly accounting for model dependence. BMA is a popular procedure to make probabilistic projections [14][15][16][19][20][21][22][23][24] . BMA is based on the law of total probability applied to a continuous random prediction variable of interest A, several exclusive hypotheses (models of reality) H 1 , …, H n ∈ S, and a sample space S. Here, an ith hypothesis refers to the event H i that the ith model is adequate for a particular modeling purpose. BMA states that where lower-case p represents a probability density function (pdf), upper-case P is a probability, and all densities and probabilities are implicitly conditioned on S. With a slight abuse of the notation, \ Y refers to intersection with the event that a certain value of Y has been observed. According to Eq. (6), the resulting final projection pdf is a weighted mean of pdfs given each model, with weights representing relative probabilities of each model. We define sample space as at least one hypothesis being correct, i.e., S ¼ ∪ H i . One important implicit assumption for BMA is that the events of individual hypotheses being correct are mutually exclusive, i.e., P H i \ H j jY ¼ 0 8i≠j, and similarly for all higher-order model combinations. This is in fact a binding assumption and it will lead to too much weight assigned to similar predictions created by highly dependent (similar) models. As we present later, it is possible for several good models to jointly match observations, and to be simultaneously correct. We show that when the hypotheses are non-exclusive, the BMA is modified as follows (non-exclusive law of total probability): We prove this formula using the so-called inclusion-exclusion principle in Supplementary Note 1. In the modified BMA the probability densities of future metric A given all model pairs are subtracted, weighted by the respective pair probabilities; conditional weighted densities given model triplets are added back again, and so forth. As Eq. (4) shows, the full law considers potential model dependencies up to order n. We propose to adopt a new term "model non-exclusivity", which refers to the fact that more than one model can be jointly consistent with reality. Non-exclusive model probabilities incorporate both models' skill at capturing observations, as well as inter-model dependencies (Eq. (4)).
Probabilistic projections using the marginalization theorem. Evaluating the terms in the full BMA equation (Eq. (7)) is nontrivial. However, if our hypotheses can be defined by some parameters θ belonging to regions in parameter space, the problem simplifies. Specifically, we can use the marginalization theorem to obtain The key idea is that if sample space is appropriately defined, this approach is equivalent to adding up all models interactions in Eq. (7). We stress that this also considers all model interactions of up to order n (Eq. (4)). As we will show later, this approach is tractable, and can be implemented with relative ease using MCMC 25,26 . MCMC is a method to sample from a joint distribution of parameters using Markov chains. The chain of the prediction variable A provides a way to estimate its probability density. The fraction of samples within an intersection of regions associated with any hypothesis combination naturally provides a tractable way to calculate the probability for that combination.
Application to projections of Arctic sea ice. We make CMIP5 multi-model projections of GMST change to melt, using presentday September Arctic sea ice extent (SIE), as well as the recent trend in SIE with respect to GMST (SIE sensitivity), as constraints. For the ice-free Arctic, we use the SIE cut-off of 1 million km 2 to be consistent with previous work [27][28][29][30] . Several studies address future sea ice projections 10,27-29,31,32 , but few in a probabilistic way 30,[33][34][35][36] . Our work can be seen as a first attempt to provide probabilistic projections of GMST change to melt, while explicitly accounting for all model dependencies. Our data consist of autocorrelated annual time-series of present-day (1979-2017) September SIE from 31 CMIP5 37 climate models y i with unknown means μ i (i = 1, 2, …, n is model index), corresponding observations y o with an unknown random mean μ o , and the deterministic future GMST change to melt for the representative concentrations pathway (RCP8.5) scenario from each model z Ã i . In addition, we use the sensitivity of SIE to GMST (e.g., trend from linear regression) from each model u i and corresponding observations u o . The list of models and their institutions is shown in Supplementary Table 1. Let y = (y o , y 1 , y 2 , …, y n ) be the collection of observed and model SIE means, Y be the event that the underlying random variable Y takes the value of y, and μ = (μ o , μ 1 , …, μ n ) be the collection of true observed and model present-day means. We assume that each model and observed SIE time series is modeled by a sum of a linear term and an autoregressive model of order 1 [AR(1)], with the only uncertain parameters being the means of the linear terms (see Methods). Thus, we assume that the slopes and the AR(1) parameters are fixed at estimated values, in the interest of reducing computational cost and complexity. Our projection random variable of interest z* is the future GMST change to melt.
The event for each model adequacy hypothesis can be defined in terms of the random parameters belonging to a region in R m where m is the number of parameters where Δ μ , Δ Ã z , and Δ u are uncertain random tolerance ranges for the hypotheses. In other words, hypothesis H i states that the present-day SIE mean for model i is within some distance from the observed mean, its temperature sensitivity is within some distance from the observed sensitivity, and future GMST change to melt is, too, within some distance of the actual unknown GMST change to melt. Originally, we were going to use fixed tolerances in our analysis, which seemed as a more natural choice in the beginning. However, during validation of the method, this resulted in flat posterior pdfs with sharp cutoffs for a projection variable of interest. This lead us to make the tolerances uncertain parameters. Specifically, we assume that the tolerances follow half-normal (or truncated normal) distributions where f is deterministic scalar error expansion factor, which can be calibrated using cross-validation, and σ μ is deterministic sample standard deviation of differences between each model's present-day mean and next-closest model's mean. This standard deviation is the same for all models. We use this quantity in the tolerance parametrization since it can be thought of as a measure of model error. Thus, broadly speaking a model is correct if it is within some model-error-informed tolerance away from the observations. We use the half-normal distribution because the tolerances cannot be negative, and because it results in reasonably looking posterior pdfs. Testing other prior distributions for the tolerances is left to future work. The f factor is introduced in order to correct for potential overconfidence. Here, fσ μ = 0.50. The distributions for Δ Ã z and Δ u are defined similarly in terms of σ Ã z and σ u , which are defined analogously to σ μ but for the future GMST change to melt and the sea ice sensitivity, respectively. Here, f σ Ã z ¼ 0:41 and fσ μ = 0.47. The events for hypothesis combinations are simply defined by intersections of the regions for constituent hypotheses. We use the mean SIE and its sensitivity to constrain the models because we find considerable relationships between these variables and the GMST change to melt ( Supplementary Fig. 1). Moreover, previous research 28 shows that in climate models the present-day mean SIE is more strongly correlated to the Arctic ice-free year under the RCP8.5 scenario than other variables, such as mean annual sea ice volume, September sea ice trend, mean seasonal cycle of SIE, or the thin ice SIE. As previously, we define our sample space as the union of all hypotheses: S ¼ ∪ H i . The sample space can also be thought of as a region in the m-dimensional space.
The probabilities can be calculated using Eq. (8) (reformulated in different variables) and Bayes' theorem: According to Bayes' theorem 38 , the posterior density of parameters given the observations is proportional to the product of the likelihood of the observations given the parameters which depends only on μ, is defined in the Methods section), and the prior belief in the model parameters pðz Ã ; μ; Δ μ ; Δ Ã z ; Δ u Þ. Here we decompose the prior into priors for individual components assuming independence. Furthermore, p(Δ μ ) follows Eq. (10), p Δ Ã z À Á and p(Δ u ) are defined similarly, and the prior for other parameters pðz Ã ; μÞ / 1 S is an indicator function for membership in the set S. (The prior is uniform over set S, and 0 outside of the set.) We explore the joint pdf presented in the second term of Eq. (11) using MCMC. For each posterior MCMC chain draw l of parameters z ÃðlÞ ; μ ðlÞ ; Δ ðlÞ μ ; Δ ÃðlÞ z , and Δ ðlÞ u we compute a value of binary variable h ðlÞ i : which is an indicator of whether the ith hypothesis is true or not. The joint hypotheses are correct if all constituent hypotheses are correct, e.g., Model probabilities for any combination (e.g., p(H i ) or p(H ijk )) are obtained from the MCMC chain using relative frequency of samples falling into each hypothesis region (or hypothesis combination region).
We first test the method using one-at-a-time observation system simulation experiments (Methods section). Here, we pretend each of the models is correct and use its output as observations one-at-a-time. We then exclude the true model from the model set and calculate the pdf for z* (GMST change to melt), which we compare to actual projections from the true model (Supplementary Figs. 2 and 3). In these experiments we choose the deterministic error expansion factor f = 3 such that the method gives correct coverage of the 90% posterior credible intervals for z*. We then constrain all available models with real observations to make actual projections of GMST change to melt. We compare the method to the standard BMA, which does not account for the non-exclusive terms. To do that, we simply perform weighted sampling from conditional GMST change pdfs given each model being correct p(z*|H i ) using our MCMC chain.
We perform five projection experiments. These experiments explore the sensitivity of the results to observational datasets, and to the natural fraction of the recent SIE decline. The range of natural fractions considered here is roughly consistent with prior model-based studies which place it between 5 and 50% 39-41 . HadISST_r51_40p.nat uses 1979-2017 HadISST data for the SIE, the 51st realization of the HadCRUT4 temperature for the calculation of sea ice sensitivity over years 1979-2017, and assumes that 40% of the 1979-2017 SIE decline was natural. HadISST_r51_anthro experiment is exactly the same except is assumes that all of the recent decline was anthropogenic. NSIDC_r47_20p.nat uses NSIDC SIE observations, 47th realization of temperature observations, and assumes a natural contribution to the ice decline of 20%. NSIDC_r91_40p.nat uses NSIDC observations, the 91st realization of temperature observations, and assumes the natural contribution to the decline of 40%. There are large uncertainties regarding the magnitude of the true ice sensitivity given the short length of the observational records, and regarding the cause of the recent SIE decline 31,35,42 . One way to sidestep the problem is to just use the mean SIE constraint for the shorter 1979-2004 period which is not as severely affected by the recent melt. This motivates the NSIDC_mean_only experiment. This experiment uses NSIDC observations for SIE for years 1979-2004, and no SIE sensitivity constraint (see Eq. (18) for the formula for the pdf of GMST change for this experiment). To make sure the expansion factor f = 3 is reasonable for this experiment, we perform observational system simulation experiments using the 1979-2004 calibration period, and no SIE sensitivity constraint. Supplementary Figures 4 and 5 provide validation results for these experiments, and confirm that in this case f = 3 provides reasonable coverage of the 90% posterior credible interval.
In what follows we focus on the HadISST_r51_anthro experiment to illustrate the capabilities of the new method. However, pdf properties of GMST change to melt are provided for all experiments, and averaged across the experiments.
Model hypothesis probabilities. We remind the reader that the models are defined to be adequate for modeling the relationship between GMST and SIE if their present SIE means fall within a distance Δ μ of observed SIE mean, the future GMST changes to melt fall within tolerance Δ Ã z of the true GMST change to melt, and the sea ice sensitivities to temperature are within Δ u of the observed sensitivity. We do not know the true GMST change to melt, but the method provides its pdf. To calculate hypothesis probabilities we integrate out this uncertainty. The tolerances are also estimated as part of the method, and their uncertainties are also integrated out. We provide the pdf of the tolerances for the HadISST_r51_anthro experiment in Supplementary Fig. 6. The present-day SIE tolerance Δ μ has a mean of 0.561 million km 2 and its 90% posterior credible interval ranges from 0.121 to 1.14 million km 2 . Similarly, for the GMST change to melt, Δ Ã z has a mean of 0.474 K, with the 90% posterior credible interval from 0.11 to 0.963 K. Finally, the tolerance for sea ice sensitivity Δ u has a mean of 0.515 million km 2 K −1 and the 90% posterior credible interval from 0.098 to 1.07 million km 2 K −1 .
The model weights for HadISST_r51_anthro experiment are presented in Fig. 1. As it can be seen in the figure, a few models attain considerable weights, while many models receive zero or near-zero weights. The model with the highest weight (model 11, GFDL-CM3), has the mean present-day SIE of 6.43 million km 2 , similar to the observed value of 6.40 million km 2 used in this experiment; and the sea ice sensitivity of −2.87 million km 2 K −1 , similar to the observations (−3.01 million km 2 K −1 ). The model with the second highest weight is HadGEM2-CC, while MRI-CGCM3 has the third highest weight. We note that the weights are a diagnostic of the method, and they are not needed to obtain the pdf of the future projections. Hence, the step of finding weights can be theoretically skipped if only the projections are needed.
Besides individual models weights, our approach provides joint model weights for all model pairs (Fig. 2). These weights are less than individual model weights, since they represent intersections of regions for hypothesis pairs. Note that these weights account for both model skill and dependence, since PðH i \ H j jYÞ ¼ PðH i jH j \ YÞPðH j jYÞ. Notably, for the HadIS-ST_r51_anthro experiment (Fig. 2a) the pair of 2nd and 11th models (ACCESS1.3 and GFDL-CM3) gets a high weight. These models produce relatively similar mean present-day SIE (5.62 and 6.43 million km 2 ), future GMST change to melt (both 2.12 K), and present-day SIE sensitivity to temperature change (−3.08 and −2.87 million km 2 K −1 ). These models have a similar ocean component: it is MOM4.1 for ACCESS1.3 and MOM4-based for GFDL-CM3. However, the sea ice models are not the same: CICE4.1 for ACCESS1.3 and GFDL Sea Ice Simulator (SIS) for GFDL-CM3 [43][44][45] . In addition, out of the top five most dependent pairs, none of the pairs belong to the same modeling family 13 . Thus, our results indicate that models may exhibit similar behavior for reasons other than coming from the same modeling center. These other reasons include equifinality (different parameters and modeling structures resulting in the same output by chance), subtler sharing of ideas, using the same observational datasets for input parameter calibration, or a random realization of internal variability. However, our approach accounts for the observed and modeled uncertainty in mean present-day SIE due to internal variability.
A unique feature of our method is the capacity to obtain joint hypothesis probabilities for any desired hypothesis combination. Figure 3 illustrates this by showing joint weight of combinations of models 11, 18, and 20 with all other models. This triplet is chosen because of the inter-dependencies between these models (Fig. 2). Specifically, these models are included in the top five most dependent model pairs. A combination of (2,11,18,20) gets the highest combined weight (note also that the pair 2 and 11 is the most likely pair in the ensemble), and a combination (29,11,18,20) gets the second highest combined weight (the pair 29 and 11 forms the second most likely pair in the ensemble).
Finally, the probability of having exclusively one model being correct is higher than the probability of more than one model being correct. Specifically, about 1,200,000 of 1,900,000 MCMC samples correspond to just one correct model ( Supplementary  Fig. 7a) for the HadISST_r51_anthro. Nonetheless, the amount of samples corresponding to more than one model being correct is still considerable. For the progressively increasing number of models the joint model probability tends exponentially to zero ( Supplementary Fig. 7a).
Comparing the HadISST_r51_anthro and other experiments with two constraints to the NSIDC_mean_only experiment (where only one constraint is used), it appears that the model non-exclusivity may be more important in the one-constraint case. There are less instances of high pair weights once the second constraint is introduced (Fig. 2), and the non-exclusive model probabilities appear lower ( Supplementary Fig. 7). Moreover, in the NSIDC_mean_only experiment the number of MCMC samples with an exclusively one correct model is smaller than in the rest of the projection experiments. This suggests that using multiple constraints (at least in the context of introduced  GMST change at which Arctic sea ice will melt. Our method provides pdfs for 36 parameters, however of main interest is the GMST change at which the Arctic ice will effectively melt in September. Figure 4a explores the effect of accounting for the non-exclusivity on the GMST to melt pdfs. It appears that the effect is relatively small for the case of two constraints (HadIS-ST_r51_anthro with and without interactions). Yet, in the NSIDC_mean_only experiment, once the interactions are accounted for, the peak in GMST change to melt between 2 and 2.5 K decreases considerably, while the upper tail of the pdf becomes much fatter. Thus, the effects of accounting for interactions diminish in the case of using two constraints on the models. This is consistent with discussion in the previous section regarding introducing multiple constraints. Figure 4b illustrates the differences between the pdfs under different assumptions about the observations, and about the fraction of the recent SIE decline caused by anthropogenic effects. For example, under the assumption of a fully-anthropogenic SIE decline, the HadISST sea ice dataset, and using the 51st realization of HadCRUT4 temperature observations, the pdf is relatively tight with the most likely GMST change to melt (mode) slightly above 2 K, and the 90% credible interval of (1.44, 3.17) K (Table 1). Keeping the HadISST SIE dataset, but assuming that 40% of the recent SIE decline was natural (HadISST_r51_40p.nat) changes the pdf considerably and results in the second peak above 4 K. This suggests that determining the cause of the recent SIE decline is fundamental in reducing the uncertainty about future GMST to melt projections. Yet, keeping the natural fraction of 40%, but switching to the NSIDC ice dataset and to the 91st realization of temperature observations substantially changes the pdf once again (NSIDC_r91_40p.nat experiment). Specifically, the pdf is shifted to the left, and the second temperature peak between the 4 and 4.5 K is almost completely removed (Table 1). This illustrates the monumental impact the uncertainties in the observations can have on future Arctic sea ice projections. These effects have been pinpointed in previous work 34,35 . Finally, the pdf under the NSIDC sea ice observations, the 47th realization of temperature observations, and a natural fraction of the sea ice decline of 20% (NSIDC_r47_20p.nat), is again tight, with no second peak, a mean of 2.11 K, and the 90% posterior credible interval of (1.35, 2.92) ( Table 1). Removing the sea ice sensitivity constraint results in a wide pdf with a relatively high mean (2.90 K). In most of the experiments, the lower bound of the 90% posterior credible interval is below 1.5 K. This indicates that there is a distinct probability of the Arctic losing essentially all of its summer ice even if we fulfill commitments to the lower 1.5 K warming limit of the Paris agreement. These results are in stark contrast to an almost zero probability of Arctic ice disappearance at 1.5 K warming suggested in previous work 33 . Yet, that work does not constrain the models by the sea ice sensitivity to temperature. On the other hand, our results broadly agree with a recent study that does use the sea ice sensitivity in correcting climate model biases 34 . Overall, the mean GMST change to melt for all experiments accounting for non-exclusivity is 2.54 K, with the 90% posterior credible interval from 1.49 to 3.83 K ( Table 1). Our results have been performed under the assumption of RCP8.5 emissions scenario. However, previous work has found a remarkably linear relationship between temperature and Arctic sea ice in various months in climate models/climate model ensembles 34,35,42 . Moreover, SIE metrics of individual models (at least for the annual mean), and entire CMIP5 ensemble (for September) have been found to not strongly depend on the driving emissions scenario 36,42, and others] . Thus suggests that current pdfs may be adequate not just for RCP8.5, but for a range of future emissions scenarios 35 where GMST monotonically increases with time. Validating our results using other emissions scenarios is subject of future work.

Discussion
The non-mathematical understanding of our method is as follows. To provide pdf of the GMST change to melt, we randomly sample all GMST changes that fall within our sample space: GMST changes that are within some distance of at least one model whose present-day mean September SIE is within some distance of the observed mean, and whose present-day SIE sensitivity is also within some distance of the observed value. In addition, we also account for the uncertainty in present-day modeled and observed SIE, as well as for the uncertainty in distances themselves. We show that such an approach is equivalent to considering all model interactions of up to order n, where n is the number of models.
Our results, specifically Eq. (7) indicate that BMA is limited in that it considers that models are exclusive representations of reality. This point has been previously discussed in the literature 22,46 . We show that using standard BMA can lead to biased projections in the case of a single observational constraint (Fig. 4a). In this case, the probability that more than one model is simultaneously correct is around 60% (Supplementary Fig. 7b). Yet, adding a second constraint substantially reduces this probability ( Supplementary Fig. 7a). This illustrates that while the model non-exclusivity may be important, using multiple constraints may reduce its effect on model projections.
In the two-constraint cases, the most likely scenario is that of one and only one model (out of the 31 models in the ensemble) being adequate to represent reality. This raises a daunting question of usefulness of the model ensemble as a whole for making future projections. Should we not have restricted our sample space to a region with at least one model being adequate to represent relationship between sea ice and temperature, what would have been the probability of none of the models being adequate? To our knowledge, methods to quantify such a probability currently do not exist. Yet, determining the validity of the ensemble as a whole is fundamental from the policy-making perspective. This should become focus of future research on this topic.
Our method can be compared with a recent study 11 . Specifically, that work uses singular value decomposition (SVD) and multidimensional scaling (MDS) to map present-day CMIP5 spatial model output, and corresponding observations into a 2D parameter space. Associated with each point in this space is a prediction property of interest (in this case climate sensitivity CS, as well as future regional temperature and precipitation changes), interpolated between model points. The study then proceeds to sample randomly from all points in the space within the convex hull of the models, with denser sampling close to observations, to construct a cumulative distribution function for CS, and pdfs for temperature and precipitation changes. The work claims their pdfs are just "resampled histograms of model behavior" 11 . However, we show using additional statistical analysis in Supplementary Note 2 (for their "Gaussian" experiment), that their approach is similar, and also considers all model interactions under some limiting statistical assumptions. Specifically, by using deterministic interpolation they assume a degenerate conditional probability for CS given a value of parameters in the 2D space. In addition, convex hull of the models is clearly a crude approximation to the probability space. Even if questions remain about the justification for their statistical model, this means that Bayesian multi-model probabilistic projections accounting for all model interactions (2 50 − 1 ≈ 1.1 × 10 15 interactions) are already available to climate community. There are numerous differences between our work and that study. First, we consider dependence in both present-day and future model output. Second, we use time-series while that work uses spatial model output 11 . Third, we provide a statistical theory and method for finding non-exclusive hypothesis probabilities, and for prediction under such hypotheses that accounts for all hypothesis interactions.
Another relevant method is Bayesian model combination or ensemble BMA 22,47 . This method accounts for model nonexclusivity in a different way. Specifically, it considers a collection of augmented models H Ã j , where each augmented model represents a combination of original models H i . It then performs standard BMA with the augmented models. However, such an approach suffers from the same problem as original BMA: it assumes exclusivity of model combinations. But if one model combination is correct, it does not preclude another model combination (e.g., a subset of the original combination) to be correct. We show here that the proper way to account for model interactions is to subtract model combination terms with an even number of models, and to add odd-number combinations following Eq. (7). By working with a parameter space (Eq. (8)) as opposed to model combinations (Eq. (7)), our approach can handle ensembles with a large number of combinations. This is because unlike the ensemble BMA we do not need to explicitly calculate the probability of each combination. Specifically, current work handles 2 31 -1 combinations, which is approximately 2.1 billion. Finally, previous ensemble BMA implementation does not properly account for model dependence when evaluating model combination probabilities 22 . Our work is subject to important caveats. First, joint/combination model weights are dependent only on model and observed output in a low-dimensional space. As such, they do not explicitly consider model families 13 , or sharing model code between institutions. While similar model output from unrelated models can result from subtle model dependencies such as sharing ideas or calibration datasets, it may also arise due to a random realization of internal variability 48 . Our code partially accounts for this by considering the uncertainty in the present-day model and observed SIE means. It has been previously shown, that when spatial information is considered, models from the same institution often produce similar output 11,13 . Hence, incorporating such information can be considered in future work. Dimension reduction methods used previously may constitute one useful avenue for action 11,49 . Second, we consider only two observational SIE datasets [50][51][52] . However, another popular dataset, Meier dataset, is based on the same satellite observations as NSIDC from year 1979 50,53 . Third, we do not consider the uncertainty in the autocorrelation, or standard deviation of the interannual present-day September SIE variability. While considering such uncertainty may present a substantial improvement, we refrain from doing this to drastically reduce the computational cost and complexity. Considering these uncertainties is subject of future work. Fourth, we use only a subset of available models and do not make use of multi-parameter model ensembles, or intermediate complexity Earth System models. Fifth, our tolerances for sea ice sensitivity are the same for all models, and do not account for the different internal variability in different models 54 . Ideally, this information should be incorporated in the sea ice sensitivity constraint. Note that this is not an issue with the mean SIE constraint, as there we sample from the pdfs of unknown population means. These pdfs take into account different internal variabilities of different models, e.g., if a particular model has a high internal variability this is expected to result in a broader pdf for the population mean of the SIE time series. Finally, our decision to formulate hypothesis tolerances using half-normal prior distributions is clearly subjective. We have chosen to make tolerances uncertain parameters because using fixed values resulted in flat pdfs with sharp cutoffs for a projection variable of interest in method validation experiments. We justify the prior distributions post-hoc by the fact that such priors lead to reasonable projection pdfs and model weights that gradually taper off further away from observations (Supplementary Figure 8). More rigorous determination of appropriate tolerance priors needs to be explored in future work.
In closing, we provide a statistically-robust Bayesian method to calculate probabilities of non-exclusive dependent hypotheses and their combinations; and to make predictions under such hypotheses. The approach accounts for 2 n − 1 hypothesis combinations for n hypotheses. We use this method to make projections of the GMST change from preindustrial (1861-1890) climatology at which the Arctic will lose almost all of its September ice using 31 non-exclusive climate models. Neglecting model non-exclusivity produces biased results in case of using just a single mean sea ice extent (SIE) constraint on the models, but the effects of non-exclusivity appear to diminish when a second constraint of SIE sensitivity to temperature is added. There is a distinct probability the sea ice may vanish below 1.5 K warming limit of the Paris agreement, even if 40% of the recent sea ice decline has been naturally-caused. The projections of GMST change to melt are sensitive to the assumptions about the observational datasets, and to the natural fraction of the recent sea ice melt. The overall mean for the GMST change to melt is 2.54 K, and the 90% posterior credible interval is (1.49, 3.83) K.
The study raises important questions about the usefulness of model ensembles (hypothesis sets) for making future projections. While the model runs used here were performed under RCP8.5 emissions scenario, the conclusions may hold more generally for a range of future emissions scenarios 35 where GMST continuously increases with time. Finally, by making parallels between our work and a previous study 11 , we show mathematically that probabilistic regional climate projections and climate sensitivity estimates accounting for all model interactions are already available to climate community.

Methods
Model output and observations. We use 31 CMIP5 climate models, utilizing the historical and future RCP8.5 first run from each model (Supplementary Table 1). This output has been obtained from the ESGF LLNL portal 55 . We first base model selection on the availability of output, which originally results in 33 climate models. We use the first run from each climate model. We then discard GISS-E2-H model as it reaches ice-free conditions (defined by <1 million km 2 September SIE) before present under the RCP8.5 scenario, and CSIRO-Mk3.6.0 as it has the highest September SIE of all models during the historical period 1979-2004, making it inconsistent with observations. We interpolate all sea ice concentration (sic) model output to a common 1°× 1°latitude-longitude grid using nearest-neighbor interpolation. Interpolating modeled sic to a common grid has been previously performed in the literature 29 . We calculate September SIE for years 1979-2017, and 2006-2099 as the total area of all cells in the Northern Hemisphere with sic > 0.15 56 . To obtain GMST change to melt we use the global mean surface atmospheric temperature difference between the 5 years centered on the year when Arctic first becomes ice-free, and the 1861-1890 preindustrial climatology for each model. These years were chosen because several models do not have data for the complete period 1850-1861. Before global averaging, we first bilinearly interpolate temperature fields to a 2°× 2°grid.
We use several observational sources for the Arctic September SIE. First, they include HadISST1 observations 52 for years 1979-2017. HadISST1 observations are in the form of monthly sea ice concentrations. We obtain SIE from these observations using the same procedure as used for the CMIP5 models. Second, we use version 3 NSIDC September SIE observations spanning years 1979-2017 57 . Third, we utilize NSIDC Sea Ice Extent version 1 50,51,58 for years 1979-2004. The differences in September SIE between the successive versions are small 57,59 . For GMST observations (used to calculate the sea ice sensitivity) we use different randomly chosen realizations of HadCRUT4 dataset 60 . We use different realizations of the same dataset as opposed to using different products, because the difference between HadCRUT4 1979-2010 GMST trend, and the GISS and NCDC datasets, is small compared to HadCRUT4 trend's own uncertainty range 60 .
Statistical model and detailed likelihood function. We assume each present-day model SIE output, as well as observations can be modeled as sum of a linear trend and an AR(1) process: wherek o andk i are observed and modeled sample slopes respectively, Δt = (t 1 − t 0 , ; ϵ i;p Þ are residual AR(1) processes (representing interannual variability) where p is the number of yearly datapoints. The AR(1) processes are defined as To obtain the likelihood function for the observations y we assume that the random variable Y describing the historical climate and model SIE output is independent between the models, and real climate, given the present-day modeled and observed SIE mean μ. This is justified as internal variability in climate models and observations is known to be random. Under such assumption the likelihood function can be decomposed as pðYjμÞ ¼ p y o ; y 1 ; y 2 ; ; y n jμ o ; μ 1 ; μ 2 ; ; μ n À Á Here, individual likelihood terms are standard AR(1) process likelihood functions 61  À Á is stationary process variance for observed variability, and w o;t ¼ ϵ o;t Àρ o ϵ o;tÀ1 ; t>1 represent whitened observed variability. The likelihoods for model outputs pðy i jμ i ;k i ;σ i ;ρ i Þ are defined similarly; we omit them for brevity.
Spectral analysis strongly suggests that AR(1) is a reasonable approximation for climate models used here (Supplementary Fig. 8). For the spectral analysis we use time series of September SIE variability of CMIP5 models over years 1880-2004 around a lowess trend 62 with a span = 0.8. We choose this value because it appears to remove multidecadal variability from the trends.
Details of the observation system simulation experiments. During the observation system simulation experiments for the GMST change to melt we use each model's output as pseudo-observations one-at-a-time. In the first set of experiments we use both SIE mean and SIE sensitivity constraints for years 1979-2017. For each perfect model, we explore the joint pdf specified in the second term of Eq. (11) using MCMC. We use 200,000 MCMC samples, with a burn-in of 50,000. We use f = 3 for the observation system simulation experiments, as this results in reasonable empirical coverage of the 90% posterior credible intervals. For each excluded model we compare the posterior pdf for the GMST change to melt to the value from the true excluded model ( Supplementary Figs. 2 and 3). The GMST change from the true model, and the mean projection exhibit a considerable positive correlation, highlighting the skill of the method at projecting this quantity. We then perform a similar experiment with just the mean SIE constraint for years 1979-2004. Supplementary Figures 4 and 5 show validation results for this case.
We test whether the chain length and burn-in settings in the observation system simulation experiments are reasonable. Using such MCMC length in an additional estimation experiment similar to NSIDC_r91_40p.nat, we find that the differences between marginal pdfs for the GMST change in the short and the standard run are small. In addition, chains for all parameters for the short run appear to be reasonably stationary and not unduly influenced by the initial random seed. This suggests that these MCMC settings are reasonable for calibration of the 90% posterior credible interval.
Details of the future projection experiments. We explore the joint pdf in the second term of Eq. (11) using MCMC. For the MCMC, we use 2,000,000 samples with a burn-in of 100,000 for all experiments. Following the cross-validation experiment, we use f = 3.0 for all experiments. We run a second run of the NSIDC_r91_40p.nat experiment with a different random seed, and the diagnostics from the two runs are reasonably similar.

Data availability
CMIP5 model output is available online from the ESGF LLNL portal at https://esgf-node. llnl.gov/projects/esgf-llnl/. NSIDC Sea Ice Index version 3 is available online from the NSIDC website at https://nsidc.org/data/seaice_index. The version 1 data is available from the authors upon request. HadISST1 sea ice concentrations are available online from the Met Office Hadley Center website at https://www.metoffice.gov.uk/hadobs/ hadisst/. HadCRUT4 temperature observations are located at https://crudata.uea.ac.uk/ cru/data/temperature/. All other relevant data are available from the authors upon request.