Many of the greatest scientific challenges involve problems of vast complexity and interconnectedness that transcend traditional disciplinary boundaries. Climate science is one such example, requiring an interdisciplinary approach to advance scientific knowledge. The fast growing availability of observations from remote sensing platforms (space-borne, aircraft-based and ground-based), and detailed outputs from global-scale earth system models provide an overwhelming flow of spatio-temporal data that far exceeds data analysis capacity. While the development of statistical tools applied to climate fields is mature, the big data–induced revolution seen in health-care, financial banking, advertising or biology have yet to be duplicated in climate science. In the last decade, however, several groups attempted to apply the so-called Knowledge-Discovery through Data mining (KDD)1,2 to climate. KDD refers to the overall process of using data mining algorithms that autonomously identify patterns from various data sources to find, extract and identify what is qualified as knowledge, and interpret the outcomes. It begins with choosing the tools for the data mining steps, as well as the preprocessing steps, and concludes with the evaluation and interpretation of the patterns resulting from the chosen algorithms. The KDD process, therefore, while encompassing data mining, adds several important steps.

Here we discuss key big-data challenges facing climate science, with an overview of recent efforts to apply KDD to this field, and we provide concrete examples from ongoing research. We focus on knowledge discovery using complex network analysis3 coupled to dimensionality reduction techniques with the objective of extracting and analyzing statistical interrelationships in fields of climatic interest.

Complex network analysis and climate science

It is widely recognized that anthropogenic emissions contribute to the observed rates of temperature increase, as ratified at the 2015 Paris Agreement.4 The fundamental scientific mechanism behind greenhouse gas-induced climate warming is straightforward and indisputable but many uncertainties remain on the extent, patterns, and implications of changes in climate fields over space and time. We have reasonably constrained global mean trends and rates of changes of heat and carbon dioxide reservoirs in the ocean, atmosphere and land over the past forty years, but we struggle to provide robust regional assessments, diagnose how modes of natural climate variability and global warming are interlinked, deduce ecosystem responses, or infer how climate change may affect weather events.5,6,7,8,9,10,11 The spatial and temporal scales involved in climate-relevant interactions are daunting. For example, greenhouse gases and aerosol modify the millimeter-scale size of cloud droplets and ice crystals, which in turn modulates the ability of clouds reflective sunlight and planetary heat, with global feedbacks on temperature and precipitation,12,13,14 while decadal or longer-scale climate changes are felt by society through changes in the character of weather-like extreme events.15,16,17 A second challenge is associated with the inadequacy of the available observing system to sample thoroughly the spatio-temporal scales on which climate varies. Remote sensing platforms have revolutionized climate science, but satellite records effectively start in the late 1970s, while technological challenges hamper sensing key areas of climatic interest such as high latitudes or the deep portions of the oceans.18,19,20

As in other areas of science and engineering, numerical models have become indispensable for understanding climate science. In the past thirty years, they have evolved to account for an increasing number of physical, chemical, and biological processes. The end result is better numerical simulations with codes that use finer grids and include more interacting processes. Climate modelers, however, are faced by challenges that include the multiplicity and nonlinearity of the processes contributing to the climate system, the high-dimensionality of the problem, and the computational requirements.13,21 Despite substantial improvements in the representation of large-scale averages, climate models remain difficult to constrain at regional scales. The uncertainties about linkages between subgrid processes, regional scale changes and large scale dynamics in both observations and model outputs hamper the confidence in regional-scale attribution of on-going changes and future projections.13,22,23,24

Evaluating climate datasets and model outputs in an efficient and robust way, while gathering information about linkages between fields, geographical regions, or time intervals is therefore a priority. This can be achieved through complex network analysis which premise is that the underlying topology or network structure of a system has a strong impact on its dynamics and evolution. Applications to climate science have received growing attention since 2004,25 when graph theory was applied to the investigation of global geopotential height. Network analysis has been since applied to studies of numerous climate modes,26,27,28,29,30,31 of atmospheric and oceanic circulation drivers,32,33,34,35 of precipitation in different time periods,36,37,38 and of Rossby wave dynamics.39

Generally networks are constructed as undirected, binary graphs. A graph is a set of vertices or nodes that, in the case of climate variables, represent geographical locations and, for gridded data-sets, grid points. The edges or links between the nodes are bidirectional (undirected), commonly do not carry information about the weight of the links (binary), and are inferred using simultaneous linear or non-linear similarity measures such as Pearson correlation, mutual information, or phase synchronization.27,29,39,40 Often, two nodes that are not linked according to the chosen criterion have their correlations deleted or pruned. In the case of climate fields, however, cell-level pruning can cause loss of robustness in the network inference, and methods that adopt pruning should not be used for intercomparison studies.41,42 Community detection (clustering) algorithms are commonly used to reduce the dimensionality of graphs.43

Recent developments in network analysis applications to climate have focused on three issues. First, it has been noted that detecting communities in climate variables requires separating between dynamical links and autocorrelations44,45 because of teleconnections between non-adjacent regions and autocorrelations over different spatio-temporal scales. Second, multivariate networks46 and networks with links that account for lagged interactions38 have been developed to explore interactions between different variables and characterize time-lagged relationships. Finally, new methodologies that uncover directed or even causal relationships have been proposed.47,48


Here we focus on a network-based methodology, δ-MAPS, that we developed to robustly compare spatial consistent gridded fields. Our goal is to exemplify how data mining methods can assist with discovering important linkages, or their absence, in climate data.

δ-MAPS identifies the spatially contiguous components of a system, or domains, that contribute in a homogenous way to the system’ dynamics, and then infers their connections accounting for autocorrelations. It refines a previously proposed methodology41,42 and allows for overlapping domains and weighted links at a temporal lag, both relevant to climate fields. After the domains are identified, δ-MAPS infers a functional network between them by examining the statistical significance of each lagged cross-correlation between any two domains, calculating a range of potential lag values for each edge, and assigning a weight that is based on the covariance of the signal of the corresponding two domains. While a temporally ordered correlation does not imply causation, it provides information on the plausible directionality of interactions. Finally each domain has a ‘strength’ calculated as the sum of the absolute weights of all links ignoring their directionality. The greater the strength, the larger is the domain influence on the system at the temporal scales considered.

Details about the methodology are provided as a Supplementary file (Supplementary Methods) and illustrations of advantages of δ-MAPS compared to standard techniques such as principal component analysis, clustering and community detection are presented in Fountalis et al.49

We present a sample of networks from two global monthly sea surface temperature (SST) reanalysis datasets, the HadISST50 and COBE-SST2,51 from the fractional ice content within clouds from the MERRA-2 project52 available from 1980 onward and corresponding variables from a representative member of the Community Earth System Model (CESM) large ensemble.53 The resolution is 1.25°x1° and the focus on the latitudinal range [60°S-60°N] for SST and [55°S-55°N] for clouds to avoid regions where the correlation across reanalyzes is widely low51 or data are not continuously available. All networks are built using detrended monthly anomalies.

Figure 1 presents strength maps over the period 1971–2015. Domains are similar in the reanalyzes, but generally weaker in COBE. The strongest domain covers the El Niño Southern Oscillation (ENSO) region extending to 60°N with a pattern reminiscent of the Pacific Decadal Oscillation (PDO) footprint. Strong domains include the horseshoe areas north and south of the equator, the eastern portion of the South Pacific, the tropical Indian Ocean, the north Tropical Atlantic, and in the reanalyzes the south Tropical Atlantic. A domain occupies the Warm Pool only in HadISST. We verified that also the ERSSTv454 reanalysis network and the MERRA-2 cloud fields presented later do not include it. In the randomly chosen CESM member no domain occupies the Warm Pool region and the south Tropical Atlantic area is extremely weak. Both features are common to all other CESM runs analyzed.

Fig. 1
figure 1

SST domains identified by δ-MAPS and their strength in a HadISST, b COBE, and c one member of the CESM ensemble over the 1971–2015 period. The strength of the domain occupying the ENSO region (E) is off-scale and indicated atop of each panel

The connections between the strongest domains including the Warm Pool for HadISST, and their lags are shown in Fig. 2. In the reanalyzes the ENSO/PDO area is linked to all others at zero or positive lags except for the south Tropical Atlantic, which is anticorrelated and leads by 8 to 10 months. Positive (negative) spring SST anomalies in the Equatorial Tropical Atlantic and in the Gulf of Guinea indeed strengthen (weaken) the Walker circulation, modifying the equatorial winds and the eastern Pacific upwelling and favoring La Niña (El Niño) conditions the following winter55,56 through a Gill-Matsuno-type response.57 Such connection is only partially counteracted by the thermodynamic link from the ENSO area into the Tropical Atlantic through the warming of the entire tropical troposphere following El Niños58,59 and by the dynamical response of the tropical Atlantic trades to the Pacific warming.59,60,61 In CESM links from the Pacific to the Indian Ocean and north Tropical Atlantic are stronger than observed, while the connection from the south Tropical Atlantic is missing. The relation between ENSO and south Atlantic domains is indeed weak and opposite in sign.

Fig. 2
figure 2

SST network across the a seven of the strongest domains in HadISST (the Warm Pool domain is excluded), b the seven strongest domains in COBE, and c six strongest domains in CESM, where TAS has no links. The color of each link represents the corresponding cross-correlation. Arrows indicate signed definitive (positive or negative) lags. The absence of arrow indicates that connections are significant also at zero lags. Some (not all for clarity) lags are indicated

The network analysis of cloud fields can contribute to diagnose this common model bias.62 Despite the higher level of noise and intermittency of cloud fields compared to SST, the δ-MAPS outcome is insightful. Figure 3 presents maps of strength for all domains and links from the ENSO area for the ice cloud fraction. Focusing on the Equatorial and south Tropical Atlantic, two domains are identified in MERRA-2, with the first negatively connected to the Equatorial Pacific, and the southern one positively correlated as expected in the thermodynamic response to ENSO; in SST these domains are merged due to the oceanic circulation. In the CESM ice cloud fraction network there is only one domain, positively, but statistically insignificantly, linked to ENSO; a weakly anticorrelated one is found entirely shifted into the northern hemisphere. The domains in MERRA-2 are used to define boxes to evaluate correlograms of SST anomalies with respect to those from the E domain (Fig. 3e–f). In HadISST (or COBE) both the thermodynamic feedback, lead by ENSO and mostly effective into the southern box, and the dynamical Gill-Matsuno teleconnection, lead by the Equatorial Atlantic, are identified. The second dominates the total domain signal. In CESM the dynamical connection is mostly absent, the Equatorial Atlantic evolves independently of ENSO and the thermodynamic link is stronger than observed63 but not sufficient to achieve statistical significance. All other 29 members of the large-ensemble confirm that CESM overestimates the thermodynamic feedback and underestimates the dynamic teleconnection, which prevails only in one run. In several integrations the thermodynamic feedback is so strong that a significant link from ENSO to the south Tropical Atlantic domain characterizes the SST network.

Fig. 3
figure 3

Cloud ice fraction domains identified by δ-MAPS over the period 1980–2015 with their strength (left) and link maps (right) from the Equatorial Pacific (ENSO-related) domain in ab MERRA-2, cd CESM. ef: correlograms between the SST domain signal of E and TAS and E and the signal calculated over the Eq. Atl. and S. Atl. boxes identified in the MERRA-2 network

Discussion: a way forward

In seeking to understand past, present and future changes in our climate is mandatory to leverage advances in KDD research while accounting for the characteristics of climate data. KDD methods that account for the characteristics of climate data can effectively aid scientific theory and should be integral to any interdisciplinary framework to quantify uncertainties in climate projections or to unveil linkages between perturbations to the climate system and its response. δ-MAPS, for example, infers the high-level abstract linkages across components of the climate system,49 highlights quantifiable differences across datasets, and provides a reduced form model that can be continuously informed from data updates. It is therefore uniquely suited to assess impacts, evaluate model performances and biases, and characterize pathway scenarios, climate trajectories, and the propagation of perturbations from local forcing agents (e.g., aerosols) across climate fields.

Immediate applications range from diagnosing representation and changes in teleconnections—or connectivity in the case of ecosystems—over space and time, to aiding adjoint models in a general framework for regional or global attribution studies.

Data availability

All data sets used are publicly available. The software for δ-MAPS is available at