A new CAM6 + DART reanalysis with surface forcing from CAM6 to other CESM models

An ensemble Kalman filter reanalysis has been archived in the Research Data Archive at the National Center for Atmospheric Research. It used a CAM6 configuration of the Community Earth System Model (CESM), several million observations per day, and the Data Assimilation Research Testbed (DART). The data saved from this global, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim 1^\circ $$\end{document}∼1∘ resolution, 80 member ensemble span 2011–2019. They include ensembles of: sub-daily, real world, atmospheric forcing for use by all of the nonatmospheric models of CESM; weekly, CAM6, restart file sets; 6 hourly, prior hindcast estimates of the assimilated observations; 6 hourly, land model, plant growth variables, and 6 hourly, ensemble mean, gridded, atmospheric analyses. This data can be used for hindcast studies and data assimilation using component models of CESM; CAM6, CLM5, CICE5, POP2. MOM6, MOSART, and CISM; and non-CESM Earth system models. This large dataset (~ 120 Tb) has a unique combination of a large ensemble, high frequency, and multiyear time span, which provides opportunities for robust statistical analysis and use as a machine learning training dataset.

"Data assimilation" ("DA") is the term used in many geophysical sciences for the merging of observations with a model state created by a (usually) numerical model of the physical system. The model state used in this dataset is a "hindcast" because it represents a past state 1 . The model state is referred to as "prior" to the assimilation. The result of assimilating observations into a hindcast is a "reanalysis", which is a better description of the system than either the observations or the model state individually. Observations and model hindcasts have both valuable information and errors. Successful DA keeps the information from both and reduces the errors. It also reduces the prior uncertainty 2,3 .
The reanalyses created by the Data Assimilation Research Testbed 4,5 (DART) use an 80 member ensemble of similar hindcasts, which leverages the power of statistics to give a more comprehensive estimate than a single hindcast can provide. The ensemble of reanalyses is a sample of the probability distribution of the physical system after the observations are assimilated (the "posterior") 6 .
The mean of this ensemble can be viewed as the most likely weather. The spread of the ensemble tells us how much confidence to have in it (smaller spread implies more certainty) or how much variability we should expect.
DA was developed to improve the initial conditions used in numerical weather forecasts 1 , but its use is spreading into studies of other Earth system components; the oceans, land, cryosphere, etc. These components are strongly forced by the atmosphere. In order to successfully model them, the atmospheric forcing must be specified correctly, both its mean and variability. The first motivation for the creation of this dataset was to provide that forcing in the context of an Earth system modeling framework, in which the atmospheric forcing can be applied to the nonatmospheric components consistently and conveniently. This is accomplished by running an atmospheric reanalysis for as long as possible and archiving all of the fluxes and other variables as frequently as required by the nonatmospheric models. Then the nonatmospheric models can be run repeatedly without the cost of regenerating the atmospheric forcing. The nonatmospheric model runs could be single or ensemble hindcasts to study the model behavior, or they could be the hindcasts used in generating reanalyses of the nonatmospheric components (see Fig. 1 for an illustration).  www.nature.com/scientificreports/ set, docum ented here). This reanalyis used the compset HIST_CAM60_CLM50%BGC-CROP_CICE%PRES_ DOCN%DOM_MOSART_SGLC_SWAV which is parsed as in Table 1 . HIST does not describe a component. It means that initial conditions are historical (not from a climate projection). CAM's "resolution" variable defines the horizontal grid and the dynamical core, which in this case is the Finite Volume core, which is defined on a logically rectangular grid with spacing ∼ 0.9 • in latitude and 1.25 • in longitude. The land model (CLM5) uses the same grid. The vertical resolution is defined by the choice of the atmospheric component. For CAM6-FV it is 32, hybrid σ + pressure levels, which are terrain following ("σ ") near the surface (the pressure on a level varies from place to place), constant pressure surfaces near the top, and a mixture of the two in the middle. The levels range from 7.44 hPa above the model ground up to a height of 3.64 hPa.
The data ocean model reads sea surface temperature (SST) and sea ice coverage data from files containing AVHRR data. The spatial resolution of this data is higher (0.25 • ) than CAM. The temporal resolution is nominally daily, but represents an ocean state which is a weighted average of data from several days. See "Sea surface temperature and ice coverage" for details.
Additional external forcing of CAM includes specified aerosols, greenhouse gases, and volcanic forcing, from CESM datasets. These are historical data for dates through 2014, and use CMIP6 scenarios after that. More information about these datasets is in "Files that force CAM".
Assimilation software. DART provides an extensive choice of algorithms for ensemble Kalman filter data assimilation 4 . Based on extensive experimentation, the choices in Table 2 define the assimilation algorithm which generated these reanalyses.
The model state is chosen to be the set of prognostic variables in CAM, which are required to define all other variables at the beginning of a hindcast. The models' states are stored in CAM "initial" files, from which both CAM and DART read them, and to which both write updated model states. See Table 6 for the list of variable names, which define the model state in a namelist in the input.nml file. Although the model state is defined using only atmospheric variables, and all observations are atmospheric, the land and sea ice variables evolve  www.nature.com/scientificreports/ consistently with the atmosphere, due to the coupling in the model. Those variables are passed along from cycle to cycle by being stored in those components' "restart" files. Note that there is no specification of the error covariances to use in the assimilation. The reason is that in the ensemble Kalman filter the background error covariance is never explicitly calculated. It is implicit in the algorithm, which calculates, at each time and for each observation, the correlation between the ensemble of model estimates of the observation and the ensemble of each model state variable. Large correlations result in large increments added to the state variable, and small correlations result in small increments.
Data management. The assimilation process generates a large amount of data in the form of large numbers of "small" files (Kbytes to several Gbytes), which are useful in the moment, but only some of which need to be archived. Those which need to be archived have been combined into files which have a convenient size for future use and make the data easier to find in the archive. In addition, they are compressed for efficient storage and transfer to where they are needed. This is described in "Data records".
CESM model state at the start. The CESM compset used in this reanalysis has active components with a broad range of dynamical memory, due to differing levels of nonlinear behavior and inertia in the physical systems. The atmosphere has some long-lived constituents, which influence the weather, but their persistent influence is overwhelmed by the model's perturbation error growth. Starting hindcasts with tiny differences in initial conditions and running them for 2 weeks results in model states that appear to be unrelated, other than being in the same season. There is also a relatively high density of atmospheric observations, so the model state can be brought into agreement with them in a time span measured in days. In contrast, the land model has large reservoirs of some constituents, which react slowly and non-chaotically to forcing from the atmosphere. There is a relatively low density of observations, especially below a few meters below the surface, so bringing an arbitrary land model state into agreement with observations can take months to decades of assimilation. This reanalysis dealt with those issues by starting an atmospheric assimilation 6 months before the reanalysis start date. This "spin up" assimilation used an ensemble of CLM model states, which itself had already been spun up for 12 years, forced by a "2 degree" CAM4 ensemble reanalysis.

Validation.
We include a subsection about validation in "Methods" section because the results of our validation calculations are, themselves, one of the products in this dataset. They are available for further use, so the methods used to derive them should be described here. Further validation details are in "Technical Validation".
One of the most important aspects of the assimilation to keep in mind is that it balances the uncertainty in the model, as represented by its ensemble spread, and the uncertainty in the observations, as represented by the observation error variance. This balance is a function of field, location, and time, due to the heterogeneous distribution of observations and the range of weather phenomena which the model must describe. If the uncertainty in the model becomes too small, the assimilation will reject observations that appear to be outliers, even though they are valid observations. This can allow the spread to become even smaller, and so on, until the model has diverged from the observations and ignores almost all of them, in which case the assimilation has failed. This is prevented by the ensemble spread "inflation" algorithm 14 .
Observation space diagnostics. Assimilation health can be monitored by processing the assimilation output periodically to examine the fraction of observations used as a function of time, and the bias and RMSE relative to the observations. The latter are calculated by comparing the model mean estimates of the observations to the actual observations at the observation locations, i.e. in "observation space". We do not use the posterior model state because it may be overfit to the observations, which would give an overly optimistic representation of the assimilation. The assimilation results are compared against each observation type separately, e.g. a figure may show a time series of the bias of the model estimates of temperature relative to the radiosonde temperatures that were assimilated, but not relative to the aircraft temperatures, which are taken at different locations and times and may have different biases. Details of the DART diagnostic software and how to do further validations are described in "Observation sequence files".
These ideas are illustrated in Fig. 4, which is derived from the "spinup" assimilation, which ensured that the initial ensemble of the reanalysis represented a stable pattern, with no remnants of the arbritrary initial conditions. The figure displays a useful quantity we call "total spread". Since both the hindcast and observations have uncertainty associated with them, it is helpful to view them as probability distributions, each with its own spread around its most likely value, those being the ensemble spread ( σ model ) and the observation error standard deviation ( σ obs ). Whether the two distributions are distinct or essentially indistinguishable depends on both spreads, as well as the difference between the means. The total spread is a combination of the two spreads, σ total = σ 2 obs + σ 2 model , which can be used as a measure of whether the RMSE is "large" or not. The model state can be arbitrarily accurate, given enough good observations, so the RMSE can be arbitrarily small. The totalspread, however, is always at least as large as the observation error standard deviation, which is independent of the ensemble model state and the assimilation. So we expect the RMSE to be less than or equal to the total spread. In practice, the RMSE is often less than the total spread, which indicates that the observation error is larger than it needs to be.
In Fig. 4 the initial ensemble has a tiny spread, so the total spread is small. This causes the assimilation to ignore more than half of the observations because it considers them to be impossible outliers. Note that the RMSE of the initial ensemble is relatively large, even though the observations that are farthest from the ensemble have been excluded from the RMSE calculation because the observations were rejected. So the ensemble has very www.nature.com/scientificreports/ high confidence in a poor estimate of the temperature. However, almost half of the observations are assimilated, so the model state is pushed towards those observations. At the same time, the assimilation sees that the model estimate is incorrect and it increases the size of the ensemble spread (reduces its confidence) via the adaptive inflation algorithm. Around day 5 the ensemble spread has become large enough that it allows the number of observations assimilated to increase. Simultaneously the RMSE rises, as larger numbers of observations are used and they reinforce the conclusion that the ensemble estimate is poor. Around day 7 the totalspread has increased enough to permit the assimilation of almost all of the observations, and to allow the ensemble to agree with them. By day 10 the RMSE has fallen by about half and virtually all of the observations are being used. This signals to the assimilation that the ensemble now has a good estimate, so it allows the ensemble spread to decrease. By the end the ensemble has high confidence in a correct estimate, which is confirmed by the small RMSE. At that point the assimilation is considered to be healthy: almost all of the observations are being assimilated and the RMSE is approximately the same size as the total spread. If the RMSE in that figure is replaced by the bias, then it indicates possible bias in the model and/or the observations. If the biases relative to the observations from a variety of platforms (radiosondes, satellites, ...) are of the same sign, then it is reasonable to conclude that the model is biased. If the biases are of different signs, then it seems likely that some of the observations are biased, and possibly the model too.
This dataset was checked after every month of assimilation. The graphical results of that are available as described in "Statistical representations of observation space performance". The files which were processed ("observation sequence files") are also available for further evaluation. There were no instances of the ensemble diverging catastrophically from the observations. In fact, the filter program never failed for algorithmic reasons (there were plenty of machine hardware and software failures).
State space. The assimilation can cycle through months in a fairly stable, healthy state, but occasionally there are features in the observation space diagnostics which merit investigation. For example, the number of radiosonde observations might suddenly increase, and, possibly with it, the RMSE. This is often caused by the appearance of a hurricane, which prompts weather services to take extra observations, as can be seen in Fig. 2a (in the Gulf of Mexico). This can be verified by examining the assimilation output which is on the model grid, often referred to as "state space". If a hurricane is visible, and maps of the observation locations coincide with it, then the issue is resolved. Other features of the assimilation are more subtle, and may indicate that the quality of an observation type is evolving, or that something about the model is changing from year to year, such as the external data forcing files it reads.
Community evaluation. This ensemble reanalysis is a large and varied dataset, which has been evaluated for only basic signs of correctness. There are many opportunities for researchers at all stages of their careers (including students) to use or probe the data and publish the results. We welcome interactions ranging from simply learning about results, to answering questions, to in-depth collaborations.  Surface forcing files. The physical system components of CESM influence each other via fluxes and component variables at the Earth's surface. CESM's coupler mediates this communication and can write those fluxes and variables to "history" files for later use. This reanalysis produced the five different coupler history files needed to run combinations of CESM components which require a data atmosphere (with a few, rarely used exceptions). These files are intended to be used only in this way, so we will not describe each variable in each file, but will focus on how to use them. The files are in NetCDF format and contain useful metadata, such as variable long names, so individual variables can easily be extracted and evaluated, but researchers should be careful to learn the exact nature and provenance of those variables. Metadata can be misleading. The files are written at the end of each 6 h hindcast, but contain data with a variety of frequencies, as appropriate for each variable. At the end of each year they are concatenated into year long time series for convenient archiving and later direct use by CESM. They are also compressed for efficient archiving and copying to local computers. Table 4 describes the CESM coupler history files and which CESM components need which ones. An example of a file name is f.e21.FHIST_BGC.f09_025.CAM6assim.011.cpl_0080.ha2x1d.2011.nc.gz which can be parsed using Table 3. In addition, the filetypes can be parsed as follows: h = (coupler) history file a,r = the component which provides the forcing fields ("atm" or "river") 2 = "to" x = all other components d,h = the frequency of the data (days or hours). "Day" is actually a 6 h span, due to the length of the hindcasts. i = flag that the data is instantaneous, instead of averaged (the default).
NOTE: CESM generally interpolates these data to the start time requested by the user. In order to start a hindcast on the first date of the dataset (2011-01-01-00000), the 2011 forcing files must contain data from the end of 2010. This sometimes raises questions about whether the "time" variable, and "time:units" and "time:bounds" metadata are all correct. They are.  The list of files in such a file can be found in Table 5. The variables in some of the files will be described in subsections, below. These restart sets can be used to start single or ensemble hindcasts, and to start data assimilation experiments. Each ensemble member is an equally likely model state, so they can be chosen at random, unless a specific atmospheric feature is needed, such as the strongest hurricane in the ensemble. They are available every Monday at 00:00 UTC.
CAM initial files. In this reanalysis CAM6 is instructed to write out an ensemble of "initial" files at the end of each hindcast. See Table 5 for an example file name. They contain the variables which comprise the DART prior state vector, and enough other variables to start a CAM6 hindcast (Table 6). In contrast to the "restart" files, they cannot resume a previous hindcast in a continuous manner. Initial files were chosen over restart files for the storage of the state variables for several reasons: • smaller size (less than 1/3), • more intuitive variables (e.g. sensible temperature instead of scaled virtual potential temperature), • better metadata, • no need for identical hindcast continuation.
That last reason is because DART changes the model state variables between the end of the previous hindcast and the start of the next, so identical continuation is not possible.
Restart files. The initial intended use of these file sets was to provide convenient dates for restarting assimilations due to computing environment difficulties, or to branch experiments, which tested aspects of the assimilation. They will hopefully prove useful for starting (ensemble) hindcast experiments and further CAM assimilation experiments. Due to their low frequency (weekly), and the fact that the contents are not easily or completely described, they do not lend themselves to meaningful evaluation, so no detailed table of variables will be provided here. Table 5. The list of files in a "restart set", their filetypes, and example file names (CASENAME = f.e21.FHIST_ BGC.f09_025.CAM6assim.011). The history files are not required for an exact restart of the model, but are archived with the restart files to maintain continuity of optional time series of (time averaged) fields requested from the case. The relationship between history and restart history files is described in CESM Post Proce ssing Data. The DART inflation restart history files are not used during CESM's restart process, but are used by the assimilation. They are also archived more frequently with the other DART output as described in "Ensemble mean and spread files". www.nature.com/scientificreports/ For the curious, here is an overview of the file contents. The CAM restart files contain similar variables to the initial files, but many more (184 versus 36). There is no metadata for many of them and some do not even appear in the CAM log file "MASTER LIST". The CLM restart files have 276 variables, whose metadata includes their long names, and 143 more which lack that. The coupler restart files contain 203 variables, many of which are related to the variables in the coupler history ("forcing") files. They contain additional fluxes from non-atmospheric (or river) components, such as CICE and CLM. The CICE restart files contain 54 variables with no metadata for them. Consult the User's Guide if needed. The river runoff (MOSART) restart files contain 14 variables on a longitude-latitude grid, with long name metadata.
Inflation files. Following Bayes' theorem, the prior distribution (represented by an ensemble) is combined with the observations at each assimilation cycle to produce a posterior. If the spread of the prior is small due to model and sampling errors or if the observation estimate is overconfident, the posterior will also be overconfident. This is easy to view because after applying Bayes' rule, the posterior variance will be smaller than both the prior and observation error variances. The easiest way to fix that is to "inflate" the ensemble; increase the spread of the ensemble around its mean, while preserving the mean. In large, complex models such as CAM6, an effective algorithm for doing that is adaptive inflation, which defines an inflation value (and standard deviation) at each model grid point for each state variable, and evolves them in time in order to adapt to the evolving distribution of observations and model state. The inflation can be applied to the prior or posterior ensemble. Studies have shown that prior inflation can mitigate model biases while posterior inflation is effective at reducing sampling errors 20 . Model biases often dominate sampling errors. In this reanalysis study, we use the adaptive prior inflation algorithm described in El Gharamti et al. 14 .
This evolving inflation data is stored in "inflation restart files" during each assimilation cycle. The product of the algorithm is essentially a distribution, so the inflation is written out as a pair of mean and standard deviation inflation files. This should not be interpretted as the inflation of the model mean and the inflation of the model standard deviation; these files describe the state of the inflation algorithm. Examples of those file names are in Table 6. The CAM6 variables which are written to the initial files. The state variables are those that are directly changed by the assimilation. The rest evolve in response to the changes to the state variables as well as the model physics and dynamics. The "modes" in the MAM4 19 modal aerosol scheme are "accum"(1), "aitken"(2), "coarse"(3), and "primary_carbon"(4). www.nature.com/scientificreports/  www.nature.com/scientificreports/ the restart file Table 5 and have the filetypes rh.cam_forecast_priorinf_mean and rh.cam_forecast_priorinf_sd. These files have variables named the same as the model state variables (see Table 6), but their contents are inflation values, defined by the filetype. The inflation fields are useful for evaluating the assimilation performance, and they are relatively small, so they are archived for every cycle. The subset of these files, which are archived as part of the restart file sets, are in archive files having "infl_log" in their names, e.g. as in the restart files outline. See Fig. 5 for examples of the inflation mean fields.
Atmosphere model ensembles. The whole ensemble of prior model states is archived at least weekly in files of the form f.e21.FHIST_BGC.f09_025.CAM6assim.011.cam_allinst.e.forecast.YYYY-MM-DD-SSSSS. tar (not to be confused with the "alltypes" files in restart sets). The posterior versions of the contents of these files are archived in the CAM initial files in the (weekly) restart sets (see Table 5) [NOTE: When the initial files were written they contained the prior model states (see "CAM Initial Files"). After the assimilation they contain the posteriors]. The assimilation innovations can be calculated by subtracting a prior (forecast) variable from its posterior (initial file) version, e.g. for temperature; T innovation = T init − T forecast . NOTE: Similar ensemble files with filetype "preassim" contain the inflated ensemble, which cannot be used to calculate assimilation innovation without deflating the ensemble first, using the inflation values described in "Inflation time series". This is in contrast to the innovations of the mean, which can be calculated from the ensemble mean of either "forecast" or "preassim". See "Ensemble mean and spread files". Assimilation files. The program "filter", which does the assimilation, writes out a variety of files, which were used to evaluate the health of the assimilation, but can also be a source of research data. They are described in detail in the following subsections.
Observation sequence files. DART currently requires that the observations to be assimilated and the metadata describing them be packaged into a custom formatted file. The assimilation process generates an ensemble of model estimates of each observation, which are written, with the actual observations, to an "obs_seq_final" file using the same custom format. In this reanalysis the files are binary for compactness.
At the end of each month of assimilation the obs_seq_final files are compressed into a tar archive for archiving in the RDA. The file names have the form f.e21.FHIST_BGC.f09_025.CAM6assim.011.cam_obs_seq_final.

YYYY-MM.tar
The observation types, which were assimilated (e.g. RADIOSONDE_TEMPERATURE), are a subset of the types available in the obs_seq_final files. Other types were merely "evaluated" during the assimilation, meaning that filter calculated the ensemble model estimates of the observations, but did not change the model state to fit them better. These are often referred to as "withheld" observations, which provide an independent verification of the assimilation. Table 7 lists the observation types and how they were used.
Statistical representations of observation space performance. In order to calculate the statistics of the observation data in the obs_seq_final files, as discussed in "Observation space diagnostics", program "obs_diag" harvests data from them, performs some of the statistical analysis, and writes the results into a NetCDF file ("obs_diag_output. nc"). This workflow, and related workflows, are outlined in Fig. 6. Obs_diag enables users to focus the evaluation in many ways. The following features of the evaluation can be specified as inputs to obs_diag: • observation types • horizontal and vertical domains • date range of the data • statistics or variables displayed (bias, RMSE, total spread, rank histograms, and more) DART provides Matlab © scripts to read the obs_diag_output.nc files, finish the statistical analysis, and display the results. (see "Assimilation diagnostic files" for guidance about replacing the Matlab © scripts by similar software of the user's choosing.) The resulting files are packaged into files of the form "Diags_NTrS_YYYY-MM.tgz", which contain the following files, most of which are described by a template due to the large number of similar names. Not all combinations of TYPE and QTY could be generated; see Table 7 Fig. 4 and is discussed in "Observation space diagnostics". A more condensed view of some of the data in this reanalysis is shown in Fig. 7. This picture provides examples of many features of the assimilation and evaluation, which are helpful for understanding the data products. The statistics shown there are calculated at the observation locations, so any interpolation is performed on the model state, not on the observations. The vertical dimension is divided into 14 layers, each of which contains a standard or mandatory pressure level of the atmosphere. The layers are not the model vertical levels. The figure also shows the number of observations available and the number assimilated in each layer, which provide guidance about the health of the assimilation (a large fraction assimilated is healthy) and about the trustworthiness of the RMSE and bias (small number assimilated = untrustworthy). Note that there are no observations assimilated in the top two layers, due to a choice to avoid assimilation in the top levels of the model, which are damped and don't respond well to assimilation.
The observations in Fig. 7 come from low orbit satellites, such as COSMIC, which very accurately measure the transit time of radio signals emitted by Global Positioning System satellites. The times are distorted by the atmosphere, which makes it possible to calculate the atmospheric index of refraction. The index is reported as the observation, but is more accurately called a "retrieval" because it is derived from the raw measurements. Note that, despite the fact that the retrievals are not model state variables, they can be assimilated and provide information which is useful for updating the model state variables. The actual index values are near 1.0, but the reported numbers have the form n rep = (n whole − 1.0) × 100 in order to focus on the variable part of the numbers. The index of refraction is a function of atmospheric density, which decreases exponentially with height. The profiles have been normalized by the average n rep in each layer, so that the curves in the upper layers don't appear to be "zero". During this month in this region, all of the biases happen to be ≥ 0 , which reveals that CAM6 has Figure 6. A map of the observation space diagnostics software available in DART. The CAM6+DART cycling (Fig. 2) creates a time series of binary obs_seq_final files, which have a custom format. They are easily converted to NetCDF using DART's obs_diag.f90 and obs_seq_to_netcdf.f90 programs. The obs_diag_output.nc file has the time series concatenated into it. The obs_epoch_#.nc files map one-to-one to the obs_seq_final files, and are numbered sequentially. Then a wealth of statistics ( www.nature.com/scientificreports/ a denser atmosphere than the observations indicate. Further probing reveals that a cold bias explains some of this density bias (not shown). This reanalysis reveals a variety of persistent biases in the model and/or in some of the observation platforms. The RMSE and bias shown in Fig. 7 are two examples of the quantities available in the obs_diag_output.nc files for visualization. The complete list of these "copies" is shown in Table 8, and is taken from the variable CopyMetaData in those files. Some of the temperature bias mentioned in the discussion of Fig. 7 can be seen in Fig. 8, which shows pictures from the Diags_NTrS_2019-07.tgz file. In contrast to Fig. 4, these time series show an assimilation that has settled into a stable, healthy pattern. There is no obvious trend in any of the statistics, including the fraction of observations used. The fraction of RADIOSONDE_TEMPERATURE observations used is quite small in the layer displayed here because this reanalysis did not assimilate any observations below the bottom model level, which is ~ 52 m above the surface, and the surface itself is smoothed to be suitable for the model's 1° resolution. The fraction used is much higher in higher layers and for other observations, such as AIRS (bottom of Fig. 8). A few of the time slots show relatively large (or small) RMSE and bias, which is almost always due to there being very few observations at that time, which can result in randomly bad (or good) agreement with the observations.
Since the bias of the model relative to both observation types is consistently ~ − 1 K, it is fair to conclude that the model probably has a deficiency. This bias persists despite the assimilation frequently pushing the model state towards the observations, so the bias in a free running hindcast would be larger. We should keep in mind the other possibility, that both observation platforms may be biased compared to reality and that the model is closer to the truth than both observation types, but we can't confirm that from anything in this assimilation. The RMSE represents a combination of the bias and random error. A large portion of this RMSE is due to this bias, implying that the distribution of estimated observations is relatively narrow, which in turn implies that the model estimates are truly inconsistent with the observations and not just "different within the noise". These www.nature.com/scientificreports/ biases are of the prior model states, which are in the atmospheric "Surface Forcing Files". In contrast, the CAM initial files in the "Restart File Sets" contain posterior fields, which will generally have smaller biases because the assimilation has pushed those model states closer to the observations.
Ensemble mean and spread files. Filter also calculates the ensemble mean and spread in state space (on the model grid) and writes those to NetCDF files. Those files are written at two "stages"; the "forecast" and "output", which are the "prior" and "posterior" states of the ensemble, respectively. Those features are summarized in the filetype part of the file name, e.g. "output_sd" means the standard deviation (spread) of the output ensemble. They are available every 6 h. The impact of the assimilation on the model state ensemble mean can be calculated as "innovation = output -forecast". Note that the "output" files have filetype ".i. " to signal that their contents describe the files which will be used to initialize the next hindcast, while the "forecast" files have filetype ".e. " to signal that they describe model data which will be modified by DART. All of these files contain only quantities describing the model state variables, (see the list in Table 6). NOTE: Some years have "preassim" files instead of forecast. The ensemble means of those 2 stages are identical, by the design of the inflation algorithm, so both the preassim and forecast means can be used in the mean innovation calculation. This is in contrast to the innovations calculated from ensemble members.
Inflation time series. The dart.rh files are concatenations of the inflation restart files, which are described in "Inflation Files" (p11), so they are 6-hourly time series. The rh.cam_output_priorinf* filetypes contain the same types of data as the rh.cam_forecast_priorinf*, but the fields have been updated by the assimilation. They were used by the subsequent assimilation.
Land model time series. The CLM namelists instructed it to write out "history" files of fields describing the evolution of plant growth under the influence of an ensemble of realistic atmospheric forcing. The fields were written to two files to separate the time averaging applied to each. Filetype "h0" has instantaneous values. Filetype "h1" has time averaged values. Files were written at the end of each 6 h hindcast, then concatenated into year long time series, and compressed for archiving. The requested fields, along with fields written by default, are listed in Table 9. Table 8. Obs_diag_output.nc files contain variables with a dimension ("copy"), which organizes the statistics of the contents of the observation sequence files (obs_seq_final). They are ordered in that file according to the CopyMetaData variable. Bold text highlights statistics which are displayed in the Diags_NTrS_YYYY-MM. tgz files. Further details can be found in the DART software package; . . ./your_DART/assimilation_code/ programs/obs_diag/threed_sphere/obs_diag.html.

Variable
Description Use(s)

Nposs
Number of observations available Helps assess the assimilation health www.nature.com/scientificreports/ Files that force CAM. The component set chosen for this reanalysis uses a "data ocean" as a time evolving boundary condition which forces CAM6. While that dataset completely specifies the influence of the oceans on CAM, other datasets supply partial influences. These are added to related, internally generated fields, which are read from the initial files. One example is aerosols, which evolve according to model physics and chemistry, but also have contributions from sources outside of the atmosphere, such as industrial emissions. The files which provide this "external forcing" are provided in a CESM archive. Installation of CESM on most supercomputers includes at least parts of this archive in a location which is accessible to everyone. If these files are not in that archive, request that they be downloaded from the CESM data repository, which can be accessed by following instructions in the CESM User's Guide. The namelist variables which specify the needed files are describe in the CESM2 CAM6 namel ist varia bles page.

Log files.
Further details about the hindcast model are in the weekly "infl_log" files. These include log files from the overall job ("cesm"), one of the 80 couplers ("cpl_0001"), and one each of the 5 nonstub components; cpl_0001, ice_0001, lnd_0001, ocn_0001, rof_0001. Log files from the assimilations are not archived in the RDA collection, but are available on NCAR's Campaign Storage file system for a limited time. The most important aspects of the assimilation setup and results are described here and in the RDA dataset.

Technical validation
The intent of this project is to provide a useful representation of Earth's actual weather for use in Earth system research, especially ensemble data assimilation. It is not to provide the best atmospheric reanalyses possible. As in any reanalysis, the fidelity to the weather depends on the types and numbers of observations at the time and in the area of interest, as well as on the ability of the model to simulate the atmospheric state. The validity of the data is described by some of the dataset itself, as noted in the "Validation" section. DART also provides methods for further evaluating these data products when a more focused evaluation is needed. See "Observation space diagnostics", "Statistical representations of observation space performance", and "Ensemble mean and spread Monthly observation space evaluation. Much of the technical evaluation of this reanalysis is contained in the monthly, observation space pictures generated by DART's Matlab © scripts. That process is described in "Methods" and "Statistical representations of observation space performance". In addition to examining individual months, we compared the same month from all years to evaluate long term trends in the assimilation quality. Examples are shown in Fig. 9, which are not included in the reanalysis archive (e.g. Diags_NTrS_YYYY-MM.tgz) but could be recreated from the available data. Note that in Fig. 9a there is a dramatic decrease in the RMSE in layers 525 hPa and 400 hPa during the last 4 years and an increase in the number of observations assimilated. We believe that that this is due to an improvement in 2016 in the algorithm used to calculate the wind from the movement of clouds. The RMSE values in the 100 hPa layer should be ignored because there were hardly any observations there, so the statistics are not robust. Figure 9b shows generally larger temperature biases relative to radiosonde observations in the lower layers in recent years. We have not found an explanation for this trend, but have not seen any evidence that the reanalysis is failing. This sort of behavior could be the subject of further research.
Ensemble spread. The observation space diagnostics discussed in the previous sections show the fidelity of the model state ensemble mean to the observations. They basically answer the question "Is the model's most likely estimate of the weather consistent with the observations?" An equally important question is "Is the consistency due to chance or to high certainty in the model ensemble?" That question is answered by the prior ensemble spread, which varies widely with location, time, and the atmospheric field of interest. An example is shown in Fig. 10. There is some correlation between the weather (ensemble mean, bottom figure) and the ensemble spread (top figure) in areas which are not well observed, such as the southern oceans. Well observed areas, such as over continents and down wind from them, have very small ensemble spread, indicating high confidence in the means.

Use of the surface forcing files in CESM.
Multiple tests have demonstrated that the coupler history files ("forcing files") can be read by CESM and provide atmospheric forcing for data atmosphere configurations of CESM ("DATM"), such as compsets whose short names start with "I" (CLM), "C" or "G" (POP2), and "D" or "G" (CICE). See Table 4 from "Surface Forcing Files", Fig. 1, and the CESM compo nent set web page for details.

Reproducibility.
A reanalysis such as this is a complex activity, involving many layers of software, evolving computing environments, and human intervention. Fortunately, the core activity is comparison of model output Table 10. Information about files which CAM6 reads to get data describing external contributions to chemical and aerosol constituents. SERIAL; extract data which has the current model date. CYCLICAL; extract data from the specified year from the file. INTERP_MISSING_MONTHS; extract data from the 2 years in the file, which bound the current model year, and from the current model month of those years. For example, if the curent model date is 2018-06-12 and the file has years 2015 and 2020 in it, then data from 2015-06 and 2020-06 will be used to interpolate to 2018-06. The 6 digits following the _c encode the creation date (YYMMDD). The other 6 digit numbers are year + month combinations. www.nature.com/scientificreports/ against observations, so the end product is closely tied to reality and is verifiable. Our intent in providing this dataset to the community is to prevent the needless reproduction of such a large amount of data. However, some reproducibility exercises are valuable for verification of the data products, the software used to create them, and how well data users understand it all. There are at least two varieties of reproducibility that are desirable from the various subsets of data; bitwise or statistical. Although the hindcast model and DART are designed to be capable of bitwise reproducibility, the fundamental nature and goal of ensemble data assimilation is a statistical description of the physical system. Two assimilations which use fundamentally the same hindcast model and the same assimilation algorithm, but different initial ensembles, should generate ensembles which are statistically indistinguishable after the typical spinup period. Conclusions drawn from such ensembles are robust and meaningful, while conclusions based on individual members or details of the differences between the the ensembles are generally not robust and are suspect.
Bitwise. The extent to which the results are bitwise reproducible will depend on many factors, including availability of the same or equivalent hardware, compilers, and libraries. The farther removed in time the reproduction effort is, the less likely it is to succeed, and the more effort it will take. For example, at one point some desirable output was lost during the archiving process. We attempted to recreate the data, after a significant computer upgrade, by recompiling the programs with the updated compilers to ensure consistency with the updated environment. We were unable to bitwise reproduce the data in the available time, but we discovered that the original executables, compiled before the upgrade, still reproduced earlier results exactly.
Statistical. When attempting to reproduce this reanalysis it is necessary to define an acceptable disagreement between the reproduction and the original. There is a range of standards which could be used, depending on the intended use. Some examples are, in roughly increasing precision: Are the differences between the means smaller than the model state changes between assimilation times? ...smaller than the combined uncertainties of the two distributions? ...smaller than the assimilation innovations? www.nature.com/scientificreports/ Are the distributions indistinguishable using formal statistical methods of comparison, which take into account the possibility that the distributions are not random draws from the continuous distribution which represents the atmospheric variable?
The last, strictest standard is difficult to define and calculate, even in the case of ensemble hindcasts with no data assimilation, as described by Milroy et al. 21 . We are not aware of any adaptations of those techniques to ensemble data assimilation.

Usage notes
Surface forcing files. CESM uses the coupler history (DATM surface forcing) files by reading their names and locations from "stream" files in the CASEROOT directory where the experiment is set up. Their names have the form "user_datm.streams.txt.prescribed_INST" where INST is an "instance number" (ensemble member, padded with 0s). In the context of a DART assimilation experiment, these stream files are created during the set up of the CASE from template files found in: ${DART}/models/{cice,clm,POP}/shell_scripts/streams.txt In those directories there are also scripts with "setup" in their names, which can be used to set up assimilation experiments, including the creation of the user_docn.streams.txt.prescribed_INST files. Before use, the coupler history files (e.g. f.e21.FHIST_BGC.f09_025.CAM6assim.011.cpl_0001.ha2x3h.2011.nc.gz) will need to be decompressed using 'gunzip'.
Restart file sets. The restart file sets can be used to start a simple hindcast by decompressing one of them into an accessible "REFDIR" directory and setting CESM's env_run.xml: www.nature.com/scientificreports/ CONTINUE_RUN = FALSE RUN_TYPE = hybrid RUN_REFCASE = f.e21.FHIST_BGC.f09_025.CAM6assim.011 RUN_REFDIR = where the restart file set has been decompressed RUN_REFDATE = the YYYY-MM-DD chosen as the date of the initial condition data. This can be a different year and day from the desired start date, but should be in the same month. RUN_REFTOD = 00000 The restart file sets can be used to start an atmospheric assimilation by decompressing an ensemble of them into an accessible "REFDIR" directory and following the usual DART set up procedures, as outlined in the DART tutorials and scripts, such as ${DART}/models/cam-fv/shell_scripts/cesm2_1/*setup*. This is not a trivial exercise, and will be done much more efficiently by becoming familiar with DART before attempting it.
Assimilation diagnostic files. If additional observation space diagnostics are needed, they can be derived from the f.e21.FHIST_BGC.f09_025.CAM6assim.011.cam_obs_seq_final.YYYY-MM.tgz files using scripts and programs in DART, as outlined in Fig. 6. Some of those are described in "Observation sequence files". In particular, program obs_diag can be rerun using different horizontal regions, time spans, vertical layers, etc. These are controlled by fortran namelist parameters in input.nml:obs_diag_nml.
DART has numerous Matlab © (.m) scripts to further process and display the contents of the NetCDF files produced by obs_diag.f90 and obs_seq_to_netcdf.f90. The latter creates files whose contents can be viewed as multiple linked frames with data selection capabilities. Several scripts display three-dimensional, rotatable renderings of the data. See DART's Diagn ostic s page for illustrations and details.
Once the NetCDF files are created, software packages other than Matlab © could be used. In that case the Matlab © scripts can serve as helpful templates for dealing with the many issues which must be resolved to create useful pictures. Users have begun contributing Python code for the manipulation of DART diagnostic output. It is not as advanced as DART's Matlab © environment, but we expect it to develop rapidly. Please contact us for the current status (dart@ucar.edu).
Land model time series. These NetCDF files can be evaluated and manipulated by the usual tools. The "h0" file variable time series are defined on the CAM lon-lat grid, so they can be easily displayed. The "h1" file variable time series have the CLM "pft" dimension (plant functional type), which is too large for display ( ≈ 800,000 ) without subsetting. There are many pfts in each grid box, distributed among the land surface types 22 .

Use cases. Several uses of this dataset have been described in previous sections:
⊲ Use of the Surface Forcing Files in CESM (see Fig. 1 for an illustration). ⊲ Restart File Sets describes initial conditions for hindcasts. ⊲ assimilation innovations can be calculated from the data described in "Atmosphere model ensembles" and "Ensemble mean and spread files" ⊲ Assimilation Files describes how to see and generate statistics describing the assimilation. ⊲ Land Model Time Series describes the plant growth variables in the CLM5 history files ( Table 9) ⊲ Comparison of these reanalyses with others, but see comments in "Technical Validation", or combination with them, as in the Multi-Reana lysis Ensem ble v2.
This large dataset has a unique combination of an 80 member ensemble, 1° global resolution, 6-hourly frequency, and 9 year time span, which provides opportunities for robust statistical analysis and use as a machine learning training and verification dataset. The data differs from model generated training sets in that it is constrained to represent the atmosphere (and plant growth characteristics), rather than the model formulation.
The observation space diagnostics can be used to help identify biases in observation platforms. The dataset already contains biases of the reanalysis relative to a variety of observation types, as a function of season, time of day, height, and broad latitude regions. Further refinement of the influences on the biases can be generated, as needed.