## Introduction

Black carbon (BC) aerosols play an important role in the Earth’s climate system through absorption of solar radiation, interaction with clouds, and deposition on snow and ice. However, the contribution of BC to current climate change remains highly uncertain.1,2,3,4 Both the radiative forcing due to the direct interaction with sunlight (DRF) of BC, and compensating rapid adjustments, are very sensitive to the vertical distribution of this aerosol.5,6 Previous studies have shown considerable diversity in vertical BC concentration profiles in global models as well as important model-measurement discrepancies, particularly at high altitudes, which contributes significantly to the uncertainty in BC DRF (e.g., Samset, et al.7,8).

Often, the geographical and vertical distributions of BC after emission are characterized through a global-mean atmospheric residence time, or lifetime. However, many of the processes affecting the lifetime remain poorly constrained. Recently, several single- and multi-model studies have used aircraft measurements from the HIAPER Pole-to-Pole Observations (HIPPO) campaign9 to investigate factors contributing to the existing spread in the vertical BC distributions and implications for the global BC lifetime.7,10,11,12,13 Results indicated a relatively linear relationship between global-mean atmospheric BC lifetime and bias versus aircraft observations in current models, and suggested that a global lifetime of <5 days is necessary for models to reasonably reproduce these observations. This is shorter than what many current global models predict (e.g., multi-model mean of 6.5 and 7.4 days found by Samset, et al.7 and Lee, et al.,14 respectively), and has been shown to have significant implications for revised estimates of the direct BC RF.7,13 These findings have led to an increased focus on tuning global BC lifetime to improve model skill.

However, most studies thus far have been limited to observations obtained over the Pacific Ocean. Moreover, high-resolution in-situ flight measurements are often compared with model data with much coarser spatial and temporal resolution. For instance, the multi-model assessment by Samset, et al.7 used monthly mean model output averaged over larger areas. This difference in sampling can lead to artificially poor comparisons, as the temporal and geographical scales of observations are small compared to typical model grids.15,16

Over the past years, aircraft measurements of BC concentrations from several additional flight campaigns have become available. These campaigns sampled both remote marine and continental regions over a range of latitudes and in most seasons. Here we compare and contrast these recent observations (see Methods for details about the instrumentation) with results from two atmosphere-climate models using a range of spatial and temporal resolutions. Based on this dataset, we examine the validity of conclusions drawn in earlier model-flight intercomparisons, and critically discuss the potential usefulness of a global “BC lifetime” as an indicator of model skill when considering a broader geographical scope than in previous literature. We find significant regional differences in the diagnosed “lifetime”, and a shortcoming of the method over the Atlantic Ocean, indicating either uncertainties in emission sources or a poor representation of transport and microphysical processes. Such discrepancies are likely to affect studies using average BC distributions and a global-mean lifetime to diagnose RF or climate impacts.

## Results

In the following, we compare BC vertical concentration distributions on three-hourly, daily and monthly resolution, simulated by the chemistry transport model OsloCTM3 and the global climate model ECHAM-HAM, against a range of aircraft observations from 2008–2017 (Method). Fig. 1 shows flight tracks for all campaigns (see Table S2 for details) and the geographical domains where the model-measurement comparison is performed. These domains are further aggregated into four focus regions, each reflecting a single set of BC source and/or transport characteristics: Pacific Ocean, Continental (US and Europe), Atlantic Ocean and Arctic. Our analysis has three main steps: (1) Evaluation of model performance against the observed BC concentration profiles from aircraft measurements over the past decade. (2) Assessment of the influence of sampling methodology on the model-measurement comparisons. (3) Quantification of the relationship between BC lifetime and model performance across different regions.

### Model evaluation

Our observational data set include several recent flight campaigns from regions where evaluation of vertical aerosol distribution so far has been limited. We therefore first present regionally averaged observed profiles and document the models’ ability to reproduce results from individual aircraft campaigns. Fig. 2 shows concentration profiles in each domains defined in Fig. 1, compared against both models, using the finest available temporal and spatial resolution. The global-mean BC lifetimes, defined as the global burden divided by total emissions, of these models in the baseline configuration are 4.4 and 6.5 days for OsloCTM3 and ECHAM-HAM, respectively.

The top two rows show flight campaigns over the Pacific Ocean. HIPPO provides a unique climatology of measured BC concentrations over the Pacific Ocean that covers the full troposphere and includes sampling in all seasons obtained over multiple years. The HIPPO climatology has been extensively used to constrain models, and we include it here for comparison to previous studies. Recently, additional measurements over the Pacific Ocean have become available from the Atmospheric Tomography (ATom) mission.17 We find good agreement between model calculations and observed HIPPO vertical BC concentration profiles at mid- and low latitudes, although with some underestimation of the near-surface concentrations in several regions. Modeled concentrations are generally higher than the average vertical profile from ATom, especially at northern mid-latitudes (P2 region). Here we find that OsloCTM3 performs noticeably better than ECHAM-HAM, a result that has not been observed with other data sets. The best agreement is seen in the P5 region, where both models reproduce the data to within 50–90% at most altitudes. More data covering more seasons from HIPPO1-5 compared to ATom1-2 may be a reason for the better agreement with the former.

Whereas HIPPO and ATom over the Pacific Ocean sample remote, marine airmasses in which aerosols have undergone significant processing during long-range transport, several recent flight campaigns have been carried out over continental US and Europe, including SENEX,18 DC3,19 SEAC4RS,20 CONCERT21, and ACCESS22 (Fig. 2, third row). OsloCTM3 and ECHAM-HAM results are generally similar in these regions. Both models nicely capture the general shape and magnitude of observed vertical profiles from SENEX, the US portion of HIPPO, CONCERT, and ACCESS. They also capture the overall shape for DC3, but modeled concentrations are on average 50% lower than mean of the measurements. The SEAC4RS campaign is split into two regions. In US west, both models fail to capture the observed peak of 500 ng m−1 around 600 hPa, where measurements were strongly influenced by biomass burning (see Discussion). In both regions, high-altitude modeled concentrations are higher than observed.

Vertical profiles over the Atlantic Ocean (Fig. 2, row 4) are obtained from SALTRACE23 and ATom. The main objective of SALTRACE was to characterize the long-range transport of dust from Sahara across the Atlantic Ocean, and we focus on the two regions off the coast of Senegal and over Barbados (Fig. 1), representing “near-field” and “aged” air masses.24 OsloCTM3 significantly overestimates concentrations above 500 hPa, on average by a factor 9 and 6 for Senegal and Barbados, respectively; ECHAM-HAM more closely reproduces observations over Senegal. Similar overestimation was previously seen when comparing against the mean of the AeroCom Phase II models.24 For both models used here, an even larger overestimation is found for high altitude measurements from ATom over Senegal, as well as further south in the A2 region and, to a lesser degree in A1. In the two former regions, the models also significantly underestimates near-surface concentrations, while the opposite is seen in the A3 region. We note that the latter region differs from the other Atlantic Ocean domains in terms of airmass influence and source regions. To examine constraints on emission and transport processes, we perform an additional comparison of modeled and measured CO (see Discussion).

Significant seasonal and inter-model differences are seen in the comparison with flight campaigns in the Arctic, (Fig. 2, bottom row), especially for OsloCTM3. The agreement is considerably better for campaigns conducted in summer and fall (including ATom1, HIPPO4-5 and ACCESS) than winter and spring (ARCTAS,25 ARCPAC,26 HIPPO1-3, and ATom2). The ECHAM-HAM performs better when compared against ARCTAS and ARCPAC. A recent study found good agreement between the AeroCom models and Arctic aircraft measurements24; however, these observations only covered May to September. Our results show that notable discrepancies remain, and that OsloCTM3 continues to struggle with the representation of Arctic BC during winter and spring despite recent updates.

### Impact of temporal and spatial sampling

The comparison presented above was performed at the highest available temporal and spatial model resolution (referred to as “raw”). Previous studies have however often used more coarsely aggregated model data. Figure S2 shows modeled BC vertical profiles constructed by sampling coarser resolution output (Methods). Differences between sampling methods exist in certain regions, but no clear pattern or bias is immediately obvious. For instance, using monthly or box data, ECHAM-HAM performs better compared to ATom P2 data, while both models perform worse in the P4 region. To quantify the impact of sampling choices on flight-model comparisons in a more systematic way, we therefore compare results using Taylor diagrams (Fig. 3). Model output at different temporal resolutions is interpolated onto the flight tracks in each of the aggregated focus regions, and compared against raw model data and aircraft measurements.

The top row of Fig. 3 shows the model output at each temporal resolution against the highest temporal resolution output available for that model, i.e. a model-to-model comparison. While the biases remain small, the centered root-mean-square error (RMSE) increases as we interpolate from daily model output, and the correlation decreases. The normalized variances remain better than 0.8–0.9 when using daily data, suggesting that although the correlation is impaired, we are still sampling most of the variance in the model fields. Significant further decrease in correlation, and increase in centered RMSE, is found when monthly data is used. For Arctic and Pacific campaigns, there is also a reduction in the normalized variance when using monthly data. These errors are due to the sampling errors introduced when using coarse temporal model output.

The bottom panels of Fig. 3 show the same model values, but this time compared to the observations. While using monthly model fields clearly introduces a sampling error, this error is small compared to model errors. The exception here is Pacific region, where sampling methodology makes a noticeable difference – as observed in particular in the Atom P2, ATom P4 and HIPPO P5 profiles in Figure S2. These are remote, background regions, and a possible reason for the larger differences when sampling with monthly data here could be that there is a higher chance of averaging out specific features as the air masses undergo long-range transport.

### Implication of model performance for global BC lifetime

Previous studies have suggested that models with short global-mean atmospheric BC lifetime reproduce both the spatial distribution and magnitude of the HIPPO measurements better than models with longer lifetimes (e.g.,7,13). The large number of flight campaigns used in the present analysis allows us to go beyond previous work, and examine the relationship between model skill and global BC lifetime for separate geographical regions. By changing assumptions about scavenging efficiency in the OsloCTM3 to obtain a spread in simulated patterns and mean residence time (Methods), we investigate BC lifetime constraints from the aircraft observations in the four aggregated focus regions (Fig. 1). We also examine the influence on this analysis from the sampling method errors discussed in the previous section.

Figure 4a–d shows normalized mean bias (NMB) regressed against global BC lifetime from each OsloCTM3 sensitivity simulation, while Fig. 4e shows resulting lifetime constraint, as diagnosed by the intercept of a least squares linear fit. As an assessment of the robustness and sensitivity of the linear fit to specific data points, we derive upper and lower lifetime estimates from fits to all combinations of N-1 data points. We also regress the global BC lifetime against the RMSE, which considers error compensation due to differences of opposite sign and capture the average error produced by the model (Fig. S3). ECHAM-HAM is added for comparison of the models in their baseline configuration.

For measurements over the Pacific Ocean, a global BC lifetime of 4 [2.3-4.2] days corresponds to the lowest bias and error, in line with previous findings.7,13 For ECHAM-HAM, which has a lifetime of around 6.5 days, a negative NMB is found, as was also reflected in the average profiles of Fig. 2. A similarly short global lifetime of 3.5 [2.1–3.7] days is also diagnosed from the continental flight measurements. For the Arctic as a whole, the lowest model bias and error corresponds to a lifetime of 5.5 [5.3–5.9] days. This is longer than in the baseline configuration of the OsloCTM3, which is reflected in the underestimation of Arctic vertical profiles (Fig. 2). Because of the longer lifetime, the Arctic profiles from ECHAM-HAM agree well with observations. Nevertheless, a lifetime of 5 days is still quite low and shorter than in many current models.7 There is, however, significant seasonal variability behind this aggregated estimate (see Discussion). The choice of sampling method affects both slope and intercept. Consistently, sampling with monthly mean model data results in a lower lifetime estimate for zero bias, while using the box averaging method implies a longer lifetime. The latter approach was used the multi-model study by.7

Over the Atlantic Ocean, the bias and error decrease with decreasing lifetime. However, even for BC concentrations in the sensitivity simulation with the shortest lifetime, a negative bias of around 20% is found. In fact, the intercept with the x-axis is as low as 2 [−1.5–2.4] days. In contrast to the other regions, sampling with monthly mean data even gives an intercept below zero. Our analysis assumes that an optimal lifetime for model-measurement agreement exist, and that emissions and transport processes are relatively well captured by the models. The negative lifetime estimates suggest that in this case emissions sources may be missing or other processes than removal may dominate. Moreover, a lifetime this short is not suggested by observations from SALTRACE, since the transport of dust layers across the Atlantic takes 5 days from the African coast to the Caribbean, without much removal during transport. This is a clear indication of how regional discrepancies between models and observations may skew analyses drawing conclusions based on a global-mean BC lifetime.

Two of the data points in Fig. 4 are close in terms of global BC lifetime, at 5.5 and 6 days, but result from modifying the scavenging efficiency of convective and large-scale precipitation by ice clouds, respectively (Table S1), and the spatial pattern of changes in burden differ (Fig. S1). To examine impacts on covariance in the shape of the vertical profiles, we calculate the Pearson correlation coefficient for each sensitivity test (see Fig. 3 for correlations in the baseline simulation). Differences are mostly small for all regions. For the Continental region, we find negligible changes. Over the Pacific correlation is reduced from ~0.6 to 0.5 in the two simulations with longest lifetime. For the Arctic, the best correlation is found when the longer lifetime compared to baseline is achieved by reducing large-scale ice clouds removal, while in the Atlantic region, simulations with changes to the convenctive scavenging corresponds results in a small improvement. The latter provides further illustration of the inhomogeneity in the processes that affect BC distributions.

## Discussion and conclusions

Using measurements from a large set of recent flight campaigns, we have performed an updated evaluation of modeled BC vertical concentration profiles from OsloCTM3 and ECHAM-HAM and examined implications for global BC lifetime in a broader geographical scope than previous studies.

Modeled and observed average BC profiles agree well over the Pacific Ocean and for many flight campaigns over the Continental US and Europe, while there are notable discrepancies in other regions. Most notably, we show a persistent overestimation of measured BC concentrations at higher altitudes over the Atlantic Ocean. This is seen in both models, which have different aerosol treatments and emission inventories, and across several seasons covered by ATom and SALTRACE campaigns. Interestingly, updates to the scavenging efficiency of BC in OsloCTM3 have mostly eliminated the previously seen high-altitude overestimation compared to HIPPO1-5 over the Pacific Ocean (e.g.,7), and the OsloCTM3 generally performs better than its predecessor OsloCTM2. In the present analysis, the high-altitude overestimation over the tropical Atlantic Ocean is also significantly lower in a sensitivity simulation where both hydrophobic and hydrophilic BC is allowed to act as efficient cloud condensation and ice nuclei, yielding a global BC lifetime of 3.2 days. However, such high ice nucleating activity of BC is not supported by available measurements. Compared to ATom data, there is also a strong underestimation of near-surface concentrations, likely driven by differences in the emission inventories used by the models relative to the actual situation influencing the observations. Uncertainties in the magnitude, injection height and temporal resolution of the biomass burning emission inventory likely contribute to the discrepancies over the Atlantic Ocean, as well as in other regions. Separating contributions from fossil fuel and biomass burning sources in the OsloCTM3 (Fig. S4), we find that biomass burning BC constitute the largest fraction of total BC in the A2 region during the time of ATom1 and 2, and in the Senegal domain during ATom2, while the high altitude overestimation over Senegal in ATom1 and SALTRACE is dominated by fossil fuel plus biofuel BC. Furthermore, the strong local peak in the vertical BC concentration profiles from SEAC4RS corresponds to measurements from the flights following the plume from the large Rim Fire in Yosemite.20 The high-altitude enhancement seen in measurements from CONCERT (up to ~25 ng m−3), but not captured by the models, was also influenced by biomass burning emissions from North America transported across the Arctic with little removal.21,24 Going beyond the use of monthly mean emission fields, as used in this and many other studies, could improve the model abilities to capture such features. Significant uncertainties are also associated with anthropogenic BC emissions, up to a factor two globally, but with large regional differences (e.g.,27), and could influence our results. In addition to the range of emissions spanned by the use of two different inventories in the model runs (Methods), a set of sensitivity simulations is performed to investigate the impact of higher and lower emissions on the BC lifetime constraint derived from ATom1 measurements (see Supplementary material). A slightly lower estimate of global lifetime is derived for the Pacific, Atlantic and Arctic regions when emissions are scaled up (Table S3). Higher emissions generally result in higher NMB for each domain than in the corresponding original model run (Fig. S7). To compensate for this additional discrepancy, a shorter global lifetime is implied. As expected, the opposite is found when emissions are halved. We note, however, that in both cases numbers are within the range given by the error bars in Fig. 4e and <5 days. Although higher emissions indicate a poorer agreement with observed BC profiles, at least for ATom1, this confirms that our framework can be influenced by uncertainties in other processes than scavenging, as already discussed above in the case of the Atlantic region.

In addition to emissions, possible sources of uncertainty include model representation of microphysical aging and transport processes. To examine the latter, we performed an additional evaluation of the modeled vertical distribution of CO from OsloCTM3 against ATom measurements (Methods, Fig. S5). The general shape of the measured and modeled profiles agree well, indicating that transport is reasonably represented in the model, as was also found in a previous comparison of OsloCTM2 CO concentrations against measurements over the Pacific Ocean.28 However, the modeled concentrations are on average 20% lower than the observed mean profiles in the A2 and Senegal regions, suggesting that emissions could be too low. Such emission-related uncertainties could also contribute to the discrepancies in BC profiles below 600 hPa, but does not provide an explanation for the high-altitude overestimation. A too slow conversion rate from hydrophobic to hydrophilic mode could play an important role in underestimating removal of the aerosols. The OsloCTM3 captures some temporal and spatial variability in aging time scales (Methods), but the representation is simplified. As shown by Lund, et al.,28 changes to parameters related to the microphysical aging of BC resulted in notable reductions in the high altitude concentrations in the OsloCTM2-M7. ECHAM-HAM has a more sophisticated treatment of aerosol microphysics,29 but neither model include processes such as coating by secondary organics, suggested to play an important role for improved simulated vertical BC distribution over the Pacific by e.g.,.30 Further dedicated studies are needed to investigate and resolve these issues seen across campaigns, seasons and models over the Atlantic Ocean.

Our analysis confirms that model error and bias generally increase with increasing global BC lifetime, and that a global lifetime of less than 5 days is necessary to broadly reproduce observed BC vertical concentration profiles. The choice of sampling methodology influences results and using box average profiles generally gives a longer global BC lifetime constraint. However, differences are ~10–50% depending on region and sampling errors are mostly small compared to both the model error and inter-model variation (for this specific model field). Hence, while sampling methodology may have influenced lifetime estimates from previous studies using monthly model data or box average profiles, our findings does not indicate that the main conclusions would have been affected. It should be noted that the spatial resolution considered here is still coarse, even when using the flight simulator to interpolate in time and space. Schutgens, et al.15 showed that model resolution needs to be 4 times higher than current common resolutions before sampling errors become significantly smaller.

More important for the usefulness of global-mean BC lifetime as an indicator of model skill is the significant geographical difference, with estimates for best agreement with observed concentrations ranging from less than 2 days for the Atlantic Ocean to around 5.5 days for the Arctic. The results point to distinct differences between high and lower latitudes, in line with previous studies (e.g., 28). There are also notable differences in the bias-lifetime relationship between individual campaigns within our aggregated Arctic region (Fig. S6); While a good agreement with measurements during winter and spring implies a global BC lifetime of 5–6 days, 3–4 days gives a better agremeent with measurements from summer and early fall. It should be noted that some of the springtime Arctic campaigns, specifically ARCPAC and ARCTAS, were heavily influenced by biomass burning plumes, which are often difficult to capture in models using monthly biomass emission fields. For the other regions, the lifetime corresponding to the lowest bias in individual campaigns generally agree more with the results for the aggregated regions.

Hence, in line with previous studies, we find that a relatively short global BC lifetime is a general prerequisite for models to be able to capture the observed vertical aerosol distribution. However, the strength of this result is significantly dependent on regional differences and inhomogeneities in BC processing, indicating that lifetime may only serve as a first order indicator of model skill. Future work combining these powerful observational constraints with source-receptor analyses (e.g., 31) would permit regional and source dependent lifetimes to be compared, providing more robust metrics of model performance.

## Method

### Model output

The distribution of BC is obtained from simulations with OsloCTM332 and ECHAM6.3-HAM2.3.29 The OsloCTM3 is a global three-dimensional chemistry transport model. The model is run in a 2.25°x2.25° horizontal resolution with 60 vertical layers for the years 2008–2017 using meteorological data from the European Centre for Medium-Range Weather Forecasts (ECMWF) Integrated Forecast System (IFS) model and emissions from the Community Emission Data System (CEDS) for CMIP6.27,33 After 2014, biomass burning emissions from the Global Fire Emission Database version 434 and constant year 2014 anthropogenic emissions are used. Global annual BC emissions range from 9.1 to 9.7 Tg from 2008 to 2014. In addition to the baseline simulations, four sensitivity simulations with changes to the scavenging efficiency of BC are performed for each year to obtain a spread in global BC lifetime from 3 to 8 days (see Supplementary material for details). Carbonaceous aerosols in OsloCTM3 are treated with a bulk parameterization and represented by a hydrophobic and a hydrophilic mode. At the time of emission, BC is assumed to be 20% hydrophilic and 80% hydrophobic. The transfer to the hydrophilic mode (i.e., aging) is parameterized using fixed time scales with monthly and latitudinal distribution based on simulations with a microphysical aerosol model, where the aging is slowest during winter at high latitudes, but shorter and with less pronounced seasonal cycle further south.35

ECHAM6.3-HAM2.3 is the latest version of the HAM aerosol-climate model and was run at T63 (roughly 1.8o horizontal resolution) with 31 vertical layers. The model was run only for the year 2008 using ECMWF meteorological data. The ACCMIP36 interpolated anthropogenic emissions dataset and ACCMIP-MACCity biomass burning inventory was used, giving a total of 7.3 Tg BC emitted. Aerosols are treated using the microphysical parameterization M7 of Vignati, et al.37 which simulates the formation and growth of aerosol particles due to nucleation and condensation of sulfuric acid gas, coagulation of particles, and aerosol water uptake. The size distribution is given by a lognormal distribution function with four modes available for BC. Upon emission, all BC is assumed to be in the insoluble Aitken mode (mean radius 0.03 μm), and the subsequent aging and growth explicitly depends on the ambient concentrations of sulfate.

### Observations

We use single-particle soot photometer (SP2, Droplet Measurement Technologies, Longmont, CO, USA) measurements from 15 flight campaigns: ACCESS (European Arctic Climate Change, Economy and Society), ARCPAC (Aerosol, Radiation, and Cloud Processes affecting Arctic Climate), ARCTAS (Arctic Research of the Composition of the Troposphere from Aircraft and Satellites), ATom (Atmospheric Tomography Mission) 1 and 2, HIPPO (HIAPER Pole-to-Pole Observations) 1-5, CONCERT (COntrail and Cirrus ExpeRimenT), DC3 (Deep Convective Clouds and Chemistry), SALTRACE (the Saharan Aerosol Long-range Transport and Aerosol-Cloud-Interaction Experiment), SEAC4RS (the Studies of Emissions and Atmospheric Composition, Clouds and Climate Coupling by Regional Surveys) and SENEX (Southeast Nexus) campaigns. The geographical coverage, timing and documentation for each campaign is summarized in Table S2. The SP2 provide measurements of accumulation mode refractory black carbon (rBC). For simplicity, we use the term BC when discussing both modeled and measured aerosols. Although three different groups operated the SP2s over this set of campaigns, the SP2 calibration uncertainties are likely negligible in the context of the model comparison; SP2 observation of refractory BC concentration is typically associated with ~25% uncertainty. Adjustments for accumulation-mode rBC mass not detected by the SP224 has been applied to measurements from ATom, HIPPO, SENEX, and ARCTAS. The corrections are generally small at <20%. Measured rBC is compared with total BC mass from the models. Additionally, we use CO measured with the Picarro Cavity Ring Down Spectrometer38 from ATom.17

### Analysis

Vertical profiles are constructed for 17 regions (Fig. 1) by averaging measured concentrations across 13 altitude bins within the respective geographical domain. Simulated vertical BC profiles are constructed by sampling model data at varying spatial and temporal resolution. For OsloCTM3, a flight simulator is used to extract data along the flight tracks, either online or using monthly mean data. For ECHAM-HAM, the three hourly, daily and monthly BC concentrations are interpolated onto the observed flight-track using the CIS tool39 and averaged in the respective regions. Additionally, monthly mean modeled concentrations are averaged over each region, referred to as box averaging.

To quantify model performance we calculate the NMB and root-mean-square error (RMSE) for each region r. When using the flight simulator, NMBr is calculated as:

$${\rm{NMB}}_r = \frac{1}{{N_{{\rm{lev}}}}}\mathop {\sum }\limits_{i = 1}^{N_{{\rm{lev}}}} \left[ {\frac{1}{{N_{{\rm{obs}}}}}\mathop {\sum }\limits_{j = 1}^{N_{{\rm{obs}}}} \frac{{O_j - M_j}}{{\frac{1}{2}(O_j + M_j)}}} \right]$$
(1)

where O denotes observation, M modeled concentration, Nobs is the number of observations in the respective altitude bin and Nlev is the total number of bins. A lower limit of minimum 10 observations in a given altitude bin is set. When modeled vertical profiles are constructed by the box averaging method, bias is calculated as:

$${\rm{NMB}}{\_}{\rm{BOX}}_r = \frac{1}{{N_{{\rm{lev}}}}}\mathop {\sum }\limits_{i = 1}^{N_{{\rm{lev}}}} \frac{{\bar O_i - \bar M_i}}{{\frac{1}{2}(\bar O_i + \bar M_i)}}$$
(2)

where $$\bar O$$ and $$\bar M$$ is the average observed and modeled concentration in each altitude bin.

Corresponding equations for RMSEr are:

$${\rm{RMSE}}_r = \frac{1}{{N_{{\rm{lev}}}}}\mathop {\sum }\limits_{i = 1}^{N_{{\rm{lev}}}} \left[ {\sqrt {\frac{1}{{N_{{\rm{obs}}}}}\mathop {\sum }\limits_{j = 1}^{N_{{\rm{obs}}}} (O_j - M_j)^2} } \right]$$
(3)
$${\rm{RMSE}}{\_}{\rm{BOX}}_r = \sqrt {\frac{1}{{N_{{\rm{lev}}}}}\mathop {\sum }\limits_{i = 1}^{N_{{\rm{lev}}}} (\bar O_i - \bar M_i)^2}$$
(4)

The 17 regions are further aggregated to four focus regions as indicated by different colors of boxes in Fig. 1: continental US and Europe, Pacific Ocean, Arctic, and Atlantic Ocean. Aggregated bias and error are calculated as unweighted averages of the statistics in individual sub-regions.

The correlations, standard deviations, and centered RMSE summarized in the Taylor diagrams are calculated using 2-min averages, but without weighting by number of observations in the vertical.

### Data availability

Aircraft observations available online: https://www.esrl.noaa.gov/csd/field.html (ARCPAC, HIPPO, SEAC4RS, SENEX, DC3), https://dx.doi.org/10.5067/Aircraft/ATom/TraceGas_Aerosol_Global_Distribution (ATom) https://www-air.larc.nasa.gov/missions.htm (ARCTAS), http://www.pa.op.dlr.de/CONCERT/ (CONCERT). SP2 data collected with the Falcon research aircraft are available by request to the DLR for the CONCERT, DC3, ACCESS, and SALTRACE missions; the PI for the Falcon BC data can be reached at: Bernadett.Weinzierl@univie.ac.at. Model data is available from Marianne T. Lund (m.t.lund@cicero.oslo.no) and Duncan Watson-Parris (duncan.watson-parris@physics.ox.ac.uk). CIS is an open-source tool available at www.cistools.net.