Estimation of hygroscopic growth properties of source-related sub-micrometre particle types in a mixed urban aerosol

Knowledge of hygroscopic properties is essential to prediction of the role of aerosol in cloud formation and lung deposition. Our objective was to introduce a new approach to classify and predict the hygroscopic growth factors (Gfs) of specific atmospheric sub-micrometre particle types in a mixed aerosol based on measurements of the ensemble hygroscopic growth factors and particle number size distribution (PNSD). Based on a non-linear regression model between aerosol source contributions from PMF applied to the PNSD data set and the measured Gf values (at 90% relative humidity) of ambient aerosols, the estimated mean Gf values for secondary inorganic, mixed secondary, nucleation, urban background, fresh, and aged traffic-generated particle classes at a diameter of 110 nm were found to be 1.51, 1.34, 1.12, 1.33, 1.09 and 1.10, respectively. It is found possible to impute (fill) missing HTDMA data sets using a Random Forest regression on PNSD and meteorological conditions.


INTRODUCTION
Processes affecting atmospheric aerosol and its effects on climate change are strongly dependent upon hygroscopic properties 1,2 . Once a particle is emitted or formed in the atmosphere, it can grow or shrink in size by water vapour uptake due to its hygroscopicity, therefore altering the scattering and absorption of solar radiation and consequently changing the Earth's radiation balance 3,4 . In addition, hygroscopic properties of aerosols play an important role in determining the impact of aerosols in cloud droplet formation 5,6 , indicating that particles with a high hygroscopic growth factor (Gf) mode are predominantly scavenged into cloud droplets with an activation efficiency of 57-83% 6 . Particle growth factor is also considered as a vital parameter in determining the deposition of aerosols in the human respiratory system 2,7,8 as atmospheric aerosols at 200 nm can grow their size to more than double when exposed to a high relative humidity (>99.5%) in the human lung, enhancing their lung deposition efficiencies.
Hygroscopic properties of aerosols can be described by hygroscopic Gf that are commonly measured by a hygroscopicity tandem differential mobility analyser (HTDMA) instrument 1 . The Gf is defined as the ratio between the particle diameter measured at a high relative humidity (RH) condition (d w , RH~90%) and a dry particle diameter measured at a low RH condition (d p , RH < 10%): The values of Gf have been widely used to classify hygroscopic properties of aerosols into four groups: nearly-hydrophobic (Gf~<1.11), less-hygroscopic (Gf~1.11-1.33), more-hygroscopic (Gf~1. 33-1.85) and sea-salt aerosols (Gf~>1.85) 1 . HTDMA techniques are now used extensively to characterise hygroscopic properties of atmospheric aerosols in different environments, including rural, marine, remote, urban background, and roadside areas 2 . However, there are some disadvantages of HTDMA measurements: (1) a HTDMA instrument can only measure particle growth factors for several selected diameters in a limited size range (30-350 nm); (2) the measurement does not have high time resolution; (3) possible bias if equilibrium conditions are not adequately reflected 8 ; and (4) the effect of condensed vapours on the surface of measured particles creating internally mixed particles and error in the HTDMA measurements [8][9][10] . Particles released from different sources have different chemical composition and sizes that determine their different hygroscopic properties. Vu et al. 2 reviewed the Gf of particles from three sources, including traffic emissions, biomass burning and nucleation, and found that fresh combustion-generated aerosols are predominantly nearly-hydrophobic or less-hygroscopic particles. During the atmospheric aging process, aerosols can be subjected to other aerosol dynamical processes which may change their shape and chemical composition, leading to a change in their hygroscopic properties. Kotchenruther and Hobbs 11 found higher growth factors of aged wood smoke aerosols (1.3-1.5) than those of fresh wood smoke aerosol (1.1-1.3). Aerosols emitted from traffic sources have sometimes been observed to increase their hygroscopic growth factor during atmospheric ageing, but this increment was much lower than that of biomass burning aerosols 12 . McMeeking et al. 13 used a HTDMA coupled with a single-particle soot photometer (HTDMA-SP2) system to measure the refractory black carbon (rBC) aerosol hygroscopicity and found an influence of mixed rBC on the hygroscopic properties of aerosols. Despite a considerable number of HTDMA measurements for ambient aerosols in recent years, there are a few modelling studies on hygroscopic properties of aerosols from sources such as coal combustion or secondary organic aerosols.
Hygroscopic growth is affected by both the inherent hygroscopicity of the material comprising the particle, and for smaller particles (<100 nm) by the influence of the Kelvin Effect upon vapour pressure. This work addresses measured growth factors that are affected by both influences, and does not therefore address directly the inherent hygroscopicity of the particles. However, the growth factor is a practically valuable quantity in the context of 1 actual growth under high humidity conditions and is directly relevant to particle size changes at high relative humidities in both the atmosphere and the human respiratory tract.
The aim of this work was to investigate the relationship between hygroscopic properties of aerosols and their emission sources. To do so, Gf values for particles from different sources have been estimated based on a non-linear regression (NLR) model between the measured Gf of ambient particles and factor contributions obtained by a positive matrix factorisation (PMF) model applied to particle number size distribution (PNSD) data sets. This study also reports the use of a PMF model on a combined data set of PNSD and hygroscopic particle counts for the selected radii to investigate the hygroscopicity of each emission source. Furthermore, this study also develops a new approach for prediction of aerosol growth factors based on their PNSD. This work has important potential applications. Particle hygroscopicity, as indicated above, is a powerful determinant of the ability of particles to act as cloud condensation nuclei (CCN), and to deposit with high efficiency in the respiratory system. Since measurements of hygroscopic growth factors with the HTDMA do not distinguish the sources of particles, methods such as this, which are capable of associating growth factors with particles from specific sources, are a high priority in understanding the sources of CCN and identifying those particles which are most responsible for contributing to the lung dose of atmospheric pollutants.

Hygroscopic properties of aerosols from measurements
The measured growth factor probability density function (Gf-PDF) presented a bi-modal distribution for each selected particle size of 50, 75, 110, 165, and 265 nm during the two sampling campaigns as shown in Fig. 1. Two dominant fractions of aerosols are found as nearly-hydrophobic (Gf ∼1.00-1.15, at 90% RH) and more-hygroscopic particles (Gf ∼1.35-1.75). Larger particles (d p > 100 nm) show greater hygroscopicity by shifting the Gfprobability density towards the mode of greater Gf. There is no significant difference between mean growth factors between summer and winter, but the peak mode of the nearlyhydrophobic and more-hygroscopic groups seems shifted to smaller Gf values in summer which was probably attributable to more atmospheric evaporation of hygroscopic components (e.g. nitrate) of aerosols or possibly more less-hygroscopic secondary organic aerosol (SOA) in summer than winter.
In terms of the fraction of aerosol by number, a greater proportion of nearly-hydrophobic particles (43-45%) were less than 100 nm whereas more-hygroscopic particles were larger than 100 nm. For example, number fractions of aerosols at 265 nm are 35, 4 and 61% for nearly-hydrophobic, lesshygroscopic, and more-hygroscopic aerosol groups. Lesshygroscopic particles were only found to occur appreciably for particles with diameter smaller than 200 nm. In all, 12-15% of such particles were ultrafine particles (diameters <100 nm) that can be attributed to aged combustion-generated aerosols. Seasalt aerosols were found for some events, but their contribution was not significant (<1 %). The number fraction of nearlyhydrophobic particles in this study was greater than those in previous studies in Leipzig (22-31%), Neuherberg (35-42%), Beijing (17-24%), and Shanghai (18-24%) [14][15][16] . It is also greater than that measured at a rural site (30-52%), but smaller than that measured at roadside (70%) 7,17 .
The scheme of model development is shown in Fig. 2. In order to characterise the distribution of growth factors associated with particles from different sources, we developed two approaches: (1) application of a PMF model to the particle size distribution data, followed by use of an NLR model to associate growth factors with each PMF source-related factor; (2) application of PMF to both the particle size distribution and hygroscopic growth factor data in a single model. Additionally, a Random Forest (RF) model was used to predict the hygroscopic growth factor of ambient aerosols using source contribution factors for missing data imputation for HTDMA data sets.   The optimal solution from the application of PMF to the PNSD data (see "Methods" section) revealed six source-related factors. Figure 3 shows time trend comparisons between the number fraction of nearly-hydrophobic particles and the PMF traffic factor contribution, and between number fraction of more-hygroscopic particles and the PMF secondary and nucleation aerosol contribution. Based on the PMF results obtained from the PMF model on only the PNSD data set, traffic emissions contribute 52% of total particles while secondary aerosols and nucleation contribute 34% of total particles by number which are comparable with 43-45% and 40-44% for nearly-hydrophobic and morehygroscopic ultrafine particles. It suggests that traffic emissions are the main source of nearly-hydrophobic particles in the urban background atmosphere of London.
Estimated growth factors by non-linear least-squares fitting Hygroscopic growth factors of particles from different sources obtained by the non-linear least-squares fitting based on Eq. (3) (see "Methods" section) are shown in Table 1 18 .
The Gf values of particles from traffic-generated aerosols (at 90% RH) are mainly found between 0.92 and 1.29 (except for freshtraffic particles at a diameter of 265 nm with GF~1.36) indicating the particles emitted from traffic emissions are nearly hydrophobic or less hygroscopic. These growth factors are in agreement with those reported from chamber experiments and measurements at roadside sites by Weingartner et al. 19 , Tritscher et al. 12 , Baltensperger et al. 17 , Ferron et al. 7 and Löndahl et al. 20 . Weingartner et al. 19 reported that the Gf values of aerosols (in the size range of 29-111 nm) released from a four-stroke spark ignition engine ranged from 0.98 to 1.15 at 95% RH.
For particles around 110 nm, there is no significant difference between Gf values for fresh (Gf~1.09) and aged traffic particles (Gf~1.10). It is likely that particles in this size range are aggregate or soot particles which change their size slightly during the atmospheric aging process 12,21 . Aged traffic particles at 50 and 75 nm were found to be more hygroscopic (Gf~1.19-1.29) than those of fresh traffic emissions (Gf~0.92-1.20). This may be explained by the evaporation of semi-volatile organic compounds (which are predominantly hydrophobic compounds) from the aerosol surface or the condensation of oxidised species onto the particles during the atmospheric ageing process, creating a more hygroscopic surface. In contrast, larger particles (with diameters at 165 and 265 nm) show higher average Gf values for aerosols from fresh traffic emissions (Gf~1.26-1.36) than aged traffic emissions ((Gf 1.11-1.12). Larger fresh traffic-generated aerosols may contain more hygroscopic compounds (i.e. NH 4 NO 3 ) which can evaporate during ageing, shifting to aged traffic aerosols with lower Gf 22 .
The urban accumulation particles are found to be less hygroscopic with the Gf ranging from 1.32 to 1.33 for particles in a range of 75-110 nm, and more hygroscopic (Gf~1.68) with a diameter of 265 nm. These particles are believed to be a mixture of aged urban combustion-generated aerosols (traffic emissions and wood burning). This source was found to correlate well with both organic compounds (r 2~0 .77) and black carbon (r 2~0 .77) 23 . Aged wood smoke aerosols are known to be more hygroscopic (Gf 1.3-1.7 at 90-95% RH) compared to fresh wood aerosols 11,[24][25][26] .
A very high growth factor (1.68) for aerosols at 265 nm suggests the urban accumulation mode aerosol contains inorganic salts, many of which have Gf values of 1.6-1.8, or may be mixtures including some compounds with Gf > 2 (at RH 90%) 27 .
True nucleation mode particles fall below the lowest particle size measured by the HTDMA of 50 nm. However, the regression model assigns growth factor values to this PMF source factor which may relate to the upper tail of the distribution. This is an artefact of the PMF, which assigns a small number of larger particles to this factor. Ultrafine particles assigned by PMF to the nucleation factor show growth factors of 1.12-1.30 with higher Gf values for smaller particles. These particles arise from regional nucleation and have grown due to condensation of oxidised species 23 . Sakurai et al. 28 reported a Gf value of 1.4 for nucleation particles at a diameter of 10 nm from an urban background site in Atlanta. The high Gf factor of nucleation aerosols indicates that these atmospheric nucleation aerosols are probably mainly formed from oxidation of SO 2 29 but are also affected by oxidised surface layers. The nucleation factor contributes only a small fraction of particles in the accumulation mode size range.
Mixed secondary organic and inorganic aerosols (MIA) are also found be less hygroscopic (Gf~1.34-1.48) than secondary inorganic aerosols (SIA) (Gf~1.46-1.77). These aerosols were predominantly transported from mainland Europe to London during the sampling campaign. The higher Gf value for SIA is due to more hygroscopic compounds contained in SIA than MIA. Vu et al. 23 found that SIA   [30][31][32] . Hygroscopic growth factors of SOA were reported to be lower with a range of 1.1-1.45 in a previous study 33 . The Gf value of MIA is greatly dependent upon the mixing ratio between inorganic salts and SOA 34 .
Discussion on hygroscopic growth factors from a NLR model The next step was to associate an estimate of growth factor with each particle class derived from the PMF. Based on averaged growth factors estimated following the NLR model based upon Eq. (3) for aerosols from each source as shown in Table 1 18 , and the contribution of each emission source obtained by the PMF model, the hygroscopic growth factors of the mixed ambient aerosol were reconstructed. A comparison between simulated and measured growth factors is shown in Fig. 4.
The simulated growth factors derived from the NLR model could predict well the average temporal trend in the growth factors of ambient aerosols, but failed to predict the lowest and highest points. The poor correlations (r 2 < 0.4) between hourly modelled and measured growth factors can be explained by four main reasons: (1) The time variation of growth factors of each source: In the NLR model, we used an averaged growth factor value for particles from each source but the Gf values of aerosols from a given source may vary within the sampling time. In summary, despite the poor prediction of high time resolution of the aerosol Gf values, the NLR model based on the PMF factors derived from PNSD data sets predicts well the average trend of Gf values and provides estimated Gf values for particles emitted from different sources.

PMF model applied to PNSD and hygroscopic growth factors
In order to further investigate the probability distribution of the Gf values, we applied a PMF model to the data sets combined of PNSD obtained by Scanning Mobility Particle Sizer (SMPS) and hygroscopic particle counts by the HTDMA. The results are shown in Fig. 5. The six source-related factors as seen in the analysis of PNSD data only were identified, but the source profiles are different for urban aerosols which were separated into two groups (less and more-hygroscopic aerosols), and secondary aerosols which was split into two factors (inorganic and mixed secondary) compared to the previous PMF run on the PNSD data only.
For particles from the nucleation source, the hygroscopic growth factor distribution (Gf-PDF) of particles at 50, 75, 110 and 165 nm have a dominant mode of 0.9-1.1. Similar to nucleation particles, the growth factor of particles from fresh traffic and aged traffic sources are mainly distributed at 0.8-1.1 with minor peaks at 1.4 and 1.6. This is as expected for the hydrocarbon and elemental carbon-based traffic exhaust-related particles, but not for the nucleation mode. However, the smallest particle size (50 nm) at which hygroscopicity T.V. Vu et al. data are available is almost outside of the size range of the nucleation source particles (Fig. 5) and much larger than the mode. Consequently, some of the particle sizes for which hygroscopicity is estimated are relevant to this particle class. However, there is a very large associated uncertainty as the HTDMA did not measure below 50 nm, and therefore omitted most of this category of particle.
By contrast, particles from the urban accumulation mode and secondary aerosol categories show three dominant Gf modes at 1.0, 1.2 and 1.5 which were assigned into hydrophobic, lesshygroscopic and more-hygroscopic groups. The mixed hygroscopic properties of these particles may be the explanation for the poor fitting of the obtained averaged Gf value from the NLR model, if composition varies with time over the wide range of Gf factors seen in Fig. 5 for SIA and MIA.
Limitations of this approach are that particles from lesser sources (e.g. brake wear or sea salt) may not be separated by the RF technique to impute missing data from HTDMA A machine learning technique, RF, was used to predict growth factors by assuming that the Gf for a given particle size is a function of the factor contribution (g k ) of each source obtained by PMF and meteorological parameters. A part of the data set was used to "train" the model, with the other part used to test its skill (see "Methods" section).
The RF model simulates the hygroscopic growth factors of particles well, as shown in Table 2 and Fig. 6. The model performance does not improve significantly with additional chemical input variables such as NH 4 + , SO 4 2− and black carbon, because these chemical components have a close relationship with the PMF factors.
We tested the application of the RF method in prediction of hygroscopic growth factors with two training methods. In the first training method, we selected a training data set of the first twothird of the whole data set, and then used the rest of the data in August as the testing data set. The model performed very poorly (r < 0.5) in predicting the hygroscopic growth factors of particles for the new period. The poor performance may be caused by the difference of hygroscopic growth factors of particles between winter and summer, leading to an inadequate training data set. It is out of the scope of our study to investigate this.
In the second model training method, we randomly select the training data sets (accounting for 70% of the whole data sets) through the whole sampling period. The model performance was found much better as shown in the Supplementary Information, indicating that we can use this method to impute (fill) the missing data from HTDMA measurements. This is however dependent upon the use of a training data set.

DISCUSSION
This study has introduced and investigated new approaches for estimation of hygroscopic properties of sub-micrometre aerosols using the PNSD. As a result, both an NLR model and a PMF model applied to PNSD and hygroscopicity data sets can predict well the hygroscopic growth factors of particles emitted from different sources without knowledge of chemical composition from each source. In addition, the PMF model can show not only the averaged value but also a probability distribution function of each source's growth factors. This study confirms that aerosol from SIA and MIA sources was found to be more hygroscopic whereas combustion creates dominant fractions of nearly-hydrophobic and less-hygroscopic groups. Furthermore, it found that a NLR model could predict well the average trends of ambient aerosols, but it fails to predict the high time resolution ambient Gf values. This study introduced an RF technique which shows a good capability to predict hygroscopic growth factors of aerosols based on the source factors and ambient meteorological conditions. It performs well with high time resolution data, resulting in the application of this technique in the imputation of missing data for the data sets obtained by a HTDMA system.

Measurements and data sources
PNSD, hygroscopic properties and other air pollutant concentrations were measured by the University of Manchester at a monitoring station located in the grounds of Sion Manning School in North Kensington, which is representative of a typical urban background area of London, UK 36 . Those measurements were conducted during two intensive sampling campaigns (January-February and July-August 2012) within the Clean Air for London (ClearfLo) project 37 .
PNSDs were measured by an SMPS system which consisted of an electrostatic classifier (EC, TSI model 3080) and a condensation particle  Fig. 6 Comparison of time series of observed growth factors (in red) with those modelled using Random Forest (in black). Each graph relates to a particle size, as indicated.
T.V. Vu et al. counter (TSI model 3775). The instrument was set up to collect PNSDs every 15 min with six scans of 2.5 min covering the size range 14.6-623 nm. The size distribution is measured after the aerosol has passed through a dryer, following the guidelines set out by Wiedensohler et al. 38 . The hygroscopic growth factors and number fraction of sub-micrometre aerosols were measured by a HTDMA system at 90 ± 0.8% RH with five selected initial dry sizes at diameters of 50, 75, 110, 165 and 265 nm operated by The University of Manchester. The scan time of HTDMA was 15 min for each run. Both SMPS and HTDMA measurements were conducted in a same location and sampling time period. The Gf-PDF was then retrieved using a TDMAinv inversion 38 . PNSD and hygroscopic data sets were extracted from the Centre for Environmental Data Analysis (CEDA database, https://catalogue.ceda.ac. uk/uuid/03cf72a33d1fcf00908bf9eca3be7eca) and have been described by Whitehead et al. 10 .
Other hourly data sets on meteorological parameters, aerosol chemical composition data (ion species and black carbon) were extracted from the Department for Environment Food & Rural Affairs website (DEFRA, http:// uk-air.defra.gov.uk/). All data were averaged on an hourly or three-hourly (for Gf) basis.

First approach using PMF and an NLR model
The application of PMF models to PNSD data has previously been used for source apportionment of particles by number 39,40 . By this approach, each size bin in the PNSD data set is considered as an input variable. To reduce the uncertainty which is associated with the SMPS measurements, three consecutive original SMPS size bins were summed into one new size bin. A profile of 1170 hourly PNSD data sets (26/01-11/02/2012 and 21/07-23/08/ 2012) comprising 32 new size bins ranging from 15 to 500 nm were input in the US EPA PMF model version 5. The PMF model can reduce the dimensions of PNSD data sets by identifying the number of factors (p), the size profile (f) of each source, and the amount of number (g) contributed by each factor to each individual measurement using the following equation: where x ij is the particle number concentration of size bin j on the ith sample and e ij is the residual for the sample. The details of this PMF method have been described in our previous study. This reported six identified sources: urban accumulation mode, nucleation, SIA, MIA, fresh and aged traffic emissions in the London samples. In the current study, we utilised source contributions and profiles obtained from the PMF model to investigate the influence of sources upon the hygroscopic properties of aerosols.
It was assumed that the average growth factor of mixed aerosols can be estimated based on the ZSR mixing rule 41,42 .
Gf mixed is averaged growth factor which is archived from the HTDMA system after using the TDMAinv fitting 43 . Gf k is an averaged growth factor of aerosols from each source and ε k is the number fraction of particle from each source (k) which is estimated from the PMF factor contribution and factor profile. Therefore, the number fraction of particles at a given diameter in a size bin (j) from a source (k) for the ith sample is calculated by the following equation: The Gf k values were obtained by a non-linear least-squares fitting for the Eq. (3). The "nls" package in R (or "scipy.optimize" library in Python version 2.3) was used for our calculations. To reduce errors arising from the time lag between the SMPS and HTDMA instruments, hourly values of measured Gf, and PMF factor contribution were averaged into three-hourly data before running a non-linear least-squares fitting.

Use of combined PNSD and growth factor data in a PMF model
In this approach, both PNSD data and hygroscopic growth factor data were used as inputs to the PMF model. In a single step, this method gives an output of source-related factors, for each of which there is a particle size distribution and a set of Gf-PDFs for five selected particle sizes. A brief description of the PMF method and its results (Supplementary Figs. 1 and 2) appear in the Supplementary Information.

A decision tree model based on RF algorithm
A decision tree model based on an RF algorithm has been used recently for classification and regression of time series data sets 44,45 . The RF model is an ensemble model consisting of hundreds of models (referred as a decision tree with different rules) to give the final decision. This model uses a small amount of training data by the bootstrap technique, and the final model is the average of all sets of predictions. Firstly, the algorithm (referred as bagging) will select the predictor variables randomly with replacement from the training set and select the best predictors out of the random samples to partition the data. A set derived from bagging (referred as out-of-bag data) grow a single tree. The final prediction is the mean of predictions produced and aggregated from different single trees.
In investigating the relationship of Gf and the source contributions to aerosol, the Gf for a given particle size assumed to be a function of the factor contribution (g k ) of each source obtained by PMF and meteorological parameters as in the following equation: Gf f g k ; wd; ws; RH; temp ð Þ PNSD can also be the input instead of g k , but use of many variables could cause a model over-fitting problem. To investigate the effects of chemical composition, we also added concentrations of ionic species (SO 4 2− , NO 3 − , Cl − , NH 4 + ) and black carbon into the model as input variables. The RF code was developed based on the "random forest" function from the "Keras library" in Python (version 2.3) or the "H2O2" package in R. The data set was split by a fraction of 0.7 to train the model and 0.3 for testing the model. Optimised tuning parameters for an RF model are: number of variables randomly sampled is 3; number of trees is 100; and minimum size of terminal nodes is 3.
In our RF method, the relationships between independent variable (hygroscopic growth factor of particle at a certain dry diameter of 50, 75, 110, 165 and 265) and its predictor features including the factor contribution (g k , k = 1:6) of each source obtained by PMF and meteorological parameters (wind speed, wind direction, temperature and humidity) are built based on decision trees. The performance of the model was evaluated by RMSE and r 2 values. The good goodness of fitting as shown in Supplementary Table 1 indicated that RF could be used to impute the missing data sets from the HTDMA based on the PNSD and weather data.