Zero-shot learning of aerosol optical properties with graph neural networks

Black carbon (BC), a strongly absorbing aerosol sourced from combustion, is an important short-lived climate forcer. BC’s complex morphology contributes to uncertainty in its direct climate radiative effects, as current methods to accurately calculate the optical properties of these aerosols are too computationally expensive to be used online in models or for observational retrievals. Here we demonstrate that a Graph Neural Network (GNN) trained to predict the optical properties of numerically-generated BC fractal aggregates can accurately generalize to arbitrarily shaped particles, including much larger (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10\times$$\end{document}10×) aggregates than in the training dataset. This zero-shot learning approach could be used to estimate single particle optical properties of realistically-shaped aerosol and cloud particles for inclusion in radiative transfer codes for atmospheric models and remote sensing inversions. In addition, GNN’s can be used to gain physical intuition on the relationship between small-scale interactions (here of the spheres’ positions and interactions) and large-scale properties (here of the radiative properties of aerosols).

Aerosols sourced from combustion such as black carbon (BC) are important short-lived climate forcers whose direct radiative forcing and atmospheric lifetime depend on their morphology.These aerosols' complex morphology makes modeling their optical properties difficult, contributing to uncertainty in both their direct and indirect climate effects.Accurate and fast calculations of BC optical properties are needed for remote sensing inversions and for radiative forcing calculations in atmospheric models, but current methods to accurately calculate the optical properties of these aerosols are computationally expensive and are compiled in extensive databases off-line to be used as a look-up table.Recent advances in machine learning approaches have shown the potential of graph neural networks (GNN's) for various physical science applications, demonstrating skill in generalizing beyond initial training data by learning internal properties and smallscale interactions defining the emergent behavior of the larger system.Here we demonstrate that a GNN trained to predict the optical properties of numerically-generated BC fractal aggregates can accurately generalize to arbitrarily shaped particles, even over much larger (10x) aggregates than in the training dataset, providing a fast and accurate method to calculate aerosol optical properties in models and for observational retrievals.This zero-shot learning approach could be integrated into atmospheric models or remote sensing inversions to predict the physical properties of realistically-shaped aerosol and cloud particles.In addition, GNN's can be used to gain physical intuition on the relationship between small-scale interactions (here of the spheres' positions and interactions) and large-scale properties (here of the radiative properties of aerosols).
Carbonaceous aerosols such as black carbon (BC) are important short-lived climate forcers [1,2].To understand their impact on climate, accurate predictions of the optical properties of absorbing aerosols such as BC are needed in atmospheric models and observational retrievals: for estimating the top-of-the atmosphere radiative effects of black carbon [3] and the impact of aged soot on cloud formation [4], for the calculation of the mass absorption coefficient of BC deposited on snow [5], for estimating the relative shortwave heating rates for different types of combustion aerosols [6], for calculating particle-to-gas heat transfer to interpret laser-induced incandescence signals [7], for accurate inversions of imaging nephelometers [8], for constraining the index of refraction of biomass burning aerosols [9], and for interpreting the optical properties of aerosols deposited on filters [e.g.10,11].Accurate calculations of carbonaceous aerosol optical properties are also important for observational retrievals in other planetary atmospheres, as these aerosols may play a role in the radiative balance of e.g. the middle atmosphere of Jupiter [12].BC particles in the atmosphere have a variety of sizes, shapes, and chemical compositions, all of which impact their optical properties (Figure 1).BC's optical properties depend on both the morphology of the primary (bare) BC particle, as well as its internal mixing with other materials (coatings) through the condensation of gas phase species during atmospheric aging.Both combustion conditions [13] and atmospheric aging [14] impact the morphology of these aerosols, which are fractal-like aggregates, typically embedded within (internally mixed) or attached to other aerosol components.The complex morphology of bare BC is generally not parameterized in models, although modeling bare BC as a sphere biases radiative forcing estimates, with too little warming by absorption and too much cooling by scattering [15].Internal mixing is modeled using a Mie Theory core-shell model, which approximates the bare BC portion as an absorbing "core", with a concentric sphere of "coating" material with an index of refraction characteristic of the internally mixed material.Several recent papers have demonstrated this Mie Theory core-shell approximation leads to an over-prediction of BC absorption in models by as much as a factor of 2 [13,16].In addition, not only are more accurate calculations of BC optical properties needed to better constrain models to observations, but models need to be capable of representing the heterogeneity of optical properties in diverse aerosol populations [13,16].
While models and observational retrievals have generally relied on Mie Theory, more accurate methods to predict the optical properties for arbitrarily shaped particles such as the Multiple Sphere T-Matrix Method (MSTM) [17,18], the discrete dipole approximation (DDA) [19,20], and the Generalized Multiple-Particle Mie (GMM) Theory [21,22] have been developed.These methods approximate BC fractal aggregates as clusters of spheres (Figure 1) and provide exact analytical solutions to the timeharmonic Maxwell's equations for the multiple sphere system.However, these approaches are computationally expensive, often requiring hours or even days to compute the optical properties of single aerosol particles with complex morphologies [23].To mitigate this computational bottle-neck, pre-calculated databases of fractal aggregate optical properties using these exact analytical methods have recently been created [23,24,25,26], but such approaches are limited to linear interpolation within the data-bases' optical and morphological properties.There is still significant uncertainty about the fundamental properties of BC from different emission sources and under different combustion conditions, and the additional complexity of internal mixing with nonabsorbing and absorbing materials during atmospheric aging [2] would require these databases to cover a very large parameter space to accurately represent the range of conditions for BC aerosols observed in the atmosphere.Moreover, observational inversions of BC have greater uncertainty when performed with only a subset of possible parameters.Machine learning offers a promising approach for reducing computational bottle-necks by speeding up numerically-intensive aspects of atmospheric models [27,28].As such it could offer an efficient alternative approach to compiling pre-computed databases for BC's optical properties.However machine learning methods are traditionally strongly dependent on the data they are trained with, and struggle to generalize beyond the training distribution.An illustrative example previously investigated a machine learning approach to predicting BC's optical properties from its morphological parameters and index of refraction using a support vector machine (SVM) trained on accurate MSTM calculations but could not accurately predict the optical properties of aggregates with morphological parameters beyond those used in the initial training data set [29].Other brute force approaches such as neural networks (NN) or random forests (RF) will similarly struggle to generate realistic BC properties outside of the training datasets.
Here we show the optical properties of bare BC with complex morphology can be accurately predicted with a graph neural network (GNN) by representing BC fractal aggregates as networks of interacting spheres.GNN's are recently developed machine learning algorithms that learn on graph-structured data sets, allowing models to directly include arbitrary relational information [30,31].These models have shown great promise in predicting the large-scale properties of structured physical science datasets such as molecules [32,33], protein-protein interaction networks [34], and glasses [35].GNN's have demonstrated skill in predicting complex global features of physical systems through learning simpler local physics [36]; here we demonstrate that through including local information about BC's structure, BC's global properties can be inferred.Importantly, because GNN's learn models for specific substructures (i.e. the nodes and their relationships with their neighbors in the graph), they are able to immediately generalize to graphs with arbitrary numbers of nodes; we exploit this feature of GNN's to predict the optical properties of BC aggregates that are significantly larger than those used in the training data set.This zeroshot learning (where models can immediately generalize to samples not represented in their original training data) paves the way towards new, flexible parameterizations of aerosol microphysical properties.

BC Fractal Aggregates as Networks
Physical properties of bare BC Primary (bare) BC particles are fractal-like aggregates with geometries that can be described according to a statistical scaling rule as where a is the primary particle mean radius, k f is the fractal pre-factor, D f is the fractal (Hausdorff) dimension, N s is the number of primary spheres, or monomers, in the aggregate, and R g is the radius of gyration, defined as where r i and r 0 denote the ith monomer center and the center of mass of the cluster, respectively (assuming all monomers have the same mass [37]).In addition to the aggregate geometry, the basic physical properties of these particles follow this scaling law [38].As a consequence of their fractal nature, aggregates are self-similar on different length scales.
The fractal-like nature of these aerosols is a result of their formation from gas-phase precursors through the aggregation and growth of hydrocarbon clusters during incomplete combustion, although this process is not yet completely understood [39].The initial morphology depends on both the combustion conditions and the emission source, with different observational methods also impacting the retrieved parameters [15].After their initial formation during combustion, atmospheric aging (due to cloud processing or the condensation of gas phase species) leads to these aerosols becoming more compact, causing D f to increase over time.This aging is expected to lead to a decrease in their top of the atmosphere radiative effects [13].Previous work has shown that k f determines the compactness of aggregate branches, although little is understood about k f 's evolution over time [15].

Numerically-generated Fractal Aggregates
To investigate how fractal aggregate particles can be modeled as networks of interacting spheres, we numerically generated fractal aggregates with N s spheres using a cluster-cluster algorithm [40] based on the one described in [38], which uses a Monte Carlo approach to randomly generate aggregates with a specified fractal dimension D f and fractal pre-factor k f .We generate Cartesian coordinates for the monomers in the aggregate in dimensionless coordinates by scaling by a factor of k = 2π λ , where λ is the wavelength of the incident light.

Characteristic length scale
To represent fractal aggregates as graphs, monomers with center positions closer together than the characteristic length scale C of the network, are connected, where X v = ka is the monomer size parameter (Fig. 2).This assumption derives from the characteristic length scale of a network with N s nodes [41].
Multiplying by X v gives a consistent number of edges irregardless of the size parameter of the aggregate, such that aggregates with the same fractal parameters but different size parameters would be encoded within the same graph structure.An example of the resulting undirected graph structure and adjacency matrix for two different aggregates with different fractal dimensions but the same number of monomers is shown in Figure 3a-d.This scaling encodes the density of edges in local neighborhoods relative to the fractal dimension of the aggregate, irregardless of the actual size of the aggregate.The total number of edges in the graph is then proportional to both N s and D f (Figure 3e), with the average degree of nodes increasing relative to D f (SI Fig. 4a).The degree distribution of nodes also depends on the fractal pre-prefactor k f (SI Fig. 4b).

GNN model for BC optical properties
Accurate solutions for the electromagnetic scattering and absorption properties for multiple sphere clusters (as BC aggregates are typically modeled) is computationally expensive because a full-wave optics treatment is needed.In the general case, spheres interact with one another, and the total scattering field component is a superposition of the components radiated from each sphere in the system [42].While the solution for the continuity equation at the surface of each sphere in the system can be solved analytically by expanding the incident and scattered fields from each sphere in terms of vector spherical wave functions, this approach generates a very large system of coupled linear equations that must be solved iteratively [43].
Additional details about the formal solution are given in Supplementary Information.
While this approach provides a fully analytical solution for light scattering from the multiple sphere cluster, the computational time for these brute-force approaches scale significantly with N s and X v as they do not take into account specific details of BC's topological structure, which could lend itself to model order reduction.Filippov et al. [38] previously explored the relationship between the morphology of BC and their aggregate physical properties using the Rayleigh-Debye-Gans (RDG) approximation and found that aggregates with similar fractal parameters also have similar physical properties.Recent work in [23] found empirical relationships between the optical properties of aggregates and their morphological parameters using extensive MSTM calculations.Machine learning offers an alternative approach for learning relevant predictors without the need for human-defined features; GNN's in particular can learn features that correspond to the relationship between the nodes (the individual spheres) and the large scale physical properties of the aggregates.
To investigate the connection between BC's fractal structure and its optical properties, we trained a GNN to predict the optical properties of BC aggregates, using the values from an analytical solution for the electromagnetic scattering and absorption properties from MSTM as ground-truth (Figure 2).We tested several different GNN approaches (See SI) and found an Interaction network (IN) [31] gave the best performance for predicting both the integral and angle-resolved optical properties.The IN (Fig. 2) is based on message passing, where nodes send and receive messages along edges from their neighbors.The messages are aggregated for each node and the nodes are updated based on the central node features and the messages received from neighboring nodes.Graph level predictions are made by aggregating the updated node embeddings from all the nodes in the graph and then applying a graph level model to the aggregated node embeddings (the readout in Figure 2).Here we predict the total extinction, scattering, and absorption efficiencies Q ext , Q scat , Q abs , the asymmetry parameter, g, and the angle-resolved elements of the scattering phase matrix S ij (θ) for the orientation-averaged case.(See Methods for discussion of aerosol optical properties and data sets).For each training example, we input X v , the real part of the index of refraction Re(n k ) (since we consider only cases where the imaginary part is Im(n k ) = 1 − Re(n k )), and the dimensionless coordinates of each sphere as node features.As edge features, we use the distance between neighboring spheres.We trained the model using 15314 aggregates from the training data set, as the training loss did not significantly decrease with additional samples (SI Figure S13); training data sets as small as 3000 aggregates showed reasonable generalization performance.The training data set consisted of aggregates with a small number of monomers (N s < 100).We tested the model on an independent test set of 7656 aggregates with the same distribution of parameters as the training data set (N s < 100).We further investigated the generalizability of the model on an independent zero-shot test set of 440 aggregates that were significantly larger (100 < N s < 1000) than the ones the model was trained on.An additional 440 large aggregates were used as a zero-shot validation data set to determine which model architecture provided the best zero-shot performance (SI Figures S9-S12).N s = 100 was chosen as the maximum size for aggregates in the training data set as smaller maximum sizes increased the bias in the zero-shot performance (SI Figure S14).The zero-shot test data set was evenly distributed among the aggregate parameters (Figure S2) to provide an estimate of generalization performance across the full parameter space.
Figure 4a-d shows the IN predictions compared to the actual values for Q ext , Q scat , Q abs , and g with the training data shown in blue and the zero-shot test data sets shown in orange, while Figure 4e shows the predictions for the S 11 (θ) element of the scattering phase matrix for several different aggregates in the test sets.Predictions for both integral and angle resolved optical properties were reasonable across the entire range of size parameters (X v =0.1 to 1.0), indices of refraction n k = 1.4+0.4i to n k = 2.0 + 1.0i, and fractal parameters, with predictions for the integral optical properties in the test data set within 10% of the true values.For S 11 , both the magnitude and functional form were well-approximated across the range of parameters in the test set, although the model did deviate slightly more from the true values for larger N s and X v (e.g. the green line in Fig. 4e).Predictions for the entire angle-resolved scattering phase matrix elements S ij (θ), for j ≥ i, were also reasonable (See SI Figure S16).For the test data with the same distribution of parameters as the training data set (N s < 100), the model predictions were very close to the true values; these results are shown in SI Figure S17.
In addition to generalizability, the IN model demonstrated physical consistency in its predictions for the aggregate optical properties.The 3 scattering efficiencies are not independent, as The model directly inferred this dependency for both the training and test sets without imposing this as a constraint.Additionally, integrating S 11 over the solid angle is equivalent to g [42], Without explicitly imposing this integral constraint, the model predictions were consistent with this constraint (Figure 4f).

Analysis of the GNN predictions
To understand how the IN model predicts the optical properties of BC fractal aggregates, including those much larger than the model was trained on, we emphasize that the graph input for the model does not directly include D f or k f as features but rather the fractal structure is implicitly encoded as the interactions between the neighboring spheres.The previously used SVM approach to predict BC's optical properties included N s , D f , and k f as features to predict Q ext , Q scat , Q abs , and g [29].
Since the network structure in the IN approach is directly learned from the sphere positions, and the model is learned at the node level, the IN approach can generalize beyond its initial training set for these morphological parameters to unseen configurations.The generalization of the IN model to a range of D f is an important feature, as it is challenging to find approximations that are valid across fractal dimension [44].Because the IN model learns about the local neighborhood of each sphere, it is able to more accurately estimate the impacts of screening on absorption and scattering than the RDG approximation, [42,44], an approach often used to approximate the optical properties of BC aggregates in a computationally efficient manner as an improvement on the equivalent sphere model.RDG assumes that individual monomers only interact with the incident electromagnetic field (neglecting multiple interactions), which can lead to absorption being under-predicted by 10-20%, and significantly under-predicting g by more than a factor of 10 [15].The IN model effectively learns, in an unsupervised manner, a simplified sphere level model that more fully captures the complexity of the optical properties of the full analytical solution [15].
The optical properties of aggregates in this regime can be modeled with the assumption of a fairly shallow graph model (for the IN model a single layer performed best; for the GCN little improvement was seen beyond 3 or 4 layers, Figure S5), suggesting that the majority of the structure influencing the optical properties of aerosols in this regime can be approximated from local interactions.We also investigated using a length scale of C = X v Log(N s )/Log(Log(N s )) (characteristic of scalefree networks) to form graphs from aggregates [41], rather than Eq. 3.This length scale has the advantage that the degree of each node scales less quickly with N s , but the IN model performed worse in this case.This indicates that including a larger local neighborhood at each layer (Eq. 3) is more informative for the model.

Discussion and Outlook
The network approach presented here provides a new framework for understanding the microphysical relationship between the morphological properties of BC and its larger scale physical properties.Here we have chosen to focus on the prediction of optical properties for numerically generated fractal aggregates, as the generation of these aggregates from combustion processes and their transformation during atmospheric aging is not yet completely understood.However, applying network theory to atmospheric aerosols suggests new directions for thinking about the generation of these fractal aggregates through combustion processes due to the connection between complex networks and percolation theory [45].Here we have used a cluster-cluster algorithm, although previous work has noted that the morphology of numerically generated fractal aggregates depends not only on the parameters (N s , a, D f , and k f ) defining the shape of the aggregate, but also on which algorithm is used to generate the sphere positions (e.g.diffusion-limited aggregation or diffusionlimited cluster aggregation) [38,46].The network approach provides a new framework from which to understand how realistically numerical algorithms reproduce the properties of aerosols formed during incomplete combustion through comparison of their network characteristics [41].This approach may also be useful for inferring 3 dimensional structure of aggregates from 2 dimensional transmission electron microscope (TEM) images of these aerosols [10,11], since it relates the relative positions of spheres to their overall morphological features; 2D methods have previously been shown to systematically underestimate the fractal dimension of BC [47].
As a proof of concept we have trained a GNN to predict the optical properties of bare BC fractal aggregates with a range of different fractal parameters.This study demonstrates that modeling aerosol fractal aggregates as networks of interacting spheres provides morphological information that allows the machine learning model to extrapolate far beyond their initial training data set.This approach may also be useful for other fractal systems found in nature, such as turbulence, vegetation, or river networks.
BC in the atmosphere is typically internally mixed.The GNN approach provides an obvious extension to internally mixed aerosols (Figure 1), as the thickness of coatings and their indices of refraction or organic fraction could be included as additional node-level features (in the thinly coated case) or graph-level features (for the thickly coated case).Other factors influencing the optical properties of aggregates such as "necking" between overlapping monomers could be included as edge features.Because atmospheric aerosol retrievals rely on orientation averaged parameters, models for predicting the scatter-ing phase function should be equivariant under rotations.Recently developed equivariant machine learning methods [48,49,50,51] may provide improved prediction of the orientation averaged optical properties.
Uncertainty in BC direct radiative climate effects is attributable to multiple factors, including BC's emissions, lifetime, atmospheric processing, and optical properties [1,2,52]; the GNN approach could help resolve this uncertainty by improving both the interpretation of BC observations and by allowing BC's morphology to be accurately represented in atmospheric models in a computationally efficient manner.As a greater understanding of BC's physical properties from different source contributions and atmospheric aging pathways becomes available through laboratory and observational studies [13,14,16], the major remaining hurdle to accurately representing BC in models will be computational.
While previous exact analytical methods have computational wall-times scaling from hours to days for larger aggregates, inference is on the order of < 0.3 seconds per aggregate for the trained GNN model (On a CPU-see SI Figure S15).The computational time for these exact analytical methods has precluded exact calculations of aerosol optical properties being used in models or observational retrievals.CELES, a CUDA-accelerated version of MSTM capable of running on a GPU, demonstrated a factor of 1.5-6 times speed up over MSTM, but was still too slow to be implemented online in models [53].The significantly faster time-scale for the GNN model, as well as its generalizability to arbitrarily shaped aggregates compared to more standard ML methods, has the potential to transform existing model parameterizations for BC.For MSTM computational wall times scale with N s , X v , and D f ; while the total inference and memory scales with N s and D f in the GNN approach, it is no longer a function of X v .
We have focused here on the forward problem of predicting the optical properties of BC given an assumed single particle morphology; however such an approach may also be useful for the inverse problem, i.e. inferring the morphology given the scattering phase function and integral optical properties.This approach could also provide insight into other physical properties which require detailed information about particle morphology [38], such as energy and heat transfer between aggregates and the surrounding gas needed to develop physical models of laser-induced incandescence [7,54].Radiative transfer calculations for mineral dust and ice crystals also rely on detailed information about particle morphology, suggesting that the GNN approach would be useful for modeling their optical properties as well.This approach could mitigate several long-standing issues with model parameterizations and observational retrievals for these species, by providing flexible parameterization of arbitrarily shaped aerosol and cloud particles that are fast enough to be deployed online in atmospheric models.
Finally, these methods have potential for new applications of machine-learning assisted materials discovery [55,56].Proposed geo-engineering approaches to mitigate global or regional impacts of climate change, such as stratospheric aerosol injection, marine cloud brightening, or precipitation enhancement, rely on the development of novel aerosol materials.Generative graph models could be used to determine optimal aerosol morphologies resulting in physical properties specific to these applications at a fraction of the cost of traditional numerical methods [57].

Numerical aggregate properties
Primary clusters of size Nc = 3, 4, 5, 7, 9, 11, 13, 15, 17, 19 were used to generate aggregates between Ns=8 to 960 spheres with fractal dimensions between D f =1.8 to 2.3.Following [23], we assume a fractal pre-factor of k f =1.2 (for the aggregates used in the MSTM calculations).We also investigated the network parameters of aggregates with k f = 1.0 − 1.5, for a given D f = 1.8 (Figure S4b).Aerosols are assumed to consist of isotropic, homogeneous spheres, with size parameters Xv = 0.1, 0.3, 0.5, 0.7, 0.9, and 1.0, corresponding to monomer radii between 7 to 72 nm for incident light at 450 nm and 10 to 104 nm at 650 nm.For each primary cluster size and fractal dimension, 10 aggregate realizations were randomly generated.

Aerosol optical properties
For radiative transfer applications, the orientation-averaged total scattering Qsca , extinction Qext , and absorption efficiencies Q abs , as well as the asymmetry parameter g = Cscacos(θ ) are typical parameters that are needed (Csca is the scattering cross-section which is related to the efficiency as Qsca = Csca/(pi * a 2 agg ), where aagg is the effective radius of the aggregate).The asymmetry parameter relates the amount of forward to back-scattered light.Other parameters relevant for radiative transfer, such as the single scattering albedo (SSA), can be derived from these parameters (SSA = Qsca/Qext).The mass absorption coefficient (MAC) or mass extinction coefficient (MEC) are typically used to relate emissions of these aerosols to their direct radiative effects, and they are sometimes estimated theoretically from C abs or Cext with assumptions about particle density.
The scattering phase function relates the incident and scattered Stokes parameters, e.g. it indicates how light scattering from the particle is transformed relative to incident light in terms of its intensity and polarization state [42].Here we assume initially unpolarized incident light, in which case the S11 element specifies the angular distribution of the intensity of scattered relative to incident light.The scattered light is partially polarized, with degree of polarization given by (S

MSTM calculations of bare BC optical properties
To determine the ground-truth optical properties for the BC fractal aggregates generated by the cluster-cluster algorithm we use the Fortran-90 implementation of the multiple-sphere T-matrix code as described in [43], which can run on a highperformance, parallel based computational platform.This code numerically solves for electromagnetic wave scattering from multiple (non-overlapping) sphere systems for either a fixed or random (orientation-averaged) orientation with respect to an incident plane wave.Here we have focused on calculation of random orientation optical properties, which utilizes the T-matrix procedure developed in [18].We assume indices of refraction consistent with a range of values from the literature for BC at 550 nm: (1.4+0.4i,1.6+0.6i,1.8+0.8i,2.0+1.0i).MSTM calculations were performed for these range of indices of refraction for 57,556 numerically generated aggregates for Ns < 100; we used randomly chosen aggregates from this data set for the training, validation, and test sets for the model.To test the zero-shot performance, MSTM calculations were performed for 880 aggregates with these parameters in the size range 100 < Ns < 1000; we randomly split this data into a zero-shot validation data set to evaluate the model's performance and an independent zero-shot test data set.A summary of the range of parameters for each data set is given in Table S1.

Graph Neural Networks
We used Pytorch Geometric [58] to implement the GNN models.Several GNN approaches were tested, including a simple graph convolutional network (SGC) [59], a graph convolutional network (GCN) [30], and an interaction network (IN) [31] (See Supplementary Information for additional details of the graph models and a comparison of performance metrics among different model parameters and targets).The best performance for the integral optical properties used an IN model with a hidden layer size of 300 for both the node and edge models, and a message size of 100.Both the node and edge models are MLPs with ReLU as non-linear activation function between layers.Aggregation for the edge model is addition, with global mean pooling followed by dropout (p=0.5) and a linear layer of size 100 as the global aggregation function.For the prediction of S11 we found that adding a fully connected node to each graph slightly improved the zero-shot performance.The model architecture was the same as that used to predict the integral optical properties.A batch size of 20 was used (training with a batch size of 2 led to slower training but did not lead to significantly worse performance).For the graph regression task, MSE loss was assumed.We trained the GNN models on a Nvidia RTX 8000 GPU.In this Supplementary Information we provide a more detailed description of the electromagnetic scattering problem (Section S1) and the graph neural network modeling approaches that we tested for the prediction of aerosol optical properties (Section S2).We discuss only approaches relevant to the current work here; additional information about graph neural networks and their applications can be found in several recent comprehensive review papers [1,2,3,4,5].

S1 Electromagnetic scattering from multiple spheres Mie Theory
Electromagnetic scattering from a sphere is known as the Lorenz-Mie solution to Maxwell's equations [6,7].Assuming a plane wave is incident on the sphere, the solution can be found by decomposing the incident plane wave into vector spherical harmonics and writing the sphere's internal and external electromagnetic fields in terms of a vector harmonic expansion.Matching the boundary conditions at the sphere's surface allows the expansion coefficients of the incident, internal, and scattered fields to be determined as an infinite series of spherical multipole partial waves.

Analytical solution for the multiple sphere system
Following the notation in [8], extending the Lorenz-Mie theory to the multiple sphere system is more complex because the field external to a sphere i is a superposition of both the incident plane wave and the scattered fields from all the other spheres in the system: The incident and scattered wave for sphere i can then be written in terms of vector spherical wave functions centered about the origin of the ith sphere.
where N mnp is the vector spherical wave function of type 1 (Bessel) or type 3 (Hankel) functions, n is the order, m is the degree, and p is the mode (1 for TM or 2 for TE).The incident field coefficient f i mnp depends on the field incident on sphere i, and the scattered field coefficients a i mnp are unknown coefficients that need to be solved for.To find analytical solutions for the scattered field coefficients, a truncation limit L i for the infinite series of vector spherical wave functions can be determined to provide an acceptable level of convergence.In general, L i will be proportional to the size parameter X v .Applying the continuity equation at the surface of each sphere then leads to a system of coupled linear equations for the scattered field coefficients for all the spheres in the systems: where a i np are the Mie coefficients of the sphere and H i−j is the transformation matrix for outgoing spherical vector wave functions centered about origin j to an expansion about origin i. Equations 1-4 are the formal solution for the scattered field from all the spheres in the system.Equation 4 is a system of 2N s L i (L i + 2) linear equations, assuming the same truncation limit L i can be used for all the spheres in the system.Numerical methods are needed to iteratively solve this large system of linear equations.
To find the random orientation optical properties adds an additional level of complexity since the scattering must be averaged over all possible orientations and polarization states of the incident light.The T-Matrix approach [9,10] is one method that is commonly used.Other computational approaches to calculate the optical properties of BC are discussed in a recent review [11].

Rayleigh-Debye-Gans Approximation
Rayleigh-Debye-Gans theory is a widely used approximation for light scattering from irregularly shaped particles [7,12].It relies on a length scaling argument, e.g.whether scattering from individual particles in a system will add constructively or destructively [7,12].It is valid in the range where particles are small relative to the wavelength of light: The second condition, e.g. that particles be optically "soft", is not satisfied for BC.

S2 Machine Learning Approach Graph Data Sets
Here we define a graph as G = (V, A), where V is the vertex set of nodes {ν 1 , ..., ν n } and A ∈ R n×n is a symmetric adjacency matrix (as we consider only undirected graphs here).The adjacency matrix summarizes the connections, or edges, between the nodes.It can be either unweighted or weighted, such that elements of the adjacency matrix a ij denote the edge weight between nodes ν i and ν j .For the unweighted case, a ij = 1 if ν i and ν j are connected, a ij = 0 otherwise.The graph's degree matrix D is a diagonal matrix of node degrees with elements d ii = j a ij (non-diagonal elements are 0).The neighborhood N k (i) are the nodes that are k steps away from the center node ν i .We omit the subscript for neighborhood k = 1 steps away, e.g. the immediate neighborhood of the ith node is N (i).
Node features Each node ν i has an associated feature vector F νi .Here the features for each node are F νi = (X v , Re(n k ), x i , y i , z i ), where (x i , y i , z i ) are the cartesian coordinates of the ith sphere.Since we make the assumption that aggregates are composed of isotropic, homogeneous spheres, X v and Re(n k ) could be considered "graph" features (e.g.F u ) rather than "node" features, but in order to make the approach generalizable to systems with heterogenous spheres, we do not take this perspective here.

Edge features
In general, edges e ij can also have associated feature vectors F eij ; in this case we consider only a single edge feature, which can be represented as the weights a ij in the adjacency matrix.The edge feature we consider is the normalized distance between adjacent spheres, , where C is the characteristic length scale (maximum distance between connected nodes) in the aggregate.
Graph Targets Here we focus on two supervised graph regression tasks, based on typical calculations needed for BC radiative forcing calculations and observational retrievals: • prediction of the integral optical properties of BC (the efficiencies Q scat , Q abs , and Q ext and the asymmetry parameter g) • prediction of the angle-dependent elements of the scattering phase matrix S ij (θ).We have mainly focused here on the prediction of S 11 (θ), although we also tested the prediction of all S ij (θ) elements for j ≥ i (Figure S16).
We denote the integral targets as Y I i and the angle dependent target S 11 (θ) as Y S i .The predicted values for these targets from the GNN model are denoted as Ŷ I i and Ŷ S i , respectively.

Graph Neural Network Models
The basic idea of graph neural networks for the graph regression task that we consider here is that each node, regardless of its number of neighbors, learns an embedding based upon the value of its feature vector F νi as well as the features of its neighbors F νj , j ∈ N (i).Because the model is learned at the node level, the information from each of the nodes in the graph contributes to the fidelity of the model (e.g. the weights of the node-level model are learned based on all the nodes, and the neighborhoods of all the nodes, in the graphs).This node-level inductive bias also enables the zero-shot learning perspective since the model is able to immediately generalize to graphs with arbitrary numbers of nodes [2].These node-level embeddings are then aggregated together and a graph level prediction is made.The model weights are learned by optimizing the graph-level predictions Ŷ I i and Ŷ S i against the true values Y I i and Y S i for the training data.
Optimization problem The model parameters Θ are learned through back-propagation using the ADAM optimizer [13].Here we assume mean-squared error (MSE) loss: The learning rate is varied during training according to a 1 cycle learning rate policy with an upper learning rate boundary of 1e −4 [14].We use L 2 regularization to minimize the model parameter weights with a weight decay parameter of 1e −8 .We also investigated training the model on each set of targets individually (e.g. on only the first or only the second term on the right hand side of Eq. 7).We found that it was preferable to separately train the model to predict Ŷ I for the first task, while the prediction of Ŷ S was improved by training the model to predict both targets simultaneously.

Propagation rule
The propagation rule is the method used to propagate information between nodes, capturing both the topological information about the graph structure and aggregating the node features.We tested several different approaches for the propagation module, including the simple graph convolution network (SGC) [15], the graph convolutional network (GCN) [16], and an interaction network (IN) [2,17].We denote each layer as H l , where H l is a R n×f l dimensional matrix, with n the number of nodes in the graph and f l is the dimension of the node state, or embedding, at that layer (for the input layer, this is just the node feature vector F νi ).
• SGC: The Simple Graph Convolution (SGC) is described in [15].The SGC differs from the other propagation rules that we tested, in that rather than having multiple layers, it reduces the complexity of multiple layers to a single linear transformation.The advantage of this approach is to remove potentially unnecessary complexity in GNN's and be more computationally efficient, with fewer fit parameters.The SGC is also more directly interpretable: the effect of the SGC is as a fixed filter on the graph spectral domain, effectively smoothing the local neighborhood so that nearby nodes have similar representations and predictions.Because there are no non-linear activation functions between layers, the effects of applying the SGC over multiple layers can be collapsed into a single linear transformation.The propagation rule to the Kth layer can be written as where S = D− 1 2 Ã D− 1 2 is the normalized adjacency matrix with added self-loops, and Θ is a learned weight matrix.
• GCN: The Graph Convolutional Network (GCN) is described in [16].The GCN's layer-wise propagation rule is based on a first-order approximation of spectral graph convolutions, which can be understood as localized spectral filters on graphs.This filter is localized to nodes that are at most 1 step away from the central node, e.g. the neighborhood N (i), and the propagation rule is where Ã = A + I Ns is the adjacency matrix plus self-loops, D is the corresponding degree matrix, and Θ l is the layer-specific trainable-weight matrix.σ is a non-linear activation function; here we use the rectified linear unit (ReLU), which is defined as ReLU(x) = max(0, x).Multiple convolutional layers can be stacked to create a deeper model, e.g. to propagate information to nodes that are K steps away.Each layer in this framework has a separate trainable-weight matrix.
• WGCN: The WGCN includes a weighted adjacency matrix, with distances scaled as r ij /C, where C is the characteristic length scale (in this case the maximum possible distance between connected nodes).
• IN: The Interaction Network (IN) is described in [17].The IN is based upon the idea of message-passing [18], where nodes "send" and "receive" messages along edges from their neighbors.(The GCN can also be reframed as message passing, as was done in [18]).In general, the IN can accommodate directed multi-graphs, although we consider only the case of an undirected graph with a single edge attribute between node i and node j, F eij .We also do not include the aggregation of all edge features in the global update, as in the full GraphNet block described in [2].
Following the terminology in [2], the IN block consists of three "update" functions φ: and two "aggregation" functions ρ: where E i = {F e ij } is the set of all the message functions for the central node i and all of its neighbors j, and V = {F ν i } i=1:N ν is the set of all the nodes in the graph.The update functions φ e , φ νi and φ u are denoted the edge-level, node-level, and graph-level models, respectively, and the primes indicate the updated state of the edges, node, and graph features.Each update function is a neural network with trainable weights.In this case both φ e and φ νi are implemented as MLP's with ReLU as a non-linear activation function between layers.
As we are interested in graph-level prediction, the model is trained such that F u predicts the graph targets.
We try two variants, one where the center node's features are concatenated to the aggregated messages before passing to the node model (Eq.11, denoted 'IN') and one where only the learned messages are passed to the node model , denoted as 'IN2'.We use addition for the aggregation scheme over the edges for each node (Eq.13).We test different aggregation functions for the global function (Eq.14), as described in the next section.

Read-out operations for graph regression
In order to aggregate the information from all of the nodes for the graph regression task, we use a graph pooling operation; graph pooling is required to be order-invariant (since a typical assumption for graph models is that nodes are permutation invariant-i.e. the identity of nodes should be interchangeable, as the graph-level prediction should be independent of the labeling of the nodes).We compare summation, averaging, and max.pooling, as pooling operations, where h G is the graph representation, and h L i is the representation of node ν i at the final layer L.
After the graph pooling operation, we apply a final classifier, which is a fully connected layer.Here we use dropout (a regularization technique used to avoid overfitting where some of the elements of input tensor are randomly zeroed with a probability of p [19]) before the final classifier, with p = 0.5.The model is trained in an end-to-end fashion, where the weights for both the node level propagation/message passing layers and the final pooling and classification layer are optimized at once.

Impact of graph model on predictions
Propagation rules and graph layers Since the output of a GCN or IN layer is itself a graph G (with the same connectivity as the original graph G), multiple layers can be stacked together.The layers correspond to the depth into a graph from each node that information is propagated, e.g.having k layers means that information can be propagated to neighbors that are at most k steps away.Because the information is effectively averaged over neighbors at each layer, however, having deep graph models is typically not as effective as having e.g.deep convolutional networks [20].We tested both varying the number of layers before pooling and the number of hidden dimensions in each layer.
In Figure S5 we show a comparison of the MSE loss for the training data set as a function of the number of convolutional layers for the SGC vs. the GCN model on the task of predicting the integral optical properties Ŷi .The loss was significantly higher for the SGC, and did not improve with more layers, as the GCN model did, suggesting that the non-linearity of the GCN spectral filtering contributes to the improved prediction skill.The GCN model's performance significantly improved with more layers, suggesting that both propagating information further in the graph and allowing different weights to be learned at each layer improved the prediction relative to the SGC model.Similar results were seen for predicting Ŷ S or both Ŷ I and Ŷ S , as in Eq. 7. Most of the improvement was seen in the addition of the first 4 layers (Figure S6); the validation loss did not significantly improve after 4 layers.The training loss for prediction of Ŷ S or both Ŷ I and Ŷ S similarly showed most improvement in the addition of the first 4 layers.The WGCN model performed slightly better than the GCN model with the same number of layers, indicating that including the distance between neighboring spheres as edge weights can improve the prediction of the model.The IN model's performance, even with a single graph layer, significantly out-performed the GCN variations; however, adding more than a single message passing layer for this model worsened, rather than improved, performance for the IN.
Impact of fully connected node Adding a special node to the graph that is connected to all the other nodes in the graph is one method commonly used to allow information to be propagated throughout the graph efficiently [18,21], although it obscures the direct interpretation of the nodes in the graphs as the spheres of the aggregate, in this case.This approach is less computationally expensive than having e.g. a fully connected graph [18].In this case we tested adding a fully connected node that has the same X v and Re(n k ) as the other spheres in the aggregate.To test models that include edge features, this node is assumed to be an averaged distance r ij away from every other node in the graph (in this case we average only over nodes that are connected, e.g.only over the distances less than the characteristic length scale of the network).
Adding a fully connected node to the GCN (GCN-FC) improved the training loss slightly for the prediction of the integral optical properties, which suggests that allowing information to propagate further in the graph could improve prediction of Y i .We denote this as "FC" in Figure S5.Adding a fully connected node improved performance on the validation sets for the IN model (Figure S10), but led to a bias in the performance on the zero-shot validation data set (Figure S11).For the prediction of the angle-dependent targets Y s with the IN model, we found that the inclusion of the fully connected node led to slightly improved generalizability when comparing the predicted S 11 integrated over the solid angle to the values of g (Eq. 4 in the main manuscript) compared to no fully connected node for the zero-shot validation data set.

Impact of Node Model and Message Size
For the IN model, we investigated different approaches for the node and edge level models.We varied the number of layers in both the node and edge level models (as discussed above, they are each MLP's with ReLU as a non-linear activation function between layers), as well as the size of the learned message.In Figure S7, we show that inputting both the concatenated messages from all the neighbors as well as the central node's features ('IN') performed better than passing only the learned messages to the node model ('IN2').Having several layers in the node and edge models decreased MSE loss for the training and validation data sets compared with only a single layer (H e L=3 ,H v L=3 vs. H e L=1 ,H v L=1 ).This suggests that more complex approaches for the node and edge models, such as using spherical harmonic filters [22], could further improve predictions by more fully exploiting the connection to the Mie theory solution for each individual sphere as influenced by the incident light and all the other spheres in the system (Eq.1).We also found that, while increasing the size of the learned message slightly improved the MSE loss for the training and validation sets for both predicting Ŷ S and predicting Ŷ I (Figure S8), restricting the size to 100 improved the zero-shot generalization performance on the zero-shot validation data set (Figure S9).This bottle-necking was previously observed to provide improved interpretability of the learned messages in an IN applied to an n-body physics simulation [23].

Impact of readout function
The readout operations of average and max pooling performed better for the training and validation sets for both the GCN and IN for the prediction of Ŷ I (Figure S10).Summation gave a lower MSE loss than the mean global pooling for the training and validation data sets for the prediction of Y S ; however the generalization performance was typically better for the mean and max.pooling cases (Figure S12).Intuitively we might expect summation to provide the best performance, as the total optical properties of the aggregate are due to the contributions of all the individual spheres; however because scattering and absorption can add constructively and destructively, this suggests the mean is more informative.

Impact of characteristic length scale
As discussed in the main text, we generate graphs for each BC fractal aggregate from their dimensionless cartesian coordinates by connecting spheres with center positions closer than the characteristic length scale C of the network.Since this length scale will impact the resulting graph structure and potentially impact the performance of the graph model, we also tested whether using a different characteristic length scale would impact the predictions.We tried characteristic length scales of C = X v Log(N s ) and C = X v Log(N s )/Log(Log(N s )).We found that the training and validation loss were both lower for C = X v Log(N s ) for all the prediction tasks.We also found that the model demonstrated superior zero-shot performance for C = X v Log(N s ).
One might also consider using a fully connected graph (since in general all spheres interact with one another), but this becomes prohibitively expensive for large N s , as the number of edges would grow as N 2 s rather than Log(N s ) (Figure 3e in the main manuscript).We note a similar assumption was made for CELES, the CUDA-accelerated version of MSTM, which sub-divided neighboring spheres into subsets, to block-diagnolize portions of the T-matrix in order to parallelize calculations and improve computational performance [24].
Impact of training sample size, batch size, and maximum graph size We vary the number of graphs s used in the training data set.Figure S13 shows the training loss and validation loss for the GCN and IN as a function of the number of samples in the training data set.The model is fairly efficient in terms of its usage of the data, and the most significant improvement is seen when increasing the sample size up to s ∼ 10000 samples, with less significant improvement in the training loss for the prediction of Ŷ I with larger sample sizes.The loss for predicting Ŷ S showed a slightly more linear decrease with increasing s for the IN, but the majority of the improvement was also seen for increasing the number of samples up to s ∼ 15000.
For zero-shot learning we also investigated how varying the maximum number of nodes N s,max allowed in graphs in the training data sets impacted the generalizability of the model to larger graphs (Figure S14).Smaller N s,max significantly increased the bias for the zero-shot prediction, although the model still showed some skill in zero-shot prediction for N s,max ≥ 40.

Figure 1 :
Figure 1: BC optical properties.Top, from left to right: Equivalent volume sphere for bare BC, thinly coated BC, and thickly coated BC particles.Bottom, from left to right: Geometry of bare BC, thinly coated BC, and thickly coated BC as used in typical MSTM calculations.

Figure 2 :
Figure2: A schematic of the GNN modeling approach for predicting aerosol optical properties.Accurate calculations of aerosol optical properties from the sphere positions are calculated using MSTM.For the GNN model, graphs are generated from aggregates by connecting spheres closer together than the characteristic length scale, C, of the aggregate (Eq.3).Embeddings are learned for each node in the graph based on the central node features, the neighboring node features, and the edge features.These node-level embeddings are then aggregated together and a graph level prediction of the optical properties of the aggregate is made.

Figure 3 :
Figure 3: Visualization of the graphs (a,c) and adjacency matrices (b,d) for fractal aggregates with the same number of spheres (Ns=288) but different fractal dimensions.D f =1.8 for (a,b) and D f =2.3 for (c,d).(e) The number of edges scales with the total number of spheres in the aggregates (Ns) and the fractal dimension of the aggregates (D f ).
The distribution of parameters among the small (Ns < 100) and large (Ns > 100) aggregates are shown in Figures S1 and S2, and the integral optical properties calculated with MSTM are shown in Figure S3.

Figure S1 :
Figure S1: Distribution of parameters in training, validation, and test data sets (Ns < 100).

Figure S2 :
Figure S2: Distribution of parameters in zero shot test and validation data sets (Ns > 100).

Figure S3 :
Figure S3: Integral optical properties as a function of the parameters for the large (Ns > 100, orange) and small (Ns < 100, blue) aggregates.

Figure S4 :
Figure S4: (a.)The degree distribution for aggregates with Ns = 1024 spheres, k f = 1.2, and D f as given in the legend.(b.) Degree distributions for an aggregate with Ns = 1024 spheres, D f = 1.8, and k f as given in the legend.

Figure S5 :
Figure S5: Training loss for the prediction of integral optical properties for the models with different number of layers.

Figure S6 :
Figure S6: Training and validation loss for the prediction of task of different targets (Y I , Y S or Y I and Y S ) for the GCN model with different number of layers.

Figure S7 :
Figure S7: Training and validation loss for the prediction of Y I for different IN model architectures.'IN' concatenates the central node features with the aggregated edges, as given in Eq. 11, while 'IN2' does not include the central node features.H e L and H v L indicate the number of layers in the MLP's for the edge and node models, respectively.

Figure S8 :
Figure S8: Training and validation loss for the prediction of different targets (Y I or Y S ) for the IN model with different sizes for the learned message.

Figure S9 :
Figure S9: Predictions for the integral optical properties for the training and zero-shot validation data sets using an IN model.Each row shows the true vs. predicted values for the model as trained with different message sizes.

Figure S10 :
Figure S10: Training and validation loss for the prediction of the GCN model and the IN model with different graph readout functions and targets.The GCN model has 4 layers, a hidden dimension size of 400, and dropout of 0.5, using a global mean pooling layer.The IN model has a message size of 100.

Figure S11 :
Figure S11: Predictions for the integral optical properties for the training and zero-shot validation data sets using an IN model with a message size of 100.Each row shows the true vs. predicted values for the model as trained without a fully connected node (No FC) and with a fully connected node (FC) in each graph.

Figure S12 :
Figure S12: Predictions for the integral optical properties for the training and zero-shot validation data sets using an IN model with a message size of 100.Each row shows the true vs. predicted values for the model as trained with different global pooling functions.

Figure S13 :
Figure S13: Training and validation loss for the prediction of integral optical properties for the GCN model and the IN model with different training sample size and targets.The GCN model has 4 layers, a hidden dimension size of 400, and dropout of 0.5, using a global mean pooling layer.

Figure S14 :
Figure S14: Predictions for the integral optical properties for the training and zero-shot validation data sets using an IN model with a message size of 100.Each row shows the true vs. predicted values for the model as trained with different maximum aggregate sizes in the training data sets.

Figure S15 :
Figure S15: Inference time in seconds as a function of the size of the aggregate for the trained IN model for predictions on the training data set and zero-shot test data sets.The inference time here was estimated on a CPU using the trained model weights.

Table S1 :
Parameters of BC aggregates for training, validation, and test data sets.Data set Training/Val./TestSets Zero-shot Val./Test Sets