Introduction

Modern engineering applications are driving the demand for heterogeneous materials with tailored multifunctional properties. Very often, these properties are dependent on the microstructure. In recent years, there has been a sustained focus on microstructure-sensitive design. The design intent here is to identify tailored microstructures that result in desired properties.

The rational design of heterogenous materials has emerged as a very promising approach towards discovery of new materials and devices with tailored properties and subsequently spur novel applications. One such application example has been that of organic electronics, specifically organic photovoltaics (OPV). In spite of exhibiting multiple benefits (tunability, flexibility, cost, low-temperature manufacturability), organic photovoltaic films still remain a niche market due to relatively poor photoconversion efficiency compared to inorganic counterparts. Careful theoretical1,2,3,4,5,6 and experimental analysis7,8,9,10 have revealed how the microstructure impacts each stage of the photoconversion process. However, the complexity of these analysis approaches have made systematic exploration infeasible, with the result that there exist no design principles nor approaches for identifying promising microstructure in a systematic way. Thus, a key bottleneck to microstructure-sensitive design is the paucity of techniques that can rapidly evaluate the performance of a microstructure.

Our approach to resolve this bottleneck is through machine learning (ML), which is used to create a fast surrogate for any complex functional map in a data-driven manner. Over the last decade, machine learning models have proved their ability to ingest volumes of data-label pairs and create efficient proxy or surrogate models to predict labels for similar instances of data. Deep Learning, the state-of-the-art ML form, has especially advanced the field by incorporating the ability to learn features from high-dimensional data such as multi-spectral images,11,12,13 speech14 and text.15 A particular form of deep networks called Convolutional Neural Networks (CNN) has become very popular due to its ability to autonomously create and analyze features in image-like inputs. Through the use of convolution operations, these models retain spatial neighbourhood information, thus allowing linking local (hierarchical) features of an image and an associated label, without the need for hand crafting of any features. Due to this special ability of ML algorithms to be input agnostic, i.e., the ability to automatically evaluate features from input data, they have found utility in a wide variety of applications including recommendation systems16 and self-driving cars.17 These approaches are slowly gaining popularity in physics and engineered systems,18,19,20 where modern sensor and computational developments have paved the way for structured data generation.21,22

Here, we utilize the versatility of CNNs to map the active layer morphology of thin film OPVs to a performance metric, which is the short-circuit current Jsc. Specifically, we train a morphology classifier that maps a OPV morphology to a short-circuit current. We test several architectures (of varying depth and width) that can learn from a given set of morphologies and their labels, and demonstrate very high accuracy, and F1 score. To distinguish and rank order between these equally well performing models, we used two additional measures. The first is based on the observation that a good model must be able to generalize the learnt structure-property relationship. Thus, we identify network architectures that can generalize the map with the available dataset. We quantify this in terms of the ability of the architecture to ‘project the unseen’ morphology onto the learnt distribution and make good predictions.

Apart from generalizability, the other critical requirement for the ML model in our context is interpretability. While model interpretability is not a very critical metric for some applications (for instance, network failure or stock pricing), it becomes a fairly important metric for understanding the behavior of engineered systems. This is because having a purely predictive ‘black-box’ model that is not interpretable raises a critical question—why should a domain expert believe in the prediction of a black-box model? This lack of “interpretability or explainability” is endemic to most black-box models and presents a major bottleneck to the widespread acceptance of ML models.23 Recently, there have been several approaches towards extracting interpretation from these “black-box” models.23,24,25,26,27 This includes domain-specific explanation of models.28,29,30,31 In the current context, the process of learning the structure-property relationship involves identifying several distinct local morphological traits (i.e., unsupervised feature learning) and weighing them appropriately to predict the performance of the morphology. While several (similarly performing) architectures will learn to look at multiple features, we argue that the most useful network is the one that can also identify the right features of the morphology used to make the (correct) prediction. In other words, the chosen architecture should be interpretable to gain trust in the model.

We introduce an approach called DLSP (Deep Learning for Structure Property interrogation) for learning the structure-property relationship from data. Figure 1 illustrates this approach graphically. We first construct a surrogate model of the structure-property relationship using a custom architecture based on a deep convolutional neural network. After training, this architecture is characterized for its trust using generalizability and interpretability measures. Specifically, generalizability is characterized by the performance of the models on off-sample morphologies, whose characteristics are not present in the training dataset. Subsequently, interpretability is characterized by evaluating the “salient” features using saliency map visualizations. This dual characterization allowed us to pick a custom architecture over standard classifying architectures such as VGG-16 and ResNet50 architectures, all of which had nearly identical predictive power. We further use this trust-worthy architecture to perform manual as well as automated explorations of the structure-property space. Using a graphical web application we simplified the process of manual exploration and intuition building of the structure-property space. Here, the user can manually draw (2D) microstructures, perturb the microstructures and use the trained model to rapidly explore the impact of specific features on performance. Such analysis using a full scale physics model would require established, complex computing resources, which are generally not available to every researcher. Additionally, we integrated this trained model into an optimization framework to enable automated morphology. This work illustrates the substantial promise of such surrogate based design procedures in the design of complex multi-physics systems.

Fig. 1
figure 1

DLSP (Deep Learning for Structure Property) framework: We construct a forward map from morphology to performance. Upon building trust in this trained model, we use it for performing manual exploration and insight buildings, as well as and automated design

Results

Training and validation

We develop a CNN-based architecture to classify morphologies into performance classes. A diverse set of binary morphologies were computationally created for use in training, testing and validation. We solved a thermodynamically consistent Cahn-Hilliard equation32 for binary phase separation using an in-house finite element library.33 We ensured creation of a diverse set of morphologies by simulating systems with different volume fractions and different binary interaction parameters. As the Cahn-Hilliard equation models spinodal decomposition (or coarsening dynamics), we output morphologies at several time-snapshots for each simulation. A total of ~65,000 morphologies were generated. Each of these morphologies was computationally interrogated to evaluate the photovoltaic performance. The short-circuit current, Jsc, was evaluated for each morphology using the excitonic drift-diffusion equation,1 which models photocurrent generation process in organic semiconducting films. Across the dataset, the Jsc exhibited a minimum of 0.6 mA/cm2 and a maximum of 7.0 mA/cm2. Subsequently, the continuous output, Jsc was binned into 10 distinct equi-spaced bins, and each morphology was assigned a one-hot vector as its label.

The dataset consists of images aggregated from solving the Cahn-Hilliard equations for a binary phase separating mixture with various blend ratios and interaction parameters (the complete dataset is publicly available). Varying interaction parameters produce morphologies with different domain purities, while varying blend ratios produce domains of different sizes. Here, we choose to consider 2D morphologies, with extension to 3D morphologies being conceptually straightforward (but computationally non-trivial11,34). This dataset of morphologies (i.e., 2D, amorphous, isotropic) chosen is a subset of the diversity of morphologies that OPV films exhibit (amorphous-crystalline, anisotropic, and multi-phase) (Interestingly, we show our model trained on this strict subset of plausible morphologies performs well on morphologies representative of the larger OPV diversity, see Section Out-of-sample testing to characterize model generalizability).

We choose the short-circuit current, Jsc, as the output of the model. The performance of an OPV device is characterized by the current-voltage (JV) plot. The JV plot is completely parameterized in terms of three quantities, (a) open circuit voltage Voc, (b) short-circuit current Jsc, and (c) fill factor. The Jsc explicitly depends on the morphology, while Voc depends on the chemistry of the acceptor-donor materials (essentially the HOMO-LUMO gap). Consequently, this motivates our choice of Jsc as the output since it explicitly encodes the influence of morphology. Our custom network architecture for mapping a specific morphology to its label is depicted in Fig. 2. It has 1.2 million learning parameters, consisting of four blocks comprised of a convolutional layer followed by a pooling layer (downsampling by 2 × 2 max-pooling) followed by a batch normalization layer. The first and second blocks have 16 feature maps with 5 × 5 convolutional kernels. The third block has 64 feature maps with 2 × 2 kernels and the final block has 128 feature maps with 2 × 2 kernels. After the final block, the output is flattened using a flatten layer and is followed by 3 fully connected (FC) layers with 512, 128, and 32 hidden units each, sequentially before reaching the final softmax output (prediction) layer of 10 units. A Dropout layer35 with 50% dropout was added between each of the FC layers. Training was performed on a total of 45,108 samples (with an additional 11,109 validation samples), and testing was performed on 11,109 samples. The learning rate was initiated at 0.0001. The Rectified Linear Unit (ReLU) function is used as the activation function for each of the convolutional and dense (FC) layers. To address over-fitting issues, we add dropout layers in between the fully connected (FC) layers. The percentage of dropouts used was 50% after each of the fully connected layers (namely, FC Layer 1, FC Layer 2, and FC Layer 3, as shown in Fig. 2). After every convolutional and subsequent max-pooling layer, batch normalization was performed to remove internal covariate shift.36 The network was trained for approximately 120 epochs (18s per epoch) with a learning rate of 0.0001, on the 45,000-image training set, giving an accuracy of 95.80%. The loss was denoted using a categorical cross-entropy function and Adam optimizer37 was used to minimize the error.

Fig. 2
figure 2

Confusion matrix for in-sample test predictions. Notice the heavily diagonally dominant matrix, indicating a very good classification accuracy. (Scalebar limits: 0–1)

Apart from this network, we also tested two standard architectures with our dataset:

  • VGG-16 (learning parameters ~50 million), with learning rate of 0.0001, batch size of 128 initialized with random weights was also trained on the training dataset, achieving a test accuracy of 96.61% at epoch 70 (with 180s per epoch) with no further improvement in test accuracy.

  • ResNet-50 (learning parameters ~23 million), with learning rate of 0.0001, batch size of 128 initialized with random weights was also trained on the training dataset, achieving a test accuracy of 96.45% at epoch 70 (580s per epoch) with no further improvement in test accuracy.

A key point to note is that our network, although shallower, performs as well as the established deeper CNN models. Therefore, we select the network based on the learnt features (’interpretability’) and out-of-sample performance (’generalizability’) and not just the accuracy/f1-score of model on the testing dataset. We also note that deeper networks also have additional problems—vanishing (or exploding) gradients,38 which hinder convergence, and the saturation of accuracy with increasing depth. We use saliency maps27 to visualize learnt features (Sec. Building trust via interpretability characteristics), i.e., identify microstructure features used by the model to make classification decisions. It is observed that the heat-maps signify the regions of varying degrees of importance and suggest a physical interpretation, which is further discussed in Sec. Building trust via interpretability characteristics.

Performance of models: statistical metrics

A standard approach to quantify performance of a classification based machine learning framework is through the confusion matrix. Figure 2b shows the confusion matrix for in-sample test data classification. It has an accuracy of 95.80% and F1-score of 97.28%. From the confusion matrix, it can clearly be seen that most of the classification is correct, and those which are incorrectly predicted are usually only off by one class. Some incorrect prediction is not unexpected, as we are binning a continuous variable into non-overlapping classes. As such, the edge cases have the potential to be misclassified. We also note that the other two standard architectures show similar confusion matrices, with similar prediction accuracy (see SI).

Out-of-sample testing to characterize model generalizability

It is a commonly known fact35 that neural networks can possibly overfit, depending on the model capacity, amount of training data and training hyperparameters. The network thus memorizes the data and exhibits poor generalization capacity as well as brittleness (i.e., lack of robustness to perturbations). We, therefore, resort to two methods of checking the robustness of our trained network(s). As noted earlier, the morphology data used for training is generated by solving a PDE. This inherits certain properties to the data such as smooth contours and uniform domain sizes. Hence we try to systematically break these assumptions about the dataset and see the performance of the network. First, we test the network on a columnar structure (Fig. 3). This structure is postulated as an ideal structure in literature.39 As the width of the columns decrease (and of the order of the exciton diffusion length) and the length of the columns increase, the performance of the morphology increases. This is an example of out-of-sample data—it has several sharp interface contours, which are completely absent in the training dataset. The results of the performance of the models on this morphology are shown in Fig. 3. The actual Jsc values from a full scale drift-diffusion simulation (along with the corresponding true label) are also presented. It is promising that the custom network accurately predicts the correct label corresponding to each of the columnar microstructures.

Fig. 3
figure 3

Saliency maps and performance of our custom trained CNN. Note how the saliency maps closely follow the interface regions in the microstructure. It should also be noted that the networks shows good performance even on samples outside the training dataset

In a more difficult generalizability test, we use fractal-like morphologies,40 that are constructed to maximize the interfacial area while minimizing the amount of tortuous transport. These ’virtual’ morphologies have been shown to exhibit enhanced performance,40 but are currently difficult to experimentally fabricate. We make this point to emphasize that our training dataset consists fully of morphologies that are experimentally feasible to fabricate. Our model correctly predicts the Jsc class of all fractal-like morphologies we considered (100% accuracy). It is very promising that our network has correctly identified (Fig. 3) all these as high-performing class label 9. This provides substantial evidence of the generalizability of the model.

Building trust via interpretability characteristics

We next query the network to characterize the learnt features. We accomplish this using the concept of saliency maps27,41 to identify the important features of the image input. Saliency mapping is a visualization technique that generates heat-maps on images that bring out (highlight) the regions (microstructure regions, for our case) the trained CNN model focuses on to generate a classification output. Figure 3 shows the saliency maps for morphologies in the data, columnar structures and the “high” performing morphologies identified in.40

We can see, in Fig. 3, how the network uses the interface between the acceptor and donor regions feature as a key measure for prediction. We believe this is critical evidence that makes this network trust-worthy. This is because the interface is the most critical feature affecting the performance. The length of the interface determines the amount of excitons that are dissociated. Additionally, interfaces that results in isolated islands or highly tortuous pathways result in enhanced recombination thus reducing performance. Finally, the impact of interfaces in the middle of the domain (away from the top and bottom electrodes) are more important, as the charges produced at these locations have a higher chance of recombination. We can see from Fig. 3 how the network is able to identify and utilize this interface information as critical to prediction of device performance.

Finally, we observe in Fig. 4 that the saliency maps from the standard deep networks (VGG-16 and ResNet-50) are unable to locate any interpretable features. Although the test accuracy of these networks is marginally higher than our custom network, we see that the saliency outputs from these networks do not provide us with any understandable information. Extensive numerical experimentation revealed that our model is shallow enough to provide meaningful saliency maps (i.e., be intepretable) while deep enough to produce accurate (and generalizable) predictions. We provide additional details in Appendix. 4.6. This observation is in-line with,42 where it was shown that deeper models are harder to explain than their shallower counterparts even though they may achieve a higher classification accuracy. These results signify the importance of tailoring architectures to the application. Thus, for performing morphology design, we use this customized architecture as a surrogate map from the microstructure space to the performance space.

Fig. 4
figure 4

Comparison of Saliency map outputs for our Custom Model (second column), VGG-net (third column) and ResNet-50 (fourth column), with input morphologies shown in the first column: top row shows an example image for class 0, bottom row shows an example image from fractal-like morphologies (correctly predicted as class 9 by our custom model)

Morphology design

Having developed a fast and trust-worthy surrogate map from microstructure to performance, we use it to enable microstructural design. In this section, we show two distinct applications, one manual and one automated, using this surrogate model for microstructures exploration and design. The goal of both these techniques is to explore and identify morphologies that demonstrate superior performance. Traditionally, this was generally achieved through a conventional optimization strategy, like simulated annealing, where an initial morphology is tweaked repeatedly to achieve superior performance. At every stage, the current morphology is evaluated for its performance. Subsequently, the whole process requires several computationally expensive evaluations and hence becomes time consuming. In the OPV context, evaluating the Jsc for a 2D morphology requires access to dedicated high-performance computing resources. While our highly optimized in-house excitonic-drift-diffusion1,43 code is able to perform one simulation in a few minutes on 24 processor, this is still not a viable approach for in-line design exploration and insight generation. In contrast, with the CNN-based framework, evaluating the morphology becomes significantly faster and easier. Hence it provides an very powerful way to quickly ’evolve’ morphologies to reach morphologies with optimized performance.

Using the surrogate, we created a browser (Fig. 5a) that enables the user to interactively modify morphologies to both visualize, test/build intuition and improve morphology performance. Using this interface, the user can get insights into the effect of morphological features on performance. Figure 5 shows how one can modify morphologies to sequentially include several features of varying sizes, with the aim of improving performance. This tool can in turn help identify features of morphology that affect the performance. An example of this is demonstrated in Fig. 5b–j. It shows a set of morphologies along with the respective performance labels predicted by our network. First, we can see how performance can be improved from a simple bilayer by increasing the amount of surface area between the acceptor and donor.44 The maximum boost of performance is obtained when the donor(black) domains are fractal-like,40 as shown in Fig. 5e. Next, we add island type structures to inhibit performance.44 In our example, a ‘line’ of donor is added to the existing morphology, creating several acceptor domains unconnected to the cathode. The performance suffers drastically as informed by the physics of photoconversion.1 This reduction can be compensated if the connectivities are improved for the acceptor, which can be seen in Fig. 5h. And finally, Fig. 5j shows how larger domains are not beneficial as they lead to geminate recombination and hence lower performance. Finally, a user can use approach as a design tool by incrementally adding changes to the initial morphology that can improve the predicted performance. Since the performance assessment is done by the trained CNN, the whole process happens in real-time.

Fig. 5
figure 5

Manual exploration and insight building using the browser interface. Notice how several physics based intuitive trends can be identified and understood by incrementally perturbing the original bilayer morphology

The above interface enables manual exploration and building of insight into the influence of various morphological features on performance. Manual exploration, however, inherently makes full exploration to find the best performing morphology manifold difficult and time-consuming. Thus, to fully explore this space, we link this fast surrogate with a probabilistic optimization algorithm to find promising, high-performing morphology classes. More specifically, we use a population based incremental learning (PBIL) approach to perturb morphologies and evolve them towards higher performance.40 PBIL estimates the explicit probability distribution of the optimal morphology. The multi variate probability distribution is stored as a probability matrix P of the 2D morphology, i.e., each pixel is associated with a probability and is updated at each iteration to evolve towards promising morphology classes. This matrix P is updated as follows: the optimization starts with a given probability matrix, generally based on the intuition of the researcher. Subsequently, n morphology instances are sampled around this matrix P. For each realization, the fast ML surrogate is deployed to evaluate the performance, fj, j [1, n]. Then nb best samples (nb < n) are used to calculate, Pu, the probabilistic update matrix. Next, the probability vector is updated according to P = P (1 − lr) + Pulr, where lr is the learning rate. Intuitively, the update step reinforces features present in the best performing morphologies, and dampens those missing. The algorithm terminates by standard criteria (iteration limits and improvement bounds). The integration of a robust and fast surrogate with a probabilistic exploration algorithm produces very promising results. Representative results are shown in Fig. 6 where the evolution of the morphology is towards features with multiple scales, mimicking the finger-like fractal structures that are exhibited by high-performance morphologies.40 We perform full-physics simulations on one of the optimized morphologies (Fig. 6c), which confirms that the surrogate-derived morphology is in fact a high-performing morphology (Fig. 6d).

Fig. 6
figure 6

Exploration by semi-automated design: The optimization started with a bilayer structure. Notice how the framework directs the formation of finer features. Figure d shows the simulated electron and hold current densities under short-circuit conditions for this optimized morphology. The result from automated design has been modified using physics based principles. (Scalebar limits: Jpy: 0–10; Jny: 0–22)

Discussion

In this work, we address the computationally challenging issue of rapidly exploring morphology space to identify promising morphologies, especially in the context of multi-physics phenomena. While the approach is general, we illustrate the approach using the case of morphology tuning to enhance the performance of organic photovoltaic films. Our approach is a data-driven approach to learn a morphology quantifier that can perform fast evaluations. We train a custom designed CNN maps a specified morphology into short-circuit current, Jsc, classes. Using out-of-sample datasets, we confirm absence of over-fitting issues during the training process. Two other standard networks (VGG-16 and ResNet-50) were also trained. It was observed that the custom network, although shallower, gave very similar accuracy. However, our custom network performed much better when visualized using saliency maps as well as when tested on out-of-sample datasets. It identified critical features of the interface in the morphology, which both VGG- 16 and ResNet- 50 failed to identify consistently. The custom designed network is then used to perform morphology design for achieving enhanced performance. Two approaches were taken to do this—the first one aims to inform the user about the effect of morphology on performance. The second approach uses the trust-worthy network as a fast cost function and performs morphology optimization using PBIL algorithm. This work serves as a proof of concept of using deep neural networks for material morphology quantification and design.

There are several interesting areas of research that this work suggests. First, we show that our model—though trained on a subset of plausible morphologies—is able to make accurate predictions on a much more diverse set of morphologies. This raises the question: ‘What is the minimal diversity of morphologies that is needed for a trained model to be generalizable?’ Such questions are particularly important to answer when data collection is resource intensive. Promising approaches include methods of active learning,45 and physics-aware models.46,47 Next, we show that CNN-based surrogate models are promising approaches to rapidly explore structure-property manifolds. This raises the question: ‘How can such techniques be extended to map and explore process-structure-property manifolds?’ This question is particularly important to isolate promising processing windows that produce high-performing devices. Promising approaches include surrogate models based on smart sampling,48 and ideas of manifold learning.49

Methods

Organic photovoltaics

Organic photovoltaic devices are energy harvesting devices, which employ organic materials for solar energy conversion. These provide multiple advantages over traditional silicon-based cells, like flexibility, transparency, and ease of manufacturability. They, however, are limited by their efficiency of operation. Although major breakthroughs in processing and materials have improved the efficiency drastically, they still lag behind the traditional photovoltaics.

The efficiency of these devices is intricately dependant on the material distribution/morphology in the active layer. The active layer generally is a bulk hetero-junction, enabling multiple sites for energy conversion. Several features of the morphology have different roles in the process of converting solar energy. The ability to change these morphological features by changing the processing protocol is a major source of control in these devices.

The solar power conversion happens in several stages. Firstly, the incident solar energy generates excitons in the donor phase. These excitons are highly unstable and need to diffuse to a nearest interface with the acceptor material to separate into positive and negative charges. This diffusion to the interface is critical to evaluate the efficiency of absorption of incident light. These excitons dissociate at the acceptor-donor interface to form charges. The nature and quality of the interface has a direct impact on this efficiency. For example, interfaces with non-aligned crystal boundaries show lower dissociation than those with aligned crystals. In the next stage, these charges (positive charge in the donor and negative charge in the acceptor) are drifted to the respective electrode to produce electricity. Usually, this drift is provided by the potential difference between the two electrodes. However, these charges also encounter other interfaces which have pairs of positive and negative charges, leading to potential recombination.

In this context, quantifying the stage efficiencies (generation, dissociation, and transport) becomes a critical part in developing strategies to design processing conditions. It can already be seen that the role of morphology cannot be over-estimated in the power conversion efficiency. Hence strategies were developed2,7 to quantify the efficiencies these morphologies.

While these techniques are robust and rigorous, they are expensive and time intensive. This makes them infeasible for further designing morphologies, which often requires several quantifications. So, we turn to modern fast methods of quantifying data, especially images. We represent the morphologies as images and take advantage of deep convolutional neural networks to do performance based classification.

Data generation and quantification

In order to train the network, we generate a dataset of microstructures using a thermodynamic consistent binary phase separation simulation. This is done by solving the well known Cahn-Hilliard equation,32 which tracks the local volume fraction of each material (ϕi):

$$\begin{array}{c}\frac{{\partial \phi _i}}{{\partial t}} = \nabla \left( {M\left( {\phi _i} \right)\nabla \mu _i} \right)\\ \mu _i = \frac{{\partial f}}{{\partial \phi _i}} - \epsilon ^2\nabla ^2\phi _i\end{array}$$
(1)

M(ϕi) is the mobility of component i. μi represents the chemical potential of component i. The chemical potential as defined in Eq. (1) is the variational derivative of the total free energy of the system. The total free energy comprises of the bulk free energy f and the interfacial energy. The interfacial free energy is characterized as \(0.5\epsilon ^2|\nabla \phi _i|^2\), where \(\epsilon\) is the interfacial energy parameter. \(\epsilon\) is usually correlated with the thickness of the interface between the components. The bulk free energy is described using the Flory-Huggins50 energy representation:

$$f = \frac{\phi _{1}}{N_{1}}{\mathrm{ln}}\phi _1 + \frac{\phi_{2}}{N_{2}}{\mathrm{ln}}\phi _2 + \chi _{12}\phi _1\phi _2$$
(2)

The degree of polymerization of the components is denoted by Ni and χij represents the severity of interaction between the components. The values for χ are either estimated using molecular simulations51,52, or experimentally53, or calculated through empirical methods54.

This process generates time series of morphologies that can be treated as independent morphologies for the sake of training a machine learning model. This method helps to quickly produce several thousands of microstructures within a very short amount of time. In order to generate numerous consistent morphologies, we perform 100 simulations of the above Eq. (10) values of χ12 with 10 values of initial concentration), with morphologies outputted at every 20 timesteps (which provides distinguishable morphologies across timesteps). Previous analysis using this data can be found in.44 A characteristic of this procedure for generating morphologies through simulation is their similarity to morphologies in real active layers produced during thermal annealing, for example, the domains are similar in size and have smooth interface contours. These characteristics will also help us to build trust in the training process by manually creating morphologies that break these characteristics and testing the performance of the trained network on such samples. We produce a dataset of nearly 65,000 (2D) gray-scale morphologies of size 101 × 101pix.

These morphologies were then characterized using an in-house physics based simulator.1 This simulator uses steady state excitonic drift-diffusion equation to model the processes of exciton dissociation and charge transport:

$${\boldsymbol{J}}_{\boldsymbol{n}} = - qn\mu _n\nabla \varphi + qV_t\mu _n\nabla n$$
(3)
$${\boldsymbol{J}}_{\boldsymbol{p}} = - qp\mu _p\nabla \varphi - qV_t\mu _p\nabla p$$
(4)
$$\nabla .{\boldsymbol{J}}_{\boldsymbol{n}} = qfR_{[n,p]} - qfD_{[\nabla \varphi ,X]}$$
(5)
$$- \nabla .{\boldsymbol{J}}_{\boldsymbol{p}} = qfR_{[n,p]} - qfD_{[\nabla \varphi ,X]}$$
(6)
$$\nabla .(\epsilon _r\epsilon _0\nabla \varphi ) = q(n - p)$$
(7)
$$- \nabla .(V_t\mu _x\nabla X) - fD_{[\nabla \varphi ,X]} - R_{[x]} = - G - R_{[n,p]}$$
(8)

where μn, μp are the mobilities of electrons and holes, respectively. The quantities of interest are the electrostatic potential in the active layer φ, electron density n, hole density p and exciton density X. G, D[ϕ,X] represent the rate of generation and dissociation of excitons, respectively. R[x] is the exciton relaxation rate. Jn, Jp are the current densities of electrons and holes, respectively. We use the short-circuit current Jsc as a means of labelling the data. The whole data were divided into 10 classes, which are equally spaced between the best (Jsc = 7 mA/cm2) and worst performing (Jsc = 0.2 mA/cm2) in the data.

Convolutional Neural Networks (CNNs)

CNNs have become the standard frameworks when it comes to computer vision tasks in recent times. To serve our purpose of classifying microstructures, we also use a CNN-based model to train on our dataset, establish trust in the trained model and then use that trained model to make test/future predictions.

CNNs achieve a high level of performance with fewer parameters to learn55,56 when compared to networks constructed simply via Fully-Connected (FC) layers. By design, they exploit the two-dimensional (2D) structure of an input image by preserving the locality of features and utilize spatially local correlations of an image by using tied weights, which are invariant to the translation of the feature positions.55,57

In CNNs, data are represented by multiple feature maps in each hidden layer. These feature maps are obtained by performing a local convolution of the input image using multiple filters. These feature maps further undergo non-linear downsampling with a max-pooling operation58 to decrease the data-dimension. Max-pooling partitions the input image into sets of non-overlapping rectangles and uses the maximum value for each partition as the output. This is done so that neighboring pixels in an image sharing similar features can be discarded. Both spatial and feature abstractness are also increased as a result, imparting increased position invariance for the filters.58,59

We use batch normalization layers, which normalize the activations of the previous layer at each batch, to improving the overall performance of the architecture. Batch Normalization applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.36

Post max-pooling, multiple dimension-reduced vector representations of the input are acquired, and the process is repeated in the next layer to achieve a higher-level representation of the data. At the final pooling layer, the resultant outputs are linked to the FC layer, where Rectified Linear Unit (ReLU) activation outputs60 from the hidden units are joined to output units to infer a predicted class on the basis of the highest joint probability given the input data. Keeping this in mind, the probability of an input vector v being a member of the class i can be written as follows:

$$Pr(Y = i|{\mathbf{v}},{\mathbf{W}},{\mathbf{b}}) = {\mathrm{softmax}}_i({\mathbf{Wv}} + {\mathbf{b}}) = \frac{{e^{W_iv + b_i}}}{{\mathop {\sum}\nolimits_j {e^{W_jv + b_j}} }}$$
(9)

where elements of W denote the weights and elements of b denote the biases. The model prediction is the class with the highest probability:

$$y_{{\mathrm{pred}}} = {\mathrm{argmax}}_iPr(Y = i|{\mathbf{v}},{\mathbf{W}},{\mathbf{b}})$$
(10)

The model weights, W, and biases, b, are optimized using error back-propagation algorithm,61 wherein true class labels are compared against the model prediction by using an error metric/loss function. We choose categorical cross entropy62 as the loss function, chosen to be minimized for the dataset V, and is given as follows:

$${\cal{L}}(V,Y) = - \frac{1}{n}\mathop {\sum}\limits_{i = 1}^n {{\mathbf{y}}^{(i)}} {\mathrm{ln}}\;{\mathbf{a}}({\mathbf{v}}^{(i)}) + (1 - {\mathbf{y}}^{(i)})\;{\mathrm{ln}}\;(1 - {\mathbf{a}}({\mathbf{v}}^{(i)}))$$
(11)

Here, V = {v(1), …, v(n)} is the set of input examples in the training dataset, and Y = {y(1), …, y(n)} is the corresponding set of labels for those input examples; a(v) represents the output of the neural network given an input v.

Class specific visualization: Saliency Maps

A detailed description of Saliency maps and their use in visualising class specific regions as learnt by CNNs has been given in ref. 27 However, we here give a brief overview as well for the sake of simplicity. Saliency Map generation is a technique, which takes an input image, a learnt classification CNN model and a class of interest as it’s input and generates as an output, an image that is representative of that particular class in terms of what that learnt CNN model sees in the given input image. Formally, we define this as follows: Say, αi(A) is the score of class i, computed by the classification layer of the CNN for an image A. The target is to find a L2-regularized image such that αi(A) is high:

$$\mathop {{{\mathrm{argmax}}}}\limits_A \alpha _i(A) - \gamma \left\| A \right\|_2^2$$
(12)

where γ is the regularization parameter. Using the back-propagation algorithm (which is also used to optimize the layer weights), we obtain a locally optimal A by optimizing with respect to the input image, with the model weights fixed to those obtained at the best-training step.

Performance of standard architectures

As discussed in Sec. Training and validation, we tested the performance of our custom architecture with standard cpnvolutional network architectures, namely ResNet-5063 and VGG-16.12 ResNet50 is a 50 layer deep convolutional network pretrained on images from ImageNet and can classify into 1000 object categories. It uses a special architecture called residual network blocks that simultaneously reduce the model size and capture diversity of input images. The final layer was modified to classify into 10 categories and was trained end-to-end with our data. VGG16 is another very popular architecture tested on data from ImageNet, which uses 13 layers of 3 × 3 convolutions with max-pooling followed by two fully connected layers of 4096 neurons each. As with ResNet50, we modify the final layer of VGG16 to classify into only 10 categories. Although our architecture is shallower, it showed similar performance in terms of the confusion matrix. The confusion matrix on validation data for ResNet-50 and VGG-16 are in Fig. 7.

Fig. 7
figure 7

Both the standard architectures show performance similar to our custom architecture. But these do not provide any meaningful explanations to their predictions (Fig. 3) (Scalebar limits: 0–1)

How shallow can the network be?

In order to determine the simplest model with desired generalizable and interpretable characteristics, we performed an analysis of shallower variants of the presented architecture (Model α). We trained a shallower model (Model αs1) retaining the first 3 convolution-max-pool-BN blocks of Model α (i.e., removing the last block from Model α) as well an even shallower model, αs2 which retains first two blocks of Model α (i.e., we remove the last 2 blocks from Model α). Table 1 compiles the performance results of these models on three test datasets: in-sample morphologies, fractal-like morphologies, and columnar morphologies. We observe that progressively shallower models perform worse in terms of prediction accuracy, especially for the out-of-sample data (fractal-like and columnar morphologies). In other words, generalizability suffers when the models become shallower than the presented model (Model α). This evidence suggests that Model α is the shallowest model that still produces viable accuracy.

Table 1 Effect of depth of network on generalizability

The accuracy values, especially in the case of columnar morphologies is slightly misleading because it considers all wrong classifications as equally bad, irrespective of how close is the prediction to the original class. Hence, we analyzed the weighted categorical cross entropy loss for the columnar morphologies, included in paranthesis in the above table.