Non-empirical identification of trigger sites in heterogeneous processes using persistent homology

Macroscopic phenomena, such as fracture, corrosion, and degradation of materials, are associated with various reactions which progress heterogeneously. Thus, material properties are generally determined not by their averaged characteristics but by specific features in heterogeneity (or ‘trigger sites’) of phases, chemical states, etc., where the key reactions that dictate macroscopic properties initiate and propagate. Therefore, the identification of trigger sites is crucial for controlling macroscopic properties. However, this is a challenging task. Previous studies have attempted to identify trigger sites based on the knowledge of materials science derived from experimental data (‘empirical approach’). However, this approach becomes impractical when little is known about the reaction or when large multi-dimensional datasets, such as those with multiscale heterogeneities in time and/or space, are considered. Here, we introduce a new persistent homology approach for identifying trigger sites and apply it to the heterogeneous reduction of iron ore sinters. Four types of trigger sites, ‘hourglass’-shaped calcium ferrites and ‘island’- shaped iron oxides, were determined to initiate crack formation using only mapping data depicting the heterogeneities of phases and cracks without prior mechanistic information. The identification of these trigger sites can provide a design rule for reducing mechanical degradation during reduction.


Note S1. XRD analysis of specimens
We used X-ray diffraction (XRD) with a Cu K α X-ray source to identify the coexisting phases in the pulverized specimen and their crystal structures. The XRD patterns and volume fractions of the detected phases are shown in Fig. S1 and Table S1.
As shown by chemical state map, the iron chemical state changed from Fe(III) to Fe(III) + Fe(II) and finally to Fe(II) during reduction. The results in Table S1 clearly demonstrate that specimens obtained at increasing reduction times have higher mass

Note S3. X-CT data and image processing
Crack formation was investigated using X-ray computer tomography (X-CT) with an in-house X-ray source. A cuboidal specimen with dimensions of 2 × 2 × 10 mm 3 was cut from an iron ore sinter. A CCD camera with a scintillator was used as the detector. A transmission image was measured in 1° or 2° steps for a 360° rotation using a white Xray source generated by a tungsten target with a tube voltage of 70 keV. The minimum spatial resolution was as small as 0.7 µm. The leftmost column in Fig. S3 shows images of slices extracted from the three-dimensional (3D) X-CT dataset of the reduced sinter that have been reconstructed from the observed data. Panels (a)-(d) show the deconvolution of the microstructure into images representing (a) the initial pores, (b) the microcracks formed during reduction, (c) calcium ferrite phases, and (d) iron oxide phases.
These components were extracted for three regions (pores/cracks, calcium ferrites, and iron oxides) by setting threshold values for the image contrast corresponding to the densities and considering the contrast differences observed at the boundaries. Each image of a slice with a thickness of 4.0 µm was processed with a spatial resolution of 4.0 µm. In other words, the 2 × 2 × 10 mm 3 specimen was divided in voxels of 4 × 4× 4 µm 3 , and each voxel was assigned to (a) an initial pore, (b) a microcrack, (c) a calcium ferrite phase, or (d) an iron oxide phase. This dataset was further analysed using persistent homology.

Note S4. Persistence diagrams from image data
Persistence diagrams (PDs) [6][7][8][9] are computed from finite points in space or binary images and have been used in the structural analysis of amorphous solids 8 and granular media 9 . Here, in order to quantify the topological features of holes in the phase-mapping data sets obtained by X-CT, we computed the 0-th PDs of those data sets, where we consider iron oxides as 'holes' in the matrix of calcium ferrite. It should be noted that we could characterise topological features in the opposite manner, i.e., calcium ferrites as 'holes'. We performed the calculation in both manners and confirmed their consistency.  In this note, we discuss how to compute the 0-th PD of an image using the example shown in Fig. S5. Fig. S5a shows the input binary image. We focus on the white pixel island-type shapes (complementary, grey holes). The 0-th PD of the image is calculated to be {(-3,∞), (-2, -1), (-2, 3)} (Fig. S5b) as follows. First, each pixel in the image is numbered as in Fig. S5c, and these numbers are used to compute the diagram. The white and grey pixels are numbered negatively and positively, respectively. The rule for assigning these numbers, known as the Manhattan distance, is given as follows: l All grey pixels next to the white pixels are numbered as 1.
l All not-yet-numbered grey pixels next to '1' pixels are numbered as 2.
l In the same way, all not-yet-numbered grey pixels next to 'k' pixels are numbered as k + 1 (k = 2, 3, 4, …).
l All white pixels next to the grey pixels are numbered as -1.
l In the same way, white pixels are numbered as -2, -3, etc.
Using these assignments, we can enlarge or contract each white region ('island' domains) of the binary image by changing a threshold value M. Namely, we define the white regions as the union of all pixels whose assigned number is less than or equal to M  S5b).
Hence it can be expressed as a 2-dimensional histogram on the birth-death plane.
Following the same process, the PD of the real phase mapping data has the general form Here, we summarize some of the important topological features encoded in the birthdeath pairs.
( ) and −2, 3 ( ) in the above example correspond to the two white islands in the input image (Fig. S5c).
l Each birth-death pair with a negative death value corresponds to an hourglass shape in the input image. Its width is encoded as the magnitude of the birth value (the halfwidth of the widest section of the hourglass) and the death value (the half-width of the middle section of the hourglass). The pair −2,−1 ( ) corresponds to such a structure (Fig.   5Sc).
l Birth-death pairs with large differences correspond to persistent topological features.
In other words, they show topological features that remain for relatively longer periods during evolution (i.e. the reaction) and are expected to play important roles in initiating trigger sites. It should be noted that the birth position of each island represents the centre of the persistent structure associated with the corresponding birth-death pair (red and blue squares in Fig. S5c).  This results in the appearance and the disappearance of some islands.

Note S5. Principal component analysis (PCA) of PDs and the identification of trigger sites by linear regression
Among the efforts to map PDs into vector spaces for machine learning tasks, one tool for converting PDs converting to (finite-dimensional) vectors is known as a persistence image (PI). The PI of a given PD is a weighted sum of Gaussian distributions on the birthdeath plane and is regarded as a vector in the L 2 (R 2 ) function space, which is an infinitedimensional vector space with an inner product. More explicitly, for a PD with where C > 0, p > 0, and σ > 0 are parameters, and w(b, d) is a weight function. Because a birth-death pair with a large difference is expected to be an important topological feature, the weight function is chosen to respect this requirement. In practice, the PI is discretized into a finite-dimensional vector using a histogram on the birth-death plane with a finite mesh. It should be noted that the PI is stable under this transformation with respect to small perturbations in the inputs 11 .
Using the vectors of the discretized PIs, we applied machine learning methods to investigate the characteristic topological features of the phase image datasets of calcium ferrites and iron oxides. In particular, we used PCA and linear regression with l 1 regularization (LASSO).   Comparison with other methods. As described in the main text, the PCA analysis has succeeded in identifying the increase in the number of 'island'-and 'hourglass'-shaped features during the reduction from S-1 to S-3 using minimal prior knowledge, i.e.
heterogeneous features (or 'shapes') of co-existing iron oxides and calcium ferrites.
Once the types of trigger sites were identified as 'island'-and 'hourglass'-shaped iron oxides and 'hourglass'-shaped calcium ferrites, we can analyse the phase images using a simple image analysis technique. For comparison, Fig. S7 contains histograms showing the number of connected components in the iron oxide images. The number increases from S-1 to S-2 and decreases slightly from S-2 to S-3. This finding is consistent with PCA.

Fig. S7 Histograms of the number of connected components in iron oxide images.
We can understand the problem better by counting only small or large connected components. Figure S8 includes histograms showing the number of small and large connected components of iron oxide. In this analysis, we count the number of birth-death pairs whose birth value is larger than -7 (resp. less than -7) and whose death value is positive to count small (resp. large) connected components. These figures show that the number of small connected components of iron oxide increases from S-1 to S-2, whereas the number of large connected components does not change as the reduction process progresses.
However, a naive analysis using the number of connected components is not appropriate for calcium ferrites. Figure S9 shows histograms of the number of connected components, which gives no insight about the characteristic features of calcium ferrites during the reaction process. These performance comparisons of our approach to simple image analysis showed that we can apply it as a simpler descriptor and those histograms provide us with some partial understandings of the reaction process, once we notice the fact that the number of small connected components of iron oxide increased from S-1 to S-2. However, we emphasize again that this finding was only clarified by the PCA on PDs and it is difficult to know it without any prior knowledge on the reaction mechanism. This comparison highlights an advantage of our method using machine learnings on persistent homology, which automatically captures significant features in the data-driven way.

Fig. S9 Histograms of the number of connected components in calcium ferrites
images. 14 We also compared the proficiency of our method to that of a standard machine learning method bag-of-keypoints approach with SIFT feature in the task of classifying S-1 and S-2 images. We selected 60 ×2 iron oxide images from S-1 and S-2. For each image, 400 keypoints (20 × 20) are prepared on a lattice, SIFT features are computed on each keypoint, and the feature dictionary is computed by k-means. Then, linear logistic regression with an l 1 penalty is applied to the constructed bag-of-keypoints histograms.
We remark that, although nonlinear kernel is often used for simple classification tasks, the linear method with an l 1 penalty is more suitable for applying feature selection techniques to identify significant geometric features. Here, the parameters are adjusted by crossvalidation. The accuracy rate computed by cross-validation was 75%, while that computed by our framework using linear logistic regression on persistence images was 80%.
Fig. S10 shows the result of feature selections using by the bag-of-keypoints approach with SIFT, where the blue (resp. red) circles correlate to S-1 (resp. S-2). The bag-ofkeypoints analysis shows that the detected areas are somehow correlated to S-1 and S-2 but does not explicitly identify the geometric structures used to distinguish between S-1 and S-2. In contrast, our method provides a more explicit and intuitive understanding of the data. These results lead us to conclude that persistent homology analysis is more suitable than bag-of-keypoints analysis for processing images of materials. correspond to S-1 and S-2, respectively. The blue (resp. red) circles correlate to S-1 (resp. S-2). Intuitively, the more the numbers of blue (resp. red) circles in the image is, the grater the likelihood the image is be S-1 (resp. S-2).
LASSO analysis. Linear regression is a statistical approach for estimating the relationship between a vector (i.e. an explanatory variable v i ) and a scalar value (i.e. a dependent variable s i ). The input of the linear regression is given by , and is modeled as where the coefficient vector a ∈R n and the intercept b ∈R are unknown parameters estimated from the input data. A standard linear regression method estimates a and b by the least-squares minimization. Furthermore, to avoid the overfitting problems of the analyses, we often combine this approach with the LASSO technique 13 , in which the following cost function is minimized with respect to a and b: where the term a 1 = a j j=1 n ∑ is called the regularization term and α > 0 is its controlling parameter.
One advantage of the LASSO regression is the sparseness of the learned vector a . In fact, by applying the LASSO to the vectors of the discretized PIs, we can identify a few grids (as a result of sparseness) in the histogram of the persistence diagram which have the largest impact on the dependent variable s i because each element of the vector corresponds to a grid in the histogram.
In this study, we applied the LASSO method to detect the key birth-death pairs having the largest correlation with the areas (number of pixels) of the microcracks. To this end, the areas of the detected microcracks were used as dependent variables s i in the LASSO regression. The parameter α was determined by cross-validation 12 . Then, four types of key birth-death pairs were identified, two for calcium ferrites (TS CF1 and TS CF2 ) and two for iron oxides (TS IO1 and TS IO2 ). These birth-death pairs are highly correlated with cracks and correspond to trigger sites.

Note S6. Datasets for PCA and linear regression
For this analysis, we prepared 80×3×4 X-CT image data from S-1, S-2, and S-3 for iron oxides, calcium ferrites, large pores, and cracks. The size of each image is 255×255 pixels (900 × 900 µm 2 in the real scale). Since an image with a large pore area has less information and is therefore harmful for statistical analysis, we remove those images from datasets. The threshold of the ratio of pore area is 0.6. Then, we use 64 × 4 images from S-1, 72 × 4 images from S-2, and 62 × 4 images from S-3. The resulting iron oxide and calcium ferrites images were used in the PCA. Only S-3 images were used in the LASSO analysis. Since the dataset is relatively small, to avoid overfitting, the parameter is adjusted by cross-validation for LASSO analysis.