Introduction

Composition-spread experiments enable rapid mapping of composition–structure–property relationships and are a powerful tool for materials discovery owing to recent developments in high throughput materials synthesis and rapid characterization techniques.1,2,3,4,5 Machine learning techniques provide a means to convert the large volume of diverse, complex data collected from materials experiments into actionable knowledge refs 6,7,8,9,10,11,12,13,14,15,16,17,18 and have been successfully demonstrated in composition spread experiments. Of particular interest is the use of machine learning techniques to rapidly determine potential composition-phase maps from X-ray diffraction (XRD) data from composition spreads.19 As the vast number of potential materials of three or more materials components is overwhelming compared to the number of expert-analyzed materials (typically unary or binary), this is primarily an unsupervised learning task. The knowledge gained from unsupervised learning analysis of phase data structure can be used to optimize semi-supervised learning analysis in those cases where phase labeled data exists. These techniques also promote a reduced reliance on expert human input, greater throughput, and an end goal of autonomous materials exploration systems. A variety of machine learning methods have been investigated in the literature, beginning with hierarchical clustering of XRD and Raman spectra data, and continuing with recent investigations in the use of constraint programming and hyperspectral graph-based techniques for concurrent composition-phase map determination and constituent phase identification.19

A key challenge for these methods is the selection of an appropriate dissimilarity (or similarity) measure, also known as a kernel function, to quantify the relationship between two pieces of structure data. Once computed, the dissimilarity between the two pieces of structure data is translated into dissimilarity between the structures of the originating samples. Proper selection of dissimilarity measure is essential to achieving quality results from unsupervised learning (and semi-supervised learning) as emphasized by Hastie, et al.20 in their machine learning textbook Elements of Statistical Learning, “Specifying an appropriate dissimilarity measure is far more important in obtaining success with clustering than choice of clustering algorithm.” This work seeks to provide insight to the experimentalist for dissimilarity measure selection through an efficacy analysis—identifying the impact of dissimilarity measure selection on composition-phase map accuracy and computation time. While particularly beneficial to the high-throughput experiment community where computation time is of concern, the analysis should also benefit the larger materials research community dealing with a large amount of XRD data as well. The structural data of interest for this work is the one-dimensional XRD pattern—a set of intensities, where each intensity value corresponds to a 2θ value or q-spacing. The diffraction data is typically collected as a two-dimensional image of intensities as a function of the instrument parameters 2θ and χ, integrated over the χ angle to generate the one-dimensional diffraction pattern, and the 2θ values are translated to q-spacing values using the Bragg Law.21

A number of confounding factors can impact the performance of the dissimilarity measure. Peak shifting can occur in XRD data due to lattice expansion or contraction, so that two samples sharing the same structure may have differing XRD patterns. The texturing of samples is another challenge. Samples that share the same structure but have different texturing can result in differences in peak heights with a significant likelihood that some peaks may reduce in intensity to the point of being lost in measurement noise. Finally, the typical one-dimensional XRD pattern used to identify material structure is a data reduction from the two-dimensional Ewald’s sphere.21 This data reduction along with measurement signal to noise issues can result in a potential loss of distinguishing information so that two different structures have similar diffraction patterns. The optimal choice of measure takes all of these factors into account to accurately determine dissimilarity, typically through increased complexity of the measure itself and increased computational cost. The data analyst must therefore make a cost benefit decision between accuracy and computational cost. For high-throughput combinatorial library data analysis, computational cost—and the resulting computation time, hold a high priority. This is true also of analyzing other large data sets of collected XRD patterns, which may have been collected in an Edisonian sample-by-sample study. When dealing with non-high-throughput studies where sample synthesis can take an appreciable amount of time, the experimentalist may choose to select those XRD dissimilarity measures with the best performance, despite computational cost.

Various measures have been used throughout the literature. In this work, we seek to investigate a set of common measures for their relative computational cost in comparing one-dimensional XRD data and their relative accuracy in resulting cluster-based composition-phase map determination. A range of measures have been investigated in the past. Long, et al. reported on the use of the Pearson correlation coefficient as part of a software package to perform hierarchical cluster analysis (HCA) for composition-phase map determination.22 The software package, CombiView, also allows the user to select from a range of measures.23 Kusne, et al. 24 demonstrated the speed and accuracy of using the L1 norm with a mean shift theory clustering method to provide analysis of XRD data as the data was being collected. Baumes, et al. 25 investigated the use of the dynamic time warping (DTW) measure to assist in XRD analysis and found it to be resilient to peak shifting when the range of peak shifting was known. LeBras, et al. 26 and Ermon, et al. 27 later applied DTW as part of larger constraint programming-based algorithms for concurrent composition-phase map determination and constituent phase identification from XRD data. Another study in the use of high speed methods by Kusne, et al. 28 demonstrated the efficacy of the cosine metric in a regularized least square method for concurrent composition-phase map determination and constituent phase identification from both XRD data and Raman spectra.Footnote 1

In this work, we investigate a set of measures for accuracy and computational cost in analyzing XRD data from the Fe–Co–Ni ternary-alloy thin-film composition spread.29 This work provides the basis for a broader analysis of dissimilarity measures and their efficacies over the wide range of materials systems and their respective structure data artifacts, such as extreme variations in peak heights, variations in number of peaks, and greater number and complexity of phase regions. This work also serves to demonstrate a system for rapid analysis of combinatorial spreads (Fig. 1). The system begins with the synthesis of the thin film composition spread using a combinatorial co-sputtering technique, followed by rapid characterization of composition using wavelength dispersive spectroscopy and structure via XRD. The dissimilarity is computed between each composition-spread sample and a set of machine learning analysis techniques are used to sort the samples into clusters of similar structure and to identify potential constituent phases. The samples are then visualized in a composition diagram with color-coded cluster labels. Contiguous regions of the composition diagram that share a cluster label correspond to potential phase regions of a composition-phase map. Dimension reduction techniques such as multidimensional data scaling are used to visualize the high dimensional XRD data in two or three dimensions to facilitate user evaluation of the analysis results.

Fig. 1
figure 1

Composition spread analysis flow chart: The strategy begins with the a synthesis of the composition spread and rapid characterization of composition and structure, followed by b computing dissimilarity measures, c composition-phase map determination and constituent phase identification via machine learning, and d visualization of analysis results via the use of a composition diagram and data dimension reduction techniques applied to the structural data

Dissimilarity measures

The data of interest for this work is the one-dimensional XRD pattern from a Fe–Co–Ni composition spread. Each diffraction pattern is described by a set of intensities with a 1-to-1 correspondence to a set of 2θ values. The diffraction patterns are measured over the same set of 2θ, here indexed with i{1…N}. There were 1125 diffraction patterns uniformly covering the entire Fe–Co–Ni ternary composition-phase map taken in the 2θ range of 42.6° to 47.0° (with λ = 0.15418 nm for Cu Kα).29 To demonstrate the robustness of the overall analysis protocol developed here, we have also applied it to another ternary system, namely, Fe–Ga–Pd composition spread, and the results are presented in the supplementary section 22. 2θ is used here rather than q-spacing due to the simplicity of having equal sampling spacing of Δ(2θ) between each intensity measurement, allowing the use of the index i as the dependent variable. The results of this work should generalize to an equivalent set of q-spacing values with uniform spacing and the appropriately interpolated XRD intensity values. For two diffraction patterns s and t, the dissimilarity measure is defined as D(s,t). For D(s,t) = 0 the two diffraction patterns are assumed to be identical and the corresponding samples are assumed to share the same structure. Larger values of the dissimilarity measure imply greater dissimilarity between the samples’ structures.

The set of dissimilarity measures investigated in this work fall into a group of categories (with some falling into more than one): the L1 norm (‘Manhattan’), L2 norm (“Euclidean”), and the cosine metric are geometry-based measures; Pearson correlation coefficient and Spearman rank correlation coefficient are statistics-based dissimilarity measures; DTW and the earth mover’s distance (EMD), also known as the “Wasserstein distance”, are measures specifically developed for feature preservation, i.e. resilience to peak shifting; and the Jensen–Shannon divergence (JSD) is based in information theory. These measures, except for DTW and EMD, satisfy the requirements of a metric. DTW is not a metric as it does not satisfy the triangle equality. EMD is a metric under two conditions –the ground distance used must be a metric and the areas under the two diffraction patterns being compared must be equal. While the ground distance used here is the L1 norm metric, the areas under any two diffraction patterns are not necessarily equal, and EMD is therefore not assumed to satisfy the requirements of a metric. This work introduces the Normalized and Constrained Dynamic Time Warping measure (NC-DTW) as a novel means for computing XRD dissimilarity, and which like the DTW measure, does not satisfy the triangle inequality.

L1 & L2 norms

The L1 and L2 norms are special cases of the p-norm, given by:

$${D}_{p-{\rm{norm}}}({\boldsymbol{s}},{\boldsymbol{t}})\,=\,{\left(\sum _{i=1}^{N}{({s}_{i}-{t}_{i})}^{p}\right )}^{1/p},$$
(1)

where p = 2 is the L2 norm, commonly known as the Euclidean distance, and p = 1 is the L1 norm, also known as the Manhattan, taxi-cab, or city block distance.

Cosine metric

The cosine metric gives the cosine of the angle between two vectors, thus measuring only the vector orientation difference, ignoring differences in vector magnitude.

$${D}_{cosine}({\boldsymbol{s}},{\boldsymbol{t}})\,=\,1\,-\frac{{\sum }_{i=1}^{n}({s}_{i}\cdot {t}_{i})}{{({\sum }_{i=1}^{n}{s}_{i}^{2})}^{\frac{1}{2}}{({\sum }_{i=1}^{n}{t}_{i}^{2})}^{\frac{1}{2}}}$$
(2)

Pearson product-moment correlation

The Pearson product-moment correlation is a measure of linear correlation between two spectrums.

$${D}_{{\rm{Pearson}}}({\boldsymbol{s}},{\boldsymbol{t}})\,=\,1\,-\frac{{\sum }_{i=1}^{n}({s}_{i}\,-\,\bar{s})({t}_{i}-\bar{t})}{{({\sum }_{i=1}^{n}({s}_{i}-\bar{s}))}^{\frac{1}{2}}{({\sum }_{i=1}^{n}({t}_{i}-\bar{t}))}^{\frac{1}{2}}}\,,$$
(3)

where \(\bar{{\boldsymbol{s}}}\) and \(\bar{{\boldsymbol{t}}}\) indicate the average values of s i and t i , respectively. The Pearson product–moment correlation is related to the cosine metric by replacing values of s i and t i with \({{\boldsymbol{s}}}_{{\boldsymbol{i}}}\,-\,\bar{{\boldsymbol{s}}}\) and \({{\boldsymbol{t}}}_{{\boldsymbol{i}}}\,-\,\bar{{\boldsymbol{t}}}\), respectively.

Spearman rank correlation coefficient

Each component s i is ranked by decreasing value, converting s i and t i into their ranks S i and T i , respectively. For example, if S l is the second largest value from the set of all s i , S l  = 2. The Spearman measure is computed using:

$${D}_{{\rm{Spearman}}}({\boldsymbol{s}},{\boldsymbol{t}})\,=\frac{6{\sum }_{i=1}^{n}{d}_{i}}{{n}^{3}-n}\,,$$
(4)

where d i  = |S i T i | is the difference in rank for s i and t i .

Jensen–Shannon divergence (JSD)

The JSD is a symmetric version of the Kullback–Leibler divergence D KL, which measures the information lost when using s to represent t .

$${D}_{{\rm{JSD}}}({\boldsymbol{s}},{\boldsymbol{t}})\,=\frac{1}{2}{D}_{{\rm{KL}}}({\boldsymbol{s}},{\boldsymbol{t}})\,+\frac{1}{2}{D}_{{\rm{KL}}}({\boldsymbol{s}},\,{\boldsymbol{t}})\,$$
(5)
$${D}_{{\rm{KL}}}({\boldsymbol{a}},{\boldsymbol{b}})\,=\,\sum _{i}^{n}{a}_{i}\,\mathrm{log}\,\frac{{a}_{i}}{{b}_{i}}$$

Earth mover’s distance (EMD)

The EMD is computed by solving a transport problem. The intensity values of the diffraction patterns being compared are interpreted as mass, with each unit of intensity equal to a unit of mass. The measure is computed as the minimum total work required to deform one diffraction pattern, by moving mass, into the second diffraction pattern.30 The EMD has a computational complexity of O(n 2) and is therefore more computationally expensive than the previously mentioned dissimilarity measures, where D p-norm, D cosine, D pearson, D spearman and D JSD have O(n) computational complexity. A fast-EMD algorithm with O(n) complexity is used here to reduce computation time.30

Dynamic time warping (DTW)

DTW measures the minimum non-linear warping required to map one numeric array to another. DTW also has a computational complexity of O(n 2). In this study, a fast-DTW algorithm with O(n) was employed to shorten the calculation time.31

Normalized and constrained dynamic time warping (NC-DTW)

This work introduces the NC-DTW measure as a means for computing XRD dissimilarity. For NC-DTW, each diffraction pattern is normalized by its maximum value. Additionally, the range of potential warping paths from one array to another is limited by setting a window size. The window size defines the maximum number of indices each intensity value can be warped by, e.g. for a window size of r, h i can only be warped to a value in the set [h i-r, … , h i+r ].32 Specifically, if the window size is set to 0.5 degrees, two diffraction patterns that share a peak which is separated by less than 0.5 degrees will be identified as similar, while two peaks separated by more than the window size will be considered dissimilar.

Results

Performance: composition-phase map determination accuracy

A dissimilarity matrix was computed for the Fe–Co–Ni XRD patterns for each measure. The dissimilarity matrices were then used to sort samples into groups of similar structure using agglomerative HCA and k-medoids clustering—both methods that rely only on the dissimilarity matrix rather than the original XRD data. For HCA, cluster–cluster dissimilarity was computed using ward’s, average, centroid, complete, Mcquitty and median linkage methods. A discussion of the clustering methods and their respective efficacies can be found in ref. 20. The HCA and k-medoids clustering results were then compared to cluster labels defined by expert analysis29 using multi-class F-measure to compute accuracy. The number of clusters was varied between 2 and 10 to observe the impact of cluster number choice on accuracy and to ensure that data effects such as outliers do not skew the overall analysis of measure performance. For visual analysis, the samples were color coded for cluster membership and plotted as a function of composition. The resulting diagram provides a potential composition-phase map with cohesive regions of shared cluster label corresponding to potential phase regions. The phase map deduced from individual XRD patterns of the spread wafer is shown in Fig. 2a2 with the equilibrium phase diagram shown in Fig. 2a1 for reference. The automated composition-phase maps for each measure are shown in Fig. 2b–i for HCA using average linkage method. The number of cluster was set to 5 because the known composition-phase map has five regions with bcc, fcc, hcp, bcc + fcc and no XRD peak detected region.

Fig. 2
figure 2

Result of the hierarchical cluster (HC) analysis in the Fe–Co–Ni ternary alloy system with different dissimilarity measures. a1 Known phase diagram (ref. 35). a2 Phase map deduced from individual XRD patterns of spread wafer (ref. 29). bi Result of the HC analysis with different measures: b Euclidean metric. c Manhattan metric. d Cosine metric. e Pearson metric. f JSD. g EMD. h DTW i NC-DTW

The Spearman metric performed poorly for all the clustering techniques and number of clusters tested, sorting the majority of data points into one cluster and the rest of the clusters containing one or two points from the edge of the composition diagram, and is excluded from the figure. It was found that the Spearman metric’s poor performance is due to the statistical distribution of XRD intensities. Figure 3 shows the data distribution (histogram) with respect to the diffraction intensity for the four XRD pattern used in the later qualitative metric multidimensional data scaling (MMDS) analysis (Section 2.1.1). The majority of intensity values occur at low values, and are typical of noise with no diffraction peaks. The Spearman metric calculates ranks primarily with noise, leading to improper dissimilarity measures and poor clustering, and was therefore found to be an inadequate metric for XRD analysis.

Fig. 3
figure 3

Data distribution (histogram) with respect to the diffraction intensity for four selected XRD patterns

F-measure is an accuracy measure that combines two properties, precision and recall. Precision is defined as the number of samples properly identified as belonging to a phase region, or true positive (TP) count, divided by the sum of the number of TP and the number of samples incorrectly labeled as belonging to the phase region, or false positive. Similarly, recall is defined as the ratio of TP to the sum of TP and the samples incorrectly labeled as not belonging to the cluster, or false negative. The F-measure is defined as the harmonic mean of precision and recall, with larger values corresponding to higher accuracy. To compute the F-measure, each sample is labeled by its phase region, given by the expert derived composition-phase map. The set of samples in each expert labeled cluster A m are indexed by cluster label m and the set of samples in the computed clusters C n are indexed by cluster label n. Associations between true cluster labels and automated cluster labels are permuted through permutations of n and the maximum F-measure is recorded:

$$F\,=\,\sum _{m=1}^{K}\frac{|{A}_{m}|}{N}\mathop{\max }\limits_{n}\frac{2{R}_{mn}{P}_{mn}}{{R}_{mn}\,+\,{P}_{mn}}\,$$
(6)
$${P}_{mn}\,=\frac{TP}{TP\,+\,FP}=\frac{|{A}_{m}{\cap }^{}{C}_{n}|}{|{C}_{n}|}$$
$${R}_{mn}\,=\frac{TP}{TP\,+\,FN}=\frac{|{A}_{m}{\cap }^{}{C}_{n}|}{|{A}_{m}|}$$

where |A m | and |C n | are the number of samples in clusters A m and C n , respectively. K is the total number of clusters and N are the number of XRD patterns, i.e., K = 5 and N = 1125.

Figure 4 shows the mean and standard deviation (indicated by error bars) of F-measure accuracies, over both clustering method and number of clusters, computed for each dissimilarity measure as a function of computation time. The F-measure accuracy for each measure and clustering method with the number of clusters varied from 2 to 10 are shown in the Supplemental Figure S3.

Fig. 4
figure 4

The computational cost and F-measure for each measure. F-measure statistics were computed over the various clustering techniques and over the range of cluster number 2–10

Figure 4 shows that the cosine, Pearson and JSD measures all perform better than the Euclidean, Manhattan, DTW, and EMD measures in F-measure accuracy. This result is consistent with intuition from Fig. 2a–i. The cosine, Pearson and JSD measures also come at a significantly reduced computational cost than the DTW and EMD measures. Of interest is the significantly greater computational cost of the cosine metric compared to the Pearson metric, as these only differ in a simple computation, the subtraction of variable mean in the Pearson metric. The difference in computational time was found to be due to the use of different computation packages as discussed in the Supplemental.

This work introduces the use of the NC-DTW measure, which utilizes prior knowledge of the maximum peak shift magnitude to improve peak-mapping and dissimilarity evaluation. Here the NC-DTW peak shift window is set to 0.6 degrees, permitting peak matching across +/−0.6 degrees, as the magnitude of peak shift for the bcc structure is approximately 0.6 degrees. As seen in Fig. 5, use of the NC-DTW measure provides the greatest accuracy, while still at a significant computational cost (2003.41 s). Also the clustering result shown in Fig. 2i is quite similar to the correct composition-phase map shown in Figure 2a. The light blue, red, green and blue regions in Fig. 2i represent bcc, hcp, fcc and bcc + fcc phases, respectively.

Fig. 5
figure 5

Result of the MMDS analysis. a Four selected XRD patterns. The solid-blue and dotted-blue are form the bcc phase. The solid-red and solid-black are fcc and hcp structural phase, respectively. a′ Composition mapping for four selected XRD patterns. bi Result of the 2D MMDS analysis with different metric. b Euclidean metric. c Manhattan metric. d Cosine metric. e Pearson metric. f Jensen–Shannon Divergence. g EMD. h DTW i NC-DTW

To gain further insight into the relative performance of the various measures, a qualitative analysis using MMDS was performed.

Qualitative analysis: dimension reduction with metric multidimensional data scaling (MMDS)

To gain a qualitative understanding of the clustering performance of each measure, MMDS was used to map the measure space (defined by the minimum number of dimensions required to preserve the dissimilarities between each set of samples) to the more easily visualized two or three-dimensional space, while minimizing the loss of inter-sample dissimilarities.22 The two or three-dimensional projection allows for a more easily visualization of cluster results to identify the impact of measure choice on cluster performance.

The four XRD patterns were selected with different compositions as shown in Fig. 5a. A solid-red and a solid-black pattern represent Ni fcc and Co hcp structures, respectively. Though a solid-blue and dot-blue patterns show different peak intensity, both of them show the Fe bcc structure. The compositions of the solid-red, solid-blue, dot-blue and solid-red patterns are samples whose compositions are Fe5.9Co7.7Ni86.5, Fe85.3Co6.5Ni8.2, Fe89.8Co4.3Ni5.9, and Fe1.3Co77.4Ni21.3, respectively, as shown in Fig. 5a’.

Figure 5b–i is the 2D MDS results for four selected patterns on the Euclidean, Manhattan, cosine, Pearson, JSD, EMD, DTW, and NC-DTW measures, respectively. The better dissimilarity measures result in the solid-blue and dotted-blue patterns appearing closer due to similar structural phase (bcc), while the distance between the different structure phases should be large. On the 2D MMDS mapping in the Euclidean, Manhattan, EMD, and DTW the distance between the solid-blue and dotted-blue is large in spite of the fact that they are from the same structural phase because these measures are strongly affected by the peak intensity difference. On the other hand, the cosine, Pearson, and JSD dissimilarities allow us to locate same structural phases closer in the 2D MMDS mapping since they mainly judge the peak shape, not peak intensity differences. The cosine, Pearson and JSD metrics permit us to separate the different structural phases and cluster the same structural phases in the defined space.

When the structural phases are roughly divided into only fcc, bcc and hcp phases, and we expect to obtain the “correct” cluster result as shown in Fig. 2a (known composition-phase map), A2 and B2 (A1 and L10) have to be recognized as the same structural phase in metric space because both of A2 and B2 (A1 and L10) are based on the bcc (fcc) structure. In order to end up with the “correct” clustering result, the peak shift information due to the difference in the lattice constant need to be accounted for, and the diffraction peaks from A2 and B2 (A1 and L10) need to be identified as arising from the same/equivalent reflections. The EMD and DTW were expected to take care of the peak shift issue because of their robustness against such peak shifting. However, the EMD and DTW led to poor cluster results as shown in Fig. 2g,h due to the following two reasons. First, EMD and DTW are strongly affected by the peak intensity difference, and accordingly, the distance between the solid-blue and dotted-blue in the MMDS space (Figs. 5g,h) is large despite the fact that they are from the same structural phase. Second, misidentification of diffraction peaks may occur. For example, the DTW has “mixed-up” the fcc and bcc peaks since the distance between solid-red (fcc) and solid-blue (bcc) is very small in Fig. 5h due to the measures’ robustness against peak location differences.

The NC-DTW measure introduced in this work resolves these issues. Compare to the MMDS mapping for DTW in Fig. 5h, the distance between solid-blue and dotted-blue in Fig. 5i is small due to the normalization. This indicates that the normalization decreases the influence of peak intensity differences on cluster results. Moreover, the peak shift window constraint allows us to increase the distance between fcc and bcc by defining an allowable peak shift range.

From these qualitative and quantitative analyses, we conclude that the cosine and Pearson metrics should be the measures of choice in cluster analysis of XRD patterns due to their high accuracy and low computational cost. Furthermore, when prior knowledge of the maximum peak shift magnitude is available, the NC-DTW should be selected if adequate time exists for the increased computational cost.

Impact of measure on parameter selection: number of clusters

For this study, expert derived labels were available to quantify the accuracy of clustering. However, for the high throughput systems described in Fig. 1, this is rarely the case. Typically, only the structure and composition data is available and the automated clustering results are presented to the expert for verification. For these cases, a set of factors can impact performance—the choice of measure, the choice of clustering method, and the choice of clustering method parameters. There are numerous methods for quantifying clustering performance when correct class assignment is unknown,20 though expert analysis is typically taken as the final say. These methods are used for identifying the optimal cluster method parameters, including the number of clusters. For this work, cluster performance is determined by the bootstrap method, which minimize cluster instability.33 Figure 6a shows the cluster instability calculated by the bootstrap method for the HCA-average linkage method with the JSD metric in the Fe–Co–Ni system. When the number of clusters is equal to six, the cluster analysis is the most stable. Therefore, this result suggests setting the number of cluster to six.

Fig. 6
figure 6

Automated determination of the cluster number using the bootstrap method. a Result of cluster instability by the bootstrap method. b Result of cluster analysis with six clusters determined by bootstrap method. c Composition spread for XRD pattern in the black rectangle (FexCo100-xNi2.5±1.0) in Fig. 5b

Figure 6b shows the cluster result with six clusters by the HCA-average linkage method with the JSD metric. Comparing to Fig. 2f, the bcc phase was separated into two phases (blue and pink region in Fig. 6b). This result reflects the difference of lattice constant which causes the peak shift in XRD pattern. The Co atomic radius is slightly smaller than the Fe atomic radius, and as a result, the diffraction peak of bcc structure with Co-rich composition shifts to the higher angle as shown in Fig. 6c, which displays XRD patterns in the black rectangle (FexCo100-xNi2.5±1.0) in Fig. 6b.

The result in Fig. 6b can be explained by the Strukturbericht designation, where Fe50Co50 and Fe50Ni50 can be the B2 and L10 structure, respectively. Therefore, the pink, blue, red, black, light blue and green region represent the A2, B2, A1, A3, L10 and A1 + A2 structure in Fig. 6b.

Existing problems and future work

Window size in NC-DTW

In Section 2.1, the window size r was set to 0.6 degrees. since we knew that the magnitude of the peak shift was about 0.6 degrees. in the Fe–Co–Ni system. However, the magnitude of the peak shift is usually unknown when the combinatorial data is analyzed for the first time. Therefore, a method which determines the window size automatically is essential when the NC-DTW is used in the cluster analysis. One possible method to decide the window size is a deductive approach by the condensed matter theory. For example, Ab initio calculation might allow us to estimate the magnitude of peak shift on XRD patterns. Another possible method is a recursive approach by statistics. For example, the magnitude of peak shift might be predicted by supervised machine learning when a large amount of data for XRD patterns from other materials has already been stored in a database.

Extension to the other fields

The cluster analysis with the cosine, Pearson, JSD, and NC-DTW are found to be the ideal metrics for combinatorial XRD data. It would be interesting to determine the best metric for Raman spectroscopy (Raman), X-ray magnetic circular dichroism (XMCD) and other physical property data. One must carefully choose the appropriate metric(s) to use taking into account the detailed characteristics of different types of physical property data. For example, the cosine metric which determines the only peak shapes (not peak intensity) is not likely to be suitable for XMCD data because peak intensity information of XMCD is very important. We have previously shown that Pearson is effective as one metric for carrying out clustering of Raman spectroscopy data.34 We are currently in the process of evaluating different metrics for a variety of data formats from other materials characterization techniques.

Discussion

We have investigated the effect of the dissimilarity measures on clustering analysis for XRD data. Clustering with the Euclidean, Manhattan, cosine, Pearson, Spearman, JSD, DTW and NC-DTW measures were carried out on XRD data from the Fe-Co-Ni ternary alloy spread. It was found that the cosine, Pearson, JSD and NC-DTW measures are the most suitable for XRD data analysis, with the cosine, Pearson, and JSD measures providing optimal results in the presence of peak height change and NC-DTW providing optimal results when prior knowledge of peak shifting magnitude is available to be incorporated into the analysis. In addition, selecting the cosine measure over DTW was shown to reduce dissimilarity computation time by two orders of magnitude (fastDTW: 1165 s verse cosine metric: 8.52 s). Similar results were obtained for another ternary system, and the results are provided in the supplementary section. Dissimilarity measure selection was shown to provide the translation invariance and scale invariance required to properly handle peak height and peak shift changes. For improved results, further physical constraints such as Gibbs phase rule can be introduced through constraints applied in the clustering algorithm.19

We have looked at composition spreads of a number of other materials systems (about 10, all metallic alloys or oxides) using some of the same metrics and the clustering algorithms. Some systems were indeed more complicated than others: while most of the samples were textured, some were polycrystalline, and some contained diffraction peaks from impurity phases. The result on the relative performance of different metrics were found to be the same or very similar to the one described here for the Fe–Co–Ni spread. We have chosen the Fe–Co–Ni system result for the current manuscript as the most representative and most effectively illustrative because it has a known phase diagram, and because the phases on the composition spread have been identified “manually” one by one prior to the current work.29

This study provides the basis for a broader analysis of dissimilarity measures and their efficacy across different material systems and their respective data features. In particular, future work will investigate whether the lessons learned in this study may be extended to systems of greater number and complexity of phase regions by appropriately increasing the composition sampling resolution of the composition spread, to ensure well defined phase regions. Furthermore, the effect of data artifacts associated with XRD measurement such as limited low 2theta range sampling or low signal to noise ratio will also be addressed. Additionally, the impact of uncertainty in measurement and data analysis will be analyzed for their propagated effects on the final composition-phase map.

It is important to point out that the challenge of having to analyze a large amount of data in materials science goes beyond screening combinatorial libraries. Mapping spatially resolved properties of samples is increasingly a desirable mode of materials interrogation as various characterization techniques (such as scanning probe microscopy and synchrotron micro diffraction) continue to become powerful with higher and higher spatial resolution. The present result therefore is applicable to any experiment where a large number of XRD patterns is collected.

Methods

The set of dissimilarity measures were computed for the Fe–Co–Ni and Fe–Ga–Pd XRD using the software packages listed in the Supplemental. Dissimilarity measure results were then used to cluster the diffraction patterns into phase regions using agglomerative HCA and k-medoids clustering. For HCA, cluster-cluster dissimilarity was computed using ward’s, average, centroid, complete, Mcquitty and median linkage methods. The clustering results were then compared to phase region labels defined by expert analysis27, with the accuracy computed using multi-class F-measure. The number of clusters was varied between 2–10 to observe the impact of cluster number choice on accuracy and to ensure that data effects such as outliers do not skew the overall analysis of measure performance. To gain a qualitative understanding of the clustering performance of each dissimilarity measure, MMDS was used to map each measure space (defined by the minimum number of dimensions required to preserve the dissimilarities between each set of samples) to the more easily visualized two-dimensional space, while minimizing the loss of inter-sample dissimilarities.22 The two-dimensional projections were then used to identify the impact of measure choice on cluster performance.