Density-of-states similarity descriptor for unsupervised learning from materials data

We develop a materials descriptor based on the electronic density-of-states (DOS) and investigate the similarity of materials based on it. As an application example, we study the Computational 2D Materials Database (C2DB) that hosts thousands of two-dimensional materials with their properties calculated by density-functional theory. Combining our descriptor with a clustering algorithm, we identify groups of materials with similar electronic structure. We introduce additional descriptors to characterize these clusters in terms of crystal structures, atomic compositions, and electronic configurations of their members. This allows us to rationalize the found (dis)similarities and to perform an automated exploratory and confirmatory analysis of the C2DB data. From this analysis, we find that the majority of clusters consist of isoelectronic materials sharing crystal symmetry, but we also identify outliers, i.e., materials whose similarity cannot be explained in this way.


Introduction
The creation of databases for computational materials science has led to a huge amount of stored calculations, exceeding by far any human's ability to comprehend the information in it.Thus, algorithmic data-analysis methods need to be leveraged to allow knowledge extraction from this large pool of data.Domain-specific search interfaces, provided by public databases [1,2,3,4,5], are one way to make information findable.These interfaces allow researchers to identify materials of their interest, e.g., in terms of structural features like space group or atom types, or in terms of properties like the electronic band gap.However, such features provide little insight only.Furthermore, the use of search interfaces is limited to mostly confirmatory analysis: Having a concrete physical mechanism in mind, e.g., the change of properties of alloys with stoichiometry, researchers can manually search materials that allow to confirm, or deny, a hypothesis.
Learning from data, however, is not limited to this kind of analysis.For instance, relations between materials in terms of certain properties, can become (only) apparent in large quantities of data.To reveal such relations and make use of them, both in-depth understanding of when we consider materials to be similar as well as powerful data-analysis methods are required.A prerequisite for understanding how different materials relate to one another is the availability of descriptive, numerical representations (descriptors), that accurately capture (dis)similarities, e.g., stemming from the atomic and/or electronic structure.
In the past years, several descriptors of the atomic structure have been published [6,7,8] and successfully applied for the prediction of material properties using machine learning (ML) techniques.However, descriptors based on the electronic structure are not well established in the ML community.In early work of Isayev and coworkers [9], descriptors of both the electronic density-of-states (DOS) and the band structure are used to create a graphical representation of more than 20000 materials from the AFLOWlib database.More recently, supervised ML was proposed [10] to predict electronic densities-of-states by their decomposition in local atomic contributions.Furthermore, a descriptor based on atomic distances, the projected densities of states (PDOS), and the Kohn-Sham band-gap was shown [11] to improve the prediction of computationally expensive material properties.
The majority of ML approaches in materials science focus on speeding up research.This concerns, for instance, the prediction of materials properties that are time-consuming to compute, like the electronic band gap, or the optimization of established methods, e.g., speeding up molecular-dynamics simulations through ML-based force fields.Thereby, highly non-linear ML models and/or complex material descriptors are necessary to achieve decent accuracy of predictions.Moreover, the underlying data are typically considered only as input for the ML models, and are not further analyzed.
In this work, we aim at obtaining deeper understanding of large materials data spaces by rationalizing the reasons behind features that materials may share.We demonstrate our approach by the similarity of materials in terms of their electronic properties.To this extent we develop a tunable DOS fingerprint that encodes the DOS of a material into a binary-valued two dimensional (2D) map, stimulated by the work of Ref. [[9]].Combining it with unsupervised ML methods, we showcase its use by revealing similarities in the electronic structure of materials from the Computational 2D Materials Database (C2DB) [2].We are able to uncover not only expected trends, e.g., clusters consisting of materials containing isoelectronic substitutions of atomic species, but also unexpected correlations, e.g., clusters of structurally very different materials.Our results show that explorative analysis of a database allows for finding relations between materials which could not be foreseen without comprehensive, data-driven analysis.

Clustering
To identify sets of similar materials, we use the clustering algorithm defined in the methods section.Employing a similarity threshold of S thres = 0.75, we find 294 distinct clusters that contain in total ∼ 23% of the materials in the entire data set.The remaining 2697 orphans are less similar than S thres to any other material in the data set and are not further considered in this specific analysis presented here.We call the materials in a cluster its members and identify the size of the clusters as the number of its members.The compactness of a cluster is determined by its radius r c = 1 − S min , with S min the minimum similarity between any two members of the cluster.Figure 1 presents the distribution of clusters sizes on a logarithmic scale together with the maximal and mean cluster radii for clusters of a given size.About two third (200) of the clusters contain only two materials.Since the clustering algorithm requires that any member of the cluster has a similarity of S thres to the reference material, the cluster radii for two-point clusters are as low as r c ≤ 0.25.The mean cluster radii for the clusters with more than two members increase to r c ∼ 0.4 with increasing cluster size.Interestingly, even though the clustering algorithm allows for the maximal cluster radius to be as large as r c = 0.5, the maximal cluster radii of the discovered clusters are all smaller than 0.4.We note that the here chosen parameters serve the purpose of showcasing our approach.Both, the similarity threshold as well as the energy range can be varied to focus such analysis on certain aspects of the data.For instance, to search only for compact clusters, the minimal similarity threshold can be increased.This, however, reduces the number of discovered clusters and their size, which ultimately prevents the discovery of meaningful clusters.Conversely, the reduction of the similarity threshold increases both cluster size and number of clusters, to the expense of larger cluster radii.Too large cluster radii bare the risk of masking meaningful relations between data points in large clusters that hinder the automatic analysis of clusters.To illustrate the similarity relations between materials, we calculate pairwise similarities between all materials in the data set, i.e., a symmetric matrix with elements S ij = S(f i , f j ).In other words, each column and row of this matrix corresponds to the similarities of a single material to the rest of the data set.The diagonal elements of the matrix are identical, S ii = 1, as they describe the similarity of each material with itself.An excerpt of the full matrix can be seen in Fig. 2. The order of columns and rows has been chosen according to the cluster sizes, i.e., such that the largest cluster appears in the top left corner of the matrix, and the cluster radius decreases with increasing index.The color code makes apparent that many of the clusters are very dissimilar to each other, i.e., the average similarity of the cluster members to the rest of the data set is low.For some others, however, the opposite is the case, and one could expect that, choosing a smaller threshold would merge them, as pointed out above.An example for this, will be given in the following section.

Analysis of selected clusters
In the following we analyze individual clusters and reason why the materials in these clusters are similar.

Isoelectronic substitutions
Figure 3 presents the DOS and crystal structures of five transition-metal dichalcogenides (TMDC) forming a cluster.Its cluster radius r c = 0.28 is close to the mean for this cluster size (see Fig. 1).Visual inspection of the corresponding DOS reveals a pronounced overall similarity in terms of i) the shape of spectra inside the feature region |E| 2eV, and ii) the size of the PBE band-gap that varies from 0.52 eV (TiSe 2 ) to 0.65 eV (Hf 2 Ti 2 Se 8 ).Above 2 eV, the DOS become more dissimilar, as expected from the coarser representation of the DOS outside the feature region.
Considering the crystal lattice, the cluster members are very similar.All materials consist of a layer of TMs between two layers of Se.The cluster contains the binary phase (TiSe 2 ), as well as ternary phases, where either one or two Ti atoms are substituted with either Hf or Zr or both.The latter type has only minor influence on the DOS.This does not come to a surprise as all substitutions within this cluster are isoelectronic, i.e., with atomic species from group 4 of the periodic table of elements (PTE).We note here that there exists another cluster of materials with the same structural prototype, containing, among other materials, the binary compounds Se 8 Hf 4 and Se 8 Zr 4 .These materials form a separate cluster because they have higher PBE band gaps, ranging between 0.72 eV (Se 8 TiHf 3 ) and 0.82 eV (Se 8 HfZr 3 ).Choosing a lower similarity threshold, these clusters merge.
To further demonstrate the isoelectonic behavior of the materials of the here considered cluster, we compare their PDOS in Fig. 4. Their valence bands are mainly composed of fully occupied Se 4p states.The conduction bands have predominant Ti 3d character, with additional contributions from 4d or 5d states of Zr and Hf (when present) [12].The latter lie all in the same energy range and sum up to the same number of d states of the four group-IV TMs.The hybridization of TM-3d with Se-p orbitals is evident from small contributions of d states in the valence region and Se-p states in the conduction region.In sum, the similarity of the electronic spectrum of these materials becomes clear: The replacement of Ti by either Hf or Zr does not alter the valence band, while the conduction states are composed of empty d shells of the transition metals, which amount to the same number of empty states.
Several other clusters exist in the data set which consist of isoelectronic materials, i.e., they may contain different elements but have the same number of valence electrons.To discover them, we make use of the PTE descriptor introduced in the methods section.Overall, our descriptor identifies 230 clusters each of them having the same c m for all its members (compare Sec. Methods, Eq. 7 for details).This number of clusters corresponds to 78% of all clusters, and 16.5% of materials in the full data set.Therefore we conclude that isoelectronicity is a main reason Figure 3: Densities of states (top) and unit cells (bottom) of the materials of a selected DOS cluster.The Fermi level is located at E = 0 eV.The cluster center, Se 8 Ti 3 Hf, is indicated in bold font in the legend.The inset shows their similarity matrix, where the color code is adapted to reflect the high similarities between the cluster members.Here, sub-clusters become visible for materials with the same number of substituents, i.e., the materials containing one or two substituents are more similar to one another than to the other cluster members.
for the similarity of the DOS of the materials in the C2DB.88.8% of all clusters contain at least two materials that have the same c m .

Materials with isoelectronic surface groups
The second most common origin of similarity concerns the substitution of flourine atoms at the materials' surfaces with OH groups.Figure 5 presents an example of such clusters.These metallic materials consist of five alternating layers of carbon and either Ta or Nb.Again, Ta, and Nb are isoelectronic, i.e., from group 5 of the PTE.At the two surfaces of the materials, either F atoms or OH groups form bonds with the underlying TM atoms.
The cluster radius is r c ∼ 0.28, which is close to the mean value for four-point clusters.The general shapes of the curves are similar.Inspection of the PDOS (not shown), reveals that the whole spectrum is dominated by d states of the TM.They hybridize with C and TM p states.The p bands have the largest contribution around E ≃ 0 and E ≃ 1.5, where also significant contributions from F and O p states are present.Thus, the saturated O in the hydroxyl group acquires an electronic configuration analogous to F and binds similarly to the TM atoms.In other words, the OH group can be regarded as isoelectronic to F. Minor differences between the spectra, as for instance displaced peaks below -1 eV, mainly originate from multiple van Hove singularities which are very sensitive to the precise location of band extrema and flat bands.
In total, we find 33 clusters in the data set where F at the surface is interchanged with an OH group.In most of these cases, these clusters contain also sets of materials with other isoelectronic substitutions.Let us note in passing that, despite the fact that the similar behavior of flourine atoms and hydroxyl groups is well-established expert knowledge in chemistry and electronic-structure theory, it is not trivial to access such knowledge in an automatized manner.So far, the search interfaces of most databases, as well as descriptors for machine learning, rely on structural features, e.g., the chemical formula or the number of atoms in the unit cell.Thus, similar materials with, e.g., different numbers of atoms in the unit cell are unlikely to be found by these methods.

Role of crystal lattice
Now, we focus our search on clusters of materials composed of identical atoms but different host lattices.Figure 6 presents the data from three different phases of In 2 S 2 , which belongs to the class of post-transition metal chalcogenides.We designate them by their SG (value of their SG descriptor, cf.Methods), where we distinguish the two materials with SG 164 as SG 164-1 and SG 164-2.The semiconducting structures resulting from the phases SG Figure 4: PDOS of the materials in the cluster shown in Fig. 3.The Fermi level is located at E = 0 eV.Although the contributions of individual orbitals vary between the different materials, due to their similar shape, their sum, i.e., the total DOS, is similar.
164-1 and SG 187 form a cluster due to the similarity of their DOS throughout the whole observed energy range.For comparison, we show the DOS of a third phase, SG 164-2, that shares the symmetry with the first material, however, shows markedly different behavior and, thus, is not part of the same cluster.While the similarity coefficient between the clustered materials is 0.76, the corresponding values with the third phase are 0.32 and 0.34, respectively.This finding goes hand in hand with the fact that the clustered materials have medium-sized band gaps of 1.60 (SG 187) and 1.68 eV (SG 164-1), respectively [13], the third one is metallic.Despite sharing the space group, the two phases with SG 164 show significant structural differences as evident from the top views of the unit cells depicted in Fig. 6.This can be further illustrated considering the stacking of In and S layers: For SG 164-1, the layer sequence corresponds to ABBC stacking; for SG 187, it is ABBA; for SG 164-2, it is ABDC.The (dis-)similarity of the different phases lies in the particular electronic configuration acquired by the atomic species: In the semiconducting phases, In adopts covalent bonding, manifested by a valence band that is dominated by hybridized S and In p states [14].In this electronic configuration, the In atoms are tetrahedrally coordinated by three S atoms and one In atom.Here the In-In bond length is d In−In =2.82 Å.In the metalic phase, In atoms form metallic bonds with a significant contribution from In s states.In this case, every In atom is coordinated with three In atoms, and the bond length is d In−In =3.62 Å, which is close to that of bulk metallic In (3.38 Å).The metallic phase is metastable as compared to the semiconducting ones [13].
We note that in this case the dissimilarity between the semiconducting and metallic phases can neither be explained by either the SG nor the PTA descriptors.Nonetheless, our DOS similarity search is able to capture the underlying electronic configuration and put together structures with identical atomic coordination, albeit different crystal structure.

Outliers
Overall, there are 25 clusters in the data set with materials that are neither isoelectronic nor share the crystal lattice, i.e. they cannot be explained by our SG and PTE descriptors.Therefore, as a final example, we focus on a cluster that consists of two materials that have no apparent similarities in their atomic structures, neither in symmetry nor in composition.They are presented in Fig. 7.While Ta 2 BS 2 has the trigonal space group 164, Bi 4 Cu 4 is characterized by an orthorhombic lattice with SG 51.Unaffected by their structural dissimilarities, the DOS of both materials resemble each other with a similarity coefficient of 0.76, i.e., slightly above the threshold of S min = 0.75.Both materials exhibit a nearly constant DOS between −1.5 and 2 eV, while it increases below.To get a deeper insight, we show in Fig. 8 the band structures of Ta 2 BS 2 and Bi 4 Cu 4 , indicating the atomic character of the bands, together with the corresponding DOS projected on the different atomic species.While in the case of Ta 2 BS 2 , only one band with mixed p-d character crosses the Fermi level, the energy spectrum of Bi 4 Cu 4 exhibits several bands composed almost exclusively of Bi and Cu p states, giving rise to a more complicated topology of the Fermi surface.We conclude that the similarity of these materials' DOS is accidental and this cluster can be indeed be considered as an outlier.

Discussion
In this work, we have presented a fingerprint of the electronic DOS that allows one to quantitatively evaluate the similarity of materials in terms of their electronic structure.We have applied this fingerprint to the C2DB database, a large, heterogeneous data-set of two dimensional materials.Based on our similarity measure, we have devised a clustering algorithm to filter the data for sets of materials that exhibit pronounced similarities to one another.A significant number of (small) clusters have been identified and further analyzed.More specifically, 23% of the materials can be associated with at least one other material in the data set.The majority of similarities in these particular materials can be explained by the similarity of the valence configuration of the involved atomic species, thus confirming physical expectations.This confirmatory analysis has been performed in an automatic fashion based on descriptors that are able to identify those physical reasons.In this way, we could identify, for instance, 16.5% of materials being isoelectronic to at least another material of the database and exhibiting a similar DOS.Our approach could be easily extended by introducing new descriptors with the potential of explaining other reasons for materials to be similar.Summarizing, our method provides a means of analyzing large data sets from electronic-structure theory and contributes to understanding, controlling, and selecting such data in view of their re-use in other contexts.Last but not least, the finding of accidental -unexpected -similarities may be of relevance in technological applications, where considering materials with different composition but similar properties could lead to e.g., structures that are easier to synthesize or reveal other properties that are superior to those of the known materials.

Electronic density-of-states fingerprints
The analysis of spectra like the DOS is typically done by visual inspection, i.e., in a qualitative manner.For large data sets, this kind of analysis quickly becomes unfeasible.Therefore, a descriptor that allows for automated processing of such data is required.This includes a suitable numerical representation of the DOS.In the following, we review such representations that have been proposed in the literature, state their drawbacks, and tell how we overcome them.
To quantitatively compare the DOS of materials, Isayev et al. [9] constructed a DOS fingerprint by encoding the DOS in the energy range between −10 and 10 eV as a series of 256 float (4 bytes) numbers.A similar pointwise representation was considered in Ref. [ [10]] for building predictive models for the DOS based on Gaussian regression.It was pointed out that such a representation were inefficient, as it may potentially require many sampling points of the DOS to efficiently train the models.Moreover, loss functions based on that representation turned out largely insensitive to spectral features with small overlap.To overcome these problems, the authors proposed two approaches: i) a truncated basis expansion based on principal-component analysis (PCA), which leads to an effective reduction of the degrees of freedom of the fingerprint (effectively smoothing the DOS spectra), and ii) a representation based on the cumulative distribution function associated to the DOS.The latter improved on the sensitivity of the loss functions to non-overlapping spectral features.More recently, a high-dimensional fingerprint based on the DOS projected on different atomic orbitals and sites, followed by PCA dimensionality reduction, has been proposed [11].
A common drawback of all these DOS fingerprints is that they equally weigh the contributions from the entire energy range considered in the spectra.Thus, they don't account for the fact that different energy regions are Figure 7: DOS (top) and atomic structures (bottom) of a cluster of materials that neither share atomic species not crystal structure.To increase visibility, the unit cell is repeated in both in-plane directions.The Fermi level is located at E = 0 eV.associated to distinct physical phenomena.For instance, the shape of the DOS close to the top of the valence band and the size of the band-gap are most important aspects in the search for p-doped materials.Likewise, for metals, the magnitude and shape of the DOS around the Fermi energy are most relevant.Other research may focus on some features of the conduction band.Although the PCA-based approach mentioned above can effectively lead to a re-weighting of spectral features, it cannot be tailored at will to focus on specific regions, but is determined by the training data that is used for the construction of the descriptors.
To overcome the described issues, we have developed a DOS fingerprint that allows for a tailored weighting of spectral features.Using a non-uniform discretization of the energy axis, the fingerprint can be adapted to focus on desired energy regions.To achieve this discretization, the DOS is transformed into a two-dimensional raster image (Fig. 9, bottom panel) as follows: First, the spectrum is shifted such that the energy ε = 0 is located at a reference energy ε ref , which defines the main focus of the fingerprint.Then, the DOS (ρ(E), Fig. 9 top) is integrated over an even number N ε of intervals of variable widths ∆ε i , to obtain a histogram {ρ i } (Fig. 9, second panel): The integration intervals ∆ε i are defined as where ∆ε min is a parameter giving the minimal integration width and the integer-valued function where ⌊•⌋ denotes the 'round down' operator and g(ε, W ) = (1 − exp(−ε 2 /2W 2 )).Here, the parameter N ∈ N (N > 1) determines the maximum interval width N ∆ε min , and the parameter W determines the feature region: For ε = 0, ∆ε i equals ∆ε min , while it approaches N ∆ε min for |ε| > W .In this way, a finer discretization of the histogram is obtained for energies in the feature region |ε| < W .This is illustrated by the integration limits indicated by vertical lines in the second panel of Fig. 9. From this histogram, a raster graphic is generated by defining a grid of pixels, as shown in the third panel of Fig. 9. Every column i of the histogram is discretized in a grid of N ρ intervals of height Here, the parameters W H , N H , and ∆ρ min play a role analogous to W , N , and ∆ε min above: Close to ε = 0, a fine discretization ∆ρ i = ∆ρ min is obtained, while it approaches N H ∆ρ min for |ε| > W H . Finally, the number of "filled" pixels in column i is determined by resulting in the 2D raster image of the bottom panel of Fig. 9, containing N ε × N ρ pixels enumerated by an index α.This image is then transformed into a binary-encoded vector f = (f 1 , ..., f Nε×Nρ ) with component f α = 1 if the pixel α is filled and 0 otherwise.

DOS similarity metric
The similarity between two materials i and j in terms of their DOS fingerprints f i and f j is denoted by S(f i , f j ).
As similarity metric, we use the Tanimoto coefficient (Tc) [15], defined as: S(f i , f j ) can be interpreted as the overlap of the areas covered by the raster images represented by f i and f j , divided by the union of the areas.S takes real values in the range [0, 1], being equal to 1 (0) if the images f i and f j are identical (have no overlap).
As an example of this metric, Fig. 10 shows the DOS of four different materials from the C2DB and their respective similarities.In the considered energy interval, C 2 (graphene) has much fewer available states than the other examples.Mainly for this reason, it is dissimilar to all of them (S ≤ 0.14, see similarity matrix in Fig. 10).The DOS of MoS 2 is similar to that of FeO 2 in magnitude for |ε| > 1eV , but since MoS 2 is a semiconductor and FeO 2 is a metal, the overall similarity is low (S = 0.4).MoS 2 and WMo 3 S 8 exhibit a high similarity coefficient of S = 0.84, as both the shape and the magnitude are similar.The electronic spectrum of a material can often be understood by counting the valence electrons in the outermost shells of its constituent atoms.This counting can, in principle, be obtained from the average of the column numbers of the atoms in the unit cell: where i runs over all N atoms in the unit cell of material m, and c im denotes their column in the PTE.c m is calculated for all materials in a cluster.If it is equal for all of them, we conclude that the cluster is formed by isoelectronic materials.Note that here we employ a lax definition of isoelectronicity that considers only electron counting but not electronic configuration.As an example, c m for two Si atoms is identical that of two C atoms or the combination of one Al and one P atom.We call this descriptor the PTE descriptor.
The geometry of the crystal structures can also be explicatory of clusters obtained from the DOS similarity metric.Accordingly, we consider a similarity measure based on the space group (SG) of the crystal structures, after removing all information of the species that form the structure.In practice, this is achieved by first replacing all atoms by a single species and then employing the software package spglib [16] with a tolerance of symprec = 1×10 −1 to find the SG of the resulting geometry.In the main text we call this the SG descriptor.

Clustering algorithm
A similarity metric allows for a range of practical applications as, for instance, to determine which materials from a data set are most similar to any given reference.The latter could be a material with a desired property, for which one seeks alternatives.This kind of analysis is commonly applied in chemical similarity searching [15,17] or drug discovery [18].A related application is the detection of (sub)sets of materials, i.e., clusters, that are more similar to one another than to other materials.In this work, we focus on the second case and develop a clustering algorithm that takes advantage of the following property of our similarity measure (Eq.6): Its complement 1 − S is a distance measure that is identical to the Soergel distance for dichotomous fingerprints [15].For binary-valued descriptors, it obeys the triangle inequality [15], i.e.,

S(f
for any three fingerprints f i , f j , and f k .This can be easily verified with the examples shown in Fig. 10.An important consequence is that any two materials that are more similar to a third one than a threshold S thres , will be more similar than 2S thres − 1 to each other.For example, if we choose S thres (f k , f ref ) = 0.75, all materials f k within a cluster centered at the reference material f ref have S ≥ 0.5 to all other cluster members.This motivates a simple clustering algorithm as follows: Start by i) making a list of the materials in the database.Then, ii) identify the material (reference) with the highest number of other materials that are more similar to it than a given threshold S thres .If no materials can be found for any reference, stop the algorithm, as all possible clusters are found.Otherwise, iii) consider the found reference and its similar materials as a cluster and extract them from the list; and return to step ii).The materials that do not belong to any cluster are considered orphans.When two materials have the same number of neighbors and share any of them, the cluster with the highest average similarity is selected.
PBE parameterization.The data were generated in an automated manner using the Atomic Simulation Recipe (ASR) [24] package.Before using this data for our analysis, we preprocess it in the following way: First, we sum up over all PDOS for one material, in order to obtain the total DOS (TDOS).Then, we define the Fermi level, E F , as the energy zero.The TDOS of every structure is then normalized with respect to the area of the unit cell spanned by the two periodic cell vectors.In this way, the results can be consistently compared across different geometries.For instance, supercells of the same material are considered identical.The resulting normalized TDOS are then employed to generate the fingerprints that encode the electronic structure.Analysis of atomic structures are performed using the Python package ASE [25].Further plots are generated using matplotlib [26].

Figure 1 :
Figure 1: Distribution of cluster sizes (blue bars) and maximal (black line) and mean (red line) cluster radii of the DOS clusters in the data set using a similarity threshold of S min = 0.75.The dashed line indicates the maximal possible cluster radius for this threshold.The bars in light blue indicate the clusters that are used to generate the similarity matrix in Fig. 2.

Figure 2 :
Figure 2: Similarity matrix for materials in clusters with more than six members.The red boxes indicate the clusters detected by our algorithm.

Figure 5 :
Figure 5: DOS (top) and atomic structures (bottom) of materials with isoelectronic surface groups.The Fermi level is located at E = 0 eV.The cluster center is indicated in bold face.For increased visibility the unit cell is repeated in both in-plane directions.

Figure 6 :
Figure 6: DOS (top) and atomic structures (bottom) of In 2 S 2 .The Fermi level is located at E = 0 eV.The structures with SG 164-1 and SG 187 form a DOS cluster.The structure with SG 164-2 (right) has a dissimilar DOS and is not part of the cluster.The unit cells are repeated in both in-plane directions to increase visibility.

Figure 8 :
Figure 8: Band structures and projected DOS of Ta 2 BS 2 (left) and Bi 4 Cu 4 (right), indicating the atomic characters.The Fermi energy is located at E = 0.

Figure 9 :
Figure 9: Generation of DOS fingerprints (bottom panel) from the electronic DOS, ρ(E) (top panel).The DOS of a material (top panel) is numerically integrated over small energy intervals [ε i , ε i + ∆ε i ) (Eq. 1).The thereby generated histogram of states (second panel) is subsequently discretized (third panel), resulting in an image (bottom panel) of the DOS.In this image, each dark (light) pixel corresponds to a 1 (0) in the fingerprint.In the third panel, only every fifth discretization step is shown.To increase visibility, we use N ρ = 30, ρ min = 0.075, and ρ max = 0.825 for this figure.The other parameters are set as described in the Code Availability section.

Figure 10 :
Figure 10: Illustration of the similarity metric with the examples of four materials.The DOS of graphene (C 2 ), MoS 2 , WMo 3 S 8 , and FeO 2 are presented on the left.The Fermi level is located at E = 0 eV.The similarity matrix (rows and columns in the same order, color-coded) is shown in the right panel.