Introduction

Materials science is entering the data age. This transition is spear-headed by projects such as the Materials Genome Initiative,1 the Novel Materials Discovery Laboratory2 and Marvel3 that combine high-throughput screening with data storage, systematic data curation and machine learning. Such projects produce computational materials databases that contain information extracted from atomistic simulations, e.g., system geometries, details of the applied theory, electronic structures, methodology and implementation and their number is increasing rapidly.4,5,6,7,8,9,10,11,12,13,14

A common problem in these databases is materials classification. Often database users would like to search for specific material types, specific functions or structural classes, such as crystals, molecules, surfaces or 2D materials. To facilitate such searches, the database entries should be tagged according to a classification system. Unfortunately, classifications are not always provided when the data are uploaded to the database, and when they are, they are often based on custom or unspecified definitions. To cope with large heterogeneous datasets from atomistic calculations, automated and verifiable methods for analyzing and categorizing atomistic structures have become a necessity.

Previous work on automated classification of atomistic structures has focused on very specific areas and often required an explicit structural search pattern. For example, defect identification and detection schemes have been developed for crystals that are based on neighbourhood analysis.15,16,17,18,19,20 In another example, a more automated workflow was applied to identify lower dimensional stable structures in crystals, such as layered solids.21 Conversely, many tools are available for the inverse problem that generates an atomistic representation from a given structure definition. Tool sets such as the atomic structure environment (ase)22 and pymatgen23 include routines for automating tasks like creating a surface given a lattice, orientation and number of layers, generating crystal structures with desired symmetry properties or generating a system representing surface adsorption, given an adsorbate and an adsorbant.

In this work, we focus on structural classification and present a generic structure classification scheme that encompasses all possible structure types. We then introduce a materials structure genealogy presented as an intuitive and human-readable materials structure 'tree of life’. After this general introduction, we present an automatic and accurate classification scheme for two-dimensional structures, including surfaces and 2D materials, that requires no explicit search patterns. This classification process also returns the underlying unit cell and works even in the presence of defects, dislocations and additional atoms. By being able to identify the unit cell, these structures can be meaningfully characterized and often linked to their bulk counterpart. We also present methods that can be used to accurately identify the outlier atoms that are not part of the underlying structure. The classification does not make assumptions about the used cell or the positioning of the structure within the cell.

The NOMAD Archive2 is used as a benchmark for testing the classification accuracy and the applicability of our method in a realistic database environment containing heterogeneous data. The classification tools are implemented as a python library licensed under the open-source Apache 224 license, and the source code together with installation instructions can be found from https://github.com/SINGROUP/matid. This library is directly compatible with the popular atomic structure manipulation library ase.

Results

Topological classification

Often materials are classified based on a set of functional and/or chemical properties. This kind of classification can be based on a single property, such as the bandgap in the case of conductors, semiconductors and insulators, or it can be based on a collection of multiple functional and chemical properties like in the case of material types, such as ceramics, metals, polymers and composites.

In an analogous way, materials can be classified based on their structural properties. This is here referred to as topological classification. By topological classification we mean a classification that is based on structural features and properties that can be deduced from static atomistic geometries. This thus excludes structural phenomena that are dynamic in nature or depend on some external perturbation. Unlike functional or chemical properties, structural properties are often not so easily quantifiable. Some combinations of specific structural or electronic quantities have been shown to separate different structural types to provide 'structure maps'.25,26 Such maps are, however, too crude to distinguish the full variety of the structural space. The task of structural classification is for these reasons often left to humans. As the systematic exploration of different materials is increasing, human labelling becomes inefficient and error-prone. For this reason, it is important to develop systematic concepts and tools for this task.

The general rules surrounding structural classification have not been discussed systematically before in the literature, and to fill this gap we introduce a recursive scheme based on primary building blocks and their higher level combinations. Primary building blocks are the lowest-level structural building units. One can recursively combine these building blocks to create higher order combinations. An example of common primary building blocks and higher order combinations is given in Fig. 1. For example, a polymer can be combined with water molecules to form a polymer–water solution, which can again be combined with a surface to form the final system of a polymer–water solution on a surface. This systematically extendable classification scheme can be used to describe arbitrarily complex structures.

Fig. 1
figure 1

Illustration of commonly used primary building blocks and their higher order combinations. The structures are here organized by dimensionality in the vertical direction and the cardinality of the used building blocks horizontally. The higher order combinations are specified by the involved building blocks and their relation to each other

Structures can be further grouped by the presence of certain building blocks and additional structural properties, such as dimensionality and long-range order. For example, a heterostructure, that consists of multiple stacked two-dimensional building blocks, can be categorized as a 2D or 3D heterostructure depending on whether there is a vacuum separating the periodic copies or not. This grouping forms structural families that can be better visualized in a tree structure. This materials genealogy is presented in Fig. 2.

Fig. 2
figure 2

Illustration of our topological materials genealogy. The first-classification level is based on the dimensionality of the structures, and further branches distinguish structural building blocks and other structural properties, like the presence of long-range order. The branches have been labelled with more common expressions for certain structure types. This map is not exhaustive, but illustrates the concept and includes many of the most commonly encountered structural categories

Given an arbitrary atomic structure, an ideal classifier would detect the different building blocks, their interrelation and other structural features and be able to place the structure on the correct branch in the materials genealogy. The classifier should have the following generic properties:

Accuracy—The classifier should be able to perform accurate classification on a wide range of samples that have not been explicitly tested.

Invariance—The classifier should be able to correctly classify structures independent of the chosen unit cell, the basis vector directions and the relative position of the structure with respect to the unit cell.

Robustness—Small displacements of atoms from their expected positions should not disrupt the classification.

Efficiency—The time taken for the classification should be small enough that processing of a large data set is possible in a reasonable time.

It is worth noting that some of the features are not independent. For example, in order to reach good accuracy for a wide range of structures, the computational efficiency may suffer.

Classification for surfaces and 2D materials

Here we demonstrate the application of a topological classifier for identifying structures in the two-dimensional branch of the materials genealogy. Figure 3 shows a high-level overview of this classification procedure, and the reader is referred to the methods section for an in-depth discussion of the steps shown in the figure.

Fig. 3
figure 3

Overview of the classification procedure for two-dimensional structures. a The classification starts with detecting the dimensionality of the system. b If a two-dimensional structure is detected, the classification proceeds to find a unit cell that best describes the structure. c This unit cell is then used to track all atoms that belong to the material. d In the final step, the outlier atoms are identified

We apply the classification routine to atomic structures from the NOMAD Archive.2 We chose the NOMAD Archive because it contains atomic structures and calculation results from a variety of electronic structure codes. The structures we used in this work originate from density functional theory simulations and were specifically calculated either with the electronic structure program Exciting27 or FHI-aims.28,29,30 Both these codes have a significant number of calculations present in the database for two-dimensional systems, but the classification is generally applicable.

To cover a wide range of different structures, a cross-validation of the parameters was performed for each structure individually as described in section Parameters and generalizability. For each atomic element a representative atom, which is nearest to the centre of mass, was tested as a seed atom. The values 0.25dmin and 0.75dmin were tested for the position tolerance Δr. Here dmin is the minimum distance between two atoms in the structure. The value 12 Å was used for the maximum cell size rmax.

From an original data set of 394,332 structures, 35 were recognized as zero-dimensional, 13 as one-dimensional, 14,959 as two-dimensional, 376,756 as three-dimensional and 2569 could not be classified due to multiple disconnected components as described in section Dimensionality detection. For each two-dimensional structure, we run the full-classification procedure including the cell and outlier detection. The classified structures were then grouped by the chemical formula of the possible outliers and the chemical formula of the rest of the system. For each unique pair of these two chemical formulas, a representative sample was chosen resulting in 192 unique structures. These structures were visually inspected to assess the correctness of the classification, the material type, the found cells and the detected outliers were checked. A breakdown of the classification results is shown in Fig. 4.

Fig. 4
figure 4

Distribution of the classification results for 192 visually inspected two-dimensional systems from the NOMAD Archive

In the visual inspection, we follow the same material definitions as used in the algorithm, surfaces should have at least two repetitions of the unit cell in three linearly independent directions and 2D materials should have at least two repetitions in two linearly independent directions restricted by a maximum thickness \(h_{{\mathrm{max}}}^{{\mathrm{2D}}}\). Seven structures were found to be unclassifiable into surfaces or 2D materials both by the algorithm and by visual inspection. From the remaining set, 183/185 ≈ 99% were correctly classified into a surface or 2D material and had a correct cell and correct outliers. Two structures were misclassified as false positives by the algorithm. In one, an incorrect cell and thus a wrong outlier was detected, and in the other, all the correct outliers were not identified. Examples of the correctly classified, incorrectly classified and unknown structures are given in Figs. 5 and 6.

Fig. 5
figure 5

Examples of correctly classified structures. The top right corner in each image shows a primitive cell corresponding to the unit cell identified by the algorithm. All detected outlier atoms are highlighted in blue. a Pristine 2D material. b 2D material with substitution. c 2D material with adsorbate. d Pristine surface. e Surface with multiple adsorbates. f Stepped surface with a substitution and adsorbate. g Surface with two different terminations. h Chemisorption. i Surface with a reconstruction

Fig. 6
figure 6

Structures that were misclassified or labelled unknown. a Wrong cell and a wrong outlier detected. b All outliers not detected. ci Unknown classification

Discussion

When a cross-validation of the parameters is performed for each structure individually, a very good accuracy for both the cell identification and outlier identification is achieved. The found structures cover many different cell geometries and types of structures, include optimized structures with deviations from perfect lattice symmetry, and we were able to analyze large quantities of structures efficiently.

The structure in Fig. 6a was misclassified because a wrong unit cell was selected and thus also a wrong outlier was detected. Further analysis reveals that the unit cell is missing one carbon atom because it is rejected from the unit cell due to its too low average degree 〈deg(v)〉 in the corresponding connected component. One can decrease the threshold value \(n_{{\mathrm{min}}}^{\mathrm{e}}\) to get this atom included in the cell, but this will lead to problems in distinguishing real outliers. This issue could be avoided by requiring more repetitions of the cell to gain better statistics about the cell contents, but that will prevent the classification of many small structures.

Our chemical similarity measure proved to be highly accurate in detecting outliers. The data set contained multiple systems, one example shown in Fig. 5e, where adsorbates would not have been correctly detected if the chemical environment would not have been checked. Only in the system shown in Fig. 6b, the carbon atom of the carbon dioxide adsorbate was not fully detected. The carbon in question is directly on top of an expected location for a carbon in the lattice and is highly connected to the surface itself. Tweaking the similarity threshold ΔC or using a more sophisticated chemical environment measure, such as SOAP34 or ACSF,35 might resolve the problem partially. It is, however, likely that perfect outlier detection cannot be reached without explicit information of the expected outlier structures.

The structures in Fig. 6c, d represent surface-like structures that do not meet our classification criteria. Both are thicker than the height threshold \(h_{{\mathrm{max}}}^{{\mathrm{2D}}}\) set for 2D materials and do not have two full repetitions of a unit cell in three linearly independent directions.

Figure 6e–g shows two-dimensional structures that are comprising multiple networked molecules. These molecular networks fail the cell validity checks, as mentioned in section Cell validation, because they have a sparse unit cell. The classification of such sparse structures depends on the threshold rbond, and depending on its value they can be regarded as 2D materials if other criteria are fulfilled. For now these molecular networks are given a generic two-dimensional classification both by the algorithm and by visual inspection. Figure 6h, i shows two additional structures that could not be classified by the algorithm or by visual inspection due to having a too sparse unit cell.

The centre of mass proves to be a good starting point for selecting the seed atoms, and by default the algorithm automatically tries multiple seed atoms with different atomic elements near the centre of mass. It is important to bear in mind that the choice of the seed atom is critical to the success of the algorithm, and generally multiple seed atoms should be tested. For example, if the seed atom is chosen to be a substitutional atom in a lattice, the algorithm will stop almost immediately without finding the correct unit cell. Also, if the seed atom is chosen to be located within a region with multiple substitutions or dislocations, the correct unit cell cannot be obtained.

To robustly detect the most common material classes, the classification procedure considers displacements that are below the user-specified threshold as noise. Some material characteristics may, however, arise from systematic displacements that are smaller than this displacement threshold. The implementation can be extended in the future to keep track of the displacement vectors for individual atoms, or to track distortions in the shapes of individual unit cells. This would allow building vector maps where different displacement patterns could be used to classify structures in more detail.

To reduce the need for manual parameter tuning, a wide range of parameter combinations are tested by default, and the best parameters are chosen by minimizing the number of found outlier atoms with some additional restrictions. So far the use of a systematic grid-search has been enough to find a good parameter combination, but in cases where a large parameter space is required, this optimization problem could be more efficiently solved with probabilistic optimization methods, such as Bayesian optimization.

Providing automatic and tractable topological classification for large quantities of atomic structures has become a challenge, as the number of new materials databases and their data volume is steadily increasing. We have introduced a general and systematic approach for the topological classification of atomistic structures and have developed an automatic, robust and accurate way for identifying two-dimensional structures, detecting the underlying unit cell and outlier atoms in them. The method has been validated on realistic data taken from a database of DFT calculations performed with different programs and computational methods and containing a wide variety of structures. Our implementation can be easily integrated into any existing database that provides atomistic geometries.

One important remaining task in the two-dimensional branch is the further categorization of outlier atoms into more specific groups, such as adsorbates, substitutions or interstitials. This would provide one more level of detail when classifying structures according to the structural genealogy. Also, extending the automatic classification for other branches in the materials genealogy is a major remaining challenge. One possibility is to apply the concepts introduced for the identification of unit cells in classifying multi-component systems such as heterostructures or crystals with defects.

An interesting alternative for structural classification is using machine learning with pre-labelled data as training material. The level of detail our method is able to achieve, including the identification of the unit cell and outlier atoms, might be hard to match by a machine learning approach. However, a supervised learning technique could be envisioned for the classification of different structural families of the materials genealogy, which does not require the detailed knowledge of the unit cell or the exact identification of outliers. The introduced methods also offer a more automated approach for linking material properties to structural features by correlating structural building blocks and structural features with materials phenomena. By using automated tools, the user does not require extensive knowledge of the original simulation setup to investigate how different physical properties, such as adsorption energies or electronic features, relate to different structural properties, such as the location and type of adsorbate, the unit cell or different defects.

Methods

Here we introduce methods for the automatic classification of two-dimensional structures especially focusing on surfaces and 2D materials, with or without outlier atoms, such as adsorbates.

Dimensionality detection

Dimensionality detection is in principle easy, as a zero-dimensional system has no periodicity, one-dimensional system is periodic in only one direction and so forth. The dimensionality is thus given by the number of lattice vectors with periodic boundary conditions. However, in many electronic structure theory codes, it is common to represent even non-periodic structures such as molecules, clusters, nanowires or surfaces with periodic boundary conditions, in particular if plane waves are used as a basis set. For this reason, we can no longer use the number of periodic directions as a criterion. Instead we must analyze the extent of vacuum that is added to the simulation cell to decouple the original structure from its periodic images in directions that are not meant to be periodic (see Fig. 7 for examples).

Fig. 7
figure 7

Illustration of dimensionality detection. The images show a 2 × 2 × 2 supercell for a system with three periodic directions (npbc = 3) where atoms in the original simulation cell are highlighted in blue. The analysis of how the number of spatially separated clusters changes when going from the original cell to the supercell, reveals that zero-dimensional structures will have eight separate clusters, one-dimensional have four, two-dimensional have two and three-dimensional structures will have only one cluster. The dimensionality depends on this scaling and is given by \(D = n_{{\mathrm{pbc}}} - {\mathrm{log}}_n\left( {N_n} \right)\), where n is the number of repetitions for the system, Nn is the number of clusters in the supercell and npbc is the number of periodic directions

To determine the dimensionality of a system, we use a modified version of the topological scaling algorithm (TSA).21 The basic idea behind the algorithm is illustrated in Fig. 7, and it is based on analyzing the size scaling of atomic clusters when going from the original system to a bigger supercell of the same system. With TSA, the dimensionality D is given by

$$D = \left\{ {\begin{array}{*{20}{l}} {n_{{\mathrm{pbc}}} - {\mathrm{log}}_n\left( {N_n} \right),{\mathrm{when}}\, n_{{\mathrm{pbc}}}\, \ne 0} \hfill \cr {0,{\mathrm{when}}\, n_{{\mathrm{pbc}}} = 0} \hfill \end{array}} \right.,$$
(1)

where Nn is the number of clusters in a supercell that is repeated n times in each periodic direction and npbc is the number of periodic dimensions.

For the clustering, we use the density-based spatial clustering of applications with noise (DBSCAN)31 data clustering algorithm. The advantage of this algorithm is that it does not require an initial guess for the number of clusters, and it can find arbitrarily shaped clusters. The clustering requires that we define a metric for the distance between the atoms. We use the following metric:

$$d_{ij} = \left| {{\bf{r}}_i - {\bf{r}}_j} \right|^{{\mathrm{MIC}}} - r_i^{{\mathrm{cov}}} - r_j^{{\mathrm{cov}}},$$
(2)

where ri and rj are the cartesian positions of atom i and j, respectively, and \(r_i^{{\mathrm{cov}}}\) and \(r_j^{{\mathrm{cov}}}\) are their covalent radii.32 It is important to notice that in this metric the distances always follow the minimum image convention (MIC), i.e., the distance is calculated between the two closest periodic neighbours. By using the distance to the closest periodic neighbour, we obtain the correct clusters regardless of what shape of cell is used in the original simulation.

The clustering uses two parameters: the minimum cluster size nmin and the neighbourhood radius \(\epsilon\). We set nmin to 1 to allow clusters consisting of even single atoms and \(\epsilon\) to 3.5 Å. At present, a system, in which there is more than one cluster in the original non-repeated system (N1 > 1), is classified as unknown. Such a case corresponds to systems with multiple components that are spatially separated, such as a molecule far above a surface, low density gases, widely spaced clusters in vacuum, etc.

Determining the unit cell basis vectors

After we have detected a structure as being two-dimensional, we will try to find an underlying unit cell. The shape of this unit cell, its chemical elements and atomic positions identify a specific material and can be used to identify the bulk material that the structure originates from. By our definition, both surfaces and 2D materials should consist of multiple repetitions of this unit cell. In our classification scheme, surfaces have an unit cell that is repeated at least twice in three linearly independent directions. There are no limitations to the thickness of a surface. Similarly, 2D materials consists of a unit cell that is repeated at least twice in two linearly independent directions, but they have a controllable maximum thickness \(h_{{\mathrm{max}}}^{{\mathrm{2D}}}\). According to this definition, there is no such thing as a single-layer surface, but even very thin structures with two-layers will be labelled as surfaces.

Traditional, symmetry-based unit cell reduction algorithms like the Niggli reduction33 cannot be used to find a repeating primitive cell when the system contains symmetry breaking vacuum, vacancies or atoms from another structural component, such as an adsorbate. Here we introduce a more general cell identification algorithm for detecting a unit cell and the structure expanded by it in complex environments.

Cell identification starts by determining the basis vectors of the unit cell. The search for the basis vectors starts by selecting a seed atom that serves as a starting point for finding a unit cell. There are multiple approaches for selecting this seed atom, and depending on the expected complexity of the analyzed structures, one may need to use several spatially distributed seed points. A good initial guess is to use atoms that are close to the centre of mass of the system.

When a seed atom has been chosen, we collect all neighbouring atoms within a radius rmax. From this set, we then choose only those atoms that belong to the same species as the seed atom. The displacement vectors from the seed atom to these atoms form the first possible set of unit cell basis vectors, va. This process is illustrated in Fig. 8. All of the basis vectors of the original simulation cell that are shorter than the defined maximum cell size rmax are automatically included as possible bases.

Fig. 8
figure 8

Illustration of finding the candidate set va of basis vectors and filtering this set into vb based on the connectivity of atoms corresponding to these vectors. The neighbourhood of the seed atom is searched within a radius rmax for atoms of the same species as the seed atom. The vectors connecting the seed atom to such atoms form a candidate set va of basis vectors for the unit cell. For each of these vectors v, a graph Gv is created. The set vb is then formed by choosing vectors with enough connections in the graph as illustrated for the vectors v4 and v7

For each vector v in the set va, we form a graph Gv of atoms that are connected by the periodicity defined by the vector. The graph is formed by first including all the atoms within rmax from the seed atom as nodes in the network. Next for each atom in the graph, we look for an atom with the same element in the two directions ±v. If such an atom exists, it is added to the graph, if not already present, the two atoms are marked as being connected by an edge. The directionality of the edges are taken into account, meaning that an edge E(i, j) is distinguished from E(j, i). The search extends also beyond the periodic boundary conditions, and it is possible that an atom is connected to a periodic copy of itself in a neighbouring cell.

The possible basis vectors va are next filtered by checking the number of edges corresponding to the vector v in the graph Gv. If the number of edges is less than the tunable parameter \(n_{{\mathrm{min}}}^{\mathrm{b}}\), v is rejected, as there are not enough repetitions of atoms corresponding to that vector near the seed atom. The default value of \(n_{{\mathrm{min}}}^{\mathrm{b}}\) is given in Table 1. The remaining set of vectors is labelled as vb. This filtering prevents the usage of candidate vectors that result only from the presence of defects or adsorbates, making the search robust against such outlier atoms. An illustration of the connections in a valid and an invalid graph Gv are shown in Fig. 8.

Table 1 Summary of parameters used in the classification

Next, we determine how many linearly independent vectors there are in the set vb, that is we calculate the rank of the space spanned by this set. Exactly linearly dependent vectors could be identified by storing them as rows of a matrix and reducing this matrix to row echelon form. Because the atomic structures are rarely perfectly aligned, the vectors will in reality be parallel only up to some threshold. For this reason, we use angles to measure the linear dependence of vectors and specify an angle threshold αpar to identify the parallel ones. To do this, we calculate the following variables

$${\bf{n}}_{ij} = \widehat {\bf{v}}_i \times \widehat {\bf{v}}_j$$
(3)
$$\alpha _{ijk} = \left| {\widehat {\bf{v}}_i \cdot \widehat {\bf{n}}_{jk}} \right|$$
(4)

Vector nij is a normal vector for the plane defined by vectors \(\widehat {\bf{v}}_i\) and \(\widehat {\bf{v}}_j\), and its length is the sine of the angle between these two vectors. Variable αijk is equal to the sine of the angle between the vector \(\widehat {\bf{v}}_i\) and the plane defined by vectors \(\widehat {\bf{v}}_j\) and \(\widehat {\bf{v}}_k\). The values αijk, αjki and αkij are calculated for all combinations of vector triplets vi, vj and vk in the set vb. First, we find all triplets for which αijk, αjki and αkij ≥ sin(αpar). If at least one such triplet is found, it represents a unit cell with repetitions in three directions, i.e., a surface. If no such triplet is found, we instead find all pairs vi, vj for which |nij| ≥ sin(αpar). If at least one such pair is found, then it represents a unit cell with repetitions in two directions, i.e., a 2D material.

If more than one combination of valid vectors is found, a graph GΣ is formed for each of them by combining the individual graphs Gv that correspond to vectors in the combination. The combinations are filtered so that only the ones which have a number of edges e in the graph GΣ equal to the maximum found value of e are kept. This filtering is done to avoid choosing basis vector combinations that correspond to symmetric substructures inside the correct unit cell.

The remaining cells are all valid and equal in terms of the repetitions of the seed atom. From this remaining set, we then select cells with size similar to the smallest found size and then choose the combination with the highest orthogonality between the basis vectors. These properties are desirable when the cell is used to track the structure as explained in section Tracking. With small cells, the tracking can better adapt to local changes, and orthogonality makes it easier to find the atoms that belong to a cell. Cell volumes V or areas A are readily available from the already calculated quantities αijk and nij

$$A = \left| {{\bf{v}}_i \times {\bf{v}}_j} \right| = \left| {{\bf{v}}_i} \right|\left| {{\bf{v}}_j} \right|\left| {{\bf{n}}_{ij}} \right|$$
(5)
$$V = \left| {{\bf{v}}_i \cdot \left( {{\bf{v}}_j \times {\bf{v}}_k} \right)} \right| = \left| {{\bf{v}}_i} \right|\left| {{\bf{v}}_j} \right|\left| {{\bf{v}}_k} \right|\left| {{\bf{n}}_{jk}} \right|\alpha _{ijk}$$
(6)

The orthogonality between the vectors is measured by the sum of the squared cross-products between the normalized vectors. By using the identity |a · b|2 = |a|2|b|2 − |a × b|2, we can quantify the orthogonality as follows:

$$\left| {\widehat {\bf{v}}_i \cdot \widehat {\bf{v}}_j} \right|^2 + \left| {\widehat {\bf{v}}_j \cdot \widehat {\bf{v}}_k} \right|^2 + \left| {\widehat {\bf{v}}_k \cdot \widehat {\bf{v}}_i} \right|^2$$
(7)
$$= 3 - \left| {{\bf{n}}_{ij}} \right|^2 - \left| {{\bf{n}}_{jk}} \right|^2 - \left| {{\bf{n}}_{ki}} \right|^2$$
(8)

Determining unit cell contents

The unit cell shape is determined by the unit cell vectors, but the atoms belonging to the unit cell and their positions still need to be determined. Not every atom within the radius rmax has to belong to the unit cell, as there might be adsorbates or defects within the neighbourhood. To correctly identify atoms within the cell, the graph GΣ, corresponding to the best basis vectors, is divided into its connected components, i.e., subgraphs that are connected internally, but not to each other. Figure 9 illustrates the separation into these connected components. Atoms that are not part of the cell can now be filtered by discarding subgraphs that have too few edges or nodes. This makes the detection robust against any possible outlier atoms that are not part of the correct unit cell. Filtering can be done by discarding the subgraphs where \(\left\langle {{\mathrm{deg}}(v)} \right\rangle \le n_{{\mathrm{min}}}^{\mathrm{e}}\) and \(n \le n_{{\mathrm{min}}}^{\mathrm{v}}\), where 〈deg(v)〉 is the average degree of the subgraph nodes that are within the radius rmax and n is the number of nodes in the subgraph. The default values of \(n_{{\mathrm{min}}}^{\mathrm{e}}\) and \(n_{{\mathrm{min}}}^{\mathrm{v}}\) can be seen in Table 1.

Fig. 9
figure 9

Illustration of the connected components of the graph GΣ corresponding to two vectors v2 and v4 shown in Fig. 8. The connected components will be filtered based on their size and connectivity. In this example, only two of the connected components are valid and thus the final unit cell will have two atoms. The positions of these two atoms will be based on an average gathered from the connected components. Notice that an outlier atom from an adsorbate is here included in a connected component, but this does not have a large impact on the final averaged positions

Each remaining valid connected component now represents a set of periodically repeated atoms corresponding to the same relative position within a repetition of the unit cell. For each atom in a valid connected component, a relative position in its respective unit cell repetition is calculated. These relative positions, which are in the interval [0, 1], are wrapped to the periodic repetition nearest to the origin of the cell and averaged to reach a robust estimate for a final relative atom position in the unit cell.

Cell validation

Sometimes the unit cell found in this way cannot represent a valid surface or 2D material because it is too sparse. This can happen, for example, when the cell contains many outliers. To detect these cases, we run the dimensionality detection routine as described in section Dimensionality detection on the found unit cell with a clustering threshold \(\epsilon\) = rbond. The default value for rbond is given in Table 1. If during this dimensionality detection, multiple clusters are detected for the original cell or the detected dimensionality is incorrect, the structure cannot be classified as a surface or a 2D material. In these cases, the structure is classified as a generic 2D structure.

The quality of cell identification depends heavily on the number of repetitions of the cell in the original system. If the algorithm finds that the best basis vectors correspond to the vectors of the original simulation cell, the correct classification cannot be guaranteed as we then only have information from one repetition. By default, these systems are classified as being generic 2D structures, but without the cell information. An exception to this can be made for 2D materials, in which it is quite typical to perform the simulation by using a single repetition of a primitive cell. To allow the identification of these systems, a maximum allowed basis vector length \(l_{{\mathrm{max}}}^{{\mathrm{2D}}}\) for 2D materials with only one unit cell in the original simulation can be specified. When a relatively small value for \(l_{{\mathrm{max}}}^{{\mathrm{2D}}}\) is allowed, it is a reasonable assumption that the unit cell does not contain defects or adsorbates, and it can be accepted. The parameter \(l_{{\mathrm{max}}}^{{\mathrm{2D}}}\) is not applicable to surfaces because by our definition, valid surfaces must have a repetition in three linearly independent directions, and thus systems with single repetition of a unit cell can never represent a surface.

Tracking

During unit cell detection, only the area within the radius rmax from the seed atom is taken into account. Usually the unit cell is also repeated beyond this radius. To correctly identify all atoms that are a part of the underlying structure, a more extensive tracking of unit cell repetitions is needed. This becomes especially important for detecting outlier atoms that do not match the atoms in any repetition of the unit cell.

Tracking works by identifying atoms belonging to periodic repetitions of the unit cell. This search is initiated by finding periodic copies of the seed atom near the positions rseed + v, where rseed is the seed atom position and v is any of the unit cell basis vectors or their corresponding opposite vectors. If such atom locations are found, they are used as new seed locations \({\bf{r}}_{{\mathrm{seed}}}^\prime\), otherwise \({\bf{r}}_{{\mathrm{seed}}}^\prime = {\bf{r}}_{{\mathrm{seed}}} + {\bf{v}}\). Using \({\bf{r}}_{{\mathrm{seed}}}^\prime\) as a new cell origin, the atoms belonging to the neighbouring cell are then identified by searching for atoms in the known positions of the unit cell. In larger systems, the orientation and size of individual unit cells can be changed locally by external factors, such as adsorbates or pressure. To better adapt to these local distortions and reorientations of the lattice, the unit cell basis vectors v are updated as \({\bf{v}}\prime = {\bf{r}}_{{\mathrm{seed}}}^\prime - {\bf{r}}_{{\mathrm{seed}}}\). If new atoms are identified as belonging to a neighbouring unit cell, the search is extended in a breadth-first manner, i.e., the search continues recursively by using the new origin \({\bf{r}}_{{\mathrm{seed}}}^\prime\) and new basis vectors v′, and the neighbouring cells closest to the original seed location are searched before advancing to the next layer of neighbouring cells. The branches of the search are stopped when no new atoms are identified within a neighbouring cell.

Outlier detection

Often 2D systems include atoms that cannot be attributed to the unit cell. Such atoms include adsorbates, vacancies, substitutions, interstitials and surface reconstructions. Being able to distinguish these outlier atoms is useful because it enables the search for structural combinations that go beyond bulk-terminated surfaces or ideal 2D materials.

During the tracking of the unit cells, the atoms that can be attributed to periodic repetitions of a unit cell are identified. However, in many cases, some of these atoms lie outside the extent of the surface and are in fact outliers that happen to be near a position where an atom is expected to be found based on the translational symmetry. This is also seen in Figs. 8 and 9, where one atom from the adsorbate happens to be near a position where a surface atom is expected. To better distinguish these outliers, we compare the chemical environments of all atoms in the unit cells to the ideal environment, as found in the detected unit cell. There are many ways to accurately quantify the local chemical environment near an atom, including the smooth overlap of atomic positions (SOAP)34 and atom-centered symmetry functions (ACSF).35 However, for the purpose of detecting outliers, only a rough estimate for the chemical environment is needed, and we apply a simpler and faster-to-compute quantity that is based on counting the occurrence of atomic elements within a finite radius. The chemical environment of atom a thus consists of a vector.

$${\bf{C}}_a = \left[ {n_1,n_2, \ldots } \right]$$
(9)

where ni is the number of atoms with atomic number i within a distance \(r_a^{{\mathrm{cov}}} + r_i^{{\mathrm{cov}}} + r_{{\mathrm{bond}}}\). We only consider atomic numbers that are found in the unit cell. Here \(r_a^{{\mathrm{cov}}}\) and \(r_i^{{\mathrm{cov}}}\) are covalent radii and rbond is a tunable parameter. The similarity between a real chemical environment Cb and the ideal chemical environment Ca as found in the unit cell is then measured as:

$$T\left( {{\bf{C}}_a,{\bf{C}}_b} \right) = \mathop {\sum}\limits_i {\kern 1pt} {\mathrm{min}}\left( {n_i^a,n_i^b} \right){\mathrm{/}}\mathop {\sum}\limits_i {\kern 1pt} n_i^a$$
(10)

This measure represents the ratio of common elements in the neighbourhoods of Ca and Cb to the number of neighbouring elements in the ideal environment Ca. This form is similar in style to the Tanimoto similarity measure,36 but is modified so that the comparison is always done to the ideal environment instead of the combined environments Ca and Cb. If this similarity is above a controllable parameter ΔC, the atom is assigned to a unit cell. Otherwise the atom is labelled as an outlier. The default value for ΔC is given in Table 1.

After this analysis, the outlier atoms are flagged as atoms that were not matched to any repetition of the unit cell or were rejected from a unit cell by the chemical similarity check. If the number of outliers is bigger than \(n_{{\mathrm{max}}}^{{\mathrm{outliers}}}\), which defaults to 50% of the total atoms, the found unit cell is not accepted. This is done to avoid using cells that are a part of a larger heterostructure or which have an exceptionally large fraction of outliers.

Parameters and generalizability

Several parameters control the classification and allow the user to customize the routine to different datasets. In Table 1, all the controllable parameters are introduced together with an explanation and the default values.

The given default parameters are a good starting point for most systems, but can be adjusted to specific environments. The optimal value for parameters Δr, rmax and iseed are highly dependent on the system, and producing good estimates for them can be hard. One approach is to find optimal parameter values for each structure separately by trying out multiple parameter values. The optimal parameters can be selected by specifying a list of possible values for each parameter, performing the classification with all possible combinations of these parameters and then keeping the result that gives the least amount of outlier atoms. This parameter cross-validation requires that the position tolerance is kept below the minimum distance between the two atoms in the structure so that thr outliers are properly detected.