Introduction

Recent developments in materials informatics1,2 combined with ever-growing computational power have opened the way towards performing high-throughput calculations based on first-principles (ab initio) methods.3 This approach significantly facilitates the accelerated discovery of various materials with special functional properties.4,5,6,7,8,9 As a result, we witness an exponentially increasing amount of data usually organized in the form of databases like the Materials Project,10 the Computational 2D Materials Database11 or the Organic Materials Database (OMDB),12 to name but a few. To keep pace with the amount of data generated, there has to be a commensurate development of data mining and information retrieval tools capable of answering non-trivial questions about the data. Here, we present the online graphical pattern search tool which is capable of finding user-specified graphical patterns in a collection of thousands of electronic band structures (EBS).

Recently, we witness an ongoing interest in extending the theory of electronic bands. This effort is mainly motivated by two ideas: the search for semimetals with low-energy excitations behaving as exotic quasi-particles13 and the recent developments in the topological band theory.8,9,14,15,16,17 Realizations of non-trivial EBS features comprise the massless Dirac-fermions which were experimentally verified in graphene18 as well as the Weyl-fermions, which were found for instance in TaAs crystals.19 With the introduction of the so-called Weyl type-II semimetals20—Weyl semimetals with heavily tilted energy-momentum cones—it is claimed that elementary excitations of the crystal can even mimic the physics of electrons close to the event horizon of black holes.21 This interpretation suddenly opens the path to verify theoretical statements of black hole physics within relatively easily approachable measurements on single crystals. More exotic quasiparticles, which were discussed in a similar manner, are, for example, the double Dirac semimetal,22 the node-line semimetals,23 the hourglass fermions24 or the triple-fermion materials.25

To find material realizations of these topological band features, manual inspection of EBSs represents a relatively easy task for a small number of materials. However, this approach becomes impracticable for thousands of band structures contained in modern EBS databases. Despite providing basic search functionality, most of the online databases lack non-trivial online search tools for EBS data querying and analysis. Our tool’s software implementation based on the approximate nearest neighbor search algorithm is designed to match the constraints of web applications in terms of fast execution time and low memory usage. The tool is accessible within the web interface of the OMDB hosting thousands of EBSs for previously synthesized organic crystals at https://omdb.diracmaterials.org/search/pattern. The source code of the developed tool is freely available at https://github.com/OrganicMaterialsDatabase/EBS-search and can be adapted to any other EBS database.

The rest of the paper is organized as follows. In Results, we describe the pattern search tool interface and its implementation. In Discussion, application examples for the discovery of novel functional materials are shown. Finally, technical details related to the OMBD data and pattern-matching algorithms are provided in Methods.

Results

Pattern search algorithm

For a three-dimensional crystalline solid, the EBS is a four-dimensional object representing energy levels of electrons dependent on a three-dimensional momentum vector. With the aim to capture its most distinctive features in such cases, the EBS is usually calculated along specific paths within the Brillouin zone, for example, depending on the crystalline symmetry.26 Hence, properties of the EBS can be effectively characterized by one-dimensional patterns involving one or multiple bands.

To locate query patterns in the EBS data from the ab initio calculations stored in the OMDB, we employ a moving window approach. Each continuous path in the Brillouin zone is scanned with a moving window of width w in the momentum space with the stride s, specifying the number of data points the window jumps at each scanning step. Since the EBS is calculated numerically along a discrete mesh with different spacing for different paths within the Brillouin zone, linear interpolation is used to approximate energy values between the mesh points. For each moving window, we uniformly select d energy values from each band and form a vector to be compared with a query pattern, being also represented as a vector in the same way (Fig. 1a). Thus, in the case of a query pattern consisting of n bands, the resulting vector dimensionality is d × n (Fig. 1c). It is important to note that the present pattern search algorithm does not take into account the distance between bands (for instance, the distance between the maximum value of the lower band and the minimum value of the upper band in the n = 2 case), which needs to be specified explicitly by the user.

Fig. 1
figure 1

A short summary of the pattern search algorithm. For each moving window of size w, d points are selected from each band for the analysis. Although the dimension of an electronic band along some high-symmetry path in the Brillouin zone is one, the dimension of the corresponding feature space, being represented in a vector form, is defined by the number of points in it. For instance, for a moving window comprising 2 bands with 3 points each a, the dimensionality of the corresponding feature space is 3 for each band b and 6 for the final concatenated vector c. In the last step, the distance between the normalized concatenated vector and query pattern vector is calculated

To measure the similarity between a vector obtained from the moving window and the query vector, the cosine distance \(\sqrt {2 - 2{\kern 1pt} {\mathrm{cos}}{\kern 1pt} \theta }\) is used, where θ is the angle between the normalized vectors. The normalization makes the cosine distance equivalent to the Euclidean (L2) distance. It also makes the distance insensitive to energy scaling. As θ ranges from 0 (two vectors are the same) to π (two vectors are opposite), the distance ranges from 0.0 to 2.0, respectively. Finally, K nearest vectors to the query vector are retrieved.

Unfortunately, finding the nearest vectors becomes computationally demanding with respect to memory and CPU usage, especially if it comes to online applications. A straightforward exhaustive search algorithm, which goes through every vector, requires the number of comparisons equal to the total number of vectors to be queried. For example, applying the moving window approach with the realistic parameters w = 0.4, d = 16 and s = 2 for 10 bands near the Fermi surface for 26,739 materials in the OMDB produces over 1.6 × 107 vectors to query. As performance is crucial for online implementation, the exhaustive solution becomes impractical.

The exhaustive search can be accelerated with a computation-memory trade-off using a precalculated index structure based on search space partitioning. We implemented fast data access using the open-source ANNOY library,27 which uses the approximate nearest neighbor search algorithm. During the indexing step, it creates multiple binary tree structures, where each intermediate node represents a split and each leaf node represents an area in the search space (Fig. 2). This precalculated index helps to significantly reduce the search time. More details about the approximate nearest neighbor algorithm can be found in Methods.

Fig. 2
figure 2

An example of the ANNOY algorithm for 100 points in a 2D space. a First, the space is split into two subspaces. The split occurs as the equidistant hyperplane between two randomly selected points indicated by the dashed line. For each subspace, this step is repeated recursively, until the number of points is below a certain threshold. b Using the constructed binary tree, the nearest neighbors can be found in logarithmic time. The algorithm generalizes to higher dimensional spaces. For instance, for a pattern consisting of 2 bands with 3 points each, the dimensionality of the corresponding search space is 6

Since the bands near the Fermi level are usually of physical interest, we have indexed the 9 closest pairs of bands (5 bands above and 5 below the Fermi level). Thus, at the current stage, only these bands are available for the online search. We started with the implementation for the patterns consisting of two bands. However, the approach can be extended in a similar manner to patterns involving an arbitrary number of bands.

The tool’s interface

The developed pattern search tool is available online at https://omdb.diracmaterials.org/search/pattern. The tool’s web interface is shown in Fig. 3. A user can either select one of the predefined query patterns (two crossing straight lines or two parabolas) or use the free drawing input interface to search for an arbitrary pattern. Also, a user can specify the band indices with respect to the Fermi level where the search is performed, the moving window size in the momentum space, the maximum/minimum distance between the bands, if zero density of states between the bands is required, and other basic filtering options, such as space group number or chemical composition of the materials of interest.

Fig. 3
figure 3

The web interface of the pattern search tool. A user can either select a predefined pattern or use the free drawing input interface to search for an arbitrary pattern (a sketch of “Mexican hat” is shown). Also, a user can specify bands of interest, moving window size, distance and density of states between the bands in the pattern, along with other basic filtering options like space group number or chemical composition of the materials of interest

Performance tests and calibration

To test and calibrate our tool, we use the EBS data contained in the OMDB. We also provide additional synthetic data tests together with the source code at https://github.com/OrganicMaterialsDatabase/EBS-search.

The first parameters to be defined are the moving window size w and the stride s. With this aim, we test the sensitivity of the cosine distance to the various distortions of the search pattern. The results are shown in Fig. 4. As can be verified, the distance between the query pattern and the example increases introducing shifts, obliques, skews, or other nonlinear distortions. While s should be small with respect to w not to miss any possible search results (we use s = 2 DFT mesh points), the moving window size w is more task-specific. It should correspond to the expected characteristic momentum scale of the pattern of interest. For example, Fig. 5a suggests that the top search results for a linear crossing pattern show a much better agreement for a window size of w = 0.4 than for w = 0.8. At the same time, a similar test for two gapped parabolas gives qualitatively acceptable results for both moving window sizes (Fig. 5b). As w is pattern-dependent, its value should be specified by the user. Furthermore, it is worth noting that for smaller values of w, we are restricted by the mesh resolution in the momentum space stemming from the ab initio calculations. For example, for the EBSs contained in the OMDB, the moving window for w = 0.4 contains only 14.4 mesh points per band on average (minimum 9 and maximum 33).

Fig. 4
figure 4

Sensitivity of the cosine (L2) distance (solid blue line) and the scaled Manhattan (L1) distance (dashed gray line) to various distortions of the Dirac crossing pattern: a shift, b oblique, c skew and d nonlinear distortion/change of the characteristic scale. The distorted patterns are shown for the red dots. High-frequency noise and outliers are not included because band structures are usually smooth objects with low variance over a characteristic scale

Fig. 5
figure 5

Comparison of the top 6 search results for linear crossings a and two gapped parabolas (the gap is not shown) b for two different moving window sizes: 0.4 (first row) and 0.8 (second row). The top search results for the linear crossings have much better quality for w = 0.4 than for w = 0.8, while the search for two gapped parabolas gives qualitatively acceptable results for both moving window sizes. The titles above the graphs indicate the OMDB-ID. The values for E and k match the values on the website

It is also important to check a maximum value of the distance for a search result to be of acceptable quality. Since similarity to a pattern is an essentially subjective quality specific to the task in hand, we resort to visual inspection of the search results. Figure 6 shows that this value can vary from 0.8 for a linear crossing (Fig. 6a) to 0.5 for two gapped parabolas (Fig. 6b). On the website, we show the top search results ranked by their distance to the query pattern and use this threshold value in a warning message only.

Fig. 6
figure 6

Pattern search results for a linear crossing in the two highest valence bands a and two parabolas in the highest valence and lowest conduction bands b. Each row shows the nearest vectors (best search results) starting from a distance threshold, for threshold values 0.0, 0.5, 0.8, 0.9, 1.0 and 1.5, respectively, for the moving window size of 0.4. The distance between upper and lower bands was set to be less than 0.0001 eV for a and was not restricted for b. The titles above the graphs indicate the OMDB-ID. The values for E and k match the values on the website

As mentioned before, the exact nearest neighbor search algorithm is not applicable in the context of a web application due to the high computational demand. To tackle this issue, we choose the approximate nearest neighbor algorithm implemented in the ANNOY library, which has two parameters to tune: the number of search trees, N, and the number of points to examine, K. Increasing both parameters gives more precise search results at the expense of computational resources. Namely, N affects the memory usage and K affects the search time.

To tune these parameters, we compare the performance of the top 100 search results of the approximate nearest neighbor search algorithm for different values of N and K to those of the exact algorithm. As a ground truth, we use the top 100 exhaustive search results with w = 0.4 for the linear crossing pattern in the two bands below the Fermi level. As can be seen in Fig. 7, the performance of the approximate nearest neighbor search is close to the exact solution but the search time is significantly reduced. For example, using the values N = 20 and K = 1500, the approximate search is more than two orders of magnitude faster in comparison to the exact algorithm by obtaining comparable search results. The level of approximation can be always adjusted to the computational resources available.

Fig. 7
figure 7

The quality of the top 100 search results obtained using the ANNOY library grows with the number of trees N for fixed K = 1500 a and the number of leaf nodes K for fixed N = 20 b. As a ground truth, we used the top 100 search results from the exact algorithm for the linear crossing pattern with a moving window size of w = 0.4 in the two highest valence bands. The precision is calculated as the fraction of coinciding search results and micro-averaged over 10 different ANNOY indices

Discussion

It has been shown by several research groups that the data mining approach has been successful, for example, for the search of stable nitride perovskites,28 thermoelectric materials,4 electrocatalytic materials for hydrogen evolution,5 or lithium-ion battery cathodes.6 Using a pattern search analysis of the data within the Electronic Structure Project,29 Klintenberg et al. identified 17 candidates for strong topological insulators by mining for materials exhibiting the specific “Mexican hat” shaped dispersion relation.7 Similarly, by searching for linear crossings in band structures, novel Dirac materials can be identified as recently shown using the data in the OMDB8,9 and the Materials Project database.30 Alternatively, new functional materials can be predicted by comparison of specific features in the EBSs of known prototype materials to the EBSs in electronic structure databases, as shown for example in the case of potential high-temperature superconductors.31,32 Similar statistical methods can be also used to identify systematic trends in strongly correlated f-electron materials.33

Here, we present a new approach to search for novel functional materials characterized by a specific pattern in their electronic structure, such as Dirac materials, topological insulators, and novel semimetals with low-energy excitations behaving as exotic quasi-particles.

A data-mining approach by means of the described pattern-matching algorithm can be a powerful tool. As the first example, we consider the linear crossing of two bands indicating Dirac materials. This class of materials has been extensively studied due to the exceptional transport and optical properties.34,35 To achieve an isolated crossing in the energy space, the additional constraint of having vanishing density of states at the crossing point was applied. Since the majority of organic crystals are insulating,12 we searched for the pattern in the first and second highest valence bands. The maximum band distance was set to 0.01 eV and the moving window size was restricted to 0.4. Using this conditions, the algorithm found 51 matching results, where the best one has the match error of 0.075 and band distance of 0 eV. The corresponding band structure is plotted in Fig. 8a, which belongs to the material C9H5ClN2O2 (OMDB-ID 4381, COD-ID 7155013), crystallizing in a triclinic crystal. It is also worth mentioning that, using an offline version of the presented tool, several novel organic Dirac materials have been already predicted.8,9

Fig. 8
figure 8

Examples of search results for the patterns which might be interesting from a physical point of view: Dirac crossing, OMDB-ID 4381 a; two touching parabolas, OMDB-ID 4492 b; Mexican hat, OMDB-ID 2308 c. Plotted using Highcharts library55

Whereas a linear crossing of bands corresponds to a nearly free electron gas of massless Dirac fermions, two touching parabolas mimic the behavior of massive free electrons corresponding to the Schrödinger equation. However, the search for two touching parabolas did not retrieve any materials with vanishing density of states at the touching point. Having weakened this criterion, the search for two touching parabolas in the second and third valence bands retrieved 1443 materials with the matching error for the top result of 0.224. The corresponding band structure is illustrated in Fig. 8b, which belongs to C20H20BrN3O3 (OMDB-ID 4492, COD-ID 7153203), having a monoclinic crystal structure.

Next to semimetals, materials possessing a gap can also show specific patterns. The most relevant examples are the topological insulators,36 where an overlap of two bands combined with a forbidden crossing leads to the specific Mexican hat shape of bands. This phenomenon is also referred to as band inversion. While the bulk of a topological insulator is insulating, metallic states on the surface can be found as a consequence of the topological gap. Well-known examples comprise the materials PbxSn1−xTe37,38,39 or Bi2Se3.40 The theory of topological gaps is clearly not restricted to a band gap at the Fermi level but can be generalized to any occurring spectral gap in the band structure. By searching for the Mexican hat shape in the third and fourth bands below the Fermi level, we found 290 materials using a moving window size of 0.8. The band distance was allowed to be in the range of 0.05–9 eV and the density of states was forced to be zero between the bands. As an example, the material C11H17ClO2 (OMDB-ID 2308, COD-ID 4030217) was found with the match error of 0.59 (Fig. 8c).

Methods

Organic materials database (OMDB)

The Organic Materials Database (OMDB)12 is an online database available at https://omdb.diracmaterials.org containing the output of ab initio calculations based on density functional theory (DFT)41,42 for 26,739 (at the moment of writing) previously synthesized three-dimensional organic crystal structures taken from the Crystallography Open Database (COD).43 The DFT calculations were performed using the Vienna Ab initio Simulation Package (VASP).44 The OMDB contains EBSs calculated along high symmetry \(\vec k\)-paths in the Brillouin zone which were automatically generated by the Pymatgen package.45 Electronic bands for each path were calculated on a discrete mesh consisting of 20 points independently of its length in the momentum space. For the pattern search, we use continuous paths suggested by Pymatgen. However, we plan to extend the search to cover all possible combinations of calculated paths sharing the same high-symmetry point. Although the calculations were performed spin-polarized, we do not distinguish between spin-up and spin-down bands for the pattern search task. More details about the DFT calculations can be found in ref. 12.

Problem overview

The problem of locating patterns similar to a target (query) pattern in a sequence of data points has a long interdisciplinary history. Related approaches are typically based on scanning the sequence with a moving window followed by the comparison of these shorter subsequences with the query.46 This approach has several dimensions to explore. The first one is related to the data representation. As an alternative to the raw data points, a fitted model or a transformation, such as Fourier,47 wavelet48 or dimensionality reduction,49 can be employed. Second, a similarity measure between the subsequences and the query need to be defined. Most of them are based on the Lp-norms, however, more advanced probability measures50 have also been discussed. Finally, for practical applications, an efficient search algorithm is necessary. Usually, it involves indexing the subsequences obtained by a moving window with a tree-like partition structure. The presented solution in this paper uses a cosine similarity (equivalent to the L2 distance for normalized vectors) and binary search trees as implemented in the open-source ANNOY library.27 No advanced data transformations are used.

Nearest neighbor search algorithm

The main idea of the nearest neighbor search51 is to find the nearest vectors to a query vector, given some distance measure. The most straightforward (exact) nearest neighbor algorithm iterates through each vector and calculates the distance to the query. This linear complexity algorithm can be accelerated with a computation-memory trade-off using a pre-calculated index structure based on search space partitioning. However, the related algorithms are not exact anymore, because they can miss some search results. Nevertheless, due to the high computational demand of the exact search, it becomes necessary to use an approach which returns “close enough” neighbors in order to obtain a good speed improvement. In many cases, approximate methods perform comparably to the exact one.52 Many open-source libraries are available where various indexing strategies and approximation methods have been implemented, for example, “FAISS” released by Facebook AI Research,53 “ANNOY” by Spotify,27 and Non-Metric Space Library (NMSLIB).54

The back-end of the graphical pattern search tool is implemented using the open-source ANNOY library27 which is based on the approximate nearest neighbor search. During the indexing step, it creates a binary tree structure for the data vectors where each intermediate node represents a split and each leaf node represents an area in the search space. It keeps splitting the space randomly using equidistant hyperplanes between two randomly selected vectors in each node until the number of vectors in each subspace is below a certain threshold. It can also use multiple trees N (n_trees in the ANNOY documentation) in order to improve the quality of search results at the expense of memory usage. When a user tries to find closest neighbors of a query vector, the library first finds the leaf node that the query vector would belong to and collects K vectors to test (search_k in the ANNOY documentation) from that node as well as nearby leaf nodes for each tree. Then, it eliminates the duplicates which come from different trees and calculates the distance between each selected vector and the query. Here, N and K can be tuned to find a trade-off between the algorithm’s precision and performance.