Introduction

Considerable growth in computational power and its ubiquity has been coupled with the development of efficient algorithms1, 2 and their implementation in robust3,4,5 and reliable6 computer codes. This has permitted the first principles, quantum mechanical (through density functional theory (DFT)7,8,9), treatment of material,10 chemical11 and biological systems12 of increasing complexity. High throughput computations can be performed directly on pre-existing data of more modest complexity, or some modification of it,13 or as a way to sample configuration space to discover previously unknown structures.14,15,16,17 Altogether, these approaches offer a route to computational materials discovery.18

In parallel, increasing computational power and an abundance of data has given rise to another rapidly expanding field—the study of complex networks.19,20,21,22 A key reason for its success is the fact that related mathematical approaches can be applied to a wide range of real-world network data across many academic disciplines. The structure of networks can be studied at a variety of resolutions. Local measures of connectivity can quantify the topological properties of individual nodes or edges. Global measures, calculated across the entire network, such as the average shortest path length,19 can help us to compare networks as a whole. Between these two extremes however lies an entire field of research that searches for meaningful descriptions of intermediate structures, such as ‘cliques’,23 ‘communities’24 and ‘rich clubs’,25 among others. These are sets of nodes or edges which are particularly densely connected, or which share some other defining topological feature. Many definitions of such structures have been put forward.23,24,25,26,27,28 Here we select an approach that is particularly good at detecting hierarchical modularity27 and apply it to atomic networks, which until now have received scant attention from the networks research community, beyond the study of proteins.29,30,31 Our aim is to provide an automated coarse-graining of the atomic structures of crystal structures. The simplification of the space of possible configurations of complex atomic systems has the potential to vastly accelerate the process of computational materials discovery, among other tasks that can benefit from automatic coarse-graining based on hierarchical decomposition. We illustrate this through the identification of a possible new allotrope of boron.

Results

Determining the modularity of atomic networks

For atomic structures of a single atomic species we can generate an unweighted network of atoms by simply imposing a threshold d * on the interatomic distance and connecting atoms that are closer to each other than this threshold distance. The communities in this network can then be extracted by using the algorithm of Arenas et al.,27 as discussed below.

The extent to which a network has well-defined community structure can be quantified with a metric known as the modularity24 (Fig. 1). This is defined as the fraction of edges that run between nodes of the same community, minus the expected fraction if the edges of the network were positioned randomly:

$$Q = \frac{1}{{2A}}\mathop {\sum}\limits_{i,\, j} {\left( {{A_{ij}} - {P_{ij}}} \right)\delta \left( {{C_i},{C_j}} \right)}$$
(1)

Here i, j [1, n], where n is the number of nodes. δ(C i , C j ) = 1 if nodes i and j belong to the same community, and 0 otherwise, and A ij and P ij are the adjacency matrices of the network and of the null model (the randomised network), respectively. \(A = \frac{1}{2}\mathop {\sum}\nolimits_{i,\, j} {A_{ij}}\) gives the total number of edges in the network.

Fig. 1
figure 1

A network with three communities of five nodes each. Let the white, grey and black communities be C 1, C 2, C 3. Then A ss , the number of internal edges, is {8, 9, 9}. The sum of degrees of each community, A s is {18, 21, 21}, and the total number of edges is 30, giving a final value for Q of 0.532. An alternative partition, indicated by the dotted lines, has A ss  = {18, 9} and A s  = {39, 21}, giving Q = 0.355. As expected, the less intuitive partitioning gives a lower Q value

This metric relies on the concept that a random graph is not expected to exhibit community structure. As such, the quality of the proposed community structure can be quantified by the difference between the network and the null model. The degree of a node i is defined as the number of edges of the node, \({A_i} = \mathop {\sum}\nolimits_j {A_{ij}}\;\). Choosing the null model to have the same degree distribution as the target network gives:

$$Q = \frac{1}{{2A}}\mathop {\sum}\limits_{i,\, j} \left( {{A_{ij}} - \frac{{{A_i}{A_j}}}{{2A}}} \right)\delta \left( {{C_i},{C_j}} \right)$$
(2)

This can be rewritten as a sum over the M communities of the network:

$$Q = \mathop {\sum}\limits_{s = 1}^M \left( {\frac{{{A_{ss}}}}{A} - {{\left( {\frac{{{A_s}}}{{2A}}} \right)}^2}} \right)$$
(3)

Where for unweighted networks A ss is the number of edges within community s, and A s is the sum of the degrees of nodes within community s. (Equivalently, A s is the number of edges exiting the community +2A ss .) This metric allows community detection to be recast as an optimisation problem; maximising Q minimises the number of edges between communities. However, naive optimisation of the modularity has been shown to have a fundamental resolution limit.32 Communities smaller than a certain size \(A_{ss}^{min}\) will not be detected. This threshold depends on the total number of edges in the network:

$$A_{ss}^{min} = \sqrt {\frac{A}{2}} - 1$$
(4)

This resolution limit arises due to the explicit dependence on the number of edges within the null model. This introduces a preferred size for the communities in the network.

The algorithm of Arenas et al.27 utilises this bound on the size of detectable communities. By adding a self-loop of strength w to each node in network, AA + w I, the resolution limit inequality becomes:

$${A_{ss}} < \frac{1}{2}\left( {\sqrt {2A + Nw} - {N_s}w - 2} \right)$$
(5)

where N s is the number of nodes in community s, and N the total number of nodes. By varying the effective resolution limit, the scale of the communities extracted can be varied. As w is increased, modularity optimisation will result in an increasingly fragmented decomposition. By optimising modularity for a range of this parameter w, and across a range of threshold values d *, one obtains a variety of hierarchical decompositions into modules (see the section ‘Relax and Shake Algorithm’ for details of the optimisation algorithm). The simplest quantity one can establish across this two-dimensional (2D) space is the number of modules. Regions across which this value is stable represent more meaningful modules that may have real physical or biological meaning as rigid clusters or units of protein architecture.

Pauling’s rule of parsimony suggests that the number of different kind of constituents in a crystal is small.33 This suggests that in complex crystal structures, in which modules are likely to exhibit a degree of symmetry, it makes sense to minimise a more sophisticated quantity, namely the information content of a structure. The identification of modules corresponds to a compression if these modules contain symmetries or if the same module appears multiple times. We can calculate the amount of information I required to describe a given module structure in terms of the global degrees of freedom of that structure, and minimise this quantity over the space of d * and w.

In order to calculate I, consider M modules of M′ distinct types. The number of modules with one atom only is 0 ≤ M * ≤ M, and the number of modules with two atoms is 0 ≤ M ** ≤ M. To position and rotate the M modules relative to one another we need 6M − 6 degrees of freedom in general, with one degree less for every two-atom module, and three degrees less for every one-atom module. Now consider each of the M′ distinct modules: If we have n i  > 2 inequivalent atoms in module i we need 3n i  − 6 degrees of freedom to describe them. If we have n i  = 2 inequivalent atoms in module i we need 3n i  − 5 = 1 degree of freedom to describe them, which corresponds to the distance between the atoms. If we have n i  = 1 inequivalent atom in module i we need 0 degrees of freedom to describe it internally. The global number of degrees of freedom is then:

$$I = 6M - {M^{**}} - 3{M^*} - 6 + \mathop {\sum}\limits_{i = 1}^{M'} 3{n_i} - 6 + 3{\delta _{1{n_i}}} + {\delta _{2{n_i}}}$$
(6)

Note that the number of inequivalent atoms n i depends on the number of atoms in module i, as well as its point group symmetry. If all N atoms are in one module, repeated once, we have M = 1, M′ = 1, M * = 0, M ** = 0, and n i  = N. Hence I = 3N − 6, as required. If all N atoms are in N modules of one atom we have M = N, M′ = 1, M * = N, M ** = 0, n i  = 1. Hence in this case also I = 3N − 6, as required. The information I can be normalised by the maximum possible value of 3N − 6. This normalised value is used in all the heat-map figures in this manuscript.

The decomposition which has the minimum I gives us the most concise description of a structure. This minimisation of the description length is conceptually related to the idea of algorithmic information theory,34,35,36 as the symmetry operations and inequivalent atomic positions that form part of the compression can be thought of algorithms which allow us to reconstruct the original atomic structure. The length of the shortest such description is a quantitative measure of the structure’s complexity. Because of the presence of crystal symmetries, we need to establish the modularity of the atomic network with high accuracy. In addition, the modularity is highly degenerate; there is a greater than exponential number of distinct possible community structures, and many will have modularity values close to that of the global maximum.37 Moreover, these structures may have very different topologies to that of the true partition, resulting in a large change in the compression achieved. We therefore employ an algorithm similar to that of the ‘relax and shake’ algorithm15 or zero temperature basin hopping.38

Relax and Shake Algorithm

The relax and shake (RASH) algorithm uses a repeated series of local modularity optimisations (relax) followed by the assignment of a small number of nodes into random communities (shake), in order to escape local maxima. The local optimisation follows existing work on community detection.39, 40 The modularity is optimised by moving each node in the network to the community of the neighbouring node resulting in the highest increase in the modularity (if > 0). This is then repeated until no further local optimisations increase the modularity. Following the local optimisation, a subset of the nodes (10%) are shaken into other communities within the network, and the local optimisation repeated. This continues until 200 consecutive relax-and-shake iterations have failed to improve the modularity. As an additional check on the solution, the modularity change resulting in merging any two communities is calculated; if this results in a modularity increase, the merge is performed, and the relax-and-shake iteration process is begun again using the new partition. The above can be considered a single optimisation step; following this, a larger subset of the nodes (20%) are shaken into either pre-existing, or new communities. The optimisation is then performed again. This shake-and-relax is performed until 200 iterations fail to improve the modularity. The whole process is repeated until three consecutive runs have failed to produce a community structure with a higher modularity. The degree of repetition is parameterisable, and allows us to have confidence in the community structure obtained (at the cost of speed).

If we partition a structure of N atoms into M multi-atom modules, so that typically MN, and assume that the modules correspond to rigid clusters, then we reduce the dimensionality of configuration space from 3N − 6 (atomic positions minus global translation and rotation) to 6M − 6 (as we have to specify a relative translation and rotation for each module) or less (if any of the modules have less than three atoms). We will show that this can be exploited in the first principles prediction of crystal structure.

Application to crystal structures

Boron. Boron is known to form several allotropes,41 including α-B,42 β-B43, 44 and γ-B.45,46,47 The structure of rhombohedral α-B is widely recognised as being made up of interconnected B12 icosahedra (Fig. 2). However, because the bonds between different icosahedra are shorter than the bonds within, simple thresholds on bond length will not yield the underlying modular structure of this crystal phase. This observation motivated the development of our current scheme, which, based on network modularity, does yield the scientifically agreed icosahedral modules.

Fig. 2
figure 2

The structure of α-B12 (in the 36 atom hexagonal unit cell) and its modular decompositions. As established by the modularity decomposition the simplest way of describing this structure is as three interconnected icosahedra. For the purpose of comparison we also show an alternative, less efficient decomposition into six units of six atoms with C3v symmetry. These units correspond to halves of the icosahedra. The decomposition that minimises the amount of information I (normalised here by the maximum value of 3N − 6) required to describe the crystal structure, occurs for values around w s  = 1 (modularity tuning parameter) and d * = 2.5 Å (interatomic distance threshold). This decomposition occupies a broad plateau (darkest shade of green) in the 2D space of the modularity tuning parameter w s and the interatomic distance threshold d *

The structure of β-B is much more complicated, and various models have been proposed, typically with 105 or 106 atoms per unit cell. We choose the 105-atom model of Geist et al.44 for further investigation (Fig. 3). Our modularity detection scheme identifies four icosahedra, two larger 25-atom modules with threefold cyclic point group symmetry C3v , and one module with threefold dihedral symmetry D3d of seven atoms. Two of the four icosahedra are slightly distorted, resulting in C2v symmetry, rather than icosahedral I h symmetry. Of course, the decomposition of complex boron structures into compact and symmetric sub units is not unprecedented, see for example Fig. 2 in ref. 41. We emphasise, however, that our scheme performs the decomposition automatically and is suitable for integration into complex computational workflows.

Fig. 3
figure 3

The highly complex structure of β-B105 (in the 105-atom rhombohedral unit cell) and its modular decompositions. The complexity of the structure means a more complex landscape of possible decompositions. The module structure that minimises the amount of information I (normalised here by the maximum value of 3N − 6) required to describe the crystal structure, occurs for values around w s  = 1.66 and d * = 2.82Å. This decomposition consists of four icosahedra, two large 25-atom structures with threefold symmetry (C3v ), and a further module with dihedral symmetry (D3d ). Two of the four icosahedra have exact icosahedral symmetry, whereas the other two are slightly distorted, and exhibit C2v symmetry

Recently the structure of a high-pressure phase of boron, γ-B28, has been described in the literature41, 46, 47 (Fig. 4). A w s vs. d * heat-map of I for the unweighted network reveals a global minimum at w s  = 1.0 and d * = 2.0 Å. This corresponds to two 14-atom modules with D2h symmetry, which are icosahedra plus two atoms on either side (Fig. 4). Note that this contrasts with the conventional decomposition into two icosahedra and two dimers found in the literature,46 which is less favoured in our scheme as it corresponds to a higher value of I. Our approach offers a meaningful partition of this structure into modules, providing insight into the organisation and visualisation of this structure and opening the door to the systematic exploration of the structure space that neighbours this γ-B28 allotrope.

Fig. 4
figure 4

The structure of γ-B28 (Pnnm unit cell) and its modular decompositions. The most parsimonious decomposition, which minimises the amount of information I (normalised here by the maximum value of 3N − 6) required to describe the crystal structure, occurs for values around w s  = 1.0 and d * = 2.0 Å. This decomposition consists of two identical modules of fourteen atoms with dihedral symmetry (D2h ), which in turn are each composed of an icosahedron with two adjoining atoms, one on either side. For the purposes of comparison we also show a less efficient decomposition, with C s symmetry

Phosphorus. Like boron, phosphorous exhibits rich allotropism, from the highly metastable white phosphorous, to layered black phosphorous, and extremely complex fibrous, or layered, structures.48 There is considerable current interest in 2D black phosphorous, or phosphorene49 and other layered forms.50 Here we investigate the crystal structure of red phosphorus,51 and attempt to identify a simple decomposition into modules using our current scheme. In the 42-atom primitive unit cell (space group P\(\bar 1\)) the modularity decomposition finds two modules that each occur twice (Fig. 5). The bigger module has symmetry (C s ) and contains 13 atoms, while the smaller has C2v symmetry and contains eight atoms. The relatively low degree of symmetry of red phosphorous means that the landscape of I with varying w s and d * is flatter, but our approach nevertheless finds a parsimonious decomposition of the crystal structure.

Fig. 5
figure 5

The structure of red phosphorous (42 atom P\(\bar 1\) unit cell) and its modular decomposition. The relatively low degree of symmetry in the structure as a whole results in a relatively flat landscape. The information I (normalised here by the maximum value of 3N − 6) required to describe the structure is minimised for values around w s  = 1 and d * = 3.33 Å. This decomposition consists of two modules, which each appear twice. One module consists of 13 atoms with C s point group symmetry. The other consists of eight atoms with C2v symmetry

Metal-organic frameworks. We extend our framework to multi-species structures requiring only a definition of the relationship between d * and the interatomic distances used to determine the network connectivity. In principle, a separate d * could be defined for each pair of atomic species. However, this introduces the cost of exploring a higher dimensional space in the search for an optimal I. Instead, we define a single dimensionless parameter \(d_{{\rm{eff}}}^{\rm{*}}\) that specifies the distance threshold as the \(d_{{\rm{eff}}}^{\rm{*}}\)-fold multiple of the sum of the fixed atomic radii for a given pair of atoms. We apply this multi-species version of our approach to the metal-organic framework MOF-5, or Zn4O(BDC)3, where BDC2– is 1,4-benzenedicarboxylate.52 Metal-organic frameworks exhibit a vast range of structures and are of great interest because their porosity allows them to be used for the storage of gases, such as hydrogen, or carbon dioxide.53 As can be seen in Fig. 6 our algorithm finds two similar decompositions with almost equally low I-values. The lowest minimum corresponds to a decomposition of the structure into six 16-atom modules with D2h symmetry, and two six-atom modules with tetrahedral symmetry (T d ). The 16-atom modules correspond to the BDC2– molecules that are sometimes referred to as the ‘struts’ of metal-organic frameworks. This suggests that the decomposition derived automatically through our procedure is chemically meaningful.

Fig. 6
figure 6

The structure of the metal-organic framework MOF-5 (Zn4O(BDC)3, where BDC2– is 1,4-benzenedicarboxylate) and its modular decompositions. This structure has 106 atoms per unit cell. The two lowest minima have very similar values for the normalised information, with I = 0.163 at w s  = 2 and \(d_{{\rm{eff}}}^{\rm{*}} = 3\) and I = 0.166 at w s  = 1 and \(d_{{\rm{eff}}}^{\rm{*}} = 2\). Unlike d *, which is a physical distance threshold, the distance threshold parameter \(d_{{\rm{eff}}}^{\rm{*}}\) is dimensionless and expressed as the multiple of the summed atomic radii for a given pair of atoms (see text). Both decompositions consist of two modules, one of which appears six times and has D2h symmetry, and the other one of which appears twice and has T d symmetry. The modules for I = 0.163 are one with 16 atoms (8 C, 4 O, 4 H) and one with five atoms (4 Zn, 1 O)

Structure prediction

Ab initio random structure searching (AIRSS)14, 15 is a simple, and demonstrably effective, approach to first principles structure prediction. It has been applied to a wide range of systems, from the crystal structures of dense hydrogen54 and hydrogen rich compounds,14 to matter under extreme compression,55 and interfaces.56 The approach involves selecting initially random structures from distributions defined by physically motivated constraints (for example, density, composition, atomic distances, symmetry, molecular units or fragments). These random ‘sensible’15 structures are fully relaxed (moved to the nearest local minimum in the energy landscape) under forces derived from DFT. Once a large number of computations have been performed the resulting structures can be ranked according to energy (free enthalpy) or any computable property of interest.

It has been a surprise to many that such a naive approach performs well, but the method’s success is linked to intrinsic features of the first principles energy landscape, such as its smoothness (a result of the quantum mechanical interactions between the atoms and electrons) and the relatively large number of large energy basins. In a smooth energy landscape the size of the basins correlates with their depth (deep basins occupying a large volume of configuration space), there is a natural bias in random sampling, plus relaxation, to the stable, low energy and relevant structures. In what follows we exploit our new approach for the automatic decomposition of known structures into minimum I fragments to accelerate the search for complex structures by restricting the regions of configuration space that must be explored.

Application to dense boron. We generate 3303 initial random structures based on packing four 14 atom D2h modules derived from γ-B28 into unit cells with a randomly chosen shape, and the same density as γ-B28. The units are not permitted to overlap each other, or be closer than 1.63 Å (the measured inter-icosahedral distance in a computed α-B structure), and are related to each other by symmetry. The symmetry is chosen at random from those space groups with four symmetry operators in the primitive cell. The random initial structures are then relaxed to nearby local energy minima (see ‘Methods’ for computational details). Four of the initial structures relaxed to supercells of the Pnnm γ-B28 structure, and three of them relaxed to a previously unreported structure with space group Pbcn and 56 atoms in a unit cell. This structure has a density very close to that of γ-B28, and is just 3 meV/atom less stable. In Fig. 7 it is shown that this near degeneracy persists over a wide range of pressures.

Fig. 7
figure 7

a Relative enthalpy of boron structures up to 20 GPa. b The new Pnnm-B56 structure is practically degenerate in energy with c the known γ-B28 structure

Our new 56 atom structure corresponds to a distorted hexagonal packing of boron icosahedra, whereas γ-B28 corresponds closely to cubic packing. The small energy difference indicates that γ-B may be susceptible to polytypism or stacking disorder. The 3 meV/atom energy difference between the hexagonal and cubic polytypes is small compared to the 25 meV difference between cubic and hexagonal (carbon) diamond computed at the same level. The situation is very similar to that for α − B12, for which an alternatively packed structure of 24 atom and space group Cmcm has been identified,15 and further discussed.57, 58

Discussion

In contrast to other applications of community-detection algorithms, in which a degree of ambiguity in the definition of communities is often tolerated, our application relies on the detection of robust and compressible modules of atoms with a maximum amount of symmetry. Small differences in the assignments of atoms to modules can have large effects on the information measure I if modular symmetries are established or broken as a result. For this reason the implementation of the modularity optimisation requires particular care in order to find maximally robust modular decompositions.

In the context of multi-species atomic structures, and particularly biological molecules, it can be valuable to consider weighted atomic networks, in which the edge weight scales with the interatomic distance. The simplest choice is a linear one:

$${w_{ij}} = K\left( {1 - \frac{{{d_{ij}}}}{{{d^*}}}} \right)$$

for d ij  < d * and w ij  = 0 otherwise, where d ij is the distance between atoms i and j, and K is an arbitrary constant. Increasing all edge weights by a constant factor leaves the modularity unchanged, so in practice this is chosen to ensure numerical precision. This choice of edge weighting reflects the fact that equilibrium bond lengths scale with the equilibrium bond energies.

Our approach can be applied in the context of proteins, similar to,29,30,31 where the vastness of configuration space has traditionally also been a difficult barrier to overcome. The lack of symmetry in biological molecules mean that the information required to describe the structure is less useful than in crystal structures. Other measurements, such as the stability of the module number, can replace the information measurement as a criterion for assessing the quality of a modular decomposition in biomolecules. While methods for rigidity analysis in proteins already exist, such as the FIRST algorithm,59 the tuning parameter in our method allows for the detection of rigid clusters on a variety of length scales, making it a complementary approach. Like FIRST, our method could inform coarse-grained multi-scale molecular dynamics methods such as FRODA,60 supplying the rigid units described as ‘ghost templates’ in FRODA, and lead to more efficient computational models of protein dynamics.

In the context of complex crystal structures this approach has numerous potential applications. It first of all suggests an automatic coarse-graining and thereby provides an intuitive simplification and visual aid. The discovery of modules also facilitates the accelerated exploration of configuration space, particularly in the context of random structure search. This has been demonstrated by the new structure of boron closely related to γ-B28 that we find using a module-based search. We believe that these results show the potential of atomic network analysis as a tool for materials discovery. Our algorithm is evidently fast enough for these purposes—the run time for calculating the metal-organic framework MOF-5 decomposition (106 atoms per unit cell) was 2.3 s on a 3.1 GHz Intel Core i7 processor.

Methods

The density functional computations on boron were performed using CASTEP 17.2,3 a full featured plane wave pseudopotential total energy code. The GGA-PBE61 density functional was used, along with a legacy Vanderbilt ultrasoft pseudopotential,62 a plane wave cutoff of 240 eV, and a k-point sampling density of 0.1 × 2π Å−1, for the random searches. The enthalpies were calculated using a higher precision default on-the-fly pseudopotential, 700 eV plane wave cutoff and a k-point sampling density of 0.03 × 2π Å−1.

Data availability

The data associated with this manuscript is made available in ref. 63.