Revealing and exploiting hierarchical material structure through complex atomic networks

One of the great challenges of modern science is to faithfully model, and understand, matter at a wide range of scales. Starting with atoms, the vastness of the space of possible configurations poses a formidable challenge to any simulation of complex atomic and molecular systems. We introduce a computational method to reduce the complexity of atomic configuration space by systematically recognising hierarchical levels of atomic structure, and identifying the individual components. Given a list of atomic coordinates, a network is generated based on the distances between the atoms. Using the technique of modularity optimisation, the network is decomposed into modules. This procedure can be performed at different resolution levels, leading to a decomposition of the system at different scales, from which hierarchical structure can be identified. By considering the amount of information required to represent a given modular decomposition we can furthermore find the most succinct descriptions of a given atomic ensemble. Our straightforward, automatic and general approach is applied to complex crystal structures. We show that modular decomposition of these structures considerably simplifies configuration space, which in turn can be used in discovery of novel crystal structures, and opens up a pathway towards accelerated molecular dynamics of complex atomic ensembles. The power of this approach is demonstrated by the identification of a possible allotrope of boron containing 56 atoms in the primitive unit cell, which we uncover using an accelerated structure search, based on a modular decomposition of a known dense phase of boron, $\gamma$-B$_{28}$.

decomposition of these structures considerably simplifies configuration space, which in turn can be used in discovery of novel crystal structures, and opens up a pathway towards accelerated molecular dynamics of complex atomic ensembles. The power of this approach is demonstrated by the identification of a possible allotrope of boron containing 56 atoms in the primitive unit cell, which we uncover using an accelerated structure search, based on a modular decomposition of a known dense phase of boron, γ-B 28 .

INTRODUCTION
Considerable growth in computational power and its ubiquity has been coupled with the development of efficient algorithms [1,2] and their implementation in robust [3][4][5] and reliable [6] computer codes. This has permitted the first principles, quantum mechanical (through density functional theory -DFT [7][8][9]), treatment of material [10], chemical [11], and biological systems [12] of increasing complexity. High throughput computations can be performed directly on pre-existing data of more modest complexity, or some modification of it [13], or as a way to sample configuration space to discover previously unknown structures [14][15][16][17]. Together, these approaches offer a route to computational materials discovery [18].
In parallel, increasing computational power and an abundance of data has given rise to another rapidly expanding field -the study of complex networks [19][20][21][22]. A key reason for its success is the fact that related mathematical approaches can be applied to a wide range of real-world network data across many academic disciplines. The structure of networks can be studied at a variety of resolutions. Local measures of connectivity can quantify the topological properties of individual nodes or edges. Global measures, calculated across the entire network, such as the average shortest path length [19], can help us to compare networks as a whole. Between these two extremes however lies an entire field of research that searches for meaningful descriptions of intermediate structures, such as "cliques" [23], "communities" [24], and "rich clubs" [25], among others. These are sets of nodes or edges which are particularly densely connected, or which share some other defining topological feature. Many definitions of such structures have been put forward [23][24][25][26][27][28]. Here we select an approach that is particularly good at detecting hierarchical modularity [27] and apply it to atomic networks, which until now have received scant attention from the networks research community, beyond the study of proteins [29][30][31]. Our aim is to provide an automated coarse-graining of the atomic structures of crystal structures. The simplification of the space of possible configurations of complex atomic systems has the potential to vastly accelerate the process of computational materials discovery, among other tasks that can benefit from automatic coarse graining based on hierarchical decomposition. We illustrate this through the identification of a possible new allotrope of boron.

Determining the modularity of atomic networks
For atomic structures of a single atomic species we can generate an unweighted network of atoms by simply imposing a threshold d * on the interatomic distance and connecting atoms that are closer to each other than this threshold distance. The communities in this network can then be extracted by using the algorithm of Arenas et al. [27], as discussed below.
The extent to which a network has well-defined community structure can be quantified with a metric known as the modularity [32] (Fig. 1). This is defined as the fraction of edges that run between nodes of the same community, minus the expected fraction if the edges of the network were positioned randomly: Here i, j ∈ [1, n], where n is the number of nodes. δ(C i , C j ) = 1 if nodes i and j belong to the same community, and 0 otherwise, and A ij and P ij are the adjacency matrices of the network and of the null model (the randomised network), respectively. A = 1 2 i,j A ij gives the total number of edges in the network.
This metric relies on the concept that a random graph is not expected to exhibit community structure. As such, the quality of the proposed community structure can be quantified by the difference between the network and the null model. The degree of a node i is defined as the number of edges of the node, A i = j A ij . Choosing the null model to have the same degree distribution as the target network gives: This can be rewritten as a sum over the M communities of the network: Where for unweighted networks A ss is the number of edges within community s, and A s is the sum of the degrees of nodes within community s. (Equivalently, A s is the number of edges exiting the community + 2A ss .) This metric allows community detection to be recast as an optimisation problem; maximising Q minimizes the number of edges between communities. However, naive optimisation of the modularity has been shown to have a fundamental resolution limit [33]. Communities smaller than a certain size A min ss will not be detected. This threshold depends on the total number of edges in the network: This resolution limit arises due to the explicit dependence on the number of edges within the null model. This introduces a preferred size for the communities in the network.
The algorithm of Arenas et. al [27] utilises this bound on the size of detectable communities. By adding a self-loop of strength w to each node in network, A → A + wI, the resolution limit inequality becomes: Where N s is the number of nodes in community s, and N the total number of nodes.
By varying the effective resolution limit, the scale of the communities extracted can be varied. As w is increased, modularity optimisation will result in an increasingly fragmented decomposition. By optimising modularity for a range of this parameter w, and across a range of threshold values d * , one obtains a variety of hierarchical decompositions into modules (see the section "Relax and Shake Algorithm" for details of the optimisation algorithm).
The simplest quantity one can establish across this two-dimensional space is the number of modules. Regions across which this value is stable represent more meaningful modules that may have real physical or biological meaning as rigid clusters or units of protein architecture.
Pauling's rule of parsimony suggests that the number of different kind of constituents in a crystal is small [34]. This suggests that in complex crystal structures, in which modules are likely to exhibit a degree of symmetry, it makes sense to minimise a more sophisticated quantity, namely the information content of a structure. The identification of modules corresponds to a compression if these modules contain symmetries or if the same module appears multiple times. We can calculate the amount of information I required to describe a given module structure in terms of the global degrees of freedom of that structure, and minimise this quantity over the space of d * and w.
In order to calculate I, consider M modules of M distinct types. The number of modules with one atom only is 0 ≤ M * ≤ M , and the number of modules with two atoms is 0 ≤ M * * ≤ M . To position and rotate the M modules relative to one another we need 6M − 6 degrees of freedom in general, with one degree less for every two-atom module, and three degrees less for every one-atom module. Now consider each of the M distinct modules: If we have n i > 2 inequivalent atoms in module i we need 3n i − 6 degrees of freedom to describe them. If we have n i = 2 inequivalent atoms in module i we need 3n i − 5 = 1 degree of freedom to describe them, which corresponds to the distance between the atoms.
If we have n i = 1 inequivalent atom in module i we need 0 degrees of freedom to describe it internally. The global number of degrees of freedom is then: The decomposition which has the minimum I gives us the most concise description of a structure. This minimisation of the description length is conceptually related to the idea of algorithmic information theory [35][36][37], as the symmetry operations and inequivalent atomic positions that form part of the compression can be thought of algorithms which allow us to reconstruct the original atomic structure. The length of the shortest such description is a quantitative measure of the structure's complexity. Because of the presence of crystal symmetries, we need to establish the modularity of the atomic network with high accuracy.
In addition, the modularity is highly degenerate; there is a greater than exponential number of distinct possible community structures, and many will have modularity values close to that of the global maximum [38]. Moreover, these structures may have very different topologies to that of the true partition, resulting in a large change in the compression achieved. We therefore employ an algorithm similar to that of the 'relax and shake' algorithm [15] or zero temperature basin hopping [39].

Relax and Shake Algorithm
The relax and shake (RASH) algorithm uses a repeated series of local modularity optimisations (relax) followed by the assignment of a small number of nodes into random communities (shake), in order to escape local maxima. The local optimisation follows existing work on community detection [40,41]. The modularity is optimised by moving each node in the network to the community of the neighbouring node resulting in the highest increase in the modularity (if > 0). This is then repeated until no further local optimisations increase the modularity. Following the local optimisation, a subset of the nodes (10%) are shaken into other communities within the network, and the local optimisation repeated. This continues until 200 consecutive relax-and-shake iterations have failed to improve the modularity. As an additional check on the solution, the modularity change resulting in merging any two communities is calculated; if this results in a modularity increase, the merge is performed, and the relax-and-shake iteration process is begun again using the new partition. The above can be considered a single optimisation step; following this, a larger subset of the nodes (20%) are shaken into either pre-existing, or new communities. The optimisation is then performed again. This shake-and-relax is performed until 200 iterations fail to improve the modularity.The whole process is repeated until three consecutive runs have failed to produce a community structure with a higher modularity. The degree of repetition is parameterisable, and allows us to have confidence in the community structure obtained (at the cost of speed).
If we partition a structure of N atoms into M multi-atom modules, so that typically M << N , and assume that the modules correspond to rigid clusters, then we reduce the dimensionality of configuration space from 3N −6 (atomic positions minus global translation and rotation) to 6M − 6 (as we have to specify a relative translation and rotation for each module) or less (if any of the modules have less than three atoms). We will show that this can be exploited in the first principles prediction of crystal structure.

Boron
Boron is known to form several allotropes [42], including α-B [43], β-B [44,45], and γ-B [46][47][48]. The structure of rhombohedral α-B is widely recognised as being made up of interconnected B 12 icosahedra (see Fig. 2). However, because the bonds between different icosahedra are shorter than the bonds within, simple thresholds on bond length will not yield the underlying modular structure of this crystal phase. This observation motivated the development of our current scheme, which, based on network modularity, does yield the scientifically agreed icosahedral modules.
The structure of β-B is much more complicated, and various models have been proposed, typically with 105 or 106 atoms per unit cell. We choose the 105-atom model of Geist et al. [45] for further investigation (see Fig. 3). Our modularity detection scheme identifies four icosahedra, two larger 25-atom modules with threefold cyclic point group symmetry C 3v , and one module with threefold dihedral symmetry D 3d of seven atoms. Two of the four icosahedra are slightly distorted, resulting in C 2v symmetry, rather than icosahedral I h symmetry. Of course, the decomposition of complex boron structures into compact and symmetric sub units is not unprecedented, see for example Figure 2 in Albert and Hillebrecht [42]. We emphasise, however, that our scheme performs the decomposition automatically and is suitable for integration into complex computational workflows.
Recently the structure of a high-pressure phase of boron, γ-B 28 , has been described in the literature [42,47,48] (see Fig. 4). A w s versus d * heat-map of I for the unweighted network reveals a global minimum at w s = 1.0 and d * = 2.0Å. This corresponds to two 14-atom modules with D 2h symmetry, which are icosahedra plus two atoms on either side. (see Fig.   4). Note that this contrasts with the conventional decomposition into two icosahedra and two dimers found in the literature [47], which is less favoured in our scheme as it corresponds to a higher value of I. Our approach offers a meaningful partition of this structure into modules, providing insight into the organisation and visualisation of this structure and opening the door to the systematic exploration of the structure space that neighbours this γ-B 28 allotrope.

Phosphorus
Like boron, phosphorous exhibits rich allotropism, from the highly metastable white phosphorous, to layered black phosphorous, and extremely complex fibrous, or layered, structures [49]. There is considerable current interest in two dimensional black phosphorous, or phosphorene [50] and other layered forms [51]. Here we investigate the crystal structure of red phosphorus [52], and attempt to identify a simple decomposition into modules using our current scheme. In the 42-atom primitive unit cell (space group P1) the modularity decomposition finds two modules that each occur twice (see Fig. 5). The bigger module has symmetry (C s ) and contains 13 atoms, while the smaller has C 2v symmetry and contains eight atoms. The relatively low degree of symmetry of red phosphorous means that the landscape of I with varying w s and d * is flatter, but our approach nevertheless finds a parsimonious decomposition of the crystal structure.

Metal-organic frameworks
We extend our framework to multi-species structures requiring only a definition of the relationship between d * and the interatomic distances used to determine the network connectivity. In principle, a separate d * could be defined for each pair of atomic species. However, this introduces the cost of exploring a higher dimensional space in the search for an optimal I. Instead, we define a single dimensionless parameter d * eff that specifies the distance threshold as the d * eff -fold multiple of the sum of the fixed atomic radii for a given pair of atoms. We apply this multi-species version of our approach to the metal-organic framework MOF-5, or Zn 4 O(BDC) 3 , where BDC 2− is 1,4-benzenedicarboxylate [53]. Metal-organic frameworks exhibit a vast range of structures and are of great interest because their porosity allows them to be used for the storage of gases, such as hydrogen, or carbon dioxide [54]. As can be seen in Fig. 7 our algorithm finds two similar decompositions with almost equally low I-values. The lowest minimum corresponds to a decomposition of the structure into six 16atom modules with D 2h symmetry, and two six-atom modules with tetrahedral symmetry (T d ). The 16-atom modules correspond to the BDC 2− molecules that are sometimes referred to as the 'struts' of metal-organic frameworks. This suggests that the decomposition derived automatically through our procedure is chemically meaningful.

Structure prediction
Ab initio random structure searching (AIRSS) [14,15] is a simple, and demonstrably effective, approach to first principles structure prediction. It has been applied to a wide range of systems, from the crystal structures of dense hydrogen [55] and hydrogen rich compounds [14], to matter under extreme compression [56], and interfaces [57]. The approach involves selecting initially random structures from distributions defined by physically motivated constraints (for example, density, composition, atomic distances, symmetry, molecular units or fragments). These random "sensible" [15] structures are fully relaxed (moved to the nearest local minimum in the energy landscape) under forces derived from DFT. Once a large number of computations have been performed the resulting structures can be ranked according to energy (free enthalpy) or any computable property of interest.
It has been a surprise to many that such a naive approach performs well, but the method's success is linked to intrinsic features of the first principles energy landscape, such as its smoothness (a result of the quantum mechanical interactions between the atoms and electrons) and the relatively large number of large energy basins. In a smooth energy landscape the size of the basins correlates with their depth (deep basins occupying a large volume of configuration space), there is a natural bias in random sampling, plus relaxation, to the stable, low energy and relevant structures. In what follows we exploit our new approach for the automatic decomposition of known structures into minimum I fragments to accelerate the search for complex structures by restricting the regions of configuration space that must be explored.

Application to dense boron
We generate 3303 initial random structures based on packing four 14 atom D 2h modules derived from γ-B 28 into unit cells with a randomly chosen shape, and the same density as γ-B 28 . The units are not permitted to overlap each other, or be closer than 1.63Å(the measured inter-icosahedral distance in a computed α-B structure), and are related to each other by symmetry. The symmetry is chosen at random from those space groups with four symmetry operators in the primitive cell. The random initial structures are then relaxed to nearby local energy minima (see Methods for computational details). Four of the initial structures relaxed to supercells of the Pnnm γ-B 28 structure, and three of them relaxed to a previously unreported structure with space group Pbcn and 56 atoms in a unit cell. This structure has a density very close to that of γ-B 28 , and is just 3 meV/atom less stable. In Fig. 6 it is shown that this near degeneracy persists over a wide range of pressures.
Our new 56 atom structure corresponds to a distorted hexagonal packing of boron icosahedra, whereas γ-B 28 corresponds closely to cubic packing. The small energy difference indicates that γ-B may be susceptible to polytypism or stacking disorder. The 3meV/atom energy difference between the hexagonal and cubic polytypes is small compared to the 25meV difference between cubic and hexagonal (carbon) diamond computed at the same level. The situation is very similar to that for α−B 12 , for which an alternatively packed structure of 24 atom and space group Cmcm has been identified [15], and further discussed [60,61]. For this reason the implementation of the modularity optimisation requires particular care in order to find maximally robust modular decompositions.
In the context of multi-species atomic structures, and particularly biological molecules, it can be valuable to consider weighted atomic networks, in which the edge weight scales with the interatomic distance. The simplest choice is a linear one: for d ij < d * and w ij = 0 otherwise, where d ij is the distance between atoms i and j, and K is an arbitrary constant. Increasing all edge weights by a constant factor leaves the modularity unchanged, so in practice this is chosen to ensure numerical precision. This choice of edge weighting reflects the fact that equilibrium bond lengths scale with the equilibrium bond energies.
Our approach can be applied in the context of proteins, similar to [29][30][31], where the vastness of configuration space has traditionally also been a difficult barrier to overcome.
The lack of symmetry in biological molecules mean that the information required to describe the structure is less useful than in crystal structures. Other measurements, such as the stability of the module number, can replace the information measurement as a criterion for assessing the quality of a modular decomposition in biomolecules. While methods for rigidity analysis in proteins already exist, such as the FIRST algorithm [62], the tuning parameter in our method allows for the detection of rigid clusters on a variety of length scales, making it a complementary approach. Like FIRST, our method could inform coarse-grained multi-scale molecular dynamics methods such as FRODA [63], supplying the rigid units described as 'ghost templates' in FRODA, and lead to more efficient computational models of protein dynamics.
In the context of complex crystal structures this approach has numerous potential applications. It first of all suggests an automatic coarse-graining and thereby provides an intuitive simplification and visual aid. The discovery of modules also facilitates the accelerated exploration of configuration space, particularly in the context of random structure search. This has been demonstrated by the new structure of boron closely related to γ-B 28 that we find using a module-based search. We believe that these results show the potential of atomic network analysis as a tool for materials discovery. Our algorithm is evidently fast enough for these purposes -the run time for calculating the metal-organic framework MOF-5 decomposition (106 atoms per unit cell) was 2.3 seconds on a 3.1 GHz Intel Core i7 processor.

METHODS
The density functional computations on boron were performed using CASTEP 17.2 [3], a full featured plane wave pseudopotential total energy code. The GGA-PBE [58] density functional was used, along with a legacy Vanderbilt ultrasoft pseudopotential [59], a plane wave cutoff of 240 eV, and a k-point sampling density of 0.1×2πÅ −1 , for the random searches.
The enthalpies were calculated using a higher precision default on-the-fly pseudopotential, 700 eV plane wave cutoff and a k-point sampling density of 0.03×2πÅ −1 .

AUTHOR CONTRIBUTIONS
SEA and CJP designed the study and methodology. SEA, CJP and WPG performed computations, generated data, and wrote the manuscript.

AUTHOR INFORMATION
The authors declare no competing financial interests.

DATA AVAILABILITY
The data associated with this manuscript is made available in Ref. [