An extended cluster expansion for ground states of heterofullerenes

It is challenging to determine the ground states of heterofullerenes due to the numerous isomers. Taking the C60-nBn heterofullerenes (1 ≤ n ≤ 4) as an example, our first-principles calculations with the isomer enumeration present the most stable structure of C57B3, which is energetically favored by 0.73 eV than the reported counterpart. It was difficult to conduct the enumeration for the isomers with n beyond 4 because of the expensive first-principle calculations. Here, we propose a nomenclature to enhance structural recognition and adopt an extended cluster expansion to describe the structural stabilities, in which the energies of the heterofullerenes with various concentrations are predicted by linear combination of the multi-body interactions. Unlike the conventional cluster expansion, the interaction parameters are derived from the enumeration of C60-nBn (n = 1~4), where there are only 4 coefficients to be fitted as a function of composition for the consideration of local bonding. The cross-validation scores are 1~2 meV per atom for both C55B5 and C54B6, ensuring the ground states obtained from our model are in line with the first-principles results. With the help of the structural recognition, the extended cluster expansion could be further applied to other binary systems as an effective complement to the first-principle calculations.

Since the discovery in 1985 1 , the C 60 fullerene has attracted great attentions due to various potential applications with its unique structure-dependent properties. The cage of C 60 fullerene, with the size which is large enough to be observed by transmission electron microscopy or scanning probe methods 2,3 , is likely to keep stable when built into molecular circuits as semiconductor materials [4][5][6][7] . Doping has been adopted as the conceivable way to alter its charge distribution and then tune the optical, electronic and magnetic properties in the solid state [8][9][10] , including the way of adding exohedral, endohedral and substitutional atoms 11 . As neighbors to carbon in the Periodic Table, boron and nitrogen with similar atomic radius are the popular choices as the heteroatoms for the substitution of one or more of the carbon atoms [12][13][14] . The C 60-n B n heterofullerenes with 1 ≤ n ≤ 6 were produced by Laser vaporization of a graphite pellet containing boron nitride powder 15 , which indicated that boron doped C 60 cage still appeared to be particularly stable. During the synthesis and characterization of C 59 N 16 , C 59 N in the vapor phase was found existing in monomer form as a molecular free radical 17 , where the single C 59 N heterofullerene molecule could be used as a new molecular rectifier in a double-barrier tunnel junction via the single electron tunneling effect 18 .
Theoretically, Kurita et al. found that the molecular structures of C 59 B and C 59 N maintained the cage of C 60 , which was distorted due to a large-size dopant such as Sulfur 19 . In addition to the calculation of C 59 B by the first-principles method 20 , the ground state geometries of C 60-n N n and C 60-n B n for 2 ≤ n ≤ 8 were screened using semiempirical MNDO, AM1, PM3, and ab initio methods 21 . As C 48 B 12 and C 48 N 12 are promising components for molecular rectifiers 22 , Garg et al. 23 reported a detailed study of structural, electronic and vibrational properties of B-doped heterofullerenes (C 60-n B n , for n = 1~12) based on ab initio calculations, concluding that the maximum number of boron atoms in a pentagon/hexagon ring was one/two. In general, it is difficult to determine the ground state of heterofullerenes due to two main obstacles: (i) only small amount of isomers for given compositions were considered; (ii) the optimized heterofullerenes largely depend on the initial geometry of numerous possible isomers 24 . Hence, there is still a lack of theoretical studies to search the energetically-preferred structures of heterofullerenes.
In this paper, we perform a systematic investigation of C 60-n B n (n = 1~6) based on the first-principles calculations with the congruence check, in which the structure recognition is achieved by a uniform numbering scheme for C 60 . Furthermore, an extended cluster expansion is proposed to estimate the total energies with all the possible pair, three-body, and four-body interactions derived from the enumeration of C 60-n B n (n = 1~4), however, there are only four coefficients to be fitted for the consideration of composition. We determine the ground state structures of C 55 B 5 and C 54 B 6 as confirmed by the first-principles calculations, indicating the possible application to other alloy systems.

Structure recognition
We adopt the systematic numbering scheme recommended by the IUPAC 25 to identify the vertices of the C 60 cage, as shown in Fig. 1a. Any atom in the C 60 fullerene cage (the coordinates are listed in Supplementary Table S1) has its unique sequence number (SN). An isomer of C 60-n B n heterofullerenes is denoted by an index consisting of the ascending ordered SNs of the substituted vertices, i.e. σ σ σ … ( , , , ) n 1 2 which is called structural index (SI). According to the congruence check, the total number of the C 60-n B n (n ≤ 4) heterofullerenes isomers is 4,517, which is only about 1% of the corresponding combination number (see Supplementary Table S2). All the total energies of these candidates are obtained by the first-principles calculations (details in Supplementary  Information), where all the calculated structures are fully relaxed without any symmetry constraint. The ground state structures with the energy profiles are shown in Fig. 1. There are 23 different isomers (shown in Table 1) for C 58 B 2 , which is in agreement with earlier calculations 23,26,27 . As shown in Fig. 1b, the global minimum structure of C 58 B 2 is the cage with two boron atoms at the opposite sites of a hexagon ring. The ground states of C 57 B 3 and C 56 B 4 , shown in Fig. 1c and d, respectively, have the similar pattern that all boron atoms are at the opposite vertices of the hexagon rings adjacent to each other, where there are no more than two boron atoms on a hexagon ring. The ground state structures of C 58 B 2 and C 56 B 4 are in agreement with those previous calculations 21,23 . However, the total energy of the most stable C 57 B 3 is 0.73 eV lower than that of the one proposed by Garg et al. 23 , which ranks 126 th according to our enumeration of the 303 isomers as shown in Fig. 1c.  Supplementary Table S1) has its unique sequence number (SN). The arrow indicates the direction of the numbering commencement. (b) The energies of C 58 B 2 isomers. The lowest, the 2 nd and 3 rd lowest energetic structure of C 58 B 2 , denoted by (1,7), (1,23) and (1,41), respectively are shown. (c) The energies of C 57 B 3 isomers. The cage with the lowest and the 2 nd lowest energy denoted by (1,7,11) and (1,7,28), respectively are shown. The cage denoted by (1,14,42) is the previously reported 23 ground state of C 57 B 3 which has a higher energy of 0.73 eV than the global minimal state (1,7,11). (d) The energies of C 56 B 4 isomers. The global minimal energetic structure, the 2 nd and 3 rd lowest energetic cage, denoted by (1,7,11,24), (1,7,32,35) and (1,6,11,18), respectively, are shown. In (b), (c) and (d), we only show the relative energies lower than 0.8 eV. Buckminsterfullerene, i.e., the only isomer fulfilling the isolated pentagon rule (IPR) 28 , is a spherical molecule with 60 carbon atoms at vertices, containing 32 faces including 20 hexagons and 12 pentagons where no pentagon shares a vertex 29 . Treating C 60 as a semiregular polyhedron and considering all the symmetry operations represented by the 120 symmetry matrices (SMs), we obtain a 60 × 120 matrix called numbering matrix (NM) (available in the Supplementary Dataset file), in which the n th row lists the coincident atoms for the n th atom under all the symmetry operations and the n th column contains the corresponding coincident atoms for all the 60 atoms under the operation of the n th SM.
Herein, based on the NM of C 60 , we propose a nomenclature for the C 60-n B n heterofullerenes. The flow chart of our structure recognition method is shown in Supplementary Fig. S1. The detailed information of the structure recognition method is available in the Supplementary Information. We have deduced all the SIs of the inequivalent structures of C 60-n B n heterofullerenes for 2 ≤ n ≤ 10, the numbers of the SIs are listed in Supplementary  Table S2, which is in good agreement with the previous results 24,30,31 . The inequivalent structures can also be singled out by our recently developed structure recognition method 32 . Note that our nomenclature is derived from the symmetry operation matrices, which can be obtained according to the coordinates of the system with the corresponding symmetry operations. Thus, this nomenclature can be extended for the C 60 with non-IPR isomers, as well as the larger fullerenes e.g. C 70 and C 82 .

Extended cluster expansion method
Combined with the isomer enumeration and the first-principles calculations, we have determined the ground state structures of C 60-n B n heterofullerenes with 2 ≤ n ≤ 4. For those heterofullerenes with higher boron concentration, we can enumerate all the isomers by the recognition method discussed above. However, it will be over expensive to search the ground state structures with the first-principles calculations, because there are 45,718, 418,470, 3,220,218, 21,330,558, 123,204,921 and 628,330,629 isomers for C 55 B 5 , C 54 B 6 , C 53 B 7 , C 52 B 8 , C 51 B 9 and C 50 B 10 , respectively. Analogue to the conventional cluster expansion(CE) 33 , an extended cluster expansion (ExCE) method is discussed aiming at this problem in the following.
As is known, the CE method is an efficient tool for studying structural properties of any binary structures over a wide range of concentrations [34][35][36][37][38] , parameterizing the total energy for any given configuration of A x B 1−x (0 ≤ x ≤ 1) to avoid the expensive cost of the first-principles calculations. The enthalpy of formation for a certain configuration  s is described exactly by a set of multi-body interaction parameters J i combined as the form of an Ising-like Hamiltonian 37 , which is often approximated as a polynomial function of occupation variables, where the summation is over all the non-equivalent clusters, a set of sites i denoted by α, and the average is taken over all the clusters α that are equivalent to α by symmetry. The coefficients are defined as effective cluster interaction (ECI) parameters, and m α is the number of the clusters equivalent to α.
In general, the total energies of given configurations are described by the combine of single-atom contributions, pair interactions and multi-body interactions, which are expected to gradually converge as more interactions are considered. However, fitting with larger number of parameters will be also time expensive. To balance the accuracy and efficiency of CE method, the number of effective multisite interactions can be greatly reduced 39 , while the combinations of possible effective interactions result in another global optimization. Herein, we attempt to derive the multi-body interactions taking account of the total impurity concentration and fit the total energy with fewer parameters for higher accuracy.
An isomer of C 60-n B n heterofullerenes, taking the carbon atoms as the background, can be viewed as a cluster of boron atoms denoted by  σ σ σ ( , , , ) n 1 2 . We suppose the total energy be attributed to all the subclusters, which are enumerated for fitting the total energy. Firstly, there is only one isomer for C 59 B, therefore the energy difference of C 59 B relative to C 60 is E 1 = E tot − E 0 = 3.39 eV, where E tot and E 0 are the total energy of C 59 B and C 60 , respectively. The energy difference E 1 is responsible for the reaction heat when one C atom is substituted by one boron atom, which can be considered as the single-atom contribution in the expansion. For a C 58 B 2 isomer whose subclusters are two equivalent boron single dopants, the environment of each boron atom is different from that of C 59 B, thus the energies can be expressed as E 0 + 2c 1 E 1 with the coefficient c 1 as a function of boron concentration. Based on all the energies of C 58 B 2 isomers, we fit this coefficient c 1 to be 0.980 with the average deviation of 0.208 eV. Similar to the CE method, the fitting quality is determined by the cross-validation (CV) score 39 , ∑ = − DFT and Ê i denote the DFT calculated and predicted energy of a particular structure i. The deviation of E i DFT from Ê i is taken as the interactive energy of the corresponding boron cluster. For C 58 B 2 , The 23 fitting deviations are the B-B interactions in the 23 isomers, respectively.
Similarly, the total energy of a C 57 B 3 isomer denoted by  σ , is contributed by three single dopants and three pair interactions. For example, the isomer of C 57 B 3 denoted by (1,7,11) can be expanded as 6 clusters including 3 singles and 3 pairs. The initial SIs of the subclusters are listed in Fig. 2a along with their smallest SIs by our recognition method. Apart from the singles, the isomer denoted by (1,7,11) has the 3 subclusters denoted by (1, 7), (1,7) and (1,24). Hence we express the total energy as E cE c E 3 interactions, and 4 triplet interactions. For example, the subclusters from the expansion of the isomer of C 56 B 4 denoted by (1,7,11,24) are listed in Fig. 2b. Analogously, we obtain the 4,190 4-body interactions when the energies of all C 56 B 4 isomers are fitted by E 0 , 3E 1 , the 2-body and 3-body interactions, in which the average fitting deviation is 0.065 eV and the coefficients c 1 ,c 2 and c 3 are 0.966, 0.690 and 0.640, respectively.
As shown above, the coefficients reflect the boron atom's concentration and the fitting deviations are attributed to the multi-body interactions. The fitting coefficients and CV scores for C 58 B 2 , C 57 B 3 and C 56 B 4 are listed in Table 2. It shows that the introducing of multi-body interactions will improve the accuracy of cluster expansion and the interactions will decrease as boron dopants increasing. For example, the coefficients c 1 is from 0.980 to 0.966, and c 2 is from 0.782 to 0.690. Fig. 3 shows the statistical distributions for the 2-body, 3-body and 4-body interactions. Compared to the 3-body and 4-body interactions, the 2-body interactions distribute in a wider range. The 4-body interactions exhibit the similar characteristics of normal distribution around zero point. It can be inferred that the interactions of 2-body and 3-body are much more important than that of 4-body or other multi-body interactions, hence the fitting will also reach a rather good convergence even if only 2-body 3-body interactions are considered.
Herein, we propose an extended cluster expansion for the C 60-n B n heterofullerene, where the energy of isomer denoted by  σ is expressed asÊ   Table 2. Cross-validation (CV) score versus fitting cutoff. Fitting cutoff refers to the number of atoms in the largest subcluster used for the fitting. We perform three kinds of fittings for C 55 B 5 and C 54 B 6 in which the largest subcluster are pair, triplet and quadruplet, respectively.
where  E ( ) DFT σ and  σ E( ) are the DFT calculated energy and the predicted energy, respectively, and E n denotes the fitting deviation which is supposed to be responsible for the n-body interaction of the boron cluster denoted by  σ . The predicted energy is as follows, where the summation runs over all possible sizes of the subclusters of  σ . The first term E 0 represents the energy of C 60 , and E i denotes the total effective energy contribution from all the clusters with i heteroatoms, as is expressed as below where the summation is over all those subclusters consisting of i heteroatoms, i.e. α ≤ ≤ C 1 n i . Different from the conventional CE, the multi-body interactive energies E i apart from E n , should be multiplied by different combination coefficients c i before they make contribution to the total energy of the C 60-n B n heterofullerene cages, where the coefficients are obtained by fitting the DFT-calculated energies of the selected C 60-n B n heterofullerenes with those E i for i < n. To balance the accuracy and computation cost, we set 4 as the maximum value of the summing index in Eq. (4) for the cages of C 60-n B n where n ≥ 5.
We show the flow chart of our method in Fig. 4 and make a detailed description for the process in searching the ground state structures for the C 55 B 5 cage.  The fitting ultimately reaches a rather reasonable convergence after several hundreds of structures with the lowest predicted energy are calculated. Supplementary Table S3 and Table S4 shows the variations of coefficients for C 55 B 5 and C 54 B 6 , respectively, as a function of the number of isomers calculated by the first-principles method. Note that the coefficients of c 2 and c 3 are around 0.6 and 0.4 respectively, while the coefficient c 4 is approaching zero. Similar to the conventional CE, the energy of interatomic bonds is usually dominated by short-range interactions 39 . On the other hand, enormous interactions would be introduced when we use the C 60-n B n with higher boron concentration for cluster expansion, which will be expensive in computational cost. Note that any other binary systems can be similarly searched by the extended cluster expansion, where the appropriate cutoff of the size of the subcluster should be carefully made to balance of the accuracy and computation cost. The nomenclature and extended cluster expansion can be also applied for the ternary systems, where different atoms are distinguished from the candidates found in the binary systems.
Application to C 55 B 5 and C 54 B 6 According to the structure recognition, there are 45,718 and 418,470 inequivalent structures for C 55 B 5 and C 54 B 6 , respectively. Using the method discussed above and following those steps, we have made a prediction for the ground state structures of C 55 B 5 and C 54 B 6 , where the energy profiles are shown in Fig. 5 (detailed in Supplementary Table S5). The optimized fitting coefficients of C 55 B 5 are adopted for the initial combination coefficients for C 54 B 6 . As the fitting steps move forward, most of the energies of the newly added isomers in each fitting iteration gradually increase. However, the 100 th lowest energy of the fitting iteration decreases rapidly and eventually converges, after 6/8 fitting iterations for C 55 B 5 /C 54 B 6 . The minimum energetic isomers for both C 55 B 5 and C 54 B 6 emerge in the first iteration. The optimized fitting coefficients were obtained for C 55 B 5 /C 54 B 6 after the energies of the selected 600/800 isomers are calculated and fitted, for which the results are listed in Table 2. The CV score of the final fitting is 0.064/0.124 eV for C 55 B 5 /C 54 B 6 , and the largest deviation of total energy between ExCE method and DFT calculations is 0.192/0.403 eV for C 55 B 5 /C 54 B 6 , indicating that the fitting energies are in good agreement with the DFT calculations. For both C 55 B 5 and C 54 B 6 , the coefficients c 1 are close to 1, implying that single boron atom does make an important contribution despite of the concentration. The coefficient c 4 is much smaller and the quadruplet interactions play a trivial role in the ExCE calculations, since the fitting will be in good accuracy when the pair and triplet interactions are considered in the energy predications of C 60-n B n for n ≥ 5. This is consistent with the above assumption that the energy of interatomic bonds should be usually dominated by short-range interactions.
For C 55 B 5 , the ExCE energy versus the computational energy is shown in Fig. 6a. The putative ground state is (1,7,11,24,27). The five heteroatoms are located at the 5 apposite sites of 5 hexagon rings and make up of a pentagon which encloses a carbonic pentagon ring, with the similar pattern of the ground state of C 60-n B n for 2 ≤ n ≤ 4. The next preferred positions for boron atoms are (1,7,11,32,35), with a total energy of 0.32 eV higher. It was reported in ref. 23 that the minimal energetic structure for C 55 B 5 was (1,7,18,51,59). In contrast, this structure is found to be 253 rd in the stability ranking and higher in energy by 0.68 eV than our minimal energetic structure.
For C 54 B 6 , the putative lowest energy is from (1,6,11,18,24,27) which is shown in Fig. 6b. In this isomer, 4 boron atoms are at the consecutive opposite sites and the other two are at the isolated opposite sites. The next most favorable configuration is (1,7,11,16,24,36). Chen et al. 21 reported that the global minimum structure was (1,6,9,12,15,18), but from our result, this structure has a higher energy of 0.34 eV relative to our minimal energy and ranks 100 th in our ascending order list of the total energies. Garg et al. 23 predicted the minimal energetic cage for C 54 B 6 was (1, 6, 11, 18, 27, 31), now in our calculation, this structure rates the 340 st in the ranking of stability and is less stable by 0.52 eV with respect to our minimal energy structure. . Flowchart for the extended cluster expansion (ExCE) method. The selection of the coefficients for the initial linear combination have a certain arbitrariness, which may bring about the different 100 lowest energy isomers for the first fitting iteration but not affect the final convergence and the searching for the putative ground state isomer. Here, we set 100 as the number of the new added isomers for each iteration, which may differ for different iteration and alloy systems.

Summary
We have developed a nomenclature to enhance structural recognition and adopted an extended cluster expansion to describe the structural stabilities, which is good agreement with the results from the first-principles calculations. Unlike the conventional cluster expansion, the interaction parameters are derived from the enumeration of C 60-n B n (n = 1~4), where there are only 4 coefficients to be fitted for the composition consideration. Notably, we have found the stable isomers of C 57 B 3 , C 55 B 5 , and C 54 B 6 , which are energetically favored by at least 0.3 eV than the reported counterparts. With the symmetry operation matrices, the nomenclature can be applied for other binary/ ternary systems, where the ground state structures are searched with the extended cluster expansion. Thus, our finding will be an effective complement to the first-principles calculations in materials science.  The predicted and DFT calculated energies for C 55 B 5 isomers. The putative ground state and the isomer with the 2 nd lowest energy for C 55 B 5 are denoted by (1,7,11,24,27) and (1,7,11,32,35), respectively. The ground state of C 55 B 5 reported previously 23 is marked as (1,7,18,51,59). (b) The predicted and the DFT calculated energies for C 54 B 6 isomers. The ground state and the isomer with the 2 nd lowest energy for C 54 B 6 are denoted by (1,6,11,18,24,27) and (1,7,11,16,24,36), respectively. The two points marked as (1,6,9,12,15,18) and (1, 6 11, 18, 27, 31) represent the reported ground states by Chen 21 and Garg 23 , respectively.