Introduction

Machine learning (ML) techniques are gaining popularity in estimating the properties of materials,1,2,3,4,5,6,7,8 inspired by the high-throughput first-principles calculations based on density functional theory (DFT).9,10,11 This new tool enables researchers to explore compounds in the vast chemical space in a reasonable amount of time. The key problem for ML methods applied in materials science is to construct descriptors, which map the structure and composition to a fixed length vector. To ensure that the feature vector of arbitrary crystal systems have the same size, early researches usually consider the overall properties of the compounds,12 while recent advances have treated materials as crystal graph with atomic properties encoded.4,5,7 Despite the differences in approaches, they all show good performances in predicting crystals or/and molecules properties, such as formation energy and elastic moduli.

However, for properties like band gap (Eg), there is a systematic underestimation in standard DFT calculations compared with experiments.13 To bridge the gap between theoretical calculations and the experimental values, ML models based on eXtreme Gradient Boosting (XGBoost), random forest (RF), and support vector machine (SVM) have been applied to directly estimate the experimental Eg14 and superconducting critical temperature (Tc),15,16 recently. Nevertheless, in order to obtain the best predictive performance, well-constructed features are required. However, the choice of the manually constructed descriptors is quite arbitrary, and it is hard to explain why those features are chosen rather than others. Here, we treat compounds as atomic table (AT) and propose a generic framework called atom table convolutional neural networks (ATCNN) to predict the compounds properties obtained experimentally. In this framework, the descriptors are learned by itself, and no additional prior knowledge, such as atomic properties and the underlying physics are utilized except for the component information. Although the detailed structure parameters are ignored, for a specific composition, the experimental structure is often determined and unique. In other words, the composition is entangled to the crystal structure. Therefore, our approach also has the advantage of overcoming the difficulty of obtaining accurate atomic position information through experiments.

Under the ATCNN framework, we have constructed ML models to predict the experimental Tc, formation energy (Ef), and Eg. The performance of these models has been greatly improved compared to the previous models. For the Tc prediction, to avoid system bias, we propose a data-enhancement method that enables the model to distinguish superconductors and non-superconductors. Utilizing the well-trained model, we have screened out dozens of compounds that are potential high Tc materials from the existing materials database. For the prediction of Eg, the accuracy of our model exceeds hybrid functional calculations that is considered to be accurate in calculating Eg,17 which means the ATCNN model has great application prospects in the semiconductor industry.

Results

Before the emergence of deep learning, researchers spent a lot of time constructing appropriate feature vectors to obtain ML model with superior performance. Deep learning algorithms try to learn high-level features from raw data, and directly output the target properties. This end-to-end approach has achieved great success in image recognition,18,19 speech recognition20,21 and machine translation.22 However, in the field of materials science, feature engineering is still one of the most important aspects of building ML models.4,12 The method of manually constructing features and then predicting the properties according to the features is called non-end-to-end learning, which is not only inefficient, but usually does not achieve optimal performance21 due to the improperly constructed features. Here, we develop an end-to-end framework ATCNN to directly predict the target properties. In this approach, no other prior knowledge is used except for the components. The detailed construction of the ATCNN model is presented in the “Methods” section and a typical ATCNN model is shown in Fig. 1.

Fig. 1
figure 1

Schematic diagram of the ATCNN model for Tc prediction. A typical ATCNN model contains one input layer (atom table), one output layer (compound property, CP), several convolutional layers (Conv), one pooling layer (Pool)", and several fully connected layers (FC). The size of blue, teal, and cyan colored kernels of Conv are 5 × 5, 3 × 3, and 2 × 2, respectively. For pool layer, the max-pooling method is used and the size is 2 × 2. The detailed hyperparameters are presented in Table S1

The experimental data for Tc, Eg, and Ef are extracted from the SuperCon database,24 previous literature,14 and the Open Quantum Materials Database (OQMD),25 respectively. The atomic number of the elements involved is within 86, covering the first six period elements. For the Eg and Ef data, a total of 3896 and 5886 compounds are included, respectively. These data are used without further screening, and each data set is randomly divided into training set (80%) and test set (20%) (see Table S4 in the Supplementary Information). However, in the SuperCon database, materials of the same composition often have different Tc values, which are obtained from different experiments. For example, at ambient pressure, H2S is not a superconductor, but under high pressure, it has a very high Tc.26 Because of this, there are two very different Tc values of H2S in the database, namely 185 and 60 K, respectively. To avoid causing confusion, for a compound with multiple Tc values, if the maximum exceeds the minimum twice, this material is removed, otherwise the average is taken as the Tc value of this material and the duplicate data is thrown away. In addition, we have also removed unreasonable data, including the unconfirmed 2D high-Tc superconductor HWO3,27 and those with the coefficient of an element in the chemical formula >50 like “Hg1234O10+z”, or with uncertain oxygen content such as “Yb16Ba1Cu2Oz”.

In the cleaned dataset of Tc, Hg, MgB2, FeSe, and YBa2Cu3O7 are retained to test the generalization ability of the ATCNN model, which are typical representatives of elemental superconductors, conventional BCS superconductors, iron-based superconductors, and copper-based superconductors. Before splitting training set and test set, the related compounds including Hg, MgxBy, FexSey, and YBa2CuxOy are removed from the cleaned dataset, where x and y denote the content of corresponding elements. Therefore, compounds like MgB2, Mg0.31B0.69, and Mg0.9B2 are all removed. The cleaned data set contains 13,598 superconductors, which is divided into a training set (80%) and a test set (20%). When determining model hyperparameters, 20% of the training data is used to validate the model, and each hyperparameter is determined by multiple tests. In the validation set, the model with five Conv layers performs best, as shown in Fig. S2. Besides, the number of the kernels in each Conv layer, the size of kernel, the number of the FC layers and the number of neurons in each FC layer are also tested. After hyperparametric optimization, the structure of the ATCNN-I model is determined, as shown in Fig. 1. In ATCNN-I, each Conv layer contains 64 kernels, and the size of hidden layers are 200 and 100 (see Table S1 in the Supplementary Information), respectively. To avoid overfitting, the dropout method28 is used and the dropout rate of the FC layers is set to 0.2. The training process is terminated after 500 epoches, because the error of the model on the validation set decreases before 500 epoches, and then the error almost remains unchanged, or even tends to increase, as shown in Fig. S1. The test results are summarized in Table 1 and shown in Fig. 2c.

Table 1 Statistical summary of the prediction performance of ATCNN-I and ATCNN-II on superconductors
Fig. 2
figure 2

Detailed results of ATCNN-I and ATCNN-II models on superconductors. The test set only contains superconductors. a Statistics of the compounds in the test set. The blue, red, green, and gray bars indicate the number of compounds containing Cu, Fe, Cu–Fe, and others, respectively. b The distributions of Tc of the compounds in a. c Comparison of the ATCNN-I model predicted Tc against the experimental Tc in the test set. d Comparison of the ATCNN-II model predicted Tc against the experimental Tc in the test set

In the test set, the MAE, RMSE, and the coefficient of determination (r2) are 4.12, 8.14 K and 0.97, respectively. The overall performance of the ATCNN-I model is much better than the previous fine tuned RF model (which has an r2 of nearly 0.88)15 and XGBoost model (which has an r2 of 0.92 and a RMSE of 9.5 K).16 In addition, except for Hg, the predicted values of Tc in the independent dataset (Hg, MgB2, FeSe, and YBa2Cu3O7) are nearly the same compared with the experimental results (see Table 2), showing that the ATCNN-I model has strong generalization ability. For Hg, its superconducting property can be learned from compounds that contain Hg element in the training set, such as HgSr2Cu10O4, Hg0.76Tl0.76BaCuO4.5, and HgBa2CuO4.19.

Table 2 Comparison of experimentally measured Tc with the values predicted by ATCNN-I and ATCNN-II models

However, like the previous models, ATCNN-I has difficulty distinguishing between superconductors and non-superconductors, because the training data only contains superconductors. To fix this problem, we add 9399 energetic stable insulators with the DFT band gap larger than 0.1 eV to the 13,598 superconductors data set, and 80% of them are mixed up to the training set, while 20% of them are added to the test set. These insulators are extracted from Materials Project repository9 and they are treated as non-superconductors. The full training and test data set are listed in the Table S5. After retraining, we get the ATCNN-II model. As shown in Fig. 2, Tables 1 and 2, ATCNN-II has similar performance on superconductors with ATCNN-I, and the MAE, RMSE, r2 are 4.12, 8.19 K and 0.97, respectively. Both models have capability of capturing the change of Tc when the composition changes slightly, such as YBa2Cu3O7 and YBa2Cu3O6.6. The biggest difference between ATCNN-I and ATCNN-II is in predicting the non-superconductors. In Figs. S3, S4 and Table S5, the predictive performances on superconductors and non-superconductors of these two models are presented. In the full test data set which contains 2720 superconductors and 1880 non-superconductors, the MAE and RMSE of the ATCNN-I model increase to 8.76 and 10.45 K, while that of the ATCNN-II model decrease to 3.17 and 6.91 K. The reason is that ATCNN-I treats compounds as superconductors and predicts non-zero Tc values, even if they are insulators, but ATCNN-II gives absolute zero Tc values for most insulators and non-superconducting metal such as alkali metal. If the predicted Tc > 0 and Tc = 0 are classified to superconductors and non-superconductors, respectively, 8.9% of the superconductors and 2.2% of the non-superconductors are misclassified by ATCNN-II, while 2.9% of the superconductors and 99.6% of the non-superconductors are misclassified by ATCNN-I in the full test set, as shown in the confusion matrices in Fig. S3. The ability to distinguish between superconductors and non-superconductors can greatly increase the efficiency of searching for new superconductors. For example, from 20,574 energetic stable materials in Materials Project database, we screen out 20 materials with large predicted Tc (see Table S6). These selected compounds are potential high Tc materials.

To further analyze the model, the 2720 superconductors in the test set are divided into four groups: the first group (I) has 1122 materials, all of which contain Cu; the second group (II) has 287 materials, all of which contain Fe; the third group (III) has 69 materials, all of which contain both Cu and Fe elements; the rest are classified as the fourth group (IV), with a total of 1242 materials. The first, second, and fourth categories roughly represent copper-based superconductors, iron-based superconductors, and conventional BCS superconductors, respectively. Their statistical distributions are shown in Fig. 2a, b. The predicted results for each group are summarized in Table 1. It can be seen from the statistical results that both the two ATCNN models perform better in group I and group IV, but worse in group II and group III. The possible reason is that the number of iron-based superconductors in the data set is much smaller than that of conventional BCS superconductors and copper-based superconductors. Therefore, it should be cautious when predicting the Tc of iron-based superconductors.

In the prediction of crystal properties, electronic structures are often used to test the performance of the ML model.4,5,6,7,33,34 As a general framework, the ATCNN model can also be applied to other experimental data, such as Eg and Ef. Due to the small amount of data, only one (two) Conv, one Pool, and two hidden layers are used for the prediction of Ef (Eg). The detailed network hyperparameters are listed in Tables S2 and S3. Figure 3a, b show the comparison of the experimental Eg and Ef with the prediction of the ATCNN model. It is clear that the ATCNN model achieves excellent agreement between the experimental data and predicted values in the test set. For Ef prediction, the MAE and r2 are 0.078 eV/atom and 0.99, while the MAE of the DFT calculation with respect to the experimental measurement is 0.81–0.136 ev/atom,11 which means that the accuracy of the ATCNN model has exceeded the DFT calculations. In addition, from the data point of view, for the same training data size (4708) of Ef, the performance of the ATCNN model is much better than the structure free ElemNet model (MAE of about 0.15 eV/atom).33 We attribute the better performance of the ATCNN model to its unique network structure, since different network structures lead to different solutions. Due to the fully connected network structure, the ElemNet model treats the relationship between elements as equivalent, and it requires a lot of data to learn the unique relationship between elements. But for the ATCNN model, because of the convolution network structure, the relationship between elements is naturally different. Besides, for the Ef prediction, the size of ATCNN model is much small than ElemNet model and does not easily cause overfitting. Therefore, the ATCNN model performs better than the ElemNet model when the dataset is small.

Fig. 3
figure 3

a Comparison of the ATCNN model predicted Eg against the experimental Eg. b Comparison of the ATCNN model predicted Ef against the experimental Ef. c ROC curve for the metal/insulator classification model with an AUC of 0.97. d Confusion matrix of the metal/insulator classification model

For Eg prediction, the MAE of the ATCNN model is 0.307 eV, while the MAE of the CGCNN model is 0.388 eV,5 and the MAE between standard DFT calculations and experimentally measured values is 0.6 eV,35 demonstrating the superior performance of our model. Compared with Ef, the prediction of Eg seems to be not so accurate, and the r2 is 0.94. Nevertheless, the performance of the ATCNN model is better than the gradient-boosting decision tree model based on the property-labeled materials fragment descriptors, which has an r2 of 0.90,4 and the SVR model using the same experimental data, which achieves an r2 of 0.90.14 To quantify the capabilities of the ATCNN model, we apply the trained model to a set of specific compounds which have been studied intensely by different levels of theory, and are often used to benchmark new methods.36,37,38 The predicted results on the selected compounds are listed in Table 3. Among these methods, the PBE calculated band gaps (Eg,PBE) differ greatly from the experimental value, and are all underestimated. The GW approximation method36 is the most accurate when calculating Eg, which resulted in a MAE of 0.22 eV and an RMSE of 0.33 eV. However, the GW-type calculation is not currently amenable to high-throughput calculations due to the expensive computational cost. The most effective way for high-throughput screening is to predict Eg by the ATCNN model, because it is both accurate (with a MAE of 0.25 eV and an RMSE of 0.58 eV, which are only worse than the GW calculation) and fast. In addition, the structure of the Eg prediction model can be used for metal/insulator classification. For the sake of comparison, the same dataset which contains 2458 unique insulators and 2458 metals as the SVC mdoel14 is used to train and test the classification model. The performance of the ATCNN classification model is characterized by the receiver operating characteristic (ROC) curve and the confusion matrix, as shown in Fig. 3c, d. The area under the ROC curve (AUC) is 0.97, the same as the SVC model. From the confusion matrix, it can be seen that ATCNN model is more accurate in classifying metals (91.2% vs. 88.8%), but slightly worse in classifying semiconductors than SVC model (90.5% vs. 95.2%).

Table 3 Comparison of experimentally measured band gap (Eg), with values calculated by PBE functional (Eg,PBE), hybrid functional (Eg,HSE), the GW approach (Eg,GW), the SVR model that relies on manually constructed feature descriptors (Eg,SVR), and the ATCNN model (Eg,ATCNN)

Discussion

For any ML algorithm applied in materials science, model interpretability is desirable. Here, we take the Ef prediction model as an example to illustrate how to extract knowledge from the ATCNN model. Generally, visualizing the element representations is a good way to examine the learned features. In Fig. 4a, the 50 dimensional representations (for each element, the first FC layer outputs 50 values) of main-group elements are shown. However, these features are intertwined with each other and it is difficult to visually see the relationship between elements. To better understand the high-dimensional feature vectors, all features are first decoupled by the principal component analysis method (PCA) and then projected onto the space spaned by the first two principal axes. PCA is a method of feature extraction and dimensionality reduction, and it is widely used in previous studies to visualize the high-dimensional features.39,40,41 As seen in Fig. 4b, alkali metals (group I), alkali earth metals (group II), halogens (group VII), and rare gases (group VIII) are clustered in different regions. Besides, the elements O, N, and S are close to halogens, showing strong non-metallicity, while the element In is close to alkali earth metal, showing metallicity. Since the Ef of elemental crystals are all 0, the features are not learned through the elements, but are learned through the compounds. The PCA results reflect the periodic law of the elements, and confirm that the ATCNN model indeed learns the properties of elements. Besides, the Ef reflects the stability of the compounds, and the stability of the compounds is related to the arrangement of electrons outside the nucleus,25,42 that is, to the periodic law of elements. Thus, the PCA results indicate that the ATCNN model has captured the underlying physics of Ef.

Fig. 4
figure 4

Features analysis of the Ef prediction model. a Illustration of the feature vectors of 36 main-group elements in vector space. Group I represents the first main-group elements, group II represents the second main-group elements, and so on. b Projection of the feature vectors of 36 main-group elements onto the plane spanned by the first and second principal axes (PC1 and PC2). The percentage represents the ratio of the variance on the principal axe direction. Elements are colored according to the their elemental groups

In summary, we treat compounds as atom tables and propose a universal ML framework called ATCNN to predict the experimental measured properties. The ATCNN model automatically learns the features needed and directly predict the properties without having to manually construct feature vectors. Under this framework, we construct ATCNN-I and ATCNN-II model to predict the superconducting transition temperature. The ATCNN-I is accurate to predict the Tc of superconductors, while the ATCNN-II is not only accurate, but also able to distinguish between superconductors and non-superconductors. Using the ATCNN-II model, we have screened dozens of unexplored compounds that are potential high Tc materials. In addition, for experimental measured Ef and Eg, the accuracy of the ATCNN model exceeds the standard DFT calculations. Furthermore, we use PCA method to analyze the learned features of main-group elements, and find that the ATCNN model indeed learns the properties of elements and reproduces the chemical trends which reflects the underlying physics of Ef.

Methods

In the framework of ATCNN, a compound is treated as a 10 × 10 pixels image that is called AT. Each pixel of AT represents an element, and its value is the proportion of this element in the compound. Therefore, a 10 × 10 pixels AT can represent any materials within 100 elements. Because all compounds involve only the first 86 elements of the periodic table, the size of the AT is sufficient. The elements represented by each pixel can be specified randomly, but must be unique and deterministic. For convenience, we specify them in the order of the periodic table of elements. That is, the first pixel represents the proportion of the H element, and the last pixel represents the proportion of the Fm element. We do not use the actual structure of the periodic table as AT because we try to build the ML models with as little prior knowledge as possible, and it is difficult to represent Actinides and Lanthanides in the periodic table. AT represents only the compound, and its feature vector will be learned through convolutional neural networks (CNN).

A typical CNN contains two major components: convolutional layers (Conv) and pooling layers (Pool). In the first Conv, the convolution is performed on the AT with the use of the filters or kernels to then produce feature maps. Each convolution kernel produces a feature map of the original map size,

$${\mathbf{\Lambda }}_l = {\mathrm{Conv}}\left( {{\mathrm{AT}},{\boldsymbol{K}}_{l,m \times m},\phi } \right),$$
(1)

where Λl is the lth feature map produced by the lth kernel Kl,m×m with the size of m × m. The kernels K are learned automatically by the network. ɸ is the non-linear activation function called Rectified Linear Unit (Relu)23 which is defined as ɸ(x) = max(0,x). The n + 1th Conv takes the feature maps Λn produced by the nth Conv as input and outputs new feature maps Λn +1,

$${\mathbf{\Lambda }}_l^{n + 1} = {\mathrm{Conv}}\left( {{\mathbf{\Lambda }}^n,{\mathbf{K}}_{l,m \times m},\phi } \right).$$
(2)

After several Convs, the final feature maps Λf are obtained. Then the max-pooling method is used to reduce the size of Λf and produce the feature vector v

$${\mathbf{\Lambda }}^{\mathrm {r}} = {\mathrm{Pool}}\left( {{\mathbf{\Lambda }}^{\mathrm {f}},P_{h \times h}} \right),$$
(3)
$${\mathbf{v}} = \mathop {\sum}\limits_{l,ij} \oplus {\mathbf{\Lambda }}_{l,ij}^{\mathrm {r}},$$
(4)

where is the concatenation operator and \({\mathbf{\Lambda }}_{l,ij}^{\mathrm {r}}\) is the entry of the lth feature map of Λr. Ph×h denotes the pool with the size of h × h. The role of the Pool layer is to reduce the size of the feature map, but in our model, the size of the Atom Table is just 10 × 10, so we do not use the Conv layer and the Pool layer alternately like the common CNN. In fact, using one Pool layer after the final Conv layer is best in our models. Finally, several FC hidden layers are used to capture the complex relation between feature vector and target property. The parameters of the entire network are learned by minimizing the loss function. For predictions of Tc, Eg, and Ef, the loss functions are root mean square error (RMSE), mean absolute error (MAE), and MAE, respectively. To ensure that the results are non-negative, we add the Relu activation function in the output layer in the predictions of Tc and Eg. Thus, the predicted Tc and Eg are positive and consistent with the actual situation.