Introduction

Over recent years, the use of various machine learning (ML) methods saw a drastic increase in material and nano-science fields. Overcoming the most common ML limitation—the necessity for massive amounts of data—the applications primarily focused on utilizing well-developed image processing approaches1,2,3, optimizing automated experimentation techniques and existing data mining4,5,6, and using more widely available computational data7,8,9,10,11. Such an approach, however, can only be of use in special cases, preventing the application of ML to typical systems with scarce experimental data. The reasonably accessible computational data commonly lack the required accuracy to naturally supplement experimental results. It is also common for computed and experimental data to not explicitly correspond to the same system, e.g., experimentally measured parameters can be strongly affected by the presence of structural defects that can’t be accounted for in computations due to inherent scale differences. Furthermore, nanomaterial properties are highly dependent on the material’s size and structure, resulting in a small set of discrete values rather than a large continuous data space, placing a hard limit on data availability. Another common complication to applying ML solutions to such problems is the periodic nature of phenomena that need to be captured. ML solutions often have difficulty representing periodic behaviors and typically use larger, more complex models that require even more data to train.

Here, we demonstrate how those obstacles (small amount of accurate data, periodic nature of the physical properties, etc.) can be overcome with careful selection of the activation function, and use of the transfer learning approach, which is further improved with physics-inspired limitations. As a familiar and important example, we chose the prediction of the band gap values of the carbon nanotubes (CNT) from their chiral indices (n,m), which presents a well-known step-like periodic behavior with a characteristic period of 3 from (n-m) values. First, we will show that while the most common ML approaches fail to represent such periodic functions due to limited available data, the recently proposed Snake activation function shows greatly improved results.

Posing a second obstacle, the discrete nature of the data, characteristic of nanomaterials, significantly limits the upper limit of the available data, creating a very difficult task for ML. As is typical for many relevant nanosystems, the available experimental data is insufficient for use with ML methods; the thorough literature search produced only 137 experimental and high-accuracy computational values12,13,14,15,16,17 (Fig. 1a) (see SI for a complete list of used values and sources). Consequently, it is common to attempt enrichment of the dataset through affordable computations; in our case, the DFTB (density functional tight binding) method provides the values for 851 CNTs18,19 (Fig. 1a). Unfortunately, such data, even if precise per se, often lacks accuracy ‘across the pools’—experimental and computed. Not only does the magnitude of band gaps significantly vary between experimental and DFTB results, but the fine details of the trend for semiconducting tubes do not match (Fig. 1b), ruling out the use of the mix of two available datasets together. In this work, we demonstrate that this low-quality data can still be useful, enabling the learning of general trends. Later employing the transfer learning (TL) approach, this rough model is re-trained to accurately represent experimental results despite extremely sparse data.

Fig. 1: Data.
figure 1

a The range and values of the available experimental and DFTB data for the CNT band gap (the color bar is shown in the top right). b Comparison of the experimental (hollow blue) and DFTB (filled green) values of (n,0) CNTs highlighting the difference in magnitude and the fine details of the trend.

The physics of band gaps in CNTs is well understood, and it derives from linear band dispersion at the Fermi level near the K point in a rolled-up graphene sheet20,21. It must be mentioned that over the years, scientists proposed various equations predicting the CNT band gap fitted to experimental data. Such equations, based on a theoretical understanding of the band gap origins or purely empirical, provide reasonably accurate values with a computational speed that can not be matched even by the simplest ML approach22,23,24. The goal of this manuscript is not to compete with those equations or even to predict the band gap values but rather to illustrate a successful transfer learning ML approach capable of handling challenging periodic data while being trained on a realistically small dataset. The existence of the empirical equations provides a convenient way to prove the methodology’s effectiveness and evaluate performance without being used to generate training data.

Results and discussion

Representation of periodic functions with machine learning

First, we must address the ability of ML to represent periodic physical data, which often poses a challenge and requires the use of overly complex models and, hence, an increased amount of data. For this first stage, we used the more numerous DFTB data, randomly split into training and testing datasets in a 70:30 ratio (see Methods section for details). To establish a baseline and establish the failure of the common approach in this situation, we trained a simple 2-layer neural network (NN) with a variable width layer using a popular ReLU (Rectified Linear Unit)25,26 activation function, ReLU(x) = max(0, x). For simplicity, the averaged absolute error value was used as a loss function (L1 loss). The plots of training and testing loss for this 2xReLU NNs display all common training characteristics (Fig. 2a): underfitting for smaller NNs (width below 200 neurons per layer) where the model simplicity prevents accurate data representation, overfitting for larger NN (above 512 neurons per layer) where the training data memorization prevents generalization and somewhat accurate prediction of testing data for the moderate size of NN. The NNs of optimal size were able to only achieve the relatively low accuracy of εmax 0.45 eV and, more importantly, failed to accurately capture the periodicity. Note that we use the maximal absolute error εmax for an individual CNT band gap prediction of the best-performing NN to characterize the performance through the ultimate guaranteed accuracy of each band gap prediction. To simplify the visualization, we plot DFTB data and ML predictions for only zigzag CNTs (m = 0), which, in the case of 2xReLU NNs, clearly show the absence of any kind of periodic behavior, a shortcoming typical for conventional activation functions (Fig. 2b).

Fig. 2: Representation of the periodic function with ML, results on DFTB data.
figure 2

ML results of two-layer NN with ReLU a, b and Snake c, d activation functions and four-layer NN with two Snake and two ReLU layers e, f. The best achievable εmax by the 4-layer NN (circle size) e with corresponding widths of the Snake and ReLU layers was used to identify combinations of widths resulting in the best performance (shaded green area). Simplified (m = 0) visualization of the best performance 2xReLU b, 2xSnake d, and 2xSnake-2xReLU f networks demonstrating representation of the periodic data. NN with just ReLU layers fails to capture the general trend of the data b with εmax > 0.45 eV. The performance with just Snake is greatly improved, well-representing periodic data, εmax > 0.2 eV d. The combination of Snake and ReLU shows the best performance with εmax > 0.0075 eV f. Error bars indicate standard deviation.

It should be mentioned that in simpler cases, one can devise a scheme that would separate the data into distinct sets that do not display periodicity, for example, using a period of 3 for our data, completely sidestepping the problem. However, this does not represent a generic solution and, significantly, would even further reduce the amount of data available for each set.

Recently, Ziyin et al.27 addressed the problem of representing periodic functions with NNs by creating a parametric periodic activation function named Snake = x + sin2(ax)/a, where parameter a can be learned within the optimization algorithm or set by the user. While being less computationally efficient, 2-layer NN with Snake activation functions significantly outperforms ReLU in representing periodic data (Fig. 2c, d). Not only the performance of 2xSnake NNs is improved to εmax 0.2 eV, but more importantly, the periodicity is accurately captured (Fig. 2d). Interestingly, the loss value changes with the width of the layer does not show typical overfitting behavior.

For comparison, we also evaluated the simpler traditional periodic activation function - sin(ax), which showed performance slightly below that of Snake (Supplementary Fig. 1). Notably, NN with sin(ax) showed a tendency to be trapped in local minimums28, resulting in reduced stability manifested in significantly increased error bars in Supplementary Fig. 1e.

Further, by combining two layers with Snake and two layers with ReLU activation functions, and varying the width of layers (both Snake and both ReLU layers are set to the same width), we create the most complex model that would be considered. In principle, such architecture should separate the periodic trend to be captured by Snake layers and the magnitude trends in ReLU layers and allow for better performance, as well as the transfer learning approach we will discuss later.

Plotting the εmax for the best-performing NN of a given architecture against the widths of Snake and ReLU layers, we find the region of optimal performance for 2xSnake-2xReLU NNs (highlighted in green in Fig. 2e). The periodicity of the data was well captured by those NN (Fig. 2f) with the accuracy even further improved significantly to εmax 0.0075 eV. Note that overfitting for larger NN is again present due to the use of ReLU layers (top right corner of Fig. 2e).

Originally, Ziyin et al.27 showed the use of the Snake for conventional, continuous periodic functions. Obtaining those promising results, we have clearly confirmed the ability of NN with the Snake27 activation function to reproduce almost step-like periodic trends in the discrete physical data where more common activation functions fail. The rather remarkable ability of Snake to extrapolate should also be mentioned. The importance of this for physical data ML cannot be overstated. While we use a rather simple case here, significantly more complex periodic functions, such as Hamiltonians, could be of interest to the nanomaterials community.

Transfer learning for ML of small experimental datasets

Our second goal was to overcome the scarcity of accurate computational and experimental data. Characteristically for discrete physical data, one is not interested in predictive interpolation between available data points; for example, in our case, such predictions would correspond to non-existent CNTs with partial chiral indexes. It would be of significant benefit, however, to predict the values of the band gaps for a much wider range of CNTs than is already described by the experimental data, looking to create an extrapolative model. To this end, we are going to use the transfer learning technique, where the NN previously trained on the low-quality data (DFTB) is then partially re-trained on limited, accurate experimental data. In our particular case, we start with the best-performing 2xSnake-2xReLU NNs trained on DFTB data and re-train only the ReLU layers (Fig. 3a). The distribution of learned Snake period a for both layers is shown in Supplementary Fig. 4. The motivation for this particular architecture is to preserve the pre-learned periodicity of the data present in DFTB results and captured by the parameters of the Snake layers. To evaluate the extrapolative performance, we would consider both maximal error εmax and average error 〈ε〉 for the training set (comparison with the experimental data) and testing set (comparison with the values predicted with empirical equations22,23,24 on the range of DFTB data), for convenience we would refer to those characteristics as an experimental (Exp.) and extended range (Ext. Range). We train 50 NN instances with randomized initiation for all considered approaches and NN architectures to evaluate achievable performance.

Fig. 3: Transfer learning ML.
figure 3

a Schematic of the 2xSnake-2xReLU NN used for the transfer learning, where the parameters of the first two layers are kept constant after pre-training on DFTB data while 2 ReLU layers are re-trained using only the experimental dataset. be The performance of 2xSnake-2xReLU NNs (Snake layer width of 16 and ReLU layer width of 64 nodes) performance on experimental (Exp., blue) in comparison to the extrapolative performance of the extended range (Ext. Range, orange) trained with simple training (ML, b), transfer learning (TrL, c), transfer learning with physics-inspired restrictions (Phys-TrL, d), and transfer learning with physics inspired restriction on the reduced range of CNTs with the diameter smaller than (40,0) (Range-Phys-TrL, e). The zoomed-in version of panels be is shown in Supplementary Fig. 3. Whisker plots (showing outliers, mean, median and all quartile boundaries) of results for simple and transfer learning approaches on experimental f and expanded ranges g at different approximations.

For an initial comparison, we start with the NN architecture with 2 Snake layers 16 nodes (neurons) wide, and 2 ReLU layers 64 nodes wide (performance results shown in Fig. 3b, e and Fig. 3f, g)—the smallest NN within the optimal region highlighted in Fig. 2e. We perform a from-scratch training of 2xSnake-2xReLU NN on experimental data for completeness (marked as ML). Due to the significant complexity of the model and the small size of the dataset, the results are plagued by overfitting, showing outstanding results in the prediction of the data points within the experimental dataset (the best performance εmax = 0.097 eV and 〈ε〉 = 0.007 eV) and poor performance on the extended range (the best performance εmax = 1.20 eV and 〈ε〉 = 0.15 eV) (Fig. 3b). The experimental and extended ranges are shown in Supplementary Fig. 2. The employment of the transfer learning procedure (marked as TrL in Fig. 3 and Fig. 4) described above markedly improve the extrapolative performance of the model, lowering achieved error level to εmax = 0.31 eV and 〈ε〉 = 0.043 eV (Fig. 3c).

Fig. 4: Transfer learning results.
figure 4

Experimental a and extended range b results for 2xSnake-2xReLU NNs with various layer widths partially re-trained on the experimental data. Whisker plots show outliers, mean, median, and all quartile boundaries.

Even further improvement can be achieved by incorporating some restrictions based on the understanding of the physical nature of the data. Including such additional rules during the optimization process is a common approach that allows scientists to leverage preexisting knowledge of the phenomena, only requiring that the restrictions are formulated to be compatible with the form of the loss function. Such physics-inspired restrictions generally are accomplished through the design of both restrictions and loss functions ahead of time. In our case, relying on an understanding of the band gap nature, we include the additional condition that punishes the prediction of any negative values (marked as Phys-TrL in Figs. 3 and 4). Despite being relatively simple, this modification is effective in improving the results on the extended range to εmax = 0.28 eV and 〈ε〉 = 0.032 eV (Fig. 3d). Even further improvement can be achieved by limiting the range of extrapolation (marked as Range-Phys-TrL in Figs. 3 and 4). Considering only CNTs with a size below the diameter of (40,0) nanotube, the performance can reach εmax = 0.098 eV and 〈ε〉 = 0.030 eV (Fig. 3e). Note that while the performance on the extended range is improved, the restrictions applied to the optimization inevitably results in reduced performance on the training data (Fig. 3f, g).

For easier comparison, we also provide the whisker charts of results on the experimental (Fig. 3f) and extended range (Fig. 3g) for simple ML, transfer learning, transfer learning with physics-inspired restrictions, and transfer learning with physics-inspired restrictions on the reduced range. As mentioned above, it is easy to see that the performance on the training set worsens with additional restrictions, while the results on the extended range drastically improve.

Now that the potential of the transfer learning is shown on a single NN architecture, we proceed to test various combinations of Snake and ReLU layer widths, using the best-performing DFTB-trained NN as a starting point. We limit our investigation to the NN highlighted in green in Fig. 2e. As previously, for each architecture, 50 instances of NN are trained following TrL, Phys-TrL, and Range-Phys-TrL approaches.

It is clear that the performance on the experimental range is universally good, with the error level slightly dipping at the Snake width of 32 nodes (Fig. 4a). At the same time, the prediction accuracy on the extended range is progressively and significantly improved with the increased complexity of the model, reaching a plateau at a Snake width of 64 (Fig. 4b). Interestingly, not only the best performance is enhanced with higher complexity, but the variability of the results from the produced NNs is also decreased, suggesting more robust models. The accuracy achieved through the transfer learning approach on the extended range reaches εmax = 0.091 eV and 〈ε〉 = 0.016 eV, on par with that of conventional ML training on DFTB data despite a very small dataset size.

To further investigate the potential usefulness of the outlined methodology on experimental data, we consider its performance of the data with imperfect periodicity. While the local deviations from periodicity are very typical for material science datasets (e.g., localized structural defects within perfect lattice), they are absent in our test case. We artificially introduced such deviations to an increasing number of points within the DFTB dataset (Supplementary Fig. 5), finding only a slight performance decrease for the data with up to 10% non-periodic data (Supplementary Fig. 6). This level of tolerance towards imperfect periodicity in pre-training data opens a variety of possible applications. While used as an example here the CNT bandgaps are indeed a representative and relevant dataset. It is worth reminding that the band gap variability with chiral symmetry of nanotubes has been seen as both opportunity and a hurdle to electronics applications, a decades-long challenge to reveal the origins of chiral type distributions29,30 and especially to achieve chiral-selective synthesis31.

In conclusion, we have successfully demonstrated a methodology to overcome several common obstacles to the use of ML on datasets in the nanomaterials field. First, the use of the recently proposed Snake activation function enables the learning of the periodic functions quite common in physical data. Here Snake’s effectiveness is illustrated on a discrete step-wise periodic function of the CNT band gap that is common for electronic and optical properties of nanostructures (the Periodic Table of the chemical elements is also a compelling example of this kind); yet its use on more conventional continuous periodic functions, such as Hamiltonians32, can prove to be important for the field of nanomaterials. It can also find application in many other tasks that remain challenging for NN, such as learning symmetry from diffraction images33,34. Furthermore, we employed transfer learning techniques by re-using NNs pre-trained on the numerous but inaccurate DFTB data. This approach allowed us to successfully represent accurate experimental data from just 137 data points, clearly illustrating transfer learning capabilities for the typical case of extremely limited data availability. Moreover, the represented range significantly exceeded that of the used data. We believe that the demonstrated approach should significantly expand the usability of ML techniques in the nanomaterial research field.

Methods

Dataset preparation

We used three distinct data sources to compose two different data sets. The first data set was composed of DFTB data18,19 for training and testing. The DFTB data were used to evaluate the ability of the network to learn characteristically periodic patterns. The second was composed of experimental and high-accuracy DFT12,13,14,15,16,17 for training, and the testing was performed using empirical formulas22,23,24. This second dataset was used primarily to evaluate the transfer learning potential of the neural network. The exact datasets used are available upon request.

The DFTB dataset included all valid n and m combinations in [4,40], [0,n], respectively, a total of 851 points (see Fig. 1). The data were then split into training and testing subsets in approximately 70:30 ratio (592 and 259 points respectively) (Fig. 5 visualizes full and training subsets). The training set was randomly upsampled, sampled with replacement to 1024 points from 592 points, and the testing set was left at native 259 points.

Fig. 5: DFTB data preparation.
figure 5

The chiral map showing the full DFTB dataset (851 points) and randomly chosen 592 data points (~70% of the full set) that were further randomly upsampled to 1024 data points used as a training set.

The transfer learning training set was composed of 137 training points12,13,14,15,16,17 (see Supplementary Information for a complete list of used values and sources) that were upsampled to 342 points by resampling the less represented dataspace (larger diameter and larger chiral angle CNTs). The oversampling of the large diameter CNTs corrected for the underrepresentation in the dataspace with smaller absolute energy values. The testing set was evaluated over the entire n, m range of interest in that experiment. The full range of DFTB data was used except for the transfer learning evaluation on the reduced extended range, where only nanotubes with a diameter below (40,0) nanotubes were included.

Neural network methods

The networks were built using python 3.7.9, pytorch 1.7, and cuda 11.0. We evaluated three different networks in this work, two versions of the two-layer and one four-layer fully connected feed-forward network. All of these are traditional neural networks that include the bias term as part of their topology. For brevity, we will denote the topology of the network by the number of elements in each of the feed-forward layers and by the transfer function used in that layer, as the input and output were the same across all networks studied.

All networks in this work utilized AdamW35 optimization methodology with an initial learning rate of 10−3. ReLU layers were initialized using He initialization25. The Snake layers were initialized using He initialization for the weight component and for the period component the Uniform [0,3] distribution. Unlike the implementation used by Ziyin et al.27, we allowed both of these components to update with the network. The networks were all trained to minimize the L1 loss between the prediction and the band gap in the dataset. Networks were stopped after they ran for 20 × 106 epochs of the training set data, and the best-performing network was evaluated by L1 loss over the testing set. We also recorded the absolute value of the maxim deviation of any given prediction vs. the actual to evaluate the worst possible prediction of the network.

The first variation of the two-layer network utilized the ReLU transfer function. The second variation of the two-layer network utilized the Snake transfer function. For both types of two-layer networks, we evaluated networks with the same number of neurons in each of the two hidden layers. The examined sizes are 16, 32, 64, 128, 256, 512, and 1024.

We observed that the Snake layers did an excellent job of learning the underlying periodic behavior; therefore, we decided to use a four-layer network that consisted of two Snake layers followed by two ReLU layers. The Snake layers learned the underlying phenomena, and the ReLU layers would learn the appropriate scale. This allowed us to transfer train the network by relearning only the ReLU layers using the much smaller, highly accurate datasets. To help ensure that these networks were learning physically meaningful outputs, we modified our L1 loss to also penalize negative band gaps energies equivalent to 10x the negative value. This strongly discouraged the network from learning any non-physical solutions. The evaluation of the retrained networks over the range significantly exceeding the limited range of the accurate dataset was performed using empirical data22,23,24 (see SI for details) in the range of the DFTB dataset or slightly reduced, as described above. To simplify the evaluation, the widths of the ReLU and Snake layers were varied independently while keeping two ReLU layers and two Snake layers of the same width. We evaluated the four-layer networks in the following layer widths: 16, 32, 64, and 128 for both ReLU and Snake layers, and 200 and 256 for ReLU layers only.